Note that with feanor's patches available on github, I also get a 5% improvement on H264 decoding with 1080p videos (Prometheus and Bourne Ultimatum trailers), on my MacMini under Linux.
@zzd10h No, I haven't recompiled it for OS4 yet, as I investigate another part of the code with tools (mainly perf, in fact) only available for Linux ...
@tommysammy I use this command line: ./ffmpeg_g -cpuflags altivec -benchmark -i Prometheus-1080p-30s.mp4 -f null /dev/null
And really, I often run it prefixed with the perf command.
Note that just like feanor, I extracted 30 seconds from the original video, even if in my case that was not from the very beginning but after a 30-second delay:
(sortof)Update: I'm not at home these days for both work (and the upcoming elections), and while i do have my powerbook with me to commit the remaining patches, time is very short, I will try to commit them during the weekend, but if that doesn't succeed I'll do that on Tuesday when I return home. I still hold by my promise to work on other components of ffmpeg that I have found to be worth optimizing, regardless of the bounty.
I am not using any "special" Altivec instructions -in particular the problematic dst ones- which exist only on G4s, so the code should work on G5s as well, but the code is not tested yet on a 64-bit powerpc environment, so I am unsure it it will work/compile. OTOH, I understand that AmigaOS does not operate in 64-bit (yet?).
I am not using any "special" Altivec instructions -in particular the problematic dst ones- which exist only on G4s, so the code should work on G5s as well, but the code is not tested yet on a 64-bit powerpc environment, so I am unsure it it will work/compile. OTOH, I understand that AmigaOS does not operate in 64-bit (yet?).
Correct, AmigaOS is still 32-bit. However, the PA6T CPU in the A1-X1000 is a 64-bit CPU (like the G5).
VMX128 is exactly the same as AltiVec. However the instruction scheduling is a little different on G4 and G5, due to the different numbers of units for handling different things. G5 also has greater latencies and more restrictions on how many permute instructions it can pipeline. The code scheduling tools mentioned above mean you can look for these cases and simply code around them.
.
Fo example, Dnetc AltiVec client (AmigaOS version from Futaura) is slower on the PA6T clocked at 1,8Gz than a G4 clocked at 1,26 Ghz.
@K-L All mentioned processors (G4, G5, PA6T) use the same Altivec (VMX, in the IBM terminology) instruction set, even if the implementation is different. For example, G5 and PA6T can issue 3 instructions issued per cycle but have 2 dispatch units, sub-units. In the past, I thought that PA6T Altivec was weaker but now ... I don't know. Another interesting point to study (with the dnetc case).
@feanor Are you sure dst instructions only exist on G4s? The G5 user mannual mentions it and the 970 has even 8 streams, instead of 4 on G4.
I'm pretty positive that the DST instructions cause a performance hit on G5s as they were not properly implemented by IBM -for whatever reason. In the now defunct Apple Developer pages on Altivec, use of DST instructions was discouraged and the dcbz/dcbt ones were suggested instead, I can look it up in the Power ISA manual to see what the exact problem is but later today.