I've been thinking about ways to improve video playback further without having to wait for HW decoding. I've already mentioned enabling direct-rendering, which should help a little.
However, comparing the PowerPC code in ffmpeg to x86, it's clear that the H.264 codec in particular is only partially altivec optimized (hint: look at the h264_* files in both directories).
So, do we have any altivec experts in the community who would be interested in checking this out?
Hans
P.S., feel free to ask MorphOS developers too, since any improvements would benefit people in both communities that have altivec machines (G4/G5/PA6T CPUs).
I'd be willing to donate a little attention to this...which would provide an initial base for more knowledgable coders to update if a 100% native assembly routine is required ( I only really know to optimize for 020/040 processors and the same rules appear to work well on the PPC from what I have tried).
I'd be willing to donate a little attention to this...which would provide an initial base for more knowledgable coders to update if a 100% native assembly routine is required ( I only really know to optimize for 020/040 processors and the same rules appear to work well on the PPC from what I have tried).
This isn't really about using assembly, but using the vector instructions in the altivec unit (like SSE instructions in x86 processors). Ever used anything like that before?
You can look at the code that I linked to in my first post, to see if you think that you could contribute.
@K-L Quote:
Hans : LiveForIt seems to be a great expert of AltiVec (with his latest MPlayer version).
I doubt that LiveForIt touched any altivec-specific code when he created his version. The partially altivec optimized code was already there. Plus, you shouldn't expect him to do everything. There are only so many hours in the day, and he has other things to do too.
Of course, he's welcome to have a look at the H.264 SIMD (SSE/Altivec/etc.) code, if he wants to.
@Hans: I think that a request could be made at the ffmpeg team. I also want to point on this possible opportunity: freevec.org offers his services. He is specialized in SIMD and AltiVec and recently proposed his services (being paid for them).
I think that a request could be made at the ffmpeg team.
Would that really make a difference? I mean, they already know that the altivec code isn't as developed as the x86 & ARM counterparts.
The impression that I get is that there is little interest in improving the PowerPC-specific code due to the small number of desktop/mobile-devices that use it. This is why I'm asking here.
Quote:
I also want to point on this possible opportunity: freevec.org offers his services. He is specialized in SIMD and AltiVec and recently proposed his services (being paid for them).
That might be worthwhile, provided that enough people are interested in footing the bill.
At this stage I have no idea how much difference could be made with more altivec-optimized functions, nor do I know how much work (i.e., cost) it would take. If we had detailed profiling/benchmark data from the X86/ARM code showing what difference the extra SIMD optimized functions make, then we could probably guess how much performance could be gained. However, I have never seen such data.
Right now I'm having to convert between interleaved and none interleaved video, I'm hoping there is a way to force FFMEG to use interleaved yuv420p.
I remember discussing this with one or two people before (possibly you too). Anyway, so long as their code treats the Y, U & V as totally independent (i.e., use the pointer and bytes-per-row of each independently), then interleaved/non-interleaved is entirely irrelevant.
Treating them independently also means that the code needs to keep its mits off the padding area (which is where the other chroma plane is stored in interleaved mode). To be honest, poking about in the padding area is a pointless waste of CPU resources, so there really is no reason for the code to be doing that, interleaved or not.
Code that follow the rules above can handle both interleaved and non-interleaved bitmaps without any special cases.
If the decoder can do direct-rendering (VideoLAN uses direct-rendering with H.264, so it should), then it should already be making no silly assumptions, as it won't know in advance what the locations and bytes-per-row of each plane will be.
Anyway, this has nothing to do with using altivec for DCT/iDCT, deblocking, motion compensation, etc.
I remember discussing this with one or two people before (possibly you too)
Yes, and I wont to go over it again, just in case there is some thing I have missed, this time I'm taking with the Mplayer developers to see what they have to say.
Quote:
so long as their code treats the Y, U & V as totally independent
Well yes sure, but I don't wont to copy etch individual line in the draw_slice() function if its not absolutely nescessary. I wont to treat it as block of memory.
Sure it works fine with direct rendering to use pointers and bytes per row.
But if possible I wont to transferee from bitmap format A to bitmap format A, I don't wont to translate from A to B, if its not needed.
I have been looking at nv12 as it interleaved mode, but its interleaved in x axes not y axes. "uvuvuvuv" not "uuuuvvvv"
The interleaved yuv420 format you have implemented looks like IMC4, well it work fine, but it be even better if it was also used by ffmeg/mplayer internally, I think.
Mplayers "i420" is not interleaved its just like yv12 but with u and v swapped. Mplayers "iyuv" format I think is just like i420 again.
I don't understand way there is so many formats in mplayer with different format names and where data is organized in the same way, makes no sense to me.
Edited by LiveForIt on 2014/11/7 4:45:06
(NutsAboutAmiga)
Basilisk II for AmigaOS4 AmigaInputAnywhere Excalibur and other tools and apps.
Well yes sure, but I don't wont to copy etch individual line in the draw_slice() function if its not absolutely nescessary. I wont to treat it as block of memory.
Sure it works fine with direct rendering to use pointers and bytes per row.
You may want to ask them about using direct rendering with H.264, because the CODEC_CAP_DR1 flag is set for that codec. That would avoid the whole bitmap copying issue, which would still be there even if the source bitmap were interleaved (the bytes-per-row could still be mismatched).
Quote:
The interleaved yuv420 format you have implemented looks like IMC4, well it work fine, but it be even better if it was also used by ffmeg/mplayer internally, I think.
I didn't actually implement any particular format. The Radeon HD driver's rendering code could handle non-interleaved YUV bitmaps just as easily (and swapped U & V planes, etc.). The layout of the Y, U & V planes in memory is decided by Picasso96.
AFAIK, Picasso96 uses interleaved U & V planes because that's the layout that Radeon 7xxx/9xxx cards use (which is what we had at the time that YUV420p support was added).
Quote:
I don't understand way there is so many formats in mplayer with same names and where data is organzined in the same way, makes no sense to me.
Yes, that is confusing. It's probably not their fault, but the result of multiple companies developing codecs choosing their own formats and names.
Using altivec, btw, will cut out the AmigaOne 500 and sam owners.
For those, i would like to suggest you to check out the PPC 440 and 460 internal DSP. This dsp have 24 instructions that can improve audio video decoding.
Here are some interesting documents you could check:
Using altivec, btw, will cut out the AmigaOne 500 and sam owners.
For those, i would like to suggest you to check out the PPC 440 and 460 internal DSP. This dsp have 24 instructions that can improve audio video decoding.
Here are some interesting documents you could check:
I did read these docs and I also profiled ffmpeg on 440 years ago. I tried to optimize but effects were not visible. ffmpeg developers know how to program and I think the code is already efficient. Many other CPU features could be used but I'm afraid the MAC instructions won't be enough.
By the way, there is already a macro in ffmpeg to use one of there MAC instructions in some places.
Looking again at this topic would be another interesting task!
Isn't that moot now? I thought the decoding stuff has been given to the GPU with the upcoming new gfx driver?
No. Composited video shifts the YUV => RGB conversion to the GPU, which improves performance by eliminating the conversion and reducing the RAM => VRAM copy bandwidth.
I think that the results would have to be 596s for realtime cpu decode . (18.20% of an AMD FX 4300 quad core 2.8GHZ =108.4732s)
Using the benchmark figures for the Prometheus 1080p clip that I have of 23.778 and 399.791, the Sam460 may take up to 1825s to complete the big_buck_bunny benchmark though.
Edited by Spectre660 on 2014/11/7 21:54:17 Edited by Spectre660 on 2014/11/7 21:56:37 Edited by Spectre660 on 2014/11/7 22:10:15 Edited by Spectre660 on 2014/11/7 22:32:57
I didn't actually implement any particular format. The Radeon HD driver's rendering code could handle non-interleaved YUV bitmaps just as easily (and swapped U & V planes, etc.). The layout of the Y, U & V planes in memory is decided by Picasso96.
Yes, so if I can some how make a "fake" bitmap (user defined bitmap), and fill in the Y, U, V pointers to slices[x] coming from draw_slice() and set the BytesPerRow to the strides[x], I can some how prevent some memcpy's from the codec.
But does it not need to be padded for DMA operation?
Quote:
because that's the layout that Radeon 7xxx/9xxx cards use
In that case I guess its a industry standard.
(NutsAboutAmiga)
Basilisk II for AmigaOS4 AmigaInputAnywhere Excalibur and other tools and apps.
Yes, so if I can some how make a "fake" bitmap (user defined bitmap), and fill in the Y, U, V pointers to slices[x] coming from draw_slice() and set the BytesPerRow to the strides[x], I can some how prevent a memcpy's from the codec.
Please do not try something like this. You should treat bitmaps as black boxes and, therefore, do not go creating fake bitmaps, or go poking around in its internals. Those internals can be changed at any time.