On some real AmigaNG systems, like the Sam4x0 and X5000 (maybe the X1000 and A1222 as well, not sure), the embedded CPU DMA engine can be used for copies between VRAM and DRAM, which is faster, but not as fast as GPU based DMA transfers, than CPU copies. AFAIK on QEmu that's not used, any maybe even can't be.
The DMA engine is emulated on sam460ex but it's not used that frequently by AmigaOS. At least with sm501 it boots without it but some activity does use it. We've seen this when first got AmigaOS working on sam460ex and needed the DMA engine to avoid some crashes. The DMA in common cases will end up in a memmove (see in qemu/hw/ppc/ppc440_uc.c) which is probably optimised by the host libc but I don't know if it's used for VRAM access and we can't easily test that as PCIe does not work on sam460ex emulation and the sam460ex firmware thinks that on PCI only Radeon cards can appear so it won't init a newer card on PCI.
On amigaone/pegasos2 there's no DMA engine so the kernel may use tricky copy methods with FPU, AltiVec or the worst is the dcbz which we measured before. This was improved a bit but may still need more optimisation as only parts of my patch was merged. The RageMem tests show that not doing any tricks is fastest so if this can be disabled that may help unless VRAM access really needs some wider load/store. This could be tested and optimised independent of GPUs but would need somebody to do it as I don't have time for everything myself.
Quote:
QEmu probably doesn't even emulate 128 bit AltiVec assesses using 128 bit host CPU accesses, but uses slower 64 or even only 32 bit ones instead.
QEMU does have the ability to compile AltiVec to host SPMD instructions but it could be that not all of them are optimal so again this could be tested and improved but I don't even know what to test as I don't know if this is really needed or used by the RadeonHD/RX drivers so all this is just shooting in the dark. I think we would need better understanding of what causes the slowness first before trying to improve QEMU to solve it. It may not even be just slow VRAM access as some results were faster so there's at least one other factor somewhere.
Second, some results with vfio-pci seems to be fast (geennaam, smarkusg) while others slower (white, nikitas and others). What does that depend on? Motherboard, GPU, BIOS, what else?
That's basically all, the speed of the motherboard and gfx card PCIe interfaces, which may be changed by BIOS or host OS settings. VRAM access speed itself through the PCIe interface may be different on different gfx cards, even ones using exactly the same GPU chip but the rest of the gfx card from different manufactures, as well.
If you compare completely different hardware you can get completely different and surprising results, for example X5000 (533 MB/s write, 40 MB/s read using the CPU, 1,071 MB/s write and 995 MB/s read with RPA/WPA*) X1000 (448 MB/s write, 26 MB/s read using the CPU, 1,407 MB/s write and 920 MB/s read with RPA/WPA*) Sam460EX (266 MB/s write, 52 MB/s read using the CPU, 572 MB/s write and 52 MB/s read using RPA/WPA*). Theoretically the X5000 should be much faster than the X1000 since it has faster PCIe slots and the Sam460EX the slowest, but in the results that's not the case, for CPU reads (Copy From VRAM) the Sam460EX is even the fastest of those 3 results.
*) (Read|Write)PixelArray use CPU/SoC specific features like DMA engines and therefore the results aren't comparable between different systems. Copy To/From VRAM only uses standard CPU read/write instructions instead, like a simple memcpy() for example, and is more comparable, but not 100% either as it's using AltiVec on systems supporting it.
Quote:
Or does virtio-gpu actually map VRAM?
Would be quite useless if it doesn't
Quote:
If not then maybe 3D drivers do something that's very inefficient and create a bottleneck where SM502 does not have that.
Real SM502 is very slow, and it doesn't support anything which requires 3D features of a GPU, for example the Composite and CompositeSrcMask tests of GfxBench2D.
The DMA engine is emulated on sam460ex but it's not used that frequently by AmigaOS.
There is probably much more, but the 3 parts that I know it's used are: - DMA transfers of the SoC SATA controller. - CopyMemQuick()/memcpy() for very large copies. - Read/WritePixelArray() (maybe it's simply using CopyMemQuick(), no idea if it includes different/special code instead).
Quote:
and we can't easily test that as PCIe does not work on sam460ex emulation and the sam460ex firmware thinks that on PCI only Radeon cards can appear so it won't init a newer card on PCI.
There was a recent U-Boot update for Sam4x0 which should add support for Radeon HD and RX.
Quote:
The DMA in common cases will end up in a memmove (see in qemu/hw/ppc/ppc440_uc.c) which is probably optimised by the host libc but I don't know if it's used for VRAM access [...]The RageMem tests show that not doing any tricks is fastest so if this can be disabled that may help unless VRAM access really needs some wider load/store.
Not optimizing it may be faster for DRAM accesses, but it's the opposite for VRAM. The main reason PCIe VRAM access is slow is because of the PCIe protocol overhead, basically a 8 bit byte read/write is the same speed as a 64 bit FPU read/write and a 128 bit AltiVec read/write, but the latter transfer more data in the same time, and the larger a single transfer is the faster it gets. On CPUs/SoCs with a DMA engine the size of a single transfer can be very much larger than the max. 128 or 64 bit of CPUs without, which can make CPU based accesses to VRAM over PCIe much faster.
joerg wrote:@kas1e Other gfx systems, for example X11, avoid most VRAM accesses using a shadow frame buffer in DRAM.
I think that's almost never so. That should be the case only if the X11 driver is not accelerated (like vesa) or driver option is added to xorg config file to disable acceleration. Also this special "modesetting" driver seems to default to use "glamor" as acceleration (implement X11 functions using GL), so no shadow buffer by default.
Having 3d accelerated gfx (even gui libs use GL nowadays) with shadow frame buffer in RAM: how would you do that (fast)? Quote:
AmigaOS doesn't support anything like that.
Thomas Richter has done some P96 gfx drivers for AOS 3.x which do it - I think sometimes with MMU tricks -, but it would be better if there would not be this (P96) gfx system limitation, that the gfx system itself insistes on being allowed to have direct access to VRAM. There should be an option for drivers to allow them to handle all themselves and the gfx system then would interact with VRAM only through driver calls (like driver->readpixels, driver->writepixels for fallback gfx functions that the driver does not "accelerate").
I think we would need better understanding of what causes the slowness first before trying to improve QEMU to solve it. It may not even be just slow VRAM access as some results were faster so there's at least one other factor somewhere.
If the emulated AOS4 has access to passed through gfx card VRAM, so does qemu. I would try to hack a little VRAM benchmark into qemu itself if you know how to find out the real address to use for this (it's not going to be the same VRAM address as seen in the emulated AOS4, is it?)
To see what theoretical max speed is. Maybe pass-trough-magic itself slows things down.
I think that's almost never so. That should be the case only if the X11 driver is not accelerated (like vesa) or driver option is added to xorg config file to disable acceleration.
Of course no sane OS, less than 30 years old, does anything like that. But the AmigaOS graphics.library was created more than 40 years ago for the "Amiga" computer, later relabled to "A1000", with only 256 KB RAM ("Chip RAM", but the only available RAM on that systems, "Fast RAM" extensions may have existed for the A1000 as well, but only got common with the A500 and newer systems), and the A1000 was more or less comparable to ancient PCs with UMA (unified memory architecture), i.e. the same RAM is used by both the CPU and GPU, instead of separate DRAM and VRAM, and some bitter, copper, agnus, paula, etc. co-processors which were faster than the 68000 CPU. Most AmigaOS(-like) RTG systems, at least all which are still in use, P96 and CGFX, just added some little support for gfx cards, but didn't reimplement the graphics.library core from scratch, like for example EGS and pOS did.
Quote:
Having 3d accelerated gfx (even gui libs use GL nowadays) with shadow frame buffer in RAM: how would you do that (fast)?
On AmigaOS you can't. Most 2D gfx isn't done by the GPU but by the CPU, incl. sub-pixel text rendering for example. Nearly all BOOPSI classes (gadgets, images, datatypes, etc.), no matter if ReAction/ClassAct or MUI/Zune, use CPU based rendering, and not the GPU, as well.
@joerg You're still bringing up numbers from real SM502 and real machines. I don't care about those. What I care about is why it's slow in QEMU with vfio when it's fast with SM502 on the same machine and if it's really just because of overhead of sending data through PCIe with 32 bit ops or there's additional overhead somewhere because of emulation or using things like dcbz that slow things down. Even with just a simple 32bit loop PCIe should be faster but to confirm that we would need tests from same machine from both host and guest which we only could get once so far. So I don't care if it works better on real machine or uses DMA or not on real machine just to understand what happens with QEMU and how is it possible that same setup is faster for some people than others. What are the factors that makes it usable for some while slow for others. We got a test case for copy routines before which now should run fast at least with user emulation (qemu-ppc) but system emulation (qemu-system-ppc) may still have issues. This would need further testing but I had no time for that. It's hard to follow all the results and missing details, maybe if we had a table with all the test so far showing what host motherboard, GPU, BIOS settings, Linux distro and QEMU version were used but getting all these details seems quite hopeless.
There is probably much more, but the 3 parts that I know it's used are: - DMA transfers of the SoC SATA controller.
Not emulated on QEMU so this won't happen. Quote:
- CopyMemQuick()/memcpy() for very large copies. - Read/WritePixelArray() (maybe it's simply using CopyMemQuick(), no idea if it includes different/special code instead).
These may be the only source of DMA use on sam460ex and as I said it does not happen often as far as I remember. One could add logs in the DMA controller emulation to see but I think it booted without needing any of it and only needed it for some apps but could be that with a Radeon driver it would be used more.
Quote:
There was a recent U-Boot update for Sam4x0 which should add support for Radeon HD and RX.
I've looked at that but it still seems to check which bus the device is connected to to decide if it's RadeonHD or older so it would think a card on PCI cannot be RadeonHD and take the wrong path. I think it should check device IDs instead but maybe this was simpler as there are so many IDs and they may not be sorted.
Quote:
Not optimizing it may be faster for DRAM accesses, but it's the opposite for VRAM. The main reason PCIe VRAM access is slow is because of the PCIe protocol overhead, basically a 8 bit byte read/write is the same speed as a 64 bit FPU read/write and a 128 bit AltiVec read/write, but the latter transfer more data in the same time, and the larger a single transfer is the faster it gets. On CPUs/SoCs with a DMA engine the size of a single transfer can be very much larger than the max. 128 or 64 bit of CPUs without, which can make CPU based accesses to VRAM over PCIe much faster.
This is again true for real hardware but may not be true for QEMU. A simple loop is translated to host code and can run fast, something using special registers could end up calling into emulation a lot and run slower. Without knowing exactly what happens I can't tell which path to trace to check if that happens but it's possible this could also play a role. Even with DRAM with RageMem on QEMU the Tricky test using dcbz is the worst so if the same is used for VRAM it may not help.
But all that does not explain why is it faster on some machines and slower on other machines so there might be something else too which I'd like to identify. Maybe some GPUs are better for this than others and especially newer ones seem worse than older ones so we need more tests with more GPUs and document the circumstances so we can compare them.
If the emulated AOS4 has access to passed through gfx card VRAM, so does qemu. I would try to hack a little VRAM benchmark into qemu itself if you know how to find out the real address to use for this (it's not going to be the same VRAM address as seen in the emulated AOS4, is it?)
To see what theoretical max speed is. Maybe pass-trough-magic itself slows things down.
QEMU can also know the guest addresses but all the "pass-through-magic" is really just calling Linux to pass through the BAR addresses to set up the IOMMU to map the card's resources to the guest's address space so there's not much QEMU does with it. QEMU does not map the card itself so it can't really do a benchmark without breaking the guest but it would be possible to set up a Linux guest and run benchmark from there to see if this is something specific to AmigaOS or happens with all guests so could be in QEMU.