QEMU GPU vfio-pci pass through

Re: QEMU GPU vfio-pci pass through

Posted on: 2024/6/10 11:19 #21

Quite a regular

@Hans
If it was only a problem with DMA we would see that the CPU and DMA read results are the same but they should be comparable to real machine CPU results and not much lower. (We have CPU=DMA but I think you also said that on amigaone, pegasos2 and sam460ex DMA may not be used so that's probably expected.) The problem is that the CPU results are much lower than they should. Maybe one could test the same card on the host with the x11perf command, then set up pass through and test the same card from Linux guest with same command to see if it's a host side limitation or guest side. Then test the same with AmigaOS to see if it depends on the guest OS or only QEMU.

Re: QEMU GPU vfio-pci pass through

Posted on: 2024/6/11 5:10 #22

Home away from home

@balaton

Quote:

If it was only a problem with DMA we would see that the CPU and DMA read results are the same but they should be comparable to real machine CPU results and not much lower. (We have CPU=DMA but I think you also said that on amigaone, pegasos2 and sam460ex DMA may not be used so that's probably expected.)

DMA is not used on the older platforms, which is why the WritePixelArray and ReadPixelArray results are almost the same as CPU RAM<=>VRAM results. Otherwise the DMA results would likely be higher, as is shown in the results from platforms such as the Sam4x0, A1-X1000, etc.

Quote:

The problem is that the CPU results are much lower than they should. Maybe one could test the same card on the host with the x11perf command, then set up pass through and test the same card from Linux guest with same command to see if it's a host side limitation or guest side. Then test the same with AmigaOS to see if it depends on the guest OS or only QEMU.

Does anyone have a suitable Linux test tool for CPU & DMA RAM<=>VRAM copy speeds? It would be very useful to see those results.

One possible cause could be the code generated by QEMU's JIT. If the code doesn't have the right access patterns, then the PCIe controller won't be able to merge requests, and it'll literally be sending requests for 8/16/32/64 bits at a time. At 1-8 bytes, the efficiency is very low.

I've managed to find the graph showing PCIe efficiency vs packet size. You can find it on page 2 of this doc. The graph goes down to 64-byte payload only, so you'll have to extrapolate down for the efficiency of 1-8 byte transfers.

Hans

Join Kea Campus' Amiga Corner and support Amiga content creation
https://keasigmadelta.com/ - see more of my work

Re: QEMU GPU vfio-pci pass through

Posted on: 2024/6/11 5:25 #23

Just can't stay away

@Hans
You could build special QEmu Sam460 VIFO versions of your drivers enabling GART, there should be no cache coherency problems in emulation, but as long as QEmu emulates the 4x0 DMA engine with the CPU (memmove()) instead of using host DMA it may not help much.
DMA enhanced AmigaOS functions such as graphics.library (Read|Write)PixelArray(), exec.library CopyMem(Quick)(), utility.library MoveMem(), C library bcopy(), etc. would still be extremely slow accessing VRAM over PCIe with the QEmu emulated Sam460.

Georg

Re: QEMU GPU vfio-pci pass through

Posted on: 2024/6/11 6:27 #24

Just popping in

@HansQuote:

Hans wrote:@balaton

Does anyone have a suitable Linux test tool for CPU & DMA RAMVRAM copy speeds? It would be very useful to see those results.

X11 GetImage and ShmGetImage functions is the same as ReadPixelArray in AOS land.

X11 PutImage and ShmPutImage functions is the same as WritePixelArray in AOS land.

"Shm" means shared memory, so memory does not need to be copied between X11 client app and X11 server.

This can be benchmarked in Linux with "x11perf -getimage500", "x11perf -shmget500", "x11perf -putimage500", "x11perf -shmput500".

With standard X11 gfx driver (like nvidia) this will use some kind of DMA.

Using "vesa" x11 driver this should use cpu only. But as said in a previous post, this will by default likely use a shadow buffer in RAM to avoid slow reads from VRAM. So you should set ShadowFB to 0 in X11 config if you want read results from VRAM. And disable compositor (there might be a shortcut key for that) so that the putimage calls write directly to screen/vram, instead of a window pixmap which in a second step gets composited to screen/vram.

Edited by Georg on 2024/6/11 7:32:17

Re: QEMU GPU vfio-pci pass through

Posted on: 2024/6/11 12:26 #25

Quite a regular

@Hans
QEMU won't try to merge requests or set up host DMA for CPU accesses and it's probably out of scope for a JIT compiler anyway so it will try to do what the guest code does and translate that to host code. So if the guest code does byte/word reads host code will do the same. If guest code does block transfer or use the GPU's DMA then it should pass that on unless guest may use something that's not implemented and falls back to byte reads maybe. Without knowing what the guest does or which operation makes it slow it's hard to find the cause and fix it. Is there a way to enable DMA in the driver on amigaone/pegasos2 to test if that would work better? I think it was suggested before that at least on pegasos2 the chipset should handle cache coherency and at least on QEMU it should not be a problem as long as it's working on the host side so it might worth a test at least. If the sam460 version would use DMA then we could test that but it seems there are some problems with PCI emulation on that machine so that would need to be fixed first but I have no documentation on that so don't know what is wrong with it. I've tried to fix PCIe emulation on sam460ex but it would need some more work for which I have no time now.

@joerg
A DMA enabled pegasos2 build (or option to enable it) would be more useful because sam460ex does not work with vfio pass through currently and it's also slower than machines with G3/G4 due to the embedded MMU using a lot of exceptions which is slow on QEMU. I have a series posted on the QEMU list in which I tried to optimise it a bit but it did not make it much faster so it would still run slower than amigaone or pegasos2. I'm also not sure the CPU DMA engine has anything to do with this on sam460ex as the driver would use the GPU's DMA which should work with vfio as long as the IOMMU translates the addresses correctly but if the driver does not use it then QEMU will not be able to find out and do it instead of the guest. QEMU only sees guest code that it translates to host code but doesn't know what that code is doing. If it accesses emulated hardware it can know but for a passed through GPU it just sets up the BARs of the passed through card then the guest should talk to it directly. (I may be wrong as I don't know how vfio works in detail but I don't think it knows about what the guest talks with the GPU it just maps the GPU BARs into guest PCI bus and the rest is like on real machine.)

Re: QEMU GPU vfio-pci pass through

Posted on: 2024/6/11 13:34 #26

Just can't stay away

@balaton
G3 and G4 CPUs are faster for calculations, for example unaligned (4 bytes, maybe even 2 bytes) 64 bit FPU loads/stores work, 4x0 FPU loads/stores cause alignment exceptions if an access crosses a cache-line boundary (not possible if correctly aligned, 8 bytes for double, but there was a lot of old AmigaOS software using only 4 or even 2 bytes alignment for double), have a better MMU without requiring a lot of exceptions like the TLB cache on 4x0, G4 CPUs have AltiVec which can make a big difference, etc.
But G3 and G4 CPUs are much slower than 4x0 for gfx card VRAM accesses because they don't have a DMA engine.

Some more GfxBench2D results from real hardware:
https://www.hdrlab.org.nz/benchmark/gf ... 2d/OS/AmigaOS/Result/1652 (Sam460ex, Radeon HD 7700 Series)
Copy to VRAM 300.43
Write Pixel Array 647.84
Copy from VRAM 54.09
Read Pixel Array 99.78

https://www.hdrlab.org.nz/benchmark/gf ... 2d/OS/AmigaOS/Result/2712 (A1222, Radeon RX Polaris11)
Copy to VRAM 146.82
Write Pixel Array 432.25
Copy from VRAM 6.75
Read Pixel Array 332.63

https://www.hdrlab.org.nz/benchmark/gf ... 2d/OS/AmigaOS/Result/2475 (AmigaOne, Radeon HD 7800 Series)
Copy to VRAM 50.71
Write Pixel Array 50.77
Copy from VRAM 6.33
Read Pixel Array 3.41

https://www.hdrlab.org.nz/benchmark/gf ... 2d/OS/AmigaOS/Result/2522 (Sam440EP, Radeon HD 7800 Series)
Copy to VRAM 38.85
Write Pixel Array 97.32
Copy from VRAM 26.56
Read Pixel Array 26.55

The differences between Copy to/from VRAM (CPU) and Write/Read Pixel Array (DMA, except on the AmigaOne of course) are independent of the gfx driver using GART or not, in the above results only on the A1222 GART can be used.

In the A1222 result the CPU Copy from VRAM is as slow as the fastest QEmu Pegasos2 result until now with a VFIO gfx card.
DMA Read Pixel Array is much faster on the A1222, but since there is no DMA Read Pixel Array on the QEmu Pegasos2 it's using the CPU and has the same slow speed as Copy from VRAM.

Of course the AmigaOne result is additionally slower because it only has PCI and a PCI->PCIe bridge has to be used, but it's the same on the Sam440EP and only the relative difference between DMA (none on the AmigaOne) and CPU VRAM copies are relevant.

Edited by joerg on 2024/6/11 13:55:16

Re: QEMU GPU vfio-pci pass through

Posted on: 2024/6/12 9:53 #27

Quite a regular

@joerg
What you write about real G4 and 460EX doesn't apply to QEMU so it does not explain the issue. I think previously we did tests with -cpu 750cxe vs the default -cpu 7457 on pegasos2 to see how much effect AltiVec has and we got the same results so it seems it does not matter that much (at least in general, maybe some specific apps it could matter but not sure it does on QEMU; this could be tested more but both seems to work about the same now). Also alignment and cache differences between real chips don't matter, on QEMU it's all the same CPU emulation for all PPC CPUs, the only difference is the supported instructions and different MMU/exception model. The G3/G4 have an MMU where the guest passes the translations to the CPU in a table that can be looked up on TLB miss without exiting the guest code but embedded PPC needs to take an exception for that and this leads to additional synchronisation that makes it slower. This may be optimised a bit more but I think it will always make sam460ex slower than amigaone/pegasos2 or at least it is slower for now.

I don't know if the DMA engine of the CPU is used or not but that only exists on sam460ex and we could not test that as it does not work with the card tried so it's both not proven and would not help amigaone/pegasos2 machines that are otherwise faster so I'd rather find out why passed through card is slow on those and fix that instead.

Re: QEMU GPU vfio-pci pass through

Posted on: 2024/6/12 12:32 #28

Home away from home

@balaton
Quote:

QEMU won't try to merge requests or set up host DMA for CPU accesses...

Of course it doesn't. It's the CPU and/or PCIe controller that does the merging. Whether it does or not depends on what code the JIT compiler generates.

Quote:

Is there a way to enable DMA in the driver on amigaone/pegasos2 to test if that would work better?

I could create a special test version of the driver, as Joerg suggested. Let's get some benchmark results for the host machines first, so we have some reference numbers.

Bear in mind that the driver's GART is only used for command submission and transfers done by 3D drivers. The graphics.library will still use CPU-based transfers. And, as I said, the graphics.library does a lot of CPU based rendering.

Hans

Join Kea Campus' Amiga Corner and support Amiga content creation
https://keasigmadelta.com/ - see more of my work

Re: QEMU GPU vfio-pci pass through

Posted on: 2024/6/12 13:33 #29

Just can't stay away

@balaton
Quote:

don't know if the DMA engine of the CPU is used or not but that only exists on sam460ex and we could not test that as it does not work with the card tried so it's both not proven and would not help amigaone/pegasos2 machines that are otherwise faster so I'd rather find out why passed through card is slow on those and fix that instead.

VRAM access is slow because it's accessed in tiny parts, at best 128 bits (if the AltiVec registers are emulated with 128 bit host registers) on G4, and only 64 bits on G3 and 4x0.

Real Sam4x0, X1000, X5000 and A1222 use larger DMA transfers instead. Less PCIe overhead = much faster speed.

There are 2 ways DMA is used:
1. GPU accessing DRAM with DMA (GART).
2. CPU accessing VRAM with DMA.

1. should work with QEmu VFIO gfx cards, if Hans creates special QEmu drivers with GART enabled.
But it can only help if the GPU does all rendering, which may be the case for Warp3D and the GPU video decoding and playback library.
It doesn't help at all, except for faster command submission, for the AmigaOS mixed CPU + blitter gfx API which was created 40 years ago for 1-8 bit planar screen modes.

2. is impossible on G3 and G4 CPUs, real and emulated.
Sam4x0 real hardware uses DMA, but QEmu emulates the 4x0 DMA with memmove() and the result is that the much larger DMA VRAM accesses the guest OS is using are splitted into tiny 64 bit parts accessed by the host CPU which is extremely slow.
There is nothing you can do for G3/G4 AmigaOne/Pegasos2 emulation, but if you can fix the 4x0 DMA emulation and change it to using host DMA instead of the host CPU, as well as the other problems in the Sam460 emulation to get it working at all, VFIO would be much faster.

Re: QEMU GPU vfio-pci pass through

Posted on: 2024/6/14 1:57 #30

Home away from home

@all

So has anyone run any RAM<=>VRAM benchmarks for the host machine yet? It would be useful to know what the CPU and DMA performance is supposed to look like.

Hans

Join Kea Campus' Amiga Corner and support Amiga content creation
https://keasigmadelta.com/ - see more of my work

Georg

Re: QEMU GPU vfio-pci pass through

Posted on: 2024/6/14 17:34 #31

Just popping in

@HansQuote:

Hans wrote:@all

So has anyone run any RAMVRAM benchmarks for the host machine yet? It would be useful to know what the CPU and DMA performance is supposed to look like.

Hans

Pretty old computer here: (ASRock Z97 Pro3, i5-4590 3.3 GHz, 16 GB RAM, Nvidia GeForce RTX 2060, OpenSuse Leap 15.4 64 Bit):


Nvidia binary driver:

     read (  6850.0/sec): ShmGetImage 500x500 square

     write(  3860.0/sec): ShmPutImage 500x500 square



Vesa driver with shadowfb disabled:

     read (    21.4/sec): ShmGetImage 500x500 square

     write(   321.0/sec): ShmPutImage 500x500 square



Vesa driver with shadowfb enabled (default):

     read ( 14500.0/sec): ShmGetImage 500x500 square

     write( 13600.0/sec): ShmPutImage 500x500 square

SysBench memory benchmark says read (7896.88 MiB/sec) and write (6098.10 MiB/sec).

Looking through sources a bit it may be that X11 Vesa driver uses memcpy() for GetImage (~ ReadPixelArray) and PutImage (~ WritePixelArray) so it might be that it copies 64 bit at a time, not just 32 bit (~ 1 pixel) as one might expect.

I haven't checked but with enabled shadowfb it may be, that not all write/put (¯ WritePixelArray) get copied to real vram, instead maybe updates from shadowfb to real fb only happen in intervals (like 1 frame).

Re: QEMU GPU vfio-pci pass through

Posted on: 2024/6/15 5:13 #32

Just can't stay away

@Georg
Quote:

Vesa driver with shadowfb disabled: read ( 21.4/sec): ShmGetImage 500x500 square write( 321.0/sec): ShmPutImage 500x500 square

3.5-4.8 times faster than geennaam's QEmu Pegasos2 OS4 results with VFIO Radeon R9 270x


Copy from VRAM    6.09

Copy to VRAM    67.15

but the differences may just be because of different PCIe versions of the motherboard and/or gfx card, the QEmu overhead and different VRAM access speeds of the different gfx cards.

Your CPU results are slower, but the DMA ones much faster, than the fastest GfxBench2D result on an X5000:


Copy from VRAM    40.33

Copy to VRAM    533.72

Read Pixel Array    995.02

Write Pixel Array    1,071.60

and the fastest one on a X1000


Copy from VRAM    41.77

Copy to VRAM    438.44

Read Pixel Array    398.56

Write Pixel Array    1,415.46

There are big VRAM speed differences between different gfx cards, especially when accessing it with the CPU, therefore it's not really comparable anyway, even if there wouldn't be the other differences as well.

Quote:

Looking through sources a bit it may be that X11 Vesa driver uses memcpy() for GetImage (~ ReadPixelArray) and PutImage (~ WritePixelArray) so it might be that it copies 64 bit at a time, not just 32 bit (~ 1 pixel) as one might expect.

It's the same on AmigaOS4/PPC, except that on CPUs with AltiVec 128 bits = 4 pixels are copied at a time. On G3, 4x0 and P5020 CPUs it's 64 bits = 2 pixels as well.

Edited by joerg on 2024/6/15 5:35:12
Edited by joerg on 2024/6/15 5:42:20
Edited by joerg on 2024/6/15 5:55:40
Edited by joerg on 2024/6/15 6:41:52

Re: QEMU GPU vfio-pci pass through

Posted on: 2024/6/15 10:04 #33

Quite a regular

@joerg
Quote:

It's the same on AmigaOS4/PPC, except that on CPUs with AltiVec 128 bits = 4 pixels are copied at a time. On G3, 4x0 and P5020 CPUs it's 64 bits = 2 pixels as well.

In one of the profiles from @nikitas I've seen a lot of AltiVec instructions. I've also asked to test with -cpu 750cxe to find out if that makes a difference but haven't seen the results yet. It could be that in QEMU some AltiVec instructions are emulated and not translated so it could fall back to byte reads and writes so it may be faster to use G3 without AltiVec in that case but this has to be tested.

From the results above it seems the AmigaOS driver should really implement shadow FB and copy to gfx card in blocks to avoid this problem. Since vfio passes through the gfx card to the guest it can't easily implement this. It might be possible to virtualise a gfx card on top of vfio but that does not seem to be easy to do in a generic way and doing it for every gfx card is also not a way to go so I can't think of a better solution than finding a card that's fast in reading/writing VRAM or fix it in the AmigaOS driver.

Re: QEMU GPU vfio-pci pass through

Posted on: 2024/6/15 12:47 #34

Home away from home

@Georg

How do those ShmGetImage/ShmPutImage results translate to MiB/s?What's the pixel format of the 500x500 blocks?

@balaton

Quote:

From the results above it seems the AmigaOS driver should really implement shadow FB and copy to gfx card in blocks to avoid this problem. Since vfio passes through the gfx card to the guest it can't easily implement this. It might be possible to virtualise a gfx card on top of vfio but that does not seem to be easy to do in a generic way and doing it for every gfx card is also not a way to go so I can't think of a better solution than finding a card that's fast in reading/writing VRAM or fix it in the AmigaOS driver.

Unfortunately, the Picasso96 driver API does *NOT* give the driver control over transfers. The graphics library copies bitmaps to/from VRAM at will, and there's no way for the driver to intercept this.

Hans

Join Kea Campus' Amiga Corner and support Amiga content creation
https://keasigmadelta.com/ - see more of my work

Georg

Re: QEMU GPU vfio-pci pass through

Posted on: 2024/6/15 13:00 #35

Just popping in

@Hans

4 byte per pixel (xdpyinfo says "depth: 32 planes"). So 500x500*4=1 million meaning results per sec = million bytes per second.