Qemu + VFIO GPU RadeonRX 550 + AmigaOS4 extremely slow

Re: Qemu + VFIO GPU RadeonRX 550 + AmigaOS4 extremely slow

Posted on: 2024/6/22 17:40 #101

Just popping in

@balaton

Running the same test under the same conditions as for my latest test (see above), this time using -cpu 750cxe produced slightly worse results:

https://hdrlab.org.nz/benchmark/gfxbench2d/OS/AmigaOS/Result/2795

Re: Qemu + VFIO GPU RadeonRX 550 + AmigaOS4 extremely slow

Posted on: 2024/6/22 17:51 #102

Quite a regular

@joerg
Quote:

But even the slowest 64 bit double implementations use DCBA (DCBZ as replacement on CPUs not supporting DCBA) and DCBT, which may make it slower on QEmu than on real hardware.

DCBZ is the only potentially problematic op. All others are no-op and generate no host code as there's no cache emulated to operate on. DCBZ needs to be handled because it should zero the memory (and was found to be an issue when trying to run PPC32 code on PPC64 with KVM such as 32bit code on a G5 with KVM PR because the cache line is larger so it would zero too much memory so it needs to be patched and emulated but this should not be a problem for TCG which emulates it correctly but may be slower than not using it).

But DCBZ is problematic, it explains why copyFromVRAMAltivec is the slowest. Removing the dcbz call from it makes it run in 2.05 sec so looks like optimising dcbz might help as a lot of OSes use this for zeroing memory (it was found to be used by MacOS as well). However it's questionable why this copy routine uses dcbz when it's then overwriting the values so maybe it should use someting that does not zero which would be no-op on QEMU. Anyway I knew about this issue with dcbz so it's nothing new, just need to eventually do something about it. I think it could be reimplemented to use TCG vector ops which would then be compiled to host vector ops when available. I don't know how that works in QEMU but if somebody is interested could have a look and submit a patch then might get advice on how to do it better. Or I may ask on the list and maybe somebody can make a patch for it.

Edited by balaton on 2024/6/22 18:36:51
Edited by balaton on 2024/6/22 18:50:12

Re: Qemu + VFIO GPU RadeonRX 550 + AmigaOS4 extremely slow

Posted on: 2024/6/22 18:10 #103

Just can't stay away

@nikitas
Why -cpu 750cxe?

Only the AmigaOne SE, never publicly released but only available to OS4 developers and beta-testers, had a 750CXe CPU soldered on the motherboard.
AmigaOne XE and µA1 used G3 750FX and 750GX, G4 7451 and 7455 (original) or 7457 (ACube CPU module repair service), V'Ger and Apollo versins, replaceable CPU modules instead (similar, but incompatible, to the PowerPC Mac Megarray CPU socket).

There may still be some, very slow, workarounds left in AmigaOS 4.x for the broken A1 SE hardware.

Re: Qemu + VFIO GPU RadeonRX 550 + AmigaOS4 extremely slow

Posted on: 2024/6/22 18:34 #104

Just popping in

@joerg

It was requested by @balaton after examining a "perf" report I sent. He can explain the details on this as I don't have the required QEMU & PowerPC ISA knowledge to answer.

In the meantime, I wonder why the overhead in @balaton results is so big. Is it because the TLB mechanism is software emulated maybe?

Re: Qemu + VFIO GPU RadeonRX 550 + AmigaOS4 extremely slow

Posted on: 2024/6/22 18:40 #105

Quite a regular

@nikitas
I could have said use -cpu g3 but I know some of @joerg's code has some CPU checks which break with an unknown CPU that wasn't in a real machine. I knew 750cxe was in one machine and I think it was tested by others before and found working but looks like it may also have some problems. So try those @joerg suggested but I don't think it would make much difference.

The reason to test with a G3 CPU is just to see if AltiVec may be worse than not using it. As G3 had no AltiVec but G4 has. The test with @Hans's copy routines showed that AltiVec is worse for these but the full test may have other usages that might be better so overall it may help. That's what testing with both the default G4 and a G3 without AltiVec test is supposed to try.

Re: Qemu + VFIO GPU RadeonRX 550 + AmigaOS4 extremely slow

Posted on: 2024/6/22 18:44 #106

Just can't stay away

@nikitas
Quote:

Is it because the TLB mechanism is software emulated maybe?

TLB should only be an issue on the embedded 440/460 CPUs where it's handled by CPU exceptions in software, on G2/G3/G4 it's done in hardware by the CPU instead.
Emulation is something completely different, but balaton wrote that one of the reasons why Sam460 emulation is slower than AmigaOne/Pegasos2 G3/G4 emulation is because of the TLB exceptions.

Re: Qemu + VFIO GPU RadeonRX 550 + AmigaOS4 extremely slow

Posted on: 2024/6/22 19:05 #107

Quite a regular

@nikitas, @joerg
This test with VRAMCopy was run in user mode emulation that does not use a softmmu so the overhead probably comes only from managing translation blocks (and dcbz as discussed above). In full system emulation there is softmmu but with amigaone and pegasos2 which use CPU with hash MMU the guest passes translations in a table in memory so QEMU can look up the new translation without running guest code so a TLB miss is not much additional overhead.

With sam460ex PPC440 has guest managed MMU which means when a TLB entry is missing it will need to raise an exception and have the guest set the new translation. Additionally it will also invalidate all translated blocks for the old translation when it changes and has to recompile these. Previously all blocks were invalidated on any TLB write which made it very slow, this was improved with recent versions to only invalidate the entry that changed. Maybe there's still some improvement possible but the exception overhead cannot be removed so PPC440 will always be slower than G3/G4 because of this. This is shown by the stream benchmark posted by @sailor in an A1222 SPE thread. I've tried to optimise MMU on QEMU a bit based on that benchmark but I could not get it run much faster because of the above. I still have those patches on the list and not merged yet but it would only get marginal improvement in speed. The main point was to clean up the code to be easier to understand and maybe optimise further in the future but because of the above I don't think it can be optimised a lot.

Re: Qemu + VFIO GPU RadeonRX 550 + AmigaOS4 extremely slow

Posted on: 2024/6/22 21:14 #108

Quite a regular

I've sent a patch which improves dcbz a bit but it only removes a small overhead that was an easy fix but most of the problem still remains. It seems to spend most of the time with checking the address to see if it's in guest RAM but I don't know how to improve that. Maybe somebody on the list has an idea.

(One can also see from the profile in the patch description that memset on the host already uses vector instructions so it wasn't that bad if the check to see if memset could be used didn't take that long. It's still better than removing the check and memset though which runs in 9.28 seconds with the above patch that runs it in 5.83 sec.)

Edited by balaton on 2024/6/22 21:37:17

Re: Qemu + VFIO GPU RadeonRX 550 + AmigaOS4 extremely slow

Posted on: 2024/6/22 23:36 #109

Just popping in

@balaton

Thank you for your work, I will pull when the patches are merged with master and I'll compile again. I'm not expecting anything close to a miracle, but I'm curious to see if it will slightly improve the results in @Hans GPU benchmark tool.

Re: Qemu + VFIO GPU RadeonRX 550 + AmigaOS4 extremely slow

Posted on: 2024/6/23 6:40 #110

Just can't stay away

@balaton
Quote:

I've sent a patch which improves dcbz a bit but it only removes a small overhead that was an easy fix but most of the problem still remains.

DCBZ should only be used in the G2 and G3 parts of AmigaOS as replacement for DCBA. 60x CPUs don't support DCBA and on 750 CPUs it's an "optional instruction" according to https://www.nxp.com/docs/en/reference-manual/MPC750UM.pdf
But IIRC at least one of 750FX or 750GX supports DCBA.

DCBA has the advantage that it's a no-op on cache-inhibited, write-through and unmapped memory while DCBZ causes an exception in those cases.
DCBZ is emulated in the kernel exception handler, for example in case someone uses it on cache-inhibited VRAM, but of course that's extremely slow.
G4 CPUs support DCBA, and 405 based CPUs according to the PPC405 Core User's Manual as well.

On G4 CPUs neither DCBA nor DCBZ should be used but the better vector data stream prefetch instructions instead.
Or simply nothing (slower than using the prefetch instructions, but faster than using DCBA or DCBZ) as 2 consecutive, cache-line aligned 128 bit vector stores don't read the cache-line fist but just store the 32 bytes.
Allocating a cache-line to skip the read/modify/write cache-line cycle is only required for smaller stores, for example 4 64 bit double or 8 32 bit integer stores in the main loop of a memcpy().

My newlib.library memcpy() implementations, on which at least some of the optimized memory copy functions in AmigaOS were based on, definitely used DCBA on CPUs supporting it and DCBZ as replacement only on the CPUs not supporting it, for example the 60x CPUs in classic Amigas.

A possible reason to use DCBZ instead of DCBA or the vector stream prefetch instructions is to have common code for all CPUs, instead of different code for each of the supported CPUs as it's the case in the AmigaOS kernel and used to be in newlib.library.
That may be the reason Hans is using it in his VRAM copy functions.

Re: Qemu + VFIO GPU RadeonRX 550 + AmigaOS4 extremely slow

Posted on: 2024/6/23 10:13 #111

Quite a regular

@nikitas
It might take a while until my patches are merged. Until then if you want to test it you can download mbox from the patchew link and use 'git am mbox' to apply it locally. You might want to do that on a new branch (e.g. after 'git checkout -b test-branch' so you can easily delete that branch and return to master or switch between them with git checkout).

Re: Qemu + VFIO GPU RadeonRX 550 + AmigaOS4 extremely slow

Posted on: 2024/6/23 10:22 #112

Quite a regular

@joerg
Quote:

But IIRC at least one of 750FX or 750GX supports DCBA.

According to QEMU only embedded PPC (e200, 4xx, e5500, e6500) and 74xx have DCBA, it's not emulated for any 750 CPUs which have DCBI instead.

Hans

Re: Qemu + VFIO GPU RadeonRX 550 + AmigaOS4 extremely slow

Posted on: 2024/6/24 0:38 #113

Home away from home

@balaton

Quote:

However it's questionable why this copy routine uses dcbz when it's then overwriting the values so maybe it should use someting that does not zero which would be no-op on QEMU.

As Joerg guessed, this is a generic copy routine that needs to work on all CPUs. It's the recommended solution (at least by Apple), given that dcba isn't available everywhere, and causes massive slowdown on CPUs that don't have it.

On actual PowerPC hardware dcbz** makes perfect sense, because it eliminates any fetching from RAM. AFAIK, zeroing the cache line has no overhead at all, because the cache does it in hardware.

Dcba was removed because it's a potential security threat. Data previously in the cache line could end up being written to totally different memory, and accessed from there. This bypasses memory protection. Zeroing the cache line prevents any data leaks.

Emulation is the only place where it's a problem, because it has to be emulated on hardware that doesn't have an equivalent instruction. It has to be emulated, because dcbz is also used for a rapid memory-clear.

** Or dcbzl for CPUs with longer cache lines.

Join Kea Campus' Amiga Corner and support Amiga content creation
https://keasigmadelta.com/ - see more of my work

Hans

Re: Qemu + VFIO GPU RadeonRX 550 + AmigaOS4 extremely slow

Posted on: 2024/6/24 0:41 #114

Home away from home

@balaton

BTW, the graphics.library does have CPU-specific copy routines. So, it's quite possible that the G4 & G4 copy routines are better than the one I sent you.

@all

Can anyone confirm if the Pegasos-II has full memory coherence or not? The Marvell Discovery II datasheet mentions cache coherence, but I haven't seen any clear confirmation.

Hans

Join Kea Campus' Amiga Corner and support Amiga content creation
https://keasigmadelta.com/ - see more of my work

Re: Qemu + VFIO GPU RadeonRX 550 + AmigaOS4 extremely slow

Posted on: 2024/6/24 1:23 #115

Just popping in

@Hans

Can't say about real hardware. But if you'd like to know about QEMU PegasosII then...

"System doesn't have full memory coherence. Compensating..."

This is reported on the boot time.

You can find it in my first post on this thread...

Re: Qemu + VFIO GPU RadeonRX 550 + AmigaOS4 extremely slow

Posted on: 2024/6/24 1:36 #116

Just popping in

@balaton

I tried to apply the 2 patches you posted here. But I got an error on applying patch 3 of 3 on the first patch. Anyway I made the changes by hand (high possible I messed something in the process, but anyway compiled). Now I'm running the GPU test.

@Hans
What does it mean the "Inactive" in the Composition field on Sysmon app?
If I enable GUI --> Composite Effects and at the same time I have Screenmode --> Enable Interupts = checked, then the system freezes at some point while booting AOS4.1.

If I disable Composite effects and maintain Interrupts enabled, then AOS boots, but it is very slow.

If i disable both GUI --> Composite Effects and have ScreenMode --> EnableInterrupts = Unchecked, then I get the best performance I can. Which is slow. But not that slow. Of course in any case I still see workbench windows drawing. Either slow or faster depending on the settings I apply each time.

Hans

Re: Qemu + VFIO GPU RadeonRX 550 + AmigaOS4 extremely slow

Posted on: 2024/6/24 2:26 #117

Home away from home

@nikitas

Quote:

Can't say about real hardware. But if you'd like to know about QEMU PegasosII then...

"System doesn't have full memory coherence. Compensating..."

Yeah, that comes from the RadeonHD/RX driver. There is no proper memory coherency test. Instead, it's based on what motherboard it's being run on.

Quote:

What does it mean the "Inactive" in the Composition field on Sysmon app?

No idea.

Your trouble with interrupts being enabled suggests that some interrupts are being missed.

Hans

Join Kea Campus' Amiga Corner and support Amiga content creation
https://keasigmadelta.com/ - see more of my work

Re: Qemu + VFIO GPU RadeonRX 550 + AmigaOS4 extremely slow

Posted on: 2024/6/24 4:55 #118

Just can't stay away

@Hans
Quote:

Dcba was removed because it's a potential security threat. Data previously in the cache line could end up being written to totally different memory, and accessed from there. This bypasses memory protection. Zeroing the cache line prevents any data leaks.

At least on some CPUs with DCBA support, for example 405 and 74[45]x, DCBA sets the cache line to 0 as well.
On those CPUs the difference between DCBA and DCBZ is just that DCBA is a no-op on cache-inhibited and write-through memory while DCBZ causes an alignment exception and the kernel alignment exception handler sets the 32 bytes in memory to 0, which is very slow.

Re: Qemu + VFIO GPU RadeonRX 550 + AmigaOS4 extremely slow

Posted on: 2024/6/24 10:29 #119

Quite a regular

@nikitas
If you ask about patch 3 of the altivec 128 bit series I've linked to first then that did not apply for me cleanly either. You can try with 'git am --3way mbox' or just 'git am --skip' when you get the error for patch 3 as that is for VSX ops and the AltiVec/VMX is in patch 2 so that should be enough for us and can skip patch 3.