You mentioned MicroDelay() and it could be caused by it as it will be some kind of busy loop checking some powerpc timer register. If qemu emulation of it is not very precise (may depend on host or even host (kernel) configuration = there may be difference between running Linux distribution A vs distribution B) then this will slow things down as it will cause the delay to last (possibly much) longer than expected.
Could be tested with a little AOS4 program which for example calls MicroDelay(10) 100000 times in a loop. Should complete in 1 second. If it takes (much) longer -> problem.
Yes, that would be worth testing. Libauto automatically opens the timer.device, so linking with -lauto will set up ITimer. Then something like this should work (NOTE: untested):
#include <proto/timer.h>
#include <stdio.h>
int main(int argc, const char **argv) {
unsigned usDelay = 1;
unsigned count = 1000000;
printf("Calling MicroDelay(%u) %u times\n", usDelay, count);
for(unsigned i = 0; i < count; ++i) {
ITimer->MicroDelay(usSize);
}
printf("Done! This should have taken %.2f seconds. How long did it actually take?\n", ((double)usDelay * count) / 1000000.0);
}
@Hans Nikitas did not have a USB vfat device in the commands but may try removing the network (but I think also tried that and it didn't help). On pegasos2 all these may share the same interrupt so maybe there's still an issue with these after all the patches and level sensitive setting in BBoot? Anyway trying to reproduce with Linux could avoid all these and check if the problem is only with how AmigaOS does things or independent of that.
int main(int argc, const char **argv) {
uint32 usDelay = 1;
uint32 count = 1000000;
printf("Calling MicroDelay(%lu) %lu times\n", usDelay, count);
for(uint32 i = 0; i < count; ++i) {
ITimer->MicroDelay(usDelay);
}
printf("Done! This should have taken %f seconds. How long did it actually take?\n", ((double)usDelay * count) / 1000000.0);
}
But it may be better to use for example usDelay = 10 and count = 100000, or usDelay = 100 and count = 10000.
I just got a reminder of some of geennaam's older discoveries. According to him, his Radeon R9 270x worked well provided that he didn't share part of his hard-drive with AmigaOS as a USB drive, and he also had to shut down ethernet. With either of those enabled, he got massive slowdown.
Okay, so do I have to remove the ethernet from the QEMU command only, or should I also shut down the ethernet on the host? Regarding the USB drive, I have a secondary real SSD drive on "/dev/sdb" that I use in the QEMU command. Is this OK? For the mouse/keyboard, I use bochs-display. Is this OK, too?
@balaton Indeed, virsh makes it more complex. So, I will create a new QEMU setup for this when I have time.
@nikitas Only remove from the QEMU command. It was said that if the guest has a USB disk as with the ufat shared folder then it ran slower with vfio for some reason. I don't know if @geennaam ever talked about a network card. It does (or should) not matter what you have on the host, the theory is that maybe having other PCI devices in the guest like USB or network card may interfere with interrupts from the graphics card. Now they are on different bus and we had several patches to fix this but who knows. This was with an RadeonHD card and used pci.1 so it's different than what you've tried but we have not better idea at the moment. So remove all -device usb-* and -device rtl8139 from QEMU command line and see if that changes anything.
I also noticed that everyone using VFIO, is using QEmu in KVM mode, which means that the guest OS can execute code on the host CPU directly instead of via emulation. I found nothing about VFIO usage with the TCG based emulator. Looks like we're in uncharted territory.
Most of the people who do this want to play games in a VM that run on Windows while they want to run Linux. So they want to have the most performance and use KVM and vfio. This may mean that using it with TCG is not tested that much and with PPC at all but that does not mean it should not work. Of course we're on uncharted territory, not many people run AmigaOS on QEMU and even less tried vfio GPU pass through so it's not something that was tested and known to work. Some people tried it before for MacOS but gave up because there the firmware is needed to run the FCode ROM of the Mac graphics card (or a suitable ROM for a PC card) for MacOS to even recognise the card but QEMU's OpenBIOS can't run FCode ROMs and real Mac ROMs don't run with QEMU. (I had patches to fix both of these but they aren't upstream so one can only experiment with it with applying patches from different places so only a few people even tried. Somebody once managed to get a Rage128Pro working but don't know if it was usable.)
Okay, so do I have to remove the ethernet from the QEMU command only, or should I also shut down the ethernet on the host? Regarding the USB drive, I have a secondary real SSD drive on "/dev/sdb" that I use in the QEMU command. Is this OK? For the mouse/keyboard, I use bochs-display. Is this OK, too?
Remove it from the QEmu command line, and use your Radeon R7 240 for testing (geennaam said that it had no effect on his RX 5x0 cards).
I have no idea bout using the secondary real SSD drive, or the bochs-display. If you can boot to AmigaOS without them, then try removing both from the QEmu args.
@balaton VFIO is obviously working with TCG. I was hoping to get some idea of what the overhead was when used with TCG instead of KVM, and maybe some tips on what to try.
balaton wrote:@Georg To help testing, could you please share your Linux kernel options and xorg.config to show how to set up vesafb and the x11perf command again so others can reproduce that test without having to find out the right config?
Could be wrong, but I don't think the x11 "vesa" driver needs any special Linux kernel options. There's another X11 driver "fbdev" which does use that Linux kernel framebuffer stuff.
In theory to use "vesa" driver it's just a matter of editing xorg.conf (in /etc/X11) (or save a modified version whereever you want) and look in the "Device" section in there and edit it to say:
Driver "vesa" Option "ShadowFB" "0"
Many years ago that was enough. But nowadays if you try to start X11 (startx -- -xf86config myxorg.conf) it may fail and the log (var/log/Xorg.0.log) says "vesa: Ignoring device with a bound kernel driver". That seems to be because of the still existing normal gfx card (in my case "nvidia") kernel modules in memory.
So here what I do is to first log out of desktop, use CTRL ALT F1 to switch to virtual console, run "init 3" to get rid of X11 (KDE) display manager, then "lsmod | grep nvidia", then "rmmod" the modules (you need to find the right order, ie. which ones to remove first, otherwise it says "module is in use by ...") and then "startx -- -xf8config myxorg.conf". For some reason here the screen first appears somewhat broken (don't know if it's just the monitor), ~zoomed, ~like_wrong_modulo, so I also have to do some CTRL ALT F1 -> CTRL ALT F7 forth and back switching and then it displays fine.
If the thing is slow and you see flickering mouse sprite (because of disabled shadow framebuffer) in front of gfx updates (like "glxgears" window) it worked.
Google how to disable "compositing" on your desktop. There may be some shortcut key for it. To verify that it's disabled run "xcalc" or "xclock" from a terminal. Press CTRL+Z to freeze the program. Then drag it's window out of screen and back in. If this creates gfx trash or gfx disappering (like text/numbers) then it worked. (Happens because program is frozen and cannot update/refresh areas of window which became hidden and then visible again. With enabled compositor this does not happen, because the windows contents are backed up in their own pixmaps=bitmaps and the contents don't get lost when dragged out of view or behind things).
x11perf -shmput500 x11perf -shmget500
It's unlikely that it is not running in 4 byte per pixel screenmode (so that you can interpret x11perf results/sec as million_bytes/sec) but if you want to check then look if "xdpyinfo" says "32" for "bitmap unit". Tough I'm not 100 % sure that really reflects the "bytes per pixel". (don't know or remember why but AROS hosted X11 driver even creates a dummy test XImage and then picks the bytes per pixel from it).
Remove it from the QEmu command line, and use your Radeon R7 240 for testing
No, I tried all the possible combinations, and it didn't go any faster. The only thing that maybe helped a little was removing bochs-diplay and using Evdev for USB devices.
Also, the command:
cpufreq-set -g performance
Made a visible difference. But just a bit.
I also tried a funny thing using a real vga-to-vga on a small old monitor I found. I got the error "Couldn't create screen mode." I think this monitor supports 640x480.
Running this script with: - R7 240 attached, took about: 24 seconds. - RadeonRX 550 attached, took about: 1.50 or 2 seconds - RadeonRX 550 attached with Screenmode --> Enable Interrupts = Checked, took about: 1.0 or 1.5 seconds - RadeonRX 550 attached with Screenmode --> Enable Interrupts = Checked and Ethernet attached and using bochs-display, took about: 1.0 or 1.5 seconds (same as the test above)
When I enable interrupts on Screemode, the systems seem to run slower overall, though this test appears to execute faster.
(Nobody can switch hardware on the fly and run the test faster than me )
Running this script with: - R7 240 attached, took about: 24 seconds. - RadeonRX 550 attached, took about: 1.50 or 2 seconds - RadeonRX 550 attached with Screenmode --> Enable Interrupts = Checked, took about: 1.0 or 1.5 seconds - RadeonRX 550 attached with Screenmode --> Enable Interrupts = Checked and Ethernet attached and using bochs-display, took about: 1.0 or 1.5 seconds (same as the test above)
When I enable interrupts on Screemode, the systems seem to run slower overall, though this test appears to execute faster.
You're RX 550 results aren't too far off, but the R7 240 result is 24x slower than it should be. I didn't expect there to be a dramatic difference depending on which graphics card is plugged in. That doesn't make sense. It does confirm that MicroDelay() can indeed be a problem, although it's not necessarily the cause of the massive graphics slow-down.
I'm the poor canary flying into the mineral mine tunnel to see if toxic gas exists further inside. Let's see if I die...
I even cleaned the connectors and PCIe slot with Isopropyl 90. I found the (hidden) mvme SSD, removed it, and placed it in another slot away from the CPU. (in case it was using PCIe lanes as read). What else can somebody do, I wonder.
Could be a problem that I don't do Single-GPU passthrough? I use the integrated GPU for my host and I pass the RX550 through QEMU/VFIO, plugged in a second monitor for the guest OS.
@nikitas Here's some more motivation on what we might be able to achieve:
While QEMU uses one thread for vcpu it is multithreaded and has another thread for other tasks and maybe also an IO thread. So confining this to a single CPU core may not be a good idea. What if you drop all the tweaks of isolating CPU cores using taskset and setting irq affinity and just run QEMU normally? The host OS should be able to schedule the threads on its own.
Using other cards on the host should not interfere as long as they are in different vfio groups. Did you check vfio groups and established that the graphics card you're passing through is in its own group (with its sound function) and you pass all devices in that group? Also re-reading @geennaam's experiment in the long thread he used multifunction=on for the graphics function to create it as multifunction device as the sound part is at the same ID and another function of the card. I don't think that matters but have no better idea now.
@Hans Maybe the HD card could work if MicroDelay didn't have a problem with that card. This might mean these cards are slow for different reasons. It's also possible that problem with HD card is not actually in MicroDelay but could it be that something is disabling multitasking so the test runs slower even though the delay would return in time? What could do that with only the HD card but not RX card and how could that be confirmed? Maybe snoopy can log calls to Disable/Forbid and see if there are more of these with the HD card than RX card?
If MicroDelay shows more or less expected results with one gfx card, but not another gfx card (with otherwise same config) then it's more likely that problem is not MicroDelay, but something else. Like maybe tons of interrupts happening with one gfx card, but not the other?
I would try repeating the test with slow gfx card, but test loop changed to be surrounded by Disable()/Enable() (if that makes it fast, try Forbid()/Permit()). If microdelay is just a busy loop - which is likely - it should still work in disabled state. You might have to use a watch and check time it takes yourself, as AOS timer.device may behave wrong (long disabled state, timer register overflows, whatever).