If you can get such a gfx card very cheap you may try it, but it's not only the gfx card itself, different host hardware (AMD Ryzen CPUs seem to be much faster for QEmu than Intel CPUs, ARM based CPUs like the Apple M1/M2/M3 may even be faster than Ryzen CPUs, but since Macs don't have PCIe slots they are useless), different Linux versions as well as differences in the BIOS/UEFI may make a difference as well.
@nikitas Since we don't know what causes the issue we also can't tell if another card would work better. That's why I opened the thread to gather test results to see what are common problems independent of host or card and what are problems specific to host/card/machine. If you can use Linux perf tool with QEMU then you could try to do some profiling of the GfxBench tests that show slow speed to see what parts of QEMU are called a lot. I don't know a good tutorial on that but there should be some docs on profiling on Linux with perf.
@balaton Not that I have much experience in Linux profiling, but I will try it.
@joerg Yes, Apple M2 for example is much faster with Qemu-PegasosII-AOS4.1. There might be PCIe solution for Macs, the eGPU devices. But they are so expensive. Around 800 Euros. I'd rather prefer to get a real PPC machine.
I think I've suggested that several years ago already, but can you explain (again) why using something like IMMU->SetMemoryAttrs(page_alinged_command_buffer, page_aligned_buffer_size, MEMATTRF_WRITETHROUGH|MEMATTRF_READ_WRITE);
or if that doesn't work either because it's not write-only but has to be (re)read by the CPU without caching as well IMMU->SetMemoryAttrs(page_alinged_command_buffer, page_aligned_buffer_size, MEMATTRF_CACHEINHIBIT|MEMATTRF_COHERENT|MEMATTRF_GUARDED|MEMATTRF_READ_WRITE); doesn't work?
Write-though or even cache-inhibited DRAM should be much faster than cache-inhibited VRAM over ZorroIII/PCI/PCIe. At least in some of my classic Amiga OS4 parts I used write-through mapped memory because it was faster, and much easier to use, than cached memory with manual cache flushing.
The challenge is in guaranteeing that the data is there before the GPU touches it. With the command stream, one byte wrong will likely result in a hung GPU.
It's been a while, so I can't remember everything that I've tried any more. I do remember that MEMATTRF_COHERENT does *nothing* on the Sam460, despite what CPU docs may say. I think I tried using MEMATTRF_WRITETHROUGH. If I did, then it didn't work. There's still a window when the GPU could read the wrong data.
What's really annoying, is that a cache flush/invalidate followed by a sync instruction should guarantee that the data has reached RAM. Yet, the GPU still locked up.
Mind you, when I worked on this I wasn't able to test it directly myself. I had to send the test versions to a beta tester who would email me back the results. That's not the easiest way to debug an issue.
Quote:
The glyph cache of ft2.library is in DRAM instead of in a BitMap in VRAM??? That would be very bad. I'm not 100% sure anymore, but I think for the text rendering in OWB I used a 8 bit (alpha only) BitMap for an additional glyph cache, IIRC for the last 256 used glyphs (because of the unicode support there can be much more), each char blitted from the glyph cache to a text line sized 32 bit bitmap, adding the text colour, which then was copied to the window with CompositeTags() for the anti-aliasing.
Unless something has changed, the graphics library's Text() function is CPU rendered.
or similar. If your CPU does not support lbr you may need to compile with debug enabled but that may introduce further slow down so not sure how useful those results would be.
@nikitas Use -p or --pid= for perf record to only record qemu-system-ppc events. Start is before running the read tests and stop afterwards. We only need the read tests that show slow performance but you can do a separate profile for running the whole test for comparison. Then generate report with perf report -g as perf script only shows the raw events that are not too useful and I don't know if that can be converted or used for reports.
@nikitas I can't use perf script output because all other perf commands to analyse the trace need the binary perf.data and I don't know how to convert it back to that format. Either upload the perf.data or the output of perf report -g
@nikitas Thanks but I can't open it because it says it's newer format than my perf version supports. I'll need to get a newer version or I think perf report -g --stdio might work to export the parsed report.
@nikitas This is the whole report but function names listed in it don't make sense so maybe --call-graph=lbr does not work on your CPU or maybe the --enable-lto option interferes. You could try compiling QEMU with default options, no extra optimisation and also try other --call-graph options with perf record.
From the picture you've posted in #30 nothing seems to stand out and most of the time is spent running JITed guest code so it's not slowed down somewhere in QEMU (although I don't see what's below that JIT block which takes ~40%, you could try expanding it with e button and take another picture of that). The helper_raise_exception_err is because some exception is happening you can use 'info irq' command in QEMU monitor to see which exception is raised frequently. The numbers are defined in qemu/target/ppc/cpu.h.
It might still help to test with Linux guest and x11perf to check if this is specific to AmigaOS or happens with any guest. There is some documentation here on how to run Linux on these machines.
@nikitas This last one now has usable info. I'm not sure I understand it completely but looks like either there are a lot of exceptions for some reason so check that with 'info irq' in QEMU monitor to see which exceptions are raised. Or there's a translation block that runs slowly for some reason but I don't know how to find out what that TB does. There are QEMU options to log these but maybe there's a better way to debug it that I don't know. If we can't figure out from the exceptions what happens maybe we'd need to try to check what the slow TB is doing to find out where it slows down but may need to ask on the QEMU list about how to do that.
About Linux/X11 reports there are some info in the other thread so you could also report results from Linux testing there.
@nikitas There seems to be no excessive number of interrupts so it's not slowed down by that and the profile showed that most time was spent in a TB that probably accesses some VRAM. I don't know a good way to find what's in that TB. Maybe you can try 'perf mem record' as you did first but I could not parse the results you've sent so try 'perf mem report' now as we did with the perf record output, maybe that shows something.