OK, then can you make a small C program that just measures these copy routines that can be compiled on Linux? QEMU has two modes, one is full system emulation with qemu-system-ppc but it also has user mode with qemu-ppc that can run on Linux or BSD executables compiled for different architecture. With that we could easily check the compiled code, because we won't have to find it between all the other OS code. So if you have a test of just these copy routines that can be compiled for PPC Linux you could use -d options with qemu-ppc running on x86_64 Linux to get the guest and host asm and compare them to see how these are transformed. Finding it in the qemu-system-ppc output is probably impossible so we'd need a seprate small test case for that.
Sorry, I misunderstood what you were asking for. Here are GfxBench2D's main copy routines (both altivec & not):
This P96 *.chip or *.card function should be used by IGraphics->WaitBlit().
It's not just flush/finish but additionally waiting for completion, but since it has to be used between any GPU rendering and CPU rendering, without using it the CPU based rendering functions may work on wrong data, it may help if several GPU functions are called in a row.
That certainly is a "sync point." However, it's no guarantee that programs will actually call that when they're done drawing.
Another sync point is page-flipping or WaitBOVP()/WaitTOF(). Again, there's no guarantee that those will be called for single-buffered screens, which includes apps drawing to a Window on Workbench.
I've also toyed with the idea of using the vblanking interrupt as a signal to flush. And if that isn't performant enough, maybe a timer could be used to flush a few times per frame.
Doing all of the above should work reliably. But, it would require a major rewrite of the driver's pipeline, and possibly patching the graphics.library. That rewritten code should really be put in the graphics.library or some other shared place so that all graphics drivers could potentially use it. At this point we're basically talking about replacing the Picasso96 driver API with something new that's fit for purpose.
Bear in mind, that it's not just about batching multiple draw ops. There are two levels of optimization: 1. Sending multiple draw calls in one command-buffer/batch 2. Accumulating multiple draw ops of the same type into one larger draw call (e.g., collect multiple fill-rect ops into a single vertex array)
When testing with x86 guests it might use KVM so guest code runs on the CPU which might give different results. You can add -accel tcg to force it to translate x86 code the same way as it does PPC code which will be slower but might be closer to the PPC guest case. Although PPC and x86 ops are different so the translation is also different but at least it goes through more the same process as with PPC guest.
I'm using Windows 11 VM via Virsh QEMU/KVM. And I don't know what to alter in the produced XML configuration in order to enable TCG...
I ran the Windows version of @Hans Gfx2d tool again. This time on my host I was monitoring my RX550 GPU via
sudo radeontop
The radeontop metrics going like crazy during the Win11 test. So I believe that it uses the GPU (Of course I can't know better than you, I'm telling what I see). The usage on each metric even goes at 90%.
On AmigaOS4, while running @Hans tool, the radeontop metrics are going very slow. Every half of a second it uses just a small portion of the resources of the RX550. For example, 0.83% of the "Graphic Pipeline" metric. Around the same amount of resources are consumed for the other metrics too.
Also I found that when I use bboot v0.7 I can get 2G of RAM but Ranger & Sysmon tools report that the CPU clock synchronizes at 999Mhz. When I use the pegasos.rom, I get 1.53Ghz.
Edited by nikitas on 2024/6/20 12:55:41 Edited by nikitas on 2024/6/20 12:56:02
Thanks a lot for your explanation about the GPU drivers (Windows/Amiga). I'm enjoying this process because I'm learning things. And this is more valuable than achieving the actual (QEMU/AOS4/RX550) goal.
Regarding the slow VRAM => RAM transfer rate on my system, I'm trying different things, but I have not yet found a clue why this is happening.
And also find out what's in the missing headers and make it compile on Linux as it seems to have some AmigaOS specific parts not #ifdef-ed. Thanks, I'll play with it sometimes but currently it does not seem to be the main cause of the issue as even commands seem to reach the card much slower than they should. Looking at these routines I see it does dcbz which may slow it down on QEMU as it may do a slow byte-by-byte write unless it can call memset (which @joerg thinks also slow but it may be optimised on the host and should be much faster than in the guest). I wanted to try optimising dcbz sometimes so this would make a good test case for that as well. For now the simplest code without any tricks should be fastest on QEMU and code that tries to be smart to run faster on real CPUs are probably slower. Is there a way to disable all CPU specific optimisations in AmigaOS and just use simple code? If there's some option it may be tested if that makes any difference.
I've also looked at the AltiVec implementations. They are in qemu/target/ppc/translate/vmx-impl.c.inc and these seem to be using 64bit, only one op is using 128bit. So for now it would probably be the same as not using it at least for the copy routines. Maybe this could be optimised but for that somebody would need to learn about AltiVec ops and TCG vector implementation to see if these could be improved. I know nothing about those so it would not be any easier to do for me than for anybody else.
@nikitas I don't know libvirt (maybe you could search 'virtmanager enable tcg' or similiar but if it has a way to add QEMU options then adding -accel tcg should enable TCG (unless libvirt adds its own -accel option which conflicts but there should be a way to select the accelerator used). It may be interesting to see if it might be an issue with TCG code vs. native CPU causes a problem. Another option is to have libvirt tell you the command it generated, copy to a script, edit that and run it without virsh. (If it does not tell you can find it in /proc/`pidof qemu-system-ppc`/cmdline)
Quote:
Also I found that when I use bboot v0.7 I can get 2G of RAM but Ranger & Sysmon tools report that the CPU clock synchronizes at 999Mhz. When I use the pegasos.rom, I get 1.53Ghz.
Those tools probably don't measure the CPU clock just report what's in the device tree so this does not matter. The pegasos2 firmware uses 1.53 for G4 CPU and a different value for G3 which are probably hardcoded. QEMU has hardcoded bus-speed*7.5 which is the 999MHz. This does not affect the speed of the emulated CPU though, you would get the same benchmark with SysMon independent of the clock speed reported so you probably don't need to care about this.
Quote:
Thanks a lot for your explanation about the GPU drivers (Windows/Amiga). I'm enjoying this process because I'm learning things. And this is more valuable than achieving the actual (QEMU/AOS4/RX550) goal.
Exactly. Doing this is fun because we can learn about things we might not know otherwise. i did not even know about MorphOS, AmigaOS4 or NG Amigas before started to work on these emulations just to see where the old AmigaOS went all those years I did not follow it.
Sorry, I misunderstood what you were asking for. Here are GfxBench2D's main copy routines (both altivec & not):
VRAMCopy.zip
The non-AltiVec part looks similar to my initial newlib.library memcpy() implementation, except for the following:
// Perform the main copy
if(numBytes >= BLOCKSIZE)
{
// WARNING: BLOCKSIZE *MUST* be a power of two
double *destEnd = dest64 + (numBytes & ~(BLOCKSIZE - 1)) / sizeof(double);
double temp1, temp2;
while(dest64 < destEnd)
{
#if defined(__POWERPC__) || defined (_M_PPC) || defined(__powerpc__)
__asm__ __volatile__ ("dcbt 0,%0" : : "r" (src64));
__asm__ __volatile__ ("dcbt 0,%0" : : "r" (src64 + BLOCKSIZE));
#endif
...
dcbt only helps if it's used in advance, not for the cache-line of the current/next access, so it should be something like
// Perform the main copy
if(numBytes >= BLOCKSIZE)
{
// WARNING: BLOCKSIZE *MUST* be a power of two
double *destEnd = dest64 + (numBytes & ~(BLOCKSIZE - 1)) / sizeof(double);
double temp1, temp2;
#if defined(__POWERPC__) || defined (_M_PPC) || defined(__powerpc__)
__asm__ __volatile__ ("dcbt 0,%0" : : "r" (src64));
#endif
while(dest64 < destEnd)
{
#if defined(__POWERPC__) || defined (_M_PPC) || defined(__powerpc__)
__asm__ __volatile__ ("dcbt 0,%0" : : "r" (src64 + BLOCKSIZE));
#endif
...
instead, or maybe even (has to be tested which one is faster)
// Perform the main copy
if(numBytes >= BLOCKSIZE)
{
// WARNING: BLOCKSIZE *MUST* be a power of two
double *destEnd = dest64 + (numBytes & ~(BLOCKSIZE - 1)) / sizeof(double);
double temp1, temp2;
#if defined(__POWERPC__) || defined (_M_PPC) || defined(__powerpc__)
__asm__ __volatile__ ("dcbt 0,%0" : : "r" (src64));
__asm__ __volatile__ ("dcbt 0,%0" : : "r" (src64 + BLOCKSIZE));
#endif
while(dest64 < destEnd)
{
#if defined(__POWERPC__) || defined (_M_PPC) || defined(__powerpc__)
__asm__ __volatile__ ("dcbt 0,%0" : : "r" (src64 + BLOCKSIZE*2));
#endif
...
There is probably next to no overhead using dcbt twice on each cache line as it's done in your code, but removing any useless instruction in the main copy loop can help.
The AltiVec code should use the vector data stream prefetch instructions (dst, dstt, dstst, dststt, etc., check for example the IExec->CopyMemQuick() implementation, which may be in HAL CPU parts, if you have access to it's sources) instead of dcb[az] and dcbt.
That certainly is a "sync point." However, it's no guarantee that programs will actually call that when they're done drawing.
Not using WaitBlit() after blitter based rendering functions even failed on classic Amigas with OCS/ECS/AGA chip set, therefore I doubt it's not used as required by most software.
Quote:
At this point we're basically talking about replacing the Picasso96 driver API with something new that's fit for purpose.
Except for the PIP/overlay functions in Picasso96API.library and the card/chip driver API there is nothing left of P96 in OS4 anyway, replacing the driver API with something better shouldn't be a big problem.
Not using WaitBlit() after blitter based rendering functions even failed on classic Amigas with OCS/ECS/AGA chip set, therefore I doubt it's not used as required by most software.
RTG software does NOT use WaitBlit(), because it only waits for the classic chipset (i.e., it's useless for RTG).
Quote:
Except for the PIP/overlay functions in Picasso96API.library and the card/chip driver API there is nothing left of P96 in OS4 anyway, replacing the driver API with something better shouldn't be a big problem.
Its possible, but it's a huge task. Driver API calls are spread throughout all code that does any kind of drawing. You also have to identify and rework every instance where the code directly accesses VRAM, because any new API would NOT allow direct CPU VRAM access.
This really needs a team working on it, instead of just me.
I've also toyed with the idea of using the vblanking interrupt as a signal to flush.
In AROS hosted on Linux (runs like a normal Linux program, does Exec task switching by poking signal context) the gfx driver uses X11 pixmaps and X11 windows for screens and friend bitmaps of screens (the gfx system does not rely on direct bitmap or framebuffer access). So for example a graphics.library/RectFill() may end up as XFillRectangle() into a X11 pixmap. A BltBitMap() may end up as XCopyArea().
X11 input events (mouse move, key press) are checked/handled at INTB_VERTB intervals and that's where XFlush() is called.
- I enabled the integrated INTEL GPU. - I replaced my primary GPU (an NVDIIA) with the Radeon RX 550. So, the RX is now plugged into the primary PCIe slot, and the NVIDIA card is removed completely.
The improvements are visible. Not that much to say that you can work on this system. It's slow. But it is improved anyway.
I personally have a asus tuf gaming plus b550 with ryzen 5800x but I don't have an integrated video card. So I should definitely keep the geforce. Having 2 PCIe slots I would necessarily have to pair it with an ATI RadeonRX 550 for example.
From what I understand it seems that the ATI RadeonRX 550 works better in single PCIe.
Thank you again, for the tests you are doing and I will continue to follow the evolution for a possible purchase and future fixes for possible improvements etc.
Clearly it would be appropriate for many more people to be able to take the tests.
By asking ChatGPT I understand why RX550 is faster on the primary PCIe slot. It's directly connected to CPU and provides more bandwidth...
But as I said it is still slow anyway. I'll see what else I can do. And probably @balaton & @Hans could find some room for improvements on Qemu PPC & RadeonRX.chip.
Quote:
The primary PCIe slot on your ASUS Z790 D-4 motherboard is faster than the secondary slot due to differences in the number of PCIe lanes and their connectivity to the CPU and chipset. Here are the key reasons:
1. **Direct CPU Connection**: The primary PCIe slot (usually labeled as PCIe x16_1) is typically directly connected to the CPU. This allows it to take full advantage of the maximum number of PCIe lanes and the fastest possible data transfer rates.
2. **Lane Allocation**: Even if you are using an x8 GPU card, the primary slot might be configured to support up to x16 lanes, providing more bandwidth compared to the secondary slot, which might be limited to x8 or even x4 lanes depending on the motherboard design.
3. **Chipset Connectivity**: The secondary PCIe slot is often connected to the chipset rather than directly to the CPU. This connection can introduce additional latency and reduce the available bandwidth because the data has to pass through the chipset before reaching the CPU.
4. **Bandwidth Sharing**: On many motherboards, secondary PCIe slots share bandwidth with other devices, such as M.2 SSDs or additional PCIe slots. This can further reduce the available bandwidth for the secondary PCIe slot compared to the primary slot, which usually has a dedicated connection.
To illustrate this with an example: - **Primary PCIe Slot (PCIe x16_1)**: This slot might operate at PCIe 4.0 x16 or x8 if the GPU only uses 8 lanes, directly connected to the CPU, providing up to 32 GB/s (for x16) or 16 GB/s (for x8) bandwidth. - **Secondary PCIe Slot (PCIe x16_2)**: This slot might operate at PCIe 4.0 x4 or x8, connected through the chipset, providing up to 8 GB/s (for x4) or 16 GB/s (for x8) bandwidth, but with added latency due to the chipset.
In summary, the primary PCIe slot is faster primarily because it has a direct connection to the CPU and typically offers more PCIe lanes with dedicated bandwidth, while the secondary slot has to share resources and connect through the chipset, leading to reduced performance.
perfect thanks for the explanation. I was aware of this for bandwidth etc.
I kept the geforce on the second slot (the slowest one) for a long time for 2 years. because they had put a big heatsink that covered the second fastest slot. In reality I have not highlighted practical problems and the speed at least for what concerns the "windows" systems were not so evident with fps test games etc.
Now, having changed the "Corsair" heatsink, I put the geforce on the primary slot. But in case I could put a possible ATI on the primary slot if needed.
But as you said, it is best to wait for further developments when choosing the card.
Thanks, for the detailed explanation.
regarding the M.2 SSD slot you are right. with the latest bios asus has solved this "bug" now the problem no longer occurs.
TUF GAMING B550-PLUS BIOS 3405 Versione 3405 16.14 MB 2024/01/08
"1.This update includes the patch for the LogoFAIL vulnerabilities 2.Support graphics card with M.2 storage Before running the USB BIOS Flashback tool, please rename the BIOS file (TGB550PS.CAP) using BIOSRenamer."
----
TUF GAMING B550-PLUS BIOS 3607 Versione 3607 16.15 MB 2024/04/03
"Update AGESA version to ComboV2PI 1.2.0.Ca. Fix AMD processor vulnerabilities security. Before running the USB BIOS Flashback tool, please rename the BIOS file (TGB550PS.CAP) using BIOSRenamer."
Edited by white on 2024/6/21 11:54:05 Edited by white on 2024/6/21 11:54:51
Yes, I know. And to move the graphics card upwards and upwards to the board is not a solution. At the end, I'll weld the GPU with the CPU for a little more bandwidth.
Maybe it's a factor due to my age, Probably if I went back a few years I would have no problem finding an ATI to do tests with qemu. At the time, there were many friends who worked in IT. So even just borrowing some ATI graphics cards wouldn't have been a problem.
But things change, that's life.
And only "heaven" knows, the thousands of euros that I have spent and that we have spent together for the treatment of my wife's cancer have unfortunately defeated us.
I'm reminded of the AGA+++ chipset designer who put designs up for sale because he needed them. For the care of his wife. My most sincere respect.
Because you abandon everything without looking back. To keep the hope of finding a cure alive.
Today, after two years and 2 months, the mourning process seems to be over, at least apparently for me.
Even if it's just a simple lie to yourself and finding something to shield your mind. Because it's the only way out of this to lie to yourself.
And I wonder, should man spend all his resources to extend people's lives. And certainly not to shorten it. Medical research is anything related to the betterment of a society should be the priority. Because the question is simple, does human being deserve to survive technological progress? Surely not. Because humanity is not going in this direction. But towards self-destruction. Moving consciences is not at all easy.
One day if I were asked the question who do you want to meet "God" or "Your Wife"
"My Wife" without hesitation.
Well I felt like writing this and I hope it can be a small message written in a simple way.
Today Elden Ring "Shadow of the Erdtree" arrived and this makes me a little free
Well, I hope we can find a card that can work with qemu. I await good news. And possibly a recommended card to buy that works for most things.
I did not check what these are doing or what makes them slower than expected. That's something for later.
There's also a patch posted on QEMU list to improve AltiVec ops to use 128 bit. This isn't final yet, there will at least be a v2 as there was a review comment. This only improves the last copyFromVRAMAltivec case a bit which is 6.03 sec with this patch (at least for the memory case, maybe it's better with VRAM but don't know how to get a VRAM buffer on Linux). But it seems there's some other issue that makes it slower to fix first and for now not using any optimisation is the best. So I ask again if there's a way to disable CPU specific optimisations in AmigaOS to test that.
So I ask again if there's a way to disable CPU specific optimisations in AmigaOS to test that.
No. Which one of the several different CopyMemQuick(), bcopy(), memcpy(), (Read|Write)PixelArray(), etc., implementations is used (DMA on Sam4x0, X1000, X5000 and A1222, 128 bit vector CPU copy loop on G4, 64 bit double copy loop on everything else) depends on the PVR register contents. The only thing you can do in QEmu to use a different implementation is emulating a G3 750GX or 750FX CPU instead of a G4 74xy CPU.
But even the slowest 64 bit double implementations use DCBA (DCBZ as replacement on CPUs not supporting DCBA) and DCBT, which may make it slower on QEmu than on real hardware.