The dcba instruction may also be very slow, just like with the G5. It's an illegal instruction on the G5, and is emulated by doing nothing (but with the overhead of an illegal instruction trap).
At least in AmigaOS 4.0 and for the 32 bit CPUs the kernel didn't emulate it, when using dcba on CPUs which don't support it at all the program crashes with an ISI exception. But there may be some CPUs where dcba is implemented as no-op without causing an ISI exception.
@afxgroup Quote:
So you have to use Exec functions since they are always optimized for the current machine. Maybe they aren't for x5000 but only kernel developers know
I've read somewhere that X5000 optimised functions (maybe X1000 and A1222 as well) are only included in beta kernels, not in public released ones yet, but I don't know if that's still the case. For the X1000 the 74xy AltiVec versions could have been used with (next to) no changes, but X5000 (dcbzl instead of dcbz, or if there is such an instruction dcbal instead of dcba) and A1222 (integer accesses only, never any float/double loads/stores) need new implementations.
Unfortunately dcbtl opcode is not recognized by any of the gcc versions in the latest SDK. Unless I use the compiler switch -mcpu=G5. But the generated code will crash immediately of course.
Even mcpu=e5500 doesn't recognize those 64bytes cacheline opcodes.
DCBA is actually still supported by the e5500 ( see e5500 rm). But, depending on a bit in the l1 cache control register, it works on half or full cache lines. The new opcode e5500 dcbal opcode works always on full cachelines. But again not recognized by gcc.
Unfortunately dcbtl opcode is not recognized by any of the gcc versions in the latest SDK. Unless I use the compiler switch -mcpu=G5. But the generated code will crash immediately of course.
Even mcpu=e5500 doesn't recognize those 64bytes cacheline opcodes.
Try: -mcpu=G5 -mno-powerpc64
Quote:
DCBA is actually still supported by the e5500 ( see e5500 rm). But, depending on a bit in the l1 cache control register, it works on half or full cache lines. The new opcode e5500 dcbal opcode works always on full cachelines. But again not recognized by gcc.
Interesting. One annoying thing about the cache hint instructions, is the varying behaviour on different PowerPC CPUs.
BTW, these days achieving maximum memory copy speeds often seems to require using multiple cores.
TW, these days achieving maximum memory copy speeds often seems to require using multiple cores.
Yes, I've noticed that bandwidth scales more or less linear when you run a multi-threaded memtest on X5000 linux (ramsmp-3.5.0). It produces the same kind of results as ragemem and my (non-cache hint instructions) memory test when run in single-threaded mode. It would be nice to see if X5040 owners see a ~4x increase for single threaded versus quad threaded test run.
https://openbenchmarking.org/test/pts/ramspeed uses the same ramsmp-3.5.0 in its test suite. Those results for the same type of DIMMs are much higher than we can achieve on a X5000. But this is not a fair comparision because there are vector enabled assembly routines for Intel and AMD.
However the disappointing part is that our GCC will never generate optimized code for the X5000. Because it simply doesn't understand the cache hint instructions for a full cacheline. Therefore "-fprefetch-loop-arrays" generates only marginal faster code. And slower than my "half cacheline" test.
According to the doc, a combination of the following instructions are enabled with the -mcpu switch. so probably altivec is also an issue: Quote:
The -mcpu options automatically enable or disable the following options:
I will try to play with the options later this week.
EDIT: I think that I understand now why GCC doesn't support dcbtl/dcbal for the e5500. By default, L1CSR0[DCBZ32] is cleared. This means that dcbz and dcba are executed on the full 64 bytes cache line. So by default dcbz/dcba is executed as dcbzl/dcbal. I will check L1CSR0[DCBZ32] if it is still cleared during runtime on AmigaOS. If not then and if the e5500 core allows for it during runtime, it would be even easier to simply clear L1CSR0[DCBZ32]. Makes sense actually. Both generic and e5500 specific code can have the same benefit. But I do wonder if there will be a catch. Because the 32 bytes limitation options looks pretty redundant to me. But they don't just waste silicon without good reason.
Edited by geennaam on 2023/2/22 12:26:17 Edited by geennaam on 2023/2/22 12:31:15 Edited by geennaam on 2023/2/22 12:32:18
I will check L1CSR0[DCBZ32] if it is still cleared during runtime on AmigaOS. If not then and if the e5500 core allows for it during runtime, it would be even easier to simply clear L1CSR0[DCBZ32].
Only if you'd run your code inside IExec->Disable()/Enable(), or at least IExec->Forbid()/Permit(). If the 32 byte bit is set in AmigaOS 4.1 it means the kernel functions (IUtility->SetMem(), IExec->CopyMemQuick(), etc.) haven't been updated to 64 bytes cache lines yet but use old 32 byte functions and wont work if you change the bit.
Quote:
But I do wonder if there will be a catch. Because the 32 bytes limitation options looks pretty redundant to me. But they don't just waste silicon without good reason.
For dcbt(st) there should be no difference, except for using 2 dcbt instructions on each cache line if you use old 32 bytes code, but for dcb[az] it's a required workaround to be able to use old code which assumes a 32 byte cache line: If dcbz clears 64 bytes executing such old code it may clear 32 bytes to much and a 64 bytes dcba may result in 32 bytes of random data.
Here's a small memory test tool which compares cache and ddr bandwidth for both normal transfers and cache hint "optimised" transfers. For now only dcba and dcbz are used. These instructions are supported by at least the e5500 and ppc440 cores.
The implementation is very basic and there's room for optimisation.
DCBA is actually still supported by the e5500 ( see e5500 rm). But, depending on a bit in the l1 cache control register, it works on half or full cache lines. The new opcode e5500 dcbal opcode works always on full cachelines. But again not recognized by gcc.
If the current gas does not support it, try the binutils v2.40 (betas) , at least in the code there is something about 'DCBA', maybe they work for you.
I know I am late to the party here. I know this is a long shot but reading this post, I only have 1 2GB Kingston Module in my X5040. So I purchased from Amazon Kingston Fury Beast 2x8GB DDR3-1833 Dual Rank (KF318C10BBK2/16) and should get it next week. Is there a chance that with fast bank writing with dual memory it would help with some of these weird Grim Reaper errors?
I have also noticed how bad the numbers get on RageMem
I got my Kingston Fury RAM. Uboot recognizes two banks of 8Gig but AmigaOS is only recognizing 2G. What limit variable in uboot do I change to set this?
AmigaOS, every variant, can only address 2GB. So this is normal. There's extended memory bu means of bank switching. But that is hardly used.
AmigaOS 4.x can address 4 GB, the single, ancient exec function which used to limit the address space to 2 GB in AmigaOS 3.x is no longer used since more than 20 years and was replaced by a function supporting the complete 4 GB. However, the upper 2 GB of the 32 bit 4 GB (virtual) address space are reserved for PCI(e), U-Boot, etc. in AmigaOS 4.x, resulting in only 2 GB for normal applications, which don't support ExtMem, in AmigaOS 4.x as well.