Your values are really low. It looks to me that this is caused by the combination of high latencies and single rank. But it also looks like your memory controller is running at 666MHz too. I cannot believe that it's the P5040 fault. Afterall, it needs to feed 4 CPU cores instead of 2. Hence the update in memory controller clock speed.
We need a CPU-Z like program to check the SPD and memory controller settings.
I'll create one if I can find the right motivation to do so.
Sysmon in the Benchmark tab does not show the processor type, although as you can see, my Amiga has a CPU by more than 10% faster than X5020 and has more MIPS.
Well, if there was no doubt on what computer did I test here you have a different Sysmon tab, where you can see the type of CPU.
Your values are really low. It looks to me that this is caused by the combination of high latencies and single rank. But it also looks like your memory controller is running at 666MHz too. I cannot believe that it's the P5040 fault. Afterall, it needs to feed 4 CPU cores instead of 2. Hence the update in memory controller clock speed.
There is some problem with x5000/040 in terms of bandwidth (and that happens not only on Mufa's hardware but on all others). We find the same issue with Hans some time ago (he has x5000/040), where all the FPS we compare in all the games, always worse for him, even if we have the same cards/components/etc.
Some games give 30% slower results in FPS on 040, and i don't hear if anyone in the team are reporting those issues or finding the roots.
Even the GART support for graphics drivers shows _SLOWER_ speed in comparison between 020 and 040 (020 are faster). That without Ragemem, but our internal tests. I.e. let's say for 020 it give us 550mb/s , and for 040 it gives let's say 350mb/s.
As i not have 040, i can't test it all properly and find out the roots and made a proper bug report, it's the works of 040 beta testers, but the fact that there are some issues somewhere, and it is unknown by now in what components: hardware, or software (kernel and stuff).
My IMHO is that can be something about uboot default values, or default kernel initialization, or something of that sort (as I can't believe that it can be hardware, as it done by Varisys as well), but again, no one made a proper report about, but there is 101% some issue somewhere on 040.
I am still curious what is going on with the DDR3 performance of the X5000.
I wrote a little benchmark program in plain C to test the three cache levels until we reach the DDR3 memory. I wasn't able to build a specific e5500 optimised binary because gcc in the SDK doesn't support the -mcpu=e5500 target. So it's a generic PPC build.
Unlike rumours on certain forums, the 2MByte CPC is enabled and configured as L3 copy back cache.
Each test transfers 2GByte of data. The amount of passes for each test is 2GByte/blocksize.
You can clearly distinguish between each cache level up to the DDR memory.
One thing that I noticed is that 64bit integer read/write speed is on the same level as 32bit integers. This is different for floats where the L1 and L2 cache speed nearly doubles for 64bit floats. I do not know if this is a limitation of the e5500 core or the fact that I've build a generic PPC binary. The 64bit float performance can also explain the write performance of the X1000. When ragemem is optimised for Altivec or otherwise can make use of the 128bit load/store unit, this effectively doubles the theoretical memory bandwidth compared to the X5000..
Another observation worth mentioning is that as soon as a write hits the L3 cache or the DDR3 memory, the speed drops to the same level for all four write scenarios (int32,int64,fp32,fp64). Only the 64bit float read can benefit from the 64bit access.
The e5500 cores and the L3 cache and DDR3 controllers are all connected to the Corenet Coherency Fabric (CCF). This internal bus runs at 800MHz and is advertised to be able to support a sustainable read performance of 128bytes/clock cycle. This means a sustained read bandwidth of ~100GByte/s. Furthermore, the reference manual claims that this is a "low latency" datapath to the L3 cache. The L3 cache is also advertised to support a sustained read bandwidth of ~100GByte/s. So if the bandwidth of the L3 cache and the CoreNet Coherency Fabric are not the issue than the only viable option left is latency or another bottleneck between the e5500 and the Corenet coherency Fabric. The reference manual doesn't mention how the DDR3 controller is connected to the CCF, but I can imagine that this is a 128bit interface. The large difference in speed between the level three cache and DDR3 controller can be explained by the introduction of latency (wait states) because of crossing clock domains (800 MHz vs 666MHz). And of course the latencies of the DDR3 memory itself.
The following topic on the NXP forum suggests that it's the latency that is killing performance in my test loops.
Quote:
We nowhere considered the L2 cache as a core's accelerator in the P2020. When enabled, it inserts additional latency to core's transactions due to time required for cache hit checking. If no hit, the transaction is sent to coherent bus, for example to DDR. L2 cache can help to speed up core's operations only if the data being read is absent in the L1 cache but valid in L2 one. This depends on the code and may or may not happen. Main features of the L2 cache are as follows: - allows access to the cache for the I/O masters (feature called 'stashing'), in this case the core reads data from the L2 cache instead of DDR; - allows to share data between two cores.
So the QorIq L2 cache is optimised for I/O handling and inter core communication instead of raw processing speed. Which sounds logical considering the fact that this processor is designed to be a network processor.
This also explains that bandwidth for my simple copy loop drops considerable when the we go down the cache hierarchie.
I repeated my memory test with the L3 cache configured as a FIFO and even when it was completely disabled. But I could hardly notice a difference in DDR3 performance.
So the next test would be to see what happens when I disable L2 cache or test DDR3 performance with a DMA transfer.
I never tried to implement optimized read/write/copy code on my X5000, only did it for some of the other systems supported by AmigaOS4, but from your results it's quite obvious that you aren't using DCBT (for reads) and DCBA (for writes, if not supported by the 5020 use DCBZ instead). On none of the PPC CPUs supported by AmigaOS4 you get anything near the max. speeds without either using the DCBxy instructions or, where supported, the AltiVec streaming instructions instead. Important: The cache line size of the 5020 is, or may be depending on some CPU configuration bits, different to most other PPC CPUs supported by AmigaOS4. AFAIK the OS4 kernel only implements correct (or at least much better optimized) 5020 code only in the beta versions of the kernel, not the public ones.
Very simple test for much faster writes: Use a DCBZ loop (or IUtility->SetMem(), BZero(), newlib memset(), bzero(), etc.). Of course that way you wont get results most software will get, but the results will be much closer to the hardware limits.
Probably not relevant for the X5000 at all, but for example with the 603e and 604 CPUs (most likely not a problem of the CPUs themselves but the rest of the BlizzardPPC/CyberstormPPC hardware) using the caches in write through instead of copy back mode resulted in overall faster system performance.
Apart from the last one it wipes the floor with your X5040...huh?
As noted earlier in this thread (post 36), the X1000 seems to be much faster than the X5000 at accessing memory, at least according to RageMem. Geennaam speculated (post 37) why this might be. And the author of RageMem has noted that it was designed for earlier, lower-spec NG Amigas, and may not be that accurate for X1/X5-class machines.
@All Yeah, it looks like ragemem can be a little bit wrong there when compare speed of x5000 and x1000 in terms of memory access. I mean, it's cleary show that x1000 there better, but, real time tests shows that things faster on x5000 instead (at least 3D tests).
It's now clear to me that unlike modern AMD/Intel CPUs, the NXP PowerPCs lack hardware cache management. Hence the slow transfer speeds with GCC generated code.
I have implemented quick and dirty DCBT/DCBA and I can already see a speedup to DDR3.
(My assumption is that the cacheline size of the e5500 is 64bytes.)
For 64-bit CPUs such as the P50x0, you should use dcbzl. Dcbz may (or may not) zero only half a cache line so that it's behaviour is consistent with older PowerPC CPUs. This forces it to fetch the remaining 32 bytes which isn't what you want.
The dcba instruction may also be very slow, just like with the G5. It's an illegal instruction on the G5, and is emulated by doing nothing (but with the overhead of an illegal instruction trap).
Hans
P.S., You can query the cache line size via the exec.library'S GetCPUInfo() function (GCIT_CacheLineSize).
It's now clear to me that unlike modern AMD/Intel CPUs, the NXP PowerPCs lack hardware cache management. Hence the slow transfer speeds with GCC generated code.
GCC has options to add data cache instructions, for example -fprefetch-loop-arrays, but with GCC 2.95.x and 3.4.x I got very poor results with it, maybe even slower than without using it.
Quote:
I have implemented quick and dirty DCBT/DCBA and I can already see a speedup to DDR3.
For best speed, at least on the old CPUs for which I implemented the memcpy() etc. functions, you need to do the DCB* not for the next reads/writes but 32 or 64 bytes in advance, for example 1 or 2 DCBT/DCBA before a loop and inside the loop current address + 32 or + 64 bytes. On 64 byte cache line size CPUs try + 64 or + 128 bytes instead, or maybe even more, you'll have to test how many cache lines have to be touched in advance for best results.
Quote:
Is bcopy(), memset() and memcpy() "inspired" by the Apple powerpc assembly code?
No, I implemented them for newlib myself, using different code for different CPUs, but only for 603/4, 750, 74xy and 440ep CPUs, and only using integer and/or double accesses. For example on CPUs with 32 byte cache line size using 2 128 bit cache line aligned vector writes you don't need DCBA/DCBZ nor vector streaming instructions, the cache line reads before writes aren't done. But using 4 64 bit writes (integer or double) instead is too slow and there is always a cache line read before the write if you don't use DCB* or AltiVec streaming instructions. But someone else implemented the AltiVec version using the vector streaming instructions for the 74xy CPUs.
AFAIK later the code was moved from newlib.library to Exec or the HAL and newlib.library only calls the Exec/Utility library functions. For the 440 (and probably 460 CPUs too) versions using DMA were added, but the DMA code is only used for very large copies because of the setup overhead. Very likely optimised versions for the X1000, X50[24]0 and A1222 were implemented as well, but I don't know anything about such newer parts of newlib and/or the kernel.
All Exec functions are optimized for CPUs. All newlib code is in Exec. So you have to use Exec functions since they are always optimized for the current machine. Maybe they aren't for x5000 but only kernel developers know