I am still curious what is going on with the DDR3 performance of the X5000.
I wrote a little benchmark program in plain C to test the three cache levels until we reach the DDR3 memory. I wasn't able to build a specific e5500 optimised binary because gcc in the SDK doesn't support the -mcpu=e5500 target. So it's a generic PPC build.
Unlike rumours on certain forums, the 2MByte CPC is enabled and configured as L3 copy back cache.
Each test transfers 2GByte of data. The amount of passes for each test is 2GByte/blocksize.
The result is the average time of the passes.
4.Work:Development/projects/memspeed> memspeed
Memspeed V0.1:
Write 32bit integer:
--------------------------------------
Block size 1 Kb: 7296.37 MB/s
Block size 2 Kb: 7303.30 MB/s
Block size 4 Kb: 7496.34 MB/s
Block size 8 Kb: 7529.59 MB/s
Block size 16 Kb: 7546.76 MB/s
Block size 32 Kb: 7509.58 MB/s
Block size 64 Kb: 5283.98 MB/s
Block size 128 Kb: 5327.65 MB/s
Block size 256 Kb: 5323.72 MB/s
Block size 512 Kb: 5189.57 MB/s
Block size 1024 Kb: 4718.91 MB/s
Block size 2048 Kb: 4398.85 MB/s
Block size 4096 Kb: 1994.95 MB/s
Block size 8192 Kb: 1764.40 MB/s
Block size 16384 Kb: 1715.94 MB/s
Block size 32768 Kb: 1720.12 MB/s
Block size 65536 Kb: 1721.06 MB/s
Block size 131072 Kb: 1710.87 MB/s
Block size 262144 Kb: 1720.05 MB/s
Write 64bit integer:
--------------------------------------
Block size 1 Kb: 7625.49 MB/s
Block size 2 Kb: 7488.31 MB/s
Block size 4 Kb: 7697.63 MB/s
Block size 8 Kb: 7735.36 MB/s
Block size 16 Kb: 7756.57 MB/s
Block size 32 Kb: 7702.04 MB/s
Block size 64 Kb: 5143.28 MB/s
Block size 128 Kb: 5255.21 MB/s
Block size 256 Kb: 5255.51 MB/s
Block size 512 Kb: 5180.61 MB/s
Block size 1024 Kb: 4774.23 MB/s
Block size 2048 Kb: 4499.13 MB/s
Block size 4096 Kb: 1993.73 MB/s
Block size 8192 Kb: 1750.30 MB/s
Block size 16384 Kb: 1716.59 MB/s
Block size 32768 Kb: 1712.13 MB/s
Block size 65536 Kb: 1718.73 MB/s
Block size 131072 Kb: 1706.55 MB/s
Block size 262144 Kb: 1712.28 MB/s
Read 32bit integer:
--------------------------------------
Block size 1 Kb: 7360.47 MB/s
Block size 2 Kb: 7432.67 MB/s
Block size 4 Kb: 7592.63 MB/s
Block size 8 Kb: 7628.53 MB/s
Block size 16 Kb: 7628.96 MB/s
Block size 32 Kb: 7520.78 MB/s
Block size 64 Kb: 4413.41 MB/s
Block size 128 Kb: 4486.52 MB/s
Block size 256 Kb: 4455.85 MB/s
Block size 512 Kb: 4442.14 MB/s
Block size 1024 Kb: 1752.00 MB/s
Block size 2048 Kb: 1504.30 MB/s
Block size 4096 Kb: 778.60 MB/s
Block size 8192 Kb: 717.67 MB/s
Block size 16384 Kb: 713.41 MB/s
Block size 32768 Kb: 714.39 MB/s
Block size 65536 Kb: 711.96 MB/s
Block size 131072 Kb: 713.14 MB/s
Block size 262144 Kb: 712.18 MB/s
Read 64bit integer:
--------------------------------------
Block size 1 Kb: 7660.88 MB/s
Block size 2 Kb: 7648.96 MB/s
Block size 4 Kb: 7724.62 MB/s
Block size 8 Kb: 7766.72 MB/s
Block size 16 Kb: 7650.40 MB/s
Block size 32 Kb: 7694.40 MB/s
Block size 64 Kb: 4497.88 MB/s
Block size 128 Kb: 4420.66 MB/s
Block size 256 Kb: 4498.84 MB/s
Block size 512 Kb: 4457.12 MB/s
Block size 1024 Kb: 1761.04 MB/s
Block size 2048 Kb: 1525.33 MB/s
Block size 4096 Kb: 776.64 MB/s
Block size 8192 Kb: 717.97 MB/s
Block size 16384 Kb: 713.57 MB/s
Block size 32768 Kb: 712.62 MB/s
Block size 65536 Kb: 714.85 MB/s
Block size 131072 Kb: 713.39 MB/s
Block size 262144 Kb: 713.53 MB/s
Write 32bit float:
--------------------------------------
Block size 1 Kb: 7344.76 MB/s
Block size 2 Kb: 7486.04 MB/s
Block size 4 Kb: 7468.08 MB/s
Block size 8 Kb: 7579.15 MB/s
Block size 16 Kb: 7591.77 MB/s
Block size 32 Kb: 7560.02 MB/s
Block size 64 Kb: 5238.74 MB/s
Block size 128 Kb: 5209.86 MB/s
Block size 256 Kb: 5281.30 MB/s
Block size 512 Kb: 5258.00 MB/s
Block size 1024 Kb: 4843.57 MB/s
Block size 2048 Kb: 4593.66 MB/s
Block size 4096 Kb: 1985.41 MB/s
Block size 8192 Kb: 1748.74 MB/s
Block size 16384 Kb: 1708.76 MB/s
Block size 32768 Kb: 1709.08 MB/s
Block size 65536 Kb: 1698.84 MB/s
Block size 131072 Kb: 1708.93 MB/s
Block size 262144 Kb: 1706.82 MB/s
Write 64bit float:
--------------------------------------
Block size 1 Kb: 14671.21 MB/s
Block size 2 Kb: 14723.52 MB/s
Block size 4 Kb: 14958.48 MB/s
Block size 8 Kb: 15108.87 MB/s
Block size 16 Kb: 15149.16 MB/s
Block size 32 Kb: 15088.17 MB/s
Block size 64 Kb: 9300.81 MB/s
Block size 128 Kb: 8960.95 MB/s
Block size 256 Kb: 9294.13 MB/s
Block size 512 Kb: 9258.15 MB/s
Block size 1024 Kb: 5371.73 MB/s
Block size 2048 Kb: 4845.98 MB/s
Block size 4096 Kb: 2003.67 MB/s
Block size 8192 Kb: 1766.85 MB/s
Block size 16384 Kb: 1707.37 MB/s
Block size 32768 Kb: 1716.26 MB/s
Block size 65536 Kb: 1719.92 MB/s
Block size 131072 Kb: 1714.04 MB/s
Block size 262144 Kb: 1708.27 MB/s
Read 32bit float:
--------------------------------------
Block size 1 Kb: 7228.52 MB/s
Block size 2 Kb: 7508.05 MB/s
Block size 4 Kb: 7585.09 MB/s
Block size 8 Kb: 7623.84 MB/s
Block size 16 Kb: 7555.20 MB/s
Block size 32 Kb: 7559.27 MB/s
Block size 64 Kb: 4440.56 MB/s
Block size 128 Kb: 4383.84 MB/s
Block size 256 Kb: 4407.89 MB/s
Block size 512 Kb: 4404.87 MB/s
Block size 1024 Kb: 1756.79 MB/s
Block size 2048 Kb: 1517.62 MB/s
Block size 4096 Kb: 777.15 MB/s
Block size 8192 Kb: 715.84 MB/s
Block size 16384 Kb: 712.32 MB/s
Block size 32768 Kb: 713.52 MB/s
Block size 65536 Kb: 712.54 MB/s
Block size 131072 Kb: 712.34 MB/s
Block size 262144 Kb: 712.24 MB/s
Read 64bit float:
--------------------------------------
Block size 1 Kb: 14763.19 MB/s
Block size 2 Kb: 14706.17 MB/s
Block size 4 Kb: 15021.15 MB/s
Block size 8 Kb: 14814.83 MB/s
Block size 16 Kb: 15228.22 MB/s
Block size 32 Kb: 14990.95 MB/s
Block size 64 Kb: 8336.21 MB/s
Block size 128 Kb: 8342.40 MB/s
Block size 256 Kb: 8334.50 MB/s
Block size 512 Kb: 8047.62 MB/s
Block size 1024 Kb: 3170.09 MB/s
Block size 2048 Kb: 2831.46 MB/s
Block size 4096 Kb: 1354.34 MB/s
Block size 8192 Kb: 1301.80 MB/s
Block size 16384 Kb: 1286.87 MB/s
Block size 32768 Kb: 1293.76 MB/s
Block size 65536 Kb: 1291.12 MB/s
Block size 131072 Kb: 1291.57 MB/s
Block size 262144 Kb: 1289.57 MB/s
You can clearly distinguish between each cache level up to the DDR memory.
One thing that I noticed is that 64bit integer read/write speed is on the same level as 32bit integers. This is different for floats where the L1 and L2 cache speed nearly doubles for 64bit floats. I do not know if this is a limitation of the e5500 core or the fact that I've build a generic PPC binary. The 64bit float performance can also explain the write performance of the X1000. When ragemem is optimised for Altivec or otherwise can make use of the 128bit load/store unit, this effectively doubles the theoretical memory bandwidth compared to the X5000..
Another observation worth mentioning is that as soon as a write hits the L3 cache or the DDR3 memory, the speed drops to the same level for all four write scenarios (int32,int64,fp32,fp64). Only the 64bit float read can benefit from the 64bit access.
The e5500 cores and the L3 cache and DDR3 controllers are all connected to the Corenet Coherency Fabric (CCF). This internal bus runs at 800MHz and is advertised to be able to support a sustainable read performance of 128bytes/clock cycle. This means a sustained read bandwidth of ~100GByte/s. Furthermore, the reference manual claims that this is a "low latency" datapath to the L3 cache. The L3 cache is also advertised to support a sustained read bandwidth of ~100GByte/s.
So if the bandwidth of the L3 cache and the CoreNet Coherency Fabric are not the issue than the only viable option left is latency or another bottleneck between the e5500 and the Corenet coherency Fabric.
The reference manual doesn't mention how the DDR3 controller is connected to the CCF, but I can imagine that this is a 128bit interface. The large difference in speed between the level three cache and DDR3 controller can be explained by the introduction of latency (wait states) because of crossing clock domains (800 MHz vs 666MHz). And of course the latencies of the DDR3 memory itself.
The following
topic on the NXP forum suggests that it's the latency that is killing performance in my test loops.
Quote:
We nowhere considered the L2 cache as a core's accelerator in the P2020. When enabled, it inserts additional latency to core's transactions due to time required for cache hit checking. If no hit, the transaction is sent to coherent bus, for example to DDR. L2 cache can help to speed up core's operations only if the data being read is absent in the L1 cache but valid in L2 one. This depends on the code and may or may not happen.
Main features of the L2 cache are as follows:
- allows access to the cache for the I/O masters (feature called 'stashing'), in this case the core reads data from the L2 cache instead of DDR;
- allows to share data between two cores.
So the QorIq L2 cache is optimised for I/O handling and inter core communication instead of raw processing speed. Which sounds logical considering the fact that this processor is designed to be a network processor.
This also explains that bandwidth for my simple copy loop drops considerable when the we go down the cache hierarchie.
I repeated my memory test with the L3 cache configured as a FIFO and even when it was completely disabled. But I could hardly notice a difference in DDR3 performance.
So the next test would be to see what happens when I disable L2 cache or test DDR3 performance with a DMA transfer.