x5000 benchmarks / speed up

	Bottom Previous Topic Next Topic
Register To Post

« 1 2 (3) 4 »

geennaam

Re: x5000 benchmarks / speed up

Posted on: 2022/2/9 15:16 #41

Quite a regular

@mufa

Your values are really low. It looks to me that this is caused by the combination of high latencies and single rank. But it also looks like your memory controller is running at 666MHz too. I cannot believe that it's the P5040 fault. Afterall, it needs to feed 4 CPU cores instead of 2. Hence the update in memory controller clock speed.

We need a CPU-Z like program to check the SPD and memory controller settings.

I'll create one if I can find the right motivation to do so.

Skateman

Re: x5000 benchmarks / speed up

Posted on: 2022/2/9 15:54 #42

Not too shy to talk

@mufa

Is that screenshot of a P5040? (at 2Ghz)

Documentation states that a P5040 is running at 2.2 Ghz..

X5000/020 with P5020 Dual Core CPU running at a clockspeed of 2Ghz
X5000/040 with P5040 Rev C Quad Core CPU running at a clockspeed of 2.2Ghz

AmigaOne X5000 -> 2GHz / 16GB RAM / Radeon RX 550 / ATI X1950 / M-Audio 5.1 -> AmigaOS 4.1 FE / Linux / MorphOS
Amiga 1200 -> Recapped / PiStorm CM4 / SD HDD / WifiPi connected to the NET
Vampire V4SE TrioBoot
RPI4 AmiKit XE

mufa

Re: x5000 benchmarks / speed up

Posted on: 2022/2/9 16:12 #43

Not too shy to talk

@Skateman

Sysmon in the Benchmark tab does not show the processor type, although as you can see, my Amiga has a CPU by more than 10% faster than X5020 and has more MIPS.

Well, if there was no doubt on what computer did I test here you have a different Sysmon tab, where you can see the type of CPU.

Resized Image

kas1e

Re: x5000 benchmarks / speed up

Posted on: 2022/2/9 16:22 #44

Home away from home

@geennaam
Quote:

Your values are really low. It looks to me that this is caused by the combination of high latencies and single rank. But it also looks like your memory controller is running at 666MHz too. I cannot believe that it's the P5040 fault. Afterall, it needs to feed 4 CPU cores instead of 2. Hence the update in memory controller clock speed.

There is some problem with x5000/040 in terms of bandwidth (and that happens not only on Mufa's hardware but on all others). We find the same issue with Hans some time ago (he has x5000/040), where all the FPS we compare in all the games, always worse for him, even if we have the same cards/components/etc.

Some games give 30% slower results in FPS on 040, and i don't hear if anyone in the team are reporting those issues or finding the roots.

Even the GART support for graphics drivers shows _SLOWER_ speed in comparison between 020 and 040 (020 are faster). That without Ragemem, but our internal tests. I.e. let's say for 020 it give us 550mb/s , and for 040 it gives let's say 350mb/s.

As i not have 040, i can't test it all properly and find out the roots and made a proper bug report, it's the works of 040 beta testers, but the fact that there are some issues somewhere, and it is unknown by now in what components: hardware, or software (kernel and stuff).

My IMHO is that can be something about uboot default values, or default kernel initialization, or something of that sort (as I can't believe that it can be hardware, as it done by Varisys as well), but again, no one made a proper report about, but there is 101% some issue somewhere on 040.

Join us to improve dopus5!
AmigaOS4 on youtube

Skateman

Re: x5000 benchmarks / speed up

Posted on: 2022/2/9 16:31 #45

Not too shy to talk

@mufa

I see.... thanks for the screenshot.

geennaam

Re: x5000 benchmarks / speed up

Posted on: 2022/2/17 23:55 #46

Quite a regular

I am still curious what is going on with the DDR3 performance of the X5000.

I wrote a little benchmark program in plain C to test the three cache levels until we reach the DDR3 memory. I wasn't able to build a specific e5500 optimised binary because gcc in the SDK doesn't support the -mcpu=e5500 target. So it's a generic PPC build.

Unlike rumours on certain forums, the 2MByte CPC is enabled and configured as L3 copy back cache.

Each test transfers 2GByte of data. The amount of passes for each test is 2GByte/blocksize.

The result is the average time of the passes.


4.Work:Development/projects/memspeed> memspeed 

Memspeed V0.1:



Write 32bit integer:

--------------------------------------

Block size         1 Kb: 7296.37 MB/s

Block size         2 Kb: 7303.30 MB/s

Block size         4 Kb: 7496.34 MB/s

Block size         8 Kb: 7529.59 MB/s

Block size        16 Kb: 7546.76 MB/s

Block size        32 Kb: 7509.58 MB/s

Block size        64 Kb: 5283.98 MB/s

Block size       128 Kb: 5327.65 MB/s

Block size       256 Kb: 5323.72 MB/s

Block size       512 Kb: 5189.57 MB/s

Block size      1024 Kb: 4718.91 MB/s

Block size      2048 Kb: 4398.85 MB/s

Block size      4096 Kb: 1994.95 MB/s

Block size      8192 Kb: 1764.40 MB/s

Block size     16384 Kb: 1715.94 MB/s

Block size     32768 Kb: 1720.12 MB/s

Block size     65536 Kb: 1721.06 MB/s

Block size    131072 Kb: 1710.87 MB/s

Block size    262144 Kb: 1720.05 MB/s





Write 64bit integer:

--------------------------------------

Block size         1 Kb: 7625.49 MB/s

Block size         2 Kb: 7488.31 MB/s

Block size         4 Kb: 7697.63 MB/s

Block size         8 Kb: 7735.36 MB/s

Block size        16 Kb: 7756.57 MB/s

Block size        32 Kb: 7702.04 MB/s

Block size        64 Kb: 5143.28 MB/s

Block size       128 Kb: 5255.21 MB/s

Block size       256 Kb: 5255.51 MB/s

Block size       512 Kb: 5180.61 MB/s

Block size      1024 Kb: 4774.23 MB/s

Block size      2048 Kb: 4499.13 MB/s

Block size      4096 Kb: 1993.73 MB/s

Block size      8192 Kb: 1750.30 MB/s

Block size     16384 Kb: 1716.59 MB/s

Block size     32768 Kb: 1712.13 MB/s

Block size     65536 Kb: 1718.73 MB/s

Block size    131072 Kb: 1706.55 MB/s

Block size    262144 Kb: 1712.28 MB/s





Read 32bit integer:

--------------------------------------

Block size          1 Kb: 7360.47 MB/s

Block size          2 Kb: 7432.67 MB/s

Block size          4 Kb: 7592.63 MB/s

Block size          8 Kb: 7628.53 MB/s

Block size         16 Kb: 7628.96 MB/s

Block size         32 Kb: 7520.78 MB/s

Block size         64 Kb: 4413.41 MB/s

Block size        128 Kb: 4486.52 MB/s

Block size        256 Kb: 4455.85 MB/s

Block size        512 Kb: 4442.14 MB/s

Block size       1024 Kb: 1752.00 MB/s

Block size       2048 Kb: 1504.30 MB/s

Block size       4096 Kb: 778.60 MB/s

Block size       8192 Kb: 717.67 MB/s

Block size      16384 Kb: 713.41 MB/s

Block size      32768 Kb: 714.39 MB/s

Block size      65536 Kb: 711.96 MB/s

Block size     131072 Kb: 713.14 MB/s

Block size     262144 Kb: 712.18 MB/s





Read 64bit integer:

--------------------------------------

Block size          1 Kb: 7660.88 MB/s

Block size          2 Kb: 7648.96 MB/s

Block size          4 Kb: 7724.62 MB/s

Block size          8 Kb: 7766.72 MB/s

Block size         16 Kb: 7650.40 MB/s

Block size         32 Kb: 7694.40 MB/s

Block size         64 Kb: 4497.88 MB/s

Block size        128 Kb: 4420.66 MB/s

Block size        256 Kb: 4498.84 MB/s

Block size        512 Kb: 4457.12 MB/s

Block size       1024 Kb: 1761.04 MB/s

Block size       2048 Kb: 1525.33 MB/s

Block size       4096 Kb: 776.64 MB/s

Block size       8192 Kb: 717.97 MB/s

Block size      16384 Kb: 713.57 MB/s

Block size      32768 Kb: 712.62 MB/s

Block size      65536 Kb: 714.85 MB/s

Block size     131072 Kb: 713.39 MB/s

Block size     262144 Kb: 713.53 MB/s





Write 32bit float:

--------------------------------------

Block size        1 Kb: 7344.76 MB/s

Block size        2 Kb: 7486.04 MB/s

Block size        4 Kb: 7468.08 MB/s

Block size        8 Kb: 7579.15 MB/s

Block size       16 Kb: 7591.77 MB/s

Block size       32 Kb: 7560.02 MB/s

Block size       64 Kb: 5238.74 MB/s

Block size      128 Kb: 5209.86 MB/s

Block size      256 Kb: 5281.30 MB/s

Block size      512 Kb: 5258.00 MB/s

Block size     1024 Kb: 4843.57 MB/s

Block size     2048 Kb: 4593.66 MB/s

Block size     4096 Kb: 1985.41 MB/s

Block size     8192 Kb: 1748.74 MB/s

Block size    16384 Kb: 1708.76 MB/s

Block size    32768 Kb: 1709.08 MB/s

Block size    65536 Kb: 1698.84 MB/s

Block size   131072 Kb: 1708.93 MB/s

Block size   262144 Kb: 1706.82 MB/s





Write 64bit float:

--------------------------------------

Block size        1 Kb: 14671.21 MB/s

Block size        2 Kb: 14723.52 MB/s

Block size        4 Kb: 14958.48 MB/s

Block size        8 Kb: 15108.87 MB/s

Block size       16 Kb: 15149.16 MB/s

Block size       32 Kb: 15088.17 MB/s

Block size       64 Kb: 9300.81 MB/s

Block size      128 Kb: 8960.95 MB/s

Block size      256 Kb: 9294.13 MB/s

Block size      512 Kb: 9258.15 MB/s

Block size     1024 Kb: 5371.73 MB/s

Block size     2048 Kb: 4845.98 MB/s

Block size     4096 Kb: 2003.67 MB/s

Block size     8192 Kb: 1766.85 MB/s

Block size    16384 Kb: 1707.37 MB/s

Block size    32768 Kb: 1716.26 MB/s

Block size    65536 Kb: 1719.92 MB/s

Block size   131072 Kb: 1714.04 MB/s

Block size   262144 Kb: 1708.27 MB/s





Read 32bit float:

--------------------------------------

Block size        1 Kb: 7228.52 MB/s

Block size        2 Kb: 7508.05 MB/s

Block size        4 Kb: 7585.09 MB/s

Block size        8 Kb: 7623.84 MB/s

Block size       16 Kb: 7555.20 MB/s

Block size       32 Kb: 7559.27 MB/s

Block size       64 Kb: 4440.56 MB/s

Block size      128 Kb: 4383.84 MB/s

Block size      256 Kb: 4407.89 MB/s

Block size      512 Kb: 4404.87 MB/s

Block size     1024 Kb: 1756.79 MB/s

Block size     2048 Kb: 1517.62 MB/s

Block size     4096 Kb: 777.15 MB/s

Block size     8192 Kb: 715.84 MB/s

Block size    16384 Kb: 712.32 MB/s

Block size    32768 Kb: 713.52 MB/s

Block size    65536 Kb: 712.54 MB/s

Block size   131072 Kb: 712.34 MB/s

Block size   262144 Kb: 712.24 MB/s





Read 64bit float:

--------------------------------------

Block size        1 Kb: 14763.19 MB/s

Block size        2 Kb: 14706.17 MB/s

Block size        4 Kb: 15021.15 MB/s

Block size        8 Kb: 14814.83 MB/s

Block size       16 Kb: 15228.22 MB/s

Block size       32 Kb: 14990.95 MB/s

Block size       64 Kb: 8336.21 MB/s

Block size      128 Kb: 8342.40 MB/s

Block size      256 Kb: 8334.50 MB/s

Block size      512 Kb: 8047.62 MB/s

Block size     1024 Kb: 3170.09 MB/s

Block size     2048 Kb: 2831.46 MB/s

Block size     4096 Kb: 1354.34 MB/s

Block size     8192 Kb: 1301.80 MB/s

Block size    16384 Kb: 1286.87 MB/s

Block size    32768 Kb: 1293.76 MB/s

Block size    65536 Kb: 1291.12 MB/s

Block size   131072 Kb: 1291.57 MB/s

Block size   262144 Kb: 1289.57 MB/s

You can clearly distinguish between each cache level up to the DDR memory.

One thing that I noticed is that 64bit integer read/write speed is on the same level as 32bit integers. This is different for floats where the L1 and L2 cache speed nearly doubles for 64bit floats. I do not know if this is a limitation of the e5500 core or the fact that I've build a generic PPC binary. The 64bit float performance can also explain the write performance of the X1000. When ragemem is optimised for Altivec or otherwise can make use of the 128bit load/store unit, this effectively doubles the theoretical memory bandwidth compared to the X5000..

Another observation worth mentioning is that as soon as a write hits the L3 cache or the DDR3 memory, the speed drops to the same level for all four write scenarios (int32,int64,fp32,fp64). Only the 64bit float read can benefit from the 64bit access.

The e5500 cores and the L3 cache and DDR3 controllers are all connected to the Corenet Coherency Fabric (CCF). This internal bus runs at 800MHz and is advertised to be able to support a sustainable read performance of 128bytes/clock cycle. This means a sustained read bandwidth of ~100GByte/s. Furthermore, the reference manual claims that this is a "low latency" datapath to the L3 cache. The L3 cache is also advertised to support a sustained read bandwidth of ~100GByte/s.
So if the bandwidth of the L3 cache and the CoreNet Coherency Fabric are not the issue than the only viable option left is latency or another bottleneck between the e5500 and the Corenet coherency Fabric.
The reference manual doesn't mention how the DDR3 controller is connected to the CCF, but I can imagine that this is a 128bit interface. The large difference in speed between the level three cache and DDR3 controller can be explained by the introduction of latency (wait states) because of crossing clock domains (800 MHz vs 666MHz). And of course the latencies of the DDR3 memory itself.

The following topic on the NXP forum suggests that it's the latency that is killing performance in my test loops.

Quote:

We nowhere considered the L2 cache as a core's accelerator in the P2020. When enabled, it inserts additional latency to core's transactions due to time required for cache hit checking. If no hit, the transaction is sent to coherent bus, for example to DDR. L2 cache can help to speed up core's operations only if the data being read is absent in the L1 cache but valid in L2 one. This depends on the code and may or may not happen.
Main features of the L2 cache are as follows:
- allows access to the cache for the I/O masters (feature called 'stashing'), in this case the core reads data from the L2 cache instead of DDR;
- allows to share data between two cores.

So the QorIq L2 cache is optimised for I/O handling and inter core communication instead of raw processing speed. Which sounds logical considering the fact that this processor is designed to be a network processor.

This also explains that bandwidth for my simple copy loop drops considerable when the we go down the cache hierarchie.

I repeated my memory test with the L3 cache configured as a FIFO and even when it was completely disabled. But I could hardly notice a difference in DDR3 performance.

So the next test would be to see what happens when I disable L2 cache or test DDR3 performance with a DMA transfer.

joerg

Re: x5000 benchmarks / speed up

Posted on: 2022/2/18 16:24 #47

Just can't stay away

@geennaam

I never tried to implement optimized read/write/copy code on my X5000, only did it for some of the other systems supported by AmigaOS4, but from your results it's quite obvious that you aren't using DCBT (for reads) and DCBA (for writes, if not supported by the 5020 use DCBZ instead).
On none of the PPC CPUs supported by AmigaOS4 you get anything near the max. speeds without either using the DCBxy instructions or, where supported, the AltiVec streaming instructions instead.
Important: The cache line size of the 5020 is, or may be depending on some CPU configuration bits, different to most other PPC CPUs supported by AmigaOS4.
AFAIK the OS4 kernel only implements correct (or at least much better optimized) 5020 code only in the beta versions of the kernel, not the public ones.

Very simple test for much faster writes: Use a DCBZ loop (or IUtility->SetMem(), BZero(), newlib memset(), bzero(), etc.). Of course that way you wont get results most software will get, but the results will be much closer to the hardware limits.

Probably not relevant for the X5000 at all, but for example with the 603e and 604 CPUs (most likely not a problem of the CPUs themselves but the rest of the BlizzardPPC/CyberstormPPC hardware) using the caches in write through instead of copy back mode resulted in overall faster system performance.

Edited by joerg on 2022/2/18 17:03:31

msteed

Re: x5000 benchmarks / speed up

Posted on: 2022/2/19 3:52 #48

Just popping in

@geennaam

Any chance I could get a copy of your benchmark program? I'm curious what the X1000 numbers would be.

Capehill

Re: x5000 benchmarks / speed up

Posted on: 2022/9/10 10:23 #49

Just can't stay away

Replaced memory on my X5040:

Quote:

Detected UDIMM KF1600C10D3/8G
Detected UDIMM KF1600C10D3/8G

Ragemem before:

Quote:

READ32: 598 MB/Sec
READ64: 1000 MB/Sec
WRITE32: 812 MB/Sec
WRITE64: 812 MB/Sec
WRITE: 2431 MB/Sec (Tricky)

Ragemem after:

Quote:

READ32: 666 MB/Sec
READ64: 1224 MB/Sec
WRITE32: 1593 MB/Sec
WRITE64: 1599 MB/Sec
WRITE: 2471 MB/Sec (Tricky)

Shaderjoy C++ single thread compilation before: ~126 seconds, after: ~117 seconds.

Raziel

Re: x5000 benchmarks / speed up

Posted on: 2022/9/10 13:07 #50

Home away from home

@Capehill

I wouldn't have thought it would be so slow???

Here are my X1000 results:
Quote:

---> RAM <---
READ32: 2805 MB/Sec
READ64: 4049 MB/Sec
WRITE32: 2350 MB/Sec
WRITE64: 2266 MB/Sec
WRITE: 334 MB/Sec (Tricky)

Apart from the last one it wipes the floor with your X5040...huh?

--

game box/art scans
scummvm infrequent builds

LiveForIt

Re: x5000 benchmarks / speed up

Posted on: 2022/9/10 14:04 #51

Home away from home

@Raziel

Read64/Write64 how is that implemented?
have looked at what assembler opcode it uses?
can be faster reading/writing as doubles.

(NutsAboutAmiga)

Basilisk II for AmigaOS4
AmigaInputAnywhere
Excalibur
and other tools and apps.

Raziel

Re: x5000 benchmarks / speed up

Posted on: 2022/9/10 16:19 #52

Home away from home

@LiveForIt

No idea.

I just let RageMem (v0.37) run (without options) from shell and that is what it spitted out

--

game box/art scans
scummvm infrequent builds

msteed

Re: x5000 benchmarks / speed up

Posted on: 2022/9/11 4:35 #53

Just popping in

@Raziel

Quote:

Apart from the last one it wipes the floor with your X5040...huh?

As noted earlier in this thread (post 36), the X1000 seems to be much faster than the X5000 at accessing memory, at least according to RageMem. Geennaam speculated (post 37) why this might be. And the author of RageMem has noted that it was designed for earlier, lower-spec NG Amigas, and may not be that accurate for X1/X5-class machines.

kas1e

Re: x5000 benchmarks / speed up

Posted on: 2022/9/11 5:41 #54

Home away from home

@All
Yeah, it looks like ragemem can be a little bit wrong there when compare speed of x5000 and x1000 in terms of memory access. I mean, it's cleary show that x1000 there better, but, real time tests shows that things faster on x5000 instead (at least 3D tests).

Join us to improve dopus5!
AmigaOS4 on youtube

derfs

Re: x5000 benchmarks / speed up

Posted on: 2022/9/14 17:34 #55

Not too shy to talk

X5000
Kingston Fury KF318C10BBK2/8
2x4GB DDR3 1866MHz


RAGEMEM v0.37 - compiled 11/06/2010



CPU: Freescale P5020 (E5500 core) 1.2 @ 1995 Mhz

Caches Sizes: L1: 32 KB - L2: 512 KB - L3: none

Cache Line: 64



---> CPU <---

MAX MIPS:  3988



---> L1 <---

READ32:  7533 MB/Sec

READ64:  15044 MB/Sec

WRITE32: 7535 MB/Sec

WRITE64: 15048 MB/Sec



---> L2 <---

READ32:  4286 MB/Sec

READ64:  7722 MB/Sec

WRITE32: 5020 MB/Sec

WRITE64: 8817 MB/Sec



---> RAM <---

READ32:  681 MB/Sec

READ64:  1217 MB/Sec

WRITE32: 1477 MB/Sec

WRITE64: 1482 MB/Sec

WRITE: 2313 MB/Sec (Tricky)



---> VIDEO BUS <---

READ:  23 MB/Sec

WRITE: 540 MB/Sec

geennaam

Re: x5000 benchmarks / speed up

Posted on: 2023/2/20 23:04 #56

Quite a regular

@joerg

It's now clear to me that unlike modern AMD/Intel CPUs, the NXP PowerPCs lack hardware cache management. Hence the slow transfer speeds with GCC generated code.

I have implemented quick and dirty DCBT/DCBA and I can already see a speedup to DDR3.

(My assumption is that the cacheline size of the e5500 is 64bytes.)


Memspeed V0.2:



Write 32bit integer:

--------------------------------------

Block size     16384 Kb: 2785.91 MB/s  +62%





Write 64bit integer:

--------------------------------------

Block size     16384 Kb: 2040.55 MB/s  +19%





Read 32bit integer:

--------------------------------------

Block size      16384 Kb: 1587.62 MB/s +123%





Read 64bit integer:

--------------------------------------

Block size      16384 Kb: 1292.63 MB/s +80%





Write 32bit float:

--------------------------------------

Block size    16384 Kb: 2784.67 MB/s +63%





Write 64bit float:

--------------------------------------

Block size    16384 Kb: 2211.16 MB/s  +30%





Read 32bit float:

--------------------------------------

Block size    16384 Kb: 1590.21 MB/s +123%





Read 64bit float:

--------------------------------------

Block size    16384 Kb: 1718.09 MB/s  +33%

Is bcopy(), memset() and memcpy() "inspired" by the Apple powerpc assembly code? From what i've heard it's lightning fast.

Edited by geennaam on 2023/2/20 23:35:07

geennaam

Re: x5000 benchmarks / speed up

Posted on: 2023/2/20 23:34 #57

Quite a regular

@joerg

I am probably doing something wrong but as soon as the buffers don't fit into CPU cache anymore, the copy performance is very poor.

Tried bcopy(), memcpy() and IExec->CopyMemQuick()

All more or less the same result:


Copy (CopyMemQuick):

--------------------------------------

Block size         1 Kb: 3423.14 MB/s

Block size         2 Kb: 3657.73 MB/s

Block size         4 Kb: 3751.87 MB/s

Block size         8 Kb: 3845.11 MB/s

Block size        16 Kb: 3722.72 MB/s

Block size        32 Kb: 2970.01 MB/s

Block size        64 Kb: 2959.63 MB/s

Block size       128 Kb: 2979.04 MB/s

Block size       256 Kb: 2940.45 MB/s

Block size       512 Kb: 1487.38 MB/s

Block size      1024 Kb: 1313.59 MB/s

Block size      2048 Kb: 541.94 MB/s

Block size      4096 Kb: 501.63 MB/s

Block size      8192 Kb: 496.15 MB/s

Block size     16384 Kb: 496.39 MB/s

I am not doing any DCBT/DCBA because it is my assumption that those functions are optimised already.

Hans

Re: x5000 benchmarks / speed up

Posted on: 2023/2/21 4:22 #58

Home away from home

@geennaam

For 64-bit CPUs such as the P50x0, you should use dcbzl. Dcbz may (or may not) zero only half a cache line so that it's behaviour is consistent with older PowerPC CPUs. This forces it to fetch the remaining 32 bytes which isn't what you want.

The dcba instruction may also be very slow, just like with the G5. It's an illegal instruction on the G5, and is emulated by doing nothing (but with the overhead of an illegal instruction trap).

Hans

P.S., You can query the cache line size via the exec.library'S GetCPUInfo() function (GCIT_CacheLineSize).

Join Kea Campus' Amiga Corner and support Amiga content creation
https://keasigmadelta.com/ - see more of my work

joerg

Re: x5000 benchmarks / speed up

Posted on: 2023/2/21 6:47 #59

Just can't stay away

@geennaam
Quote:

It's now clear to me that unlike modern AMD/Intel CPUs, the NXP PowerPCs lack hardware cache management. Hence the slow transfer speeds with GCC generated code.

GCC has options to add data cache instructions, for example -fprefetch-loop-arrays, but with GCC 2.95.x and 3.4.x I got very poor results with it, maybe even slower than without using it.

Quote:

I have implemented quick and dirty DCBT/DCBA and I can already see a speedup to DDR3.

For best speed, at least on the old CPUs for which I implemented the memcpy() etc. functions, you need to do the DCB* not for the next reads/writes but 32 or 64 bytes in advance, for example 1 or 2 DCBT/DCBA before a loop and inside the loop current address + 32 or + 64 bytes.
On 64 byte cache line size CPUs try + 64 or + 128 bytes instead, or maybe even more, you'll have to test how many cache lines have to be touched in advance for best results.

Quote:

Is bcopy(), memset() and memcpy() "inspired" by the Apple powerpc assembly code?

No, I implemented them for newlib myself, using different code for different CPUs, but only for 603/4, 750, 74xy and 440ep CPUs, and only using integer and/or double accesses.
For example on CPUs with 32 byte cache line size using 2 128 bit cache line aligned vector writes you don't need DCBA/DCBZ nor vector streaming instructions, the cache line reads before writes aren't done. But using 4 64 bit writes (integer or double) instead is too slow and there is always a cache line read before the write if you don't use DCB* or AltiVec streaming instructions.
But someone else implemented the AltiVec version using the vector streaming instructions for the 74xy CPUs.

AFAIK later the code was moved from newlib.library to Exec or the HAL and newlib.library only calls the Exec/Utility library functions. For the 440 (and probably 460 CPUs too) versions using DMA were added, but the DMA code is only used for very large copies because of the setup overhead.
Very likely optimised versions for the X1000, X50[24]0 and A1222 were implemented as well, but I don't know anything about such newer parts of newlib and/or the kernel.

Edited by joerg on 2023/2/21 7:50:08

afxgroup

Re: x5000 benchmarks / speed up

Posted on: 2023/2/21 9:01 #60

Amigans Defender

All Exec functions are optimized for CPUs. All newlib code is in Exec. So you have to use Exec functions since they are always optimized for the current machine. Maybe they aren't for x5000 but only kernel developers know

i'm really tired...

Register To Post	« 1 2 (3) 4 »
	Top Previous Topic Next Topic

Currently Active Users Viewing This Thread: 2 ( 0 members and 2 Anonymous Users )