i did some benchmarking before and after installing latest RadeonHD drivers. but i was using SysMon ragemem and didn't fine the log, but i have the Mips score.
i also was using RadeonHD 0.55 to start with.
before update: ati drv 0.55
GfxBench2D score: 5903 ragemem: 2291 mips
after installed ati driver 1.2
GfxBench2D score: 5890 ragemem: 2308
after installed ati driver 2.4
GfxBench2D score: 5880 ragemem: 2308
Sam460ex 2GB 120Gb SSD&1Tb HD7750 Envy24HT A-Eon Drv 2.10+Warp3D New Uboot Apollo v4 Standalone
Does anyone know why the RAM write speeds on X1000 are so appalling in ragemem? The x1000 RAM Write speeds are the slowest of all the NG Amigas.
Actually, only the "tricky" write speed is "appalling," the others are very good.
I'm guessing that "tricky" means that it's using the dcba cache instruction to prepare a cacheline for a write without wasting bandwidth reading it from RAM. While this boosts performance on 32-bit CPUs, it's an illegal instruction on 64-bit (G5) CPUs like the PA6T in the A1-X1000. Hence, that instruction triggers an exception handler, and that slows things right down.
Interesting. Thanks for the explanation. Do existing programs make that call in real day to day usage?
AFAIK, the vast majority of programs don't use it. A developer has to be pretty concerned with performance to even consider using such lowlevel techniques. Dcba has to be used carefully, because it operates on an entire cacheline. That means handling special cases so you don't accidentally wipe memory before/after the data that you're working on. Besides, the "dcbz" instruction is similar enough, and is safe to use on G5 CPUs (although you don't get all the speed benefits because of the larger cacheline size).
It's possible that the kernel may use it on hardware that supports it. IIRC, some functions such as the memcopy ones are optimised on a per-CPU basis.
Went through the thread but did not spot x1000 results. Someone with better eyes.... please?
- Kimmo --------------------------PowerPC-Advantage------------------------ "PowerPC Operating Systems can use a microkernel architecture with all it�s advantages yet without the cost of slow context switches." - N. Blachford
Edited by KimmoK on 2015/6/9 8:22:12 Edited by KimmoK on 2015/6/9 8:23:13
- Kimmo --------------------------PowerPC-Advantage------------------------ "PowerPC Operating Systems can use a microkernel architecture with all it�s advantages yet without the cost of slow context switches." - N. Blachford
Once can compare different results from different apps all day long.
I was referring to post #122 X1000 result for RAGEMEM v0.37 -Video bus :
---> VIDEO BUS <--- READ: 66 MB/Sec WRITE: 160 MB/Sec
compare against
post #60 CPU: AMCC PPC460EX 1.2 @ 1166 Mhz
---> VIDEO BUS <--- READ: 72 MB/Sec WRITE: 261 MB/Sec
Then think about price difference.
surely bus/bandwidth is better on AMCC for this value - so that needs investigating to improve on future systems.
If we can collate all results into CSV would be great, then a small code can test a games usage of resources and offer recommended system / card requirements.
I wonder if ragemem is (hand) optimized for PA6T? To my understanding we do not yet have gcc that can do PA6T optimizations.
- Kimmo --------------------------PowerPC-Advantage------------------------ "PowerPC Operating Systems can use a microkernel architecture with all it�s advantages yet without the cost of slow context switches." - N. Blachford
Crisot gave an explanation (in French) on Amiga-NG:
"Sur SAM460, la RAM tourne à la vitesse du L2... Soit le L2 du 460EX est pas terrible, soit j'ai loupé un truc (mais quoi...)
(On Sam460, RAM goes as fast as L2 cache. Either L2 from the Sam460 is rubbish or I missed someting)
Sur X1000, c'est encore pire, l'écriture RAM est supérieure à l'écriture L2, ce qui est matériellement impossible, la RAM passant forcément par le L2... Je me demande si ma loop est pas trop courte pour bencher à ces vitesses là.
(It's even worse on X1000 : writing in RAM is faster than writing in L2, which in impossible since RAM goes through the L2 cache... I'm wondering if my Loop is too short to bench such speeds)
Pour le tricky super lent, c'est parceque mon code n'est pas adapté à la Cache Line du PA6T.
(Regarding the super slow tricky test, it's because my code is not adapted to the PA6T Cache Line)
Pour les MIPS. Je n'ai que des hypothèses mais pas de réponse:
(Regarding MIPS, I have only hypothesis but no answers)
-Déjà, mon bench n'est pas multi-threadé, il n'utilise qu'un seul core, donc il faut multiplier par 2 ce chiffre.
(First, my bench is not multithreaed, it uses only one core so we must multiply x2 this value)
-Dans tous les cas le PA6T@2.0 Ghz est donné constructeur pour 8800 MIPS, ça fait 3960 MIPS par core@1.8 Ghz.
(Anyway, PA6T@2.0Ghz should give, from the manufaturer, 8800MIPS, so 3960 MIPS by core@1.8Ghz.)
-Ma loop n'est pas du tout optimisée pour le PA6T, elle a été écrite avec les datasheet des 604e/G3/G4/460EX. Je pense que le PA6T ne dispatche pas du tout les instructions de la même manière, mais sans datasheet je peux rien écrire.
(My loop is not at all optimized for the PA6T, it was written with 604e/G3/G4/460EX datasheets. I suspect the PA6T to not dispatch instructions the same way than the others but without the datasheets, I cannot be sure nor enhance Ragemem[i]).
-Au dela des MIPS la rapidité des caches et de la ram ont une énorme influence sur les performances CPU. A titre d'exemple, lire un seul octet en mémoire sur un XE (240 mo/sec) équipé d'un G4@1.0Ghz (3000 MIPS), c'est en temps machine l'équivalent de 104 instructions perdues (!!!) pendant lesquelles le CPU attend. En clair, un G4 attend continuellement après le reste de la machine, si on pouvait convertir le temps machine perdu en taches ménagères, la pelouse serait toujours tondue et la maison propre.
([i]Beyond MIPS, caches and RAM speed make a huge difference in CPU performances. For example, reading only on byte in memory on the XE (240MB/s) with a G4@1Ghz (3000MIPS), is equivalent of 104 lost instructions (!!!) while the CPU is waiting. To simplfy : a G4 is always waiting for the the other parts of the comuter. If we could convert the lost time by the system in household tasks, the house would always be clean.)
Je pense pas que core à core (ahah) le PA6T soit beaucoup plus puissant que le G4, mais tout ce qui tourne autour (caches, bus, ram) lui laisse infiniment plus de marge de manoeuvre. Le G4 ira beaucoup moins vite, mais paradoxalement glandera beaucoup plus.
(I don't think that the PA6T core is much more powerful than the G4 core, but everything that makes the X1000 (caches, FSB, RAM) gives him more latitude. G4 will be way slower but, paroxally, will wait a lot more for the rest of the architecture.)."
well now, very interesting seeing my much stronger Gigabyte HD7950, 3GB Ram performing worse here compared to my older single slot low profile HD7750 above??
also noticed that SuperTuxkart ran very smooth with my HD7750 and is running very slow & choppy with my HD7950?? more testing later with my 7950
Surely bus/bandwidth is better on AMCC for this value - so that needs investigating to improve on future systems.
The newer AMMC PowerPC SOCs only scale to 1.3Ghz.
The LSI Axxia PowerPC SOCs look much more interesting on paper. They scale as high as 1.8Ghz on a 4 core SOC and the 476fp core is rated at 2.7 DMIPs per Mhz. It also has bigger L2 cache than the 465 cores as well as large L3 cache. It also has a SIMD unit although I'm not sure if this is Altivec/VMX* although it would make sense for IBM to have designed a a new vector unit for the 476fp. LSI don't specify what version of PCIe is used on there product pages but I did see a product breif that sugested that it was PCIe 3.0
A system based on that faster Axxia SOCs should be able to trounce the X1000 in every aspect.
Both the Axxia ACP3500 and ACP3400 have a 1295 pin/ball count so in theory should be able to use the same board for both. That would give a good range of speeds and cores. The 3500 can have 2 cores at 1.1Ghz, 2 cores at 1.26Ghz, 4 cores at 1.26Ghz and 6 cores at 1.26Ghz while the 3400 can have 2 cores at 1.6Ghz or 4 cores at 1.8Ghhz