Login
Username:

Password:

Remember me



Lost Password?

Register now!

Sections

Who's Online
207 user(s) are online (148 user(s) are browsing Forums)

Members: 1
Guests: 206

Georg, more...

Support us!

Headlines

 
  Register To Post  

« 1 (2) 3 4 »
Re: x5000 benchmarks / speed up
Quite a regular
Quite a regular


See User information
@kas1e


See here for information about the uboot memory controller options.

Uboot should also support some low level DDR tweaking. It's supposedly enabled by creating a new env called "ddr_interactive" and set it to any value.

After reset, it should bring you to the fsl ddr debugger. The debugger allows you to change DDR parameters and print the contents of the SPD.

But unfortunately this doesn't work for me. The env variable is simply ignored on our uboot.

Go to top
Re: x5000 benchmarks / speed up
Home away from home
Home away from home


See User information
@kas1e

Quote:
And another issue i found that i can't understand for now. I have too slow speed for the Video Bus "Read". I mean, before (few years ago) when i do test ragemem bench, i do have 54mb/s , that for sure. Today, i have just 23. I tried to put back my old RadeonHD, and it is still 30. Nowhere is a 54. I tried 3 other video cards HD and RX ones, all of them 50% slower in Read as they should be instead. I tried to rollback drivers/ etc, but so far found nothing which can make this difference. Any ideas? Maybe some uboot settings or so ?


Maybe they recently modify something on the kernel and now there are some regression?
At the time I remember that Max worked on increasing the speed of the video bus (for the 440) by working into it

Go to top
Re: x5000 benchmarks / speed up
Quite a regular
Quite a regular


See User information
@Skateman

You are as fast as your ram

Brand: Crucial (DR) | Kingston (SR) |
READ32 : 665 | 670 MB/Sec | >+1%
READ64 : 1195 | 1218 MB/Sec | +1.9%
WRITE32: 1429 | 1585 MB/Sec | +10.9%
WRITE64: 1433 | 1589 MB/Sec | +10.9%
WRITE: 2318 | 2320 MB/Sec (Tricky) | >+1%

Sigh, 11% is worth the effort. Now I have to buy new ram

Go to top
Re: x5000 benchmarks / speed up
Home away from home
Home away from home


See User information
@geennaam
Quote:

See here for information about the uboot memory controller options.

Uboot should also support some low level DDR tweaking. It's supposedly enabled by creating a new env called "ddr_interactive" and set it to any value.

After reset, it should bring you to the fsl ddr debugger. The debugger allows you to change DDR parameters and print the contents of the SPD.

But unfortunately, this doesn't work for me. The env variable is simply ignored on our uboot.


I just do grep inside of uboot.img file on any words with "ddr_" , and find only those: fsl_ddr_set_memctl_regs , fsl_ddr_get_, __fsl_ddr_set_lawbar, DDR_SDRAM_CFG[RD_EN]. But no words "interactive" at all.

So seems there is just no ddr_interactive build-in.

As for reading SPD values and all the memory stuff: I think it is even possible to change values directly from os4 (will be interesting if we can change 1500 for DDR3 on something like 1600). Do you have trm_cyrus_1-1-1_AEON.pdf ? There is interesting section "CPLD" about

Join us to improve dopus5!
AmigaOS4 on youtube
Go to top
Re: x5000 benchmarks / speed up
Quite a regular
Quite a regular


See User information
@kas1e

You could change the register values of the DDR controllers directly, but changing DDR parameters on-the-fly is bad idea and will most likely result in a freeze of the system.

I don't see anything usefull in the CPLD section of that document, but I've found the I2C controller and SPD addresses.

At least we could dump the contents of the SPD on each DIMM (must be equal). It would also be possible to change the JEDEC timing (if the SPD EEPROM is not write protected). The values would be applied on the next boot. But a mistake and machine will not boot anymore.


Go to top
Re: x5000 benchmarks / speed up
Home away from home
Home away from home


See User information
@geennaam
Quote:

The values would be applied on the next boot. But a mistake and machine will not boot anymore.


Probably not worth of tries then :)

Join us to improve dopus5!
AmigaOS4 on youtube
Go to top
Re: x5000 benchmarks / speed up
Quite a regular
Quite a regular


See User information
@kas1e

You can always revive the DIMM offline with the right tools. The SPD is just an I2C EEPROM.

Go to top
Re: x5000 benchmarks / speed up
Quite a regular
Quite a regular


See User information
@kas1e

This is the fastest single rank DDR3 kit that I can find: Kingston KF318C10BBK2/8. (also available in red and blue ).

So I'll order a set in the next days and see if it makes a difference..

Go to top
Re: x5000 benchmarks / speed up
Home away from home
Home away from home


See User information
@geennaam
Cool! Keep us posted plz, if they will be faster i will order those ones too.

Join us to improve dopus5!
AmigaOS4 on youtube
Go to top
Re: x5000 benchmarks / speed up
Quite a regular
Quite a regular


See User information
@kas1e

I've dug a bit deeper into the subject and it all makes sense to me now. So here is what I know so far. (But you can also skip directly to the end for the performance values )

The technology:
The first implementation of dual channel memory on PCs merged two 64bit DDR3 buses into one 128bit bus (as was my understanding of this technology until today). This doubles the bandwidth of the memory bus. The timing of both DIMMs had to be closely matched for the 128bit bus to work properly. But as it turned out, most consumer applications saw little benefit of a 128bit DDR3 memory bus. So, the 128bit bus idea was dropped and replaced with technology that divided (interleaved) the memory access over the two channels (e.g., the even access on channel 1 and the odd access on channel 2). Since Both channels operate in a concurrent fashion, and the fact that an internal access must wait for the ( by comparison) slow DDR3 interface anyways. The effective memory bandwidth is doubled again.
The difference with the former 128bit wide dual channel bus is that the bus remains 64bit for the consumer applications but at twice the perceived speed. And as a result, a lot of those applications saw a considerable increase of performance. Another advantage is that the DIMMs operate independent from each other. So, timing doesn't have to be closely matched anymore. Only the memory layout needs to be the same. It is still preferable that both DIMMs have a similar timing. But this is not required. The slowest timing determines the timing for both DIMMs.

Now back to the X5000. The P5020 supports two tricks to increase the effective bandwidth of DDR3 memory accesses. The first trick is controller interleaving and the second one is rank interleaving.

Controller interleaving:
P5020 controller interleaving supports 4 modes (cache-line, page, bank and super-bank). The controller cache-line interleaving mode is basically the same as modern dual channel technology on PCs. The DDR3 memory access with the size of a cache line is divided (interleaved) over the two available memory controllers. This effectively doubles the memory bandwidth for an application on the X5000 too. The other two modes are like rank interleaving and are meant to reduce latencies. But cache-line interleaving will provide the most performance benefit by far.

Rank (chip select) interleaving:
First a bit background on the internals of DDR3 memory. DDR3 DRAM is organized in multiple two-dimensional arrays of rows and columns called banks. A row (also called a page) in a DDR3 DIMM bank has a total length of 8 Kbyte (for 8x 8bit DDR3 chips) and must be opened, by means of an activation command, before the individual bytes in an 8kbyte page can be accessed with column addressing. So, the same page in the same bank for all 8 8bit DDR3 chips in the same rank are opened at once. When access (read/write) to the page columns is completed, the complete row/page must be closed with a pre-charge command before you can open the next row/page with a new activation command.
Rank interleaving reorganizes the memory addressing in such a way that the consecutive memory address is not on the next row in the same bank of the rank but on the same row/page of the same bank on the next rank (chip select). One activation command opens the same row in the same bank across all chip selects. In case of dual rank memory with 2 chip selects and 8x 8bit chips on each chip select, the effective row/page size doubles from 8kbyte to 16kbyte. The benefit of this addressing method is that it reduces the amount of row/page activation and closing commands and therefore reducing the average read/write latency. This results in lower overall access time and therefore an increase of effective bandwidth. (Note that the DIMM is still 64bit. So, while the activation and closing commands can be send simultaneously to both ranks, still only 64 bits of data can be accessed at the same time).

The impact of the memory controller and DIMM itself on overall memory performance of the X5000 can be ranked from high to low in the following order:
1. Cache line controller interleaving
2. Rank interleaving
3. Faster timing of the DIMM itself. (Lower latencies with one or two clocks has less impact than omitting commands with latencies of >10 clocks)
Normally, the clock speed of the DDR3 DIMM would matter as well for the true latency of a DIMM, but the X5000/20 is limited to just 666MHz/1333MT/s.

The DIMMs:
It turns out that the CPU-Z screenshots in my previous post only tell half of the story. In the early days of DDR3 DIMMs, DDR3 chips with the double density were often more than twice as expensive compared to two DDR3 chips of half the density (for the same annual quantity of DIMMs). So, it was more economical to fit 4GB DIMMs with two ranks of 2GB each. When the price of DDR3 silicon went down, the package price became dominant. So, manufacturers started to fit their DIMMs with single ranks of 4GB because that was now the most economical configuration. Unfortunately, this happened often using the same sku.
So, the CPU-Z screenshots are not wrong. They simply do not apply to the DIMM of kas1e (and mine) anymore. I verified this by removing the heat spreader on my 4GB corsair DIMM and it contains indeed 8x 8bit DDR chips (only one side is fitted). So, a single rank. And that is why uboot produced the error message on rank interleaving. Because you cannot interleave ranks with just a single rank available on your DIMM. The DIMM of Skateman is an 8GB DIMM. This DIMM has 8x 8bit DDR3 chips on both sides of the DIMM (16 chips total). This 128bit in total is divided over two ranks. That's why rank interleaving works for Skateman.
Since the DIMMS of both kas1e and Skateman run on similar SPD JEDEC timing, we can already notice that rank interleaving results in about 11% higher write speeds in rage mem.
The first post in this thread shows that controller interleaving results in about 65% higher write speeds in rage mem.
The result of faster DIMM timing is a bit trickier to predict because we currently cannot control the latencies like on a PC BIOS/UEFI. Uboot spl simply takes the values from the SPD eeprom (often JEDEC defined CL9-9-9-24 at 1333MT/s) on a DIMM and applies them to the memory controller. And these so called JEDEC timings are more relaxed compared to the maximum capability of the DIMM. But fortunately, DIMMs like the Kingston Fury Beast 1866MT modules come with more optimized timing values in their SPD EEPROM (CL8-9-8-24 instead of JEDEC CL9-9-9-24 at 1333MT/s). So, the modules are not necessarily faster than equal DIMMs from other manufacturers but uboot will configure the P5020 memory controller with faster timing from the SPD eeprom.

Test result:
For the sake of amiga science, I've bought both the Fury Beast single rank (2x4GB) and dual rank (2x8Gb) kits.

Memory modules used in this test:
- Corsair Vengeance LP 2x4GB DDR3-1600 Single Rank (CML8GX3M2A1600C9) -> CL-9-9-9-24@1333MT/s
- Kingston Fury Beast 2x4GB DDR3-1833 Single Rank (KF318C10BRK2/8) -> CL-8-9-8-24@1333MT/s
- Kingston Fury Beast 2x8GB DDR3-1833 Dual Rank (KF318C10BBK2/16) -> CL-8-9-8-24@1333MT/s

Here are the ragemem results with their Uboot initialization output:

Corsair Vengeance LP:
DRAM:  Initializing....using SPD
Detected UDIMM CML8GX3M2A1600C9
Detected UDIMM CML8GX3M2A1600C9
Not enough bank
(chip-select) for CS0+CS1 on controller 0interleaving disabled!
Not enough bank(chip-select) for CS0+CS1 on controller 1interleaving disabled!
6 GiB left unmapped
8 GiB 
(DDR364-bitCL=9ECC off)
       
DDR Controller Interleaving Modecache line   

Read32
:  665 MB/Sec
Read64
:  1195 MB/Sec
Write32
1429 MB/Sec
Write64
1433 MB/Sec


Kingston Fury Beast 2x4GB (Single rank):
DRAM:  Initializing....using SPD
Detected UDIMM KF1866C10D3
/4G
Detected UDIMM KF1866C10D3
/4G
Not enough bank
(chip-select) for CS0+CS1 on controller 0interleaving disabled!
Not enough bank(chip-select) for CS0+CS1 on controller 1interleaving disabled!
6 GiB left unmapped
8 GiB 
(DDR364-bitCL=8ECC off)
       
DDR Controller Interleaving Modecache line
 
Read32
:  682 MB/Sec
Read64
:  1225 MB/Sec
Write32
1479 MB/Sec
Write64
1483 MB/Sec

Kingston Fury Beast 2x8GB (Dual rank):
DRAM:  Initializing....using SPD
Detected UDIMM KF1866C10D3
/8G
Detected UDIMM KF1866C10D3
/8G
14 GiB left unmapped
16 GiB 
(DDR364-bitCL=8ECC off)
       
DDR Controller Interleaving Modecache line
       DDR Chip
-Select Interleaving ModeCS0+CS1

Read32
:  685 MB/Sec
Read64
:  1261 MB/Sec
Write32
1638 MB/Sec
Write64
1644 MB/Sec


---------: --- CL9 SR --|---- CL8 SR ----| ---- CL8 DR ----
Read32 : 665 ( base ) | 682 ( +2.6% ) | 685 ( +3.0%)
Read64 : 1195 (base) | 1225 ( +2.5%) | 1261 ( +5.5%)
Write32: 1429 (base) | 1479 ( +3.5%) | 1638 (+14.6%)
Write64: 1433 (base) | 1483 ( +3.5%) | 1644 (+14.7%)

Conclusion:
As predicted, the controller interleaving mode gives the biggest boost in performance (Write: ~+65% ; see post #1). Rank interleaving comes second (Write: ~+10.9%) and improved timing (CL9 -> CL8 ; Write +3.5%) comes third. But I am sure that I could have pushed this module further if our uboot would allow for manual editing. At CL6 timing we could see a similar boost in performance as for rank interleaving alone.


Edited by geennaam on 2022/2/3 20:02:34
Go to top
Re: x5000 benchmarks / speed up
Home away from home
Home away from home


See User information
@geennam
At least i know what one to buy now :)

But are you sure about "- Kingston Fury Beast 2x16GB DDR3-1833 Dual Rank (KF318C10BBK2/16) -> CL-8-9-8-24@1333MT/", is't there mistake in 2x16gb, maybe it 2x8gb ? (at least as it stated in uboot?)

Join us to improve dopus5!
AmigaOS4 on youtube
Go to top
Re: x5000 benchmarks / speed up
Quite a regular
Quite a regular


See User information
@kas1e

You are right, it's 2x8GB =16GB. I've corrected it.

Go to top
Re: x5000 benchmarks / speed up
Not too shy to talk
Not too shy to talk


See User information
@geennaam

wow!

That's what I call a very detailed and complete post!

As expected to be honest.....

thanks!

AmigaOne X5000 -> 2GHz / 16GB RAM / Radeon RX 550 / ATI X1950 / M-Audio 5.1 -> AmigaOS 4.1 FE / Linux / MorphOS
Amiga 1200 -> Recapped / PiStorm CM4 / SD HDD / WifiPi connected to the NET
Vampire V4SE TrioBoot
RPI4 AmiKit XE
Go to top
Re: x5000 benchmarks / speed up
Home away from home
Home away from home


See User information
@geeenman
Quote:

Uboot spl simply takes the values from the SPD eeprom (often JEDEC defined CL9-9-9-24 at 1333MT/s) on a DIMM and applies them to the memory controller. And these so called JEDEC timings are more relaxed compared to the maximum capability of the DIMM.
....

But I am sure that I could have pushed this module further if our uboot would allow for manual editing. At CL6 timing we could see a similar boost in performance as for rank interleaving alone.


I may think that we can do it from the OS side, but then it will need soft reboot, which on modern video cards not works :)

Join us to improve dopus5!
AmigaOS4 on youtube
Go to top
Re: x5000 benchmarks / speed up
Just popping in
Just popping in


See User information
@geennaam
Thanks, extremely valuable info. I got significant performance gains for barely no money at all (in Amiga terms).

Before (HX316C10FR/4)
Quote:
READ32: 656 MB/Sec
READ64: 1225 MB/Sec
WRITE32: 860 MB/Sec
WRITE64: 864 MB/Sec


After (KF318C10BBK2/8):
Quote:
READ32: 698 MB/Sec
READ64: 1270 MB/Sec
WRITE32: 1537 MB/Sec
WRITE64: 1542 MB/Sec




Go to top
Re: x5000 benchmarks / speed up
Just popping in
Just popping in


See User information
@geennaam

Thanks for the informative post. I've never paid that much attention to memory speed, so it was enlightening to read about some of the factors involved. It inspired me to take a closer look at the memory in my X1000.

I wonder how much of what you wrote applies to the X1000? The technical manual indicates that the PA6T's dual DRAM controllers can interleave modules if you have pairs of them installed. It also mentions that the maximum memory speed supported by the controllers is 800 MHz. I've no idea if the controllers (or CFE) support rank interleaving.

I tried running RageMem, and got results similar to those in this thread. I was surprised to see that the X1000 is much faster at memory access than the X5000, even if the latter is using interleaved ranks and modules. In fact, the X1000 is roughly 50% faster at writing, and three to four times as fast at reading. And that's despite the fact that the X5000 is significantly faster at L2 cache access than the X1000.

My X1000 just has the memory that came in it from AmigaKit. It's a single Kingston HyperX KHX8500D2/2G 2GB DDR2 module that was fairly high-spec for the time. It's rated at 800 MHz (5-5-5-18) at the standard 1.8 V supply that the X1000 uses. It has 16 chips on it, so it would seem to be dual rank, though the spec sheet doesn't specifically say that.

Notably, I just have one module, so I'm not taking advantage of controller interleaving. That suggests that it would be even faster if I got a second one, and in fact most of the RageMem results in the link I mentioned earlier are somewhat faster at writing than my results.

Now that OS4 can make use of memory beyond 2GB, it might be worthwhile to get a second module, and make the memory faster as well. Unfortunately, the Kingston module is no longer available. I'll either have to get an A-Tech equivalent (probably a pair of them, to keep both modules the same), or get a used one from eBay.

Go to top
Re: x5000 benchmarks / speed up
Quite a regular
Quite a regular


See User information
@msteed

First of all, the ragemem results are bogus for both the X1000 and X5000. As a rule of the thumb for the maximum thoughput calculation, you can use 80% of the maximum bus speed.
X1000: 0.8x1600*8= 10240Mbyte/s ( 20480 MByte/s for controller interleaving)
X5000: 0.8x1333*8= 8531Mbyte/s ( 17062 MByte/s for controller interleaving.
So the ragemem results are about 10% of the expected thoughput value.

I don't have enough knowledge of the X1000, X5000 and ragemem to pinpoint where the difference comes from. But I can think of (a combination of) several reasons that can explain the difference.

1) Memory is simply faster. 1600MT/s versus 1333MT/s can explain 20% at the same timing. But the CL5-5-5 (at 1600MT/s) timing is noticeable faster than the standard CL9-9-9 for DDR3. So there is another 10%-15%.

2) X5000 internal bus seems to run at 800MHz. Memory at 666MHz. So there are wait states involved when crossing clock demains. Same to/ from cache. The memory controller might contain FIFOs to speedup throughtput with DMA transfers. But under software control this would benefit only write access. This might explain why writes are faster than reads. (I don't know anything about bus and cache speed of the X1000.)

3) Ragemem will most likely not use DMA. So it might be a software loop. Maybe the X1000 has a lower branch delay and therefore the memory controller gets it's data faster.

4) DDR3 has a fixed burst lenght of 8 words. DDR2 has a programmable burst lenght of 4 or 8. If ddr2 burst lenght is programmed to 4 and the software loop can only fill one word in the burst, then less cycle are wasted until the next memory access. (This looks like the main reason why we only get 10% of the to be expected thoughput in ragemem)

5) The X1000 memory controller is simply more efficient


It would be interesting to see the results of the lucky few with a X5000/40. The P5040 memory controller is also capable of running at 1600MT/s.

Go to top
Re: x5000 benchmarks / speed up
Home away from home
Home away from home


See User information
Ok, got my new memory, that is what i have now on the serial:


Quote:

SPI: ready
DRAM: Initializing....using SPD
Detected UDIMM KF1866C10D3/8G
Detected UDIMM KF1866C10D3/8G
14 GiB left unmapped
16 GiB (DDR3, 64-bit, CL=8, ECC off)
DDR Controller Interleaving Mode: cache line
DDR Chip-Select Interleaving Mode: CS0+CS1



And that is what I have now in ragemem:

Quote:

---> L1 <---
READ32: 7535 MB/Sec
READ64: 15046 MB/Sec
WRITE32: 7536 MB/Sec
WRITE64: 15050 MB/Sec

---> L2 <---
READ32: 4287 MB/Sec
READ64: 7723 MB/Sec
WRITE32: 5019 MB/Sec
WRITE64: 8831 MB/Sec

---> RAM <---
READ32: 679 MB/Sec
READ64: 1251 MB/Sec
WRITE32: 1635 MB/Sec
WRITE64: 1641 MB/Sec
WRITE: 2318 MB/Sec (Tricky)

---> VIDEO BUS <---
READ: 26 MB/Sec
WRITE: 541 MB/Sec


Through this time, i didn't see a noticeable difference in games/apps in comparison with one i have when just double the ram speed by changing one module on two :) But why it should of course,

Write32: 1551 vs 1635
Write64: 1552 vs 1641

It's just about +5%, so probably add a little there and there, but nothing radical.

Join us to improve dopus5!
AmigaOS4 on youtube
Go to top
Re: x5000 benchmarks / speed up
Just can't stay away
Just can't stay away


See User information
I don't know how ragemem works, but in case it's using own code for reading/writing you get very slow results, especially if it doesn't use DCBT for memory reads and DCBA (DCBZ as replacement on CPUs without DCBA) for writes.

To get results closer to the maximum speed supported by your hardware implement a small benchmark tool using the OS functions (CopyMem[Quick](), SetMem(), etc.), the OS4 kernel should include different code optimized for each supported CPU. IIRC I implemented the 60x, 750 and 440 parts, on 74xx CPUs it's probably using AltiVec and very likely special code for the 460, PA6T and 5020 CPUs was implemented as well. On some systems even DMA might be used instead of the CPU for large copies.

You could compile a portable benchmark tool using C library functions (memcpy(), memset(), etc.) as well, newlib.library should use the optimized kernel functions for them, but you'll get little lower results because of the call overhead.

Go to top
Re: x5000 benchmarks / speed up
Not too shy to talk
Not too shy to talk


See User information
@geennaam

Quote:
It would be interesting to see the results of the lucky few with a X5000/40. The P5040 memory controller is also capable of running at 1600MT/s.


Unfortunately, it is even slower than on X5020. The fact that it RageMem probably display the abnormal results, especially for newer configurations (it was created in AmigaOne XE epoch). The second fact that I have a budget memory not too efficient: CORSAIR ValueSelect CMV8GX3M2A1600C11, 8 GB, 1600 MHz, CL11

Resized Image


Go to top

  Register To Post
« 1 (2) 3 4 »

 




Currently Active Users Viewing This Thread: 1 ( 0 members and 1 Anonymous Users )




Powered by XOOPS 2.0 © 2001-2024 The XOOPS Project