When I run at Linux side the 'whole disk' test with 'Disks' tool, it gave in average for Kingston 990 MB/s, OWC 180 MB/s and Hitachi 50 MB/s read speeds (the sample sizes were the same as buffers above),
It surprices me how well the old mechanical Hitachi performed with SCSISpeed, compared to the SSDs. In real life it is dog-slow, if used e.g. for loading Linux programs!
I don't understand why Hitachi SATA is faster than OWC SATA on AmigaOS, but it's the other way round on Linux. The absolute results will of course always be different between AmigaOS and Linux, but the relative differences shouldn't be.
I don't understand why Hitachi SATA is faster than OWC SATA on AmigaOS, but it's the other way round on Linux. The absolute results will of course always be different between AmigaOS and Linux, but the relative differences shouldn't be.
Can the basic formating affect anyhow SCSISpeed? OWC and Hitachi has both RDB (I think Hitachi has been formatted with MorhOS), but Kingston has GPT (I wish I remembered the right combination ;) as it is used primarily for Linux.
@Rolar C:Copy with 16 MB buffer is faster than SCSISpeed with 10 and 100 MB ones!? That's even more strange than your SCSISpeed results (SATA HD faster than SATA SSD). SCSISpeed only uses reads, while C:Copy does reads + writes, i.e. has to transfer twice as much data, and with the slow CPU memory speed I wouldn't be surprised at all if RAM: is slower than a partition on a NVMe.
Edit: Or did you calculate the C:Copy speeds with 2 GB (1 GB read + 1 GB write)?
C:Copy with 16 MB buffer is faster than SCSISpeed with 10 and 100 MB ones!? That's even more strange than your SCSISpeed results (SATA HD faster than SATA SSD). SCSISpeed only uses reads, while C:Copy does reads + writes, i.e. has to transfer twice as much data, and with the slow CPU memory speed I wouldn't be surprised at all if RAM: is slower than a partition on a NVMe.
Edit: Or did you calculate the C:Copy speeds with 2 GB (1 GB read + 1 GB write)?
No, it was only one-way copy. And I did not do a calculation error as the 'Speed' option of Enhancer-Copy gives practically the same result ;). It is also strange that I get somewhat higher speed with the default 128K buffer:
3.Linux-NTFS:> copy-en Test.mp4 ram: speed
SPEED option - being quiet during copy...
Copied 963.61 MB in 11.411 seconds at 84.44 MB/s
Copy buffer size: 128 KB
Version: 54.5
I begin to suspect this is somehow related to the memory management of X5040... I have to wait a few weeks before I have an opportunity to test this with a X5020. But I hope geennam can soon tell what kind of results he gets with a NTFS partition.
But I hope geennam can soon tell what kind of results he gets with a NTFS partition.
I actually just did the performance analysis
My result is ~85MB/s with NTFS. This is way slower than SFS2. The net write speed isn't that all bad at ~500MB/s. But still only 1/3th of SFS2. The 1024MB file is split in 20x ~50MB transfers when I use BUF=65536. But for some reason, there's a small read (64kb each) before and after such a ~50MB write with a total command delay of ~500ms. So in total 20 * ~500ms = ~10 seconds. That's why the write actions doesn't take 2 but ~12 seconds to complete. Hence the very slow performance.
As I understand it, AmigaOS4 includes a port of NTFS-3G which has to go through an additional library called FUSE/FileSysBox. The sources are available in the download section of Hyperion (login required). So if anyone has some spare time and motivation to find out where the 100ms+400ms command delay for those two reads comes from then feel free to do so
EDIT: I suppose that Linux on your X5040 will use all 4 cores when performing that benchmark. Unlike SFS2, NTFS-3G gives a 100% CPU load. So this is also a factor to consider when comparing NTFS results.
My result is ~85MB/s with NTFS. This is way slower than SFS2. The net write speed isn't that all bad at ~500MB/s. But still only 1/3th of SFS2.
Thank you for your analysis :)! Your results are anyway in line with those I got. My Kingston NV2 is not the fastest model and does not have DDR4 cache, so I think it's normal if your Samsung shows somewhat better values.
Quote:
As I understand it, AmigaOS4 includes a port of NTFS-3G which has to go through an additional library called FUSE/FileSysBox. The sources are available in the download section of Hyperion (login required). So if anyone has some spare time and motivation to find out where the 100ms+400ms command delay for those two reads comes from then feel free to do so
I wish I had the skills for that...;).
Quote:
EDIT: I suppose that Linux on your X5040 will use all 4 cores when performing that benchmark. Unlike SFS2, NTFS-3G gives a 100% CPU load. So this is also a factor to consider when comparing NTFS results.
That's true... By default Linux uses the NTFS driver in kernel, but there is also a separate NTFS-3G driver. The advantage of the latter is that it allows to use 'fstrim' on NTFS partitions, but you have to mount the partion on fstab for it to work. There is also the newer NTFS3 driver (in kernel) but that is broken in PPC kernels.
The CPU load was very low (in average < 10%/core) when I tested a NTFS partition with 'Disks' tool. It seems that the AmigaOS NTFS-3G driver indeed needs some further work...
EDIT: Have you tried the SCSITools on your X5000? So far the only reference results I have seen are from Sailor and she has a X1000. It is said to have much better memory management, so the results cannot be directly compared with X5000.
So far the only reference results I have seen are from Sailor and she has a X1000. It is said to have much better memory management, so the results cannot be directly compared with X5000.
CPU memory interface may by faster on the X1000, but the PCIe speed is only half of the X5000, which should be the limiting factor for NVMe (and any other PCIe DMA).
@geennaamQuote:
As I understand it, AmigaOS4 includes a port of NTFS-3G which has to go through an additional library called FUSE/FileSysBox. The sources are available in the download section of Hyperion (login required). So if anyone has some spare time and motivation to find out where the 100ms+400ms command delay for those two reads comes from then feel free to do so
I don't have a Hyperion account, but checked the AROS + AmigaOS 3.x/m68k sources instead. The disk cache it's using uses check sums to make sure the memory used for the caches wasn't trashed by some other software, and it's using a very slow method for it. That can explain some slowdowns and high CPU usage, but just like in my diskcache.library caches it's skipped for "large" (>= 64 KB) transfers. (diskcache.library used by SFS doesn't use check sums for the cache data but uses IMMU functions to write protect it's caches instead, which should be faster.) I only checked the device I/O and disk cache system and didn't find anything there which could explain the 64 KB reads, but not the actual file systems (NTFS-3G) code. The 64 KB reads are probably something in the NTFS file system, for example any file system has to search for free space on the partition before it can append data to a file. In SFS the required data for that should nearly always be inside a block in it's "buffers", i.e. it doesn't have to (re-)read it from the disk (again).
Edited by joerg on 2023/5/2 17:29:15 Edited by joerg on 2023/5/2 17:29:42
Looked around a bit in filesysbox sources, and it's a bit suspicious that it has a 100 ms timer (FBX_TIMER_MICROS) and in FbxHandleTimerEvent() it then may possibly call FbxFlushAll(). Susicious because 500 and 400 is multiple of 100.
So far the only reference results I have seen are from Sailor and she has a X1000. It is said to have much better memory management, so the results cannot be directly compared with X5000.
CPU memory interface may by faster on the X1000, but the PCIe speed is only half of the X5000, which should be the limiting factor for NVMe (and any other PCIe DMA).
It will be great to define unified benchmark platform for this tests. For example: scsispeed + xyz buffers, copy (aos/enhancer) + xyz buffs or whatever Can anybody ( who understand the internals) define it please?
AmigaOS3: Amiga 1200 AmigaOS4: Micro A1-C, AmigaOne XE, Pegasos II, Sam440ep, Sam440ep-flex, AmigaOne X1000 MorphOS: Efika 5200b, Pegasos I, Pegasos II, Powerbook, Mac Mini, iMac, Powermac Quad
It will be great to define unified benchmark platform for this tests.
Depends on what you want to test... - Only nvme.device driver (and PCIe hardware) speed: SCSISpeed - FileSystem speed, only really usable to compare different file systems on the same hardware and driver: DiskSpeed - Speed of the different C:Copy implementations, speed of RAM-Handler (more or less a speed test of the different IExec->CopyMem[Quick]() implementations for the different CPUs), speed of the file system (incl. the speed of a disk cache the file system may be using) + speed of the nvme driver and hardware combined: C:Copy i.e. using C:Copy is the most useless test for comparing only one of the 4-5 involved parts, for example to compare SATA with NVMe if everything else (hardware, file system, RAM-Handler and C:Copy versions and number of C:Copy BUFFERS) is the same, but OTOH the DiskSpeed and C:Copy tests are more close to what you can get in real life usage than a test for a specific part only like SCSISpeed.
The default buffer/transfer sizes of C:Copy and especially SCSISpeed and DiskSpeed (only 512 bytes to 256 KB) are way too small, using 16 MB for C:Copy is OK, for SCSISpeed and DiskSpeed you have to use 4 buffer sizes (I don't remember if you can use less, for example only 2 by using something like BUF3=0 BUF4=0), for that I'd suggest to use sizes between 128 KB and 128 MB, for example
. Although Rolar got different results, except for the HD, using more than 16 MB as transfer/buffer size should't make any noticeable difference with a tool like SCSISpeed, no matter on which hardware and device driver (SCSI, PATA, SATA, NVMe, ...).
@Georg That could explain additional writes, for example updating the pointers to the 50 MB of data blocks appended to the file on each transfer and making sure they are on the disk before the 50 MB file data writes are done (CMD_UPDATE), but geennaam is getting 64 KB reads before and after each 50 MB write.
scsispeed DRIVE=nvme.device:0 FAST BUF1=16777216 BUF2=16777216 BUF3=16777216 BUF4=16777216
(Yes, I know, 4 times 16MB. But it was for diagnostics purposes)
I've tested the same command on my latest driver both with and without debug printouts to the debug shell.
The debug driver prints a string with the following information for each read action: - The timing of the IO command - The timing of the NVME read handler - Time between issued IO commands - Lenght of read size in bytes
So that's a debug terminal full of data
Then scsispeed reports 220MB/s for all four tests. This low speed is understandable due to the additional time it takes for the debug strings.
Debug printout show that the readspeed is actually happening at close to maximum PCIe speed. This was expected given the small overhead at these ind of transfers. Also the time between commands about 8 us.
But when I run the same benchmark while the debug printouts are disabled then the reported speed drops to just 50MB/s
So something is clearly off here. A quick scan through the scsispeed source shows that there's some timer trickery happening. According to the source because of a low timer resolution. But a microseconds resolution is sufficient to time these kind of transfers. So apparently this trickery is not working properly or some overflow occurs.
Anyways, the benchmark is just a low level timed CMD_READ loop. I will create one myself without the trickery and up to date Exec calls against the latest SDK.
A simple ITimer->GetSysTime() before and after the DoIO() and then a ITimer->SubTime() should do the trick.
A simple ITimer->GetSysTime() before and after the DoIO() and then a ITimer->SubTime() should do the trick.
Maybe no longer the case, but IIRC ITimer->GetUpTime() did have a higher resolution, the same as ITimer->ReadEClock() but without having to convert the results to microseconds, than ITimer->GetSysTime().
I use UNIT_MICROHZ to request the current system time. But I could also use the eclock. I don't know if this makes much difference for timing in the miliseconds range.
I could also do two calls to ReadEClock. Substract the results and divide by the countrate. Maybe this is even faster and more accurate.
Yeah, but thinking again about it, GetSysTime()/TR_GETSYSTIME probably is always the same no matter what unit. Especially GetSysTime() as it would not get the unit as parameter. And it's unlikely that TR_GETSYSTIME behaves differently.