I've seen some reports of drives that fail to work with my driver.
I've tried to search if other platforms have similar issues and it looks like the issue might be related to legacy interrupts (emulated pin interrupts). I've encountered multiple websites which recommend to use MSI/MSI-X interrupts only with NVMe. Windows even offers a tool (LOGO) which checks which type of interrupt work for an attached NVMe drive and which don't. The publically available OS4 kernels supports legacy interrupts only. But the latest SDK contains traces of MSI support. So if a new kernel will ever be released, this might solve that issue.
The driver on os4depot is really interrupt based. If an interrupt is not received within the timeout window then it will generate an error. In your case, this error likely occurs during initialisation and therefore shows the symptoms as if there was no NVMe drive found at all (bug in cleanup routine). My current beta driver checks for an active interrupt inside the NVMe drive itself. But since the NVMe completion is much faster then the interrupt response, I might as well simply poll the completion queues. So stay tuned
Edit1: The good news is, it works to ignore interrupt and check the completion queue. The bad news is that it has a negative impact on performance because I need to flush caches during polling.
I have it on the best authority that the kernel does not, and never has, supported MSI interrupts. Nothing has changed in that regard with newer kernels.
BOOL Is64Bit; /* True if the device is capable of 64 bit MSI addresses */
uint64 MessageAddress; /* The message target address. Note that the interrupt controller code
* has to set this up accordingly. 0 means MSI is disabled for this device
*/
};
I thought that this was the ground work for MSI support in new kernel versions. But apparently not.
I've tried to search if other platforms have similar issues and it looks like the issue might be related to legacy interrupts (emulated pin interrupts). I've encountered multiple websites which recommend to use MSI/MSI-X interrupts only with NVMe. Windows even offers a tool (LOGO) which checks which type of interrupt work for an attached NVMe drive and which don't. The publically available OS4 kernels supports legacy interrupts only. But the latest SDK contains traces of MSI support. So if a new kernel will ever be released, this might solve that issue
Ah, my usual luck to choose a model which has unexpected issues... Unfortunately it's too late to return it. Is there for Linux some tool similar to 'LOGO'?
Quote:
The driver on os4depot is really interrupt based. If an interrupt is not received within the timeout window then it will generate an error. In your case, this error likely occurs during initialisation and therefore shows the symptoms as if there was no NVMe drive found at all (bug in cleanup routine). My current beta driver checks for an active interrupt inside the NVMe drive itself. But since the NVMe completion is much faster then the interrupt response, I might as well simply poll the completion queues. So stay tuned
Edit1: The good news is, it works to ignore interrupt and check the completion queue. The bad news is that it has a negative impact on performance because I need to flush caches during polling.
Ok, let me know when there is a new version available... And if you need betatesters for the prerelease versions, just drop me a PM .
I have updated DiskSpeed (not SCSISpeed) with the changes suggested by Joerg (to fix the counter overflow problem) and added a 1 MB buffer setting for test.
I have submitted DiskSpeed V4.5 to OS4Depot for upload, should be available soon.
Meanwhile, here are results of Geennaam's driver with a 512 GB Kingston "device" on my X5000-20:
Thanks for the DiskSpeed updates, Tony! I am a little surprised there isn't a bigger delta in performance between your NVMe drive and my SSD. The numbers posted below are from a X5000/20 with a Samsung EVO SSD attached to the on-board SATA interface. The volume under test is a 400GB NGFS\01 partition on that disk:
Thanks for the DiskSpeed updates, Tony! I am a little surprised there isn't a bigger delta in performance between your NVMe drive and my SSD. The numbers posted below are from a X5000/20 with a Samsung EVO SSD attached to the on-board SATA interface. The volume under test is a 400GB NGFS\01 partition on that disk:
DiskSpeed is a tool to compare different FileSystems using the same driver and hardware, not for comparing different drivers/hardware, that's what SCSISpeed is for. In tonyw's results the most important details are missing: Which filesystem is used? Which BlockSize is used on the test partition? In case of NGFS: Is it a beta version with the strange 128 KB transfer size limit fixed already, or an old version with this limit which makes fast reads and writes impossible with any driver and hardware? Adding 1 MB default buffer size is better than the old versions (512 byte - 256 KB only), but still way too small to get fast read/write speeds on any current hardware. For example the C:Copy tests geennaam did were using a 16 MB buffer. For any usable test, no matter if DiskSpeed, ScsiSpeed or C:Copy, the buffer size has to be larger than the disk cache used, or all you'll get is the performance of the IExec->CopyMemQuick() implementation on your system instead of anything related to the disk speed.
It looks like there's some DDR3 memory cache benchmarking going on.
I can assure you that the X5k is SATA2. Hence a 300MByte/s theoretical limit. However the raw read speed is about 250MB/s with my SATA SSD. With larger transfer sizes, the P50x20sata.device starts chopping up the transfer to smaller sized chunk (the driver informs this with debug ouput on terminal). As a result, the read speed drops a little. I will post the benchmarks later today with my scsispeed alternative.
The transfer size limit is set by the size of the disk cache, the read-ahead cache and the number of available "buffers". Since NGFS has a write-through cache, all Reads and Writes go through the cache. Also, since it is a journalling file system, all Writes to disk (of meta data) take three Write operations, not just one.
The cache "buffers" are permanently allocated from the system and controlled by internal allocation code. Allocating and de-allocating cache buffers from the Exec imparts a heavy speed penalty. For a partition of 100 GB+, 4096-byte blocks are used, which requires 16 MB of cache for each such partition. I have 23 such partitions on my X-5000, so the cache is no bigger than necessary.
Many years ago, when I spent a lot of time optimising performance, I played with cluster sizes, number of cache buffers, etc. The FS was optimised (at the time) for overall speed *of my test suite*, not for the speed of individual transfers.
I have a test suite that runs all sorts of different tests and takes about 12 minutes to complete. The optimisation work was performed on a Sam 460 with a mechanical hard drive (the mid-range machine at the time). The 32-block cluster that limits read/write transfer sizes gave the best *overall* performance at the time.
Now that I have Geennaam's driver working, I can revisit the speed optimisations and check to see if there is anything to be gained by changing the settings. I doubt that any great increase can be achieved.
PS. Naturally, the test results I published were taken using the current version of NGFS. It would be unfair to publish the results of tests performed on other file systems. The partition size in this case was about 120 GiB.
Small sized, single command transfers is the Achilles' heel of NVMe. Small sizes are fine as long as you overload the drive with them (the more IOs, the better). Alternatively, large transfers are fine because they are broken down in multiple small transfers (Size depends on NVMe controller) and fed to the submission queue.
Currently, my driver is optimised for large transfers. A future release will include independant submission and retire queues in order to create a true pipelined flow. But this will also require a filesystem which is capable of sending multiple IOs.
The transfer size limit is set by the size of the disk cache, the read-ahead cache and the number of available "buffers". Since NGFS has a write-through cache, all Reads and Writes go through the cache.
Read-ahead and copy-back caches only help for small transfers, not for large ones (slower than without a cache) and caching everything doesn't make sense either. For meta-data blocks SFS has "buffers", which are something completely different than the diskcache.library (or SFS internal if diskcache.library isn't used) caches. For transfers larger than the cache line size, IIRC 64 KB in diskcache.library, I just just invalidate the caches of the transfer, in case some of the sectors were in the cache and the contents change, and do a single device read or write of the size the file system got from the application if it's start address and size are multiples of the block size. If it's not block aligned only the first and/or last part(s) smaller than a block are done through the read-ahead/copy-back cache, but the largest part is bypassing the cache. The disk cache used in the AmigaOS port of NTFS, and probably all FUSE/FileSysBox file systems, does the same as I do in diskcache.library: Only small transfers use the cache, large ones don't.
Quote:
Also, since it is a journalling file system, all Writes to disk (of meta data) take three Write operations, not just one.
It's the same in SFS, (at least) 3 writes and a CMD_UPDATE, but delayed by the flush timeout.
Quote:
For a partition of 100 GB+, 4096-byte blocks are used, which requires 16 MB of cache for each such partition. I have 23 such partitions on my X-5000, so the cache is no bigger than necessary.
The diskcache.library cache is much larger, some percent of the installed RAM, but it's a single cache shared by all partitions using diskcache.library.
Thanks for that discussion. I think I tried (years ago) bypassing the cache for large transfers but it did not benefit the overall speed of the test suite, so I removed the extra code (don't like special cases). Of course, the test suite does not use a lot of huge transfer sizes such as we are testing with Geennaam's driver.
I'll try re-enabling the bypass-cache code and see if it improves DiskSpeed's performance.
I keep asking myself: "Why are we striving for maximum benchmark performance if it won't make much difference to real-world operation? What sort of application will benefit from an increase of transfer speed for buffer sizes > 1 MiB?"
I can't help thinking that this whole investigation is a solution looking for a problem.
I keep asking myself: "Why are we striving for maximum benchmark performance if it won't make much difference to real-world operation? What sort of application will benefit from an increase of transfer speed for buffer sizes > 1 MiB?"
Some examples: - Compiling software, 16 MB is too few for keeping all executables (make, gcc, gas, ld, etc.) in the cache and loading the large executables bypassing the cache should be faster. Small files like the includes will stay in the cache, and if the large executables aren't cached much more of them. - Playing large audio or video files. - Editing or converting audio or video files. - Copying files.
Usual benchmarks are faster if you put everything into the cache (only if the benchmark uses files <= the cache size you are using), but real world software is usually faster if you bypass the cache for large transfers. Most software using large transfers uses the data of the large transfers only once and putting it into the cache removes a lot of other cached data which is accessed more often.
Currently, my driver is optimised for large transfers. A future release will include independant submission and retire queues in order to create a true pipelined flow. But this will also require a filesystem which is capable of sending multiple IOs.
The only AmigaOS file system which might still be able to do that, if it wasn't ported to the new AmigaOS 4.1 FS API yet, is FFS2, using the ACTION_(READ|WRITE)_RETURN packets for device I/O. In FileSystems using the new AmigaOS 4.1 FS API that's no usable option, and in my AmigaOS 4.x SFS/JXFS implementations, which neither use the old TRIPOS/AmigaOS 0.x-3.9 packet API nor the new AmigaOS 4.1 FS API but a custom one, it's not possible either.
So I modified NGFS' ReadData() function so that for a Read request size larger than MAX_CACHE_READ, it bypasses the cache and reads the device directly into the caller's buffer. I haven't made any changes to Write yet.
Result is surprising: read speeds fall by a factor of 4 or 5. I then tried breaking up the long Read into several shorter Reads, but the overall speed doesn't change much with different sub-read sizes.
I think what is happening is this:
In the current version, everything goes through the cache. So the first read is slow, then all later reads are much faster, leading to an average that is pretty good. But when you ignore the cache and read directly from the disk each time, it's going to be much slower than reading from the memory-resident cache.
The code in DiskSpeed measures the overall time to Read() and Seek() to the beginning again (repeated many times). The actual times of the first disk Read() and the subsequent cache Reads are all averaged, so the difference between them is not visible. Bypass the cache and you see only slow transfers.
In the case of Writes, they all write into the memory-resident cache, which is written to disk some time later, so short Write() operations appear fast. They only slow down when the Write() length exceeds the cache size. A 1 MiB test size operates at full Write speed, although the reported speed is going to be slower than the maximum because of the included Seek() times.
I will add some longer test transfers to DiskSpeed and see what happens.
@tonyw What you got might be true for SATA, maybe even for X5000 SATA2, but did you test it with NVMe as well? Please ignore any results you may get from benchmarks, your own benchmark tool as well as any foreign ones like DiskSpeed, only use some real-world software tests instead. I guess over the nearly 20 years I worked on SFS I did about as much tests with it as you are doing with NGFS, incl. an always cache everything version of diskcache.library (i.e. the same as you are doing), but in my results nearly all real-world software was much faster with SFS if large transfers aren't cached.
The DiskSpeed results you get, with a file/buffer size < cache size, aren't any disk related speed results at all, but just IExec->CopyMemQuick() benchmarks copying data from/to the disk cache memory.