@all Hello. Could someone please try whether Snoopy with you still running with NVMe device driver installed? I don't feel like opening the computer again to test it ;)
I'm wondering if NVMe cards exists in 1x form factor ? Having the both long ones occupied there's no chance for dual gfx card setup.
Plenty. I have one myself. Amazon has lot's of them. Just search for "PCIe x1 NVMe adapter". But beware that the X1000 has got only PCIe 1.0. So the x1 slot will be limited to just 250MByte/s
It's been almost a month since the first release so here's a little progress update:
Changes: - Bug solved which halted the boot procedure when no NVMe drive is connected - Host Memory buffer implemented for SSDs without embedded DDR cache. This means that up to 64MByte of DDR memory is used as buffer for those DDR-less drives. - Small speed optimizations of the code itself - Dropped Interrupt handler as preparation for multiple NVMe units -> MSI or MSI-X is not supported by the latest publicly available kernel. Simulated pin interrupt is always number 57 for the drives that I own. This would have meant sharing interrupt between multiple drives which would impact speed. - Complete rewrite of the NVMe IO handler -> Command Queue depth (CQ) increased from 2 to 64.
Especially the last bullet should increase the speed a lot. Or so I thought....Unfortunately speed in the X4 slot is still stuck to about 345MB/s average. (averaged over multiple 1GB writes)
Finding the bottleneck: The architecture of my driver is pretty basic. BeginIO() will differentiate between direct and indirect commands. Direct commands are handled immediately and the indirect commands (like read and write) are forwarded with a List structure (FIFO) to a separate "unit" task. The unit task calls the NVMe write handler in case of a write function. So basically the OS4 virtual textbook implementation.
In search for speed, I've decided to profile my driver. I've used a SFS/02 partition on my Samsung EVO 970 with a "blocksize" of 4096 and buffer size of 2048. The benefit of a blocksize of 4096 instead of 512 is that the transfer size limit for a copy is increased to ~16MB instead of ~2MB for a blocksize of 512 bytes. Of course this can be controlled with the "BUF" option for "copy" but I've noticed that it is benifical for drag and drop copies as well. (would be nice if there would be a global OS4 variable to control this)
When I copy a 1GB file from RAM: to the SFS/02 partition, the file is divided in ~16MB transfers. Execution time of the write handler for each ~16MB transfer is about 11ms. When I ignore the relatively small overhead for the handler itself, this means that the ~16MB data is transferred with ~1450MB/s from RAM to SSD. This was a bit surprising to me because until now I could read on several forums that the X5000 would have slow DRAM and PCIe performance. But this is clearly not the case. Another fun fact is that my Samsung drive is somehow using cache coherent magic. It knows when the source data hasn't changed because when I repeat the same copy action, the 16MB/s transfers suddenly complete in ~7ms each. This would mean ~2300MB/s which is faster than the theoretical maximum 2000MB/s of my PCIe2.0 x4 slot. My other two drives (WD and Solidigm) don't do this trick.
Anyways, back to profiling. Next up was timing the complete write command. So including BeginIO(), NVMe write handler and replying the IO message to the caller (Filesystem). When I subtract the time that the NVMe write handler needs then this is basically a benchmark of the message port system and scheduler. But to my big surprise, this overhead is only a couple of microseconds. So my driver alone can write with about 1450MB/s. This means the bottleneck is somewhere else in the system.
So finally I've measured the time between consecutive IO commands which are send by the filesystem to my driver. It turns out that this time between IO commands is huge and scales linearly with the transfer size. So the overhead percentage is more or less equal for each different transfer size. As a result the theoretical maximum transfer speed with SFS/02 is limited to ~425MB/s (when you calculate with zero overhead for processing and executing the actual IO command). Playing with the BUF=xxxxx option for the copy command will increase and decrease the transfer size. But as stated above, the relative equal IO command interval overhead means that this always results in the same ~425MB/s limitation. Reformatting the partition with the recommended blocksize of 512 bytes made no difference.
If my driver would be able to transfer at the X5000 PCIe x4 bus limit and without overhead than the maximum speed (without Samsung cache trick) with SFS/02 would be ~352MB/s. Currently I can transfer at ~330MB/s (without Samsung cache trick).
Summarizing the profiling effort: - My driver is already performing at the limit for SFS2/02 - Pipelining my driver (separate tasks for NVME command submission and completion) makes no sense because IO commands are send one at a time and only after the pervious one has been finished.
I've been told that the NGFilesystem is faster. Like SFS, the IO commands are send one at a time after the previous one has been completed. But the transfer size is limited to just 128kB. So I cannot determine if this filesystem is suffering from the same huge delay time between commands for larger transfers. And because of the small transfer size, the NVMe read/write setup overhead becomes dominant which is now the limiting factor.
Other observations: NVMe SSD are created with multithreading and pipelining in mind. This xxxxMB/s figures that you can read on the box of your NVMe drive are only reached when the command queue of NVMe keeps filled with new commands. Unfortunately, AmigaOS filesystems are not multithreaded. This means that the NVMe command queue is empty for the most of the time. Therefore it doesn’t really matter what’s written on the box, most NVMe drives will perform more or less equal. The Samsung cache trick is a nice bonus but only beneficial in my testcase. And probably not so much in real use cases. And last but not least Don’t buy a Solidigm P44 Pro NVMe SSD. (I did because this drive gets very good reviews.) This PCIe 4.0 drive fails to enumerate to PCIe 2.0 x4. Instead it falls back to PCIe 1.0 x4 (1000MB/s). At first I thought that this was simply a reporting issue of this drive. But benchmarks showed that it is indeed operating at just 1.0. On top of that, the maximum transfer size of the drive for each NVMe IO command is just 256kBytes (Compared to 2MB for the Samsung 970). While this is not an issue for NGfilesystem with its 128kB limit, it means more overhead for SFS2.
- Complete rewrite of the NVMe IO handler -> Command Queue depth (CQ) increased from 2 to 64.
Especially the last bullet should increase the speed a lot. Or so I thought....Unfortunately speed in the X4 slot is still stuck to about 345MB/s average. (averaged over multiple 1GB writes)
SFS only uses a single command at a time per partition. Command queue might help a little if you copy data from one SFS partition to another, but only if diskcache.library isn't used, which is limited to one command at a time per device unit.
Not sure anymore, but FFS2 might use multiple read/write commands.
Quote:
Of course this can be controlled with the "BUF" option for "copy" but I've noticed that it is benifical for drag and drop copies as well. (would be nice if there would be a global OS4 variable to control this)
There should be a ToolType to set the buffer size used by Workbench copies in AsyncWB. With AsyncWB enabled Workbench copies are usually faster than C:Copy ones (at least if neither source nor destination are RAM:).
Quote:
When I copy a 1GB file from RAM: to the SFS/02 partition, the file is divided in ~16MB transfers.
No idea where this 16 MB limit comes from. Are you sure it's not just a limit of C:Copy?
To benchmark your driver you should use something like SCSISpeed instead, just not with the tiny default buffers sizes of it but using much larger ones, for example 1 MB, 16 MB, 64 MB and 256 MB. That way you are only testing your driver and not any limits the file systems used might have. SCSISpeed only uses a single command at a time as well, but adding command queuing in it should be easy. If you want to keep using a file system try using a FFS partition with the largest supported block size (IIRC 32 KB). Everything else is much slower in FFS, but simple file reads/writes using large block sizes may be faster than in more complex file systems like SFS and NGFS.
Quote:
Execution time of the write handler for each ~16MB transfer is about 11ms. When I ignore the relatively small overhead for the handler itself, this means that the ~16MB data is transferred at ~1450MB/s from RAM to SSD. This was a bit surprising to me because until now I could read on several forums that the X5000 would have slow DRAM and PCIe performance. But this is clearly not the case.
On the AmigaOne SE/XE/µ the limit wasn't PCI or RAM speed but the CPU memory interface, PCI DMA transfers could exceed the speed of CPU RAM reads/writes. Maybe it's the same on the X1000 and X5000?
Edited by joerg on 2023/4/20 18:04:45 Edited by joerg on 2023/4/20 18:18:22 Edited by joerg on 2023/4/20 18:38:42
joerg wrote: Unless nvme.device includes own RDB parsing and partition mounting support, like the X1000, X5000 and Sam460 SATA drivers seem to do, you have to add nvme.device to diskboot.config - if you are either using a very old AmigaOS 4.1 version which still included my diskboot.kmod (Hyperion has no licence to use it in 4.1FE or any newer version of AmigaOS) or the Enhancer Software, which might include a legal version of it as well.
Please, what exactly should be in diskboot.config to boot from NVMe.device ? In documentation I found nothing, and inside the file diskboot.config is this comment:
;Device unit flags:
; 1 = mount HD
; 2 = mount CD/DVD
; 3 = both
; 4+ = support LUNs
but devices has two numbers ( commented help is probably for the second one), and the first number is not for me too clear:
Please, what is first number? It looks like max. nr. of devices (a1ide=4, scsi=8), but why sii3114 has 16 ? It is four port device and sii3112 two port... Should it be something like: "nvme.device 1 1" ?
Quote:
joerg wrote: There should be a ToolType to set the buffer size used by Workbench copies in AsyncWB
Yes, there is BUFSIZE. It is not set. Please, what is recomended value? And should it be the same for old machines with enough RAM like XE or Pegasos 2 and for new machines like X5000 or X1000?
AmigaOS3: Amiga 1200 AmigaOS4: Micro A1-C, AmigaOne XE, Pegasos II, Sam440ep, Sam440ep-flex, AmigaOne X1000 MorphOS: Efika 5200b, Pegasos I, Pegasos II, Powerbook, Mac Mini, iMac, Powermac Quad
Please, what exactly should be in diskboot.config to boot from NVMe.device ?
nvme.device 1 1 First 1: Number of units supported by the device, in the current nvme.device versions only one is supported, but that may change in newer versions. 2nd 1 (flags): Only mount HD partitions, no CD/DVD ones, no support for LUNs (only used by SCSI).
Quote:
but why sii3114 has 16 ? It is four port device
A PCI sii3114 SATA controller only has 4 ports, but sii3114ide.device has support for using more than one sii3114 card. Usually you don't do that, but there were systems with onboard sii3114 and you can add a 2nd one in a PCI slot. Same for the 2 port 3112/3152 SATA controllers, for 8 units you'd have to install 4 PCI 3112/3152 cards...
Quote:
Yes, there is BUFSIZE. It is not set. Please, what is recomended value? And should it be the same for old machines with enough RAM like XE or Pegasos 2 and for new machines like X5000 or X1000?
The larger the BUFSIZE size is the faster the transfers are. On X1000 and X5000 with 2 GB (or more) RAM you can use for example 64 MB or even 256 MB, on older machines with much less RAM installed you have to make compromise between the faster copy speed of a large BUFSIZE and the amount of RAM required for it, maybe 16 MB or 32 MB would be a good one for old systems with for example 512 MB RAM.
Unless nvme.device includes own RDB parsing and partition mounting support, like the X1000, X5000 and Sam460 SATA drivers seem to do, you have to add nvme.device to diskboot.config
I have no idea why parsing the RDB in a device driver should be necessary. Sounds like a hack.
Anyways, after succesfully initializing the NVMe drive, my driver announces the nvme.device to the mounter library with IMounter->AnnounceDeviceTags(). This mounts the partitions.
No idea where this 16 MB limit comes from. Are you sure it's not just a limit of C:Copy?
Yes, it's a combination of SFS/02 blocksize and default c:copy BUF size.
SFS2 with 512 bytes blocksize results in default 2MB-512 transfers sizes. A 8 times higher blocksize (4096) results in 8 times higher default c:copy transfer sizes (16MB-4096). So the c:copy default is 4095*blocksize.
Using eg. c:copy BUF=65536 increases transfer size from 2MB-512 to exactly 33MB. But the time between commands scales ~linearly with the transfer size. So the theoretical SFS/02 performance limit remains 425MB/s. That is what I meant with equal relative overhead. Independant of transfer size.
I have no idea why parsing the RDB in a device driver should be necessary. Sounds like a hack.
Of course it doesn't much sense, but on AmigaOS <= 3.x each device driver had to implement it's own RDB parsing and partition mounting code. There was nothing else which could mount partitions in AmigaOS. I fixed that in AmigaOS 4.0 with my diskboot.kmod/.config (that way for example sg2's PATA, SATA and SCSI drivers didn't have to re-implement it in his device drivers), but since I never was payed by Hyperion for it it can't be used in AmigaOS 4.1 FE and beyond any more. AFAIK they implemented a replacement for it (MediaBoot?), but way too late, and I don't know how it's working.
According to https://wiki.amigaos.net/wiki/Anatomy_of_a_SATA_Device_Driver, all I have to do is make sure that my nvme.device kmod has a lower priority then mounter.library to ensure that mounter.library is available before I call IMounter-> AnnounceDeviceTags().
The way I understand it, the mounter.library call is all that's needed to mount the partitions on the nvme SSD. And this is exactly what happens.
So why would anything else be needed to continue booting from nvme once the kernel and kmods are loaded and executed from sata ssd?
@geennaam Please remember that I wasn't involved in any AmigaOS 4.x development since about 15 years any more. I don't know what IMounter or mounter.library is, neither of it existed yet when I was still involved in any AmigaOS developpment, but it's probably either based on the independent USB mounting code (can be used for mounting partitions used for booting AmigaOS), or the Mounter commodity (can't be used for booting, only for mounting partitions after the Workbench was started), and is now used as replacement for my diskboot.kmod.
So I have to create a driver based on bits and pieces of information where available. But most of it is just trial and error to see what works.
Creating the initial nvme backend was just two weekends worth of coding. Just follow the NVMe standard and everything is fine. The remaining two months was figuring out how amigaos4 device drivers/kmods are supposed to work.