I have spent the last week with an experimental version of NGFS that can bypass the cache when reading large transfers. Depending on the criterion for reading directly rather than through the cache, I can get speed increases for different transfer sizes. Some speed increases (for particular Read transfer sizes) can be as much as two to four times.
I have not attempted to allow Writes to bypass the cache because of the extra work involved to avoid clashes.
Overall speeds of (say) compiling the file system are not improved to any extent. Overall speeds of my test suite are not improved either. I have not tried profiling movie playback.
The show-stopper showed up today in the way of intermittent Read errors in one of my tests. If a caller writes a small update to an existing file, that change will be cached and will remain in the cache (not on disk) until the cache contents are flushed to disk. If, before the flush, a large Read comes in, that Read will get the old contents of the file directly from the disk, not seeing the update.
Since the Fast read shows speed improvements only for a narrow range of benchmark tests, and introduces problems that can only be fixed by adding code for particular conditions, the result is not worth the effort. I have abandoned the experimental version. Maybe one day a flash of inspiration will hit me, but in the meanwhile, I am happy to leave the FS as it is.
@tonyw Of course it's much more complex to bypass the cache for large transfers, for example small write before a large read and error handling as you mentioned... But it's worth the effort, with SFS and JXFS and diskache.library it was about 2-3 times faster on average on A1XE and Sam440ep with SATA, and on an X5000 with NVMe the difference should be even higher.
Another, maybe easier, way to speed up large reads would be to do the same as IIRC the FFS2 cache does, or did 20 years ago. Instead of using very small (max. 128 KB in your case) cache reads and then CopyMemQuick() the cache contents to the application buffer do it the other way round: 1. Check if some of the blocks of the transfer are in the cache and modified, but not written do disk yet. If there are any write and flush the cached blocks of the transfer range first. My IDiskCache->Read() functions simply first calls Self->Flush() with the same arguments before doing a large transfer which bypasses the cache. 2. Do a direct device I/O transfer with the size and buffer you get from the applications, which is much faster for large transfers. 3. CopyMemQuick() the data from the application buffer to your disk cache, if there was no I/O error.
Something similar can be done for writes as well, instead of flushing the cached parts before the large, non-cached I/O transfer you just invalidate (or remove them from the cache if you don't have an "unused" flag) the involved cache lines of the transfer instead.
Especially starting at a transfer size of 16kB, with my previous methods, there was a considerable drop in performance. This has been solved using new methods. Performance of the new method drops again when approaching the maximum transfer size of my Samsung 970 EVO. So a combination of old and new metods will probably maximize performance.
This is the latest result compared to the result in #209
Yes, the higher the transfer size, the lower the OS4 overhead. So it's basically benchmarking PCIe to DDR3 memory DMA at this point. And that seems to be no bottleneck for the X5000.
For real world speed, only the area up to 128kB is interesting because that's the transfer limit of NGFS. And there's where I've concentrated my optimization efforts.
SFS2 does benefit from transfer speeds at higher transfer sizes but this filesystem hits a brick wall around 375MB/s due to its own overhead. So 128MByte blocks are transferred at 2GB/s but then the drive idles until the next filesystem command arrives.
SFS2 does benefit from transfer speeds at higher transfer sizes but this filesystem hits a brick wall around 375MB/s due to its own overhead. So 128MByte blocks are transferred at 2GB/s but then the drive idles until the next filesystem command arrives.
Usually not recommended, but please try if using a SFS\0 or SFS\2 partition with 32768 bytes/block makes any difference. If it does I might be able to improve the speed for 512 bytes/block SFS partitions with nvme.device as well.
The SSD reports that 71 GByte is occupied. But in reality, only 6 GByte is occupied. So gradually the drive will fill up until everything is occupied and then the drive slows down. This is why we need TRIM support.
@geennaam ATA TRIM is not possible, unless there is a HD_ATACmd now, but using UNMAP (3.54) with HD_SCSICmd would be. But even if I'd add support for it in SFS it would only work if diskcache.library is disabled/removed in the kicklayout...
@geennaam Doesn't make much difference if you implement a private nvme.device command for TRIM, or convert HD_SCSICmd UNMAP to ATA TRIM (would be the better solution IMHO), but SFS without diskcache.library is way to slow to be usable. Of course the file system itself know which blocks/sectors are freed, but diskcache.library is a file system independent cache system and doesn't. Mixing the APIs in SFS (IExec->DoIO() if diskache.library isn't installed, IDiskCache->Read()/Write()/etc. if it's installed) isn't possible either. The easiest way may be implementing a separate TRIM/UNMAP tool for SFS (or adding it to PartitionWizard), which stops the file system, reads the bitmap, uses TRIM/UNMAP on all unused sectors and restarts the file system. That would work with and without diskcache.library.
I must be misunderstanding your statement here. I thought only SFS used diskcache.library. It can be (or is) used by other filesystems?
It could be used by any file system, but the only other one which did was AFAIK my failed attempt to implement a better file system for AmigaOS 4.x (JXFS, should only have been available to OS4 beta testers). IIRC I even added support for it to FFS2, but olsen didn't like it and rejected the changes. IMHO FFS2's own cache system (fs_plugin_cache) is the worst possible way to implement a file system/disk cache...
but SFS without diskcache.library is way to slow to be usable.
Sorry, i probably miss something, but i read in some other place that you say that "diskcache.library" should be removed from kicklayout so to have better speed. But not you say in opposite. Can you explain it a bit (again, sorry). thanks!
@kas1e Not because of speed, but diskcache.library is optimized for max. speed on HDDs. The result is that it (re)writes much more sectors than required. On HDDs that's no problem, but on flash based storage like SSDs and NVMe, with much less possible overwrites before the hardware dies, or at least gets extremely slow, that's bad.
Do you think it worth to try NVME device via PCI to PCIE bridge on pegasos2 ? It can be pretty good to compare with IDE even with CF2IDE or SATA2IDE adapters. Dunno through if it will faster than SiI3114 with Sata disk.. In end of all it's Pegasos2 , with SFS(2) only as the fastest option.
i think SATA2IDE is the same speed, you will only get max pci speed - overheads.
133 MB/s on 66Mhz bus. 66 MB/s on 33Mhz bus.
Under emulation it’s a different story, I guess.
Our old real PowerPC CPU has pretty slow PCIe compared to newer X86/ARM cpu’s, simply a question of how fast it can emulate a PowerPC. In particular DMA transferee will be killer, on faster bus.
I feel need a proper benchmark tool, to show the strength and weaknesses of a system. and visualize it, to compare the results.
(NutsAboutAmiga)
Basilisk II for AmigaOS4 AmigaInputAnywhere Excalibur and other tools and apps.
But I’m thinking more about benchmarks on Hans webpage, we are bombarded with 16bit benchmarks from emulation, while most of real hardware uses 32bit modes in the benchmarks. I feel it needs to be split into different categories.
32bit will always get lower score then 16bit on the same hardware because its half of the data, if there no major byte swap, GFX issues, that is.
Also not so interested in 1000’s of QEMU scores, only need to see the unique ones. without knowing host CPU as well, the benchmarks become kind meaningless.
(NutsAboutAmiga)
Basilisk II for AmigaOS4 AmigaInputAnywhere Excalibur and other tools and apps.