Using an endian-swapping version of GCC might help, like the one which was used for building native x86/x64 Amithlon software.
Only if all software is compiled the same way. Otherwise the drivers would still have to endian-convert all data coming from (or going to) apps and the rest of the system.
@PixelHi
Quote:
Is lack GART support for Sam460 (whatever that is) limitation of Sam460 or its just not easy to implement for it comparing x1000/5000?
In other words will it ever work for Sam460 for RadeonHD/RX?
It's a Sam460 limitation. The Sam460 doesn't have cache coherency. I tried doing manual cache flushing everywhere where it's needed, but couldn't get it working. It would always lock up.
It's unlikely that we'll get it working.
EDIT: That said, it may still be possible to get a performance boost. We've hit some kind of performance wall, even with GART support. When we finally figure out what's going on, there's a chance we'll get a boost even without GART.
Yes, it's with r7 240. I will try with 250 and 7750, since now at least I can use it 😁. It looks like rx 550 works better at the moment, but I think I settled on r7 240 as more universal. I might try Linux on it.
About SAM460ex cache coherency. I don't know anything about that and surely you've tried everything possible, but searching for PPC460EX datasheet I found this: ... L2 Cache/SRAM The PPC460EX also provides a 256KB L2 cache between the Processor Local Bus and the processors D- and I-caches. This memory unit can be alternatively programmed to function as 256KB of SRAM. Features include: *Four banks of 64KB each ... *Use as an L2 cache improves processor performance and reduces the PLB load -Cache coherency maintained by a hardware snoop mechanism on the Low Latency (LL) Processor Local Bus (PLB) or by software ...
Using an endian-swapping version of GCC might help, like the one which was used for building native x86/x64 Amithlon software.
I wasn't aware of a specific endian swapping version of GCC. There was an Intel compiler to assist with compiling big endian dependant code on x86.
Also, speaking of the Amithlon compiler, where are the programs compiled for it? I thought Aminet would be full of it, but I can't find them anymore. There's more OS4 software on Aminet!
I tried use the reverse load and store instructions I was disappointed with results. If remember correct, I got as poor results as bit mask bits, then shift & Or result code. Yeh it might be I don’t have the instruction, or maybe some trap instruction interrupt bug, some forgot check what CPU has or has not the instructions, not sure how that works.
The instructions should be fine as there is only one needed for read or write direct on memory. Or pointer register direct on PPC.
I did see an issue with GCC endian swap macros. The problem here is it's optimised for x86 which has both register swap and memory R/W swap, or at least optimised for variable data swap. PPC only has memory R/W swap. So if on PPC, the macro wants to swap data already in a register, and PPC doesn't support that. So the code then moves bytes around and ends up as a mess. It really shouldn't be that bad and although PPC has a neat all in one rotate and mask instruction it can't seem to swap bytes quickly. It looks like an after thought. On 68K it's a rol, swap rol. On x86 a bswap.
This is ridiculous really, as PPC can do it natively, though I'm not sure if only some cores support it. Endian is a memory issue at the core. If the macros were designed to R/W from a pointer instead, where the real action is, then it would work well for PPC as it could use the instruction to load reverse from a pointer.
It's a Sam460 limitation. The Sam460 doesn't have cache coherency. I tried doing manual cache flushing everywhere where it's needed, but couldn't get it working. It would always lock up.
That makes me wonder if Linux cannot use HW acceleration on the Sam graphic drivers? The A1/XE had the same limitation so Linux was slower to use. I actually did enable it to test and it ran well and was fluent on screen. Unfortunately it only lasted a few moments until a system freeze. Said to be related to ring buffer and issues with DMA engine I also recall.
Also, speaking of the Amithlon compiler, where are the programs compiled for it? I thought Aminet would be full of it, but I can't find them anymore. There's more OS4 software on Aminet!
http://amithlon.aminet.net/tree Only seems to work if you enable architecture filtering and i386-amithlon in setup, but there isn't much anyway.
Edited by joerg on 2023/2/7 16:37:34 Edited by joerg on 2023/2/7 16:40:18
That makes me wonder if Linux cannot use HW acceleration on the Sam graphic drivers? The A1/XE had the same limitation so Linux was slower to use. I actually did enable it to test and it ran well and was fluent on screen. Unfortunately it only lasted a few moments until a system freeze. Said to be related to ring buffer and issues with DMA engine I also recall.
The ring buffer is likely the first place where you'd hit problems with lack of cache coherency. However, if you're using an SI card or newer, then the problem could also be that their drivers can't handle big-endian.
IIRC, ACube did have a Linux driver that worked with GART enabled. I don't know for which cards, or how it worked.
I can think of one way to get GART working, and that's to mark all memory used for GART as non-cacheable (or disable the data cache entirely). The L2 cache might need to be disabled too. Needless to say, doing so would come with a serious performance penalty.
Latest post at the bottom of the page mentions the following: Quote:
The L2 cache on the 440GX is cache coherent (via snooping). On the 440SP/440SPe the L2 cache is partially coherent. The LL (Low Latency) PLB segment is coherent and the HB (High Bandwidth) PLB segment is unfortunately not. Here an except from the 440SPe users manual:
" Cache coherency is limited to the Low Latency (LL) PLB bus and is managed by a hardware snoop mechanism or software (software that is similar to the existing CPU L1 cache) "
So we will need to add something to handle the L2 cache on those platforms correctly. Not needed on 440GX though.
As for 460EX/GT this is currently not clear yet. I'm working on it with AMCC right now.
In the meantime everything is clear for the PPC460Ex now because this is what can be read from the PPC460Ex datasheet
Quote:
Use as an L2 cache improves processor performance and reduces the PLB load – Cache coherency maintained by a hardware snoop mechanism on the Low Latency (LL) Processor Local Bus (PLB) or by software
No other mention of coherency. So still no cache coherency for the HB PLB where PCIe and DDR are located.
AFAIK, under AOS4.1 DDR and PCIE are located on the LL PLB segment (and mirrored on the HB) since it uses the settings from U-Boot:
#if defined(CONFIG_440SP) || defined(CONFIG_440SPE) || \
defined(CONFIG_460EX) || defined(CONFIG_460GT) || \
defined(CONFIG_460SX)
/*
* Enable high bandwidth access
* This is currently not used, but with this setup
* it is possible to use it later on in e.g. the Linux
* EMAC driver for performance gain.
*/
mtdcr(SDRAM_PLBADDULL, 0x00000000); /* MQ0_BAUL */
mtdcr(SDRAM_PLBADDUHB, 0x00000008); /* MQ0_BAUH */
Yes, the DDR controller has a LL port as well. Probably to allow low speed transfers between LL peripherals and DDR without slowing down the high speed segment. But according to the datasheet, there's no direct connection between PCIe and the LL segment. So those transfers are not snooped.
I can't find a programming manual for the 460ex. So I cannot comment on those u-boot registers.
However I can turn the question around. If everything is indeed configured to be routed over the LL bus, then why is cache coherency not working for DDR? LL transfers are snooped according to the datasheet.
My guess would be that the PLB can be segmented in a LL and HS segment. But you can also turn it off. In that case there is just a common PLB4. And no snooping at all.
A second guess would be that only transfers to/from LL peripheral address ranges are snooped. And not all traffic that is routed over the LL segment.
@geennaam No idea about GART, but for example SATA, ethernet and USB drivers do work with DMA on SAM440/460 systems, i.e. there is no complete cache coherency failure like it used to be on the A1-SE and partially A1-XE, and Pegasos-1 systems.
DMA works and is not the issue. Cache coherency is the issue.
DMA (without using manual cache flushes/invalidates, which were required for example on the A1-SE to get anything working at all, but which aren't used by DMA drivers on SAM440/460) can only work if there is working cache coherency.
Edit: Maybe there is a PCIe-only problem? SATA, and probably USB and ethernet as well, are PCI, not PCIe controllers.