Also libc libraries (newlib/clib) needs to be recompiled for P1222 support, we need new sdk to produce the right binaries for A1222 without any workaround.
A special SPE compiled version of newlib.library for the Tabor/A1222 exists since version 53.54 (released to beta testers in October 2019).
The exposed ABI of the library's "main" interface is and has to remain that of generic PPC code (e.g. floating point parameters and results are passed in the emulated FPU registers) because otherwise it wouldn't be possible to run existing non-SPE compiled programs on the A1222.
What might however be possible would be to also expose the SPE ABI functions directly through another "main.spe" interface but in order for it to be usable special versions of the startup code and libc will likely also be needed.
The ABI for SPE code generated by gcc is identical to soft-float ABI in that double precision floats are passed as register pairs (r3/r4, r5/r6, r7/r8, r9/r10) even though for the SPE they could be passed in a single 64-bit register.
And please, how to use soft-float C library? Is it something like: "gcc -mcrt=clib2 -msoft-float .... -lm" ?
Yes.
Quote:
And how is floating-point parameters passed when I used "-mcpu=powerpc -msoft-float" ? Via GPR registers? They are 32-bit in powerpc ABI. Or via stack?
Same as regular integers, first in the 8 registers, if more parameters are used on the stack. float = int32 = one 32 bit register, double = int64 = two 32 bit registers.
A SPE C library is required, but as long as there is none and if for some reason rebuilding clib4 for it doesn't work, the old, already existing soft-float clib2 could be used for now: - Build everything which doesn't use (much) float/double code with -msoft-float and use the soft-float C library. - Put code which uses float/double calculations in separate sources compiled with -mabi=spe -mfloat-gprs=double instead. - Make sure SPE functions called from soft-float code, and the other way round, are compatible, for example by only using pointers to float/double instead of direct float/double parameters. May not even be required if they are compatible anyway, as salass00 wrote.
@salass00 Quote:
What might however be possible would be to also expose the SPE ABI functions directly through another "main.spe" interface but in order for it to be usable special versions of the startup code and libc will likely also be needed.
Unless the way I implemented the newlib libc.(a|so) stub functions was changed only a new startup code (crtbegin) using interface "spe" instead of "main" should be required.
Edited by joerg on 2024/4/18 15:13:07 Edited by joerg on 2024/4/18 15:17:55
- Build everything which doesn't use (much) float/double code with -msoft-float and use the soft-float C library. - Put code which uses float/double calculations in separate sources compiled with -mabi=spe -mfloat-gprs=double instead.
And what if I need to use math library functions ( sin,cos..)? Do you know, what is faster? To call it newlib + standard powerpc way, i.e. it uses LTE emulator, or to use clib2 + integer emulation from here? Of course, I cam measure it, I am asking just for case.
Quote:
- Make sure SPE functions called from soft-float code, and the other way round, are compatible, for example by only using pointers to float/double instead of direct float/double parameters. May not even be required if they are compatible anyway, as salass00 wrote.
at least printf, fprintf and sin() are not identical ( newlib.library 53.84 )- calling from SPE code returns nonsence. These I tested.
AmigaOS3: Amiga 1200 AmigaOS4: Micro A1-C, AmigaOne XE, Pegasos II, Sam440ep, Sam440ep-flex, AmigaOne X1000 MorphOS: Efika 5200b, Pegasos I, Pegasos II, Powerbook, Mac Mini, iMac, Powermac Quad
at least printf, fprintf and sin() are not identical ( newlib.library 53.84 )- calling from SPE code returns nonsence.
There is no soft-float newlib, the function calls are -mhard-float using the PowerPC ABI with FPU registers, even if the A1222 version is internally using SPE code. You have to use clib2 for now.
Unless the way I implemented the newlib libc.(a|so) stub functions was changed only a new startup code (crtbegin) using interface "spe" instead of "main" should be required.
My first SPE-modified application stream is finished!
I need some apps for bechmarking of A1222+, and if nearly no exists, I have to it myselves . It is on OS4 depot now. It is only one small easy piece, but this is also my first c-code after 20+ years, so I am happy...
AmigaOS3: Amiga 1200 AmigaOS4: Micro A1-C, AmigaOne XE, Pegasos II, Sam440ep, Sam440ep-flex, AmigaOne X1000 MorphOS: Efika 5200b, Pegasos I, Pegasos II, Powerbook, Mac Mini, iMac, Powermac Quad
#stream
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
The *best* time for each kernel (excluding the first iteration)
will be used to compute the reported bandwidth.
-------------------------------------------------------------
Your clock granularity/precision appears to be 2 microseconds.
Each test below will take on the order of 371870 microseconds.
(= 185935 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Best Rate MB/s Avg time Min time Max time
Copy: 760.6 0.214048 0.210373 0.223806
Scale: 328.4 0.496581 0.487278 0.506078
Add: 429.6 0.564791 0.558662 0.571497
Triad: 429.3 0.568371 0.559030 0.578980
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
#
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
The *best* time for each kernel (excluding the first iteration)
will be used to compute the reported bandwidth.
-------------------------------------------------------------
Your clock granularity/precision appears to be 2 microseconds.
Each test below will take on the order of 51567 microseconds.
(= 25783 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Best Rate MB/s Avg time Min time Max time
Copy: 2647.9 0.064595 0.060425 0.071738
Scale: 4057.5 0.044220 0.039433 0.047337
Add: 3822.7 0.067003 0.062782 0.075438
Triad: 3823.4 0.069087 0.062771 0.071987
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
With QEMU amigaone I get good results (slower than real G4 but we know QEMU FPU is slow and this combines that with memory access that also gets slower for bigger blocks):
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
The *best* time for each kernel (excluding the first iteration)
will be used to compute the reported bandwidth.
-------------------------------------------------------------
Your clock granularity/precision appears to be 2 microseconds.
Each test below will take on the order of 170922 microseconds.
(= 85461 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Best Rate MB/s Avg time Min time Max time
Copy: 2540.6 0.068012 0.062977 0.077582
Scale: 1009.5 0.163934 0.158498 0.172685
Add: 1217.1 0.208123 0.197197 0.223799
Triad: 965.6 0.261186 0.248550 0.281111
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
But with QEMU sam460ex something seems to be wrong:
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
The *best* time for each kernel (excluding the first iteration)
will be used to compute the reported bandwidth.
-------------------------------------------------------------
Your clock granularity/precision appears to be 3 microseconds.
Each test below will take on the order of 192721 microseconds.
(= 64240 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Best Rate MB/s Avg time Min time Max time
Copy: 1672.3 0.098496 0.095678 0.109226
Scale: 771.9 0.212180 0.207291 0.219799
Add: 968.0 0.253143 0.247930 0.264263
Triad: 760.0 0.326087 0.315804 0.345664
-------------------------------------------------------------
Failed Validation on array a[], AvgRelAbsErr > epsilon (1.000000e-13)
Expected Value: 1.153301e+12, AvgAbsErr: 1.141173e+12, AvgRelAbsErr: 9.894843e-01
For array a[], 9895936 errors were found.
Failed Validation on array b[], AvgRelAbsErr > epsilon (1.000000e-13)
Expected Value: 2.306602e+11, AvgAbsErr: 2.282429e+11, AvgRelAbsErr: 9.895204e-01
AvgRelAbsErr > Epsilon (1.000000e-13)
For array b[], 9895936 errors were found.
Failed Validation on array c[], AvgRelAbsErr > epsilon (1.000000e-13)
Expected Value: 3.075469e+11, AvgAbsErr: 3.043100e+11, AvgRelAbsErr: 9.894753e-01
AvgRelAbsErr > Epsilon (1.000000e-13)
For array c[], 9895936 errors were found.
-------------------------------------------------------------
which is odd as the FPU emulation is the same so maybe there's some memory access issues still left. (This is with my current development version but same result with QEMU 8.0.0 so at least not something I broke recently but don't know yet what causes it.)
Compiling stream.c with gcc 10.2.1 for Linux (with -mcpu=powerpc -O3 -DVERBOSE) and running it on QEMU sam460ex with Linux guest instead of AmigaOS I get:
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
The *best* time for each kernel (excluding the first iteration)
will be used to compute the reported bandwidth.
-------------------------------------------------------------
Your clock granularity/precision appears to be 2 microseconds.
Each test below will take on the order of 208608 microseconds.
(= 104304 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Best Rate MB/s Avg time Min time Max time
Copy: 2126.6 0.076052 0.075236 0.077315
Scale: 775.6 0.208164 0.206299 0.212734
Add: 950.8 0.256040 0.252422 0.261350
Triad: 825.2 0.294674 0.290838 0.302348
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
Results Validation Verbose Results:
Expected a(1), b(1), c(1): 1153300781250.000000 230660156250.000000 307546875000.000000
Observed a(1), b(1), c(1): 1153300781250.000000 230660156250.000000 307546875000.000000
Rel Errors on a, b, c: 0.000000e+00 0.000000e+00 0.000000e+00
-------------------------------------------------------------
So the validation error only happens on AmigaOS with the binary from @sailor. Could it be some problem with gcc 6 and -O3? I don't have AmigaOS compiler set up now so can't try it but if somebody can compile it with gcc 10 or without -O3 and verify if that runs correctly on QEMU sam460ex that may help further to locate why this fails. But the same binary runs on real Sam460EX as confirmed above so the problem must be in QEMU but I have no idea how to debug it.
Edited by balaton on 2024/4/28 13:56:30 Edited by balaton on 2024/4/28 15:50:34 Edited by balaton on 2024/4/28 15:51:35 Edited by balaton on 2024/4/28 19:10:58
8.System:> Work:Benchmark/stream-5.10-AOS/
8.Work:Benchmark/stream-5.10-AOS> stream_spe
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
The *best* time for each kernel (excluding the first iteration)
will be used to compute the reported bandwidth.
-------------------------------------------------------------
Your clock granularity/precision appears to be 2 microseconds.
Each test below will take on the order of 293711 microseconds.
(= 146855 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Best Rate MB/s Avg time Min time Max time
Copy: 787.1 0.204503 0.203269 0.208423
Scale: 492.9 0.326322 0.324588 0.329637
Add: 568.0 0.424966 0.422508 0.427871
Triad: 541.6 0.445014 0.443115 0.449225
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
standart powerpc FPU code with LTE emulator:
8.Work:Benchmark/stream-5.10-AOS> stream
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
The *best* time for each kernel (excluding the first iteration)
will be used to compute the reported bandwidth.
-------------------------------------------------------------
Your clock granularity/precision appears to be 2 microseconds.
Each test below will take on the order of 1032608 microseconds.
(= 516304 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Best Rate MB/s Avg time Min time Max time
Copy: 788.5 0.204721 0.202919 0.208100
Scale: 148.0 1.081844 1.080804 1.085773
Add: 154.7 1.554502 1.551267 1.557342
Triad: 148.2 1.622773 1.619742 1.626540
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
LTE FPU emulation is very fast - more than 25% of SPE FPU native code. Unfortunatelly majority of 3D games nor works with LTE and interpretative emulator must be used, and it is very slow.
AmigaOS3: Amiga 1200 AmigaOS4: Micro A1-C, AmigaOne XE, Pegasos II, Sam440ep, Sam440ep-flex, AmigaOne X1000 MorphOS: Efika 5200b, Pegasos I, Pegasos II, Powerbook, Mac Mini, iMac, Powermac Quad
A little bit higher results here on A1222. I am running a beta 1222 Tabor system from year 2016. I repeated test with emergency USB stick that is delivered with A1222+ but the results are the same. Maybe you have some stuff working in background.....
stream_spe: Function Best Rate MB/s Avg time Min time Max time Copy: 842.8 0.191521 0.189850 0.196177 Scale: 547.7 0.295012 0.292110 0.304204 Add: 627.5 0.385231 0.382451 0.388252 Triad: 595.9 0.404831 0.402750 0.410643
stream: Function Best Rate MB/s Avg time Min time Max time Copy: 840.9 0.192648 0.190283 0.196379 Scale: 154.2 1.038695 1.037723 1.040899 Add: 162.7 1.478401 1.475531 1.481802 Triad: 156.5 1.537082 1.533314 1.538960
Indeed LTE emulator is doing a great job. In applications with less FPU code user will not notice any speed slowdown. That is the main objective with LTE. But I guess these results are also high because of fast memory transfer on A1222.
About the 3D games - the problem is primary because of minigl library. Practically nothing will work (even included demos with minigl). At least in my testing.
Lot of system parts in AOS4 are optimized with SPE. Also Warp3DNova and other code done by Hans is optimized.
But there is a minigl version for both mixed FPU and SPE code and only SPE version. It can be downloaded from:
Also other SPE optimized games from Daniel are on this page.
Also there is a Novabridge that will then make FPU/SPE minigl library operational on Radeon RX cards. Also then Rewarp will make possible to run software for WapUp. There is a great tutorial by kas1e here:
I tested all this in A1222 and results are great in some games, some not (too slow to play). I even amanged to get Quake Darkplaces working with more than 20 fps under LTE (Quake uses a lot FPU code). Quake darkplaces must be started with benchmark switch (otherwise it doesnt work, also game must be started from previous saved game position - if you have one from other Amiga system):
Lots of things to investigate and play around with A1222 Real SPE optimized games will work excellent because SPE FPU performance is fast. This is when FPU is used in games - then I guess the real thing is to transfer all the calculations on the graphic card...not using much of FPU.
Here I have A1222 with Radeon RX 580. It works great.
A little bit higher results here on A1222. I am running a beta 1222 Tabor system from year 2016. I repeated test with emergency USB stick that is delivered with A1222+ but the results are the same. Maybe you have some stuff working in background.....
Indeed LTE emulator is doing a great job. In applications with less FPU code user will not notice any speed slowdown. That is the main objective with LTE. But I guess these results are also high because of fast memory transfer on A1222.
Yes, it depends a lot on memory speed, this benchmark is for measuring of "Sustainable Memory Bandwidth" - i.e. memory + some non-heavy FPU operatons with it.
Quote:
I tested all this in A1222 and results are great in some games, some not (too slow to play). I even amanged to get Quake Darkplaces working with more than 20 fps under LTE (Quake uses a lot FPU code). Quake darkplaces must be started with benchmark switch (otherwise it doesnt work, also game must be started from previous saved game position - if you have one from other Amiga system):
Lots of things to investigate and play around with A1222 Real SPE optimized games will work excellent because SPE FPU performance is fast. This is when FPU is used in games - then I guess the real thing is to transfer all the calculations on the graphic card...not using much of FPU.
Here I have A1222 with Radeon RX 580. It works great.
It is great, that you find the way how run Darkplaces. Thank you, I will use it for benchmarking too. I have no success with classic MiniGL Quake - LTE crashes, and compatible emulator is not useful ( cca 2 FPS )
Maybe it will be wise to have some webpage with howto for running non-spe games on A1222? Maybe ask Eliyahu to add your info on his page?
AmigaOS3: Amiga 1200 AmigaOS4: Micro A1-C, AmigaOne XE, Pegasos II, Sam440ep, Sam440ep-flex, AmigaOne X1000 MorphOS: Efika 5200b, Pegasos I, Pegasos II, Powerbook, Mac Mini, iMac, Powermac Quad
I would like to compile FFMPEG for Tabor as SPE Native version directly under AmigaOs4.1. SDK version 54.16 is already installed and I have set the GCC version to 6.4.0. Basically that's all, yes I know that's not much and there will certainly be some things missing, but I have no idea how to proceed and would still like to try.
navigated to the unpacked achiv via the shell and there I executed a ./Configure via SH-Shell as a test, there was only an error message with
"SDK: Local/C/grep:
unknown devices method
Compiler lacks support for Cll static assertions"
Probably everything has to be linked and further configuration information is necessary.
I would be happy if someone could explain to me how to do it correctly, if this is even possible directly under AmigaOs4.1. Unfortunately my skills are limited and I am not a programmer, but I would still like to try.
Mainly it is about FFPlay which I would like to use natively as SPE version, the nonspe version already works well, but has its limits and consumes a lot of CPU power. So it would also be a good test what speed advantage a SPE version could bring.
Any help will be appreciated.
MacStudio ARM M1 Max Qemu//Pegasos2 AmigaOs4.1 FE / AmigaOne x5000/40 AmigaOs4.1 FE
The ffmpeg version from 7 onwards checks for some compiler issues for AOS4. You have to use a flag to bypass this. Do not try the latest version at first, only ffmpeg 6.
The config.log and its location are listed in the screenshot you sent.
See additionally the ffmpeg package from @ Michael from os4depot. You have an additional description of ffmpeg-os4-howto.txt what to add under newlib to bypass some problems. Ideally you should compile this with clib4 some of the problems described are out of date. Clib4 is not part of the current SDK.
ffmpeg7/ffplay even on QEMU emulation with sm501 works acceptably. Maybe on the SPE if someone wants to, they will adapt the code
smarkusg wrote:@Maijestro ffmpeg7/ffplay even on QEMU emulation with sm501 works acceptably. Maybe on the SPE if someone wants to, they will adapt the code
Since I see that you were again faster than everyone else and there is already a clib4 version, I would of course be happy to be able to test it under Tabor and compare it with the newlib version of os4depot
If the current SDK does not provide a clib4 function, it makes no sense for me to experiment with it directly under AmigaOs4.1.
MacStudio ARM M1 Max Qemu//Pegasos2 AmigaOs4.1 FE / AmigaOne x5000/40 AmigaOs4.1 FE
For ffmpeg 7, edit the configure file and rename static_assert to _Static_assert
Also do the same in libavcodec/ccaption_dec.c and libavutil/hash.c
If you're using clib4 you don't have to comment out anything that needs fenv.h, but if you're using newlib it's possible to use the fenv.h from clib4 with some minor adjustments.