I just now played with auto-vectorization feature of gcc:
There is result of
standart powerpc stream:
Work:Benchmark/stream-5.10-AOS/stream
...
-------------------------------------------------------------
Function Best Rate MB/s Avg time Min time Max time
Copy: 2666.8 0.060642 0.059998 0.064029
Scale: 4103.0 0.039175 0.038996 0.039543
Add: 3870.2 0.065242 0.062012 0.083547
Triad: 3901.1 0.061864 0.061521 0.062810
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
Than I copmpiled mu stream.c code with:
gcc -DSTREAM_TYPE=float -DTUNED -mcpu=G4 -maltivec -mabi=altivec -O3 -ftree-vectorize -fopt-info-vec-optimized stream.c -o stream-float-tuned-altivec-g4Result of stream-float-tuned-altivec-g4:
Work:Benchmark/stream-5.10-AOS/stream-float-tuned-altivec-g4
...
-------------------------------------------------------------
Function Best Rate MB/s Avg time Min time Max time
Copy: 2642.3 0.031379 0.030277 0.039397
Scale: 4866.5 0.017653 0.016439 0.024704
Add: 5440.7 0.022771 0.022056 0.025303
Triad: 5414.9 0.022900 0.022161 0.028060
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-06 on all three arrays
-------------------------------------------------------------
Result is ( also confirmed with -fopt-info-vec-optimized ) that Scale, Add and Triad functions are optimized - uses altivec, Copy remains unoptimized.
Source code of functions is here (#pragma omp simd is irelevant in this example):
#ifdef TUNED
/* stubs for "tuned" versions of the kernels */
/* --- Modified by sailor -------------------*/
void tuned_STREAM_Copy()
{
ssize_t j;
#pragma omp simd
for (j=0; j<STREAM_ARRAY_SIZE; j++)
c[j] = a[j];
}
void tuned_STREAM_Scale(STREAM_TYPE scalar)
{
ssize_t j;
#pragma omp simd
for (j=0; j<STREAM_ARRAY_SIZE; j++)
b[j] = scalar*c[j];
}
void tuned_STREAM_Add()
{
ssize_t j;
#pragma omp simd
for (j=0; j<STREAM_ARRAY_SIZE; j++)
c[j] = a[j]+b[j];
}
void tuned_STREAM_Triad(STREAM_TYPE scalar)
{
ssize_t j;
#pragma omp simd
for (j=0; j<STREAM_ARRAY_SIZE; j++)
a[j] = b[j]+scalar*c[j];
}
/* end of stubs for the "tuned" versions of the kernels */
Do somebody know, why tuned_STREAM_Copy() was not optimized?
Of course, I can modify function to something like c[j] = one*a[j];
P.S. all with gcc 11.2.0, compilled natively on A1222+, tested on X1000