> Is there any part that can optimized by using AltiVec?
Not without introducing extra complexity and branches in the code to cater for machines that aren't altivec enabled. There are some functions that have altivec alternatives implemented but the vast majority of the code does not.
> Is there any complex switches can be replaced table lookups?
On faster CPUs (covering multiple architectures), I have observed a trend that switch case is almost always faster than table lookups. The compiler is free to convert any switch case into one or more jump tables anyway.
> Is there any loops that can be unrolled? > (Other micro optimisations)
The compiler does this already.
> Is there any malloc() / free(), that are called too often, maybe there is way's workaround it.
I seem to recall there was some MGLPolygon allocation going on in the past, but I replaced all that ages ago.
> Data being uneasily being copied as parameters, when they can be global, parameter passing does generate extra store operations.
That's not really a suitable approach for shared libraries - writeable global data aren't thread safe. I had to fix some issues caused by that very problem in the past. All the common stuff is stored in one or more structures that are passed by reference.
The fat MGLVertex was a promising lead, but it seems that it's not necessarily the major limiting factor here (it should still be trimmed however).
PPC 440 core is claimed to be able to store up to four load misses ("up to three outstanding line fills, up to four outstanding load misses"). Consider vertex data structure of 96 bytes and vertex processing loop (in a pseudo-code) as a: dcbt currentVertexStructure dcbt (currentVertexStructure+32 dcbt (currentVertexStructure64) loop: dcbt nextVertexStructure dcbt (nextVertexStructure+32 dcbt (nextVertexStructure64)
Not without introducing extra complexity and branches in the code to cater for machines that aren't altivec enabled. There are some functions that have altivec alternatives implemented but the vast majority of the code does not.
Just compile two version of library, use pre processor directives, or overload interface table, or something like that. (That’s sort of what FFMPEG does).
Quote:
On faster CPUs (covering multiple architectures), I have observed a trend that switch case is almost always faster than table lookups. The compiler is free to convert any switch case into one or more jump tables anyway.
In some cases, GCC might automatic optimize your code to use table lookups, but you do not have control over what is checked first, GCC might decide that you check default first, even if default case, might be least used case.
But sometimes GCC will not be able to do that for you, because the case numbers is not sequential numbers.
Quote:
> Is there any loops that can be unrolled? > (Other micro optimisations)
The compiler does this already.
True, but it can be worth decompiling code, and check, GCC does not always do what you expect. what is generated is not always generates the most efficient code, often things are punched back to RAM, pulled from RAM, when a value might have been keep in registry.
Quote:
> Is there any malloc() / free(), that are called too often, maybe there is way's workaround it.
I seem to recall there was some MGLPolygon allocation going on in the past, but I replaced all that ages ago.
I any case storing things on stack, is faster than on RAM, if you're only going keep things for short while.
(NutsAboutAmiga)
Basilisk II for AmigaOS4 AmigaInputAnywhere Excalibur and other tools and apps.
So I've implemented the "last used index per vertex" check in the Permedia2's implementation of W3D_DrawElements().
It certainly isn't any slower, but I need to write some synthetic tests on vertex-sharing indexed triangle lists to see the effect on performance. It should be reasonable as the Permedia2 driver is generally up against the bus limit.
I'm sorry to come late in the discussion that I've read since it started. It seems that MiniGL can receive improvements and that's great to see some have already begun.
You talked about Valgrind and I confirm such tool can't really be ported on AmigaOS. But that's sure tools like that are necessary.
I launched Quake3 on my MicroAOne (that is at the maximum of its capabilities) and using my profiler Hieronymus, I've found that much time is spent in W3D_Radeon.library. But maybe Quake3 is not the best example.
Note that I recently added alternative mode in my profiler that allows (at least on my G3 CPU at the moment) to profile on L2 cache misses. That will be interesting too.
I will have to test Cow3D and compile MiniGL with debug symbols to confirm some results.
How does the instrumentation in Hieronymus work, exactly?
Warp3d drivers and MiniGL can be compiled with basic inbuilt profiling that I added to help me diagnose some performance issues, but it isn't a true hierarchial profiler as only registered functions are timed.
I launched Quake3 on my MicroAOne (that is at the maximum of its capabilities) and using my profiler Hieronymus, I've found that much time is spent in W3D_Radeon.library. But maybe Quake3 is not the best example.
How much compared to the time spent in MiniGL?
Please be aware that MiniGL will drop down to sending one triangle/primitive at a time under certain conditions, and that could increase the amount of time spent in the driver by a fair amount. In Quake 3's case, its engine uses compiled vertex arrays MiniGL will drop down to sending a single triangle to Warp3D at a time if even one triangle in the array is clipped. This happens very frequently; so frequently that an early version of the W3D_SI driver managed only a few fps with Open Arena (which uses the Q3 engine).
@Karlos Hieronymus is a statistical profiler. He collect samples (let's say 50 or 60 times per second) that indicate the address of the instruction that was executed. Then, it finds the corresponding program and function. Statistically, that gives proportions of times consumed by the different running applications.
So that is not intrusive and gives a great view of the system activity. And when you run a program, you also see the percentage of time spent in library that it uses.
The idea is the same than the tool "perf" that comes with Linux.
@Hans Thanks for information. About results, I was not very comfortable to give them now, so take them as early and live results, not confirmed yet. On CPU time: 59% in W3D_Radeon.library 13% in ATIRadeon.chip 11% in Quake3 1% in minigl.library
Note that is given by the alternative mode of sampling I've just developped (using the performance monitor) so I would like to compare with the "standard" mode. I obtained them yesterday very late, and was too tired to make other runs or check with other programs.
Interesting. However, is such a method not prone to sample aliasing problems? How can you differentiate between code that spends 1% time in a function called at (any multiple of) the sampling frequency that you just happen be in when you measure and code that spends 99% of it's time elsewhere?
@Karlos The use case you describe can theorically happen, nothing is impossible. But statistically, there is no workload like that. If a program consumes 1% of the CPU time, on 10 seconds sampling at 50 Hz, you will meet it more or less 5 times.
That would seem to depend on the sample rate. A simpler example, code that spends 20ms in function A and 20ms in function B() alternatively for some compute bound period of time. If you sample at 50Hz you are far more likely to see that it spend 100% in A or B than you are any other distribution of the two. Which one would depend purely on the relative latency of the monitor versus execution of the code. It wouldn't matter how long you profile for, in the absence of any other factors, you'd get an all A or all B result. Unless you managed to start sampling at 20ms in and got the transition point.
Admittedly a contrived example but I guess there's no perfect way to profile. You either do it non intrusively and get approximate or potentially biased results. Or you add instrumentation and accept the changes you make to the running code affect the results.
I suppose only a cycle exact cpu simulator that gathers the statistics on cache misses etc. as it executes would give a true reflection of how code should perform.
@Karlos Right, a simulator could be more accurate. An hardware trace système would be even better. But for now, I think a statistical profiler is useful ans could show dôme surprises. About the sampling frequency,you're Wright, this is why Brendan Gregg (a master about system performance) uses 99 Hz.
Please do not use the minigl.library that I have build: It is not faster with Quake even slower It got some bugs that I introduced with the modifications I made (lines and others stuff in Glexcess..)
My sources and binary are only given as an example of "a modification that may have accelerated minigl but that didnt"
Using waZp3d I get a GR/crshalog after quitting cow3d, is there anything you can fix or is some wazp3d setting I'm missing(setting wrong)?
I updated cow3d source code to use AOS4.1FE's gfx_lib. but original cow3d crashes here too when quitting. my system: sam460ex/2GB/RadeonHD6570_1GB and radeonhd.chip V2.10
Crash log for task "Cow3D-AmigaOS4"
Generated by GrimReaper 53.19
Crash occured in module Warp3D.library at address 0x7CB236DC
Type of crash: DSI (Data Storage Interrupt) exception
Alert number: 0x80000003
Hacked the vertex array code a bit, now it draws all triangles with one call (when using compiled vertex arrays a la Q3). On my Sam440 there seems to be about 1% FPS boost so not much to write home about but at least the direction is correct.
It would be interesting to hear how faster system perform though.