Last month I have tried to recompile latest MiniGL to optimize it So I have a Cygwin based cross compiler 1) I have fixed all "deprecated" warnings about MsgPort,etc.. in gl & glut sources 2) I have fixed all "deprecated" warnings about AllocVec in gl & glut sources So no more "deprecated" warnings on glut & minigl sources (except an allocvec for chip memory that cant be removed) ==> clean compilation on gl & glut (I didnt checked glu sources because glu is not so important) 3) tried to rewrite the "transform" part to use registers ==> but it dont speed up Huno's Ioquake 4) Apply the functions transform,CodePoints(clip test),LightVertices,VerticesToScreen to the whole primtive not on a per triangle basis 5) rewrote all the "Draw a primitive" functions in draw.c to make them bufferize the non clipped triangles (or points or lines) so allowing to draw a (clipped) primitive in a single pass ==> but it dont speed up Huno's Ioquake 6) Test if primitive not clipped (fully on screen)then draw it immediatly ==> but it dont speed up Huno's Ioquake 7) I have also tried to fully remove the "lighting" (ie vertex is simply colored with white) ==> almost dont speed up Huno's Ioquake ===> So the "lighting" almost have no influence on this program
All that changes have introduced new errors : so now Glexcess crash on lines drawings = certainly ane easy bug but I am too tired to look after Hopefully Ioquake, that serve as test, still run perfectly but at almost the same speed (even a few slower)
Now I am very tired : those who says "MiniGL can be easily faster" are wrong : I have spend a lot of time on this for zero progress
2) I have fixed all "deprecated" warnings about AllocVec in gl & glut sources So no more "deprecated" warnings on glut & minigl sources (except an allocvec for chip memory that cant be removed)
Are you sure you can't remove it? CHIP/FAST ram is just emulated anyway, and it normally limited to only 50mb or less, so you run out memory quickly, not a good idea, CHIP has nothing to do with video memory.
You should be using MEMF_Private if possible, if not use MEMF_Sheard, and possibly you should ask for aligned memory, and none swappable memory (maybe), but anyway AmigaOS should not swap memory unless absolute necessary, so you should not need to think about it, unless you're writing a DMA or Interrupt routine.
You should be using AllocVecTags(), not AllocVec().
If you worried about backwards compatibility, you can always warp the code into macro, or use preprocessor directives, #if #else #elif #endif.
(NutsAboutAmiga)
Basilisk II for AmigaOS4 AmigaInputAnywhere Excalibur and other tools and apps.
@thellier It's not easy to improve things, because the feature set of Warp3D is stuck in the late 90s. One tip I got from Hans was accumulating the clipped primitives into the vertex buffer, instead of drawing them one by one. I started implementing this for the vertex array, but couldn't finish it yet due to other commitments.
This is just like television, only you can see much further.
@BSzili >accumulating the clipped primitives into the vertex buffer, instead of drawing them one by one This is what I did too : It dont seems to have enhanced the speed (at least on Ioquake....)
When the "glexcess lines bug" will be fixed (when I will have some courage) then I will release this sources+binary
If you're using Ioquake as benchmark for any improvements, then it's important to note that the Quake 3 engine almost exclusively renders GL_TRIANGLES in compiled vertex arrays via glDrawElements(). So it uses the GLDrawElementsTriangles() function in vertexarray.c. This bypasses some of the main transformation code (calling v_MaybeTransform() instead).
The bottom line is that only code in or called by GLDrawElementsTriangles() is likely to have any impact on the performance of Ioquake (or any Quake III port).
Quote:
Now I am very tired : those who says "MiniGL can be easily faster" are wrong : I have spend a lot of time on this for zero progress
Thanks for trying anyway, and posting the details of your efforts. That way anyone else who wants to look at it knows what's already been tried.
Personally, I think that it would take a rewrite of MiniGL's rendering pipeline to boost performance, and that's definitely not easy. My thoughts are that the pipeline should write the transformed and clipped vertex data to a compact buffer, and deliver them to Warp3D in large blocks. This would do two things: - Rendering whole vertex arrays in one go is more efficient than rendering one triangle at a time (with modern hardware) - Having a dedicated and compact output buffer for the transformed vertices would make more efficient use of CPU caches, and would make it easier to avoid unnecessary copies. I have no idea how much slowdown we're getting from cache misses, but MGLVertex (which stores both input and output vertex data) is very fat, and it doesn't take many vertices to exceed the CPU's L1 cache size.
Needless to say, such a large pipeline rewrite would be a lot of work, with no guarantee that we'd get much of a performance boost.
Speeding up MiniGL was always going to be a "simple" task for someone as long as it was just talk.
The reality is that it's not so easy. I wrote some basic built-in profiling to try and identify slow or often called code and performed some optimisations based on that. But the problems are mostly not going to be solved that way. Your optimisations will probably help the slowest machines, on faster ones, other factors become more important. Cache utilisation is a much bigger issue there I think. It is interesting to note that a lot of older MiniGL stuff is faster despite using theoretically* slower V3 Warp3D calls. This is probably due to having more compact MiniGL vertex structures back then (supporting fewer features), which leads to better cache usage.
*In practise, simpler and easier to write optimised code for in a driver than split/interleaved pointer even if the legacy W3D_Vertex format is a bit silly.
I'll try to split MGLVertex into a rendering and management part. The structure used for rendering will only have a pointer to the management structure. This could minimize the changes required, most of which could be done with mass-replace.
This is just like television, only you can see much further.
OK my sources + binaries are here http://thellier.free.fr/src-02-jul-2015.zip No changes in src/glu 23 files modified in src/ mostly just for fixing warnings in fact only hclip.c light.c texture.c and especially draw.c were truly modified include/mgl/context.h modified too
bug in draw.c/DrawStoredLines() still here : i give up
I will try to find some time this weekend to merge your changes back into the svn repository. I will start with just the compiler warning fixes for now and examine what else can be incorporated without pulling in any new bugs such as the line draw you mention.
Don't be too disheartened at the lack of apparent performance improvements at this stage. You have eliminated one of several potential areas and it might be that your changes further increase performance after other, more significant bottlenecks are eliminated.
What is really needed is the ability to run parts of this code through a tool like cachegrind. I've done it for whole binaries on Linux but not sure what we can do here.
Tested your new MiniGL on my system but in IoQuake3 (Huno version) i had a downgrade ! No better at all but worse, i switch back to the official 2.20 for now
Here the difference, not mutch in compare but for sure not better
Official MiniGL 2.20 HunoPPC R3 version of IoQuake3 1.36 (Sam440ep Flex 800 Mhz + ATI Radeon 9250, 128 MB, 64 Bit) 640x480 - 21,6 FPS
Your version HunoPPC R3 version of IoQuake3 1.36 (Sam440ep Flex 800 Mhz + ATI Radeon 9250, 128 MB, 64 Bit) 640x480 - 21,0 FPS
What is really needed is the ability to run parts of this code through a tool like cachegrind. I've done it for whole binaries on Linux but not sure what we can do here.
I've never used cachegrind myself, but having some hard data on what's going on would be very useful.
The biggest difficulty would be in porting valgrind to AmigaOS. Sure, the source code is available, but something that works at such a lowlevel is virtually guaranteed to be harder to port than the average *nix application.
Even without cachegrind we aren't helpless. One way to test the fat vertex hypothesis would be to write synthetic tests for warp 3d directly:
1) test W3D_DrawArray/Elements with a triangle list using a compact, minimal vertex structure.
2) the same tests again using a fat vertex structure (same size as MGL_Vertex) in which the w3d vertex strucrure is embedded. Initialise the extra space with any old crap to ensure the cache isn't hot for just the parts we care about.
Time both tests carefully for different sized vertex arrays.
I wouldn't be surprised if, once the vertex array exceeds some CPU dependent magic number, that the performance drops suddenly, especially for DrawElements which generally makes random accesses into the vertex array. This would correspond to the point at which you keep having to refetch the data from RAM because you can't keep enough vertices in your cache.
I hadn't thought of testing it that way. Yes, that would give us some usable results. It might be challenging to create a test where the vertex array that's sent to the driver has zero gaps. The reason that I'm interested in that case is because CPUs tend to be very good at prefetching data into caches with consecutive accesses. Hence, any gaps in the data are likely to screw up that process and cause delays (via unnecessary reads and cache misses). That's in theory; I have no idea how much of an effect this actually has.
Quote:
I wouldn't be surprised if, once the vertex array exceeds some CPU dependent magic number, that the performance drops suddenly, especially for DrawElements which generally makes random accesses into the vertex array.
The accesses don't have to be totally random. For example, this document about optimizing drivers for Quake 3 states that the vertices in the GL_TRIANGLES array are in tri-strip order. So, in that case the vertex accesses are always within a small window.