Login
Username:

Password:

Remember me



Lost Password?

Register now!

Sections

Who's Online
122 user(s) are online (115 user(s) are browsing Forums)

Members: 0
Guests: 122

more...

Support us!

Headlines

 
  Register To Post  

« 1 (2) 3 »
Re: My MiniGL experiments,recompilation,tips,etc...
Not too shy to talk
Not too shy to talk


See User information
MGL_Vertex use 224 bytes

Cow3D on Sam440 reach 38 fps or 41 fps with bufferized animation
Cow3D is 5813 triangles so 5813*3 vertices to draw 41 times per second so 715 000 vertices/second
We can say that it is the max speed limit for the Sam440 as Cow3D is a very simple program

So doing the same with the big MGL_Vertex structure will need 160 MB/sec

Alain



Go to top
Re: My MiniGL experiments,recompilation,tips,etc...
Quite a regular
Quite a regular


See User information
I did a quick test, and by adding 256 byte padding to the point3D structure in Cow3D, my framerate dropped from 58 to 40.

This is just like television, only you can see much further.
Go to top
Re: My MiniGL experiments,recompilation,tips,etc...
Just popping in
Just popping in


See User information
@Hans
Quote:
The accesses don't have to be totally random. For example, this document about optimizing drivers for Quake 3 states that the vertices in the GL_TRIANGLES array are in tri-strip order. So, in that case the vertex accesses are always within a small window.


Well, random was meant to be the pathological worst case. However, the above also sounds bad, unless I have misunderstood this: Why render triangle strips out of discrete triangles when drivers can render triangle strips directly?

Rendering a strip (or fan) requires 1 extra vertex for each additional triangle, whereas you need 3 for each discrete triangle.

Go to top
Re: My MiniGL experiments,recompilation,tips,etc...
Just popping in
Just popping in


See User information
@BSzili

That's pretty conspicuous, a 31% drop in performance.

Go to top
Re: My MiniGL experiments,recompilation,tips,etc...
Just popping in
Just popping in


See User information
If the quake 3 engine really is hammering GL_TRIANGLES but passing vertex indecies in fan/strip order then it ought to be theoretically possible to convert the incoming discrete triangles into fans or strips where they share edges. That could give up to a 2/3 reduction in the amount of data that needs to be transferred over the bus. Combine this with more efficient vertex packing in the first place and you might have a result.

Go to top
Re: My MiniGL experiments,recompilation,tips,etc...
Home away from home
Home away from home


See User information
@BSzili
Quote:
I did a quick test, and by adding 256 byte padding to the point3D structure in Cow3D, my framerate dropped from 58 to 40.

Thanks. That's a pretty definitive result. What happens if you pad it by a smaller amount; say 32 bytes (so one cacheline)?


@Karlos

Quote:
Well, random was meant to be the pathological worst case. However, the above also sounds bad, unless I have misunderstood this: Why render triangle strips out of discrete triangles when drivers can render triangle strips directly?

You'll have to ask John Carmack for a definitive reason. Most likely, it's so that multiple short tri-strips can be rendered in one call. Plus, the extra "vertices" are passed to the driver as indices, so the bandwidth increase is modest.

Chaining the triangles up into tri-strips in MiniGL could be detrimental to performance, because clipping would fragment them again. However, a Warp3D driver could do this internally (if it can't use index arrays directly). Of course, that would require MiniGL to send entire arrays in the first place. GLDrawElementsTriangles() currently drops down to sending one triangle at a time if even one triangle is clipped (which is very often with the Q3 engine).

Hans

Join Kea Campus' Amiga Corner and support Amiga content creation
https://keasigmadelta.com/ - see more of my work
Go to top
Re: My MiniGL experiments,recompilation,tips,etc...
Just popping in
Just popping in


See User information
@Hans

Actually strip optimisations are discussed in the page you linked as something to do as a final stage optimisation. The reason given for using the triangles function was to limit the number of different GL calls. It pretty much renders everything using that one call.

Regarding the increase in traffic, its not modest at the driver level. I don't know about the radeon HD code but the older drivers will repeatedly fetch each indexed vertex's data every time that index is used. Indexing only reduces the amount of data you define, not how much goes over the bus. Whether it's duplicated data in a radeon command packet or duplicated data in a FIFO, the notion of the index is in source data retreival only, not what is then passed to the GPU.

I think I can see an obvious optimisation for the permedia here at least. Just keep track of the last index positions associated with each vertex register set (starting with your index trackers set to something impossible like -1 at the start of each draw elements call) and only submit updates to any given vertex register set if the index changes. If the indexes are passed in fan or strip order, the end result would be that you would get the same behaviour as if you explicitly requested a fan or strip: one vertex update per triangle rendered.

This approach might also work on radeon, but I would need to revisit the documentation for it.

Go to top
Re: My MiniGL experiments,recompilation,tips,etc...
Quite a regular
Quite a regular


See User information
@Hans
With only 32 bytes of padding I get 55 FPS, which is much better. Larger vertex array structures definitely have a performance impact.

This is just like television, only you can see much further.
Go to top
Re: My MiniGL experiments,recompilation,tips,etc...
Quite a regular
Quite a regular


See User information
I separated MGLVertex into a projected and management part. MGLVertex now has just a pointer to the latter. I currently access the management data via this pointer, which is unnecessary in most cases (mass-replace rules), but I was able to get a nice speed increase in IOQuake3 from 35.4 to 41.1 FPS. I'll get rid of this indirection to see if I can get more out of it.


Edited by BSzili on 2015/7/4 19:42:36
This is just like television, only you can see much further.
Go to top
Re: My MiniGL experiments,recompilation,tips,etc...
Just popping in
Just popping in


See User information
@BSzili

Not bad! How big is the residual drawing part now? Does it contain, for example, surplus sets of texture coordinates for texture units that aren't present? I wonder what could be trimmed.

Go to top
Re: My MiniGL experiments,recompilation,tips,etc...
Home away from home
Home away from home


See User information
@BSzili

Cool !!

Go to top
Re: My MiniGL experiments,recompilation,tips,etc...
Quite a regular
Quite a regular


See User information
@Karlos
It has all of the old projected part including MAX_TEXTURE_UNITS texture coordinates. IMHO making this structure variable sized would complicate things too much. I'll refactor things to get rid of the management pointer, and shrink the structure back to 96 bytes from the current 100.

This is just like television, only you can see much further.
Go to top
Re: My MiniGL experiments,recompilation,tips,etc...
Just popping in
Just popping in


See User information
@BSzili

In theory, because all the T&L is done by this stage, you could convert the colour values from normalized float to uint8 (just use a single uint32 rather than four separate bytes).

That would reduce the size a fair bit (by 24 bytes) and there are already routines for this in util.h; see fast_normalized_to_u8(). They do absolutely require that the input values are clamped 0.0 - 1.0 however.

The colour format passed to W3D would need to be updated accordingly. For Permedia, this representation is used anyway and for R100 and R200 it's directly supported too.

I've managed to get my A1200 out of storage and will investigate what can be done to optimise DrawElements when discrete triangles are passed in strip/fan order. If this works, it might be applicable more generally.

Go to top
Re: My MiniGL experiments,recompilation,tips,etc...
Quite a regular
Quite a regular


See User information
@Karlos
I'll try that, although it will need to be converted back to 4 floats at certain places, for example when W3D_SetCurrentColor is called. I just looked at the projected structure again and the secondary colors are never set in the code. Could W3D_VFORMAT_SCOLOR possibly be omitted, or it has to be there?

I just tried accessing the management array directly without the pointer, and it resulted in worse framerate (34 FPS). It looks like the cache usage was better with the other approach, and the pointer is here to stay.

edit: Using W3D_VFORMAT_PACK_COLOR and W3D_VFORMAT_PACK_SCOLOR with a modified MGLVertex broke the rendering. If anyone wants to take a look at it, I can upload it somewhere. In any case I'll commit the MGLVertex split, because I tested it throughly, and gives a nice FPS boost.


Edited by BSzili on 2015/7/4 22:18:59
This is just like television, only you can see much further.
Go to top
Re: My MiniGL experiments,recompilation,tips,etc...
Home away from home
Home away from home


See User information
@BSzili

I'd advise against converting colours to uint formats. While it may help with older hardware, it's a hindrance for newer hardware (where floats are the expected format). I'm not sure what the R100 and R200 drivers do, but the W3D_SI driver internally converts uint32 packed formats back to floats. As you've already noted, there are other places where you'll need to convert back and forth.

Added to that, the R100 & R200 Warp3D drivers get the colour channels for the packed formats round the wrong way. I have no idea if that's been fixed in the most recent version or not.

Quote:
Could W3D_VFORMAT_SCOLOR possibly be omitted, or it has to be there?

As far as I can tell, MiniGL doesn't actually use the secondary colour. So it can probably be removed.

Quote:
...I was able to get a nice speed increase in IOQuake3 from 35.4 to 41.1 FPS.

Nice!

Hans

Join Kea Campus' Amiga Corner and support Amiga content creation
https://keasigmadelta.com/ - see more of my work
Go to top
Re: My MiniGL experiments,recompilation,tips,etc...
Just popping in
Just popping in


See User information
@Hans

It must be the interleaved functions that get the packed colours wrong because I unit tested all the color formats for the separate pointers code and fixed all the broken conversions there. I didn't test the v5 functions at that time however, as I was under the impression it was not broken. The R100 and R200 do support packed colour formats directly as far as I recall.

Go to top
Re: My MiniGL experiments,recompilation,tips,etc...
Quite a regular
Quite a regular


See User information
Sigh, the plot thickens. Unfortunately the speed increase wasn't caused by my changes, but by starting IOQuake3 before I have a working network connection (connecting to WiFi takes more than a minute on OS4, at least on my machine). If I do the tests consistently (eg. doing both of them before/after connecting), I get no speed increase whatsoever. I could have never guessed the network subsystem could be responsible for such a speed drop in IOQuake3.

The numbers in Cow3D still apply, so reducing the vertex size might be worthwile, just not the way I did it.


Edited by BSzili on 2015/7/5 7:59:43
This is just like television, only you can see much further.
Go to top
Re: My MiniGL experiments,recompilation,tips,etc...
Not too shy to talk
Not too shy to talk


See User information
I think that you can use W3D_VFORMAT_PACK_COLOR but not W3D_VFORMAT_PACK_SCOLOR that is broken from my own Microbe3D's experiments
So something like this is correct
UBYTE RGBA[4];
float RGBA2[4];


Also in MGLvertex some fields may be packed more like
GLuint frame_code;
GLuint tex_frame_code;
GLuint outcode;
GLboolean rhw_valid[MAX_TEXTURE_UNITS];
GLboolean inview;

GLboolean may be changed to UBYTE (or even one bit)
frame_code & tex_frame_code may use UWORD (but then need to adjust the same vars that are in context) as those var are only checked this way if(V->frame_code!=context->framecode) then v_not_up_to_date

outcode may be UWORD too

Alain

Go to top
Re: My MiniGL experiments,recompilation,tips,etc...
Home away from home
Home away from home


See User information
@thellier

A few tips I guess, I have not looked at the code.

Is there any part that can optimized by using AltiVec?
Is there any complex switches can be replaced table lookups?
Is there any loops that can be unrolled?
Is there any useless memory copy operations?
Is there any unneeded indirect table lookups.
Is there any code that can use macros instead of functions (less braches).
Is there any dynamic list that can be replaced by static arrays, normally faster to loop entries when you do not need to step into the list.
Is there any integers, that can be replaced by unsigned integers, unsigned integers should be faster I read somewhere.
Is there any malloc() / free(), that are called too often, maybe there is way's workaround it.
Data being uneasily being copied as parameters, when they can be global, parameter passing does generate extra store operations.
Are there any sort operations that can be made faster.

(NutsAboutAmiga)

Basilisk II for AmigaOS4
AmigaInputAnywhere
Excalibur
and other tools and apps.
Go to top
Re: My MiniGL experiments,recompilation,tips,etc...
Not too shy to talk
Not too shy to talk


See User information
In fact I have only studied/modified hclip.c light.c texture.c and especially draw.c

>Is there any part that can optimized by using AltiVec?
Yes but Altivec code is already here
>Is there any complex switches can be replaced table lookups?
Not in the parts I looked
>Is there any loops that can be unrolled?
Already compiled -O3 so pointless (unrolling )
>Is there any useless memory copy operations?
I think no
>Is there any unneeded indirect table lookups.
I think no
>Is there any code that can use macros instead of functions (less braches).
Already compiled -O3 so pointless (inlining)
>Is there any dynamic list that can be replaced by static arrays, normally faster to loop entries when you do not need to step into the list.
Not in the parts I looked
>Is there any integers, that can be replaced by unsigned integers, unsigned integers should be faster I read somewhere.
? dont know
>Is there any malloc() / free(), that are called too often, maybe there is way's workaround it.
Definitely no
>Data being uneasily being copied as parameters, when they can be global, parameter passing does generate extra store operations.
I think not
>Are there any sort operations that can be made faster.
That is the problem
In fact several operations are not optimal but not used often.

Alain Thellier




Go to top

  Register To Post
« 1 (2) 3 »

 




Currently Active Users Viewing This Thread: 1 ( 0 members and 1 Anonymous Users )




Powered by XOOPS 2.0 © 2001-2024 The XOOPS Project