@kas1e Drawing a scene like that is the ultimate most inefficient way to do things and one of the big "donts" in terms of GL.
I suppose W3DSI does extra batching internally. I however am not certain yet if I shall add sth. like that to ogles2.lib as it's such an insane API abuse.
If MiniGL does it, then maybe it's sth. to add to gl4es, since it's the MiniGL equivalent here, right?
@Daytona675x All in all its quake3 : works on all possible platforms and drivers. I assume if quake3 do something, then almost all other games do the same.
What make me curious, is why that "bad" minigl faster in 4 times, even with quake3 which do no batching. And we of course want not "the same as minigl" speed (as then why we need it all), but kind of faster.. Or then there is minigl already (which is bad, but faster even with TCL in software :)) )
Maybe its not quake3, but gl4es split it all like this ? Through that cant explain why he on Pandora with 1ghz cpu and some not so goid gfx card have results which beat even minigl on x5k(taking aside gl4es version, which gl4es on pandora just crash)
Ogles2 if course not minigl equalent, but imho we all expect it to be so much better, that will crash minigl in compare.. But..:)
I assume if quake3 do something, then almost all other games do the same.
Drawing single triangles pre glDraw-call? Certainly not. And if a game does so, bad luck, at least with ogles2
Quote:
What make me curious, is why that "bad" minigl faster in 4 times, even with quake3 which do no batching.
I already told you my guess: probably W3DSI does some batching internally. Or MiniGL itself, dunno.
But I know one thing for sure: ogles2 is not optimized / designed for thousands of single-triangles-from-client-mem-draw-calls per frame, which is why we got that low performance here. ogles2 is well optimized for what people usually do with it, while considering the way Nova likes it: mostly use VBOs and if using client-mem-arrays at all, then those are usually not just a handful of triangles. It is not optimized for what Q3+gl4es deliver right now and I probably won't optimize it for that kind of stuff.
Quote:
Maybe its not quake3, but gl4es split it all like this ?
Don't know.
Quote:
Ogles2 if course not minigl equalent, but imho we all expect it to be so much better, that will crash minigl in compare..
It will, as soon as you start to feed it with something else than single triangles
It is not optimized for what Q3+gl4es deliver right now and I probably won't optimize it for that kind of stuff.
Ok.. then probably all those gl4es and regals make no sense in end and somebody should do proper OpenGL (mesa probably) over NOVA.
Quote:
It will, as soon as you start to feed it with something else than single triangles
Seriusly, Quake3 is so much popular and work so much everwhere (and even over gl4es on other gles drivers), that rule it out like "writen bad, we not do it like this", its dunno.. But whatever.
At this point its all probably mean, there is no needs to worry about gl4es or regal at all anymore, and only proper OpenGL realisation (which will do all those "bad" things, so usuall things as quake3 will give more fps than "bad" minigl).
I mean, there is no reassons to make another layer, which slower than minigl in all the stuff i try (simple games give about the same result, quake3 give so much worse result).
Right now Q3/gl4es draws a scene in a way that's no good with ogles2/Nova. The latter like rather big amounts of triangles. That's what they are designed for. And this is how you get good performance from it. (Not just) people like Entwickler-X demonstrate how it's done.
Making hundreds or thousands of draw-calls with less than 10 triangles each is missing the topic of those libs. And it was never a good idea. Apparently you're lucky and other ogles2 implementations on other hardware isn't hurt by that so much. And apparently you're lucky that MiniGL/Warp3D(SI) is of some help in the background.
The thing is, like said before, that this type of inefficient drawing is not what 99% of ogles2 programs do. That's why I'm absolutely not convinced that it makes sense to optimize ogles2 also for this niche task. IMHO something like that has to be implemented in the next higher level (where it's also most likely easier to do and where other systems also benefit from it). The next higher level in case of the constellation here would be MiniGL or its equivalent.
So the obvious solution is: extend gl4es instead to collect the data of such small draw calls and then issue a bigger one.
So, it's only a "sad final" if you either don't want to do that or if you cannot convince me that I'm wrong and that I should extend ogles2 by such an optimization.
The thing is, like said before, that this type of inefficient drawing is not what 99% of ogles2 programs do. That's why I'm absolutely not convinced that it makes sense to optimize ogles2 also for this niche task.
Quake3 isn't niche task imho. Or you mean the way how "gl4es" do things in end ? Author says before that "Many GLES2 hardware also doesn't like to have many draw command, so gl4es tries to group them as much as it can." But maybe its a bit different.
Quote:
IMHO something like that has to be implemented in the next higher level (where it's also most likely easier to do and where other systems also benefit from it). The next higher level in case of the constellation here would be MiniGL or its equivalent.
You mean GL4ES ? But then, GL4ES already give more or less the same perfomance as alrady do minigl , on pandora with 1ghz cpu and worse graphics card than i have. Maybe their drivers just do batching inside.
Quote:
extend gl4es instead to collect the data of such small draw calls and then issue a bigger one.
But, doesn't it strange anyway : as Hans say "5 * 550 = 2750 draw calls/s. We can manage a lot more than that, so something must be getting in the way". So, ok, let's it be lots of calls with not many triangles, then what, its then mean everything going slow ?:)) Kind of strange to have that from the thing which we expect to be modern , for modern gpus, with modern nova, on modern x5000 :)
Sorry, i am not in technical details much, but i mean it all from user's perpective.
Quote:
So, it's only a "sad final" if you either don't want to do that or if you cannot convince me that I'm wrong and that I should extend ogles2 by such an optimization.
You mean gl4es , or rewrote all the "bad" games of quake3 quality ?:) If latter, then unpossible, just once you will say anyone that things slow because "quake3 writen bad" , then all will say "ok, i got the point", as quake3 by itself optimized a lot (can't say as developer, just as user who heard about it all the time everywhere in last 10 years).
To clear its up : by "game" maybe you mean "game compiled over gl4es" ? Maybe its gl4es do things "bad", and not quake3 code (which is for sure optimized very much by tons of coders) ? Maybe its all about only gl4es only ? Just by some luck most gles2 drivers have batching inside, and so, because of that it works more or less ok on Pandora and co ?
I say that making hundreds or thousands of glDraw-calls with 10 triangles per frame each is a niche task for ogles2/Nova.
Quote:
But, doesn't it strange anyway : as Hans say "5 * 550 = 2750 draw calls/s.
Absolutely true. But that means something else than throwing hundreds or thousands of such micro-draw-calls per frame at the ogles2 / Nova and expect it to deliver fast results.
Quote:
You mean gl4es , or rewrote all the "bad" games of quake3 quality ?
I told you exactly what I mean above: "So the obvious solution is: extend gl4es instead to collect the data of such small draw calls and then issue a bigger one"
This is the solution to your problem.
Quote:
If first one, then why gl4es works better on pandora ?
I already told you my guess on that already too: "Apparently you're lucky and other ogles2 implementations on other hardware isn't hurt by that so much."
Quote:
Sure, i will forward to gl4es author that , but .. But he will say "eveywhere all fine, only on amigaos4 is so slow"
Too bad then, because, as being said above: in gl4es "it's also most likely easier to do and other systems also benefit from it" (even if it already runs "fast enough" on others). ogles2/Nova can draw scenes with hundreds of thousands of triangles per frame - if you feed it correctly. Right now you don't feed it correctly, simple as that. And the way you incorrectly feed it right now is so uncommon to ogles2 applications that I don't see the need / value / work-value-balance to optimize ogles2 also for this (ab)use-case. It is sth. the caller should take care of. The caller is gl4es.
Btw.: I don't know Q3 sources but somehow I cannot imagine that the game itself doesn't produce larger batches of triangles.
Quote:
But, doesn't it strange anyway
Actually it's not strange at all and consistent to your very first tests. Essentially what Q3/gl4es is doing here is the same as if you would put a big for-loop around the begin-end block of your rotating cube example, instead of putting that loop *inside* the begin-end-block. So simply spoken, instead of sending 1x 1000 cubes you are sending 1000x 1 cube (actually Q3/gl4es is even worse, a cube has at least 12 triangles ). This is what I call a niche use / abuse of ogles2. Why should something like that produce better performance than the original 1 cube per draw-call per frame? No, there's nothing strange at all. The only strange thing is that Q3/gl4es issues such draw calls in the first place.
Btw.: I don't know Q3 sources but somehow I cannot imagine that the game itself doesn't produce larger batches of triangles.
Sure that idTech3 engine do things right. Or it wasn't so popular and fast in end.
As Hans says before, that once he run first time openarena on w3d si, it give him also 10 fps only on minigl. So probably, batching was added to warp3d si then, and warp3d nova didn't have it => ogles not have it => 5 fps.
@kas1e your post and my edit overlapped, please see addon above for even more clarity
Yes, it may well be that W3DSI contains extra batching-logic. Or MiniGL does, dunno. That would explain why it isn't hurt by that. And that's why I say: put that logic into the MiniGL equivalent, which is gl4es.
@Daytona675x Btw, have some more info about topic.
For first, "all is like with minigl on Pandora" , its when author use regular q3/gl4es version. By regular there mean all extensions enabled (!). For us as you remember we have GL_EXT_compiled_vertex_array disabled as it not works currently in ogles2 (nova ?).
So, we should not forget that once the fix for the vertex attribs is done in Nova, we can enable extension, which mean that GL_EXT_compiled_vertex_array one, and have 550 calls for the frame. I.e. we will have 550 "large" calls, and that probably will be good.
With all extensions enabled on Pandora's GLES2, he have that "minigl kind" perfomance. With exactly that GL_EXT_compiled_vertex_array , quake3 use glDrawElements(...) that are quite optimized by the idTech3 engine.
That can explain the same kind of fps as we have on minigl.
quote from gl4es auhor:
Quote:
I.e. with GL_EXT_compiled_vertex_array you will have the same rendering as he had yesterday on the Pandora. idTech3 engine is pretty good at batching call. It was not the case with idTech1 and 2 (because of the software rendering), but, again, I don't know well the "glBegin/glEnd" path, maybe it's fragmented. I'll check tonight.
So, in our case now, when we have disabled all extensions, it's using the glBegin(...)/glEnd() code path, which may do things wrong.
Through, gl4es code have some "collapse" code to do the batching, but it may very well not works, or broken, or not works with q3 by some reassons.
That to be checked today, fingers crossed ! Maybe its a bit early for sad final ! :)
@kas1e So apples and pears have been compared? Anyway, we are back to where we left some posts ago where I told you: rest, give it a break Wait until Hans finished his work on Nova, then wait until I checked and eventually finished fixing ogles2 if there's an issue despite Nova supporting mixed types of shader variables then. Until then: let's put this here to sleep (at least I will, I won't get back here until then).
@Daniel Btw, when you say "flooded by glcall with less or eq. than 10 triangles", it was many thousands ? I mean not few hundreds, but hundreds of thousands ?
Just have some good news : we (we mean gl4es author, i only like sucker follow what he and you do:) ) checked quakes'3 sources, and found that when no extensions enabled, the code will call R_DrawStripElements( numIndexes, indexes, qglArrayElement );
This function does that (quoting code comments):
Quote:
/* ================== R_DrawElements Optionally performs our own glDrawElements that looks for strip conditions instead of using the single glDrawElements call that may be inefficient without compiled vertex arrays. ================== */
So, if looks at code of that functions, we see it tries to do TRIANGLE_STRIP, so making more calls instead of on single glBegin/glEnd, as it was more optimised in the early days.
The only way to collapse thoses triangles strip is to put them back in individual triangles... gl4es is supposed to do that, but (!):
After an analysis when using the glBegin/glEnd path (ie. when disabled extensions) on the Pandora with gl4es build: its bad. 2, maybe 3 fps.
And it seems that gl4es doesn't collapse the blocks (but it should). So author have a bug in gl4es to leads to bad performances in this case.
He have 3906 drawing calls (instead of 550, with extensions enabled), and yeah, performances are terrible.
He have to fix that, as the code for collapsing the call is there, so it's "just" a bug somewhere.
Not that i expect 50 fps anymore after batchig is working, but it should beat minigl at least.
Edited by kas1e on 2018/3/1 19:14:54 Edited by kas1e on 2018/3/1 19:17:14
@All Ok, "batching" bug in gl4es found and fixed, so, instead of 3906 calls, it is now 300. Which is probably already good enough to feed for ogles2/nova, and those not micro calls anymore.
Results for now start to be better, but still far from beating MiniGL.
I do compare everything with usuall "timedemo 1 / demo four" typed in console. Full maximum details. And that what i have:
1) GL4ES version without GL extensions is slower about 1.7 times than MiniGL version without extensions. So that part probably still needs a lot of optimisation.
2). Quake3 seems to support just few GL extensiions, and even if MiniGL have about 20 of them, quake3 still use those 3 even with MiniGL:
On ogles2 we have now 2 of them : GL_EXT_texture_env_add and GL_ARB_multitexture extensions, and it give just +4 fps in compare with MiniGL version, where we have +30 fps more. So, the real boost seems to come from GL_EXT_compiled_vertex_array.
Through, non-extension version still visibly slower, and for crash minigl even a little, we need to got somewhere 30 fps more..
And activating the glDrawElements(...) will not reduce the number of calls at this stage, but they will reduce the cpu load / number of malloc()/free() done, if that will make any differences.
@Daniel Sorry for not being rest, but i can upload binary with fixed "batching", so we have 300 calls now instead of 4000. If you of course, will have time for it at all.
We of course can easly wait till Hans will add vertexattrib stuff, but that will probably help only with GL Extensions, while non GLExtensions variant still slower than minigl one (that part can be checked probably)
Edited by kas1e on 2018/3/1 21:59:11 Edited by kas1e on 2018/3/1 22:00:14 Edited by kas1e on 2018/3/1 22:22:51 Edited by kas1e on 2018/3/1 22:23:22 Edited by kas1e on 2018/3/1 22:48:40
?:) Sadly i need to convince you to "Shader attribute expansion to four components" fast fix :( Its not like we ask for features or make BZ just because :)
You don't need to convince me to fix bugs. I wasn't convinced that the bug in question is the cause of the very serious performance issues. It sounds like I was right...
I also really appreciate the work you're doing, and the bugs/limitations you've reported. Things have improved faster over the last month or so because of this. It's great that the GL4ES author is helping too.
@all Yes, the W3D_SI driver does its best to batch up tiny draw ops. Like I said earlier, MiniGL tends to chop up draw ops into tiny fragments, so this was a must. Clearly GL4ES did that too, and I'm glad that's now been fixed.
Quake3 definitely does *not* do hundreds of tiny draw calls. Some older 3D games do, because that was actually faster back when they were written. Fortunately, Quake 3 is from a newer generation.
There must still be some things in the way of performance, because 300 * 40.1 = 12030 draw calls/s. The system should be able to do better than that, even including the CPU time taken by game logic.
And activating the glDrawElements(...) will not reduce the number of calls at this stage, but they will reduce the cpu load / number of malloc()/free() done, if that will make any differences.
I missed this. Yes, many malloc()/free() calls will definitely slow things down.
As an experiment, you could try changing R_DrawElements() to always use glDrawElements(). You still won't get the advantage of compiled vertex arrays, but you'll bypass the very inefficient glBegin()/glEnd() code. That code hurts performance on multiple levels: - The overhead of glBegin/glEnd() - The overhead of analysing the index array and copying vertices to a new buffer (increases the chance of the data not fitting the CPU's cache) - It's cutting up the vertex array into smaller fragments (more batches) - Vertices in a mesh that are shared between tri-strips are duplicated, meaning more vertices need to be sent to the GPU (more data == more transfer time, and it increases the chance of the data not fitting the CPU's cache even more)
@Hans Sure system can do a lot more. If you take that 40.1 as reference (so value with extensions), then minigl already give 90 there (and that with TCL in software). And with nova/ogles2 we probably should have more as TCL in hardware, and we on better drivers.
As for glbegin/glends (so without extensions): same mallocs/free (in game itself, not in minigl/gl4es, as in gl4es there may very well be lots of mallocs/free which don't in minigl), etc in minigl version still faster in 1.7 times, while, of course, we need gl4es version be faster on 1.7 times (at least, for minimum). So experiments with gldrawelenents cant explain non-extension differences.
I mean non-extensions versions are in same code condition, and gl4es one should be faster, imho, ust because its on new drivers and batching now works on gl4es side.
Edited by kas1e on 2018/3/2 6:47:45 Edited by kas1e on 2018/3/2 6:48:49 Edited by kas1e on 2018/3/2 6:50:21 Edited by kas1e on 2018/3/2 19:49:12
@Hans For sake of tests i anyway changing R_DrawElements() to always use glDrawElements(), and as result, i have the same mess which i have when i use GL_EXT_compiled_vertex_array enabled.
Seems roots of non working GL_EXT_compiled_vertex_array are about gldrawelements().
Yes, many malloc()/free() calls will definitely slow things down.
If there are frequent allocations of a same-sized memory block then it might be worth adapting the code to use item pool allocations/deallocations instead of malloc() / free(). They are way faster and also help prevent memory fragmentation.