Knowing that you are busy with everything, but just to remind after 2 months, is it possible that you can find some time soon to working on nova again ?
Not right now, sorry. I really need to get things I'm working on finished ASAP, so I don't have much time for other things.
Quote:
Also, what you can suggest of how can we debug why Nova fail to made FrickingShark shaders works correctly, while compile them fine ? (taking in account, that shaders are ok, and that on win32 and linux version the same code which use those shaders works correctly).
From memory, last time we talked about this you had simplified the shaders (disabled the fogging), and it was still drawing a single colour. You weren't sure if the textures were set up correctly.
The best way I can think of, is to simplify the shader until it works, and then work back up. I'd start by simply outputting Texture0 to gl_FragColor, to make sure that works. If it doesn't work, then either the texture isn't bound correctly, or something is going wrong with the texture coordinates (in which case, check the vertex shader).
After that, I'd output Texture1, to make sure that one works, and then gradually add in the rest of the code again.
Some info about gl4es: seeing from commits , ptitSeb start to works on precompiled shaders support (what mean, no pauses in the games which want to change state offten, like FrickingShark, LugaruHD and some others).
Another thing worth to mention: i was able to build old supertuxkart (that 0.6.2a version) over gl4es, and good thing that rendering and co all looks fine just from begining (through there little issue when you run it : not whole menu draws by some reassons, you need go inside of any entry and exit from to make menu appears).
But bad thing, is that such an expected speed boost didn't happens :( In some tracks gl4es build a little bit faster, in some slower than minigl build. So generally speaking at moment nothing which worth to release in that terms.
There is videos of one track called "racetrack" to see difference:
As you can see gl4es a little bit faster there (expectually under the bridge), but not _that_ fast as one expect.
I will play with differenc settings of gl4es , as well as asking ptitSeb (gl4es author) if he have any ideas about.
I may also try to build whole supertuxkart with LTO enabled, maybe it can give some boost (at least with foobillard++ it give +10 fps), maybe will be lucky with that 0.6.2a vesion too..
I still get the same visual error i got with v1.58 of Warp3DNova drivers:
The left one is with OGLES2, the right one with OpenGL
I'm using Warp3DNova.library 1.65 (31.03.2019) W3DN_SI.library 1.65 (31.03.2019) now, but it doesn't seem to be fixed as this post (#483) from you assumed back then.
@All As profiler in glSnoop seems working, i want to understand why SuperTuxKart compiled over gl4es the same 1:1 slow as MiniGL version. So, i run the game on some track, and there is output i got:
OGLES2:
OpenGL ES 2.0 profiling results:
--------------------------------
-> DrawArrays callcount 137928, duration 4977.818102 milliseconds, 59.32 % of total
-> DrawElements callcount 7109, duration 1727.744436 milliseconds, 20.59 % of total
-> CompileShader callcount 32, duration 1073.860889 milliseconds, 12.80 % of total
-> TexImage2D callcount 1256, duration 409.614268 milliseconds, 4.88 % of total
-> VertexAttribPointer callcount 488300, duration 179.157637 milliseconds, 2.14 % of total
-> SwapBuffers callcount 538, duration 11.084613 milliseconds, 0.13 % of total
-> BindTexture callcount 3734, duration 5.976421 milliseconds, 0.07 % of total
-> GenTextures callcount 142, duration 2.095481 milliseconds, 0.02 % of total
-> DeleteTextures callcount 142, duration 1.718170 milliseconds, 0.02 % of total
-> EnableVertexAttribArray callcount 7787, duration 1.053254 milliseconds, 0.01 % of total
-> TexParameteri callcount 426, duration 0.485383 milliseconds, 0.01 % of total
-> ShaderSource callcount 32, duration 0.234230 milliseconds, 0.00 % of total
Total recorded duration 8390.842884 ms, 13.04 % of total 64337.708425 ms
--------------------------------------------------------
Warp3DNOVA:
Warp3D Nova profiling results:
------------------------------
-> DrawArrays callcount 57256, duration 1588.027950 milliseconds, 56.87 % of total
-> BufferUnlock callcount 64253, duration 501.094117 milliseconds, 17.95 % of total
-> DrawElements callcount 3343, duration 338.133456 milliseconds, 12.11 % of total
-> CreateVertexBufferObject callcount 669, duration 140.691583 milliseconds, 5.04 % of total
-> VBOSetArray callcount 209122, duration 84.408429 milliseconds, 3.02 % of total
-> VBOLock callcount 58728, duration 70.160685 milliseconds, 2.51 % of total
-> DestroyVertexBufferObject callcount 669, duration 50.385371 milliseconds, 1.80 % of total
-> BindVertexAttribArray callcount 207526, duration 17.670089 milliseconds, 0.63 % of total
-> Destroy callcount 1, duration 1.587200 milliseconds, 0.06 % of total
-> DestroyFrameBuffer callcount 2, duration 0.040703 milliseconds, 0.00 % of total
-> FBBindBuffer callcount 2, duration 0.037053 milliseconds, 0.00 % of total
-> CreateFrameBuffer callcount 1, duration 0.002647 milliseconds, 0.00 % of total
-> SetRenderTarget callcount 2, duration 0.000441 milliseconds, 0.00 % of total
Total recorded duration 2792.239724 ms, 4.20 % of total 66428.862092 ms
--------------------------------------------------------
As i see from both reports, the timewaster are DrawArrays which takes about 60% of everything. And only 12-20% is taken by glDrawElements.
But as i do not understand why it can be that slow and what can be done about, i just bring that info there in hope someone more skilled can bring some ideas about :)
I also for sake of tests profile quake3, and that what i got:
OGLES2:
OpenGL ES 2.0 profiling results:
--------------------------------
-> DrawElements callcount 16661, duration 15241.116213 milliseconds, 94.09 % of total
-> TexImage2D callcount 2620, duration 445.739704 milliseconds, 2.75 % of total
-> DrawArrays callcount 442, duration 350.937282 milliseconds, 2.17 % of total
-> CompileShader callcount 4, duration 69.808558 milliseconds, 0.43 % of total
-> TexSubImage2D callcount 23, duration 43.077355 milliseconds, 0.27 % of total
-> BindTexture callcount 16160, duration 21.526567 milliseconds, 0.13 % of total
-> SwapBuffers callcount 559, duration 12.450375 milliseconds, 0.08 % of total
-> GenTextures callcount 477, duration 7.757589 milliseconds, 0.05 % of total
-> DeleteTextures callcount 477, duration 4.823114 milliseconds, 0.03 % of total
-> TexParameteri callcount 1431, duration 1.179853 milliseconds, 0.01 % of total
-> ShaderSource callcount 4, duration 0.024903 milliseconds, 0.00 % of total
-> VertexAttribPointer callcount 10, duration 0.005694 milliseconds, 0.00 % of total
-> EnableVertexAttribArray callcount 4, duration 0.001764 milliseconds, 0.00 % of total
Total recorded duration 16198.448971 ms, 58.19 % of total 27836.338613 ms
--------------------------------------------------------
Warp3DNOVA:
Warp3D Nova profiling results:
------------------------------
-> DrawElements callcount 16661, duration 2020.462445 milliseconds, 86.40 % of total
-> BufferUnlock callcount 19708, duration 125.265309 milliseconds, 5.36 % of total
-> CreateVertexBufferObject callcount 675, duration 65.347476 milliseconds, 2.79 % of total
-> DrawArrays callcount 442, duration 51.715724 milliseconds, 2.21 % of total
-> VBOLock callcount 17507, duration 27.010988 milliseconds, 1.16 % of total
-> VBOSetArray callcount 58002, duration 26.906725 milliseconds, 1.15 % of total
-> DestroyVertexBufferObject callcount 675, duration 17.456069 milliseconds, 0.75 % of total
-> BindVertexAttribArray callcount 51165, duration 3.733328 milliseconds, 0.16 % of total
-> Destroy callcount 1, duration 0.560212 milliseconds, 0.02 % of total
-> DestroyFrameBuffer callcount 2, duration 0.065405 milliseconds, 0.00 % of total
-> FBBindBuffer callcount 2, duration 0.036251 milliseconds, 0.00 % of total
-> CreateFrameBuffer callcount 1, duration 0.002807 milliseconds, 0.00 % of total
-> SetRenderTarget callcount 2, duration 0.000441 milliseconds, 0.00 % of total
Total recorded duration 2338.563179 ms, 8.98 % of total 26053.620002 ms
--------------------------------------------------------
So in case with Quake3 , there can be seen that everything is limited by glDrawElements and its optimisation.
Probably, those 2 tests mean, that gl4es need optimisation in glDrawArray() handling (batching and stuff). But i of course can't be sure.
In STK case, it seems that OGLES2 spent 13 % of "context lifetime" in known functions (that figure also includes W3DNova). This amount may change once we add the rest of functions so glSnoop becomes more aware of everything. But it's also possibile that 87% of time is done something else than drawing. Compare to Q3 case where OGLES2 spent 58 % of context lifetime so my interpretation is that Q3 is more GPU-oriented than STK, based on these rough stats.
It's also important to note that Pause doesn't impact profiling, at the moment anyway. So if you let game sit on some menu it may twist the stats.
Yeah, those results are without "pause" taken, just run the game, in menu choice play, choice track, and play a little in it.
What is interesting is that there is (summing both glDrawXXX) 137928+7109 draw commands, for 538 frames (number of SwapBuffer) will makes 269 draw command per frames ! (for comparison, quake3 gets 30 draw commands per frame!). That of course with menu and "a little playing" in game itself. But when i play more in the game itself, the results even worse: 489 draw commands per frame !
Also, i see problem there that this STK version works pretty fast on old 1ghz PC , so on our 2ghz (even if take in account it CPU limited), it should be surely fast too.
Its all looks like something blocking the speed there, and i can't got what exactly. Probably that such a high amount of draw commands per frame is that what is key for slowing downs here , and they somehow need to be batched or so.. Through that didn't explain of course why minigl build and gl4es one are exactly the same by speed in most cases..
Edit: and yeah, we probabaly need for profiling add all/every ogles2/warp3d functions, so we will got real/full profiling.
@Capehill I do run latest glSnoop over STK , and as we can now just use "PROFILE" without tracing, i can fully play one track, so to profile everything what happens there.
So, there is results:
OpenGL ES 2.0 profiling results for Shell Process 'supertuxkart_gl4es_1915':
--------------------------------------------------------
-> DrawElements callcount 94606, duration 28722.920640 milliseconds, 50.61 % of total
-> DrawArrays callcount 1016825, duration 22311.293700 milliseconds, 39.31 % of total
-> Clear callcount 12143, duration 2780.711874 milliseconds, 4.90 % of total
-> CompileShader callcount 36, duration 1208.309059 milliseconds, 2.13 % of total
-> VertexAttribPointer callcount 3399766, duration 802.878293 milliseconds, 1.41 % of total
-> UseProgram callcount 50476, duration 390.844087 milliseconds, 0.69 % of total
-> TexImage2D callcount 1256, duration 373.238160 milliseconds, 0.66 % of total
-> BindTexture callcount 56723, duration 79.214340 milliseconds, 0.14 % of total
-> SwapBuffers callcount 2366, duration 70.471067 milliseconds, 0.12 % of total
-> DeleteTextures callcount 142, duration 5.362273 milliseconds, 0.01 % of total
-> EnableVertexAttribArray callcount 34347, duration 5.205077 milliseconds, 0.01 % of total
-> GenTextures callcount 142, duration 1.960581 milliseconds, 0.00 % of total
-> TexParameteri callcount 426, duration 0.439026 milliseconds, 0.00 % of total
-> ShaderSource callcount 36, duration 0.287364 milliseconds, 0.00 % of total
Total recorded duration 56753.135542 ms, 47.12 % of total 120434.271123 ms
--------------------------------------------------------
Warp3D Nova profiling results for Shell Process 'supertuxkart_gl4es_1915':
--------------------------------------------------------
-> BufferUnlock callcount 1212921, duration 13976.371496 milliseconds, 39.21 % of total
-> DrawArrays callcount 1016825, duration 7000.779364 milliseconds, 19.64 % of total
-> DrawElements callcount 94606, duration 5845.702049 milliseconds, 16.40 % of total
-> Clear callcount 12143, duration 2608.353250 milliseconds, 7.32 % of total
-> Submit callcount 132308, duration 2595.274893 milliseconds, 7.28 % of total
-> CompileShader callcount 36, duration 1110.067811 milliseconds, 3.11 % of total
-> VBOLock callcount 1143981, duration 1038.540442 milliseconds, 2.91 % of total
-> VBOSetArray callcount 3576550, duration 1037.991138 milliseconds, 2.91 % of total
-> CreateVertexBufferObject callcount 826, duration 181.778321 milliseconds, 0.51 % of total
-> BindVertexAttribArray callcount 3531305, duration 163.687051 milliseconds, 0.46 % of total
-> DestroyVertexBufferObject callcount 826, duration 70.044272 milliseconds, 0.20 % of total
-> BindTexture callcount 62278, duration 10.223042 milliseconds, 0.03 % of total
-> SetShaderPipeline callcount 50477, duration 6.143923 milliseconds, 0.02 % of total
-> Destroy callcount 1, duration 2.250832 milliseconds, 0.01 % of total
-> FBBindBuffer callcount 2, duration 0.032883 milliseconds, 0.00 % of total
-> DestroyFrameBuffer callcount 2, duration 0.018607 milliseconds, 0.00 % of total
-> CreateFrameBuffer callcount 1, duration 0.002647 milliseconds, 0.00 % of total
-> SetRenderTarget callcount 2, duration 0.000802 milliseconds, 0.00 % of total
Total recorded duration 35647.262822 ms, 29.79 % of total 119664.569234 ms
--------------------------------------------------------
It is 469 draw calls per frame (!). Visually when you play , it vary from 5 to 20 fps. 5 fps when many cars close to you, and 20 fps when you draw alone on track.
What i also notice now (after i able to play full track, thanks to new PROFILE option in glsnoop), that actual timewaster are W3DN_BufferUnlock() (if take in account, that we add already most usable functions and that this % do not include ogles2 values).
It is 469 draw calls per frame (!). Visually when you play , it vary from 5 to 20 fps. 5 fps when many cars close to you, and 20 fps when you draw alone on track.
Back in August of 2016, the Sam460ex managed about 25.6k draw-calls/s (measured using Daniel's boingtest).** IIRC, the X1000 managed about double that, but I can't find the data to confirm.
Your test had about 9.2K draw calls/s with the driver using about 47.12% of CPU time (scales up to about 19.6K calls/s if GLES2 + Nova were using 100% of CPU time).
The number of draw calls/s goes down as the amount of data that's transferred to VRAM increases. So, more shader constants will slow it down, as will more and more vertices. This is why developers should use VBOs instead of vertex arrays...
What system was this test on? If it's a Sam460, then the draw-calls/s bottleneck is probably a big factor. If you're using an X1000 or X5000, then the number of vertices/s being copied to VRAM is probably having a sizeable effect.
Quote:
What i also notice now (after i able to play full track, thanks to new PROFILE option in glsnoop), that actual timewaster are W3DN_BufferUnlock() (if take in account, that we add already most usable functions and that this % do not include ogles2 values).
That's to be expected. Data is copied to VRAM when the buffer is unlocked, so this is probably the slowest operation. We still lack GART support, so the transfer is slow...
Hans
** Boingtest uses VBOs, so this can be considered the draw-calls/s limit back in 2016 in the absence of large vertex data-sets being uploaded to VRAM. Things have been optimized a bit since then, but lack of GART support is the biggest factor limiting the draw-calls/s limit.
Back in August of 2016, the Sam460ex managed about 25.6k draw-calls/s (measured using Daniel's boingtest).** IIRC, the X1000 managed about double that, but I can't find the data to confirm.
Your test had about 9.2K draw calls/s with the driver using about 47.12% of CPU time (scales up to about 19.6K calls/s if GLES2 + Nova were using 100% of CPU time).
The number of draw calls/s goes down as the amount of data that's transferred to VRAM increases. So, more shader constants will slow it down, as will more and more vertices. This is why developers should use VBOs instead of vertex arrays...
What system was this test on? If it's a Sam460, then the draw-calls/s bottleneck is probably a big factor. If you're using an X1000 or X5000, then the number of vertices/s being copied to VRAM is probably having a sizeable effect.
I am on x5000, so then it mean that having those ~500 draw calls per frame not which make it be slow , as it still have lot of space left..
But to compare, quake3 doing only 30 draw calls per frame, so probabaly such a different still make sense too ..
Quote:
That's to be expected. Data is copied to VRAM when the buffer is unlocked, so this is probably the slowest operation. We still lack GART support, so the transfer is slow...
Yes, lot of time spent in the creation of buffers. Does this operation really necessary? I mean, can't NOVA works without buffer (with just main CPU memory) via some environment or something ?
And yeah, i as usuall will say obvious thing, but GART was and still much more important than other things, but sure those who pay those make the decissions of course :)
Btw, ptitSeb also build that version of supertuxkart for tests, and on Pandora he also have very low framerate. He made GL capture on his side, and seems there are a lot of glCallList(...) involved. If each glDrawArrays(...) is in a list, then the BATCH mode will not merge them => slow.
Also what he say about missing of GART on our side, that its unfortunate the reassons for slow gameplay in STK0.6.2a, he say that :
--- Mmm. I see, but that's unfortunate. Most of those legacy games are not using VBO, so need to transfert data every times (and even if they were using VBO, gl4es would not anyway). VBO are bit special. Testing on the Pandora showed me that use of VBO actualy slowed down things on graphic intensive game (that was Doom III). And because gl4es is targetted to embedded SoC and SoC mostly have shared memory for VRAM, I'm still not convinced VBO would speed up anything. I still have some idea on were use VBO easily, but I don't much incentive to spent time on something that I'm not confident will improve anything (except complexity on gl4es). ---
Yes, lot of time spent in the creation of buffers. Does this operation really necessary? I mean, can't NOVA works without buffer (with just main CPU memory) via some environment or something ?
Yes, it's necessary. Firstly, you need GART support for the GPU to read directly from main memory. Otherwise, you need to copy stuff over. Next, the GPU is little-endian so at a bare minimum all data needs to be byte-swapped.
Quote:
Also what he say about missing of GART on our side, that its unfortunate the reassons for slow gameplay in STK0.6.2a, he say that :
--- Mmm. I see, but that's unfortunate. Most of those legacy games are not using VBO, so need to transfert data every times (and even if they were using VBO, gl4es would not anyway). VBO are bit special. Testing on the Pandora showed me that use of VBO actualy slowed down things on graphic intensive game (that was Doom III). And because gl4es is targetted to embedded SoC and SoC mostly have shared memory for VRAM, I'm still not convinced VBO would speed up anything. I still have some idea on were use VBO easily, but I don't much incentive to spent time on something that I'm not confident will improve anything (except complexity on gl4es). ---
That makes sense assuming that Pandora and other hardware ptitSeb is testing on have shared graphics memory (I know the Raspberry Pi has shared graphics memory). In that case, yes, VBOs won't help. Thats because VRAM is just a part of main memory, and so the GPU's access speed is the same as for the rest of RAM.
On desktop systems it's different. The GPU has very fast dedicated VRAM, and everything else has to be copied across the PCIe bus. VRAM access is noticeably faster than PCIe access, so VBOs make a big difference.
With prototype (and noticed even before, just didn't reported), when you run glSnoop, visually, i can see some strange distortion while played in actual game: some vertical line near of ship arise from time to time, which is not fully black. I fear when first time when see it that its bug in game which left unnoticed, but then i run it without glSnoop, and there is no such a line. So i conclude that it can be something about patching of framebuffer based functions, etc. But can be wrong of course and that can be not related.
As for foobilalrd++, yeah, have the same. And when do just PROFILE (for speed), and play in game , visually i see no glitches (was in hope it can be same as with prototype).
Good thing that now, even if GUI window is not active, it refresh count of errors once they arise (for foobillard++ it show 672 GL errors and 0 for nova). For prototype also show 0 gl errors and 4 W3D errors, which bring me to next enhancement if you doesn't mind : once errors arise in ogles2, or in w3dn , maybe show the button near like "Show errors", and pressing on button will just open a window with errors list only, or something , i.e. not full looging/tracing copy, but just some clean summary, and for full summary user can check the full logs). What you think ?
Also can you clean a bit some bits for me if you have time for :) :
Let's say we have for warp3dnova "44.17% of total context life-time" and for ogles2 "60.54% of total context life-time". What did it mean, that we have full 100%, and 40 go to wapr3dnova and 60 to ogles ? or , it mean that we only catch 60% in whole (and 40% just untraced/unpatched at moment) , and from those traced 60% of whole calls, mean that it include 44% of warp3dnova inside (so actuall ogles2 use are 18%) ?
I mean maybe something more clean and general can be divided into another field , not related to ogles2 / warp3d , but just some general summorise of everything at once. I just start to be loosing a bit into this % which is % from % from % :))
Regarding errors, the simplest might be to display error count in one column at the end.
"44 % of total context life-time" means that if context exists for 100 milliseconds, then known calls used 44 milliseconds of that.
Let's assume for simplicity that both OGLES2 and Nova context existed exactly for the same duration, 1000 ms. Now if OGLES2 spends 80 % of (its) total context life-time and Nova 40 % of (its own) total context life-time, then both OGLES2 and Nova used actually the same amount time processing because OGLES2 includes Nova (so you can do 80% - 40% = 40 % so both use the same amount of CPU time.
Let's see if it can be simplified somehow. (my head is spinning too)
"44 % of total context life-time" means that if context exists for 100 milliseconds, then known calls used 44 milliseconds of that.
After i check almost all ogles2 stuff we have, i can say that most of time ogles2 take around 40-60% , so somewhere else some functions eat lot of time usually..
Quote:
Let's see if it can be simplified somehow. (my head is spinning too)
At least i got it now, so even if stay same, other ones can got it via reading that info (we can put it to docs/readme/guide/etc).
Added a check for W3DN_Submit() return value, and it seems that Foobillard+ and Prototype have one-shot errors where Submit (first, maybe?) returns 0.
Check the errCode value. If it's W3DNEC_QUEUEEMPTY, then it's not really an error. It just means that you submitted an empty queue. Otherwise, errCode should give you a hint as to what's wrong.