@kas1e Those two examples trigger one of the many performance boosters in ogles2.library without which things would crawl to resize the internal multibuffer's VBOs all to rather huge sizes. I added a simple safety mechanism now to avoid such situations. This quick workaround comes at a rather huge performance penalty (only) for such situations though. I will improve that for the next official ogles2 version.
@Hans However it's important to note that there is no memory-leak or bug in ogles2.library! All those allocations are legal, Nova always reports "success". What is causing the crash here is something else, pretty interesting actually:
As being said, all VBO allocations are reported as being successful by Nova. This is true even if the physical gfx memory is already practically fully depleted. The stuff still runs, albeit with a significant slowdown. Something else is causing Nova to freeze the system then, namely a call to DestroyVertexBufferObject. This is what happened in ogles2: when gfx RAM was already in critical areas, yet another VBO was in the process of being resized - which involved its prior destruction before its recreation, bang. Note that the call to DestroyVertexBufferObject is fully legal and the respective VBO is not in use anywhere or whatever.
So, to sum it up: Nova freezes if you call DestroyVertexBufferObject when gfx RAM is low, at least.
EDIT: filed this as new W3D Nova bug report 0000447.
Edited by Daytona675x on 2019/9/7 10:56:15 Edited by Daytona675x on 2019/9/7 10:58:53
@All Ok so, for that example from Irrlicht called TerrainRendering we are done : Daniel add workaround for Nova bug when Nova freezes if you call DestroyVertexBufferObject when gfx RAM is low, as well as he added that safety mechanism to avoid such situations.
But now more news about : PtitSeb add real use of VBO for glBindBuffer() (at this time only for it), so, all the apps/games which use it much, benefit from it. For example, in FrickingShark i got +10fps more (not a lot, but still), but in that TerrainRendering example, it is now 550 fps instead of 50 (!) yeah, in 10 (ten) times faster.
But in case with TerrainRendering example, it was also slow and on Linux with gl4es. So, it was gl4es slow things down this time. But now all Irrlicht examples have on linux with gl4es almost the same speed as on linux with Mesa, so it didn't add as a layer speed issues anymore.
Now, probabaly the last thing i want to understand (and maybe it possible to speed up it somehow) with Irrlicht, is why some Irrlicht examples are 2-3 times slower, than on 10 years old AMD 1.6 ghz with some shity-inbuild gfx card.
To explain it more, we have in Irrlicht few different modes of rendering : OpenGL and Software Rendering.
Now, let's see figures of software rendering , on 3 different machines:
As we can see, x5000 by hardware specs on the level of 10 years old amd 1.6ghz notebook.
That time, we didn't take in account opengl, or warp3d, or anything of that sort. It is pure software rendering. And only what have impact there , is CPU, graphics.library and radeonHD driver.
Through if think more about it, its probabaly ok. Everyone know that x5000 is 10 years behind of current computer world, so, kind of ok.
Now, taking those results in mind, and knowing that we on the level of amd 1.6ghz , we expect x5000 with opengl rendering be at least on the same level (of course, as we have better gfx card, it is wish to be faster, but its ok to be just on the same level).
And there is table , with OpenGL. The same 3 machines are used:
But also added there results from icore_i5 (first config) under Linux with MESA, and under linux with GL4ES (so to see that GL4ES is ok, and on the same level as MESA and we can't blame it for some issues i will point out now). Also i added MiniGL results (i was able to made Irrlicht works for MiniGL somehow, which offten crashes with it, have rendering bugs, some example's didn't work either, but it still something to compare with).
What we can say there ? For first, MiniGL is suck. Only 2 examples at least on the same speed level (09.MeshViewer and 20.ManagedLights). For others its just too slow indeed.
Next thing we can notice there, that X5000 with GL4ES, with some examples pretty much faster that old AMD notebook (at least that what we expect when we have modern RadeonHD), like 03.CustomSceneNode, 04.Movement and most of them if not on the same level, but a little bit faster there and there.
Issue i see now, is that some examples show pretty degradate results, which i want to discuss and find out why: maybe it will be again possible to speed up somehow/somewhere , or at least we can find WHY.
Examples about which i told are:
02.Quake3Map (slower on 50% than even on 10 years old AMD with shiti intel gfx card)
16. Quake3MapShader (again slower on 50% than 10 years old notebook, but i assume its the same issue as with first example). But with that example, DISABLING of compositing make it be 90 fps, while with ENABLED compositing it is around 60-65 (maybe that will point out on something).
18. SplitScreen (slower in 300% ! Example again load quake3map, and split screen on 4 parts, and rendering happens independent in each).
So those 3 examples probably have the same single issue (i hope), as all of them use quake3map.
And, last one:
20.ManagedLights. That one the same as MiniGL one by speed, nothing changes, so i assume there is that issue with GART this time, or non-dma in graphics.library for x5k.
Now, to have something to discuss, i firstly upload all those 4 examples ready to run, so everyone can try:
I am in interest if any X1000 user (so with DMA in graphics.library) can run them, and see the maximum fps they have in all of them (FPS is writen in window title). So we can avoid or not avoid that its because of DMA. Quiting from examples can be done via close gadget or via "alt+f4". Run it from directory where they are (bin/amigaos4) as they want root's "media" directory.
Next, i made a tracing/profile for all those 4 examples via today's glsnoop which catch almost everything now (Capehill, thank you very much for that!). Profilings for both warp3dnova and ogles2 are at end of files
To be honest, the capture is very clean. Only 27 drawing call in the entire frame! No suspicious state change. Only the bare minimum per drawing call and that's all. And most drawing are using a quite simple shader that do multitexture rendering: Vertex shader:
His first guess was is the time difference is converting from BigEndian to LittleEndian to send the ever changing Vertex Arrays data to the GPU.
But yeah, sure that conversion take place, but it can't take THAT MUCH. I mean, old amd 1.6ghz with some slow intel-gfx card give us ~350fps, and on our setup ~150fps. I can't belive that such a conversion can take that much time. Its probabaly something else ?
Then we checked tracing/profiling logs, and can see that it's the Drawing itself that is the main bottleneck, not the VBO creation/handling it seems (well, it takes some time too, but ok-ish).
And currently run out of ideas. Its a pretty simple things as can be seen from log, through, that a big diffrence bettwen 10 years old amd with shity gfx-card. So something wrong somewhere, and we can't see why and where.
@TearsOfMe Thanks. So its not because of missing DMA on x5k. Less to check then..
But from trace logs it visibly that bottleneck is drawing function itself. But why, and what make it be THAT slower than on old amd1.6ghz with crappy inbuild gfx board that unknown still.
Are you sure ? (When I coded Wazp3D57 with Nova rendering) I have tried differents methods for updating a VBO but it seems to be slow
Perhaps having a patched MiniGL that will update (say) 11 times the VBO will allow to know how much time a VBO update is REALLY during in a REAL program (delta time / 10)
I mean when I was testing Cow3D on Wazp3D57-> Nova it was (say) 80 % of a real waRp3D (massive VBO update but one time) so bandwith seems +- ok
But when I was testing Quake2 (real life test) it was 1-2 FPS ... weird
An other thing: Also when I was testing a simple raymarching test i found that Nova GLSL seems to have very strange bugs: I mean all is fine when GLSL code is compiled but strange artefacts appears like the GLSL code was computing badly at some pixels (like a rounding fpu bug)in frag shader
Are you sure ? (When I coded Wazp3D57 with Nova rendering) I have tried differents methods for updating a VBO but it seems to be slow
In tracers log i post, you can see at end 2 tables: warp3dnova profile and ogles2 profile. So in warp3d one you can see how much and what function of warp3dnova take time. And creating of vbo there is okish, didnt take all time.
Besides, VBO used everywhere in other games/examples, but only those examples slow that much. So they seems to do something which make our drivers be that slow in compare with even just amd1.6ghz with inbuild intel. I even didnt say about modern computers, but be THAT slow, show that something is really wrong somewhere.
Quote:
Also when I was testing a simple raymarching test i found that Nova GLSL seems to have very strange bugs: I mean all is fine when GLSL code is compiled but strange artefacts appears like the GLSL code was computing badly at some pixels (like a rounding fpu bug)in frag shader
Nova shaders compiler still WIP, but lately it start to be better and better. It sadly have optimisation disabled because of some bug, but visual bugs can be checked and reported, so Hans may fix it.
If you have that raymarching test, we can test it on last nova, then test it on ogles2 on windows (to be sure your code correct), and the create a bugreport. I can help with tests (but lets create another topic about ?)
Dunno through what it can say to us.. I mean this time it didn't explain still, why it slower that much in compare with old amd1.6ghz with shiti gfx card
@Capehill Talking with ptitSeb about all that, that what he say:
--- VBO are 1 thing: vertices data in VRAM, so in the graphic card memory ready to use. So the main thing that consume time is the time to transfert the data in VRAM. If you reuse those data, you just activate the VBO, no need to transfert the data.
So, when a software/game use VBO, it create one, fill it with data, and then simply use the transfered data (sometimes, it changes pert of the VBO, to update some of data).
The Transfert of data is traditionnaly at the "Unlock" part of the VBO (lock gives you and address where to put the data, Unlock transfert from that address to VRAM).
On the Amiga, the transfert to VRAM can be slow if you don't have some kind of DMA for that (that's the 1st thing), and all, all data need to be in LittleEndian, because the GraphicCard is LittleEndian (so you need to analyse the VBO, to know what data need swapping, and what data doesn't).
So yes, I think this VBO transfert can be a bottleneck. ---
But we also tested with working DMA in graphics.library on x1000, and results are still the same. Maybe by DMA he mean something like GART there, dunno. That DMA in graphics.library probabaly other kind of DMA expected from VRAM transfers ?
Interestengly also, that in the documentation about BufferUnlock of warp3dnova, there is no mention about any big->little endian conversion ..
Even if graphics.library has DMA support on X1000, is has "only" capability to transfer bitmap data to VRAM (as far as I know). I suppose Nova needs a way to pull data from RAM using DMA and this is what GART would provide.
@Capehill But isn't "bitmap" are the same data ? I mean , if there is DMA to transfer bitmap to VRAM (and bitmap are usuall data, or not ?) , then it should the same transfer any other data to VRAM ?
Or DMA in graphics.library its to transfer to VRAM only some specific data which no one use ?:)
> the same data ? I mean , if there is DMA to transfer bitmap to VRAM
Yes,For Wazp3D57 I have made some code that hack a bitmap transfer to be used to copy vertices to VBO. (ouuh what a hack but it works )
It changed almost nothing to speed on Sam460 & X5000 but dont know if those machines use dma for that
>about BufferUnlock of warp3dnova, there is no mention about any big->little endian conversion
IMHO what I understood: There are differents method for updating the VBO but in fact Nova just read and/or write the data When you lock it (can) read/reorder the VBO data to a buffer that you will access. When you unlock it (can) write/reorder the buffer to the VBO data. As reordering is done Nova side you never accesss to real data that are on the GPU VRAM but on a reordered buffer
You can also do write only (ie write all new vertices values from your buffer) or read only (ie read some GPU datas) or read/write (ie change some vertices)
Certainly let Nova do the reordering was not a good idea as datas are then acessed several times (vs a cpu that will write to real GPU vram directly the reordered datas)
See below Nova doc // W3DN_STATIC_DRAW: Written:(CPU) once Read: rendered many times // W3DN_STATIC_READ: Written:(GPU) once Read: CPU many times // W3DN_STATIC_COPY: Written:(GPU) once Read: rendered many times // W3DN_DYNAMIC_DRAW: Written:(CPU) occasionally Read: rendered many times // W3DN_DYNAMIC_READ: Written:(GPU) occasionally Read: CPU many times // W3DN_DYNAMIC_COPY: Written:(GPU) occasionally Read: rendered many times // W3DN_STREAM_DRAW: Written:(CPU) frequently Read: rendered a few times // W3DN_STREAM_READ: Written:(GPU) frequently Read: CPU a few times // W3DN_STREAM_COPY: Written:(GPU) very often Read: rendered a few times
Hans explain those things a bit for me, and that what i understand :
That "DMA support in graphics.library" in Sam440, Sam460 & x1000 is not real DMA , its just a hack, which used their CPU's DMA to speed up those ram->vram transfers , but only for internal use inside of graphics.library , and only for bitmaps. I.e. its "DMA", but CPU's DMA and not real DMA or GART of anything of that sort. Just some little "speed up".
In other words, is of no use for drivers (like for minigl, for warp3dnova, radeon drivers, etc), and more of it , its only for use for those apps which use graphics.library and rely on those parts which is "hacked" inside of graphics.library to speed things ups (named those "graphics.library's bitmaps).
Probabaly, it may help somewhere and not only in benchmarks (at least it was added not just because), but it didn't help at all with drivers, so that explain why we didn't have a single difference when test irrlicht engine examples between x5000 (without that "cpu dma hack") or on x1000 (with that "cpu dma hack"). Those examples done for gl4s, which works over ogles2.library, which works on top of warp3dnova, which in turn, didnt have any kind of DMA acceleration for RAM->VRAM transfers. There is only proper implementation of GART can help.
Quote:
Certainly let Nova do the reordering was not a good idea as datas are then acessed several times (vs a cpu that will write to real GPU vram directly the reordered datas)
As i aware now, warp3dnova's BufferUnlock() do not only writing from RAM to VRAM, but also do endian conversion from big-endian to little-endian (as gfx card is little endian). So we have 2 stop factors there :
1. no real DMA (GART) is used , that mean transfering from RAM->VRAM are slow. We should be happy we even have something usable without it. We at least have 100fps in quake3 without GART, that for sure not bad.
2. Endian conversion inside of BufferUnlock(), may slow all things down, expectually if it didn't compiled with -O3 optimisation enabled by any of reassons. As it also mean buffers, working with them, etc, so pretty possible that add another bottleneck.
All of this probably explain well why some things works on minigl and on gl4es almost the same by speed, like that quake3, lugaru, supertuxkart : all those games do a lot of drawing per frame, which is limited by speed because of no GART, so both minigl.library which working over warp3dnova, and gl4es , limited by the same.
But that code which writen "right", i.e. not thousands of draw calls per frame with lots of data, those ones speeduped well by usage of VBO and co.
Interestengly also, that in the documentation about BufferUnlock of warp3dnova, there is no mention about any big->little endian conversion
That's right, that documentation is not in the Unlock function's desc but in the Lock functions's
In VBOLock's description it is mentioned that you must call VBOSetArray first to tell Nova the data types and other info so that it knows for which parts of the buffer's data it has to perform endian conversion. Although it's not explicitely stated, you may safely asume that the actual conversion is then done inside BufferUnlock (and then back again in VBOLock if a read is requested, unless the driver keeps a copy of the original data, no idea if it does). If the conversion would take place later, internally, then this info in the docs wouldn't make sense because then you could tell Nova the data layout later (prior to first use) as well.
@kas1e @thellier The fun part with that VBOSetArray convention: you can (ab)use it to trick Nova into not doing its slow endian conversion. Simply tell it beforhand that the VBO is just a package full of plain bytes
But keep your seat, yes, it works, but unfortunately another Nova-slowdown-area kills the potential gain again: you may remember that ogles2 contains a workaround for plain byte stuff like RGBA8 data. I found upload of such endian-free-simple-data to be so extremely dead-slow for unknown reasons, so the lib converts those to RGBAfloat32 data... You'd expect it to be much slower then (the 4x byte-to-float conversion, 4x as much data to transfer) but it's muuuuuuch faster than letting Nova do the simple job on the plain bytes.
So, unfortunately to avoid Nova's endian-conversion also means to switch to that slow byte-data layout, so we end up with a netto fps loss. Damn I will experiment a bit more, but this one looks like a dead end. So we'll have to wait for Hans to improve speed and eventually also implement that requested manual-endian-conv-conversion.
The fun part with that VBOSetArray convention: you can (ab)use it to trick Nova into not doing its slow endian conversion. Simply tell it beforhand that the VBO is just a package full of plain bytes
But keep your seat, yes, it works, but unfortunately another Nova-slowdown-area kills the potential gain again: you may remember that ogles2 contains a workaround for plain byte stuff like RGBA8 data. I found upload of such endian-free-simple-data to be so extremely dead-slow for unknown reasons, so the lib converts those to RGBAfloat32 data... You'd expect it to be much slower then (the 4x byte-to-float conversion, 4x as much data to transfer) but it's muuuuuuch faster than letting Nova do the simple job on the plain bytes.
Huh? If a VBO contains only uint8 data, then it should be using a straight copy routine (one that uses doubles if possible).
You do need to make sure that *all* VBO arrays are 8-bit or disabled (W3DNEF_NONE), otherwise it'll fall through to the complex case of handling mixed data.
You do need to make sure that *all* VBO arrays are 8-bit or disabled (W3DNEF_NONE), otherwise it'll fall through to the complex case of handling mixed data.
No, unfortunately it's not like this. Even setting all "unused" arrays to W3DNEF_NONE and size / stride 0 doesn't change anything. The only thing that helps is to create a simple 1 array VBO in the first place. Which I should have done and usually do for pure index-VBOs, but which wasn't enforced in this case here indeed, thanks for pointing me at it
And oh yes, that makes a difference indeed! However, not for the "own" vs. "Nova" endian conversion, there's no measurable difference here in this simple 1-array-scenario.
But, damnit, all this revealed again just how slow Nova buffer copy becomes as soon as you don't have the most trivial 1 array VBO layout! Here it's the difference between 7 and 30 fps! And this happens for every VBO you create with 2 or more arrays inside.
Now the thing is: obviously there is huge optimization potential here. Whatever you do in your multi-array-copy function, it's very bad. And if ogles2 could get rid of it that would result in an incredible speedup for sure.
But: unless you make VBOSetArry with W3DNEF_NONE work as you described above, I cannot implement it, because obviously an 1 array VBO is useless in that case. Or is any special parameter combination required for VBOSetLArray with W3DNEF_NONE to make it work as promised?