@TearsOfMe Thanks. So its not because of missing DMA on x5k. Less to check then..
But from trace logs it visibly that bottleneck is drawing function itself. But why, and what make it be THAT slower than on old amd1.6ghz with crappy inbuild gfx board that unknown still.
Are you sure ? (When I coded Wazp3D57 with Nova rendering) I have tried differents methods for updating a VBO but it seems to be slow
Perhaps having a patched MiniGL that will update (say) 11 times the VBO will allow to know how much time a VBO update is REALLY during in a REAL program (delta time / 10)
I mean when I was testing Cow3D on Wazp3D57-> Nova it was (say) 80 % of a real waRp3D (massive VBO update but one time) so bandwith seems +- ok
But when I was testing Quake2 (real life test) it was 1-2 FPS ... weird
An other thing: Also when I was testing a simple raymarching test i found that Nova GLSL seems to have very strange bugs: I mean all is fine when GLSL code is compiled but strange artefacts appears like the GLSL code was computing badly at some pixels (like a rounding fpu bug)in frag shader
Are you sure ? (When I coded Wazp3D57 with Nova rendering) I have tried differents methods for updating a VBO but it seems to be slow
In tracers log i post, you can see at end 2 tables: warp3dnova profile and ogles2 profile. So in warp3d one you can see how much and what function of warp3dnova take time. And creating of vbo there is okish, didnt take all time.
Besides, VBO used everywhere in other games/examples, but only those examples slow that much. So they seems to do something which make our drivers be that slow in compare with even just amd1.6ghz with inbuild intel. I even didnt say about modern computers, but be THAT slow, show that something is really wrong somewhere.
Quote:
Also when I was testing a simple raymarching test i found that Nova GLSL seems to have very strange bugs: I mean all is fine when GLSL code is compiled but strange artefacts appears like the GLSL code was computing badly at some pixels (like a rounding fpu bug)in frag shader
Nova shaders compiler still WIP, but lately it start to be better and better. It sadly have optimisation disabled because of some bug, but visual bugs can be checked and reported, so Hans may fix it.
If you have that raymarching test, we can test it on last nova, then test it on ogles2 on windows (to be sure your code correct), and the create a bugreport. I can help with tests (but lets create another topic about ?)
Dunno through what it can say to us.. I mean this time it didn't explain still, why it slower that much in compare with old amd1.6ghz with shiti gfx card
@Capehill Talking with ptitSeb about all that, that what he say:
--- VBO are 1 thing: vertices data in VRAM, so in the graphic card memory ready to use. So the main thing that consume time is the time to transfert the data in VRAM. If you reuse those data, you just activate the VBO, no need to transfert the data.
So, when a software/game use VBO, it create one, fill it with data, and then simply use the transfered data (sometimes, it changes pert of the VBO, to update some of data).
The Transfert of data is traditionnaly at the "Unlock" part of the VBO (lock gives you and address where to put the data, Unlock transfert from that address to VRAM).
On the Amiga, the transfert to VRAM can be slow if you don't have some kind of DMA for that (that's the 1st thing), and all, all data need to be in LittleEndian, because the GraphicCard is LittleEndian (so you need to analyse the VBO, to know what data need swapping, and what data doesn't).
So yes, I think this VBO transfert can be a bottleneck. ---
But we also tested with working DMA in graphics.library on x1000, and results are still the same. Maybe by DMA he mean something like GART there, dunno. That DMA in graphics.library probabaly other kind of DMA expected from VRAM transfers ?
Interestengly also, that in the documentation about BufferUnlock of warp3dnova, there is no mention about any big->little endian conversion ..
Even if graphics.library has DMA support on X1000, is has "only" capability to transfer bitmap data to VRAM (as far as I know). I suppose Nova needs a way to pull data from RAM using DMA and this is what GART would provide.
@Capehill But isn't "bitmap" are the same data ? I mean , if there is DMA to transfer bitmap to VRAM (and bitmap are usuall data, or not ?) , then it should the same transfer any other data to VRAM ?
Or DMA in graphics.library its to transfer to VRAM only some specific data which no one use ?:)
> the same data ? I mean , if there is DMA to transfer bitmap to VRAM
Yes,For Wazp3D57 I have made some code that hack a bitmap transfer to be used to copy vertices to VBO. (ouuh what a hack but it works )
It changed almost nothing to speed on Sam460 & X5000 but dont know if those machines use dma for that
>about BufferUnlock of warp3dnova, there is no mention about any big->little endian conversion
IMHO what I understood: There are differents method for updating the VBO but in fact Nova just read and/or write the data When you lock it (can) read/reorder the VBO data to a buffer that you will access. When you unlock it (can) write/reorder the buffer to the VBO data. As reordering is done Nova side you never accesss to real data that are on the GPU VRAM but on a reordered buffer
You can also do write only (ie write all new vertices values from your buffer) or read only (ie read some GPU datas) or read/write (ie change some vertices)
Certainly let Nova do the reordering was not a good idea as datas are then acessed several times (vs a cpu that will write to real GPU vram directly the reordered datas)
See below Nova doc // W3DN_STATIC_DRAW: Written:(CPU) once Read: rendered many times // W3DN_STATIC_READ: Written:(GPU) once Read: CPU many times // W3DN_STATIC_COPY: Written:(GPU) once Read: rendered many times // W3DN_DYNAMIC_DRAW: Written:(CPU) occasionally Read: rendered many times // W3DN_DYNAMIC_READ: Written:(GPU) occasionally Read: CPU many times // W3DN_DYNAMIC_COPY: Written:(GPU) occasionally Read: rendered many times // W3DN_STREAM_DRAW: Written:(CPU) frequently Read: rendered a few times // W3DN_STREAM_READ: Written:(GPU) frequently Read: CPU a few times // W3DN_STREAM_COPY: Written:(GPU) very often Read: rendered a few times
Hans explain those things a bit for me, and that what i understand :
That "DMA support in graphics.library" in Sam440, Sam460 & x1000 is not real DMA , its just a hack, which used their CPU's DMA to speed up those ram->vram transfers , but only for internal use inside of graphics.library , and only for bitmaps. I.e. its "DMA", but CPU's DMA and not real DMA or GART of anything of that sort. Just some little "speed up".
In other words, is of no use for drivers (like for minigl, for warp3dnova, radeon drivers, etc), and more of it , its only for use for those apps which use graphics.library and rely on those parts which is "hacked" inside of graphics.library to speed things ups (named those "graphics.library's bitmaps).
Probabaly, it may help somewhere and not only in benchmarks (at least it was added not just because), but it didn't help at all with drivers, so that explain why we didn't have a single difference when test irrlicht engine examples between x5000 (without that "cpu dma hack") or on x1000 (with that "cpu dma hack"). Those examples done for gl4s, which works over ogles2.library, which works on top of warp3dnova, which in turn, didnt have any kind of DMA acceleration for RAM->VRAM transfers. There is only proper implementation of GART can help.
Quote:
Certainly let Nova do the reordering was not a good idea as datas are then acessed several times (vs a cpu that will write to real GPU vram directly the reordered datas)
As i aware now, warp3dnova's BufferUnlock() do not only writing from RAM to VRAM, but also do endian conversion from big-endian to little-endian (as gfx card is little endian). So we have 2 stop factors there :
1. no real DMA (GART) is used , that mean transfering from RAM->VRAM are slow. We should be happy we even have something usable without it. We at least have 100fps in quake3 without GART, that for sure not bad.
2. Endian conversion inside of BufferUnlock(), may slow all things down, expectually if it didn't compiled with -O3 optimisation enabled by any of reassons. As it also mean buffers, working with them, etc, so pretty possible that add another bottleneck.
All of this probably explain well why some things works on minigl and on gl4es almost the same by speed, like that quake3, lugaru, supertuxkart : all those games do a lot of drawing per frame, which is limited by speed because of no GART, so both minigl.library which working over warp3dnova, and gl4es , limited by the same.
But that code which writen "right", i.e. not thousands of draw calls per frame with lots of data, those ones speeduped well by usage of VBO and co.
Interestengly also, that in the documentation about BufferUnlock of warp3dnova, there is no mention about any big->little endian conversion
That's right, that documentation is not in the Unlock function's desc but in the Lock functions's
In VBOLock's description it is mentioned that you must call VBOSetArray first to tell Nova the data types and other info so that it knows for which parts of the buffer's data it has to perform endian conversion. Although it's not explicitely stated, you may safely asume that the actual conversion is then done inside BufferUnlock (and then back again in VBOLock if a read is requested, unless the driver keeps a copy of the original data, no idea if it does). If the conversion would take place later, internally, then this info in the docs wouldn't make sense because then you could tell Nova the data layout later (prior to first use) as well.
@kas1e @thellier The fun part with that VBOSetArray convention: you can (ab)use it to trick Nova into not doing its slow endian conversion. Simply tell it beforhand that the VBO is just a package full of plain bytes
But keep your seat, yes, it works, but unfortunately another Nova-slowdown-area kills the potential gain again: you may remember that ogles2 contains a workaround for plain byte stuff like RGBA8 data. I found upload of such endian-free-simple-data to be so extremely dead-slow for unknown reasons, so the lib converts those to RGBAfloat32 data... You'd expect it to be much slower then (the 4x byte-to-float conversion, 4x as much data to transfer) but it's muuuuuuch faster than letting Nova do the simple job on the plain bytes.
So, unfortunately to avoid Nova's endian-conversion also means to switch to that slow byte-data layout, so we end up with a netto fps loss. Damn I will experiment a bit more, but this one looks like a dead end. So we'll have to wait for Hans to improve speed and eventually also implement that requested manual-endian-conv-conversion.
The fun part with that VBOSetArray convention: you can (ab)use it to trick Nova into not doing its slow endian conversion. Simply tell it beforhand that the VBO is just a package full of plain bytes
But keep your seat, yes, it works, but unfortunately another Nova-slowdown-area kills the potential gain again: you may remember that ogles2 contains a workaround for plain byte stuff like RGBA8 data. I found upload of such endian-free-simple-data to be so extremely dead-slow for unknown reasons, so the lib converts those to RGBAfloat32 data... You'd expect it to be much slower then (the 4x byte-to-float conversion, 4x as much data to transfer) but it's muuuuuuch faster than letting Nova do the simple job on the plain bytes.
Huh? If a VBO contains only uint8 data, then it should be using a straight copy routine (one that uses doubles if possible).
You do need to make sure that *all* VBO arrays are 8-bit or disabled (W3DNEF_NONE), otherwise it'll fall through to the complex case of handling mixed data.
You do need to make sure that *all* VBO arrays are 8-bit or disabled (W3DNEF_NONE), otherwise it'll fall through to the complex case of handling mixed data.
No, unfortunately it's not like this. Even setting all "unused" arrays to W3DNEF_NONE and size / stride 0 doesn't change anything. The only thing that helps is to create a simple 1 array VBO in the first place. Which I should have done and usually do for pure index-VBOs, but which wasn't enforced in this case here indeed, thanks for pointing me at it
And oh yes, that makes a difference indeed! However, not for the "own" vs. "Nova" endian conversion, there's no measurable difference here in this simple 1-array-scenario.
But, damnit, all this revealed again just how slow Nova buffer copy becomes as soon as you don't have the most trivial 1 array VBO layout! Here it's the difference between 7 and 30 fps! And this happens for every VBO you create with 2 or more arrays inside.
Now the thing is: obviously there is huge optimization potential here. Whatever you do in your multi-array-copy function, it's very bad. And if ogles2 could get rid of it that would result in an incredible speedup for sure.
But: unless you make VBOSetArry with W3DNEF_NONE work as you described above, I cannot implement it, because obviously an 1 array VBO is useless in that case. Or is any special parameter combination required for VBOSetLArray with W3DNEF_NONE to make it work as promised?
@Daytona675x >VBO is just a package full of plain bytes But will it avoid the "write from buffer to GPU vram" part ? I mean perhaps Nova will proceed as usual : copy reordered data from a buffer to vram but only skip all reordering for each items
>RGBA8 data converted to RGBA float Interesting too
Anyway we just need a new VBO mode (let it call W3DN_RAW_ACCESS) that dont copy GPU vram to/from a reordered buffer but only let it accesses the VBO data at their place in the GPU vram (so it is the cpu that will manage the reordering & copy)
But will it avoid the "write from buffer to GPU vram" part ?
In theory it could. I mean, we explicitely tell Nova "look, what's coming is just a plain byte buffer, no conversion whatsoever required and write-only" even before locking the VBO! So in theory it could give us a pointer into VRAM at this point in such a scenario. But I bet it doesn't. Actually, I know it doesn't because Hans just told us: Quote:
Hans: Huh? If a VBO contains only uint8 data, then it should be using a straight copy routine (one that uses doubles if possible).
If we had direct VRAM access then no such extra copy at his end would be necessary at all
Quote:
I mean perhaps Nova will proceed as usual : copy reordered data from a buffer to vram but only skip all reordering for each items
Yes, that's the case. But while we don't have direct VRAM access we still have a potential up-to-factor-4 (7 vs 30 fps in my test) performance gain around the corner with the plain-bytes trick nevertheless - if VBOSetArray with W3DNEF_NONE would work as promised. But yes, completely getting rid of the Nova-copy in such a scenerio would be great. This here is meant to be a hack using what we already got (well, with what I thought we already got). And for sth. like that the potential gain is unbelievable big already.
Quote:
>RGBA8 data converted to RGBA float Interesting too
Yes, but that's not everything ogles2 does under the hood to workaround some Nova slowdowns through type-conversions. E.g. it also silently converts 16bit index data to 32bit because native 16bit indices are slow (and 8bit indices because they aren't supported at all). ogles2 even does so if you supply your indices in your own VBO - then not your VBO is being used but an internal 32bit version of it
Quote:
Anyway we just need a new VBO mode (let it call W3DN_RAW_ACCESS)
Yes, that won't hurt, the potential speedup is gigantic. Although for now I'd be happy already if VBOSetLArray(W3DNEF_NONE) would do what it's supposed to do.
Try setting the unused arrays to UINT8 as well. That should work for now, until I add a proper option to disable endianness conversion.
@thellier
Quote:
Anyway we just need a new VBO mode (let it call W3DN_RAW_ACCESS) that dont copy GPU vram to/from a reordered buffer but only let it accesses the VBO data at their place in the GPU vram (so it is the cpu that will manage the reordering & copy)
That would be a workaround for the lack of GART/DMA. We really want to move away from the CPU accessing VRAM directly, because it's really slow.
At some point we'll finally have GART support, which will allow us to use DMA to copy data to/from VRAM, and even allow the GPU to slurp data directly from RAM.