To not deral apitrace-glsnoop topic, let's continue there.
So, questions is : what we have today to do normal profiling of our apps ?
There was "hieronymus" on os4depot. It wasn't feature rich, but it was a start. Through it wasn't working on many machines, and work only on AmigaOne (G3/G4), Sam440 and Sam460. Also started from os4 FE, it didn't works Sam460ex. As it also no more in develop we probably can forget about for now.
Another one, which was of big hope for me, its gcc's "-pg" which create profiling stat which can be then read by "gprof". But sadly, our gcc binaries which compiled with -pg , simple crashed. I ask SBA about, he say that never worry about and never looked at , so it can be easy bug to fix, or some thing not implemented, but whatever is currently it didn't works. For sure, it can be good idea to have a look at this, maybe it will be not _that_ hard to fix.
More of it we have "gprof" ported already, so something seems wrong with gcc part when we build it with -pg , and that should be fixable for sure.
Also, lately Capehill create a simple CPU profiler, called "tequila" : https://github.com/capehill/tequila. That one i tried to use now to profile some game, and that what i have:
Processing symbol data (max addresses 2970)...
297/2970
594/2970
891/2970
1188/2970
1485/2970
1782/2970
2079/2970
2376/2970
2673/2970
Found 14 unique and 2898 non-zero symbols
Sorting symbols...
I'm not sure how reliable is Tequila's way to fetch the instruction pointer of the current task (it seems to work in some cases).
But in many other real world cases the symbol lookup fails and it's puzzling. If the symbol was valid, was it removed later or why it cannot be looked up. I would have expected at the least the "module" part to show up.
@Capehill That profiling was about loading part (which i trying to understand why it take 40-50 seconds vs 5sec on win32).
But then, i tried to profile actual gameplay of the game via glsnoop. And what i found, that during plaing, while my CPU hit 100%, warp3dnova and ogles2 take only 40% of it. For example ogles2's drawelements() take most of time : 32% (that include warp3dnovas drawlements (10%), bufferunlock(9%), sumbit() and other stuff).
That mean that another 60% is taken up by CPU itself. I also do run "tequalia" when i actually play, and this time it show me that:
Processing symbol data (max addresses 2970)...
297/2970
594/2970
891/2970
1188/2970
1485/2970
1782/2970
2079/2970
Found 14 unique and 2049 non-zero symbols
Sorting symbols...
Sample % Count Function in module 'Not available'
63.59 1303
Sample % Count Function in module 'graphics.library.kmod'
15.91 326 (null)
Sample % Count Function in module 'kernel'
13.23 271 (null)
Sample % Count Function in module 'RadeonHD.chip'
3.27 67 (null)
Sample % Count Function in module 'LIBS:Warp3DNova/W3DN_SI.library'
1.51 31 (null)
Sample % Count Function in module 'LIBS:ogles2.library'
1.17 24 (null)
Sample % Count Function in module 'Tequila'
0.54 11 (null)
Sample % Count Function in module 'newlib.library.kmod'
0.39 8 (null)
Sample % Count Function in module 'intuition.library.kmod'
0.10 2 (null)
Sample % Count Function in module 'AmiDock'
0.10 2 (null)
Sample % Count Function in module 'System:Utilities/Dockies/topbar/RamDock/RAMDock.docky'
0.05 1 FONTNAME
Sample % Count Function in module 'RadeonRX.chip'
0.05 1 (null)
Sample % Count Function in module 'dos.library.kmod'
0.05 1 (null)
Sample % Count Function in module 'massstorage.usbfd'
0.05 1 (null)
See, there now something strange : glsnoop says that warp3dnova/ogles2 take about 50% of time, but tequila says that warp3dnova take only 1.51% and ogles 1.17% ?
So, questions is : what we have today to do normal profiling of our apps ?
There was "hieronymus" on os4depot. It wasn't feature rich, but it was a start. Through it wasn't working on many machines, and work only on AmigaOne (G3/G4), Sam440 and Sam460. Also started from os4 FE, it didn't works Sam460ex. As it also no more in develop we probably can forget about for now.
Hieronymus is not as advanced as I would like but don't consider it dead yet
I have unreleased improvements that I would have to publish in a new version, to make it working in an alternative mode without machine restriction. And other things. I was not aware about problems with Sam460.
If you use the SAMPLES parameter, you can get more accuracy. Default is 99 per second.
Quote:
glsnoop says that warp3dnova/ogles2 take about 50% of time, but tequila says that warp3dnova take only 1.51% and ogles 1.17% ?
Right. So those percents reported by glSnoop are "wrapped" times: a function starts and ends, its duration is recorded. They include everything else that is called by OGLES2 or Nova (for example graphics.library).
With Tequila we can see graphics.library taking 15%. That could be blit functions for display update, and it's likely that this is belongs to that 50% total that glSnoop reports.
Right. So those percents reported by glSnoop are "wrapped" times: a function starts and ends, its duration is recorded. They include everything else that is called by OGLES2 or Nova (for example graphics.library).
With Tequila we can see graphics.library taking 15%. That could be blit functions for display update, and it's likely that this is belongs to that 50% total that glSnoop reports.
How intersting .. I mean, if graphics.library take up to 15% from all warp3dnova and ogles2 stuff its quite a lot ! Probabaly that mean that DMA support for x5000 in graphics.library will help pretty much, even without DMA for ram->vram memory accesses.
Is there any way how we can know what functions of graphics.library are involved ? Will be indeed interesting to know all kind of functions involved in some game, so to see full picture of what is un-optimized and take up the time most.
And yeah, using SAMPLES of 10000 make it all be more accurate about processes being involved, but output still all those "60% Not available", ad all functions in modules are (null).
As side note i still think about that "-pg" switch in gcc and ability to fix it somehow. But i am not sure, if without debug build of graphics.library, we can know what functions is used in (and does not matter if it tequila or gcc with -pg) ?
Oops! I'm so sorry ... I really did not remember. So I did not keep my promise ... I was too far too long ... but after an house, my third child and several job changes, I start again to have some free time
@Capehill From code of tequila i not very much understand how it all works, but is it going some oldschool method like we get time at begining of fnction, then at end, calcualte delte, and put that to log ?
1) It's based on a timer interrupt. When interrupt triggers, it read the current task address from ExecBase. That seems to be enough to check which task is using the CPU.
2) For function profiling, in addition to the first step, it tries to get the Link Register (see elfspec_ppc.pdf) from task's stack, to know where in the memory task was running when interrupt came.
3) Finally there is an attempt to query the symbol names based on the collected addresses (last 30 seconds).
I just tried to use hieronymus on my Sam460 and it crashes when I run it on my 4.1 Final Edition or beta installations, but works fine under 4.1 Update 6.
The crash AFAICT is because of a NULL elf handle being passed to IElf->CloseElf() from: hieronymus:Bosch_StopAcquisition()+0x3fc (section 1 @ 0xBA8)
The crash AFAICT is because of a NULL elf handle being passed to IElf->CloseElf() from: hieronymus:Bosch_StopAcquisition()+0x3fc (section 1 @ 0xBA8)
Not quite that simple unfortunately. I added a NULL pointer check in CloseElf() but it just lead to a different DSI (again caused by a NULL pointer access) in a different function, but still from a call to IElf->CloseElfTags() from hieronymus:Bosch_StopAcquisition().
@kas1e Thanks for the news. Here the file still does not appear in the recent files.
@walkero I hope you will enjoy this new version! You can change the default sampling duration with the option "duration", for example "duration=60" for one minute (and you have the possibility to stop in that time with CTRL-C). I may have to clarify the command "help" where the information appears. About debug option, it seems that -g is enough to access function symbols.
@salass0 I just see your old comment. I know more details about the crash in CloseElfTags(). I will create a test file and see if I can fix it or create a Bugzilla ticket.
@all This release comes after a huge rework of the tool. Reporting is improved and among new features, A1222/X5000 support is added (at the moment, limited to OS4 beta users due to a change in the kernel) ... and finally, after several approaches, I found a solution to make profiling working on X1000 (not the best implementation but seems to do the job though).
@Corto Probably you forgot a password you use with previous upload, so it can't be automatically overwritten. When i have such kind of problems i wrote directly to Orgin.
@corto Thank you so much for this release and for your reply. I was using the duration argument without the equals sign, i.e. duration 60 Yeah, a clarification in the help would be useful.
The old version had an argument to specify the binary. If I recall correctly that filtered the results to be only for the specified one? Do you plan to add something like that in the new version, or it is there and I missed it?
Quote:
Note that after 3 days, Hieronymus is still in the Uploads page
Check your email account (and spams) for a confirmation email.
@kas1e When I submitted it, that seemed to be good ... anyway, I sent a message to orgin.
@walkero Thanks for the idea about spam, I checked. About the option to track a single program, this feature was not in the previous version ... even if I think about it for a long time. Now that I started to move using the performance monitor by default, it should not be so difficult, the mechanism is already used for the gcc profiling. For other platforms, I suppose I will have to be creative, one more time (like for the current X1000 support). That's one of the feature for a future version, in my todo list.