No, I didn't use ibm_perflibs because in dosbox almost all the time is spent in the core (decoding x86 and doing huge computing) : I tried to reorganize some things to avoid cache misses (access data puting it on the same cache line, prefetch on data), improve branch prediction, add ... but I am not sure the effect is visible. It is only possible to optimize some peripheral parts (what means few percents).
The big improvement would be to write a dynamic core (called recompiler in a previous post) producing PPC code. Do you know why dosbox is so slow on PPC ? Not because PPC is bad, it is even very good (I was impressed : sometimes I thought I had found something to optimize and the compiler already did it !).
It is slow compared to x86 versions because it does not use JIT technology.
I will look again at the code to see how hard it is but nothing is documented and the code is a mess even if at the end the program is very good ...
Elwood : About Hieronymus, I didn't port it to OS4, I created it

And yes, again with dosbox, I can confirm it list the same slow functions than oprofile on Linux.
kas1e : About Hugi and x-ray, I tried of course and even on Linux PPC there is a problem with colors. I checked SDL surfaces and their masks, they are good. I thing these programs access hardware in a different way ... I don't know if we could fix that.