I've been "relaxing" today by playing around with MAME - it's so slow on a Sam because of the SDL graphics routines (even in OpenGL, because minigl is rather inefficient).
Incredibly, I actually have a game displaying now!
I was wondering, though - what is the fastest way to display the bitmap?
MAME has a list of primitives which it iterates through every render. Therefore it's building up the image bit by bit.
I'm currently doing a p96LockBitMap() on my previously p96AllocBitMap()'d bitmap, and then calling the Render list. After that, I p96UnlockBitMap() and do a CompositeTags call "borrowed" from the boing3 example.
What is the most efficient way of doing this? Any particular parameters that would be useful in the allocation of the bitmap, or the call to the CompositeTags()? I'm currently using (COMPTAG_Flags, COMPFLAG_SrcAlphaOverride) as my flags field, as MAME renders to the RGBA32 bitmap with a zero alpha.
Any help gratefully appreciated!
Thanks!
Edit: I am now allocating an off-card bitmap as well as the on-card bitmap. I render to the off-card bitmap and then do a BltBitMap to the on-card bitmap when the data is ready. Then I do a CompositeTags(). This gave a major speed-up on my Sam440ep, but not much difference on my A1XE G4. I presume this is to do with the DMA of the Sam's PCI bus....
Edit: I am now allocating an off-card bitmap as well as the on-card bitmap. I render to the off-card bitmap and then do a BltBitMap to the on-card bitmap when the data is ready. Then I do a CompositeTags(). This gave a major speed-up on my Sam440ep, but not much difference on my A1XE G4. I presume this is to do with the DMA of the Sam's PCI bus....
Good! I was about to scream about those extended bitmap locks! They will play havoc with usb input and other factors affecting GUI. If you are rendering directly to a bitmap by manipulating the bitmap memory then ofcourse you do need to lock it, but extended locks cause problems.
Why do you need the CompositeTags? Why not blit direct to the final destination? Scaling?
A more advanced technique for speed would be to store you graphics elements on the graphics card then build the target with blits (blits between displayable friend bitmaps are usually hardware accelerated) or composites.
Ofcourse this puts more demands on available video memory, you might need to implement some caching to keep the most used bitmaps in video and less used ones of card.
I may make it do that eventually, but one step at a time :) MAME is a huge, complicated project, and I'm already very chuffed that my little Sam440ep 600MHz has gone from 38% speed on "Circus" at a low screen size, to 100% speed (no frame skip!) at a high screen size!
That's why I'm compositing, by the way - for scaling.
My first effort at code was literally my first effort - doing it the most basic way. I knew it was wrong, that was the purpose of this post :)
I'm a little puzzled by the A1XE's not being so much faster. I'm going to try an Altivec build and see if that helps.
Is it possible to do a build for a particular machine? I want to do a specialised build for my Sam440ep, my A1XE G4, my CS-PPC and my X1000 and see how they benefit. I believe particularly the X1000 may benefit...
No can do. MAME uses all sorts of weird screen sizes, from very small to huge.
Anyway, compositing is so cheap it seems silly not to use it. :) I'm actually very impressed by the speed of my little Sam now it's being used correctly with hardware DMA (BltBitMap) and compositing - made a massive difference.
Edit: I am now allocating an off-card bitmap as well as the on-card bitmap. I render to the off-card bitmap and then do a BltBitMap to the on-card bitmap when the data is ready. Then I do a CompositeTags(). This gave a major speed-up on my Sam440ep, but not much difference on my A1XE G4. I presume this is to do with the DMA of the Sam's PCI bus....
Yes, it's due to use of DMA. I hope that other developers are paying attention, because this is the best way to stream data to VRAM, whether it's video frames or software rendered frames. With a few rare exceptions (such as drawing a line or two), you should avoid directly writing to bitmaps in VRAM.
For developers who aren't familiar with the AmigaOS graphics system, to make a "user private" bitmap is one that is always off-card.
Thanks for your inputs! It wouldn't help in this case just putting it in SDL because MAME software renders to the whole window. I forced it to make a window of the correct machine size (e.g. 384x224) and then used that as the source bitmap with a scaling factor to the actual desired monitor size . Just putting composite into SDL would still be rendering in software to the actual window size.
Onice I'm happy with my build , we won't need openGL for it at all (which is much less efficient).
I know this is going to sound like a completely insane idea but you might try PM'ing or emailing Karlos directly instead of posting a message to some web forum and hoping he might read it some day.
Yes, it's due to use of DMA. I hope that other developers are paying attention, because this is the best way to stream data to VRAM, whether it's video frames or software rendered frames. With a few rare exceptions (such as drawing a line or two), you should avoid directly writing to bitmaps in VRAM.
It works the other way round too, (as you would know ofcourse).
Fiddling with my highly hacked local version of SRec I got a 10% increase in speed by extracting the bitmap from video ram with ReadPixelArray() first. Though only when I chose a 32bit destination format. Change from 32 to 24 just with that function lost the speed gain.
I find reading line by line rather that a whole bitmap at once reducing jittering in the usb mouse movements (critical if trying video smooth painting operations).