His original code is here:
- Code: Select all
void ultraDraw(byte data[], char x, char y){
uint8_t* buf = gb.display.getBuffer();
for(char i = 0; i < 8; i++){
if (y>=0)
*(buf + i+x + (y/8) * 84) |= data[i] << (y%8);
*(buf + i+x + ((y+8)/8) * 84) |= data[i] >> (8-y%8);
}
}
Let's see if you will learn something from the optimizations...
First I noticed that he divided by 8, while 8 is a power of two. Such divisions can be sped up by using bitshifts instead, as 8 = 2^3 we need to shift three times to the right, so instead of /8 you do >>3. Testing did show that it rendered quicker!
Next up I wanted to try a little something if the c++ compiler didn't optimize that.....by putting the whole bracket and multiplying part to the front, so that the operation order is actually from left-to-right (due to the way the brackets are), it speeds things up again!
So this is where we are at right now:
- Code: Select all
void ultraDraw(byte data[], char x, char y){
uint8_t* buf = gb.display.getBuffer();
for(char i = 0; i < 8; i++){
if (y>=0)
*((y>>3) * 84 + buf + i+x ) |= data[i] << (y%8);
*(((y+8)>>3) * 84 + buf + i+x ) |= data[i] >> (8-y%8);
}
}
When drawing a sprite 100000 times it took on my gamebuino 6114ms.
But you can go quicker. You notice the multiplying with a constant? There are usually ways to speed that up, too! For that we need to know the prime factorials of 84, so I quickly factorized it (more like, used an online tool to do it for me), and it turned out that 2*2*3*7 = 84.
That is good, there are powers of two, so 2^2 * 21 = 84, we can optimize that 2^2 again with a bitshift, as we multiply this time to the left direction.
So instead of num*84 we could do (num<<2)*21
but this isn't where to end, as the AVR cpu can already multiply that wouldn't really speed things up, we need to see it in our context, we'd have now
((y>>3)<<2)*84
And there we are! We shift three times to the right and then twice to the left. we could optimize that by only shifting once left, but we'd need to clear byte 0, 1 and 2 for that, but that is no problem!
A simple &0xF8 will do the trick as it does a logical-and with 0b11111000 so the last two bits drop out, so now we have ((y&F8)>>1) * 21
And yes, this again performs better, where we are at now is this:
- Code: Select all
void ultraDraw2(byte data[], char x, char y){
uint8_t* buf = gb.display.getBuffer() + x;
for(char i = 0; i < 8; i++){
if (y>=0)
*((((y)&0xF8)>>1) * 21 + buf) |= data[i] << (y%8);
*((((y + 8)&0xF8)>>1) * 21 + buf) |= data[i] >> (8-y%8);
buf++;
}
}
When drawing a sprite 100000 times it took on my gamebuino 5897ms.
Still not good enough, we can get faster!
Note how it re-defines the buffer offset every single time it needs to render a byte, while it actually just increases a known amount down the buffer?
It has a catch, though, if you render a byte from the sprite between two bytes of the buffer (not-aligned to the 8x8 grid) you need to shift it, and due to how the buffer is designed that is 84 bytes off, then. As I didn't want to have to add 84 and subtract 84 at times (which probably wouldn't have been too bad) I decided to make two pointers instead, one to the buffer and one to the buffer + 84. Now, when drawing I only need to increase them and the pointer is updated correctly!
This is my code for that:
- Code: Select all
void ultraDraw3(byte data[], char x, char y){
uint8_t* buf = ((y&0xF8)>>1) * 21 + gb.display.getBuffer() + x;
uint8_t* buf2 = buf + 84;
for(char i = 0; i < 8;i++){
if(y>=0){
*(buf) |= data[i] << (y%8);
buf++;
}
*(buf2) |= data[i] >> (8-y%8);
buf2++;
}
}
When drawing a sprite 100000 times it took on my gamebuino 5633ms.
Just for fun and comparison reasons I made an inline assembly routine which takes over as soon as the buffer offset is calculated (as that would have been a PITA in assembly), and, well, I was blown away by just HOW much better it performed!
This is the routine:
- Code: Select all
void ultraDraw4(byte data[], char x, char y){
uint8_t* buf = ((y&0xF8)>>1) * 21 + gb.display.getBuffer() + x;
asm volatile(
"ldi R16,8\n\t"
"cpi %[rotnumber],0\n\t"
"breq LoopAligned\n"
"LoopStart:\n\t"
"ld R17,Z+\n\t"
"eor R18,R18\n\t"
"mov R19,%[rotnumber]\n\t"
"LoopShift:\n\t" // carry is still reset from the cpi instruction or from the dec
"rol R17\n\t"
"rol R18\n\t"
"dec R19\n\t"
"brne LoopShift\n\t"
"ld R19,X\n\t"
"or R19,R17\n\t"
"st X+,R19\n\t"
"ld R19,Y\n\t"
"or R19,R18\n\t"
"st Y+,R19\n\t"
"dec R16\n\t"
"brne LoopStart\n\t"
"rjmp End\n"
"LoopAligned:\n\t"
"ld R17,Z+\n\t"
"ld R18,X\n\t"
"eor R18,R17\n\t"
"st X+,R18\n\t"
"dec R16\n\t"
"brne LoopAligned\n"
"End:\n"
::"x" (buf),"y" (buf + 84),"z" (data),[rotnumber] "r" (y%8):"r16","r17","r18","r19");
}
When drawing a sprite 100000 times it took on my gamebuino 2472ms.
That is a heck of a lot faster.
To sum it all up:
Always drawing the sprite 100000 times:
1st: simple math operation optimizations taking advantage of powers of two - 6114 ms
2nd: more complex math optimizations which are specific for each individual case - 5897 ms
3rd: try not to re-compute values often but use shortcuts like only increasing - 5633 ms
4th: inline assembly - 2472 ms
Well, I guess inline assembly wins, but then again, it is that which takes the longest to write + learn + understand, so yeah
I hope you had fun reading through this wall of text and might have even learned something while reading it!
If you have any questions as to why I did a certain optimization (this includes the inline asm) or something was unclear to you, feel free to ask me :3
-------------------------------------------------------------------------------------
Why custom-coded assembly usually wins is actually quite simple. When writing your arduino sketch the compiler converts your code into the machine language, while assembly is directly machine language. With that you have direct control over the output and thus can optimize way more.
For such things like this sprite routines ram access is actually relatively slow, compared what you can do easily with your hand-coded optimizations. The CPU has internal storage called registers, the gamebuino (AVR) has 26 8-bit registers, access to them is very fast! You need these registers for virtually everything, as that is the only way you can add/subtract/multiply numbers.
Now, if your compiler compiles code it usually reserves some space in ram for the variable and when doing math on it it loads those into the registers, does some math and then stores them back into ram. While if you hand-code assembly you have direct control over those registers, you know which ones you are not using right now and can switch to others for your math stuff, that way you'll require way less ram and way less access. Remember how I said that ram is actually slow? Just look at the numbers to compare - more than twice as fast with fewer ram access!
In fact, for this sprite routine, the only time I access the ram is when storing a byte to a buffer or fetching a byte from the sprite data.