Having fun optimizing a sprite routine

Libraries, utilities, bootloaders...

Having fun optimizing a sprite routine

Postby Sorunome » Sat Aug 08, 2015 12:15 am

Sooo, Valden made a little fast 8x8 sprite routine for his upcoming game, which only takes 8x8 sprites in a different format, think of the bytes being sideways (as that is how the actual screen buffer is), and draws them way faster than the default gamebuino library.
His original code is here:
Code: Select all
void ultraDraw(byte data[], char x, char y){
 uint8_t* buf = gb.display.getBuffer();
 for(char i = 0; i < 8; i++){
   if (y>=0)
     *(buf + i+x + (y/8) * 84) |= data[i] << (y%8);
   *(buf + i+x + ((y+8)/8) * 84) |= data[i] >> (8-y%8);
 }
}
It does work fine and all, but I thought, "why not optimize the heck out of it?"
Let's see if you will learn something from the optimizations...

First I noticed that he divided by 8, while 8 is a power of two. Such divisions can be sped up by using bitshifts instead, as 8 = 2^3 we need to shift three times to the right, so instead of /8 you do >>3. Testing did show that it rendered quicker!

Next up I wanted to try a little something if the c++ compiler didn't optimize that.....by putting the whole bracket and multiplying part to the front, so that the operation order is actually from left-to-right (due to the way the brackets are), it speeds things up again!

So this is where we are at right now:
Code: Select all
void ultraDraw(byte data[], char x, char y){
 uint8_t* buf = gb.display.getBuffer();
 for(char i = 0; i < 8; i++){
   if (y>=0)
     *((y>>3) * 84 + buf + i+x ) |= data[i] << (y%8);
   *(((y+8)>>3) * 84 + buf + i+x ) |= data[i] >> (8-y%8);
 }
}

When drawing a sprite 100000 times it took on my gamebuino 6114ms.


But you can go quicker. You notice the multiplying with a constant? There are usually ways to speed that up, too! For that we need to know the prime factorials of 84, so I quickly factorized it (more like, used an online tool to do it for me), and it turned out that 2*2*3*7 = 84.
That is good, there are powers of two, so 2^2 * 21 = 84, we can optimize that 2^2 again with a bitshift, as we multiply this time to the left direction.
So instead of num*84 we could do (num<<2)*21
but this isn't where to end, as the AVR cpu can already multiply that wouldn't really speed things up, we need to see it in our context, we'd have now
((y>>3)<<2)*84
And there we are! We shift three times to the right and then twice to the left. we could optimize that by only shifting once left, but we'd need to clear byte 0, 1 and 2 for that, but that is no problem!
A simple &0xF8 will do the trick as it does a logical-and with 0b11111000 so the last two bits drop out, so now we have ((y&F8)>>1) * 21
And yes, this again performs better, where we are at now is this:
Code: Select all
void ultraDraw2(byte data[], char x, char y){
  uint8_t* buf = gb.display.getBuffer() + x;
  for(char i = 0; i < 8; i++){
    if (y>=0)
      *((((y)&0xF8)>>1) * 21 + buf) |= data[i] << (y%8);
    *((((y + 8)&0xF8)>>1) * 21 + buf) |= data[i] >> (8-y%8);
    buf++;
  }
}

When drawing a sprite 100000 times it took on my gamebuino 5897ms.

Still not good enough, we can get faster!
Note how it re-defines the buffer offset every single time it needs to render a byte, while it actually just increases a known amount down the buffer?
It has a catch, though, if you render a byte from the sprite between two bytes of the buffer (not-aligned to the 8x8 grid) you need to shift it, and due to how the buffer is designed that is 84 bytes off, then. As I didn't want to have to add 84 and subtract 84 at times (which probably wouldn't have been too bad) I decided to make two pointers instead, one to the buffer and one to the buffer + 84. Now, when drawing I only need to increase them and the pointer is updated correctly!
This is my code for that:
Code: Select all
void ultraDraw3(byte data[], char x, char y){
        uint8_t* buf = ((y&0xF8)>>1) * 21 + gb.display.getBuffer() + x;
        uint8_t* buf2 = buf + 84;
        for(char i = 0; i < 8;i++){
          if(y>=0){
            *(buf) |= data[i] << (y%8);
            buf++;
          }
          *(buf2) |= data[i] >> (8-y%8);
          buf2++;
        }
}

When drawing a sprite 100000 times it took on my gamebuino 5633ms.


Just for fun and comparison reasons I made an inline assembly routine which takes over as soon as the buffer offset is calculated (as that would have been a PITA in assembly), and, well, I was blown away by just HOW much better it performed!
This is the routine:
Code: Select all
void ultraDraw4(byte data[], char x, char y){
        uint8_t* buf = ((y&0xF8)>>1) * 21 + gb.display.getBuffer() + x;
        asm volatile(
        "ldi R16,8\n\t"
        "cpi %[rotnumber],0\n\t"
        "breq LoopAligned\n"
        "LoopStart:\n\t"

          "ld R17,Z+\n\t"
          "eor R18,R18\n\t"
          "mov R19,%[rotnumber]\n\t"
          "LoopShift:\n\t" // carry is still reset from the cpi instruction or from the dec
            "rol R17\n\t"
            "rol R18\n\t"
            "dec R19\n\t"
            "brne LoopShift\n\t"

          "ld R19,X\n\t"
          "or R19,R17\n\t"
          "st X+,R19\n\t"
         
          "ld R19,Y\n\t"
          "or R19,R18\n\t"
          "st Y+,R19\n\t"
         
          "dec R16\n\t"
          "brne LoopStart\n\t"
        "rjmp End\n"
        "LoopAligned:\n\t"
          "ld R17,Z+\n\t"
          "ld R18,X\n\t"
          "eor R18,R17\n\t"
          "st X+,R18\n\t"
          "dec R16\n\t"
          "brne LoopAligned\n"
        "End:\n"
        ::"x" (buf),"y" (buf + 84),"z" (data),[rotnumber] "r" (y%8):"r16","r17","r18","r19");
}

When drawing a sprite 100000 times it took on my gamebuino 2472ms.

That is a heck of a lot faster.


To sum it all up:
Always drawing the sprite 100000 times:
1st: simple math operation optimizations taking advantage of powers of two - 6114 ms
2nd: more complex math optimizations which are specific for each individual case - 5897 ms
3rd: try not to re-compute values often but use shortcuts like only increasing - 5633 ms
4th: inline assembly - 2472 ms


Well, I guess inline assembly wins, but then again, it is that which takes the longest to write + learn + understand, so yeah :P
I hope you had fun reading through this wall of text and might have even learned something while reading it!

If you have any questions as to why I did a certain optimization (this includes the inline asm) or something was unclear to you, feel free to ask me :3



-------------------------------------------------------------------------------------

Why custom-coded assembly usually wins is actually quite simple. When writing your arduino sketch the compiler converts your code into the machine language, while assembly is directly machine language. With that you have direct control over the output and thus can optimize way more.
For such things like this sprite routines ram access is actually relatively slow, compared what you can do easily with your hand-coded optimizations. The CPU has internal storage called registers, the gamebuino (AVR) has 26 8-bit registers, access to them is very fast! You need these registers for virtually everything, as that is the only way you can add/subtract/multiply numbers.
Now, if your compiler compiles code it usually reserves some space in ram for the variable and when doing math on it it loads those into the registers, does some math and then stores them back into ram. While if you hand-code assembly you have direct control over those registers, you know which ones you are not using right now and can switch to others for your math stuff, that way you'll require way less ram and way less access. Remember how I said that ram is actually slow? Just look at the numbers to compare - more than twice as fast with fewer ram access!
In fact, for this sprite routine, the only time I access the ram is when storing a byte to a buffer or fetching a byte from the sprite data.
Last edited by Sorunome on Sat Aug 08, 2015 12:16 pm, edited 3 times in total.
User avatar
Sorunome
 
Posts: 629
Joined: Sun Mar 01, 2015 1:58 pm

Re: Having fun optimizing a sprite routine

Postby Drakker » Sat Aug 08, 2015 12:38 am

Great job!

Quick thing you might want to test. With the default library, the more black pixels there are in the sprite, the longer it takes to draw it (not really surprising). So you might want to test your timings with an 8x8 black square and other mixes of transparency/white and black. That will give you a more accurate window for the time required to draw a sprite using your improved drawing function.
User avatar
Drakker
 
Posts: 297
Joined: Sun Mar 30, 2014 2:54 am
Location: Québec, Canada

Re: Having fun optimizing a sprite routine

Postby Sorunome » Sat Aug 08, 2015 6:48 am

Drakker wrote:Great job!

Quick thing you might want to test. With the default library, the more black pixels there are in the sprite, the longer it takes to draw it (not really surprising). So you might want to test your timings with an 8x8 black square and other mixes of transparency/white and black. That will give you a more accurate window for the time required to draw a sprite using your improved drawing function.

That would make no difference here at all as the whole thinking of drawing the sprite is different, the data is even aligned differently.
All it does is rotating bytes and OR-ing them and things like that, in theory it takes longest if it has to rotate 7 times, so if the y-pos is 7, 15, 23, 31, 39 etc., then again, that difference is only ever so tiny.

Nevertheless, I gotta try out if y&7 performs greater than y%8 so that I could gain some speed there.
User avatar
Sorunome
 
Posts: 629
Joined: Sun Mar 01, 2015 1:58 pm

Re: Having fun optimizing a sprite routine

Postby Valden » Sat Aug 08, 2015 9:27 am

Dayum, that's some impressive performance there. Well done ! :)
User avatar
Valden
 
Posts: 31
Joined: Mon Jun 08, 2015 5:33 pm

Re: Having fun optimizing a sprite routine

Postby Sutchig » Sat Aug 08, 2015 10:41 am

That's impressive! Thank you for the explanation. If I do bitshifts I always have to mask afterwards... maybe I got this not right :( it feels like a ror.
Did you watch the memory usage (flash)? I guess, ASM wins,too ;)
Sutchig
 
Posts: 67
Joined: Sat May 23, 2015 3:48 pm

Re: Having fun optimizing a sprite routine

Postby Drakker » Sat Aug 08, 2015 12:13 pm

Wow, that's nice. Your function is a lot more predictable then. It was kind of bugging me at times that drawing more black pixels would take longer, as I sometimes get slowdowns with certain sprites and not others. I'll sure consider it if I ever have time to work on my roguelike game.
User avatar
Drakker
 
Posts: 297
Joined: Sun Mar 30, 2014 2:54 am
Location: Québec, Canada

Re: Having fun optimizing a sprite routine

Postby jonnection » Sat Aug 08, 2015 7:46 pm

@sorunome

Default optimization setting in Arduino IDE for avr-gcc is -Os (optimize for size). You can find it in platforms.txt in compiler flags... if I remeber it right. -O2 or -O3 will give bigger but faster code.

Edit: this means how c to asm is optimized. It will not affect your inline asm, for the sake of clarification to all. And your asm will be quicker anyhow. In any case, the difference between the compiler output and inline asm should not be 2x. The compiler is doing a lousy job if that is the case.
User avatar
jonnection
 
Posts: 317
Joined: Sun May 04, 2014 8:21 pm

Re: Having fun optimizing a sprite routine

Postby Sorunome » Sat Aug 08, 2015 8:40 pm

jonnection wrote:@sorunome

Default optimization setting in Arduino IDE for avr-gcc is -Os (optimize for size). You can find it in platforms.txt in compiler flags... if I remeber it right. -O2 or -O3 will give bigger but faster code.

Edit: this means how c to asm is optimized. It will not affect your inline asm, for the sake of clarification to all. And your asm will be quicker anyhow. In any case, the difference between the compiler output and inline asm should not be 2x. The compiler is doing a lousy job if that is the case.

I know of the optimization compiler flags, thanks, the thing is, though, that if you tell the whole thing to O2 or even O3 your output hex will be way larger, not only your sprite routine ;)
User avatar
Sorunome
 
Posts: 629
Joined: Sun Mar 01, 2015 1:58 pm

Re: Having fun optimizing a sprite routine

Postby jonnection » Sun Aug 09, 2015 5:03 am

Yeah. I just thought I'd bring up the topic since optimization is whats being discussed here. At least its useful to know that Arduino IDE default setting is -Os (optimize for size).

You are 100% correct -O2 will blow up the program size.

Setting optimization flags can also be done for only for 1 funtion:

Code: Select all
void __attribute__((optimize("O2"))) whateverfunction(unsigned char data) {
    // Function code
}


The point is that the C compiler should be clever enough to put (edit: and keep!!!) variables in same registers during the function, just as you have done with asm.
User avatar
jonnection
 
Posts: 317
Joined: Sun May 04, 2014 8:21 pm

Re: Having fun optimizing a sprite routine

Postby Sorunome » Sun Aug 09, 2015 9:31 am

jonnection wrote:Yeah. I just thought I'd bring up the topic since optimization is whats being discussed here. At least its useful to know that Arduino IDE default setting is -Os (optimize for size).

You are 100% correct -O2 will blow up the program size.

Setting optimization flags can also be done for only for 1 funtion:

Code: Select all
void __attribute__((optimize("O2"))) whateverfunction(unsigned char data) {
    // Function code
}


The point is that the C compiler should be clever enough to put (edit: and keep!!!) variables in same registers during the function, just as you have done with asm.

Oooooh, I didn't know you could apply that to only a certain function! That sure opens more possibilities, thanks!
User avatar
Sorunome
 
Posts: 629
Joined: Sun Mar 01, 2015 1:58 pm

Next

Return to Software Development

Who is online

Users browsing this forum: No registered users and 26 guests

cron