- Code: Select all
C:/arduino/hardware/tools/avr/bin/avr-objdump -h -S C:\Users\YOUR_USER_NAME_\AppData\Local\Temp\build12345678_or_something.tmp/yourapp.cpp.elf > yourapp.lst
This shows you the assembly locations of all the functions in your code, so you can load your HEX into Simbuino, place breakpoints at relevant points and time the exact number of cycles your code takes to run. To get accurate timing you need to disable interrupts while your code is running:
- Code: Select all
noInterrupts();
ultraDraw3(face, x, y);
interrupts();
A quick search of the LST file shows where this is located in memory:
- Code: Select all
3be: f8 94 cli
3c0: 4d 2f mov r20, r29
3c2: 6c 2f mov r22, r28
3c4: 8a e0 ldi r24, 0x0A ; 10
3c6: 91 e0 ldi r25, 0x01 ; 1
3c8: 0e 94 93 01 call 0x326 ; 0x326 <_Z10ultraDraw3Phcc>
3cc: 78 94 sei
As for optimization, a few things stick out immediately:
- the test for "if(y>=0)" is being done inside the loop, it only needs to be done at the start.
- this loop body is only a few instructions long, so loop mechanics are going to have significant overhead. given that the loop executes exactly 8 times you'll be better off unrolling it.
- keeping separate pointers for buf and buf2 isn't as useful on AVR as it is on other architectures due to the limited number of CPU registers that can be used for memory access. given that buf2 is always exactly 84 bytes ahead of buf you're better off using a single pointer with constant indexing.
- arbitrary shifts on the AVRs are slooooow!
This last point is significant, look at the assembly shift statement for "*(buf2) |= data[i] >> (8-y%8);":
- Code: Select all
394: 8d 91 ld r24, X+ ; load data[i] into r24
396: 90 e0 ldi r25, 0x00 ; signed-extend to 16-bits
398: 02 2e mov r0, r18 ; load r0 with (8-y%8)
39a: 02 c0 rjmp .+4 ; skip next instruction (WHY??? pseudo-optimization perhaps?)
39c: 95 95 asr r25 ; shift r25:r24 right by one bit
39e: 87 95 ror r24
3a0: 0a 94 dec r0 ; we done with the loop yet
3a2: e2 f7 brpl .-8 ; if not then keep going
AVRs can only shift one bit at a time, so the code has to use a loop to shift by (8-y%8). Secondly, the compiler is extending "data[i]" to 16-bits and doing the shift on that even though the top byte will always be 0. On AVRs you're much better off doing shifts with a multiplication instruction which takes only 2 cycles, and given that there are only 8 possible values we may as well store it in a look-up table:
- Code: Select all
const static byte rol_lut[] = {1<<0, 1<<1, 1<<2, 1<<3, 1<<4, 1<<5, 1<<6, 1<<7}; // store this somewhere globally
....
buf[0] |= *data * rol_lut[y & 7]; // do this once per pixel
This can also be used for right shifts, although in this case the result will be 16-bit and we need the upper 8 bits, which the compiler is smart enough to implement with a byte swap:
- Code: Select all
const static unsigned ror_lut[] = {0x100>>0, 0x100>>1, 0x100>>2, 0x100>>3, 0x100>>4, 0x100>>5, 0x100>>6, 0x100>>7};
....
buf[84+0] |= (data[0] * ror_lut[y & 7]) >> 8; // do this once per pixel
So put this all together and here's a new version:
- Code: Select all
void ultraDraw5(register byte * data, const char x, register char y) {
if(y>=0) {
register uint8_t* buf = ((y&0xF8)>>1) * 21 + gb.display.getBuffer() + x;
y &= 7;
register uint8_t scale1 = ror_lut[y];
register uint16_t scale2 = ror_lut[y];
buf[0] |= data[0] * scale1;
buf[84+0] |= (data[0] * scale2) >> 8;
buf[1] |= data[1] * scale1;
buf[84+1] |= (data[1] * scale2) >> 8;
buf[2] |= data[2] * scale1;
buf[84+2] |= (data[2] * scale2) >> 8;
buf[3] |= data[3] * scale1;
buf[84+3] |= (data[3] * scale2) >> 8;
buf[4] |= data[4] * scale1;
buf[84+4] |= (data[4] * scale2) >> 8;
buf[5] |= data[5] * scale1;
buf[84+5] |= (data[5] * scale2) >> 8;
buf[6] |= data[6] * scale1;
buf[84+6] |= (data[6] * scale2) >> 8;
buf[7] |= data[7] * scale1;
buf[84+7] |= (data[7] * scale2) >> 8;
}
}
According to Simbuino the original ultraDraw3 function executes in 621 cycles. With these changes alone the function now executes in 313.