You would run each frame from 15 to 0 to perform a fadeout, for example.
Right now, I emulate this using a 1MB lookup table. I'm thinking that's probably bad for cache, and a waste of memory, so I want to convert this into fast math runtime versions instead. Below is what I have so far.
Note that the below code only adjusts one pixel. In the actual emulator, I plan to only cast ltable / use the switch statement once, and implement the actual for loop through all of the pixels inside of each switch.
This is just for an example :)
Code: Select all
r = bgr555_color;
#ifdef LUT
extern light_table[16][32768 * sizeof(uint16)];
uint16 *ltable = light_table[b];
r = ltable[r];
#else
#define p1 ((r & 0x4210) >> 4)
#define p2 ((r & 0x6318) >> 3)
#define p4 ((r & 0x739c) >> 2)
#define p8 ((r & 0x7bde) >> 1)
//0 10000 10000 10000 = 4210
//0 11000 11000 11000 = 6318
//0 11100 11100 11100 = 739c
//0 11110 11110 11110 = 7bde
switch(b) {
case 0: r = 0; break;
case 1: r = p1; break;
case 2: r = p2; break;
case 3: r = p2 + p1; break;
case 4: r = p4; break;
case 5: r = p4 + p1; break;
case 6: r = p4 + p2; break;
case 7: r = p4 + p2 + p1; break;
case 8: r = p8; break;
case 9: r = p8 + p1; break;
case 10: r = p8 + p2; break;
case 11: r = p8 + p2 + p1; break;
case 12: r = p8 + p4; break;
case 13: r = p8 + p4 + p1; break;
case 14: r = p8 + p4 + p2; break;
default: break;
}
#undef p1
#undef p2
#undef p4
#undef p8
#endif
I also believe the rest are fairly quick. The only ones I am really concerned about are 7, 11, 13 and 14. Those all require 3 ands, 3 shifts, and two adds each, plus the end assignment operator.
I'm specifically looking for improved algorithms for these. But perhaps 2, 5, 6, 9, 10 and 12 could be sped up as well somehow?
Unforunately, I'm only aware of how to do simplistic "divide by powers of two" tricks, so I'm pretty limited on what I can improve here :/
Also, one problem is that with only 0-15, we're technically missing one step. eg
0 = 0% brightness
8 = 50% brightness
12 = 75% brightness
15 = 100% brightness
With my lookup table, that's much easier to account for. I can just use 1 / 15 for steps instead of 1 / 16. That won't be quite as easy for this.
So... perhaps the lookup table really is the best solution for this puzzle? Or maybe split the table in two, increase the math requirements but decrease mem / cache requirements?
ex: p = light_table[p >> 10] | light_table_gr[p & 0x3ff];