Optimal RGB Mix/Clamped Add/Sub

blargg · Post by **blargg** » Tue Feb 28, 2006 2:49 am

I'm working on some tutorials about efficient algorithms for packed RGB mixing and add/subtract with clamping, which might be helpful for implementing the graphics operations in a SNES emulator. So far I've written one for mixing:

Mixing Packed RGB Pixels Efficiently

Code: Select all

low_bits = (X ^ Y) & 0x0421;
result   = (X + Y - low_bits) >> 1;

I'm working on another for clamped add/subtract. It's a fun diversion to find subtle combinations of bit operations that give the desired result. I still don't have as good an intuitive grasp on subtraction as I do on addition.

Post if you want more explanation of how this one works, otherwise I'll post the clamped add/subtract shortly, which is sure to take some head-scratching to grasp.

EDIT: Reproduced final code from web page.

creaothceann · Post by **creaothceann** » Fri Mar 03, 2006 1:48 pm

I'd certainly be interested in the add/sub algorithm.

I'm using 32-bit values, but it should be easy to adapt.

byuu · Post by **byuu** » Fri Mar 03, 2006 2:17 pm

The best I can come up with for clamped add/sub is 32 bitwise operations and no branches, or much less bitwise operations with three branches.

The only way I can turn parts from 32-63 to 31 is with r = (x+y)|(31-(x+y)&32)&31. And all that per channel. Otherwise r = (x+y); r = r >= 32 ? 31 : r;
The fast add doesn't work because you have to round off the low bit since you're expecting to divide anyway, which isn't the case in normal add/sub effects.

The fast 50% mix is genius though :D

creaothceann · Post by **creaothceann** » Fri Mar 03, 2006 3:00 pm

Yeah... SNES does this when color add/sub is enabled:

(optional) X = X + Y or X = X - Y
(optional) X / 2
if X < 0 then X = 0, else if X > 31 then X = 31

sinamas · Post by **sinamas** » Fri Mar 03, 2006 6:13 pm

byuu wrote:The only way I can turn parts from 32-63 to 31 is with r = (x+y)|(31-(x+y)&32)&31. And all that per channel. Otherwise r = (x+y); r = r >= 32 ? 31 : r;

The ternary would probably translate to a cmp and a cmov (no branch) if you compile for x86 >= i686.

If you absolutely insist on avoiding anything that remotely resembles a branch, I suppose you could always do r=x+y; r+=(0x1F-r)*(r>>5); (where x and y are 5-bit numbers), or r|=0x1F*(r>>5); r&=0x1F;. For subtraction you could simply do r=x-y; r-=r*(r>>31); (using unsigned 32-bit ints). Since compares don't necessarily mean branching you could probably even do r+=(0x1F-r)*(r>0x1F); and r-=r*(r<0); It would even be feasible to use a look-up table for such limited additions

sinamas · Post by **sinamas** » Fri Mar 03, 2006 8:01 pm

What do you think of this for an rgb15 clamped add?

Code: Select all

unsigned int sum=pixel1+pixel2;
unsigned int tmp=(sum-((pixel1^pixel2)&0x0420))&0x8420;
sum-=tmp;
tmp|=tmp>>1;
tmp|=tmp>>2;
tmp|=tmp>>1;
tmp>>=1;
sum|=tmp;

blargg · Post by **blargg** » Fri Mar 03, 2006 9:26 pm

Wow, as I was going to write the page describing clamped add, I came up with a new, much simpler algorithm that's 3 operations less than before. It came out similar to the RGB mixing algorithm (and sinamas' version above):

Adding RGB Pixels With Clamping

Code: Select all

sum      = x + y;
low_bits = (x ^ y) & 0x0421;
carries  = (sum - low_bits) & 0x8420;
modulo   = sum - carries;
clamp    = carries - (carries >> 5);
result   = modulo | clamp;

Only 9 operations. Discuss.

sinamas · Post by **sinamas** » Fri Mar 03, 2006 9:42 pm

clamp = carries - (carries >> 5);

Birlliant. I had a feeling it could be optimized further.

byuu · Post by **byuu** » Sat Mar 04, 2006 12:11 am

clamp = carries - (carries >> 5);

...I literally spent the last two days trying to come up with that exact line of code. blargg, you are a god among men o.O

Now to work color halve into there, heh.
The halve is done after adding, so 1+1 becomes (1+1)>>1=1, and not (1>>1+1>>1)=0.

And subtraction needs the same halve trick, and then we have all four as optimized as possible! I'd bet even ZSNES could benefit from these algorithms, as they are extremely non-obvious.

blargg · Post by **blargg** » Sat Mar 04, 2006 12:27 am

Now to work color halve into there, heh.

Eh? If you halve the intermediate sum, it's impossible for it to overflow.

I can see how it would be necessary for the subtract clamped operation. Once I post that we can work that out.

And is it just the BBTech skin/my browser, or is it lame for hyperlinks to have no visual distinction whatsoever until you happen to pass your mouse pointer over them? I'd feel kind of stupid saying "This is a link:" when the notation has already been worked out over ten years ago: underline it. This takes me back to those graphical adventure games where you had to try clicking on everything in order to find the hidden objects you could interact with.

-_pentium5.1_- · Post by **-_pentium5.1_-** » Sat Mar 04, 2006 12:43 am

Good work. Are the prospects good for this color blending code to be eventually implemented in ZSNES?

blargg wrote:And is it just the BBTech skin/my browser, or is it lame for hyperlinks to have no visual distinction whatsoever until you happen to pass your mouse pointer over them? I'd feel kind of stupid saying "This is a link:" when the notation has already been worked out over ten years ago: underline it. This takes me back to those graphical adventure games where you had to try clicking on everything in order to find the hidden objects you could interact with.

That should be posted in the "Forum" section of the board (which is listed under the "Site" heading).

byuu · Post by **byuu** » Sat Mar 04, 2006 12:45 am

Eh? If you halve the intermediate sum, it's impossible for it to overflow.

Not following, but I wrote a quickie app to run through all 4 million combinations of x and y to test, so I'm sure I can figure out where to right shift the inputs to get the desired result.
I was thinking it might work better to simplify the equation some by taking the halve into account somewhere in there, or something :/
Edit: oh, duh. You mean like use your old 50% blend code and be done with it. Ok, heh.

And is it just the BBTech skin/my browser, or is it lame for hyperlinks to have no visual distinction whatsoever until you happen to pass your mouse pointer over them?

I do that on my site too :(
I at least have a better contrast difference between the two. You're right though, it looks terrible. I just don't like seeing underlined text anymore. Probably because of "Intellitext". Maybe one could bold the links and darken them. So long as they're very visually distinctive from plain text, and consistent amongst the entire site, it's fine in my opinion.

kode54 · Post by **kode54** » Sat Mar 04, 2006 12:53 am

byuu wrote:
blargg wrote:
byuu wrote:Now to work color halve into there, heh.
Eh? If you halve the intermediate sum, it's impossible for it to overflow.
Not following, but I wrote a quickie app to run through all 4 million combinations of x and y to test, so I'm sure I can figure out where to right shift the inputs to get the desired result.

Doesn't the original post and linked article already cover add/halve?

sinamas · Post by **sinamas** » Sat Mar 04, 2006 2:25 am

I got bored again

Code: Select all

unsigned int sub=px1-px2;
const unsigned int borrows=(~(sub+((~(px1^px2))&0x8420)))&0x8420;
sub+=borrows;
const unsigned int clamp=borrows-(borrows>>5);
sub&=~clamp;

grrr... stupid "nots"

blargg · Post by **blargg** » Sat Mar 04, 2006 3:00 am

sinamas, I've enjoyed your solutions since I can see we're both thinking along the same lines, encountering the same issues. I also had to wrestle with negations when handling the subtraction case. I think I even had the same intermediate solution as you posted before I was able to eliminate operations.

This has got me thinking of whether addition and subtraction are symmetrical, and whether the general set of bit operations all have mirror images. My conclusion is that subtraction is not a complement (just the fact that it's not associative should be a clue), so the subtraction case can't be done with as few operations (mine took 10 operations as compared to 9 for add clamped).

One interesting thing to note is that x ^ ~y is the EQV operation, basically the reverse of XOR: if both bits are the same, the result bit is 1. The PowerPC actually has this instruction, eqv, so when I write z = x ^ ~y it generates eqv z,x,y. I decided not to count this as a single operation since it's not a commonly implemented instruction.

byuu · Post by **byuu** » Sat Mar 04, 2006 3:24 am

Hm, in terms of logic gates this operation is usually called XNOR.
NAND and NOR are other ones that should be single opcodes on modern processors, but usually aren't.

Only ten operations for your subtract? Can you share that version with us, please? :)

That makes three of four SNES color add/sub algorithms fully optimized. Heh, this isn't even a bottleneck, but I admit it's fascinating what you two are coming up with.

byuu · Post by **byuu** » Sat Mar 04, 2006 4:18 am

clock() ticks on an Athlon 3500+:

Code: Select all

 add_old = 21343
 add_new = 10157
hadd_old = 15765
hadd_new =  6781
 sub_old = 13454
 sub_new = 12796
hsub_old = 13532
hsub_new = 13203

New code:

Code: Select all

uint16 add_new(uint32 x, uint32 y) {
uint sum     = x + y;
uint carries = (sum - ((x ^ y) & 0x0421)) & 0x8420;
  return (sum - carries) | (carries - (carries >> 5));
}

uint16 hadd_new(uint32 x, uint32 y) {
  return (x + y - ((x ^ y) & 0x0421)) >> 1;
}

uint16 sub_new(uint32 x, uint32 y) {
uint sub = x - y;
uint borrows = (~(sub + ((~(x ^ y)) & 0x8420))) & 0x8420;
  sub += borrows;
  return sub & ~(borrows - (borrows >> 5));
}

uint16 hsub_new(uint32 x, uint32 y) {
uint sub = x - y;
uint borrows = (~(sub + ((~(x ^ y)) & 0x8420))) & 0x8420;
  sub += borrows;
  return ((sub & ~(borrows - (borrows >> 5))) & 0x7bde) >> 1;
}

Old code:

Code: Select all

uint16 add_old(uint32 x, uint32 y) {
  x = ((x << 16) | x) & 0x03e07c1f;
  y = ((y << 16) | y) & 0x03e07c1f;
  x +=y;

  if(x & 0x04000000)x |= 0x03e00000;
  if(x & 0x00008000)x |= 0x00007c00;
  if(x & 0x00000020)x |= 0x0000001f;

  x &= 0x03e07c1f;
  return (x >> 16) | x;
}

uint16 hadd_old(uint32 x, uint32 y) {
  x = ((x << 16) | x) & 0x03e07c1f;
  y = ((y << 16) | y) & 0x03e07c1f;
  x +=y;

  x >>= 1;

  x &= 0x03e07c1f;
  return (x >> 16) | x;
}

uint16 sub_old(uint32 x, uint32 y) {
  if((x & 0x7c00) < (y & 0x7c00))x &= ~0x7c00;
  else x -= (y & 0x7c00);
  if((x & 0x03e0) < (y & 0x03e0))x &= ~0x03e0;
  else x -= (y & 0x03e0);
  if((x & 0x001f) < (y & 0x001f))x &= ~0x001f;
  else x -= (y & 0x001f);
  return x;
}

uint16 hsub_old(uint32 x, uint32 y) {
  if((x & 0x7c00) < (y & 0x7c00))x &= ~0x7c00;
  else x -= (y & 0x7c00);
  if((x & 0x03e0) < (y & 0x03e0))x &= ~0x03e0;
  else x -= (y & 0x03e0);
  if((x & 0x001f) < (y & 0x001f))x &= ~0x001f;
  else x -= (y & 0x001f);
  return (x & 0x7bde) >> 1;
}

Not bad at all, especially for addition! I'm kind of surprised at how close my subtraction code was speed-wise to the non-branching code :/

I ran all four through all 32768*32768 iterations and they are all 100% identical.

blargg · Post by **blargg** » Sat Mar 04, 2006 5:17 am

Not bad at all, especially for addition! I'm kind of surprised at how close my subtraction code was speed-wise to the non-branching code :/

Twice the performance if you process two pixels simultaneously. Probably more if you process two pairs in parallel, due to the pipeline stalling in the non-parallel version (most operations depend on the result of the previous). Your timing method may also be including the (fixed) overhead of the timing loop. In other words, if you were to time a no-op function, the proper result would be infinite iterations per second. You could partially compensate for this by timing such a function and adjusting your results. Timing is such a complex topic. And of course the real performance that matters is how it affects the speed of bsnes, which is probably less than 1%. Reality always ruins the fun of micro-optimization.

NAND and NOR are other ones that should be single opcodes on modern processors, but usually aren't.

PowerPC even has these, which I think is pretty odd since they are rarely useful.

Anyway, here's the 10-operation clamped subtract:

http://www.slack.net/~ant/info/rgb_clamped_sub.html

Code: Select all

diff     = x - y + 0x8420;
low_bits = (x ^ y) & 0x8420;
borrows  = (diff - low_bits) & 0x8420;
modulo   = diff - borrows;
clamp    = borrows - (borrows >> 5);
result   = modulo & clamp;

creaothceann · Post by **creaothceann** » Sat Mar 04, 2006 10:13 am

blargg wrote:And is it just the BBTech skin/my browser, or is it lame for hyperlinks to have no visual distinction whatsoever until you happen to pass your mouse pointer over them? I'd feel kind of stupid saying "This is a link:" when the notation has already been worked out over ten years ago: underline it. This takes me back to those graphical adventure games where you had to try clicking on everything in order to find the hidden objects you could interact with.

You can select subSilver in your profile.

Post by **pagefault** » Sat Mar 04, 2006 4:24 pm

Wow this is some really nice work. I can finally understand how colour sub/add works now that I can read some legable code

. I hope now that I understand how it works I can figure out where the problem is in ZSNES's code.

Deathlike2 · Post by **Deathlike2** » Sat Mar 04, 2006 6:40 pm

Nice to see programmers looking over something even if it wasn't asked... it would be good to see what this code affects/fixes

byuu · Post by **byuu** » Sat Mar 04, 2006 10:31 pm

Thanks blargg, your subtraction is ~20% faster and makes the addition and subtraction run at roughly the same speed. Of course, my timing method sucks, but eh.

I can finally understand how colour sub/add works now that I can read some legable code

Heh, wow, that code is incredibly complex compared to what I was using before. I still don't understand the subtraction code, but I'll read over blargg's doc again now that I've gotten some sleep :)
As far as I've noticed, ZSNES has no problems with the actual add/sub adjustment algorithms, I've tested lots of things against it while initially developing bsnes. I believe the problems in ZSNES relate to it deciding when (and not) to use add/sub effects. And even then, there aren't many things I've noticed there that didn't match my copier.
These algorithms should speed things up for you guys though, which is always a good thing.

So real quick and slightly offtopic, blargg... can you optimize one more subtraction thing for me please? :)
In the HQ2x filter, the major bottleneck is the diff() function, everything else is fairly fast.
What the filter does is load the current pixel and its eight surrounding pixels in 16-bit RGB, cast them into a YUV lookup table and pull out a YUV888 value into a 32-bit value.
It then tests to see if the central pixel is "significantly" different from each pixel around it. However, this diff function is awfully slow...

Code: Select all

const  int   Ymask = 0x00FF0000;
const  int   Umask = 0x0000FF00;
const  int   Vmask = 0x000000FF;
const  int   trY   = 0x00300000;
const  int   trU   = 0x00000700;
const  int   trV   = 0x00000006;

inline bool Diff(unsigned int w1, unsigned int w2)
{
  YUV1 = RGBtoYUV[w1];
  YUV2 = RGBtoYUV[w2];
  return ( ( abs((YUV1 & Ymask) - (YUV2 & Ymask)) > trY ) ||
           ( abs((YUV1 & Umask) - (YUV2 & Umask)) > trU ) ||
           ( abs((YUV1 & Vmask) - (YUV2 & Vmask)) > trV ) );
}

I can optimize things to not read from the lookup tables so often and such, but that diff function alludes me. So far, I've tried making the YUV table a mixed value (e.g. YYYYUUUUVVVV->YYYUVUVYUUVV) and get the correct result ~80% of the time, and making the table as RGBtoYUV = y * trY + u * trU + v * trV; and getting the correct result ~70% of the time...

I could modify your 16-bit subtraction to work on these two values, but that won't wrap around like this function is via the abs call :/
And even once the subtraction is done... how to quickly test for (value & 0xff0000) > trY || (value & 0xff00) > trU || (value & 0xff) > trV?

If you're busy it's no big deal, I can probably figure something better out.

blargg · Post by **blargg** » Sat Mar 04, 2006 11:28 pm

I love a challenge. I'll look at it later when I get home. One technique I've developed is ANDing/ORing the differences of values. For example, to find out if A < B and C < D:

Code: Select all

if ( ((A - B) & (C - D)) < 0 )
    ...

It just ANDs the sign bits together. I don't know how much faster it would be, though. You could eliminate the branch by then using the sign bit to form a mask, i.e.

Code: Select all

((A - B) & (C - D)) >> 31

Here the mask would be ~0 only when A < B && C < D, 0 otherwise. It relies on a two's complement implementation of signed numbers (i.e. -1 = ~0), which is very common.

EDIT: A few more tricks, but I think I've got the diff function nailed to about four operations (about to implement it and find out).

Code: Select all

int abs( int n )
{
    int neg = n >> 31; // assumes int is 32 bits and >> fills with sign bit
    return (n ^ neg) + (neg & 1);
}

bool in_range( int n, int min, int max )
{
    return (unsigned) (n - min) <= (unsigned) (max - min);
}

For abs, you can probably eliminate the "+ (neg & 1)", since it only affects the result by 1; abs( -2 ) would return 1, for example. You'd probably be calling in_range with constants for min and max, so it reduces to just add and compare.

blargg · Post by **blargg** » Sun Mar 05, 2006 5:55 am

I was wrong; it only took three operations:

Code: Select all

/* non-zero result if colors differ considerably */
unsigned diff( unsigned x, unsigned y )
{
    unsigned offset = (0x440 << 21) + (0x207 << 11) + 0x407;
    unsigned mask   = (0x380 << 21) + (0x1F0 << 11) + 0x3F0;
    return (x - y + offset) & mask;
}

It requires that the RGB to YUV table be in a slightly different format, but this shouldn't matter.

Technique: You can manipulate multiple signed values in a 32-bit integer, as long as you convert to offset-binary when examining the result. When one of the lower values goes negative, it will effectively subtract one from the next component to the left. Adding offset later will cancel this. Consider a decimal example where the input values range from 0 to 10. The difference of any two values is in the range -10 to 10. If you add 100 to the result, it will always be in the range 90 to 110. Since that range has no negative values, it could be used in the low bits of a number without affecting the upper bits.

Technique: You can check whether a value is outside the range -n to (n - 1), where n is a power of two, by adding n and masking with ~(n * 2 - 1). For example, to find if a value is outside the range -16 to 15, add 16 and mask with ~31. The original range will be converted to 0 to 31. A value below zero will set all the bits above bit 4, and a value above 31 will set at least one bit above bit 4. This technique allows multiple range checks on values packed into a single integer.

Putting these together yields the solution. The function in pure form is

Code: Select all

abs( y1 - y2 ) > 0x30 || abs( u1 - u2 ) > 7 || abs( v1 - v2 ) > 6

The first step is to rescale the components so that the ranges are a power of two. So rescale Y by 0x3F / 0x30 and V by 7 / 6. Ideally you'd do the scaling to the Y and V values while they're still floating-point, otherwise you'll get some rounding errors. The thresholds are probably not absolutely critical, so these small errors should be acceptable.

Next, pack the Y U V values in the table with some padding between them. Y and V will have a slightly higher range, so give them each 11 bits, and U the remaining 10 bits.

Code: Select all

packed = (Y << 21) | (U << 11) | V;

That covers table initialization. The final function finds the difference between the two Y/U/V vectors and adds an offset to convert to offset binary. The offset also incorporates the value for the range check technique. The only step left is to mask off the upper bits of each component (except the top bit, which is normally set, rather than clear) and return this. If any of the retained bits are non-zero, one or more of the Y, U, or V component differences were beyond the threshold.

You'll note that the top bit of each component wasn't used. If they covered a larger range (or you wanted more accuracy), this might be necessary, in which case you'd include it in the final mask, then XOR with a value having the top bit set for each component.

byuu · Post by **byuu** » Sun Mar 05, 2006 7:48 am

Genius as always. Thanks a million!

Code: Select all

<current>
matches = 1070858994
misses  =    2882830

<always different>
matches = 1058421684
misses  =   15320140

Not bad. A ~0.27% margin of error. The differences are visible, but you really have to look to see them. It's worth it for the speed increase, and the differences aren't always bad. The formula is actually always kind of sketchy and this just kind of changes what gets wrongly detected as edges.

Comparison screenshots with your new diff function:

Can you tell which one is the original and which one uses the new diff function? "If you can't tell, why should I?" :P
Note that I anticipated that some people may think the obvious in that the left is the original picture (e.g. for before -> after order), but does that mean I swapped them or not? :)

Originally, for the first set of screenshots I was getting 50.5fps, with your diff function I pushed that to 54.5fps, with my own optimizations I pushed it up to 57.5fps.
On the second screenshot: 48.5fps original, 52.5fps + blargg's diff, 56.0fps + my optimizations.

Not bad at all. I get ~80fps with no filter whatsoever, and you have to remember that I'm transferring four times the video data to the card.

Source code is here: http://byuu.cinnamonpirate.com/hq2x/filter_hq2x.cpp

I'm not done improving it either. I think I can turn the c[] and w[] buffers into buffers that span the entire line so there isn't six needless copies per pixel for each buffer.