Optimal RGB Mix/Clamped Add/Sub

Strictly for discussing ZSNES development and for submitting code. You can also join us on IRC at irc.libera.chat in #zsnes.
Please, no requests here.

Moderator: ZSNES Mods

blargg
Regular
Posts: 327
Joined: Thu Jun 30, 2005 1:54 pm
Location: USA
Contact:

Optimal RGB Mix/Clamped Add/Sub

Post by blargg »

I'm working on some tutorials about efficient algorithms for packed RGB mixing and add/subtract with clamping, which might be helpful for implementing the graphics operations in a SNES emulator. So far I've written one for mixing:

Mixing Packed RGB Pixels Efficiently

Code: Select all

low_bits = (X ^ Y) & 0x0421;
result   = (X + Y - low_bits) >> 1;
I'm working on another for clamped add/subtract. It's a fun diversion to find subtle combinations of bit operations that give the desired result. I still don't have as good an intuitive grasp on subtraction as I do on addition.

Post if you want more explanation of how this one works, otherwise I'll post the clamped add/subtract shortly, which is sure to take some head-scratching to grasp. :)

EDIT: Reproduced final code from web page.
Last edited by blargg on Fri Mar 03, 2006 9:29 pm, edited 1 time in total.
creaothceann
Seen it all
Posts: 2302
Joined: Mon Jan 03, 2005 5:04 pm
Location: Germany
Contact:

Post by creaothceann »

I'd certainly be interested in the add/sub algorithm. :)
I'm using 32-bit values, but it should be easy to adapt.
vSNES | Delphi 10 BPLs
bsnes launcher with recent files list
byuu

Post by byuu »

The best I can come up with for clamped add/sub is 32 bitwise operations and no branches, or much less bitwise operations with three branches.

The only way I can turn parts from 32-63 to 31 is with r = (x+y)|(31-(x+y)&32)&31. And all that per channel. Otherwise r = (x+y); r = r >= 32 ? 31 : r;
The fast add doesn't work because you have to round off the low bit since you're expecting to divide anyway, which isn't the case in normal add/sub effects.

The fast 50% mix is genius though :D
creaothceann
Seen it all
Posts: 2302
Joined: Mon Jan 03, 2005 5:04 pm
Location: Germany
Contact:

Post by creaothceann »

Yeah... SNES does this when color add/sub is enabled:

(optional) X = X + Y or X = X - Y
(optional) X / 2
if X < 0 then X = 0, else if X > 31 then X = 31
vSNES | Delphi 10 BPLs
bsnes launcher with recent files list
sinamas
Gambatte Developer
Gambatte Developer
Posts: 157
Joined: Fri Oct 21, 2005 4:03 pm
Location: Norway

Post by sinamas »

byuu wrote:The only way I can turn parts from 32-63 to 31 is with r = (x+y)|(31-(x+y)&32)&31. And all that per channel. Otherwise r = (x+y); r = r >= 32 ? 31 : r;
The ternary would probably translate to a cmp and a cmov (no branch) if you compile for x86 >= i686.

If you absolutely insist on avoiding anything that remotely resembles a branch, I suppose you could always do r=x+y; r+=(0x1F-r)*(r>>5); (where x and y are 5-bit numbers), or r|=0x1F*(r>>5); r&=0x1F;. For subtraction you could simply do r=x-y; r-=r*(r>>31); (using unsigned 32-bit ints). Since compares don't necessarily mean branching you could probably even do r+=(0x1F-r)*(r>0x1F); and r-=r*(r<0); It would even be feasible to use a look-up table for such limited additions :P
sinamas
Gambatte Developer
Gambatte Developer
Posts: 157
Joined: Fri Oct 21, 2005 4:03 pm
Location: Norway

Post by sinamas »

What do you think of this for an rgb15 clamped add?

Code: Select all

unsigned int sum=pixel1+pixel2;
unsigned int tmp=(sum-((pixel1^pixel2)&0x0420))&0x8420;
sum-=tmp;
tmp|=tmp>>1;
tmp|=tmp>>2;
tmp|=tmp>>1;
tmp>>=1;
sum|=tmp;
blargg
Regular
Posts: 327
Joined: Thu Jun 30, 2005 1:54 pm
Location: USA
Contact:

Post by blargg »

Wow, as I was going to write the page describing clamped add, I came up with a new, much simpler algorithm that's 3 operations less than before. It came out similar to the RGB mixing algorithm (and sinamas' version above):

Adding RGB Pixels With Clamping

Code: Select all

sum      = x + y;
low_bits = (x ^ y) & 0x0421;
carries  = (sum - low_bits) & 0x8420;
modulo   = sum - carries;
clamp    = carries - (carries >> 5);
result   = modulo | clamp;
Only 9 operations. Discuss. :)
sinamas
Gambatte Developer
Gambatte Developer
Posts: 157
Joined: Fri Oct 21, 2005 4:03 pm
Location: Norway

Post by sinamas »

clamp = carries - (carries >> 5);
Birlliant. I had a feeling it could be optimized further.
byuu

Post by byuu »

clamp = carries - (carries >> 5);
...I literally spent the last two days trying to come up with that exact line of code. blargg, you are a god among men o.O

Now to work color halve into there, heh.
The halve is done after adding, so 1+1 becomes (1+1)>>1=1, and not (1>>1+1>>1)=0.

And subtraction needs the same halve trick, and then we have all four as optimized as possible! I'd bet even ZSNES could benefit from these algorithms, as they are extremely non-obvious.
blargg
Regular
Posts: 327
Joined: Thu Jun 30, 2005 1:54 pm
Location: USA
Contact:

Post by blargg »

Now to work color halve into there, heh.
Eh? If you halve the intermediate sum, it's impossible for it to overflow. :)

I can see how it would be necessary for the subtract clamped operation. Once I post that we can work that out.

And is it just the BBTech skin/my browser, or is it lame for hyperlinks to have no visual distinction whatsoever until you happen to pass your mouse pointer over them? I'd feel kind of stupid saying "This is a link:" when the notation has already been worked out over ten years ago: underline it. This takes me back to those graphical adventure games where you had to try clicking on everything in order to find the hidden objects you could interact with.
-_pentium5.1_-
Lurker
Posts: 110
Joined: Sat Sep 04, 2004 7:55 pm
Location: USA

Post by -_pentium5.1_- »

Good work. Are the prospects good for this color blending code to be eventually implemented in ZSNES?
blargg wrote:And is it just the BBTech skin/my browser, or is it lame for hyperlinks to have no visual distinction whatsoever until you happen to pass your mouse pointer over them? I'd feel kind of stupid saying "This is a link:" when the notation has already been worked out over ten years ago: underline it. This takes me back to those graphical adventure games where you had to try clicking on everything in order to find the hidden objects you could interact with.
That should be posted in the "Forum" section of the board (which is listed under the "Site" heading).
This signature intentionally contains no text other than this sentence.
byuu

Post by byuu »

Eh? If you halve the intermediate sum, it's impossible for it to overflow.
Not following, but I wrote a quickie app to run through all 4 million combinations of x and y to test, so I'm sure I can figure out where to right shift the inputs to get the desired result.
I was thinking it might work better to simplify the equation some by taking the halve into account somewhere in there, or something :/
Edit: oh, duh. You mean like use your old 50% blend code and be done with it. Ok, heh.
And is it just the BBTech skin/my browser, or is it lame for hyperlinks to have no visual distinction whatsoever until you happen to pass your mouse pointer over them?
I do that on my site too :(
I at least have a better contrast difference between the two. You're right though, it looks terrible. I just don't like seeing underlined text anymore. Probably because of "Intellitext". Maybe one could bold the links and darken them. So long as they're very visually distinctive from plain text, and consistent amongst the entire site, it's fine in my opinion.
kode54
Zealot
Posts: 1140
Joined: Wed Jul 28, 2004 3:31 am
Contact:

Post by kode54 »

byuu wrote:
blargg wrote:
byuu wrote:Now to work color halve into there, heh.
Eh? If you halve the intermediate sum, it's impossible for it to overflow.
Not following, but I wrote a quickie app to run through all 4 million combinations of x and y to test, so I'm sure I can figure out where to right shift the inputs to get the desired result.
Doesn't the original post and linked article already cover add/halve?
sinamas
Gambatte Developer
Gambatte Developer
Posts: 157
Joined: Fri Oct 21, 2005 4:03 pm
Location: Norway

Post by sinamas »

I got bored again

Code: Select all

unsigned int sub=px1-px2;
const unsigned int borrows=(~(sub+((~(px1^px2))&0x8420)))&0x8420;
sub+=borrows;
const unsigned int clamp=borrows-(borrows>>5);
sub&=~clamp;
grrr... stupid "nots"
blargg
Regular
Posts: 327
Joined: Thu Jun 30, 2005 1:54 pm
Location: USA
Contact:

Post by blargg »

sinamas, I've enjoyed your solutions since I can see we're both thinking along the same lines, encountering the same issues. I also had to wrestle with negations when handling the subtraction case. I think I even had the same intermediate solution as you posted before I was able to eliminate operations.

This has got me thinking of whether addition and subtraction are symmetrical, and whether the general set of bit operations all have mirror images. My conclusion is that subtraction is not a complement (just the fact that it's not associative should be a clue), so the subtraction case can't be done with as few operations (mine took 10 operations as compared to 9 for add clamped).

One interesting thing to note is that x ^ ~y is the EQV operation, basically the reverse of XOR: if both bits are the same, the result bit is 1. The PowerPC actually has this instruction, eqv, so when I write z = x ^ ~y it generates eqv z,x,y. I decided not to count this as a single operation since it's not a commonly implemented instruction.
byuu

Post by byuu »

Hm, in terms of logic gates this operation is usually called XNOR.
NAND and NOR are other ones that should be single opcodes on modern processors, but usually aren't.

Only ten operations for your subtract? Can you share that version with us, please? :)

That makes three of four SNES color add/sub algorithms fully optimized. Heh, this isn't even a bottleneck, but I admit it's fascinating what you two are coming up with.
byuu

Post by byuu »

clock() ticks on an Athlon 3500+:

Code: Select all

 add_old = 21343
 add_new = 10157
hadd_old = 15765
hadd_new =  6781
 sub_old = 13454
 sub_new = 12796
hsub_old = 13532
hsub_new = 13203
New code:

Code: Select all

uint16 add_new(uint32 x, uint32 y) {
uint sum     = x + y;
uint carries = (sum - ((x ^ y) & 0x0421)) & 0x8420;
  return (sum - carries) | (carries - (carries >> 5));
}

uint16 hadd_new(uint32 x, uint32 y) {
  return (x + y - ((x ^ y) & 0x0421)) >> 1;
}

uint16 sub_new(uint32 x, uint32 y) {
uint sub = x - y;
uint borrows = (~(sub + ((~(x ^ y)) & 0x8420))) & 0x8420;
  sub += borrows;
  return sub & ~(borrows - (borrows >> 5));
}

uint16 hsub_new(uint32 x, uint32 y) {
uint sub = x - y;
uint borrows = (~(sub + ((~(x ^ y)) & 0x8420))) & 0x8420;
  sub += borrows;
  return ((sub & ~(borrows - (borrows >> 5))) & 0x7bde) >> 1;
}
Old code:

Code: Select all

uint16 add_old(uint32 x, uint32 y) {
  x = ((x << 16) | x) & 0x03e07c1f;
  y = ((y << 16) | y) & 0x03e07c1f;
  x +=y;

  if(x & 0x04000000)x |= 0x03e00000;
  if(x & 0x00008000)x |= 0x00007c00;
  if(x & 0x00000020)x |= 0x0000001f;

  x &= 0x03e07c1f;
  return (x >> 16) | x;
}

uint16 hadd_old(uint32 x, uint32 y) {
  x = ((x << 16) | x) & 0x03e07c1f;
  y = ((y << 16) | y) & 0x03e07c1f;
  x +=y;

  x >>= 1;

  x &= 0x03e07c1f;
  return (x >> 16) | x;
}

uint16 sub_old(uint32 x, uint32 y) {
  if((x & 0x7c00) < (y & 0x7c00))x &= ~0x7c00;
  else x -= (y & 0x7c00);
  if((x & 0x03e0) < (y & 0x03e0))x &= ~0x03e0;
  else x -= (y & 0x03e0);
  if((x & 0x001f) < (y & 0x001f))x &= ~0x001f;
  else x -= (y & 0x001f);
  return x;
}

uint16 hsub_old(uint32 x, uint32 y) {
  if((x & 0x7c00) < (y & 0x7c00))x &= ~0x7c00;
  else x -= (y & 0x7c00);
  if((x & 0x03e0) < (y & 0x03e0))x &= ~0x03e0;
  else x -= (y & 0x03e0);
  if((x & 0x001f) < (y & 0x001f))x &= ~0x001f;
  else x -= (y & 0x001f);
  return (x & 0x7bde) >> 1;
}
Not bad at all, especially for addition! I'm kind of surprised at how close my subtraction code was speed-wise to the non-branching code :/

I ran all four through all 32768*32768 iterations and they are all 100% identical.
blargg
Regular
Posts: 327
Joined: Thu Jun 30, 2005 1:54 pm
Location: USA
Contact:

Post by blargg »

Not bad at all, especially for addition! I'm kind of surprised at how close my subtraction code was speed-wise to the non-branching code :/
Twice the performance if you process two pixels simultaneously. Probably more if you process two pairs in parallel, due to the pipeline stalling in the non-parallel version (most operations depend on the result of the previous). Your timing method may also be including the (fixed) overhead of the timing loop. In other words, if you were to time a no-op function, the proper result would be infinite iterations per second. You could partially compensate for this by timing such a function and adjusting your results. Timing is such a complex topic. And of course the real performance that matters is how it affects the speed of bsnes, which is probably less than 1%. Reality always ruins the fun of micro-optimization. :)
NAND and NOR are other ones that should be single opcodes on modern processors, but usually aren't.
PowerPC even has these, which I think is pretty odd since they are rarely useful.

Anyway, here's the 10-operation clamped subtract:

http://www.slack.net/~ant/info/rgb_clamped_sub.html

Code: Select all

diff     = x - y + 0x8420;
low_bits = (x ^ y) & 0x8420;
borrows  = (diff - low_bits) & 0x8420;
modulo   = diff - borrows;
clamp    = borrows - (borrows >> 5);
result   = modulo & clamp;
creaothceann
Seen it all
Posts: 2302
Joined: Mon Jan 03, 2005 5:04 pm
Location: Germany
Contact:

Post by creaothceann »

blargg wrote:And is it just the BBTech skin/my browser, or is it lame for hyperlinks to have no visual distinction whatsoever until you happen to pass your mouse pointer over them? I'd feel kind of stupid saying "This is a link:" when the notation has already been worked out over ten years ago: underline it. This takes me back to those graphical adventure games where you had to try clicking on everything in order to find the hidden objects you could interact with.
You can select subSilver in your profile.
vSNES | Delphi 10 BPLs
bsnes launcher with recent files list
pagefault
ZSNES Developer
ZSNES Developer
Posts: 812
Joined: Tue Aug 17, 2004 5:24 am
Location: In your garden

Post by pagefault »

Wow this is some really nice work. I can finally understand how colour sub/add works now that I can read some legable code :). I hope now that I understand how it works I can figure out where the problem is in ZSNES's code.
Deathlike2
ZSNES Developer
ZSNES Developer
Posts: 6747
Joined: Tue Dec 28, 2004 6:47 am

Post by Deathlike2 »

Nice to see programmers looking over something even if it wasn't asked... it would be good to see what this code affects/fixes
Continuing [url=http://slickproductions.org/forum/index.php?board=13.0]FF4[/url] Research...
byuu

Post by byuu »

Thanks blargg, your subtraction is ~20% faster and makes the addition and subtraction run at roughly the same speed. Of course, my timing method sucks, but eh.
I can finally understand how colour sub/add works now that I can read some legable code
Heh, wow, that code is incredibly complex compared to what I was using before. I still don't understand the subtraction code, but I'll read over blargg's doc again now that I've gotten some sleep :)
As far as I've noticed, ZSNES has no problems with the actual add/sub adjustment algorithms, I've tested lots of things against it while initially developing bsnes. I believe the problems in ZSNES relate to it deciding when (and not) to use add/sub effects. And even then, there aren't many things I've noticed there that didn't match my copier.
These algorithms should speed things up for you guys though, which is always a good thing.

So real quick and slightly offtopic, blargg... can you optimize one more subtraction thing for me please? :)
In the HQ2x filter, the major bottleneck is the diff() function, everything else is fairly fast.
What the filter does is load the current pixel and its eight surrounding pixels in 16-bit RGB, cast them into a YUV lookup table and pull out a YUV888 value into a 32-bit value.
It then tests to see if the central pixel is "significantly" different from each pixel around it. However, this diff function is awfully slow...

Code: Select all

const  int   Ymask = 0x00FF0000;
const  int   Umask = 0x0000FF00;
const  int   Vmask = 0x000000FF;
const  int   trY   = 0x00300000;
const  int   trU   = 0x00000700;
const  int   trV   = 0x00000006;

inline bool Diff(unsigned int w1, unsigned int w2)
{
  YUV1 = RGBtoYUV[w1];
  YUV2 = RGBtoYUV[w2];
  return ( ( abs((YUV1 & Ymask) - (YUV2 & Ymask)) > trY ) ||
           ( abs((YUV1 & Umask) - (YUV2 & Umask)) > trU ) ||
           ( abs((YUV1 & Vmask) - (YUV2 & Vmask)) > trV ) );
}
I can optimize things to not read from the lookup tables so often and such, but that diff function alludes me. So far, I've tried making the YUV table a mixed value (e.g. YYYYUUUUVVVV->YYYUVUVYUUVV) and get the correct result ~80% of the time, and making the table as RGBtoYUV = y * trY + u * trU + v * trV; and getting the correct result ~70% of the time...

I could modify your 16-bit subtraction to work on these two values, but that won't wrap around like this function is via the abs call :/
And even once the subtraction is done... how to quickly test for (value & 0xff0000) > trY || (value & 0xff00) > trU || (value & 0xff) > trV?

If you're busy it's no big deal, I can probably figure something better out.
blargg
Regular
Posts: 327
Joined: Thu Jun 30, 2005 1:54 pm
Location: USA
Contact:

Post by blargg »

I love a challenge. I'll look at it later when I get home. One technique I've developed is ANDing/ORing the differences of values. For example, to find out if A < B and C < D:

Code: Select all

if ( ((A - B) & (C - D)) < 0 )
    ...
It just ANDs the sign bits together. I don't know how much faster it would be, though. You could eliminate the branch by then using the sign bit to form a mask, i.e.

Code: Select all

((A - B) & (C - D)) >> 31
Here the mask would be ~0 only when A < B && C < D, 0 otherwise. It relies on a two's complement implementation of signed numbers (i.e. -1 = ~0), which is very common.

EDIT: A few more tricks, but I think I've got the diff function nailed to about four operations (about to implement it and find out).

Code: Select all

int abs( int n )
{
    int neg = n >> 31; // assumes int is 32 bits and >> fills with sign bit
    return (n ^ neg) + (neg & 1);
}

bool in_range( int n, int min, int max )
{
    return (unsigned) (n - min) <= (unsigned) (max - min);
}
For abs, you can probably eliminate the "+ (neg & 1)", since it only affects the result by 1; abs( -2 ) would return 1, for example. You'd probably be calling in_range with constants for min and max, so it reduces to just add and compare.
blargg
Regular
Posts: 327
Joined: Thu Jun 30, 2005 1:54 pm
Location: USA
Contact:

Post by blargg »

I was wrong; it only took three operations:

Code: Select all

/* non-zero result if colors differ considerably */
unsigned diff( unsigned x, unsigned y )
{
    unsigned offset = (0x440 << 21) + (0x207 << 11) + 0x407;
    unsigned mask   = (0x380 << 21) + (0x1F0 << 11) + 0x3F0;
    return (x - y + offset) & mask;
}
It requires that the RGB to YUV table be in a slightly different format, but this shouldn't matter.

Technique: You can manipulate multiple signed values in a 32-bit integer, as long as you convert to offset-binary when examining the result. When one of the lower values goes negative, it will effectively subtract one from the next component to the left. Adding offset later will cancel this. Consider a decimal example where the input values range from 0 to 10. The difference of any two values is in the range -10 to 10. If you add 100 to the result, it will always be in the range 90 to 110. Since that range has no negative values, it could be used in the low bits of a number without affecting the upper bits.

Technique: You can check whether a value is outside the range -n to (n - 1), where n is a power of two, by adding n and masking with ~(n * 2 - 1). For example, to find if a value is outside the range -16 to 15, add 16 and mask with ~31. The original range will be converted to 0 to 31. A value below zero will set all the bits above bit 4, and a value above 31 will set at least one bit above bit 4. This technique allows multiple range checks on values packed into a single integer.

Putting these together yields the solution. The function in pure form is

Code: Select all

abs( y1 - y2 ) > 0x30 || abs( u1 - u2 ) > 7 || abs( v1 - v2 ) > 6
The first step is to rescale the components so that the ranges are a power of two. So rescale Y by 0x3F / 0x30 and V by 7 / 6. Ideally you'd do the scaling to the Y and V values while they're still floating-point, otherwise you'll get some rounding errors. The thresholds are probably not absolutely critical, so these small errors should be acceptable.

Next, pack the Y U V values in the table with some padding between them. Y and V will have a slightly higher range, so give them each 11 bits, and U the remaining 10 bits.

Code: Select all

packed = (Y << 21) | (U << 11) | V;
That covers table initialization. The final function finds the difference between the two Y/U/V vectors and adds an offset to convert to offset binary. The offset also incorporates the value for the range check technique. The only step left is to mask off the upper bits of each component (except the top bit, which is normally set, rather than clear) and return this. If any of the retained bits are non-zero, one or more of the Y, U, or V component differences were beyond the threshold.

You'll note that the top bit of each component wasn't used. If they covered a larger range (or you wanted more accuracy), this might be necessary, in which case you'd include it in the final mask, then XOR with a value having the top bit set for each component.
byuu

Post by byuu »

Genius as always. Thanks a million!

Code: Select all

<current>
matches = 1070858994
misses  =    2882830

<always different>
matches = 1058421684
misses  =   15320140
Not bad. A ~0.27% margin of error. The differences are visible, but you really have to look to see them. It's worth it for the speed increase, and the differences aren't always bad. The formula is actually always kind of sketchy and this just kind of changes what gets wrongly detected as edges.

Comparison screenshots with your new diff function:

Image Image

Image Image

Can you tell which one is the original and which one uses the new diff function? "If you can't tell, why should I?" :P
Note that I anticipated that some people may think the obvious in that the left is the original picture (e.g. for before -> after order), but does that mean I swapped them or not? :)

Originally, for the first set of screenshots I was getting 50.5fps, with your diff function I pushed that to 54.5fps, with my own optimizations I pushed it up to 57.5fps.
On the second screenshot: 48.5fps original, 52.5fps + blargg's diff, 56.0fps + my optimizations.

Not bad at all. I get ~80fps with no filter whatsoever, and you have to remember that I'm transferring four times the video data to the card.

Source code is here: http://byuu.cinnamonpirate.com/hq2x/filter_hq2x.cpp

I'm not done improving it either. I think I can turn the c[] and w[] buffers into buffers that span the entire line so there isn't six needless copies per pixel for each buffer.
Post Reply