Optimal RGB Mix/Clamped Add/Sub

Strictly for discussing ZSNES development and for submitting code. You can also join us on IRC at irc.libera.chat in #zsnes.
Please, no requests here.

Moderator: ZSNES Mods

Deathlike2
ZSNES Developer
ZSNES Developer
Posts: 6747
Joined: Tue Dec 28, 2004 6:47 am

Post by Deathlike2 »

The worst part is I can't tell a difference... :P :twisted:
Continuing [url=http://slickproductions.org/forum/index.php?board=13.0]FF4[/url] Research...
blargg
Regular
Posts: 327
Joined: Thu Jun 30, 2005 1:54 pm
Location: USA
Contact:

Post by blargg »

I found HQ2X rendering code and tried my optimized diff and it worked. I also optimized the calculation of the pattern index, which might give an even bigger win since it's used so often. Tomorrow I'll see if I can initialize the RGB to YUV table so that it produces the exact same results as the original filter (currently the thresholds are slightly different), and more fully how this works. Here are the relevant changes:

Code: Select all

unsigned const diff_offset = (0x440 << 21) | (0x207 << 11) | 0x407;
unsigned const diff_mask   = (0x380 << 21) | (0x1F0 << 11) | 0x3F0;

// you'd use a table in place of this, and do rescaling
// of y and v before you had reduced to 1/256 precision
unsigned to_yuv( unsigned rgb )
{
    int r = rgb >> 16 & 0xF8;
    int g = rgb >>  8 & 0xFC;
    int b = rgb >>  0 & 0xF8;

    int y = (r + g + b) >> 2;
    int u = ((r - b) >> 2) + 128;
    int v = ((g * 2 - r - b) >> 3) + 128;
    
    // these are the changes 
    int y = y * 0x3F / 0x30;
    int v = v * 7 / 6;
    return (y << 21) + (u << 11) + v;
}

// non-zero if pixels differ enough
unsigned diff( unsigned x, unsigned y )
{
    x = to_yuv( x );
    y = to_yuv( y );
    return (x - y + diff_offset) & diff_mask;
}

// calculation of pattern index inside blitter

// add the offset now instead of 8 times below
unsigned middle = to_yuv( w [5] ) + diff_offset;

int pattern;

// negation of result sets highest bit when pixels differ, which is then shifted to proper position
pattern  = -((middle - to_yuv( w [1] )) & diff_mask) >> (31 - 0);
pattern |= -((middle - to_yuv( w [2] )) & diff_mask) >> (30 - 1) & (1 << 1);
pattern |= -((middle - to_yuv( w [3] )) & diff_mask) >> (29 - 2) & (1 << 2);
pattern |= -((middle - to_yuv( w [4] )) & diff_mask) >> (28 - 3) & (1 << 3);
pattern |= -((middle - to_yuv( w [6] )) & diff_mask) >> (27 - 4) & (1 << 4);
pattern |= -((middle - to_yuv( w [7] )) & diff_mask) >> (26 - 5) & (1 << 5);
pattern |= -((middle - to_yuv( w [8] )) & diff_mask) >> (25 - 6) & (1 << 6);
pattern |= -((middle - to_yuv( w [9] )) & diff_mask) >> (24 - 7) & (1 << 7);

switch ( pattern )
...
byuu

Post by byuu »

The pattern thing doesn't quite work right (or maybe I'm doing something wrong, but I copied the code almost verbatim), however I was able to modify it with no speed loss.

Code: Select all

    int pattern;
    uint32 yx = rgbtoyuv[w[5]] + diff_offset;
      pattern  = ydiff(yx, w[1]);
      pattern |= ydiff(yx, w[2]) << 1;
      pattern |= ydiff(yx, w[3]) << 2;
      pattern |= ydiff(yx, w[4]) << 3;
      pattern |= ydiff(yx, w[6]) << 4;
      pattern |= ydiff(yx, w[7]) << 5;
      pattern |= ydiff(yx, w[8]) << 6;
      pattern |= ydiff(yx, w[9]) << 7;
This brings the framerate to 61 and 60, respectively. A 20% speed increase, given the emulation itself takes most of that speed, means we've probably more than doubled the speed of the original routine by now.

I was thinking, how about instead of having the w[] and c[] buffers like we currently do, we instead move to making w and c the size of the entire screen (512*480 each), and use indexes into each buffer (or a pointer to the buffer that slides along with the x/y increments)?

Ah, and here's my YCbCr code.

Code: Select all

double kr = 0.2126, kb = 0.0722, kg = (1.0 - kr - kb);
  y  = double(r) * kr + double(g) * kg + double(b) * kb;
  cb = 128.0 + (double(b) - y) / (2.0 - 2.0 * kb);
  cr = 128.0 + (double(r) - y) / (2.0 - 2.0 * kr);
Unfortunately, it doesn't work too well as a drop-in replacement. Probably need to recalibrate the tolerance thresholds since this is extremely different than his current model.
Or better yet, I need to figure out the evil that is the pattern table (I understand what it does, just not how he decided which pattern values should blend in which ways) and remake it to work better with the ycbcr table.

I almost want to think there's an analytical approach to building that pattern table (as opposed to guessing until it looks good) that can be automated somehow...
sinamas
Gambatte Developer
Gambatte Developer
Posts: 157
Joined: Fri Oct 21, 2005 4:03 pm
Location: Norway

Post by sinamas »

You may want to make the blend and diff functions static inlines, or inlines in an anonymous namespace, as some compilers may care (and to me it seems more appropriate).

I haven't looked closely at the code, but I think the w and c buffers may actually be helpful for cache locality. And, do you think it would be feasible to do most of the colorspace conversion in one bunch, before anything else is done? I think it might prevent cache misses, as those tables are quite large. Do the blend functions even need to require 32-bit input?
byuu

Post by byuu »

Colorspace conversion is done beforehand. I have lookup tables for the rgb15->30 and rgb15->yuv24.

The blend functions do not have to take 32-bit input, but so far I only have an algorithm for mixing two RGB555 values to 50% each.
I would need to be able to mix them at all the way up to sixteenths (e.g. blend10 = (color1*14+color2+color3)/16), and I doubt that'd be easy.
Not sure what would be faster, a 10-entry converted color buffer you have to constantly move around, or a bunch of bit manipulation for each color blend.
sinamas
Gambatte Developer
Gambatte Developer
Posts: 157
Joined: Fri Oct 21, 2005 4:03 pm
Location: Norway

Post by sinamas »

I'm well aware of the lookup tables. I meant that it may be benficial to do all the lookups earlier, and use the data sequentially instead of converting along the way.

edit: I didn't realise you'd cut down the 15to32 lookups to only 3 pr pixel. It would be simple to make blend functions that use 15bit input, but some of them would use 3-4 more instructions than the current ones.
blargg
Regular
Posts: 327
Joined: Thu Jun 30, 2005 1:54 pm
Location: USA
Contact:

Post by blargg »

I've optimized byuu's filter code quite a bit. The base performance on my machine was 27 frames per second for a 256x223 source image. After optimizations, it ran at 50 frames per second. I had to make the blend functions outline since it otherwise took several minutes to compile the file with maximum optimization. I think that this actually helps since making them inline results in larger code, but perhaps PowerPC's much faster function invocation helped here (parameters are passed in registers and calling a leaf function doesn't require any stack access at all, since the link register holds the return address).

bsnes_filter_hq2x_opt.zip

I also wrote a normal modular version that compiles in C, if anyone's interested.

Byuu, reason you had to calculate the pattern index differently than how I did is that you were having ydiff return a bool, so the compiler was internally converting all non-zero values to 1. If you have it return an unsigned, you'll avoid this extra conversion and then can use my negate-shift method. On the other hand, the IA-32 might have a single instruction to convert any non-zero value to 1 that's faster than my portable negate-shift method, in which case your bool method is probably best.

As for the RGB->YUV conversion, does the quality matter? It's only used to find the visual "distance" between two colors, so its quality isn't very critical. Of course the original version was optimized for use directly; if a lookup table had been used, the conversion might have been done a bit more accurately. But I wonder if the algorithm itself actually depends on some of the roughness of the fast algorithm.

I eliminated the W and C arrays, since having to build/shift them every pixel was a big speed hit. The main purpose of them was to handle special-cases as the edges; instead of that, I just require that the input buffer have the edge pixels doubled to form an extra border around the outside. I also require that the input buffer be a fixed internal width, allowing all array indicies to be constants (so previous line is in [-width]).

The blend functions are now outline and take the raw 15-bit RGB values and unpack them internally as necessary, eliminating the c array. They are also static functions, since don't need to access the YUV table. This eliminates the unnecessary "this" parameter. I profiled the relative frequency of the blend functions and blend2 was used much more often, so that's the only one I focused optimization on. I was able to optimize it down to two 15-bit RGB mixes, with no unpacking required.

I considered keeping the recent YUV lookups in a local array, but decided that it probably wouldn't help much due to the complexity it would add. The calculation of each bit of the pattern index can be done in parallel, so a compiler can schedule the loads ahead of when they're needed.

The following should help clear up how the thresholds are efficiently checked. Say we have X and Y and want to know if Y is outside the range X - 0x10 to X + 0x0F. If we add 0x10 to X, the range for Y becomes X - 0x1F to X. Subtracting Y from this adjusted X will yield a negative value if Y is above the range, and a value of 0x20 or greater if Y is below the range. Adding 0x200 to this moves the result to 0x1FF or less and 0x220 or more, respectively. The two adjustments (0x10 and 0x200) can be combined since X + 0x10 - Y + 0x200 is the same as X + 0x210 - Y. The added 0x200 keeps the result always positive, allowing operation on multiple packed values without the lower ones interfering with the upper ones. Here are some examples at the boundaries:

Code: Select all

 X     Adjust   Y     Raw     Mask    Result
----------------------------------------------
0x40 + 0x210 - 0x30 = 0x220 & 0x1E0 = non-zero
0x40 + 0x210 - 0x31 = 0x21F & 0x1E0 = zero
0x40 + 0x210 - 0x50 = 0x200 & 0x1E0 = zero
0x40 + 0x210 - 0x51 = 0x1FF & 0x1E0 = non-zero
sinamas
Gambatte Developer
Gambatte Developer
Posts: 157
Joined: Fri Oct 21, 2005 4:03 pm
Location: Norway

Post by sinamas »

Spewing out some 15bit blend code I threw together before blargg's update... I didn't use much time on optimization but I'm posting it in case it's useful.

Code: Select all

static inline unsigned blend1(unsigned c1, unsigned c2) {
	const unsigned lowbits=(((c1<<1)&0x0842)+(c1&0x0C63)+(c2&0x0C63))&0x0C63;
	return ((c1*3+c2) - lowbits) >> 2;
}

static inline unsigned blend2(unsigned c1, unsigned c2, unsigned c3) {
	c1<<=1;
	const unsigned lowbits=((c1&0x0842)+(c2&0x0C63)+(c3&0x0C63))&0x0C63;
	return ((c1+c2+c3) - lowbits) >> 2;
}

static inline unsigned blend5(unsigned c1, unsigned c2) {
	return ( c1+c2 - ((c1^c2)&0x421) ) >> 1;
}

static inline unsigned blend6(unsigned c1, unsigned c2, unsigned c3) {
	c2<<=1;
	const unsigned lowbits=( ((c1<<2)&0x1084)+(c1&0x1CE7)+(c2&0x18C6)+(c3&0x1CE7) ) & 0x1CE7;
	return ((c1*5+c2+c3) - lowbits) >> 3;
}

static inline unsigned blend7(unsigned c1, unsigned c2, unsigned c3) {
	const unsigned lowbits=(((((c1<<1)&0x0842)+(c1&0x0C63))<<1)+(c2&0x1CE7)+(c3&0x1CE7))&0x1CE7;
	return ((c1*6+c2+c3) - lowbits) >> 3;
}

static inline unsigned blend9(unsigned c1, unsigned c2, unsigned c3) {
	c1<<=1;
	const unsigned rb=(c1&0xF83E)+((c2&0x7C1F)+(c3&0x7C1F))*3;
	const unsigned g=(c1&0x07C0)+((c2&0x03E0)+(c3&0x03E0))*3;
	return ((rb&0x3E0F8)|(g&0x01F00))>>3;
}

static inline unsigned blend10(unsigned c1, unsigned c2, unsigned c3) {
	const unsigned rb=(c1&0x7C1F)*14+(c2&0x7C1F)+(c3&0x7C1F);
	const unsigned g=(c1&0x03E0)*14+(c2&0x03E0)+(c3&0x03E0);
	return ((rb&0x7C1F0)|(g&0x03E00))>>4;
}
blargg
Regular
Posts: 327
Joined: Thu Jun 30, 2005 1:54 pm
Location: USA
Contact:

Post by blargg »

Coolness, you've generalized the RGB mix to do the fixup for multiple low bits! It seems this takes one extra operation for the blend2 case (11 versus 10 operations), but it won' t result in as many pipeline stalls, since more things can be calculated in parallel. It helps a bit on my machine (57 versus 60 with your RGB mixing code for blend2 only).

I was browsing the hq2x author's website and found the filter quite impressive for the example images. I looked at the optimized assembly versions and have a feeling that my optimized C version will outperform it at this point. :)
sinamas
Gambatte Developer
Gambatte Developer
Posts: 157
Joined: Fri Oct 21, 2005 4:03 pm
Location: Norway

Post by sinamas »

Haha! I guess you should mail him once you feel done with this.

I noticed that inline functions were slower on my p4 northwood too, by more than 30%. Then I added back the w[] buffer, this gave a performance loss of ~3% when using outline functions. With inline functions however, it became 70% faster than with no buffer.

outline, no buffer: ~230 fps
inline, no buffer: ~170 fps
outline, with w[] buffer: ~224 fps
inline, with w[] buffer: ~294 fps

weird huh? blame gcc?

edit: I redid the test with g++ 3.4 rather than 4.0:

outline, no buffer: ~285 fps
inline, no buffer: ~285 fps
outline, with w[] buffer: ~253 fps
inline, with w[] buffer: ~378 fps
byuu

Post by byuu »

blargg, with your code I get 65 and 64fps respectively, up from 61 and 60 with my code. Scale2x, completely unoptimized, is still pushing 78 and 76fps. So this is definitely an intensive algorithm :)
I can use the tricks in HQ2x on Scale2x, so I can probably push that up to 80-85 fps.

I can't currently meet your requirement of the width and height being 2 pixels wider, though, so currently it's just reading outside of the buffered area. I know, not good.
Also, the in_width thing might be a problem. I currently use a trick for interlaced and hires video.
Basically, I make the height always 480, but double the pitch when interlace is disabled. The width is always 512, and I only write the first half of the scanline for non-hires modes. So if you go from hires to lores, pixels 256-511 will be from the hires mode. I hate adding hackery code in there for things like clearing the edge pixels when hires is off and it was on for the last frame :/
I'm only planning on sizing all video up to 512x448 anyway. No idea what I'll do for 256x448 and 512x224 yet. I need it to basically blend in only one direction, and the filter isn't really designed with that in mind.
You also forgot to set result_width and result_height for the filter :P

When I add sinmais' blend1 function, I lose 5fps over the unpack/pack code by blargg. Makes me wonder if blend2 would benefit from using the pack/unpack macros, at least on this box...
sinamas
Gambatte Developer
Gambatte Developer
Posts: 157
Joined: Fri Oct 21, 2005 4:03 pm
Location: Norway

Post by sinamas »

You could try changing blend1 to this, but I doubt it'll make any difference.

Code: Select all

static inline unsigned blend1(unsigned c1, unsigned c2) {
	const unsigned tmp=c<<1;
	const unsigned lowbits=((tmp&0x0842)+(c1&0x0C63)+(c2&0x0C63))&0x0C63;
	return ((tmp+c1+c2) - lowbits) >> 2;
}
byuu

Post by byuu »

On the subject of micro optimizations, I came up with another minor one today.

So for each pixel, the difference is calculated eight times. Each calculation is one memory read, two subtractions (if you count the negation as one, I don't know how that would compile), a mask, an or, and a bitshift.

So, ascii art lameness aside, take the following table. The X's meant to represent links in both diagonal directions:

Code: Select all

A-B-C-D
|X|X|X|
E-F-G-H
|X|X|X|
I-J-K-L
|X|X|X|
M-N-O-P

123
4.5
678
Now let's say we're calculating pattern for F.
A's result for 8 is the same as F's result for 1. B's 7 matches F's 2, and E's 5 matches F's 4. Now, given we can only pull one previously calculated result from each previous pattern, and even then we can only get one result from the previous pattern with no memory fetches, there's still the point that we're calculating the pattern result bits with a repetition of 37.5%.
It would probably thrash the cache, but we could maybe keep buffer of (widthxheight) to hold the results of each pattern for each pixel, and then for three of each eight pixels, we could replace the calculation with a memory load, bit mask, and shift. Which if it doesn't ruin the cache would be faster than recalculating the pixel.

Otherwise, we could keep going left to right, top to bottom, and change the pattern calculation for just copying E's bit over. e.g.

Code: Select all

pattern  = (pattern & 0x10) >> 1;
pattern  = -((middle - to_yuv( w [1] )) & diff_mask) >> (31 - 0);
pattern |= -((middle - to_yuv( w [2] )) & diff_mask) >> (30 - 1) & (1 << 1);
pattern |= -((middle - to_yuv( w [3] )) & diff_mask) >> (29 - 2) & (1 << 2);
//pattern |= -((middle - to_yuv( w [4] )) & diff_mask) >> (28 - 3) & (1 << 3);
pattern |= -((middle - to_yuv( w [6] )) & diff_mask) >> (27 - 4) & (1 << 4);
pattern |= -((middle - to_yuv( w [7] )) & diff_mask) >> (26 - 5) & (1 << 5);
pattern |= -((middle - to_yuv( w [8] )) & diff_mask) >> (25 - 6) & (1 << 6);
pattern |= -((middle - to_yuv( w [9] )) & diff_mask) >> (24 - 7) & (1 << 7);

pattern  = (pattern & 0x10) >> 1;
//omit -width as well?
pattern |= (ptable[-width-1]) >> 7); //no need to mask
pattern |= (ptable[-width] & 0x40) >> 5);
//pattern  = -((middle - to_yuv( w [1] )) & diff_mask) >> (31 - 0);
//pattern |= -((middle - to_yuv( w [2] )) & diff_mask) >> (30 - 1) & (1 << 1);
pattern |= -((middle - to_yuv( w [3] )) & diff_mask) >> (29 - 2) & (1 << 2);
//pattern |= -((middle - to_yuv( w [4] )) & diff_mask) >> (28 - 3) & (1 << 3);
pattern |= -((middle - to_yuv( w [6] )) & diff_mask) >> (27 - 4) & (1 << 4);
pattern |= -((middle - to_yuv( w [7] )) & diff_mask) >> (26 - 5) & (1 << 5);
pattern |= -((middle - to_yuv( w [8] )) & diff_mask) >> (25 - 6) & (1 << 6);
pattern |= -((middle - to_yuv( w [9] )) & diff_mask) >> (24 - 7) & (1 << 7);
So, which do you think would be faster? Obviously the edges would have be handled, I say just add a quick setup to the start of each for(x=...) loop or whatever.
I'm planning to render the HQ2x screen by ignoring a 1x1 pixel border around the entire image, to simplify things and not require an oddly-sized internal buffer with black pixels around the edges being mandatory.

EDIT: actually, the pattern table would probably only need to be the width of one scanline (maybe two?). Just write the new result for each pixel from the current line on top of the old result for the pevious line.
grinvader
ZSNES Shake Shake Prinny
Posts: 5632
Joined: Wed Jul 28, 2004 4:15 pm
Location: PAL50, dood !

Post by grinvader »

Yes, I already thought about that kind of system for a 3x filter of my own. I haven't been very far and it won't probably look as good as hq*x, but using a 32bpp buffer - 8 bits for R, G, B - the 8 remaining bits store a bool used for filtering in each of the directions.
When I calculate that bool I assign it to the current pixel and it's counterpart if any, and it's simple enough not to recalculate anything with some position tests.

When I see computers with 2GB ram nowadays, you can just make a buffer for the colours and another one for the differences. -_-
Let that as an option for the user (cache hq diffs yes/no), and if the system doesn't have enough memory just force recalculation.
皆黙って俺について来い!!

Code: Select all

<jmr> bsnes has the most accurate wiki page but it takes forever to load (or something)
Pantheon: Gideon Zhi | CaitSith2 | Nach | kode54
byuu

Post by byuu »

It's more a question of what's faster than if there's enough RAM for this.
That said, just sliding the result for W5 over to W4 for each pixel gives a 1fps boost to 66fps on this PC, so it's probably more significant when you consider the emulation overhead.
I'm thinking adding W1/W2 copying would just end up adding enough overhead to cancel out the speed gain, and make things more complicated as well.
DMV27
New Member
Posts: 9
Joined: Thu Jan 27, 2005 5:03 pm

Post by DMV27 »

This code can be used to average two colors. It does not need a carry bit, so it can be used with RGB565.

Code: Select all

avg = (x & y) + (((x ^ y) >> 1) & 0x7F7F7F7F);
This code is exactly the same as the MMX op "paddusb". It can also be adjusted to work with field sizes other than 8-bits.

Code: Select all

ls = (x & 0x7F7F7F7F) + (y & 0x7F7F7F7F);
hs = (x ^ y) & 0x80808080;
hc = (x & y) & 0x80808080;
s = ls ^ hs;
c = (ls & hs) | hc;
mask = ((c >> 7) + 0x7F7F7F7F) ^ 0x7F7F7F7F;
dest = s | mask;
blargg
Regular
Posts: 327
Joined: Thu Jun 30, 2005 1:54 pm
Location: USA
Contact:

Post by blargg »

Code: Select all

avg = (x & y) + (((x ^ y) >> 1) & 0x7F7F7F7F);
Nifty. X & Y yields the carries from each bit and X ^ Y yields the individual sums of each bit (without carries).
This code is exactly the same as the MMX op "paddusb". It can also be adjusted to work with field sizes other than 8-bits.
The mask can be generated without the large constants, which might generate less code on some machines and allow more parallelism (the two shifts could be done simultaneously). On machines without three-operand instructions (a = b OP c) it might be worse though.

Code: Select all

mask = (c << 1) - (c >> 7);
A while back I came up with a version similar to this using two fewer operations, but yours allows much more parallelism.
MaxSt
ZSNES Developer
ZSNES Developer
Posts: 113
Joined: Wed Jul 28, 2004 7:07 am
Location: USA
Contact:

Post by MaxSt »

blargg wrote:I was browsing the hq2x author's website and found the filter quite impressive for the example images. I looked at the optimized assembly versions and have a feeling that my optimized C version will outperform it at this point. :)
I don't think so. :)

If you serious about optimizations, you have to switch to MMX at some point.

MaxSt.
blargg
Regular
Posts: 327
Joined: Thu Jun 30, 2005 1:54 pm
Location: USA
Contact:

Post by blargg »

How fast was my optimized C version compared to your assembly version? I figured the elimination of all the branches for the threshold calculation would make a big difference.
pSXAuthor
New Member
Posts: 7
Joined: Tue Jan 31, 2006 6:15 pm

Post by pSXAuthor »

DMV27 wrote:This code is exactly the same as the MMX op "paddusb". It can also be adjusted to work with field sizes other than 8-bits.

Code: Select all

ls = (x & 0x7F7F7F7F) + (y & 0x7F7F7F7F);
hs = (x ^ y) & 0x80808080;
hc = (x & y) & 0x80808080;
s = ls ^ hs;
c = (ls & hs) | hc;
mask = ((c >> 7) + 0x7F7F7F7F) ^ 0x7F7F7F7F;
dest = s | mask;
Here is a faster one that can be used when adding 24bit values (note: upper 8bits=0):

Code: Select all

dest=(x&0xfefeff)+(y&0xfefeff);
tmp=dest&0x1010100;
tmp=tmp-(tmp>>8);
dest|=tmp;
as with the code above this can be adapted for 555 or 565.
MaxSt
ZSNES Developer
ZSNES Developer
Posts: 113
Joined: Wed Jul 28, 2004 7:07 am
Location: USA
Contact:

Post by MaxSt »

I never really tested the C version for performance. I provided it only as a reference, to better understand the algorithm. Never expected it to be used "as is" in emulators. I only recommend MMX code.

After all, that's what MMX was designed for in a first place... So you have a freedom to do all kinds of calculations with color channels in 16 bit precision, everything in parallel.

Just look how short it is:

Code: Select all

%macro TestDiff 2
    xor     ecx,ecx
    mov     edx,[%1]
    cmp     edx,[%2]
    je      %%fin
    mov     ecx,_RGBtoYUV
    movd    mm1,[ecx+edx*4]
    movq    mm5,mm1
    mov     edx,[%2]
    movd    mm2,[ecx+edx*4]
    psubusb mm1,mm2
    psubusb mm2,mm5
    por     mm1,mm2
    psubusb mm1,[threshold]
    movd    ecx,mm1
%%fin:
%endmacro
byuu

Post by byuu »

Yeah, MMX/SSE/2 is good. I'm just not big on using processor-specific code, though. My code compiles on PowerPC, and blargg is actually using a G(3?) himself.

Would you be interested in hosting the optimized C version? I'll modify my current version back to a standard-C library if you want. Would probably be good for OS X emulators and such.
blargg
Regular
Posts: 327
Joined: Thu Jun 30, 2005 1:54 pm
Location: USA
Contact:

Post by blargg »

I hadn't noticed that the MMX threshold function was so short; maybe I was looking at some other code. Still, I wonder if the C code "(x_yuv - y_yuv + offset) & mask" beats this, since it's also pretty damn simple. And you could use this C version as a basis for an SIMD version that calculated two or four thresholds simultaneously.

If the MMX version is many times faster, then the same would likely be the case for an AltiVec version for PowerPC. It probably wouldn't be written in assembly, but it'd still use proprietary intrinsics from C so it wouldn't be portable (assembly gives very little advantage on modern RISC processors). At some point I'm going to have a Mac with a G4 processor, so I could try writing an AltiVec version of the algorithm.

byuu, any reason you aren't using MaxST's MMX-optimized version for your x86 builds?
byuu

Post by byuu »

Code: Select all

Still, I wonder if the C code "(x_yuv - y_yuv + offset) & mask" beats this, since it's also pretty damn simple.
Definitely, especally since we removed the offset addition from eight of (at most) ten diff checks anyway, it's only two bitwise operations and one memory fetch per diff. Unless Intel makes an opcode to do the entire diff at once, I doubt your method can be beaten.
MaxSt's is way more flexible, though. He has finer precision and adjusting the y/u/v tolerance levels is trivial. I would actually prefer his method if it could be done in c.
byuu, any reason you aren't using MaxST's MMX-optimized version for your x86 builds?
1) His MMX code is GPL, which I cannot use, and his C code is LGPL, which I hopefully can.
2) I have absolutely zero platform-specific code in bsnes, and I'd like to keep it that way. I'm aware I can use both an asm-optimized version and a c-version and allow compilation anywhere. I just honestly don't want assembly code in there, period. And yes, I'm very fluent in x86 assembler, so it isn't a matter of not understanding it. I actually did use x86 for my video blitter in the past, but decided to remove it. I make illogical, baseless decisions like this all the time ;)
MaxSt
ZSNES Developer
ZSNES Developer
Posts: 113
Joined: Wed Jul 28, 2004 7:07 am
Location: USA
Contact:

Post by MaxSt »

byuu wrote:Yeah, MMX/SSE/2 is good. I'm just not big on using processor-specific code, though. My code compiles on PowerPC, and blargg is actually using a G(3?) himself.
Sometimes processor-specific code is a very good thing.
Post Reply