Progress on reducing HQ2x to fit in L1 cache

WolfWings · Post by **WolfWings** » Mon Aug 07, 2006 11:34 pm

mozz wrote:One good thing about this mapping is I didn't need to manually fix up any entries. If you want to get rid of the manual fixups for Y, I suggest suggest maybe giving Y*21/8 a try instead of Y*8/3.

I wasn't able to combine your mapping fully with the 128-entry instead of 192-entry tinytable trick I figured out.

It did allow me to remove the adjustYUV tables though, so the math is simpler.

I'd still say just store the tinytables directly in the code though, unless we want to allow people to change the thresholds in which case I'll have to integrate my automated adjusting code to match the 'perfect' comparison approach.

mozz · Post by **mozz** » Tue Aug 08, 2006 12:59 am

There was a bug in the test program I used to check the scaling factors in my previous post... I'm no longer convinced that they work in all situations! Test thoroughly.

However, I had more time to think about what we are really trying to achieve with the scaling, and I think I can now explain it clearly in terms of fixed-point numbers. I haven't had time to test any of this yet, if you spot any bugs lemme know.

Scaling to a convenient range

First, pretend we have real numbers of infinite precision to work with. Suppose we scale the values of each component (Y, U or V) so that original values which are "inside" the threshold for that component (i.e. |y| <= trY ) map to values in the [-0.5..+0.5) range, and original values which are "outside" the threshold (i.e. |y| > trY ) map to values outside the [-0.5..+0.5) range. (Note that this range is inclusive at the bottom only. This will be important later because we'll be trying to test an integer range like [0..64) or something, with a bitmask like 63.)

So any scaling factor K such that (0.5/(trY+1)) < K < (0.5/trY) is OK at this point. It must be strictly within that range because using either boundary value would cause would cause 0.5 and -0.5 to be included in the set of scaled values. -0.5 is inside our range and 0.5 is outside! We don't want that.

Now suppose you add a positive bias of 0.5 to every result value. Values "inside" the threshold are now in the [0..+1) range. Values "outside" the threshold are either negative, or greater than 1 (note that it can't be exactly 1 because that would require K to be equal to (0.5/(trY+1)) and we chose K to be larger than that).

Getting more practical

Now in reality, we don't have unlimited precision---we are going to have to represent these values as bitfields, and we need to fit 3 of them in a 32-bit word. So now what do we do?

Suppose we represent the values of one component (lets use Y for example) in an N+M bit fixed-point number, where N is the number of bits in the whole part, and M is the number of bits in the fractional part. (Ignore how we choose N and M for now). In order to test if a value is in the [0..+1) range, we simply AND it with a bitmask in which the whole bits are set, and the fractional bits are clear!

When we round each component up or down to make it fit in N+M bits, we potentially introduce some error: |e|<=0.5/2^M. Since we need to do 3 table lookups and add together three values for the component, the cumulative error can actually be 3 times greater than that. We require enough precision in the fraction field so that the cumulative error can not affect the result. Specifically, the error must not cause a carry or borrow in the fractional part which changes the whole part from 0 to nonzero, or vice versa (because that would change the result of a threshold test on this component).

Choosing a scaling factor and bit size

At this point it should be clear how we want to choose the scaling factor K. Consider that K*(trY) < 0.5 < K*(trY+1), and if cumulative error caused this statement to become false, it would also screw up the final result. So we want to choose K such that those two values are as far as possible from 0.5, to make the "safety margin" (i.e. the magnitude of cumulative error that it would take to corrupt the result) as big as possible. In other words, we would like to split the difference with K = (0.5/(trY+0.5)) so that K*(trY+0.5) = 0.5. (I haven't actually tried this, so maybe (0.5/(trY+0.6)) or (0.5/(trY+0.4)) would work better, but this is a reasonable starting point at least).

Okay, now how many bits (N) do we need for the whole part of a given component (Y for example)? Well, find out the maximum value (oldY1-oldY2) can take in the original algorithm and call that maxOldY. N needs be large enough to accomodate the largest positive scaled Y value, which is floor((K*maxOldY)+0.5). It also needs to be able to accomodate the largest negative scaled Y value, which is floor((-1*K*maxOldY)+0.5). However, its OK if the negative values wrap around into the range for positive values (we don't care as long as the whole part does not become zero when that happens).

Okay, now how many bits (M) do we need for the fractional part? As mentioned above, we need enough bits to make sure the cumulative error does not affect the result, and no more.

We could work this out mathematically but its easier to just choose any M and then use brute force checking to figure out if the cumulative error was too high or not. Start with M=8 or something, and keep decrementing it and testing all possible input (redDelta, greenDelta, blueDelta) triples until the cumulative error becomes a problem, and then use the last working value. Notice that you shouldn't need any "manual adjustment" of scaled component values using this method. However, it might be possible to have a field 1 bit smaller with manual adjustment (I have no idea). But in general we shouldn't need it.

Packing three components into a 32-bit word

Assume the M+N sizes of the three components add up to less than 32 bits (otherwise all this was a waste of time! heh). How do we pack them into the table words?

There's two parts to this that need describing. One is where does the positive bias of 0.5 get added in? You can either add 0.5 to each field value in the green table, or 0.25 to the field value in the redBlue table. HOWEVER: one of the components (is it V?) is negated during the Blue lookup, so you can't put any bias in the redBlue table for that field or it will cancel out! The simplest thing is probably to add the bias for all three fields into the green table.

The other question is, do we need guard bits or anything to protect us from negative numbers?

I am pretty sure the answer is no. Consider this: we are testing to see if ALL THREE components are "inside" the thresholds, and if they are the result of the diff test is "they are the same" (false). If any of them is "outside" the thresholds, we know the result of the diff test is "they are different" (true) and we don't care about the other two components. In other words, we are doing a logical OR of the three individual results to get one combined result. Suppose the Y field is in the highest bits, U is in the middle bits and V is in the lowest bits. If V is negative, the result from the V test is true and the final result is going to be true no matter what (so it doesn't matter if a borrow from V into U corrupts the result of U). Similarly, if U is negative, the result of the U test is true and the final result will be true--a borrow from U into Y might corrupt the Y result but that can't possibly change the final result.

Okay, so we don't need any guard bits and we can just add table entries together with glee. The table entries should just be a linear combination of the three fields, i.e. something like:

Code: Select all

redBlue[index] = (1<<22)*scaledY + (1<<11)*scaledU + scaledV;

In other words, if the scaled value for a particular field in a particular table entry is NEGATIVE, it must just borrow from the next field up (or be negative at the top). When three entries are added together, either the results of all three fields will be positive (hence each borrow was cancelled out by a carry), or the outcome is going to be "they are different" anyways and a borrow or two will not affect it in any way.

Conclusion

If everything I've described above works, then:
(1) we can create table entries without any manual adjustment of the field values,
(2) no guard bits are needed between fields, and
(3) no additional constants need to be combined with the 32-bit YUV word before it can be ANDed with the mask to get the final result!

Property (3) is especially nice. The only constants needed are the 0.5 bias values which are baked into the tables. If you are using a redBlue table you still need to XOR some bits of the Blue lookup result, but that's it.

Also (4) If the table entries end up needing less than 32 bits, I think you could use a few highest bits of each table entry for something else! The Diff algorithm would not be affected by them in any way. This is probably useless, but you never know =)

[Edit: with the formulation above, it may be possible to replace the green table with a multiply-add. The add is required to put the +0.5 bias in each field. I'm not sure if the bias could be baked into the redBlue table without causing problems on that Blue lookup. On the multiply, I'm not sure if unrounded remainders from one field can corrupt the high bits of a field below it in a way that changes the final result. If so, maybe it is possible to choose K very carefully so as to limit the corruption. Because the "safety margin" would then be smaller and/or to limit the effect on lower fields, we might need to increase the bit size of some fields. And in the end, it might not be any faster than just using the table. But the idea is interesting.]

Edit: I forgot that U=0 in the green table (at least if there's no constant baked into it). I'm 99% sure it could be replaced by a multiply,AND,add combo. May suck on old Pentiums, but newer machines could do it pretty fast. I only care about this because my green table is twice as big as yours (due to supporting 6-bit green values).

Another possibility which I may have already mentioned (can't remember) is to put U in the top bits, and use 16-bit entries for the green table and movsx from memory. Only works if the Y and V values in the green table will fit into 16 bits (which seems unlikely).

WolfWings · Post by **WolfWings** » Tue Aug 08, 2006 2:48 am

First off... um, you branched into theoretical number-churning code at a level I never touch. Most of it didn't make much sense to me except that it looked to allow for a way to calculate arbitrary bit-field-sized and scaling values in one shot, instead of simply trying a couple of times from a quick estimate and fine-tuning until one gets it right like I choose to do.

I'm not much for complicated equations when simple guess-and-check is easy to do. And I perform exhaustive tests regardless. I don't trust 'magic equations' when I have to hand-verify things work identically anyways.

And, um... why are you supporting 6 bits of green? The SNES only calculated things at 5 bits of green.

And blue and red were reversed from normal PC format.

Also, the entire purpose of this is to speed up the HQ code on OLDER machines. Specifically things like the Pentium MMX and Pentium 2. Newer machines will benefit from the smaller code-size, but the older machines need the most help. And a large part of that is by supporting zSNES native pixel format and outputting native PC pixel format directly.

EDIT Also, note that I am already putting the entire bias into the Green table, and the 'trick' I'm using for combining the red/blue tables is internally bit-negating (SLIGHTLY different than properly negating) the r-b value. And I adjusted the mask already as I realized since we're only adding three numbers together we can never 'toggle' the guard-bit twice so I avoided one bit-mask at the tail of the code so I'm down to 11 instructions of which all but 3 (one at the beginning and one at the end) pair on Pentium MMX/Pentium Pro/Pentium 2.

The current tables I'm using:

Code: Select all

#define RB(y,u,v) (((y)&0x3FF)<<22)|(((u)&0x3FF)<<11)|((v)&0x3FF)
#define GO(y,v) ((((y)+64)&0x3FF)<<22)|(16<<11)|(((v)+16)&0x3FF)
unsigned int DiffTable[128] = {
RB(-85, 160,  80), RB(-83, 155,  77), RB(-80, 150,  75), RB(-77, 145,  72),
RB(-75, 140,  70), RB(-72, 135,  67), RB(-69, 130,  65), RB(-67, 125,  62),
RB(-64, 120,  60), RB(-61, 115,  57), RB(-59, 110,  55), RB(-56, 105,  52),
RB(-53, 100,  50), RB(-51,  95,  47), RB(-48,  90,  45), RB(-45,  85,  42),
RB(-43,  80,  40), RB(-40,  75,  37), RB(-37,  70,  35), RB(-35,  65,  32),
RB(-32,  60,  30), RB(-29,  55,  27), RB(-27,  50,  25), RB(-24,  45,  22),
RB(-21,  40,  20), RB(-19,  35,  17), RB(-16,  30,  15), RB(-13,  25,  12),
RB(-11,  20,  10), RB( -8,  15,   7), RB( -5,  10,   5), RB( -3,   5,   2),
RB(  0,   0,   0), RB(  2,  -5,  -2), RB(  5, -10,  -5), RB(  7, -15,  -7),
RB( 10, -20, -10), RB( 13, -25, -12), RB( 15, -30, -15), RB( 18, -35, -17),
RB( 21, -40, -20), RB( 23, -45, -22), RB( 26, -50, -25), RB( 29, -55, -27),
RB( 31, -60, -30), RB( 34, -65, -32), RB( 37, -70, -35), RB( 39, -75, -37),
RB( 42, -80, -40), RB( 45, -85, -42), RB( 47, -90, -45), RB( 50, -95, -47),
RB( 53,-100, -50), RB( 55,-105, -52), RB( 58,-110, -55), RB( 61,-115, -57),
RB( 63,-120, -60), RB( 66,-125, -62), RB( 69,-130, -65), RB( 71,-135, -67),
RB( 74,-140, -70), RB( 77,-145, -72), RB( 79,-150, -75), RB( 82,-155, -77),
GO(-84,     -160), GO(-81,     -155), GO(-78,     -150), GO(-76,     -145),
GO(-73,     -140), GO(-70,     -135), GO(-68,     -130), GO(-65,     -125),
GO(-63,     -120), GO(-60,     -115), GO(-57,     -110), GO(-55,     -105),
GO(-52,     -100), GO(-49,      -95), GO(-47,      -90), GO(-44,      -85),
GO(-42,      -80), GO(-39,      -75), GO(-36,      -70), GO(-34,      -65),
GO(-31,      -60), GO(-28,      -55), GO(-26,      -50), GO(-23,      -45),
GO(-21,      -40), GO(-18,      -35), GO(-15,      -30), GO(-13,      -25),
GO(-10,      -20), GO( -7,      -15), GO( -5,      -10), GO( -2,       -5),
GO(  0,        0), GO(  2,        4), GO(  5,        9), GO(  7,       14),
GO( 10,       19), GO( 13,       24), GO( 15,       29), GO( 18,       34),
GO( 21,       39), GO( 23,       44), GO( 26,       49), GO( 28,       54),
GO( 31,       59), GO( 34,       64), GO( 36,       69), GO( 39,       74),
GO( 42,       79), GO( 44,       84), GO( 47,       89), GO( 49,       94),
GO( 52,       99), GO( 55,      104), GO( 57,      109), GO( 60,      114),
GO( 63,      119), GO( 65,      124), GO( 68,      129), GO( 70,      134),
GO( 73,      139), GO( 76,      144), GO( 78,      149), GO( 81,      154),
};

int ExpandedDiff(unsigned int c0, unsigned int c1) {
	unsigned int acum, diff, slot;
	diff   = c0;
	diff  -= c1;
	slot   = (long int)((  signed char)diff);
	diff  += 0x8020;
	diff >>= 10;
	acum   = DiffTable[slot +  32];
	slot   = (long int)((unsigned char)diff);
	acum  ^= (0x3FF << 11);
	diff >>= 11;
	acum  += DiffTable[slot +   0];
	slot   = (long int)((  signed char)diff);
	acum  += DiffTable[slot +  96];
	return acum; /* BITMASK = 0xE01F03E0 */
}

The table-verification program:

Code: Select all

#include <stdio.h>
#include <stdlib.h>
#include <strings.h>
#include <math.h>

unsigned int DiffTable[64*2];

void InitDiffTable(void) {
	unsigned int l;
	signed int j, k;
	for (j = -32; j < 32; j++) {
		k = ((j *  8) - 1) / 3; l  = ((k & 0x3FF) << 22);
		k =  (j * -5)         ; l |= ((k & 0x3FF) << 11);
		k =  (j * -5)      / 2; l |=  (k & 0x3FF)       ;
		DiffTable[j +  32] = l; /* Blue/Red */
		k =  (j * 21)      / 8; l  = (((k + 64) & 0x3FF) << 22);
		                        l |= ((     16  & 0x3FF) << 11);
		k = ((j * 10) - 1) / 2; l |=  ((k + 16) & 0x3FF)       ;
		DiffTable[j +  96] = l; /* Green */
	}
}

static __inline__ unsigned int RGBUnpack(unsigned int x) {
	return ((x << 16) | x) & 0x03E07C1F;
}

int ExpandedDiff(unsigned int c0, unsigned int c1) {
	unsigned int r, g;
	g   = c0;
//	g  -= c1;
//	g  += 0x4008020;
	r   = DiffTable[(long int)((unsigned char)g) +   0];
	r  ^= (0x3FF << 11);
	g >>= 10;
	r  += DiffTable[(long int)((unsigned char)g) +   0];
	g >>= 11;
	r  += DiffTable[(long int)((unsigned char)g) +  64];
//	r  &= 0xE01F03E0;
	return r;
}

int OldDiff(unsigned int c0) {
	signed int r, g, b;
	signed int y, u, v;
	int l;

	r = ((c0 & 63)) - 32;
	b = (((c0 >> 10) & 63)) - 32;
	g = (((c0 >> 21) & 63)) - 32;

	y = (r + g + b);
	u = (r - b);
	v = ((2 * g) - r - b);
	l = 0;
	if (y >=  25) l |= (0x080 << 22);
	if (y <= -25) l |= (0x380 << 22);
	if (u >=   4) l |= (0x020 << 11);
	if (u <=  -4) l |= (0x3E0 << 11);
	if (v >=   7) l |= (0x020 <<  0);
	if (v <=  -7) l |= (0x3E0 <<  0);
	return l;
}

int main(int argc, char **argv) {
	int r, g, b, j, k;
	int old, new, block;

	block = 0;

	InitDiffTable();

	for (r = 0; r < 64; r += 1) {
		for (g = 0; g < 64; g += 1) {
			for (b = 0; b < 64; b += 1) {
				j = (g << 21) | (b << 10) | r;
				old = OldDiff(j);
				new = ExpandedDiff(j, 0);
				if (!old ^ !(new & 0xE01F03E0)) {
					k = (new & 0xE01F03E0);
					block = 1;
					printf("%3i:%3i:%3i\t", r-32, b-32, g-32);
					j = 0;
					if (!((old >> 22) & 0x3FF) ^ !((k >> 22) & 0x3FF)) j = 0x3FF;
					j = (((old >> 22) & 0x3FF) - ((new >> 22) & 0x3FF)) & j;
					printf(":%4i", (j ^ 0x200) - 0x200);
					j = 0;
					if (!((old >> 11) & 0x3FF) ^ !((k >> 11) & 0x3FF)) j = 0x3FF;
					j = (((old >> 11) & 0x3FF) - ((new >> 11) & 0x3FF)) & j;
					printf(":%4i", (j ^ 0x200) - 0x200);
					j = 0;
					if (!((old >>  0) & 0x3FF) ^ !((k >>  0) & 0x3FF)) j = 0x3FF;
					j = (((old >>  0) & 0x3FF) - ((new >>  0) & 0x3FF)) & j;
					printf(":%4i", (j ^ 0x200) - 0x200);
					new &= 0xE01F03E0;
					printf("\t:%08X\t", old);
					printf(":%03X", (new>>22) & 0x3FF);
					printf(":%03X", (new>>11) & 0x3FF);
					printf(":%03X\n", (new>> 0) & 0x3FF);
				}
			}
		}
	}
	if (!block) {
		printf("#define RB(y,u,v) (((y)&0x3FF)<<22)|(((u)&0x3FF)<<11)|((v)&0x3FF)\n");
		printf("#define GO(y,v) ((((y)+64)&0x3FF)<<22)|(16<<11)|(((v)+16)&0x3FF)\n");
		printf("unsigned int DiffTable[128] = {\n");
		for (r =  0; r <  64; r++) {
			printf("RB(%3i,%4i,%4i),%s",
			       (((DiffTable[r]>>22)&0x3FF)^0x200)-0x200,
			       (((DiffTable[r]>>11)&0x3FF)^0x200)-0x200,
			       (( DiffTable[r]     &0x3FF)^0x200)-0x200,
			       ((r & 3) == 3) ? "\n" : " ");
		}
		for (r = 64; r < 128; r++) {
			printf("GO(%3i,     %4i),%s",
			       (((DiffTable[r]>>22)&0x3FF)^0x200)-0x200-64,
			       (( DiffTable[r]     &0x3FF)^0x200)-0x200-16,
			       ((r & 3) == 3) ? "\n" : " ");
		}
		printf("};\n");
	}
	return 0;
}

mozz · Post by **mozz** » Tue Aug 08, 2006 3:17 pm

WolfWings wrote:why are you supporting 6 bits of green? The SNES only calculated things at 5 bits of green. And blue and red were reversed from normal PC format.

Hmm, good point. But the PPU code of an emulator can produce output in whichever format is easier. It makes sense to produce it in a 565 format which is supported directly by most video cards. I would like to have an algorithm that works on input from other types of emulator which might use all 6 bits of green. But maybe there is no point.

WolfWings wrote:Also, the entire purpose of this is to speed up the HQ code on OLDER machines. Specifically things like the Pentium MMX and Pentium 2.

My motivations may be different from yours. I want to make a very small, accurate, reasonably fast emulator for multiple different old consoles. I'd rather have a 1K implementation of HQ3X than a 200K implementation even if the 1K implementation was a little bit slower (though I expect we can be faster than the original implementations AND be really small at the same time).

WolfWings wrote:Newer machines will benefit from the smaller code-size, but the older machines need the most help. And a large part of that is by supporting zSNES native pixel format and outputting native PC pixel format directly.

True that older machines need the most help. I thought the "zSNES native pixel format" as you put it, is also the format that the HQXX used in zsnes accepts as input. I don't think this is all that significant anyway. The real wins here come from making Diff blazing fast, reducing the number of Diffs needed, getting rid of the large tables and the bloated switch statement code, and replacing them with small tables and code that fit in the L1 cache.

WolfWings · Post by **WolfWings** » Tue Aug 08, 2006 9:25 pm

mozz wrote:
WolfWings wrote:why are you supporting 6 bits of green? The SNES only calculated things at 5 bits of green. And blue and red were reversed from normal PC format.
Hmm, good point. But the PPU code of an emulator can produce output in whichever format is easier. It makes sense to produce it in a 565 format which is supported directly by most video cards. I would like to have an algorithm that works on input from other types of emulator which might use all 6 bits of green. But maybe there is no point.

PC video cards also support 555 format, though reversed R/B from SNES format.

mozz wrote:
WolfWings wrote:Also, the entire purpose of this is to speed up the HQ code on OLDER machines. Specifically things like the Pentium MMX and Pentium 2.
My motivations may be different from yours. I want to make a very small, accurate, reasonably fast emulator for multiple different old consoles. I'd rather have a 1K implementation of HQ3X than a 200K implementation even if the 1K implementation was a little bit slower (though I expect we can be faster than the original implementations AND be really small at the same time).]/quote]

Then yes, I have little further to discuss with you. I'm here to improve zSNES by simplifying the HQ?? code. If someone can use my simplified HQ?? framework to improve other emulators, that's fine. But HQ?? will always be complex enough it will need to be tuned to individual applications (input pixel formats for example) in emulators before it can be sped up properly.

mozz wrote:
WolfWings wrote:Newer machines will benefit from the smaller code-size, but the older machines need the most help. And a large part of that is by supporting zSNES native pixel format and outputting native PC pixel format directly.
True that older machines need the most help. I thought the "zSNES native pixel format" as you put it, is also the format that the HQXX used in zsnes accepts as input. I don't think this is all that significant anyway. The real wins here come from making Diff blazing fast, reducing the number of Diffs needed, getting rid of the large tables and the bloated switch statement code, and replacing them with small tables and code that fit in the L1 cache.

No, the real wins are for the only machines that currently can't run HQ??, which are older machines. Newer 1Ghz+ machines can run HQ2x with the current code just fine, and 2Ghz+ machines can run HQ4x easilly. That's my reasoning for trying to focus the optimization nearly entirely on older sub-1Ghz machines that currently can't run even HQ2x.

Post by **pagefault** » Mon Mar 31, 2008 3:55 pm

Does anyone know where I can get a non GPL version of this that is closed source friendly?

byuu · Post by **byuu** » Mon Mar 31, 2008 4:23 pm

Necro post :P

If you want, I have an LGPL version that was heavily optimized by blargg. Don't know about "fitting in L1 cache", but it's about as fast as you will ever get with HQ2x.

Can't really do better than LGPL unless you want to ask MaxSt permission.

WolfWings · Post by **WolfWings** » Sun Jan 04, 2009 9:57 am

I feel really bad, I never posted my finalized code, that does HQ2x in (on average) 126-131 clock cycles on my Core 2 Duo per original source pixel, so around 32 clock cycles per output pixel. And, yes, it does fit entirely into L1 cache even back on a 486. 692 bytes of read-only data, ~279 bytes of code as a stand-alone function for the inner components of the HQ2x code now.

On a side-note, it actually runs the fastest when compiled for i386, using GCC 4.3.x:
gcc -std=c99 -march=i386 -O3 -s -Wall -pedantic

I haven't tested this extensively (just dusting off my old code and tidying it up to post, though I'll probably start working on HQ3x/HQ4x tables) but I believe it's because targetting 'my' arch causes the code and what-not to get a lot of alignment-spacers forced on it, which actually ends up hurting the code performance dramatically. And yes, the test-harness doesn't handle edge-of-screen, it was just a quick test-wrapper so I could get some comparison images vis-a-vis the stock HQ2x code.

Testing image:

Results:

Code removed. See this post instead.

blargg · Post by **blargg** » Sun Jan 04, 2009 12:44 pm

Minor comments for HQ2x_00 function:

- Shorter way to mix two pixels, same result: accum = (accum + l - ((accum ^ l) & 0x0421))) >> 1;

- Do you need to use uint16_t for local variables/return values, or would unsigned int suffice? Sometimes compilers generate extra masking when you use shorter types.

- Make static arrays const if possible. Allows compiler more options of where to store the data. Making that local to the function might help a compiler even more.

Looking over the algorithm and resulting weights, would this be any faster? There are two main changes: avoiding calculation of packIndex by using a full table, and avoiding the microcode by simply having a table of the resulting pixel weights and multiplying pixels with RGB components separated, then recombining. Note that this mixing will yield slightly different results, because it preserves more fractional precision during mixing. I'm guessing this might be slower, since by the looks of your code you struck a specific balance between code size and table size.

I am curious why you did the odd XORs when checking for identical pixels, rather than plain compares.

Code: Select all

static unsigned char weight_idxs [0x800];

static void init_weight_idxs()
{
    for ( int pattern = 0; pattern < 0x800; ++pattern )
    {
        int packIndex = hq2x[pattern & 15];
        if (packIndex) {
            unsigned char patternMask = (pattern >> 4);
            packIndex = hq2x[(patternMask & hq2x[packIndex]) + packIndex + 1];
            if (packIndex & 0xF0)
                packIndex = hq2x[((patternMask >> 4) & hq2x[packIndex]) + packIndex + 1];
        }

        weight_idxs [pattern] = packIndex * 4;
    }
}

static inline unsigned HQ2x_00(int pattern, const uint16_t *p)
{
    /* this might even slow things down now */
    if ( p[0] == p[2] && p[1] == p[3] && p[0] == p[1] )
        return p[0];
   
    /* weights, in sixteenths */
    static const unsigned char weights [12 * 4] =
    {
         8, 4, 4, 0,
         8, 4, 0, 4,
         8, 0, 4, 4,
        16, 0, 0, 0,
        12, 4, 0, 0,
        12, 0, 4, 0,
        12, 0, 0, 4,
         4, 6, 6, 0,
        12, 2, 2, 0,
        14, 1, 1, 0,
        10, 4, 2, 0,
        10, 2, 4, 0
    };
   
    const unsigned char* weight = &weights [weight_idxs [pattern]];
    t = ((p [0] * 0x10001) & 0x03E07C1F) * weight [0] +
        ((p [1] * 0x10001) & 0x03E07C1F) * weight [1] +
        ((p [2] * 0x10001) & 0x03E07C1F) * weight [2] +
        ((p [3] * 0x10001) & 0x03E07C1F) * weight [3];
    t = (t >> 4) & 0x03E07C1F;
    return (t >> 16) | t;
}

WolfWings · Post by **WolfWings** » Sun Jan 04, 2009 1:40 pm

blargg wrote:- Shorter way to mix two pixels, same result: accum = (accum + l - ((accum ^ l) & 0x0421))) >> 1;

Didn't know that was any faster, I'll give it a try.

blargg wrote:- Do you need to use uint16_t for local variables/return values, or would unsigned int suffice? Sometimes compilers generate extra masking when you use shorter types.

I'm still going through and cleaning the code from head to tail at this point, so that sort of polish will definately occur. This was very much a 'get this working on the compiler on my current computer, then post it in a hurry' when I realized I never had posted this finished version before.

blargg wrote:- Make static arrays const if possible. Allows compiler more options of where to store the data. Making that local to the function might help a compiler even more.

Another GCC 4.3ism I've gotten too used to, it was moving everything into the .rodata section since they were static, and only referenced in a read-state already.

blargg wrote:I am curious why you did the odd XORs when checking for identical pixels, rather than plain compares.

Old habit, XOR+zero-test is the same as a CMP+zero-test, though it only mangles the sign-flag. Better would honestly have been tier'd xor's or'ed together then only zero-tested once, since those can be parallelized after the memory-read to only 3-4 clock cycles on really new arch's. The other theory was since I was testing 16-bit operands, make it so things are unchained enough (in theory at least) the 0-2/1-3 test could be done using 32-bit reads as 0/1 versus 2/3 as one xor. Load 2/3, xor against 0/1, load 1, xor against 0, no stalls, though I bolloxed that up by using ||'s instead of |'ing the results together into one test.

blargg wrote:Looking over the algorithm and resulting weights, would this be any faster? There are two main changes: avoiding calculation of packIndex by using a full table, and avoiding the microcode by simply having a table of the resulting pixel weights and multiplying pixels with RGB components separated, then recombining. Note that this mixing will yield slightly different results, because it preserves more fractional precision during mixing. I'm guessing this might be slower, since by the looks of your code you struck a specific balance between code size and table size.

The microcoding was an attempt to code an alternative to MMX/IMUL, based around the 'bitwise average' technique from Aggregate.org I used that turned out to be able to pair remarkably well on most CPU's pipelines, after large penalties were found converting to/from expanded form to use the other techniques, and the latencies and clock-cycle-counts for using multiply were surprisingly high.

The microcoding was an idea that, while being VERY effective at reducing code and data-size, may not be the best approach. The code posted is, more-or-less, the smallest I believe you can get HQ2x. From here, it's far easier to unpack some things for speed; but right now this code is still gobs faster than the stock HQ2x code just because even on my Penryn-era Core 2 Duo I only have 32k of L1 separate data and instruction cache per core, but stock HQ2x takes up over 50k just for it's massive switch statement in C form compiled for size, and the pure-assembly version in current ZSNES still takes up over 65k just for the HQ2x16 code alone again, larger than even the infamous large-L1-cache'd Athlon64's can hold in their instruction-cache.

blargg wrote:

Code: Select all

static unsigned char weight_idxs [0x800];

static void init_weight_idxs()
{
    for ( int pattern = 0; pattern < 0x800; ++pattern )
    {
        int packIndex = hq2x[pattern & 15];
        if (packIndex) {
            unsigned char patternMask = (pattern >> 4);
            packIndex = hq2x[(patternMask & hq2x[packIndex]) + packIndex + 1];
            if (packIndex & 0xF0)
                packIndex = hq2x[((patternMask >> 4) & hq2x[packIndex]) + packIndex + 1];
        }

        weight_idxs [pattern] = packIndex * 4;
    }
}

Long-term, this is likely the best approach. Especially since supporting 4x uses the exact same pattern-tree as 2x, just with more specific blends chosen. Supporting 3x may require a different pattern-tree, I'm still cross-checking the data tables to figure that out and be sure, so hopefully the code can be generalized enough to be able to be initialized at a somewhat-arbitrary scaling factor. At first I was worried I'd need seperate pattern-tree's for each pixel in a given HQ set, so at 4x that would become 8k of data just for the pattern-tables, which is too much for some CPU's on-die tiny cache's. If it turns out to be a relatively fixed 2k table? Makes a lot more sense to just have a single-byte lookup.

blargg wrote:

Code: Select all

static inline unsigned HQ2x_00(int pattern, const uint16_t *p)
{
    /* this might even slow things down now */
    if ( p[0] == p[2] && p[1] == p[3] && p[0] == p[1] )
        return p[0];
   
    /* weights, in sixteenths */
    static const unsigned char weights [12 * 4] =
    {
         8, 4, 4, 0,
         8, 4, 0, 4,
         8, 0, 4, 4,
        16, 0, 0, 0,
        12, 4, 0, 0,
        12, 0, 4, 0,
        12, 0, 0, 4,
         4, 6, 6, 0,
        12, 2, 2, 0,
        14, 1, 1, 0,
        10, 4, 2, 0,
        10, 2, 4, 0
    };
   
    const unsigned char* weight = &weights [weight_idxs [pattern]];
    t = ((p [0] * 0x10001) & 0x03E07C1F) * weight [0] +
        ((p [1] * 0x10001) & 0x03E07C1F) * weight [1] +
        ((p [2] * 0x10001) & 0x03E07C1F) * weight [2] +
        ((p [3] * 0x10001) & 0x03E07C1F) * weight [3];
    t = (t >> 4) & 0x03E07C1F;
    return (t >> 16) | t;
}

I'd be happy to give a spin at this approach, and for larger scaling factors I'm quite sure this, with MMX/SSE, would be vastly faster. I over-compressed the data and code in my posting, though I'll need to verify this doesn't bog down due to the huge reliance on MUL's from memory-operands and what-not.

WolfWings · Post by **WolfWings** » Sun Jan 04, 2009 11:49 pm

Post deleted, see here for post to re-written HQ-code website.[/url]

Squall_Leonhart · Post by **Squall_Leonhart** » Mon Jan 05, 2009 4:27 am

this is good work wolfwings, you think it'll work as a kega plugin?

Post by **Nach** » Mon Jan 05, 2009 4:46 am

WolfWings:
If the function works best with it optimized as i386, for GCC, add: __attribute__((optimize("i386"))) to the function definition.

If it's a block of code, place around it:
#pragma GCC optimize("i386")

Note that some of these won't take affect until GCC 4.4, but it pays to get ready.

References:
http://gcc.gnu.org/onlinedocs/gcc/Funct ... butes.html
http://gcc.gnu.org/onlinedocs/gcc/Funct ... agmas.html
http://gcc.gnu.org/gcc-4.4/changes.html

WolfWings · Post by **WolfWings** » Mon Jan 05, 2009 5:44 am

Squall_Leonhart wrote:this is good work wolfwings, you think it'll work as a kega plugin?

Don't know anything about kega plugins, but I've never been that interested in most Genesis games, to be honest, so I have little enthusiasm to convert it for Kega use. I'll assist someone else, but it's unlikely I'll do any heavy lifting myself to convert it.

Nach wrote:If the function works best with it optimized as i386, for GCC...

Some more testing with PGO is letting me correct the issue more directly, and actually has let me get things down to 21-22 clock cycles per output pixel regardless of processor target. Haven't integrated all the #pragma's to remove the need to run a PGO pass over the code yet so nothing new to post.

Squall_Leonhart · Post by **Squall_Leonhart** » Mon Jan 05, 2009 5:58 am

Well its more the fact VBA-M supports Kega plugins, it was either make it a plugin format so it works in both kega and vba-m, or add it directly into vba-m, which im not sure the licenses would allow...

WolfWings · Post by **WolfWings** » Mon Jan 05, 2009 6:32 am

Squall_Leonhart wrote:Well its more the fact VBA-M supports Kega plugins, it was either make it a plugin format so it works in both kega and vba-m, or add it directly into vba-m, which im not sure the licenses would allow...

Um... my version's not GPL. It's ISC, as close to public domain as you can get. Add to projects, relicense, mangle, print and use as toilet paper, it's fine with me and legal. This version is a ground-up re-implementation of a general-purpose filtering suite (I could make tables for 2xSai or Eagle quite easilly, or bilinear filtering even) with 2kb of tables providing an equivilant effect to HQ, with the values only tuned for HQ2x equivilance at the moment.

Squall_Leonhart · Post by **Squall_Leonhart** » Mon Jan 05, 2009 8:42 am

ah, ok thats kool, this serves to improve the peformance of the Hq2x filter, right?

h4tred · Post by **h4tred** » Tue Jan 06, 2009 6:05 am

Yes, this is a optimized version of HQ2x

Don't know anything about kega plugins, but I've never been that interested in most Genesis games, to be honest, so I have little enthusiasm to convert it for Kega use. I'll assist someone else, but it's unlikely I'll do any heavy lifting myself to convert it.

All thats needed really is a simple function that does the processing to a image buffer. Like what the old HQXX and 2xSai functions do, as well as blargg's nes_ntsc implementations.

WolfWings · Post by **WolfWings** » Tue Jan 06, 2009 8:50 am

h4tred wrote:All thats needed really is a simple function that does the processing to a image buffer. Like what the old HQXX and 2xSai functions do, as well as blargg's nes_ntsc implementations.

That's easilly provided, and nudged me to more-properly modularize the HQ system now.

WolfWings HQ Library WIP v0.1

I need to entirely rebuild the Makefile much more properly, it's a hack to make my testing much easier, but it's there and it works with GCC 4.3 at least. I get around an average of 25 ticks per output pixel with the unprofiled executable, 21 ticks per output with the profiled executable, I'd love some others out there to post their results.

It also doesn't (yet/still) handle edge-of-buffer at all. I know what I need to do, I just haven't bothered to make that hunk of code yet.

h4tred · Post by **h4tred** » Tue Jan 06, 2009 1:15 pm

File's 403'd o.o

WolfWings · Post by **WolfWings** » Tue Jan 06, 2009 1:43 pm

h4tred wrote:File's 403'd o.o

I was a dumbass and forgot to set the file permissions right when I uploaded the zip file. Fixed now!

h4tred · Post by **h4tred** » Wed Jan 07, 2009 12:53 am

Looking at the source, I guess this lib is tied to SNES dimensions?

Just wondering, since Kega Fusion filters are done with arbitrary resolutions in mind (and the original HQ2X() function had this in mind....)

WolfWings · Post by **WolfWings** » Wed Jan 07, 2009 8:07 am

h4tred wrote:Looking at the source, I guess this lib is tied to SNES dimensions?

Just wondering, since Kega Fusion filters are done with arbitrary resolutions in mind (and the original HQ2X() function had this in mind....)

At the moment, yes, it's tied to 256x224. And as I said, it doesn't process edge-of-screen pixels yet either, they don't even get copied over to the output buffer yet. Nor does it support any input or output buffer format except 15-bit (not 16-bit) RGB/BGR yet, though the macro's that need to be adjusted to support other formats are listed in the source. Supporting other options yet would make the inner loop logic get duplicated several times, and while I think the inner loop won't change at this point, I'd rather wait until it's confirmed as working as-is with a version for HQ4x at least before I expand the code to support arbitrary resolutions.

Remember, this is still deeply WiP code. Ergo the 0.1 version number on the link, and just a handful of raw files inside. It's meant to be easy to adjust and test the inner loop with various arguments, but with an eye towards a structure that will allow for handling quite a bit more later on.

WolfWings · Post by **WolfWings** » Thu Jan 08, 2009 1:08 pm

Fixed the code, renamed the library, etc.

It supports any single fixed resolution when compiled, and properly processes the entire input framebuffer without any speedhit that I've found so far. About half-way done rigging up HQ4x, though I may not finish before I have to go in to work 7 12-hour days in a row starting on Saturday. I'll try to do what I can though!

WolfWings · Post by **WolfWings** » Sun Jan 11, 2009 9:30 pm

Woot, progress!

Sneaking on between work and bed, but I updated the website with the v0.2 code. It has a mostly-finished HQ4x implementation (I need to hand-edit the table still to move some 7's to 12's before it's pixel-identical to factory HQ4x) and start working out the HQ3x table which will be quite a few more blending entries. Then, adding support for 2xSai and Eagle and all the rest looks like it might actually end up being more-or-less free. Scale4x is the only 'maybe not' in that pack, since it's actually Scale2x applied twice which doesn't seem to expand easilly. I'll just end up brute-force verifying that though since even worst-case it's only 400-million combinations to test, as the numbers might mesh up.

As a side-note, HQ4x is only 10.75 clock cycles per output pixel, so quadruple the output pixels for only double the cost on newer machines. And all this is still without any MMX or other speedups yet, those are quite possible in various places still, though profiling is showing the weakness is still the difference-testing code at the moment. I'll be trying some approaches to cut that from 7 tests per input pixel to 4, see how much of a help that does.

ZSNES board

Progress on reducing HQ2x to fit in L1 cache

How to scale YUV components to fit power-of-two masks

Not to commit too much thread necromancy...