Progress on reducing HQ2x to fit in L1 cache

AamirM · Post by **AamirM** » Mon Jan 12, 2009 6:20 am

Can't access. File "403"ed.

WolfWings · Post by **WolfWings** » Mon Jan 12, 2009 7:17 am

Fixed the permissions on the file, and I believe I corrected all the errant '7' entries in the main lookup table that needed to be 12's so it should return identical results to HQ4x as well as HQ2x now.

I also shoe-horned in the technique to reduce it down to only 4 'full' pixel-difference tests per input pixel, caching even more aggressively from the previous line as well as the previous pixel now. It made a measurable but comparitively small dent, getting it down to 20.5/10.25 clock cycles per output pixel on 64-bit but having very little effect on 32-bit binaries.

blargg · Post by **blargg** » Mon Jan 12, 2009 12:38 pm

I notice you're not using the simpler pixel comparison that only involves two table lookups and a few bit operations. Was that slower or not precise enough?

WolfWings · Post by **WolfWings** » Mon Jan 12, 2009 3:22 pm

blargg wrote:I notice you're not using the simpler pixel comparison that only involves two table lookups and a few bit operations. Was that slower or not precise enough?

That version still requires a conversion from RGB to YUV, which is a 64kb table AFAIK. The technique I'm using currently only takes 512 bytes total for the difference-table, though I might test with a 764-byte version to remove the xor with 0x3FF in my difference-equation. I'll have access to a Pentium 4 later this week, though I still lack anything but a simulator for calculating older CPU-clock-counts unfortunately.

blargg · Post by **blargg** » Mon Jan 12, 2009 4:31 pm

It looks like you can eliminate the *4 by rearranging the blend table. Instead of (simplified example)

Code: Select all

uint8_t table [4 * 2] =
{
    0, 1,
    2, 3,
    4, 5,
    6, 7
};

int x = table [index*4    ];
int y = table [index*4 + 1];

transpose to

Code: Select all

uint8_t table [4 * 2] =
{
    0, 2, 4, 6,
    1, 3, 5, 7
};

int x = table [index    ];
int y = table [index + 4];

On the other hand, since you're reading index out of another table, you could just multiply the entries in the other table by 4 (or maybe not, since you're using the table for multiple filters, some using a different factor, like 16).

I wonder if you can help the compiler optimize the blend+RGBPack operation. You have something like

Code: Select all

rgb = blended >> 4;
rgb &= 0x03E07C1F;
rgb |= (rgb >> 16);

I wonder if moving that right shift later could allow more parallelism

Code: Select all

rgb &= 0x03E07C1F << 4;
rgb = (rgb >> 4) | (rgb >> 20);

Looking at the data flows, the second allows the two shifts to be done in parallel:

Code: Select all

rgb -> >> -> & --------> | -> out
               \-> >> --/

rgb -> & --> >> -> | -> out
         \-> >> --/

And I notice the pixels array is never indexed except with constants. You could help the compiler by using a bunch of local variables. GCC might optimize both the same, but some compilers might be more apt to put them in registers if they aren't in an array (more relevant on RISC machines with lots of registers).

Heh, and finally, make use of restrict pointers (or perhaps __restrict, if you're not coding in C99), so that the compiler can know nothing aliases. Will help more on RISC machines where the compiler wants to move memory loads as early as possible to minimize latency, and it must know that a load can be moved before a store without affecting results.

WolfWings · Post by **WolfWings** » Mon Jan 12, 2009 4:57 pm

blargg wrote:Oh yeah, forgot about cache usage. Though for common images like from the SNES, there will only be around 256 distinct colors. If these are spread evenly, then it'll only use 256*cache_line_size bytes of cache. With 32-byte lines, that's 8K. I think it might be higher, like 64 or 128 bytes per line, though.

Considering I'm aiming for a total of 8K for both code and data at a maximum... yeah, even that would be painful. And the cache-thrash if an add/sub gradient blend or per-line palette-effect kicked in (Super Metroid anyone?) would be extra-horrible then.

For something using 64 colors or less total on a frame, a flat 'load, add, load' in 64-bit MMX would be best by far, I admit and concur. But the SNES just isn't safe to treat as anything less than a 32k-color machine a lot of the time. :-/ Especially with the reduction to only 4 YUV-comparisons at this point, compared to 8-12 per pixel in the stock HQ system or even 7 in my original implementation before I added the previous-line buffer, I'm gonna polish up the actual blending code next, especially since that table is only 224 bytes right now so I have quite a bit of budget to try larger data-formats there to speed things up with MMX.

WolfWings · Post by **WolfWings** » Mon Jan 12, 2009 5:09 pm

blargg wrote:On the other hand, since you're reading index out of another table, you could just multiply the entries in the other table by 4 (or maybe not, since you're using the table for multiple filters, some using a different factor, like 16).

I'm spoiled by x86 in that regard, power-of-two shifts of indexes into an array up to 8 are free. I'll look into that during the deep-optimization stage once I've got all the blending array's in place for the various filtering techniques. I believe I can afford to unfold the HQ2x table to HQ4x width actually without penalty right now, and may try that shortly.

blargg wrote:I wonder if you can help the compiler optimize the blend+RGBPack operation.

I'll try that, good thinking on that regard. And one of the main reasons I'm posting all my progress up here for others to stare at.

blargg wrote:And I notice the pixels array is never indexed except with constants. You could help the compiler by using a bunch of local variables. GCC might optimize both the same, but some compilers might be more apt to put them in registers if they aren't in an array (more relevant on RISC machines with lots of registers).

Another good idea I'll try, I was using the array mostly for readability to be honest. GCC on AMD64 (yay for 16+16+8 registers) tended to shuffle between registers a lot, few or no memory loads already, but as I said at the get-go my code is nearly reliant on being run through the profile-guided optimization pass before it performs well.

Edit: Helped quite a bit on HQ2x (18 3/8 clock cycles per output pixel), almost no impact on HQ4x (10 clock cycles even per output pixel) so yeah, definately need to pound on HQ4x's blending math at least. Going to try unrolling the math a bit to serialize the writes better.

blargg wrote:Heh, and finally, make use of restrict pointers (or perhaps __restrict, if you're not coding in C99), so that the compiler can know nothing aliases. Will help more on RISC machines where the compiler wants to move memory loads as early as possible to minimize latency, and it must know that a load can be moved before a store without affecting results.

Restrict pointers? Never heard of them before now, but I'll give them a whirl, as they make a lot of sense since I'm basically making the filters as a giant stream-processor that periodically 'hiccups' it's read stream at regular offsets. Adding proper SIMD mneumonics will provide the next major stage of speed boosts I think.

Squall_Leonhart · Post by **Squall_Leonhart** » Tue Jan 13, 2009 4:51 pm

I wonder, if this will benefit LQ2x at all, there are some games where HQ, just doesn't look right...

WolfWings · Post by **WolfWings** » Tue Jan 13, 2009 4:52 pm

And I had a chance to test my current C code against the existing MMX HQ4x pure-assembly code.

I'm not faster.

But that's not actually surprising. It clocks in (when handed a slowly mutating input image to vaguely simulate what an average SNES game may throw at it) at 7-8 clock cycles per output pixel with 200kB of code on my Core 2 Duo. My existing and relatively naive but extra-compact implementation of HQ4x clocks in at around 10-12 clock cycles. Now, if the image is fully static, the pure-assembly HQ4x drops to only 4-5 clock cycles per output pixel since the only 'active' codepaths fit in L1 cache at that point. My code doesn't improve. But at the same time, my code isn't penalized if the input goes slowly-randomized, or even crazy-chaotic-white-noise random where I basically keep up with the pure-assembly HQ4x code in ZSNES now; it clocks in at high 9's to low 10's, I'm around the same.

I'm going to try a bit of a paradigm shift for HQ, as it gives me the best bet at making an easy to MMX version of the code after some aborting attempts to implement one right now: Shifting it from testing 3x3 pixels and spitting back four 2x2-pixel blocks for HQ4x, to testing 4x4 pixels and spitting back a flat 4x4 pixel block based on the innermost 2x2 pixels only.

WolfWings · Post by **WolfWings** » Tue Jan 13, 2009 5:43 pm

Squall_Leonhart wrote:I wonder, if this will benefit LQ2x at all, there are some games where HQ, just doesn't look right...

It may, I haven't looked at LQ at all yet since I'm focussing on the 'hardest to implement fast' filter first.

mozz · Post by **mozz** » Sun Feb 22, 2009 9:34 pm

I know I'm a little late, but its nice to see this thread on the first page again!

Edit: about this code from hq.cpp:

Code: Select all

			newpattern = 0;
			if (pattern & DIFF26) newpattern |= DIFF51;
			if (pattern & DIFF56) newpattern |= DIFF54;
			if (pattern & DIFF86) newpattern |= DIFF57;
			if (pattern & DIFF53) newpattern |= DIFF24;
			if (pattern & DIFF59) newpattern |= DIFF84;

I only looked briefly at the code, so forgive me if I'm missing something obvious. But why not assign the bits so that a single shift and mask can accomplish this, rather than all those branchy-looking tests?
i.e. something like this:

Code: Select all

#define DIFF26 0x001
#define DIFF51 0x002
#define DIFF56 0x004
#define DIFF54 0x008
#define DIFF86 0x010
#define DIFF57 0x020
#define DIFF53 0x040
#define DIFF24 0x080
#define DIFF59 0x100
#define DIFF84 0x200
//...the others..

do {
    newpattern = (pattern << 1) & (DIFF51|DIFF54|DIFF57|DIFF24|DIFF84);
    ...

Maybe some of the other bits (e.g. the lastLineDiffs ones) could be worked in similarly.

----------

Edit2: Okay, I see now the line pattern = RotatePattern(pattern); and I'm guessing thats why the bits were chosen as they are.

I can see I'm going to have to set aside some of my copious free time in order to take another look at these HQ?X algorithms and try and make a *really* small+fast implementation! =P

Edit (monday): Last night, I was again trying to find cheaper represenations for the mapping from diff results to blend numbers. I stumbled across the fact that one of the 12 bits is totally unused (the DIFF86 bit, I suppose). Then I looked in Wolfwings' code and noticed that he put DIFF86 in bit 11 and cut the size of the tree_hq table from 4096 to 2048. Good call =)

wah_wah_69 · Post by **wah_wah_69** » Wed Mar 04, 2009 8:23 pm

¡Greetings!

I also played a bit with the code weeks ago and noticed that the highest value in the tree_hq table was 11 and thus you could pack two values in one byte and made a quick and dirty C program to generate the new values. It also tested against the original table to check if it was generating correct values, it was.

The thing is there's a cost for accesing each nibble which causes this to be slower in my Athlon machine.

I also changed staments as:

if (ExpandedDiff(pixels[5], pixels[3])) pattern |= DIFF53;

To

pattern |= (DIFF53* (ExpandedDiff(pixels[5], pixels[3])));

Where I thought it didn't cause any trouble.
I did managed to spare some cycles with these edits.

Note: I made several edits on this file and have different versions lying around in my machine, I remember removing some edit that caused it to return wrong values, I think the one posted is correct but I haven't checked.

I used to check the md5 of the output image against the one produced by the original code, and maybe I got good results because the test image wasn't using any colour that triggers any error caused by my editing.

Also note that I don't have a deep understanding of how the hq2x algorithm works, as the tables obscure its working in this implementation.

Code: Select all

/* This source code is Copyright (C) 2006-2009 by WolfWings. */
/*      It is, however, released under the ISC license.      */
/*             en.wikipedia.org/wiki/ISC_Licence             */
/* --------------------------------------------------------- */
/* It is a formula-level rederiviation of the HQ2x technique */
/* and is, thus, not a derivative work of the original code, */
/* only the original equations behind the code.              */

#include "hqlib.h"

static const uint8_t blends_2x[12*4] __attribute__ ((section (".rodata.hq2x"))) = {
	 8, 4, 4, 0,
	 8, 4, 0, 4,
	 8, 0, 4, 4,
	16, 0, 0, 0,
	12, 4, 0, 0,
	12, 0, 4, 0,
	12, 0, 0, 4,
	 4, 6, 6, 0,
	12, 2, 2, 0,
	14, 1, 1, 0,
	10, 4, 2, 0,
	10, 2, 4, 0
}; 	

static const uint8_t tree_hq[0x400] __attribute__ ((section (".rodata.hqlib"))) = {
0,34,17,0,0,34,17,8, 
0,34,17,118,0,34,17,136, 
0,90,68,0,0,85,180,0, 
0,90,68,112,0,85,180,0, 
0,34,17,120,0,34,17,104, 
0,34,17,134,0,34,17,102, 
0,90,68,112,0,85,180,0, 
0,90,68,153,0,90,187,153, 
0,34,17,8,0,34,17,136, 
0,34,17,120,0,34,17,134, 
0,85,68,0,0,85,68,0, 
0,85,68,112,0,85,68,112, 
0,34,17,120,0,34,17,134, 
0,34,17,134,0,34,17,102, 
0,85,68,119,0,85,68,0, 
0,85,68,153,0,85,68,153, 
0,34,17,0,0,34,17,8, 
0,34,17,118,0,34,17,136, 
0,85,68,0,0,85,180,0, 
0,85,68,112,0,85,180,0, 
0,34,17,120,0,34,17,104, 
0,34,17,134,0,34,17,102, 
0,85,68,112,0,85,180,0, 
0,85,68,153,0,85,187,153, 
0,34,17,8,0,34,17,136, 
0,34,17,120,0,34,17,134, 
0,85,68,0,0,85,68,0, 
0,85,68,112,0,85,68,112, 
0,34,17,120,0,34,17,134, 
0,34,17,134,0,34,17,102, 
0,85,68,119,0,85,68,0, 
0,85,68,153,0,85,68,153, 
0,34,17,99,0,34,17,54, 
0,34,17,102,0,34,17,102, 
0,90,68,51,0,85,180,51, 
0,90,68,51,0,85,180,51, 
0,34,17,102,0,34,17,102, 
0,34,17,102,0,34,17,102, 
0,90,68,51,0,85,180,51, 
0,90,68,51,0,90,187,51, 
0,34,17,102,0,34,17,102, 
0,34,17,102,0,34,17,102, 
0,85,68,51,0,85,68,51, 
0,85,68,51,0,85,68,51, 
0,34,17,102,0,34,17,102, 
0,34,17,102,0,34,17,102, 
0,85,68,51,0,85,68,51, 
0,85,68,51,0,85,68,51, 
0,34,17,99,0,34,17,54, 
0,34,17,102,0,34,17,102, 
0,85,68,51,0,85,180,51, 
0,85,68,51,0,85,180,51, 
0,34,17,102,0,34,17,102, 
0,34,17,102,0,34,17,102, 
0,85,68,51,0,85,180,51, 
0,85,68,51,0,85,187,51, 
0,34,17,102,0,34,17,102, 
0,34,17,102,0,34,17,102, 
0,85,68,51,0,85,68,51, 
0,85,68,51,0,85,68,51, 
0,34,17,102,0,34,17,102, 
0,34,17,102,0,34,17,102, 
0,85,68,51,0,85,68,51, 
0,85,68,51,0,85,68,51, 
0,34,17,0,0,34,17,8, 
0,34,17,118,0,34,17,136, 
0,90,68,0,0,85,68,0, 
0,90,68,112,0,85,68,0, 
0,34,17,120,0,34,17,104, 
0,34,17,134,0,34,17,102, 
0,90,68,112,0,85,68,0, 
0,90,68,153,0,90,68,153, 
0,34,17,8,0,34,17,136, 
0,34,17,120,0,34,17,134, 
0,85,68,0,0,85,68,0, 
0,85,68,112,0,85,68,112, 
0,34,17,120,0,34,17,134, 
0,34,17,134,0,34,17,102, 
0,85,68,119,0,85,68,0, 
0,85,68,153,0,85,68,153, 
0,34,17,0,0,34,17,8, 
0,34,17,118,0,34,17,136, 
0,85,68,0,0,85,68,0, 
0,85,68,112,0,85,68,0, 
0,34,17,120,0,34,17,104, 
0,34,17,134,0,34,17,102, 
0,85,68,112,0,85,68,0, 
0,85,68,153,0,85,68,153, 
0,34,17,8,0,34,17,136, 
0,34,17,120,0,34,17,134, 
0,85,68,0,0,85,68,0, 
0,85,68,112,0,85,68,112, 
0,34,17,120,0,34,17,134, 
0,34,17,134,0,34,17,102, 
0,85,68,119,0,85,68,0, 
0,85,68,153,0,85,68,153, 
0,34,17,99,0,34,17,54, 
0,34,17,102,0,34,17,102, 
0,90,68,51,0,85,68,51, 
0,90,68,51,0,85,68,51, 
0,34,17,102,0,34,17,102, 
0,34,17,102,0,34,17,102, 
0,90,68,51,0,85,68,51, 
0,90,68,51,0,90,68,51, 
0,34,17,102,0,34,17,102, 
0,34,17,102,0,34,17,102, 
0,85,68,51,0,85,68,51, 
0,85,68,51,0,85,68,51, 
0,34,17,102,0,34,17,102, 
0,34,17,102,0,34,17,102, 
0,85,68,51,0,85,68,51, 
0,85,68,51,0,85,68,51, 
0,34,17,99,0,34,17,54, 
0,34,17,102,0,34,17,102, 
0,85,68,51,0,85,68,51, 
0,85,68,51,0,85,68,51, 
0,34,17,102,0,34,17,102, 
0,34,17,102,0,34,17,102, 
0,85,68,51,0,85,68,51, 
0,85,68,51,0,85,68,51, 
0,34,17,102,0,34,17,102, 
0,34,17,102,0,34,17,102, 
0,85,68,51,0,85,68,51, 
0,85,68,51,0,85,68,51, 
0,34,17,102,0,34,17,102, 
0,34,17,102,0,34,17,102, 
0,85,68,51,0,85,68,51, 
0,85,68,51,0,85,68,51, };

#define RB(y,u,v) (((y)&0x3FF)<<22)|(((u)&0x3FF)<<11)|((v)&0x3FF)
#define GO(y,v) ((((y)+64)&0x3FF)<<22)|(16<<11)|(((v)+16)&0x3FF)
static const uint32_t DiffTable[128] __attribute__ ((section (".rodata.diff"))) = {
RB(-85, 160,  80), RB(-83, 155,  77), RB(-80, 150,  75), RB(-77, 145,  72),
RB(-75, 140,  70), RB(-72, 135,  67), RB(-69, 130,  65), RB(-67, 125,  62),
RB(-64, 120,  60), RB(-61, 115,  57), RB(-59, 110,  55), RB(-56, 105,  52),
RB(-53, 100,  50), RB(-51,  95,  47), RB(-48,  90,  45), RB(-45,  85,  42),
RB(-43,  80,  40), RB(-40,  75,  37), RB(-37,  70,  35), RB(-35,  65,  32),
RB(-32,  60,  30), RB(-29,  55,  27), RB(-27,  50,  25), RB(-24,  45,  22),
RB(-21,  40,  20), RB(-19,  35,  17), RB(-16,  30,  15), RB(-13,  25,  12),
RB(-11,  20,  10), RB( -8,  15,   7), RB( -5,  10,   5), RB( -3,   5,   2),
RB(  0,   0,   0), RB(  2,  -5,  -2), RB(  5, -10,  -5), RB(  7, -15,  -7),
RB( 10, -20, -10), RB( 13, -25, -12), RB( 15, -30, -15), RB( 18, -35, -17),
RB( 21, -40, -20), RB( 23, -45, -22), RB( 26, -50, -25), RB( 29, -55, -27),
RB( 31, -60, -30), RB( 34, -65, -32), RB( 37, -70, -35), RB( 39, -75, -37),
RB( 42, -80, -40), RB( 45, -85, -42), RB( 47, -90, -45), RB( 50, -95, -47),
RB( 53,-100, -50), RB( 55,-105, -52), RB( 58,-110, -55), RB( 61,-115, -57),
RB( 63,-120, -60), RB( 66,-125, -62), RB( 69,-130, -65), RB( 71,-135, -67),
RB( 74,-140, -70), RB( 77,-145, -72), RB( 79,-150, -75), RB( 82,-155, -77),
GO(-84,     -160), GO(-81,     -155), GO(-78,     -150), GO(-76,     -145),
GO(-73,     -140), GO(-70,     -135), GO(-68,     -130), GO(-65,     -125),
GO(-63,     -120), GO(-60,     -115), GO(-57,     -110), GO(-55,     -105),
GO(-52,     -100), GO(-49,      -95), GO(-47,      -90), GO(-44,      -85),
GO(-42,      -80), GO(-39,      -75), GO(-36,      -70), GO(-34,      -65),
GO(-31,      -60), GO(-28,      -55), GO(-26,      -50), GO(-23,      -45),
GO(-21,      -40), GO(-18,      -35), GO(-15,      -30), GO(-13,      -25),
GO(-10,      -20), GO( -7,      -15), GO( -5,      -10), GO( -2,       -5),
GO(  0,        0), GO(  2,        4), GO(  5,        9), GO(  7,       14),
GO( 10,       19), GO( 13,       24), GO( 15,       29), GO( 18,       34),
GO( 21,       39), GO( 23,       44), GO( 26,       49), GO( 28,       54),
GO( 31,       59), GO( 34,       64), GO( 36,       69), GO( 39,       74),
GO( 42,       79), GO( 44,       84), GO( 47,       89), GO( 49,       94),
GO( 52,       99), GO( 55,      104), GO( 57,      109), GO( 60,      114),
GO( 63,      119), GO( 65,      124), GO( 68,      129), GO( 70,      134),
GO( 73,      139), GO( 76,      144), GO( 78,      149), GO( 81,      154)};

__inline__ static int ExpandedDiff (uint32_t c0, uint32_t c1) {
   uint32_t r, g;
   g   = c0;
   g  += 0x4008020;
   g  -= c1;
   r   = DiffTable[(int)((unsigned char)g) +   0];
   //r  ^= (0x3FF << 11);  //2095104
   r  ^= 0x1FF800;
   g >>= 10;
   r  += DiffTable[(int)((unsigned char)g) +   0];
   g >>= 11;
   r  += DiffTable[(int)((unsigned char)g) +  64];
   r  &= 0xE01F03E0;
   return r;
}

#define DIFF56 0x001
#define DIFF52 0x002
#define DIFF54 0x004
#define DIFF58 0x008
#define DIFF53 0x010
#define DIFF51 0x020
#define DIFF57 0x040
#define DIFF59 0x080
#define DIFF26 0x100
#define DIFF24 0x200
#define DIFF84 0x400
#define DIFF86 0x800

#define RotatePattern(x) ((((x) & 0x777) << 1) | (((x) & 0x888) >> 3))

void ProcessHQ2x(const pixel *inbuffer, pixel *outbuffer) {
	unsigned int y, x;
	unsigned int pattern, newpattern;
	uint32_t pixels[10];
	
	for (y = 1; y < 224 - 1; y++) {
		pixels[2] = RGBUnpack(inbuffer[(y*256)+0-256  ]); /* Pixel 2 */
		pixels[3] = RGBUnpack(inbuffer[(y*256)+0-256+1]); /* Pixel 3 */

		pixels[5] = RGBUnpack(inbuffer[(y*256)+0      ]); /* Pixel 5 */
		pixels[6] = RGBUnpack(inbuffer[(y*256)+0    +1]); /* Pixel 6 */

		pixels[8] = RGBUnpack(inbuffer[(y*256)+0+256  ]); /* Pixel 8 */
		pixels[9] = RGBUnpack(inbuffer[(y*256)+0+256+1]); /* Pixel 9 */

		pattern = 0;
		/*
		if (ExpandedDiff(pixels[5], pixels[3])) pattern |= DIFF53;
		if (ExpandedDiff(pixels[5], pixels[6])) pattern |= DIFF56;
		if (ExpandedDiff(pixels[5], pixels[9])) pattern |= DIFF59;
		if (ExpandedDiff(pixels[2], pixels[6])) pattern |= DIFF26;
		if (ExpandedDiff(pixels[8], pixels[6])) pattern |= DIFF86;
		*/
		
		
		pattern |= (DIFF53* (ExpandedDiff(pixels[5], pixels[3])));
		pattern |= (DIFF56* (ExpandedDiff(pixels[5], pixels[6])));
		pattern |= (DIFF59* (ExpandedDiff(pixels[5], pixels[9])));
		pattern |= (DIFF26* (ExpandedDiff(pixels[2], pixels[6])));
		pattern |= (DIFF86* (ExpandedDiff(pixels[8], pixels[6])));
		
		for (x = 1; x < 256 - 1; x++) {
			pixels[1] = pixels[2];
			pixels[2] = pixels[3];
			pixels[3] = RGBUnpack(inbuffer[(y*256)+x-256+1]); /* Pixel 3 */

			pixels[4] = pixels[5];
			pixels[5] = pixels[6];
			pixels[6] = RGBUnpack(inbuffer[(y*256)+x    +1]); /* Pixel 6 */

			pixels[7] = pixels[8];
			pixels[8] = pixels[9];
			pixels[9] = RGBUnpack(inbuffer[(y*256)+x+256+1]); /* Pixel 9 */
	
			newpattern  =      0;
			
			/*if (pattern & DIFF26) newpattern |= DIFF51;
			if (pattern & DIFF56) newpattern |= DIFF54;
			if (pattern & DIFF86) newpattern |= DIFF57;
			if (pattern & DIFF53) newpattern |= DIFF24;
			if (pattern & DIFF59) newpattern |= DIFF84;
			*/
			
			newpattern |= (DIFF51 * ((pattern & DIFF26)>0));
			newpattern |= (DIFF54 * ((pattern & DIFF56)));
			newpattern |= (DIFF57 * ((pattern & DIFF86)>0));
			newpattern |= (DIFF24 * ((pattern & DIFF53)>0));
			newpattern |= (DIFF84 * ((pattern & DIFF59)>0)) ;
			
			                      pattern = newpattern;

			if (ExpandedDiff(pixels[5], pixels[2]))    pattern |= DIFF52;
			if (ExpandedDiff(pixels[5], pixels[3]))    pattern |= DIFF53;
			if (ExpandedDiff(pixels[5], pixels[6]))    pattern |= DIFF56;
			if (ExpandedDiff(pixels[5], pixels[8]))    pattern |= DIFF58;
			if (ExpandedDiff(pixels[5], pixels[9]))    pattern |= DIFF59;
			if (ExpandedDiff(pixels[2], pixels[6]))    pattern |= DIFF26;
			if (ExpandedDiff(pixels[8], pixels[6]))    pattern |= DIFF86;
			
			
		newpattern = ((tree_hq[(pattern& 0x7FF)>>1] >>(4*(!(pattern&1))))&0x0F);
		
			
			pattern = RotatePattern(pattern);
			outbuffer[(y*1024)    +(x*2)  ] =
				RGBPack(((pixels[5] * blends_2x[(newpattern * 4) + 0]) +
				         (pixels[2] * blends_2x[(newpattern * 4) + 1]) +
				         (pixels[4] * blends_2x[(newpattern * 4) + 2]) +
				         (pixels[1] * blends_2x[(newpattern * 4) + 3])) / 16);

		newpattern = (tree_hq[(pattern& 0x7FF)>>1] >>(4*(!(pattern&1))))&0x0F;
			
			pattern = RotatePattern(pattern);
			outbuffer[(y*1024)    +(x*2)+1] =
				RGBPack(((pixels[5] * blends_2x[(newpattern * 4) + 0]) +
				         (pixels[6] * blends_2x[(newpattern * 4) + 1]) +
				         (pixels[2] * blends_2x[(newpattern * 4) + 2]) +
				         (pixels[3] * blends_2x[(newpattern * 4) + 3])) / 16);

		newpattern = (tree_hq[(pattern& 0x7FF)>>1] >>(4*(!(pattern&1))))&0x0F;
		
			pattern = RotatePattern(pattern);
			outbuffer[(y*1024)+512+(x*2)+1] =
				RGBPack(((pixels[5] * blends_2x[(newpattern * 4) + 0]) +
				         (pixels[8] * blends_2x[(newpattern * 4) + 1]) +
				         (pixels[6] * blends_2x[(newpattern * 4) + 2]) +
				         (pixels[9] * blends_2x[(newpattern * 4) + 3])) / 16);

		newpattern = (tree_hq[(pattern& 0x7FF)>>1] >>(4*(!(pattern&1))))&0x0F;
		
			pattern = RotatePattern(pattern);
			outbuffer[(y*1024)+512+(x*2)  ] =
				RGBPack(((pixels[5] * blends_2x[(newpattern * 4) + 0]) +
				         (pixels[4] * blends_2x[(newpattern * 4) + 1]) +
				         (pixels[8] * blends_2x[(newpattern * 4) + 2]) +
				         (pixels[7] * blends_2x[(newpattern * 4) + 3])) / 16);
		}
	}
}

mozz · Post by **mozz** » Sat Mar 07, 2009 3:07 am

wah_wah_69 wrote:I also changed staments as:

if (ExpandedDiff(pixels[5], pixels[3])) pattern |= DIFF53;

To

pattern |= (DIFF53* (ExpandedDiff(pixels[5], pixels[3])));

Where I thought it didn't cause any trouble.
I did managed to spare some cycles with these edits.

Making all of those things branchless is probably a good idea. When there are several of them in a row, you could try writing it with temp variables to break the dependence chain. I'm not sure if it would help though for this kind of code. (Even if it does, the compiler might figure that transformation out on its own, since pattern is a local with no intervening reads).

An update on my equivalent table to tree_hq... the smallest practical encoding I've found for it is somewhere in the ballpark of 256 bytes. If anyone cares, I can spend some time to sort out the details and post it. It requires a different ordering of the DIFF bits from that used in Wolfwings' code (but the bits are still grouped for cheap rotation). Accessing it requires two byte accesses, an AND instruction and a dependent byte access. There's no branch, so its a lot better than my suggestion from a couple years ago (and the table is ~35 bytes smaller too).

But for sheer performance probably nothing can beat that 2KB table: there just aren't enough cache misses to warm up that 2KB table to pay for the cost of my more complicated lookup thingy. I still had some fun deriving it though.

wah_wah_69 · Post by **wah_wah_69** » Sun Mar 08, 2009 3:08 pm

mozz wrote:
wah_wah_69 wrote:I also changed staments as:

if (ExpandedDiff(pixels[5], pixels[3])) pattern |= DIFF53;

To

pattern |= (DIFF53* (ExpandedDiff(pixels[5], pixels[3])));

Where I thought it didn't cause any trouble.
I did managed to spare some cycles with these edits.
Making all of those things branchless is probably a good idea. When there are several of them in a row, you could try writing it with temp variables to break the dependence chain. I'm not sure if it would help though for this kind of code. (Even if it does, the compiler might figure that transformation out on its own, since pattern is a local with no intervening reads).

An update on my equivalent table to tree_hq... the smallest practical encoding I've found for it is somewhere in the ballpark of 256 bytes. If anyone cares, I can spend some time to sort out the details and post it. It requires a different ordering of the DIFF bits from that used in Wolfwings' code (but the bits are still grouped for cheap rotation). Accessing it requires two byte accesses, an AND instruction and a dependent byte access. There's no branch, so its a lot better than my suggestion from a couple years ago (and the table is ~35 bytes smaller too).

But for sheer performance probably nothing can beat that 2KB table: there just aren't enough cache misses to warm up that 2KB table to pay for the cost of my more complicated lookup thingy. I still had some fun deriving it though.

It sounds interesting, please if you have time to spare post your code or at least explain how did you get to shrink the table so much .

Even after the "packing" I did the table still has a lot of repeated values, I thought about using different methods for accesing it, so a smaller table could be used but didn't come up with anything usefull.

mozz · Post by **mozz** » Tue Mar 10, 2009 12:42 am

wah_wah_69 wrote:It sounds interesting, please if you have time to spare post your code or at least explain how did you get to shrink the table so much.

Here is the compressed table I ended up with -- it worked out to 240 bytes. Slightly smaller ones are probably possible, but maybe not worth the effort.

If you want a bit of a puzzle, try and figure it out how it works without any further description...

Code: Select all

// This lookup table REQUIRES input patterns to have the following bit ordering.
#define DIFF86 0x001
#define DIFF26 0x002
#define DIFF42 0x004
#define DIFF84 0x008
#define DIFF57 0x010
#define DIFF59 0x020
#define DIFF53 0x040
#define DIFF51 0x080
#define DIFF54 0x100
#define DIFF58 0x200
#define DIFF56 0x400
#define DIFF52 0x800

// This ordering rotates in the same direction as the ordering used in WolfWings'
// code, so his RotatePattern macro should still work.

const U8 g_CrazyTable[128+112] = {
     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,20,20, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,28,
     0, 0, 0, 0,10,10,10,10, 0, 0, 0, 0,26,10, 2,26,17,17,17,17,26,27,26,10,25, 1, 0,25,26,27, 2,27,
     5, 5, 5, 5,13,13, 2, 2, 5, 5, 5, 5,13,13,10,10, 5, 5, 5, 5,13,13, 2, 2, 5, 5, 5, 5,13,13, 2, 2,
     1, 1, 4, 4,33,29,97,52, 1, 1, 4, 4,80,69,36,36, 7, 7,11,11,45,84,81,32, 0, 0, 4, 3,53,68,36,48,
     7, 1, 4,15, 2, 0, 4, 7, 1, 1,14, 8, 2, 3, 4, 0, 1, 1, 4, 2, 2, 0, 4,19, 1, 1, 4, 2, 2, 6, 4,17,
     9, 5,21,17, 5,12,21,17,13,11,21,17,13, 9,21,21, 9, 9,21,21, 6,12,21,17,18,18,21,21,13,12,21,17,
     5, 9,22,21,16,12,10,17,13,13,21,21,10,17,10,17, 5, 5,21,21,10,17,10,17,17, 5,17,21,10,17,10,17,
    12, 5,17,21,16,12,10,17,12,11,17,21,20,17,20,17
};

inline U8 LookupBlend(int pattern)
{
	int row = (pattern >> 6);
	U8 mask = g_CrazyTable[row];              // byte load
	U8 offset = g_CrazyTable[64+row];         // byte load
	pattern >>= 1;
	pattern &= mask;
	return g_CrazyTable[128+offset+pattern];  // dependent byte load
}

In order to use this table, you would also need to know the identities of the blends (all bytes after offset 128 in g_CrazyTable are blend numbers from this list):

Code: Select all

//     |  In HQ2X:  |        In HQ3X:       |                   In HQ4X:
// ----+------------+-----------------------+-----------------------------------------------
//  0  | PIXEL00_20 | PIXEL00_2,  PIXEL01_1 | PIXEL00_20, PIXEL01_60, PIXEL10_60, PIXEL11_70
//  1  | PIXEL00_22 | PIXEL00_1M, PIXEL01_C | PIXEL00_80, PIXEL01_10, PIXEL10_61, PIXEL11_30
//  2  | PIXEL00_11 | PIXEL00_1L, PIXEL01_C | PIXEL00_81, PIXEL01_31, PIXEL10_81, PIXEL11_31
//  3  | PIXEL00_21 | PIXEL00_1M, PIXEL01_1 | PIXEL00_80, PIXEL01_61, PIXEL10_10, PIXEL11_30
//  4  | PIXEL00_12 | PIXEL00_1U, PIXEL01_1 | PIXEL00_82, PIXEL01_82, PIXEL10_32, PIXEL11_32
//  5  | PIXEL00_20 | PIXEL00_4,  PIXEL01_3 | PIXEL00_50, PIXEL01_50, PIXEL10_50, PIXEL11_0
//  6  | PIXEL00_90 | PIXEL00_5,  PIXEL01_6 | PIXEL00_50, PIXEL01_83, PIXEL10_21, PIXEL11_70
//  7  | PIXEL00_22 | PIXEL00_1M, PIXEL01_3 | PIXEL00_80, PIXEL01_10, PIXEL10_61, PIXEL11_30
//  8  | PIXEL00_60 | PIXEL00_2,  PIXEL01_6 | PIXEL00_12, PIXEL01_14, PIXEL10_81, PIXEL11_31
//  9  | PIXEL00_20 | PIXEL00_4,  PIXEL01_C | PIXEL00_50, PIXEL01_50, PIXEL10_50, PIXEL11_0
// 10  | PIXEL00_10 | PIXEL00_1M, PIXEL01_3 | PIXEL00_80, PIXEL01_10, PIXEL10_10, PIXEL11_30
// 11  | PIXEL00_90 | PIXEL00_5,  PIXEL01_1 | PIXEL00_50, PIXEL01_21, PIXEL10_83, PIXEL11_70
// 12  | PIXEL00_70 | PIXEL00_2,  PIXEL01_C | PIXEL00_20, PIXEL01_12, PIXEL10_11, PIXEL11_0
// 13  | PIXEL00_100| PIXEL00_2,  PIXEL01_C | PIXEL00_20, PIXEL01_0,  PIXEL10_0,  PIXEL11_0
// 14  | PIXEL00_61 | PIXEL00_2,  PIXEL01_1 | PIXEL00_11, PIXEL01_82, PIXEL10_13, PIXEL11_32
// 15  | PIXEL00_11 | PIXEL00_1L, PIXEL01_3 | PIXEL00_81, PIXEL01_31, PIXEL10_81, PIXEL11_31
// 16  | PIXEL00_70 | PIXEL00_2,  PIXEL01_3 | PIXEL00_20, PIXEL01_12, PIXEL10_11, PIXEL11_0
// 17  | PIXEL00_10 | PIXEL00_1M, PIXEL01_C | PIXEL00_80, PIXEL01_10, PIXEL10_10, PIXEL11_30
// 18  | PIXEL00_100| PIXEL00_2,  PIXEL01_3 | PIXEL00_20, PIXEL01_0,  PIXEL10_0,  PIXEL11_0
// 19  | PIXEL00_22 | PIXEL00_1M, PIXEL01_1 | PIXEL00_80, PIXEL01_10, PIXEL10_61, PIXEL11_30
// 20  | PIXEL00_10 | PIXEL00_1M, PIXEL01_1 | PIXEL00_80, PIXEL01_10, PIXEL10_10, PIXEL11_30
// 21  | PIXEL00_ 0 | PIXEL00_C,  PIXEL01_C | PIXEL00_0,  PIXEL01_0,  PIXEL10_0,  PIXEL11_0
// 22  | PIXEL00_ 0 | PIXEL00_C,  PIXEL01_3 | PIXEL00_0,  PIXEL01_0,  PIXEL10_0,  PIXEL11_0

I used a program to perform the first few steps on 768 different permutations of the pattern bits, and for various line sizes, before settling on this particular permutation and a line size of 32 (after folding out the DIFF86 bit which has no effect on the result).

Here is some source code with a test routine for the table.

In the comments, it includes a step-by-step explanation of how this table was derived.

wah_wah_69 · Post by **wah_wah_69** » Fri Mar 13, 2009 3:54 pm

Awesome work, I think I get most of it but I have yet to give it a deeper look.

KaOSoFt · Post by **KaOSoFt** » Sun Jul 12, 2009 4:06 pm

What has happened to this project? It was progressing smoothly.

Post by **grinvader** » Sun Jul 12, 2009 6:32 pm

It looks pretty much complete, don't you think ?

creaothceann · Post by **creaothceann** » Mon Jul 13, 2009 3:47 am

See also this thread - the filter is now in bsnes.

byuu · Post by **byuu** » Mon Jul 13, 2009 7:52 pm

I used a slightly different approach.

First, I wasn't worried about L1 so I have a 128kb LUT for the YUV difference function, courtesy of blargg. I'm not convinced L1 would make up for the speed hit of doing that with pure math.

Code: Select all

bool diff(uint32_t x, uint16_t y) { return ((x - yuvTable[y]) & diff_mask); }

Second, I modified four rules slightly to create a perfectly circular algorithm. The results are infinitesimally different in mine (~6 pixels per frame difference, of those who could even tell, they preferred mine anyway.)

That allowed me to reduce the algorithm otherwise to a 256-byte table and a blend function. It could be packed further for more of a speed hit, or expanded for less.

On the plus side, mine seems to get roughly the same framerate. Identical on 64-bit Linux, ~4fps slower on 32-bit Windows.

But yeah, different goals. I just hated how long it took HQnX to compile from source. The original switch table was serious overkill.

KaOSoFt · Post by **KaOSoFt** » Tue Jul 14, 2009 2:51 am

creaothceann wrote:See also this thread - the filter is now in bsnes.

Is it? I see no mention of this algorithm in there, but it's a good reference, nevertheless. Thank you.

EDIT: Nevermind, after I posted I saw byuu's message.

ZSNES board

Progress on reducing HQ2x to fit in L1 cache

Q

Re: Q