Progress on reducing HQ2x to fit in L1 cache

WolfWings · Post by **WolfWings** » Thu Jul 27, 2006 9:49 pm

mozz wrote:My thought for HQ3X was that it looks like most of the handlers could be divided into "two halves" (of varying size). E.g. if there are a bunch of handlers that set the top, left, and upper left pixels to the same thing, that would be done in one "half" and the other 5 pixels could be done in the other "half" (the center pixel is always the same in HQ3X).

Even simpler: The top and upper-left pixels are the only ones that are 'unique' to HQ3x. (Had a chance to run over the data today.) Rotate them, don't mirror them, and they match the other six pixels if you rotate the 'pattern' as well.

Again, the entire HQ system is rotationally symetric, this is one of the reasons it works so well at detecting 1:2 ratio lines, because when one rotation of the filter will detect it, the other side usually won't, so it ends up smoothly detecting the 1:2 ratio lines.

And I've already 'unfolded' those accursed "if Diff()" statements into the main switch-statement-equivilant, because those difference-tests have to be done eventually for the next pixel horizontally regardless. So instead of testing it once possibly, then testing it for sure on the very next pixel I just test it once every single time and bit-shift the test for the next pixel.

So each pixel I perform 7 Diff() tests, and bit-shift 5 older Diff() tests into position for a re-formatted layout of the pattern variable as I call it, with tests layed out in 5<->2, 5<->4, 5<->8, 5<->6 order in the low bits, and continuing on to the higher bits with the diagonals, then the cross-diagonals. This lets me move from one 'pixel' I'm calculating to the next with a simple bit-shuffle.

mozz · Post by **mozz** » Fri Jul 28, 2006 12:36 am

WolfWings wrote:And I've already 'unfolded' those accursed "if Diff()" statements into the main switch-statement-equivilant, because those difference-tests have to be done eventually for the next pixel horizontally regardless.

Wow! I did not realize that (I haven't made a real effort to understand the filter patterns yet). I bet that makes it a lot easier to make a data-driven implementation.

..though the variety of different blends being used still seems easier to express through code. Have you tried to compare the performance of your data-driven version to the original unrolled code versions?

WolfWings · Post by **WolfWings** » Fri Jul 28, 2006 2:30 am

mozz wrote:
WolfWings wrote:And I've already 'unfolded' those accursed "if Diff()" statements into the main switch-statement-equivilant, because those difference-tests have to be done eventually for the next pixel horizontally regardless.
Wow! I did not realize that (I haven't made a real effort to understand the filter patterns yet). I bet that makes it a lot easier to make a data-driven implementation.

..though the variety of different blends being used still seems easier to express through code. Have you tried to compare the performance of your data-driven version to the original unrolled code versions?

I've actually mostly abandoned the microcoded blending system for unrolled versions. I didn't originally think I could pack the 'pattern detection' tables down far enough, and developed a way to express all the possible blends used in the entire HQ?? system as a single 16-bit-integer value per pixel, so I could avoid having a 'jump prediction' error anywhere in the pattern-detection code, just a memory-read and fixed code-path.

Basically, I managed to make a system where I believe I could pack HQ2x at least in a non-MMX implementation down to under 0.5k total. That was when I realized I had a LOT more breathing room than I initially thought, and have been optimizing based on Pentium MMX/Pentium 3/Athlon XP series targets since then. The existing MMX blends are hard to beat with MMX code, though I can actually beat them on Athlon 64 and Pentium 4 with non-MMX versions using ((A+B)/2) equivilant sequences since a lot of MMX operations take 1 or 2 cycles to finish, but on AMD64 and P4 basic math instructions can take between 1/2 and 1/4 of a cycle per.

franpa · Post by **franpa** » Fri Jul 28, 2006 3:43 am

you seem to have ended your last paragraph a little early.

WolfWings · Post by **WolfWings** » Fri Jul 28, 2006 6:47 am

franpa wrote:you seem to have ended your last paragraph a little early.

How so?

I'm comparing the clock-cycle-latency of MMX instructions on P4/Athlon 64 systems to basic ALU instructions on the same chips. I state that MMX instructions take 1-2 cycles to finish, but basic math instructions take between 1/2 and 1/4 of a cycle per. I honestly don't know how I can make that last sentence any more verbose. On P4/AMD64, between two and four simple add, subtract, bit-shift, or moves can be performed at the same time, this the 1/2-1/4 of a cycle per instruction that I'm referencing.

franpa · Post by **franpa** » Fri Jul 28, 2006 8:37 am

can take between 1/2 and 1/4 of a cycle per.

that part at your end.... isnt there meant to be a word after the word per?

mozz · Post by **mozz** » Fri Jul 28, 2006 3:54 pm

So I wrote programs to extract the data from the reference C implementations of HQ2X/HQ3X/HQ4X, and merged the results into one table. The 12 digits on the left identify each combination of diff outcomes, and the rest of the columns are the blends used for each output pixel:

Code: Select all

"123467894286|   HQ2X    |         HQ3X          |                     HQ4X                       ",
"555555552648|00 01 10 11|00 01 02|10 12|20 21 22|00 01 02 03|10 11 12 13|20 21 22 23|30 31 32 33 ",
"------------+-----+-----+--------+-----+--------+-----------+-----------+-----------+------------",
"000000000000|20 20|20 20| 2  1  2| 1  1| 2  1  2|20 60 60 20|60 70 70 60|60 70 70 60|20 60 60 20 ",
"100000000000|20 20|20 20| 2  1  2| 1  1| 2  1  2|20 60 60 20|60 70 70 60|60 70 70 60|20 60 60 20 ",
"010000000000|22 21|20 20|1M  C 1M| 1  1| 2  1  2|80 10 10 80|61 30 30 61|60 70 70 60|20 60 60 20 ",
"110000000000|11 21|20 20|1L  C 1M| 1  1| 2  1  2|81 31 10 80|81 31 30 61|60 70 70 60|20 60 60 20 ",
"001000000000|20 20|20 20| 2  1  2| 1  1| 2  1  2|20 60 60 20|60 70 70 60|60 70 70 60|20 60 60 20 ",
"101000000000|20 20|20 20| 2  1  2| 1  1| 2  1  2|20 60 60 20|60 70 70 60|60 70 70 60|20 60 60 20 ",
"011000000000|22 12|20 20|1M  C 1R| 1  1| 2  1  2|80 10 32 82|61 30 32 82|60 70 70 60|20 60 60 20 ",
"111000000000|11 12|20 20|1L  C 1R| 1  1| 2  1  2|81 31 32 82|81 31 32 82|60 70 70 60|20 60 60 20 ",
...4088 more lines...

Since it was incidental to what I was doing anyway, I wrote a little Java program that uses this table to prove WolfWings' assertion that the algorithms are all rotationally symmetric (which they are, of course). I am now experimenting with different ways to represent all this data in a smallish table. WolfWings has already solved this problem elegantly for HQ2X in his microcoded version, but I want to try and derive something interesting on my own (maybe borrowing an idea or two

). I haven't seen his microcoded solution for HQ3X yet, so I can experiment on that without being influenced much by his version.

[Edit: by "represent all this data" I really meant only one quarter of it: 00 for HQ2X, 00/01 for HQ3X and 00/01/10/11 for HQ4X.]

WolfWings wrote:developed a way to express all the possible blends used in the entire HQ?? system as a single 16-bit-integer value per pixel

Interesting. That may be overkill, though -- there are only 30 unique blends used by the HQ?? algorithms (HQ2X uses 12 of them and HQ3X uses only 9... HQ4X uses 25 of them though). A 4- or 5-bit value can uniquely identify the blend being used, and you could have a table of 30 or so entries in whatever shape is most convenient (packed MMX data with the coefficients for each blend, code address for a custom routine, whatever) and it wouldn't take much space. I have no doubt you've given this stuff a lot of thought, however =)

WolfWings · Post by **WolfWings** » Sat Jul 29, 2006 5:22 am

franpa wrote:
can take between 1/2 and 1/4 of a cycle per.
that part at your end.... isnt there meant to be a word after the word per?

Um... for the third time, no. The entire sentence is discussing how long most instructions take in clock cycles. Including that part. It's valid english, I've checked with two books at this point and an English teacher. I'm sorry I don't keep my English to an 8th-grade reading level like most books recommend.

mozz wrote:
WolfWings wrote:developed a way to express all the possible blends used in the entire HQ?? system as a single 16-bit-integer value per pixel
Interesting. That may be overkill, though -- there are only 30 unique blends used by the HQ?? algorithms (HQ2X uses 12 of them and HQ3X uses only 9... HQ4X uses 25 of them though). A 4- or 5-bit value can uniquely identify the blend being used, and you could have a table of 30 or so entries in whatever shape is most convenient (packed MMX data with the coefficients for each blend, code address for a custom routine, whatever) and it wouldn't take much space. I have no doubt you've given this stuff a lot of thought, however =)

Like I said, I determined it was overkill. I was using every 'trick' I knew of to try to pack things down to determine a 'smallest possible' size before I started trading larger size for speed. It's how I usually optimize things, compact it as far as possible, THEN unpack it one section at a time to speed it up. It's easier than selectively compacting it I've found because it's easy to see memory-using ways to 'unfold' code, it can be harder to figure out memory-saving ways to compress code.

And the attempt was to remove extra memory references from the code by directly storing the 16-bit value that contained the full blend operation code inside for the pixel. I.E. The actual value was broken into 2-bit 'opcodes' that began with 'load pixel 0-3' and the rest were mostly 'average with pixel 0-2' and a special-case opcode to handle 'done' and 'store and average with pixel 0' and 'average with stored value' specifically for my implementation. It only had one jump per loop unless a special-case instruction was used, though it did have one memory load per loop as well. For most blends, it only had to reference the relevant pixels for input once each so it did remove redundant memory references completely. And that handled well over 90% of real-world activity for HQ2x.

It ended up being fairly effecient, but with the breathing room I've got now I can do better.

Unfortunately, I've tried for a'while now to do use 'unpacked MMX blending values' but I couldn't find an acceptable way to chew up that much space usually, or handle the code complexity it added having to unpack pixels to two seperate formats. Easier to keep things as simple adds and shifts, then mask and 're-pack' pixels at the very end since I already have to unpack all the pixels for my 'Diff' subroutine approach.

BTW, shuffle the bits for the difference-pair orderings you're using: 51, 52, and 54 by themselves determine the 00 pixel for HQ2x or the 00/01/10/11 pixels for HQ4x more than 60% of the time by themselves.

Deathlike2 · Post by **Deathlike2** » Sat Jul 29, 2006 5:34 am

WolfWings wrote:
franpa wrote:
can take between 1/2 and 1/4 of a cycle per.
that part at your end.... isnt there meant to be a word after the word per?
Um... for the third time, no. The entire sentence is discussing how long most instructions take in clock cycles. Including that part. It's valid english, I've checked with two books at this point and an English teacher. I'm sorry I don't keep my English to an 8th-grade reading level like most books recommend.

Free free to ignore franpa, reading comprehension is... lacking at times.

WolfWings · Post by **WolfWings** » Sat Jul 29, 2006 6:10 am

Deathlike2 wrote:Free free to ignore franpa, reading comprehension is... lacking at times.

Fair enough. I didn't want to exactly outright blow him off the first time. But the second time... well, I tried being nice.

Now to get back to 'work' and get the HQ code up to speed.

franpa · Post by **franpa** » Sat Jul 29, 2006 6:46 am

Deathlike2 wrote:
WolfWings wrote:
franpa wrote:
can take between 1/2 and 1/4 of a cycle per.
that part at your end.... isnt there meant to be a word after the word per?
Um... for the third time, no. The entire sentence is discussing how long most instructions take in clock cycles. Including that part. It's valid english, I've checked with two books at this point and an English teacher. I'm sorry I don't keep my English to an 8th-grade reading level like most books recommend.
Free free to ignore franpa, reading comprehension is... lacking at times.

lol... free free and sorry wolf wings

WolfWings · Post by **WolfWings** » Sat Jul 29, 2006 8:27 am

EDIT: Removed uselessly-slower MMX code. Proof that MMX is not a Silver Bullet for high-speed graphics code.

mozz · Post by **mozz** » Sat Jul 29, 2006 8:40 am

I pursued the "use a small, unique blend number" approach for a while. I managed to come up with a 35+256 = 291 byte table that can turn a 12-bit diff index into a 5-bit blend number, using the following algorithm:

Code: Select all

int lookupBlend(int diffIndex) {
    int result = rowIndex[diffIndex>>4];  // rowIndex is 256 bytes
    if (result >= 8) {
        int mask = 8-(result>>5);  // TODO: dumb subtraction here..
        result = rowData[(result&31) + (diffIndex & mask)];
    }
    return result;
}

One upside of this approach is that the same 291-byte table and lookup routine can be shared by all the algorithms: HQ2X, HQ3X, HQ4X.

However, once I have the 5-bit blend number, I still have to do something with it. Each algorithm will probably need a table of 23 entries (the number of blends) which are either code addresses, or data to directly control the blend. Another idea: I could replace the 35-byte table with a 35-dword or 35-qword table, and stick something larger like blend coefficients in there. I'd have to tack on a few more entries to handle the rowIndex[] values that are returned directly because they are less than 8. Still, the reduction from 4096 indices to essentially 35+6, is very satisfying.

[Edit: you are right about the order of the diff bits, too... to use the table I describe the 12-bit index must be constructed in a particular AAAABBBBCCCC order, like this:

Code: Select all

    // Diff(c[2],c[4]) is bit 0
    "24","26","68","48", U-L, U-R, D-R, D-L  - cross-diagonals
    "15","35","59","57", C-UL, C-UR, C-DR, C-DL  - center with corners
    "45","25","56","58", C-L, C-U, C-R, C-D  - center with sides

This is by far the most compressible ordering I've found, because the cross-diagonals [the ones from the if-tests in the original alg] rarely affect the blend number (and usually only one or two of them when they are), so with the aid of the 3-bit mask value in the rowIndex[] bytes, the 16-entry rows are very cheap to represent in rowData[], needing 2 to 4 bytes each:

Code: Select all

	// Mask Ofs     0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
	//           +---+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
	// 0010   0  | 17,    0,                                      | // 17,17, 0, 0,17,17, 0, 0,17,17, 0, 0,17,17, 0, 0,
	// 0010   1  |    18,    4,                                   | // 18,18, 4, 4,18,18, 4, 4,18,18, 4, 4,18,18, 4, 4,
	// 0001   4  |              5, 6,                             | //  5, 6, 5, 6, 5, 6, 5, 6, 5, 6, 5, 6, 5, 6, 5, 6,
	// 0001   6  |                   11, 6,                       | // etc.
	// 0001   8  |                         12, 6,                 |
	// 0001  10  |                               13, 6,           |
	// 0001  12  |                                     14, 6,     |
	// 0001  14  |                                           14, 7|
	//           +------------------------------------------------+
	// Mask Ofs    16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31|32 33 34
	//           +---+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
	// 0011  16  | 22,16,10, 7,                                   |        | // 22,16,10,7,22,16,10,7,22,16,10,7,22,16,10,7,
	// 0010  17  |    16,    7,                                   |        | // 16,16, 7,7,16,16, 7,7,16,16, 7,7,16,16, 7,7,
	// 0001  18  |       10, 7,                                   |        |
	// 0001  20  |             11, 7,                             |        |
	// 0011  22  |                   14, 9, 8, 6,                 |        |
	// 0001  24  |                          8, 6,                 |        |
	// 1000  26  |                               21,..............|...... 2| // 21,21,21,21,21,21,21,21,2,2,2,2,2,2,2,2,
	// 0010  27  |                                  15,    7,     |        |
	// 0001  28  |                                      5, 7,     |        |
	// 0010  30  |                                           19,  | 4,     |
	// 0010  31  |                                              20|    0,  |
	//           +---------------------------------------------------------+

Edit here's a collapsed version of the table:

Code: Select all

	public static final int[] rowData = {
		17,18, 0, 4, 5, 6,11, 6,12, 6,13, 6,14, 6,14, 7,
		22,16,10, 7,11, 7,14, 9, 8, 6,21,15, 5, 7,19,20,
		 4, 0, 2
	};

And I guess it would be kind of hard to use the stuff above without knowing which blends are supposed to be which:

Code: Select all

    HQ2X  HQ3X       HQ4X
     00   00 01   00 01 10 11
   +----+-------+-------------+
 0 | 11 | 1L  C | 81 31 81 31 | // here's the rowIndex table:
 1 | 20 |  2  1 | 20 60 60 70 |  public static final int[] rowIndex = {                              
 2 | 12 | 1U  1 | 82 82 32 32 |      1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,                                
 3 | 21 | 1M  1 | 80 61 10 30 |      3,2,3,2,3,2,3,2,3,2,3,2,3,2,3,2,                                
 4 | 22 | 1M  C | 80 10 61 30 |      4,0,4,0,4,0,4,0,4,0,4,0,4,0,4,0,                                
 5 | 90 |  5  6 | 50 83 21 70 |      238,236,252,228,238,236,252,228,244,230,242,234,244,230,242,234,
 6 |  0 |  C  C |  0  0  0  0 |      1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,                                
 7 | 10 | 1M  C | 80 10 10 30 |      3,2,3,2,3,2,3,2,3,2,3,2,3,2,3,2,                                
 8 | 20 |  4  C | 50 50 50  0 |      222,223,222,223,193,0,193,0,222,223,222,223,193,0,193,0,        
 9 |  0 |  C  3 |  0  0  0  0 |      248,236,209,248,242,236,176,248,242,236,209,234,242,230,219,234,
10 | 70 |  2  C | 20 12 11  0 |      1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,                                
11 | 90 |  5  1 | 50 21 83 70 |      3,26,3,26,3,2,3,2,3,26,3,26,3,2,3,2,                            
12 | TX |  2  3 | 20  0  0  0 |      4,0,4,0,4,0,4,0,4,0,4,0,4,0,4,0,                                
13 | TX |  2  C | 20  0  0  0 |      236,236,242,236,242,236,242,228,7,236,7,234,242,236,7,234,      
14 | 20 |  4  3 | 50 50 50  0 |      1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,                                
15 | 10 | 1M  1 | 80 10 10 30 |      3,2,3,2,3,2,3,2,3,2,3,26,3,2,3,2,                               
16 | 10 | 1M  3 | 80 10 10 30 |      222,0,222,192,4,0,222,0,4,0,222,223,4,0,222,0,                  
17 | 11 | 1L  3 | 81 31 81 31 |      242,236,176,248,242,236,209,182,242,236,209,232,7,236,209,234,  
18 | 22 | 1M  1 | 80 10 61 30 |  };                                                                  
19 | 22 | 1M  3 | 80 10 61 30 |
20 | 60 |  2  6 | 12 14 81 31 |
21 | 61 |  2  1 | 11 82 13 32 | NOTE: TX is my abbreviation for 
22 | 70 |  2  3 | 20 12 11  0 |  the blend '100' in the original
   +----+-------+-------------+

Edit: just to clarify, across all the HQ?X algorithms there are a total of 30 different blends used to produce a single pixel---HQ2X only uses 9 of them, HQ4X uses 25 of them. However, the blends appear only in 23 combinations which is what I am referring to as a 'blend' in the stuff above. IIRC, there are 16 unique combinations used by HQ3X and 14 by HQ4X. Which means if you write one blending routine per combination to compute all the pixels in one "corner" of the HQ?X algorithms, and then run 4 routines per source pixel, you will need to have 9 of these routines for HQ2X, 16 for HQ3X and 14 for HQ4X. Using the shared tables described above, I would then dispatch to those routines through a 23-entry table.

mozz · Post by **mozz** » Sat Jul 29, 2006 3:15 pm

WolfWings wrote:My only real limit is scheduling the bit-shifts for the old P1/PMMX chips as everything after that has out-of-order execution so as long as I leave dependancies alone long enough the CPU's can unstall pipes themselves safely.

The other thing to keep in mind is the 4-1-1 decoder template on PPro through P3 architectures. They might support out-of-order execution, but instruction decoding can be a bottleneck if you throw a bunch of instructions at it that it can only decode one at a time. It looks like your code is mostly 1-uOp instructions so it should be no problem in this case.

DataPath · Post by **DataPath** » Sat Jul 29, 2006 6:08 pm

WolfWings wrote:
franpa wrote:
can take between 1/2 and 1/4 of a cycle per.
that part at your end.... isnt there meant to be a word after the word per?
Um... for the third time, no. The entire sentence is discussing how long most instructions take in clock cycles. Including that part. It's valid english, I've checked with two books at this point and an English teacher. I'm sorry I don't keep my English to an 8th-grade reading level like most books recommend.

Elliptical clauses were 8th grade english for me, so you still win.

An elliptical clause is a clause in which some words have been left out. Because of the pattern or logic of the entire sentence, it is clear what the missing words are.

An elliptical clause may be either independent or subordinate.

Example: Jessica had five dollars; Monica, three.
(The verb had was dropped from the second clause, but the meaning is still clear.)

WolfWings · Post by **WolfWings** » Sat Jul 29, 2006 8:56 pm

mozz wrote:
WolfWings wrote:My only real limit is scheduling the bit-shifts for the old P1/PMMX chips as everything after that has out-of-order execution so as long as I leave dependancies alone long enough the CPU's can unstall pipes themselves safely.
The other thing to keep in mind is the 4-1-1 decoder template on PPro through P3 architectures. They might support out-of-order execution, but instruction decoding can be a bottleneck if you throw a bunch of instructions at it that it can only decode one at a time. It looks like your code is mostly 1-uOp instructions so it should be no problem in this case.

That's also somewhat my point, I'm removing the multiplies from the code for the same reason to avoid using anything but 1-uOp codes anywhere. This should allow it to run four instructions per cycle on Pentium 4 and Athlon 64 both since there's enough simple additions to take advantage of the LEA to stuff some of the math into the AGU instead of the ALU to get a 'fourth' computational unit on the Athlon 64. So it should be a win on all platforms. Also, note the above code has a minor flaw, I had a badly transcribed table that didn't list bit-shifts as being limited to the SAME pipe as multiplies, so I had things organized badly there. I'll be taking the shift/multiply-only-in-pipe-0 limit into account on the next version I'm working on today.

And finally... might as well share my 'super-compressed pattern->blend' code for HQ2x at least. Note how I take advantage of inverting the bit-ordering you're using to make the bit-mask and offset directly addressable. This isn't my most recent code, it's just what I have on my web-server linked to from an early post: http://wolfwings.us/hq/00micro.c

Basic idea was to load the four pixels that are actually used for this test into the pixel[0-3] array before triggering the comparison. Not the best solution looking back, but, again, compression at the cost of everything else.

Now I'd say just jump to fixed routines for each pixel from a unified 'pattern matching' table, with the table of actual routine offsets being per-pixel.

Code: Select all

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define DIFF56 0x001
#define DIFF52 0x002
#define DIFF54 0x004
#define DIFF58 0x008
#define DIFF53 0x010
#define DIFF51 0x020
#define DIFF57 0x040
#define DIFF59 0x080
#define DIFF26 0x100
#define DIFF24 0x200
#define DIFF84 0x400
#define DIFF86 0x800

#define PIXEL1 3
#define PIXEL2 1
#define PIXEL3 3
#define PIXEL4 2
#define PIXEL5 0
#define PIXEL6 2
#define PIXEL7 3
#define PIXEL8 1
#define PIXEL9 3

#define RotatePattern(x) ((((x) & 0x777) << 1) | (((x) & 0x888) >> 3))

/* pixels[3] pixels[1] pixels[3] */
/* pixels[2] pixels[0] pixels[2] */
/* pixels[3] pixels[1] pixels[3] */
/* The specific corners used depend on the rotational step of the process being used. */

/*
 * The interlinking in this array is somewhat complicated.
 * Specifically, it masks off the lowest 4 bits of the pattern,
 *	and loads that byte initially.
 * If that 'loaded' byte is 0, it short-circuits out of this loop
 * Next, mask off (pattern >> 4) with the pointed-to byte in the
 *	array. This masking-off value allows redundant patterns
 *	to be stored in smaller areas instead of taking up a full
 *	16 bytes per pattern-step.
 * It then loads the pointed-to byte AFTER the masking byte. If
 *	this byte is above 15, is does the above one more time
 *	though starting with (pattern >> 8) to drop the first 2
 *	pseudo-tupples from the sequence now.
 */
static unsigned char hq2x[] = {
            0,  0, 84, 76, 88, 88, 67, 33,  0,  0, 84,109, 80, 92, 50, 16,
/* 16*/	 15,148,148,140,140,148,  6,140,152,148,  6,140,140,  6,  6,140,152,
/* 33*/	 15,140,  6,140,140,148,  6,140,152,148,148,140,140,148,  6,144,152,
/* 50*/	 15,140,148,140,140,  6,  6,140,152,148,148,140,144,148,  6,140,152,
/* 67*/	  7,132,136,140,144,136,148,144,152,
/* 76*/	 10,  2, ~0,128,
/* 80*/	                 10,  1, ~0,126,
/* 84*/	                                  2,  2, ~0,  5,
/* 88*/	                                                  2,  1, ~0,  4,
/* 92*/	 15,  1,  1,  4,  4,  1,  1,  4,126,  1,  1,  4,  4,  1,  1,  4,  4,
/*109*/	 15,  2,  2,  5,  5,  2,  2,  5,128,  2,  2,  5,  5,  2,  2,  5,  5,
/*126*/	  4, 11,
/*128*/	          1, 10,  5,  4,
/*132*/	  2,  0, ~0,  6,
/*136*/	  2,  7, ~0,  6,
/*140*/	  2,  0, ~0,  3,
/*144*/	  2,  7, ~0,  3,
/*148*/	  2,  8, ~0,  6,
/*152*/	  2,  9, ~0,  3,
};

/*
 * Microcoded blending script format
 * The first tupple is the p[] value loaded into the
 *	accumulator. Note this does NOT subtract 1
 *	from the value of which p[] index to load,
 *	and is the ONLY way to read the p[3] slot.
 *	This is by design.
 * All further tupples prefix 0 as an 'extra action'
 *	indicator.
 * 0,0 means 'return accumulator'
 * 0,1 means 'return accumulator'
 * 0,2 means 'average accumulator with memory'
 * 0,3 means 'store accumulator to memory
 *            THEN
 *            average accumulator with p[0]'
 * 1   means 'average accumulator with p[0]'
 * 2   means 'average accumulator with p[1]'
 * 3   means 'average accumulator with p[2]'
 * This allows for up to eight opcodes per short int,
 *	and allows for all 'special' checks to occur
 *	as a side-effect of the existing bit-masking
 *	on x86 at least, reducing code-code and also
 *	reducing instruction-count.
 */
#define blendPack(a,b,c,d,e,f,g,h) (a|(b<<2)|(c<<4)|(d<<6)|(e<<8)|(f<<10)|(g<<12)|(h<<14))
static unsigned short int blends[] = {
	blendPack(2,2,1,0,0,0,0,0),
	blendPack(3,2,1,0,0,0,0,0),
	blendPack(3,3,1,0,0,0,0,0),
	blendPack(0,0,0,0,0,0,0,0),
	blendPack(1,1,1,0,0,0,0,0),
	blendPack(2,1,1,0,0,0,0,0),
	blendPack(3,1,1,0,0,0,0,0),
	blendPack(2,2,0,3,0,2,0,0),
	blendPack(2,2,1,1,0,0,0,0),
	blendPack(2,2,1,1,1,0,0,0),
	blendPack(0,3,2,1,0,0,0,0),
	blendPack(0,2,3,1,0,0,0,0),
};

unsigned int HQ2x_00(int pattern, unsigned int *p) {

	unsigned int blendPointer; /* The only variable that needs to live between the two code-blocks */

	{ /* Descend the hq2x[] database to determine which blend to use */
		int packIndex = hq2x[pattern & 15];
		if (packIndex) {
			unsigned char patternMask = (pattern >> 4);
			packIndex = hq2x[(patternMask & hq2x[packIndex]) + packIndex + 1];
			if (packIndex & 0xF0)
				packIndex = hq2x[((patternMask >> 4) & hq2x[packIndex]) + packIndex + 1];
		}
		blendPointer = blends[packIndex];
	}

	{ /* Run the blending microcode */
		unsigned int accum, memory;
		unsigned int l;

		accum = p[blendPointer & 3];
		blendPointer >>= 2;
		do {
			l = blendPointer;
			blendPointer >>= 2;
			if (!(l &= 3)) {
				if (!(blendPointer & 2))
					return accum;
				l = blendPointer;
				blendPointer >>= 2;
				if (!(l &= 1)) {
					l = memory;
					goto MergeAvg;
				}
				memory = accum;
				/* l already equals 1, short-circuit the code to save a tupple */
			}
			l = p[l - 1];
			/* zSNES internal pixel format is 0RRRRRGGGGGBBBBB */
			/* Bitmask created to prevent inter-field math errors */
			/* Equation sourced from "Magic Algorithms" at www.aggregate.org */
MergeAvg:		accum = (((accum & l) << 1) + ((accum ^ l) & 0x7BDF)) >> 1;
		} while (1);
	}
}

mozz · Post by **mozz** » Sun Jul 30, 2006 1:02 pm

Assume we are computing '*', and previous pixels are arranged like this:

Code: Select all

             J K L
             X *

+-----------------------------------------+-------------------------------+
|      o#o       ooo       ooo       oo#  |       ooo                     |
|  *24 #oo  =K57 o#o  =J68 oo#  =X35 o#o  |   *59 o#o  this one is new    |
|      ooo       #oo       o#o       ooo  |       oo#                     |
+-----------------------------------------+-------------------------------+
|      o#o       ooo       ooo            |       ooo       ooo           |
|  *26 oo#  =K59 o#o  =L48 #oo            |   *57 o#o  =X68 oo#           |
|      ooo       oo#       o#o            |       #oo       o#o           |
+-----------------------------------------+-------------------------------+
|      ooo                                |       ooo       ooo           |
|  *68 oo#  this one is new               |   *45 ##o  =X56 o##           |
|      o#o                                |       ooo       ooo           |
+-----------------------------------------+-------------------------------+
|      ooo       ooo                      |       o#o       ooo           |
|  *48 #oo  =X59 o#o                      |   *25 o#o  =K58 o#o           |
|      o#o       oo#                      |       ooo       o#o           |
+-----------------------------------------+-------------------------------+
|      #oo       ooo       ooo       o#o  |       ooo                     |
|  *15 o#o  =K48 #oo  =J59 o#o  =X26 oo#  |   *56 o##  this one is new    |
|      ooo       o#o       oo#       ooo  |       ooo                     |
+-----------------------------------------+-------------------------------+
|      oo#       ooo       ooo            |       ooo                     |
|  *35 o#o  =K68 oo#  =L57 o#o            |   *58 o#o  this one is new    |
|      ooo       o#o       #oo            |       o#o                     |
+-----------------------------------------+-------------------------------+

If you only have the results from X available, it can only supply 5 bits and you end up calculating 7 diffs for this pixel, 3 of which you have calculated before (in the previous row).

Suppose instead you had the results from X (to your left) and K (directly above you). Now you are in a position to re-use 8 previous results, and have to calculate only 4 new diffs for this pixel. The cost is, you must maintain a buffer of bytes the size of one scanline, to hold the old diff results from the line above. Each byte in this buffer must store at least the results *59, *68 and *58 for that pixel, so that K59, K68 and K58 will be available when we get to the next line (*57 and *48 could also be useful but those results can also be gotten later from the X pixel). Depending on the pixels above you may also affect the order in which you handle edges (if you require the source bitmap to have duplicate pixels around the edges it should be fine, but I would prefer not to do that, so I will probably have extra loop(s) that fill in the edge pixels).

Edit: ...on the other hand, it sounds like your diff tests are becoming pretty lightweight. Is it worth it to try and re-use the 3 extra previous results?

WolfWings · Post by **WolfWings** » Sun Jul 30, 2006 9:10 pm

mozz wrote:Edit: ...on the other hand, it sounds like your diff tests are becoming pretty lightweight. Is it worth it to try and re-use the 3 extra previous results?

Well right now since I've "ditched" MMX for the diff test (cyclic logic at it's finest due to sudden realization) I've got things down to: (Pseudo-code.)

diff=x-y+constant (working in unpacked math, which non-MMX blends can as well)
RVal = tinytable[lowbyte(diff) + 32]
GVal = tinytable[lowbyte(diff >> 10) + 96]
BVal = tinytable[lowbyte(diff >> 21) + 160]
diff = (((RVal + GVal) & constant) + BVal) & constant

I'd been trying too hard to keep all the values byte-aligned to work with MMX, before I realized I could re-use the 'guard bits' trick to pack three 10-bit values into one 32-bit register. This ends up only being 15 distinct operations, all of which are one-cycle on P1, PMMX, PPro, P2, and P3. And the only pieces that don't "pair" in either pipe are the two bit-shifts. Also for the final 'true diff' calculation, you can add the *Val's in any order, the constant indicated there is just resetting the 'guard bits' to allow for full 10-bit precision.

And after testing, trying to avoid the memory-lookup for the tinytable (768 bytes) turned out to not be worth it generally. At best, it was slightly ahead on newer machines that could chew through 3 or 4 operations at a time. At worst (P1/PMMX/PPro/P2/P3) it would be one clock cycle behind and chew up both pipes instead of only one for the aligned memory load.

VietBitter · Post by **VietBitter** » Sun Jul 30, 2006 10:29 pm

A little late but here goes..

Pentium 3 800

Ticks: 16737511473
Ticks: 16970093899
Ticks: 17022090973
Ticks: 18204778139
Ticks: 17354570600
Ticks: 16366045831
Ticks: 17567403476
Ticks: 17431249919
Ticks: 16267264478
Ticks: 16421056128
Ticks: 16927345830
Ticks: 19831238183
Ticks: 17160059838
Ticks: 17193770300
Ticks: 17017544974
Ticks: 16566450418

WolfWings · Post by **WolfWings** » Mon Jul 31, 2006 8:08 am

Code: Select all

#define RB(y,u,v) (((y)&0x3FF)<<22)|(((u)&0x3FF)<<11)|((v)&0x3FF)
#define GO(y,v) ((((y)+64)&0x3FF)<<22)|(16<<11)|(((v)+16)&0x3FF)
unsigned int DiffTable[128] = {
RB(-85, 160,  85), RB(-82, 155,  82), RB(-80, 150,  80), RB(-77, 145,  77),
RB(-74, 140,  74), RB(-72, 135,  72), RB(-69, 130,  69), RB(-66, 125,  66),
RB(-64, 120,  64), RB(-61, 115,  61), RB(-58, 110,  58), RB(-56, 105,  56),
RB(-53, 100,  53), RB(-50,  95,  50), RB(-48,  90,  48), RB(-45,  85,  45),
RB(-42,  80,  42), RB(-40,  75,  40), RB(-37,  70,  37), RB(-34,  65,  34),
RB(-32,  60,  32), RB(-29,  55,  29), RB(-26,  50,  24), RB(-24,  45,  22),
RB(-21,  40,  19), RB(-18,  35,  17), RB(-16,  30,  15), RB(-13,  25,  12),
RB(-10,  20,  10), RB( -8,  15,   7), RB( -5,  10,   5), RB( -2,   5,   2),
RB(  0,   0,   0), RB(  2,  -5,  -2), RB(  5, -10,  -5), RB(  8, -15,  -8),
RB( 10, -20, -10), RB( 13, -25, -13), RB( 16, -30, -15), RB( 18, -35, -17),
RB( 21, -40, -20), RB( 24, -45, -23), RB( 26, -50, -25), RB( 29, -55, -29),
RB( 32, -60, -32), RB( 34, -65, -34), RB( 37, -70, -37), RB( 40, -75, -40),
RB( 42, -80, -42), RB( 45, -85, -45), RB( 48, -90, -48), RB( 50, -95, -50),
RB( 53,-100, -53), RB( 56,-105, -56), RB( 58,-110, -58), RB( 61,-115, -61),
RB( 64,-120, -64), RB( 66,-125, -66), RB( 69,-130, -69), RB( 72,-135, -72),
RB( 74,-140, -74), RB( 77,-145, -77), RB( 80,-150, -80), RB( 82,-155, -82),
GO(-85,     -160), GO(-82,     -155), GO(-80,     -150), GO(-77,     -145),
GO(-74,     -140), GO(-72,     -135), GO(-69,     -130), GO(-66,     -125),
GO(-64,     -120), GO(-61,     -115), GO(-58,     -110), GO(-56,     -105),
GO(-53,     -100), GO(-50,      -95), GO(-48,      -90), GO(-45,      -85),
GO(-42,      -80), GO(-40,      -75), GO(-37,      -70), GO(-34,      -65),
GO(-32,      -60), GO(-29,      -55), GO(-26,      -50), GO(-24,      -45),
GO(-21,      -40), GO(-18,      -35), GO(-16,      -30), GO(-13,      -25),
GO(-10,      -20), GO( -8,      -15), GO( -5,      -10), GO( -2,       -5),
GO(  0,        0), GO(  2,        5), GO(  5,       10), GO(  8,       15),
GO( 10,       19), GO( 13,       25), GO( 15,       30), GO( 18,       35),
GO( 21,       40), GO( 23,       45), GO( 26,       49), GO( 29,       55),
GO( 32,       60), GO( 34,       65), GO( 37,       70), GO( 40,       75),
GO( 42,       80), GO( 45,       85), GO( 48,       90), GO( 50,       95),
GO( 53,      100), GO( 56,      105), GO( 58,      110), GO( 61,      115),
GO( 64,      120), GO( 66,      125), GO( 69,      130), GO( 72,      135),
GO( 74,      140), GO( 77,      145), GO( 80,      150), GO( 82,      155),
};

unsigned int RGBUnpack(unsigned int x) {
	return (x * 0x10001) & 0x03E07C1F;
}

/* Alternate construction skips the 0x8020 constant, but requires */
/* adc slot,0 before slot can be used each time. Possible to gain */
/* speed due to increased pairing possibilities on newer machines */
/* but almost impossible to represent in C code.                  */
/* This is possible because MOVZX/MOVSX don't affect any flags, & */
/* the >>= punts the last bit into the carry slot. Very unobvious */
/* optimization, and one of the few that still applies on the new */
/* CPU's unlike the convoluted AA-related stuff of yore.          */
int ExpandedDiff(unsigned int c0, unsigned int c1) {
	unsigned int diff, acum, slot;
	                                /* Pentium */
	diff   = c0                   ; /* U/p0    */
	diff  -= c1                   ; /* U/p0    */
	slot   = ((  signed char)diff); /* U/p0    */
	diff  += 0x8020               ; /* V/p1    */
	diff >>= 10                   ; /* U/p0    */
	acum   = DiffTable[slot +  32]; /* V/p2    */
	slot   = ((unsigned char)diff); /* U/p0    */
	acum  ^= (0x3FF << 11)        ; /* V/p1    */
	diff >>= 11                   ; /* U/p0    */
	acum  += DiffTable[slot      ]; /* V/p1+p2 */
	slot   = ((  signed char)diff); /* U/p0    */
	acum  += DiffTable[slot +  96]; /* U/p0+p2 */
	return acum & RB(-128,-32,-32); /* bitmask */
}

WolfWings · Post by **WolfWings** » Thu Aug 03, 2006 8:29 pm

...I've possibly figured out a way to remove that 'Guard Bit' masking constant from the math. I haven't finished re-tuning the tables yet to verify I can get identical results to the original code yet, but the basic math works, and it removes one non-pairing instruction from the finale-sequence of the code. Final non-MMX version with a single 512-byte lookup table, two shifts and four add/subs will be 12 instructions long for the core math with 'testing' the value being a simple bitmask so the normal 'test' instruction will work fine.

WolfWings · Post by **WolfWings** » Fri Aug 04, 2006 7:20 am

New technique ended up being able to re-use the previous tables with a couple minor adjustments. Updated previous post with newest Diff() replacement code.

Final estimated clock-cycle-count for P1 through P3 is 8 cycles per difference, with three non-paired instructions (one with a memory load) so the code will interleave well with sequential tests. Marked up the code to show the Pipe (P1/MMX) or individual execution port (PPro/P2/P3) that the code will run through.

Newer platforms like PM, P4, and AMD64 have higher latencies so the code takes more clock cycles though inlining two or more of the functions should help a lot when the C compiler can interleave two Diff's at once on newer CPU's. Hell, on AMD64 all five difference tests can be run in parallel at the same time, solving all latency issues entirely. Yay for having 16 registers.

mozz · Post by **mozz** » Fri Aug 04, 2006 2:42 pm

The way your DiffTable is constructed is pretty neat. I've been trying to independently derive your method (or something similar) from the reference C implementation of Diff() in hq2x.cpp, in incremental steps.

First I worked out the following RGBtoYUV function which is bit-identical to the 64K table in hq2x.cpp:

Code: Select all

inline int RGBtoYUV(unsigned short w1) {
    int i = (w1>>11)&31;  int r = i << 3;
    int j = (w1>>5)&63;   int g = j << 2;
    int k = w1 & 31;      int b = k << 3;

    int Y,u,V;
    /* stepA */ Y = r>>2; u = 128+(r>>2); v = 128+ (-r >> 3);
    /* stepB */ Y += g>>2;                v += (2*g) >> 3;
    /* stepC */ Y += b>>2; u += (-b >> 2); v += (-b >> 3);
    
    return (Y<<16) + (u<<8) + v;
}

This calculation could be done by a table of size similar to yours, because stepA, stepB and stepC each reference one color only. So in each step you use its particular color value as an index and look up a word containing deltas for Y, u and v for that step and add them to your running total. Of course the layout of Y, u and v within the output words can be adjusted to whatever bit ranges you want and you could add enough positive bias to each field in the entries for one of the steps, to ensure that each subfield has a positive value by the end (so it doesn't end up borrowing bits from other subfields). So far so good.

The next step is to take in two RGB colors and compute the difference of their Y, u, v results instead of converting them separately. Basically you start by subtracting the RGB components of the two colors, i.e. the i,j,k and r,g,b values can now be positive or negative. Double the index ranges of each section of the table, and remove constants (like 128) from the calculation above because they cancel out, and adjust the bias values to keep everything positive again.

...But now I run into difficulty. Once I have some offset-shifted values for Y, u and v I have to compare their absolute values to the thresholds. And when you include zero, there is an odd number of values that are 'within' the threshold. And even if it was not an odd number, it wouldn't be a power of two, would it?

What you want to be able to do is apply a single bitmask test, such that any Yuv result outside any of the three thresholds will contain a bit where it isn't supposed to, but every Yuv result within all three thresholds will not contain any bits in those places.

And as far as I can tell, you solved this by using tables that have non-contiguous values in them such that the range you need to test actually IS a power of two and fits into a mask and so on.

How did you come up with the calculation for that? A lot of experimentation? Was getting the power of two ranges your goal, or did you stumble on it while looking for something else?

WolfWings · Post by **WolfWings** » Fri Aug 04, 2006 9:21 pm

mozz wrote:The way your DiffTable is constructed is pretty neat. I've been trying to independently derive your method (or something similar) from the reference C implementation of Diff() in hq2x.cpp, in incremental steps.

Cool, I'll explain the differences in my approach to yours as we go then.

mozz wrote:First I worked out the following RGBtoYUV function which is bit-identical to the 64K table in hq2x.cpp:

Code: Select all

inline int RGBtoYUV(unsigned short w1) {
    int i = (w1>>11)&31;  int r = i << 3;
    int j = (w1>>5)&63;   int g = j << 2;
    int k = w1 & 31;      int b = k << 3;

    int Y,u,V;
    /* stepA */ Y  = r>>2; u = 128+(r>>2); v = 128+ (-r >> 3);
    /* stepB */ Y += g>>2;                 v += (2*g) >> 3;
    /* stepC */ Y += b>>2; u += (-b >> 2); v += (-b >> 3);
    
    return (Y<<16) + (u<<8) + v;
}

Here's a simplification, and the real key to my approach because it dramatically reduces the range of values required. zSNES pixels are 0-31 for R, G, and B first off. This means the lowest 3 bits are ALWAYS zero for R, G, and B. This allows for this reduction:

Code: Select all

inline int RGBtoYUV(unsigned short w1) {
    int i = (w1>>11)&31;  int r = i;
    int j = (w1>>6)&31;   int g = j;
    int k = w1 & 31;      int b = k;

    int Y,u,V;
    /* stepA */ Y  = r; u  = 128+(r); v  = 128+(-r);
    /* stepB */ Y += g;               v += g<<1;
    /* stepC */ Y += b; u +=    (-b); v += (-b);

    return (Y<<17) + (u<<9) + v;
}

Note I didn't remove constants here. Once you do that, you'll have values between -124 and 124 (for v), -62 and 62 (u), and -93 and 93 (Y) for the final results. These fit inside of Signed Bytes, ergo why I tried to implement an MMX version using byte-aligned tables. It ended up being a false lead.

mozz wrote:This calculation could be done by a table of size similar to yours, because stepA, stepB and stepC each reference one color only. So in each step you use its particular color value as an index and look up a word containing deltas for Y, u and v for that step and add them to your running total. Of course the layout of Y, u and v within the output words can be adjusted to whatever bit ranges you want and you could add enough positive bias to each field in the entries for one of the steps, to ensure that each subfield has a positive value by the end (so it doesn't end up borrowing bits from other subfields). So far so good.

You've figured out most of my approach so far. The real 'trick' is that I expanded each calculation segment to 11 bits, a couple of the high bits of which are ignored. So I have a range of -1024-1023 to perform the math inside of, allow for floating-point multipliers of up to 8.0 to expand the range to one where the final tests will fall on powers of two.

mozz wrote:The next step is to take in two RGB colors and compute the difference of their Y, u, v results instead of converting them separately. Basically you start by subtracting the RGB components of the two colors, i.e. the i,j,k and r,g,b values can now be positive or negative. Double the index ranges of each section of the table, and remove constants (like 128) from the calculation above because they cancel out, and adjust the bias values to keep everything positive again.

...But now I run into difficulty. Once I have some offset-shifted values for Y, u and v I have to compare their absolute values to the thresholds. And when you include zero, there is an odd number of values that are 'within' the threshold. And even if it was not an odd number, it wouldn't be a power of two, would it?

This is where exhaustive testing comes in. I tried ignoring the 'odd result' that was available, and tuned the lookup tables to return correct results compared to the original code.

mozz wrote:What you want to be able to do is apply a single bitmask test, such that any Yuv result outside any of the three thresholds will contain a bit where it isn't supposed to, but every Yuv result within all three thresholds will not contain any bits in those places.

And as far as I can tell, you solved this by using tables that have non-contiguous values in them such that the range you need to test actually IS a power of two and fits into a mask and so on.

How did you come up with the calculation for that? A lot of experimentation? Was getting the power of two ranges your goal, or did you stumble on it while looking for something else?

Well the basic 'math' for the multiplier was simple: The test for Y was (for example) 24 out of 96. Ignoring the maximum value obtainable except to make sure my multiplier keeps it under 1024, (X*4)/3 (x1.333) converts the test into 32 out of 128. Increasing this by four, I get 128 out of 512, a very convenient value range to test for. The final lookup tables still require minor hand-tuning, but the hand-tuning is very straight-forward with the code I've built because it shows if a value is too high or low, and in fact I could completely automate the process but it takes all of three minutes with partial automation.

And as you probably figured out by now, yes, the entire goal of this was to expand the comparisons to a power-of-two, using 'fuzzy' math to give back equivilant results.

mozz · Post by **mozz** » Sat Aug 05, 2006 1:25 am

I'm on the road this weekend. In the car I realized that a bitmask of -8 would catch any value outside the -7..+7 range (for example). Rather than biasing all values to be positive, leave them centered around zero and the masking is easy.

Given that, it should be no problem to scale the ranges up to *something* that is close enough to a power of two. E.g. for x in the range -31..+31 and the threshold |x|>6, you can just multiply by 10. The values are then in the -310..+310 range, and 6*10 & -64 is 0, and 7*10 & -64 is nonzero. Etc. [Edit or multiply by 5 instead of 10..]

Then there are further tricks you mentioned such as discarding low bits which aren't actually needed to get correct answers.

[Edit: a mask of-64 would hit any negative value. Oops. But you can use a positive bias of +64 and a mask of 64+128 for the expected result. The positive bias does not cause difficulties as long as you don't mess with any of the low bits---the "zero value" should have zeros in the low bits. Since you know how many lookups are done in each half of the table (1 or 2), you can build (all of, or half of) the positive bias into each entry in that region.

Given the number of bits that are available and what the thresholds are, you could allocate more bits to the two fields that need to be multiplied the most (by 5 and 21, or whatever) and fewer bits to the one with a threshold of |x|>7. You might be able to avoid the need for hand-tuning ala adjustY that way, I'm not sure.

One thing I would like is for my version to support all 6-bit values of green, not just the top 5 bits. I want a bit-perfect implementation that works for any RRRGGGGGGBBBBB source.]

Edit: currently I am using this mapping to get Y,u,v values that cross a power-of-two boundary at the right places for the threshold check:

Code: Select all

newY = (oldY*21 + 1024)>>2;  // 11-bit field, mask=0x600, Ymin=-727, Ymax=1231
newU = (oldU*17 + 128)>>3;   // 8-bit field, mask=0xE0, Umin=-116, Umax=146
newV = (oldV*5 + 32)>>1;     // 9-bit field, mask=0x1E0, Vmin=-455, Vmax=480

I assume whichever two are packed highest in the table entry will need guard bits below them (i.e. I didn't actually check but I don't think any of these 3 mappings can tolerate a borrow from a field below it).
But it will fit in a 32-bit word with some room to spare (11+1+8+1+9 = 30).

One good thing about this mapping is I didn't need to manually fix up any entries. If you want to get rid of the manual fixups for Y, I suggest suggest maybe giving Y*21/8 a try instead of Y*8/3.

ZSNES board

Progress on reducing HQ2x to fit in L1 cache

Okay...

Calculating only 4 new diffs instead of 7

Re: Calculating only 4 new diffs instead of 7

Newest Diff code...

Further progress, albeit minor now...