Progress on reducing HQ2x to fit in L1 cache

WolfWings · Post by **WolfWings** » Wed Jul 19, 2006 12:03 am

I've been researching alternate methods of implementing the HQ system, initially the HQ2x variant specifically. The goal of this research has been to reduce the total footprint for code and data for the filter to be inside of 8 kilobytes so as to keep the lookup tables and code paths completely inside of L1 cache on all MMX and above processers.

So far this research has born fruit, in the form of a pixel art scaling toolkit I had posted on-line for general use briefly though I've since taken it down. 7 difference-checks and 4 arbitrary blends take a total of 86 clock cycles without any MMX used yet, allowing for 21 clock cycles per output pixel while only using ~4.5kb of code and data total.

Noxious Ninja · Post by **Noxious Ninja** » Wed Jul 19, 2006 1:13 am

WolfWings wrote:Also, if people can run the MMX Difference code on their own computers on as many different systems as possible, I'd appreciate it. I only have an Athlon 64 to test on, and that can be heavilly lopsided for testing purposes. Especially I'd appreciate results from earlier CPU's.

What are you compiling with? MinGW-GCC 3.4.5 gives me

Code: Select all

C:\src>gcc -Os MMXdiff.c -o MMXdiff.exe
MMXdiff.c: In function `ExpandedDiff':
MMXdiff.c:55: error: incompatible types in assignment
MMXdiff.c:56: error: incompatible types in assignment
MMXdiff.c:57: error: incompatible types in assignment
MMXdiff.c:58: error: incompatible types in assignment

And 4.1.1

Code: Select all

C:\src>gcc -Os -march=pentium-m MMXdiff.c -o MMXdiff.exe
C:\DOCUME~1\Mitchell\LOCALS~1\Temp/ccu0baaa.o:MMXdiff.c:(.text+0xe8): undefined reference to `random'
C:\DOCUME~1\Mitchell\LOCALS~1\Temp/ccu0baaa.o:MMXdiff.c:(.text+0xf8): undefined reference to `random'
C:\DOCUME~1\Mitchell\LOCALS~1\Temp/ccu0baaa.o:MMXdiff.c:(.text+0x134): undefined reference to `random'
C:\DOCUME~1\Mitchell\LOCALS~1\Temp/ccu0baaa.o:MMXdiff.c:(.text+0x144): undefined reference to `random'

Post by **Nach** » Wed Jul 19, 2006 2:20 am

random() is a BSD thing, you'll need *nix to compile it, not Windows.

WolfWings · Post by **WolfWings** » Wed Jul 19, 2006 2:23 am

I was mistaken, I thought random() was pretty cross-platform. Modified the code on the web-server, replacing it with rand() runs equivilant code, just might not be as random. Though since I'm not seeding the random-number generator with the clock or something similair it's relatively moot.

Also, yeah, only testing with GCC 4.x (4.1.1 specifically) since 3.x has really poor MMX intrinsic support by comparison. 4.x allows for much cleaner-looking code to compile. I'd need to go through all sorts of hoops that would make the code easier to write in raw assembly to begin with in 3.x, compared to the relatively neat and readable code I can use in 4.x right now aside from the lack of MOVD support to avoid redundant stack usage.

Noxious Ninja · Post by **Noxious Ninja** » Thu Jul 20, 2006 12:28 am

Alright. Pentium M 730 (Dothan, 1.6 GHz) w/ 32/32 KB L1 data/code cache. Compiled with MinGW-GCC 4.1.1.

gcc -Os -march=pentium-m MMXdiff.c -o MMXdiff.exe

Code: Select all

Ticks:      -1124683956 Right:       1073741824 Wrong:                0
Ticks:      -1271181139 Right:       1073741824 Wrong:                0
Ticks:      -2101298327 Right:       1073741824 Wrong:                0
Ticks:        407186843 Right:       1073741824 Wrong:                0
Ticks:       1018030989 Right:       1073741824 Wrong:                0
Ticks:       1161470693 Right:       1073741824 Wrong:                0
Ticks:       -636712411 Right:       1073741824 Wrong:                0
Ticks:       2092338431 Right:       1073741824 Wrong:                0
Ticks:       1054051752 Right:       1073741824 Wrong:                0
Ticks:       -884377318 Right:       1073741824 Wrong:                0
Ticks:        561283507 Right:       1073741824 Wrong:                0
Ticks:      -1176038292 Right:       1073741824 Wrong:                0
Ticks:       1677575927 Right:       1073741824 Wrong:                0
Ticks:       -580191728 Right:       1073741824 Wrong:                0
Ticks:       1777804351 Right:       1073741824 Wrong:                0
Ticks:        192927273 Right:       1073741824 Wrong:                0

I've got an Athlon XP and a Pentium 3 at home, I'll run it on those.

WolfWings · Post by **WolfWings** » Thu Jul 20, 2006 4:07 am

...bug in the blasted code there. Thanks for the test result, installed a full 32-bit Linux install to track down the bug that I wasn't seeing initially running the code in 32-bit mode on my 64-bit OS install. It wasn't displaying the results properly on non-64bit machines due to a mistake in my code. I uploaded a fixed version of the code that also shows a running average estimated clock-cycle count a couple of minutes ago. I also moved the timing checks inside the calculation subroutine so it will get a much more accurate reading of the clock cycles taken for the actual 'subtraction' and MMX ops.

Poobah · Post by **Poobah** » Thu Jul 20, 2006 11:40 am

I compiled it using MinGW with GCC 4.1.1. I use a Pentium 4 "Nocona" CPU @ 3.0 GHz. I only had time for the first nine "laps".

Code: Select all

Ticks:        635047103  Right: 1073741824  Wrong:          0  Average:    -3
Ticks:       2109085812  Right: 1073741824  Wrong:          0  Average:    -2
Ticks:      -1853395371  Right: 1073741824  Wrong:          0  Average:    -1
Ticks:       1536322843  Right: 1073741824  Wrong:          0  Average:    -2
Ticks:        969834710  Right: 1073741824  Wrong:          0  Average:    -3
Ticks:        725194335  Right: 1073741824  Wrong:          0  Average:    -3
Ticks:       1200218382  Right: 1073741824  Wrong:          0  Average:    -2
Ticks:        807501412  Right: 1073741824  Wrong:          0  Average:    -3
Ticks:        696416873  Right: 1073741824  Wrong:          0  Average:    -3

byuu · Post by **byuu** » Thu Jul 20, 2006 7:07 pm

Offtopic note: seeding is usually done with srand(time());

But there's plenty of other ways to do it too, of course :)

WolfWings · Post by **WolfWings** » Thu Jul 20, 2006 7:43 pm

Poobah wrote:I compiled it using MinGW with GCC 4.1.1. I use a Pentium 4 "Nocona" CPU @ 3.0 GHz. I only had time for the first nine "laps".

Code: Select all

Ticks:        635047103  Right: 1073741824  Wrong:          0  Average:    -3
Ticks:       2109085812  Right: 1073741824  Wrong:          0  Average:    -2
Ticks:      -1853395371  Right: 1073741824  Wrong:          0  Average:    -1
Ticks:       1536322843  Right: 1073741824  Wrong:          0  Average:    -2
Ticks:        969834710  Right: 1073741824  Wrong:          0  Average:    -3
Ticks:        725194335  Right: 1073741824  Wrong:          0  Average:    -3
Ticks:       1200218382  Right: 1073741824  Wrong:          0  Average:    -2
Ticks:        807501412  Right: 1073741824  Wrong:          0  Average:    -3
Ticks:        696416873  Right: 1073741824  Wrong:          0  Average:    -3

That... shouldn't happen. Meh. Okay, try this version instead. http://wolfwings.us/hq/timing.c Skips all the fancy tests, I've accepted the math works, so it removes the right/wrong stuff and just tries to record the ticks accurately over a huge block of the code being run repeatedly.

Poobah · Post by **Poobah** » Thu Jul 20, 2006 10:31 pm

Code: Select all

Ticks:       1932806547
Ticks:       1899956382
Ticks:       1929284420
Ticks:       2080034802
Ticks:       2074461807
Ticks:       2109035840
Ticks:       2098081857
Ticks:       1883020288
Ticks:       1882983830
Ticks:      -2132344564
Ticks:       1996767124
Ticks:      -2098670569
Ticks:       2038388383
Ticks:       1910984104
Ticks:       1929814685
Ticks:       1883466837

WolfWings · Post by **WolfWings** » Fri Jul 21, 2006 5:37 am

byuu wrote:Offtopic note: seeding is usually done with srand(time());

But there's plenty of other ways to do it too, of course

In this case, truly randomized sequences weren't really needed, the random numbers were only being used to try to 'poison' read-ahead in caches to rudely simulate the chaotic 'all over the place' nature of most video-game images since they don't do lots of smooth gradiants.

And I still can't figure out why the frelling hell the code's not getting proper 64-bit RDTSC values. =-.-= It should not physically be possibly to get a negative result like it's turning out unless it's only doing the math in 32-bit mode for some reason, and I haven't been able to duplicate that situation with the timing.c code. :-/

Fuck it, just throw 'time ./timing' if you're on a *NIX and lemme know what that says, that'll at least give a reasonable response regardless of the mis-gauged clock-ticks.

paulguy · Post by **paulguy** » Sun Jul 23, 2006 5:31 am

if someone can supply a windows binary i'd be glad to test it on this garbage amd k6-2 500Mhz processor which would be a stress test for it i guess.

Poobah · Post by **Poobah** » Sun Jul 23, 2006 11:42 am

Here you go. Compiled with -march=k6-2 for your convenience.

paulguy · Post by **paulguy** » Mon Jul 24, 2006 12:26 am

thianks but I'll have to wait until the weekend when i can get to that ubercrap of a computer.

kevman · Post by **kevman** » Mon Jul 24, 2006 1:59 am

Northwood Celeron, 3020Mhz.

Ticks: -219943712
Ticks: 706787624
Ticks: 1760269224
Ticks: 239852700
Ticks: 23413652
Ticks: 106703928
Ticks: 31305052
Ticks: -179077004
Ticks: -140565660
Ticks: 617559996
Ticks: 1016521984
Ticks: 1007000948
Ticks: 735528072
Ticks: 324340236
Ticks: 314756832
Ticks: 267699984

Yay?

DOLLS (J) [!] · Post by **DOLLS (J) [!]** » Mon Jul 24, 2006 9:19 am

Edit: Rerun using DMV27's binary.
Athlon XP Barton Core @ 2.0 GHz

Code: Select all

Ticks:      12274106949
Ticks:      12472828512
Ticks:      11549340469
Ticks:      11431195209
Ticks:      11411786656
Ticks:      11417993042
Ticks:      11419195704
Ticks:      11385457613
Ticks:      11341841106
Ticks:      11392726554
Ticks:      11387325977
Ticks:      11463281467
Ticks:      11411806859
Ticks:      11387981304
Ticks:      11488935477
Ticks:      11370416010

DMV27 · Post by **DMV27** » Tue Jul 25, 2006 5:51 am

WolfWings wrote:And I still can't figure out why the frelling hell the code's not getting proper 64-bit RDTSC values. =-.-= It should not physically be possibly to get a negative result like it's turning out unless it's only doing the math in 32-bit mode for some reason, and I haven't been able to duplicate that situation with the timing.c code. :-/

The RDTSC code is correct, but the printf "%16lli" code does not work properly on Win32. I fixed it by using a double instead of a uint64. The code and a Win32 EXE compiled with GCC 3.4.5 are at http://rapidshare.de/files/26915721/timing.zip.html. I also added some CPUID opcodes for more accurate timing.

Duron @ 850 MHz

Code: Select all

Ticks:      14898415422
Ticks:      14530683119
Ticks:      14513379681
Ticks:      14531306276
Ticks:      14576906166
Ticks:      14495807632
Ticks:      14548245187

Lord Alpha · Post by **Lord Alpha** » Tue Jul 25, 2006 5:07 pm

Athlon X2 3800+ @ 2.0 GHz

Code: Select all

Ticks:       9025792898
Ticks:       8986723978
Ticks:       9025712758
Ticks:       8529643670
Ticks:       9046469446
Ticks:       9539386198
Ticks:       9525954193
Ticks:       9566102128
Ticks:       9573992158
Ticks:       9552645045
Ticks:       9537216631
Ticks:       9485480727
Ticks:       9480342099
Ticks:       9504240861
Ticks:       9499390360
Ticks:       9026290844

byuu · Post by **byuu** » Tue Jul 25, 2006 6:14 pm

DMV27 wrote:The RDTSC code is correct, but the printf "%16lli" code does not work properly on Win32.

Ah, this is another Windowism. You need to use %I64d for 64-bit integers passed to a printf statement. However, using a double would work as well for this case since the va_args list stack push would truncate the double down to 64-bits for you anyway.

Good to know %16lli is the unix-equivalent of %I64d, though.

stale · Post by **stale** » Tue Jul 25, 2006 10:27 pm

Pentium 4 Prescott 3.0 ghz
Ticks: 17102538877
Ticks: 17268107992
Ticks: 17137126762
Ticks: 17117920807
Ticks: 17259204945
Ticks: 17141682488
Ticks: 17048534033
Ticks: 17029717709
Ticks: 17041291192
Ticks: 17047171650
Ticks: 17222334128
Ticks: 17394300292

paulguy · Post by **paulguy** » Tue Jul 25, 2006 11:21 pm

Ticks: 11279449977
Ticks: 11332359781
Ticks: 11269373812
Ticks: 11257814392
Ticks: 11295460487
Ticks: 11242154574
Ticks: 11315805258
Ticks: 11280371652
Ticks: 11275821014
Ticks: 11289480215
Ticks: 11247295621
Ticks: 11295770396
Ticks: 11251983033
Ticks: 11283265198
Ticks: 11260864251
Ticks: 11262269531

AMD Athlon XP 2500+ 1.98Ghz

WolfWings · Post by **WolfWings** » Wed Jul 26, 2006 2:01 am

Okay, the results are valid and stable at least now. I was away from the internet for a couple of days, and had my car break down in Arizona from this heat (it's not getting repaired) and I'm back in California now.

Looks like Athlon's are averaging around 12 cycles per lap with the loop overhead and RGBUnpack overhead included, Athlon-64's/Turions/X2's are down around 9 cycles per lap, and the Pentium 4 is up around 18 though I'm not entirely surprised since it's such an infamously difficult case to optimize code fully for the P4. Fully P4-optimized code though, as the VirtualDub author can attest, chews up bytes at a frightening rate.

I've actually worked out a table-free version at this point that surprisingly ends up about the same size, code-wise (100-120 bytes depending on optimization settings), as this tiny-table version. And it removes the table. As it ends up doing more parallel codepaths, it pipelines better from what I can see, though it obviously is doing more math (12 opcodes being the longest 'branch' of the parallel code paths though I may be able to reduce that by shuffling YUV columns around) so it may not be any faster. I also figured out how to short-circuit for the 'identical pixels' case, so that particular case kicks back after nothing more than the equivilant of a compare (a subtraction), and requisite jump. Moved the guard-bits math outside of the comparison after reading and understanding the "Optimal RGB Mixing" articles better. I'll upload the source code for this version later today if folks don't mind running another test program?

EDIT: After reading through various CPU optimization and instruction-pairing documents, I've learned something amusing: It's often faster to store a 32-bit register to memory and perform an MMX operation directly FROM that memory, than to use MOVD as I am right now to copy to an MMX register and manipulate from that. Starting to chew up registers more than before, but still not badly.

mozz · Post by **mozz** » Wed Jul 26, 2006 1:02 pm

I'll just toss out an idea that could be used to shrink the assembly versions of the "giant switch statement" in HQ2X/HQ3X/HQ4X (this does not apply to WolfWings' microcoded version though):

A lot of the "switch statement handlers" have common tails which could be merged together to save a lot of code space.

For example, you could optimize something like this: [example from HQ3X.asm]

Code: Select all

..@flag89
    PIXEL00_1U
    PIXEL01_1
    PIXEL02_1M
    PIXEL10_C
    PIXEL11
    PIXEL12_C
    DiffOrNot w8,w4,PIXEL20_1M,PIXEL20_2
    PIXEL21_C
    DiffOrNot w6,w8,PIXEL22_1M,PIXEL22_2
    jmp .loopx_end
..@flag90
    DiffOrNot w4,w2,PIXEL00_1M,PIXEL00_2
    PIXEL01_C
    DiffOrNot w2,w6,PIXEL02_1M,PIXEL02_2
    PIXEL10_C
    PIXEL11
    PIXEL12_C
    DiffOrNot w8,w4,PIXEL20_1M,PIXEL20_2
    PIXEL21_C
    DiffOrNot w6,w8,PIXEL22_1M,PIXEL22_2
    jmp .loopx_end

to read like this instead:

Code: Select all

..@flag89
    PIXEL00_1U
    PIXEL01_1
    PIXEL02_1M
..@flag89_tail:
    PIXEL10_C
    PIXEL11
    PIXEL12_C
    DiffOrNot w8,w4,PIXEL20_1M,PIXEL20_2
    PIXEL21_C
    DiffOrNot w6,w8,PIXEL22_1M,PIXEL22_2
    jmp .loopx_end
..@flag90
    DiffOrNot w4,w2,PIXEL00_1M,PIXEL00_2
    PIXEL01_C
    DiffOrNot w2,w6,PIXEL02_1M,PIXEL02_2
    jmp ..@flag89_tail

Given how much code is emitted by those macros, I suspect you could make the whole thing some 30-40% smaller this way with essentially no performance loss.

@MaxSt: I don't know how this code was written but it sure smells like it was generated by a program from some sort of source table. Would it make sense to modify your generator program so that it can identify common tails and automatically combine them as above?

[Edit: You could apply this optimization to the C++ code also, using gotos. An omniscient C++ compiler could do it automatically, but I'd be surprised if real compilers even tried an optimization like that. For readability of the C++ version, you might want to keep the original tail code there after the goto and just comment it out.]

Edit: I had another dumb idea, which could reduce code size especially for HQ3X and HQ4X: un-inline the contents of the PIXEL_nn macros (i.e. each one a call instruction). That would shrink code size drastically but probably slow it down a bit on anything older than a Pentium 4. My guess is that the Pentium 4 would perform about the same on that code; I have not tried it though, and I have no intuition about AMD here.

It might be better to just un-inline some of the PIXEL_nn macros---the ones that are not called very often at runtime (though I imagine those are the least-frequently-occurring in the source code too... oh well). In HQ3X for example, suppose you budget around 3 extra call instructions per flag handler (i.e. around 6 of the pixels for each are still inlined). If you pick the right ones you could probably reduce code size by 30% without slowing it down TOO much.

Maybe that trick is better applied to HQ4X only. You could start with the assumption that only people with a decently fast computer are going to enable HQ4X anyway, and not worry if its 7% slower on older machines...

WolfWings · Post by **WolfWings** » Thu Jul 27, 2006 10:16 am

mozz wrote:I'll just toss out an idea that could be used to shrink the assembly versions of the "giant switch statement" in HQ2X/HQ3X/HQ4X (this does not apply to WolfWings' microcoded version though):

Actually, this very concept is the basis for my work. Try this experiment instead: Unwrap the switch statement for each individual pixel. Now compare entire switch statements, you'll suddenly see a LOT of relations between them. And I mean hard, solid simplifications once you break all the pixels apart. Well... once you break things into quarters specifically, breaking them down further doesn't noticably improve HQ4x, though HQ3x you'd end up with the 'overlapping' middles to deal with.

mozz · Post by **mozz** » Thu Jul 27, 2006 1:20 pm

WolfWings wrote:
mozz wrote:I'll just toss out an idea that could be used to shrink the assembly versions of the "giant switch statement" in HQ2X/HQ3X/HQ4X (this does not apply to WolfWings' microcoded version though):
Actually, this very concept is the basis for my work. Try this experiment instead: Unwrap the switch statement for each individual pixel. Now compare entire switch statements, you'll suddenly see a LOT of relations between them. And I mean hard, solid simplifications once you break all the pixels apart. Well... once you break things into quarters specifically, breaking them down further doesn't noticably improve HQ4x, though HQ3x you'd end up with the 'overlapping' middles to deal with.

My thought for HQ3X was that it looks like most of the handlers could be divided into "two halves" (of varying size). E.g. if there are a bunch of handlers that set the top, left, and upper left pixels to the same thing, that would be done in one "half" and the other 5 pixels could be done in the other "half" (the center pixel is always the same in HQ3X).

You then construct a dispatch table with two entries for each index (pack them into one 32-bit word by making them 16-bit offsets into the code, and as you load them into a register, add a base address to them). During dispatch, load the second one into a register and leave it there. By the time you've run the first handler (which should generally be the shorter of the two, unless its less than 40 cycles long), the jump through a register to the second handler ought to be cheap or free.

ZSNES board

Progress on reducing HQ2x to fit in L1 cache

Progress on reducing HQ2x to fit in L1 cache

Re: Progress on reducing HQ2x to fit in L1 cache

Meh...