Optimal RGB Mix/Clamped Add/Sub

Strictly for discussing ZSNES development and for submitting code. You can also join us on IRC at irc.libera.chat in #zsnes.
Please, no requests here.

Moderator: ZSNES Mods

MaxSt
ZSNES Developer
ZSNES Developer
Posts: 113
Joined: Wed Jul 28, 2004 7:07 am
Location: USA
Contact:

Post by MaxSt »

blargg wrote:If the MMX version is many times faster, then the same would likely be the case for an AltiVec version for PowerPC.
Yes, I know that some emulators on Mac already use AltiVec-optimized versions of HQ code.
blargg
Regular
Posts: 327
Joined: Thu Jun 30, 2005 1:54 pm
Location: USA
Contact:

Post by blargg »

pSXAuthor wrote:Here is a faster one that can be used when adding 24bit values (note: upper 8bits=0):

Code: Select all

dest=(x&0xfefeff)+(y&0xfefeff); 
tmp=dest&0x1010100; 
tmp=tmp-(tmp>>8); 
dest|=tmp;
as with the code above this can be adapted for 555 or 565.
That's a different algorithm as it loses the low bit of each component except the right-most one. This means it would be quite poor for use on 16/15-bit RGB.
MaxSt wrote:Sometimes processor-specific code is a very good thing.
Especially when using it in a few bottleneck routines that can drastically ease the speed requirements on other code, allowing better maintainability overall, even considering the need to maintain a portable alternative to the architecture-specific code. These days the x86 architecture is the only one that really needs assembly anyway. Modern RISC architectures with gobs of registers can easily be optimized for by tuning the C code.
WolfWings
Hazed
Posts: 52
Joined: Wed Nov 02, 2005 1:31 pm

Post by WolfWings »

Hopefully this isn't thread necromancy, but some work I've been doing lately seemed most appropriate to post about to this particular thread, instead of starting an entirely new one. Specifically, I've been rebuilding HQ2x to replace that super-massive 'switch' statement after I noticed something while doing analysis of the branches, namely that in the HQ2x and HQ4x cases (I haven't analyzed the HQ3x case yet) each quarter was calculated independantly of all other corners, and the tests to determine which quarter got which pattern was rotationally identical.

I.E. If you strip out all but the PIXEL_00 entries, and simplify the switch statement, the 'inner technique' for HQ2x becomes MUCH clearer. Based on this, and some research papers I've read at The Aggregate and some simple code-size and data-size tests, I was able to rebuild the 'core' HQ2x system down to 156+24 bytes of lookup tables, and another ~300 bytes of actual code including a replacement for the 'diff/hdiff' code that removes those lookup tables entirely thanks to a new approach to calculating the YUV differences on the fly that I worked out.

The new 'switch replacement' code was available before I removed it from my website, though I ended up figuring out a different (but equivilant in all ways) approach to the 'optimal mixing' based on some of the 'SWAR Bit Twiddles' reference on The Aggregate. Either approach ends up as the same number of instructions, though I think the approach posted earlier on this thread is more readable so I'll be converting my code to use it.

I'm still optimizing the diff/hdiff replacement code, right now it's down to 4 instructions to 'prep' each side of a diff, which pair and interleave completely on newer machines. 2 instructions are needed to get the difference values for RGB, then 8 instructions (3 mov, shr, 3 add/sub, shr, 3 add/sub) are used to convert that into YUV. Finally, 9 instructions are used to add the final test-bit onto the end of the pattern-buffer. (zero, cmp, adc, cmp, adc, cmp, adc, cmp, rcl)

This is primarilly an 'older machine' optimization with a goal of making the entire HQ2x subroutine fit in L1 cache, but as those are the machines that need help running HQ2x acceptably the most, it seems useful to pursue. The optimization is possibly going to improve performance even on newer machines though, as even the newest CPU's still have very limited amounts of high-speed L1 cache (64kb code, 8x8kb data for Athlon 64, for instance) that the HQ systems blew right past those limits before.
[img]http://wolfwings.us/sigs/WolfWings.jpg[/img]
Post Reply