Another LUT->bit transformation request x.x

Strictly for discussing ZSNES development and for submitting code. You can also join us on IRC at irc.libera.chat in #zsnes.
Please, no requests here.

Moderator: ZSNES Mods

Post Reply
byuu

Another LUT->bit transformation request x.x

Post by byuu »

Ok, this one is for ROM memory speed. We need to determine the speed of every memory access made, both read and write.

We know the speed of each region follows this table (from anomie):

Code: Select all

00-3f:0000-1fff=  8
00-3f:2000-3fff=  6
00-3f:4000-41ff= 12
00-3f:4200-5fff=  6
00-3f:6000-ffff=  8
40-7f:0000-ffff=  8
80-bf:0000-1fff=  8
80-bf:2000-3fff=  6
80-bf:4000-41ff= 12
80-bf:4200-5fff=  6
80-bf:6000-7fff=  8
80-bf:8000-ffff=6,8
c0-ff:0000-ffff=6,8
The 6,8 is basically 6 when fastrom is enabled, 8 when it is not. We can store this number in a variable, therefore the fastest code we can make with a lookup table is:

Code: Select all

inline uint speed(uint addr) { return speedlut[addr >> 9]; }
Where speedlut is a pointer to speedtableSlow or speedtableFast, set when FastRom mode is toggled on or off.

Result:

Code: Select all

mov ebx,addr ; shr ebx,9 ; mov edx,[speedlut] ; mov al,[ebx+edx]
Now, the fastest we can make this without the LUT is:

anomie's method:

Code: Select all

uint speed(uint addr) {
  if((addr & 0xc00000) == 0x400000)return 8;
  if((addr & 0x808000) == 0x808000)return fast;
  if((addr & 0xc00000) == 0xc00000)return fast;
  if((addr & 0xe000) == 0x2000)return 6;
  if((addr & 0xfe00) == 0x4000)return 12;
  if((addr & 0xe000) == 0x4000)return 6;
  return 8;
}
TRAC's method:

Code: Select all

uint speed(uint addr) {
  addr |= ((addr & 0x8000) << 7);
  if ((addr & 0xc00000) == 0xc00000) return fast;
  addr |= ((addr - 0x2000) & 0x4000) << 8;
  if ((addr & 0xc00000) == 0x400000) return 8;
  if ((addr & 0x7e00) != 0x4000) return 6;
  return 12;
}
Where fast is set when fastrom is toggled as:
fast=(fastrom)?6:8;

Quite a bit more code and lots of conditional branches to ruin branch prediction.

Now, what I'm wondering is... is this worth trying to come up with more complex bit arithmetic tricks to get rid of the LUT, or is it just simply faster to use the two 32kb lookup tables?

For what it's worth, access patterns are below for typical ROMs:

Code: Select all

00-3f:0000-1fff=common
00-3f:2000-3fff=common
00-3f:4000-41ff=very uncommon
00-3f:4200-5fff=common
00-3f:6000-7fff=uncommon
00-3f:8000-ffff=common
40-7f:0000-ffff=uncommon
80-bf:0000-1fff=common
80-bf:2000-3fff=common
80-bf:4000-41ff=very uncommon
80-bf:4200-5fff=common
80-bf:6000-7fff=uncommon
80-bf:8000-ffff=very common
c0-ff:0000-ffff=very common
Nach
ZSNES Developer
ZSNES Developer
Posts: 3904
Joined: Tue Jul 27, 2004 10:54 pm
Location: Solar powered park bench
Contact:

Post by Nach »

one lookup table, populate appropriatly at start depending if fast or slow.

mov bx,[addr]
div bh,2
add eax,[speedlut]
mov al,[eax]
May 9 2007 - NSRT 3.4, now with lots of hashing and even more accurate information! Go download it.
_____________
Insane Coding
byuu

Post by byuu »

And if $420d.d0 is toggled while the game is running?
Nach
ZSNES Developer
ZSNES Developer
Posts: 3904
Joined: Tue Jul 27, 2004 10:54 pm
Location: Solar powered park bench
Contact:

Post by Nach »

byuu wrote:And if $420d.d0 is toggled while the game is running?
Which is?
I'll assume that toggles fastrom which I wasn't aware was possible or part of the equation.
May 9 2007 - NSRT 3.4, now with lots of hashing and even more accurate information! Go download it.
_____________
Insane Coding
creaothceann
Seen it all
Posts: 2302
Joined: Mon Jan 03, 2005 5:04 pm
Location: Germany
Contact:

Re: Another LUT->bit transformation request x.x

Post by creaothceann »

byuu wrote:The 6,8 is basically 6 when fastrom is enabled, 8 when it is not. We can store this number in a variable, therefore the fastest code we can make with a lookup table is:

Code: Select all

inline uint speed(uint addr) { return speedlut[addr >> 9]; }
Where speedlut is a pointer to speedtableSlow or speedtableFast, set when FastRom mode is toggled on or off.

Result:

Code: Select all

mov ebx,addr ; shr ebx,9 ; mov edx,[speedlut] ; mov al,[ebx+edx]
You can use only one table by modifying its values when FastROM is toggled.
vSNES | Delphi 10 BPLs
bsnes launcher with recent files list
mozz
Hazed
Posts: 56
Joined: Mon Oct 10, 2005 3:12 pm
Location: Montreal, QC

Post by mozz »

Depending on what table(s) you already have for handling aliasing etc, you may be able to re-use a few bits from the entries of those.

If you can borrow two bits from the top (or bottom) of each table entry, you can encode all the possible speeds. For example, if you have a table of pointers to memory blocks (or a table of offsets to add to the address), you could make sure they are all 4-byte aligned and that gives you the two bits. Either use 6+2*n to encode the speed directly, or have a constant that you shift by 4*n or something.

With 3 or 4 bits, more direct encodings are possible.
byuu

Post by byuu »

Bugfix:

Code: Select all

uint speed(uint addr) {
  addr |= ((addr & 0x8000) << 7);
  if((addr & 0xc00000) == 0xc00000) { return fast; }
  addr |= ((addr - 0x2000) & 0x4000) << 8;
  if(addr & 0x400000) { return 8; }
  if((addr & 0x7e00) != 0x4000) { return 6; }
  return 12;
}
TRAC's method wins. It's equal to or faster than the memory lookup, and does not consume 64k memory / 32k cache.

Identical results in all possible addresses compared to anomie's method.
mozz
Hazed
Posts: 56
Joined: Mon Oct 10, 2005 3:12 pm
Location: Montreal, QC

Post by mozz »

...I should add that if you encode speeds in a few bits of a lookup table you already have, then you have already paid for the cache miss. That might make it preferable to TRAC's method (I don't know).

Another possibility is to borrow just one bit from your lookup table entries and use it to differentiate common from uncommon cases. I doubt that would be better than TRAC's method though.
WolfWings
Hazed
Posts: 52
Joined: Wed Nov 02, 2005 1:31 pm

Post by WolfWings »

byuu wrote:Bugfix:

Code: Select all

uint speed(uint addr) {
  addr |= ((addr & 0x8000) << 7);
  if((addr & 0xc00000) == 0xc00000) { return fast; }
  addr |= ((addr - 0x2000) & 0x4000) << 8;
  if(addr & 0x400000) { return 8; }
  if((addr & 0x7e00) != 0x4000) { return 6; }
  return 12;
}
TRAC's method wins. It's equal to or faster than the memory lookup, and does not consume 64k memory / 32k cache.

Identical results in all possible addresses compared to anomie's method.
And actually, that code should be quite fast after the fast-check as a series of CMOV's if you rebuilt it. Have you tried this construction instead?

Code: Select all

uint speed(uint addr) {
  addr |= ((addr & 0x8000) << 7);
  if((addr & 0xc00000) == 0xc00000) { return fast; }
  {
    int a = 12;
    addr |= ((addr - 0x2000) & 0x4000) << 8;
    if((addr & 0x7e00) != 0x4000) { a = 6; }
    if(addr & 0x400000) { a = 8; }
    return a;
  }
}
Or if you want nothing but CMOVs at all...

Code: Select all

uint speed(uint addr) {
  int a = 12;
  addr |= ((addr & 0x8000) << 7);
  addr |= ((addr - 0x2000) & 0x4000) << 8;
  if((addr & 0x7e00) != 0x4000) { a = 6; }
  if(addr & 0x400000) { a = 8; }
  if((addr & 0xc00000) == 0xc00000) { a = fast; }
  return a;
}
I think all that code is correct.
[img]http://wolfwings.us/sigs/WolfWings.jpg[/img]
TRAC
SNEeSe Developer
SNEeSe Developer
Posts: 25
Joined: Sun Nov 14, 2004 12:46 pm
Contact:

Post by TRAC »

Mozz wrote:...I should add that if you encode speeds in a few bits of a lookup table you already have, then you have already paid for the cache miss. That might make it preferable to TRAC's method (I don't know).
Even if that is possible, you may very well end up slowing down the use of the piggybacked LUT, which could make things worse...


Moreover, I wouldn't be so sure about CMOVs being faster than branches in this case. As long as the branches are reused (ie, preferably not inlined - it's not a small code block by any means, anyhow), branch prediction should function exceedingly well.

CMOVs, otoh, might have trouble resolving the result quickly due to the dependency chains; second, the CMOV paths have to evaluate more cases before a result is reached; and lastly, there's no guarantee that the compiler won't optimize it to use branches instead of CMOVs anyway, in part because of the second reason.
byuu

Post by byuu »

Moreover, I wouldn't be so sure about CMOVs being faster than branches in this case. As long as the branches are reused (ie, preferably not inlined - it's not a small code block by any means, anyhow), branch prediction should function exceedingly well.
Interesting point... in my code, sCPU::op_read(addr) and sCPU::op_write(addr, data) calls MemBus::speed(addr). MemBus is an object, and it's the base class so there's no vtable lookup, at least. But it still has to switch out the class context pointer (ecx) between sCPU and MemBus, and perform a long function call. I actually did declare the function inline, assuming that since it's only used in two functions, it would be better to avoid the function overhead, but I can see your point in the most common cases (- lda $4212 : bpl - loop, for instance) where sharing the function would quickly result in 99.9% accurate branch predictions for long periods of time. I doubt the compiler was listening to me when I asked it to inline the function anyway.

As a side note, one of the things that's really bothering me right now is the way c++ compilers refuse to inline functions that are large, when they are only used a single time. I have a highly critical loop that I try and manage by having it call small "chunks" of the function at a time, eg:

Code: Select all

void sCPU::tick() {
  if(status.nmi_counter) { nmi_tick(); }
  if(status.irq_counter) { irq_tick(); }

  status.hclock += 2;

  if(status.dram_refreshed == false) {
    if(status.hclock >= 530) {
      status.dram_refreshed = true;
      add_clocks(40);
      return;
    }
  }

  poll_interrupts();
}
I have to use Visual C++'s __forceinline to get nmi_tick, irq_tick and poll_interrupts to inline. These functions are used nowhere else, and result in a very, very real performance drop by not being inlined in this case. This is why I'm so skeptical when I always hear "the compiler knows best" in regards to optimizations and inlining :/

Of course, I could always go the evil #define route and enjoy backslash hell, but I'd rather not have to do that.
and lastly, there's no guarantee that the compiler won't optimize it to use branches instead of CMOVs anyway, in part because of the second reason.
One of the reasons I'm highly critical of designing c++ code in ways that optimize for a specific processor. The results are iffy at best when not hardcoded in ASM, and can easily result in large performance penalties on non-x86 architecture in this case. bsnes does compile for the G5 processor (albeit with the older CPU/SMP cores), so I can't ignore this platform in my optimizations.
byuu

Post by byuu »

Double post because this is interesting x.x

I came up with a new algorithm that should (hopefully) be faster in most cases. It's speed2 below:

Code: Select all

//anomie -- most conditions, no math
uint speed0(uint addr) {
  if((addr & 0xc00000) == 0x400000)return 8;
  if((addr & 0x808000) == 0x808000)return fast;
  if((addr & 0xc00000) == 0xc00000)return fast;
  if((addr & 0xe000) == 0x2000)return 6;
  if((addr & 0xfe00) == 0x4000)return 12;
  if((addr & 0xe000) == 0x4000)return 6;
  return 8;
}

//TRAC -- lots of math, 3 conditions (FastROM in 1)
uint speed1(uint addr) {
  addr |= ((addr & 0x8000) << 7);
  if(addr >= 0xc00000) { return fast; }
  addr |= ((addr - 0x2000) & 0x4000) << 8;
  if(addr & 0x400000) { return 8; }
  if((addr & 0x7e00) != 0x4000) { return 6; }
  return 12;
}

//byuu -- litle math, 4 conditions (FastROM in 2)
uint speed2(uint addr) {
  if(addr & 0x408000) { return (addr & 0x800000) ? fast : 8; }
  return ((addr + 0x6000) & 0x4000) ? 8 : ((addr - 0x4000) & 0x7e00) ? 6 : 12;
}

//LUT -- no conditions, cache killer
uint speed3(uint addr) {
  return speed_table[addr >> 9];
}

//byuu+TRAC -- only one condition (impossible to avoid), absurd amount of math
uint speed4(uint addr) {
  addr |= (addr & 0x8000) << 7;
  if(addr >= 0xc00000) { return fast; }
  addr |= ((addr - 0x2000) & 0x4000) << 8;
  return 6 + ((addr & 0x400000) >> 21) + ((addr & 0x7e00) != 4000) * 6;
}
Clock ticks (in ms) to test the entire 24-bit address range 64 times on a P4 1.7ghz:

speed0: 14844
speed1: 21750
speed2: 13953
speed3: 12172
speed4: 30890

I've no explanation for why speed0 is performing so well x.x

Anyone care to verify how the code performs on an AMD Athlon before I get home to test for myself?

http://cpp.sourceforge.net/?show=19679

Not sure how long that link will last.

Also, speed3 tests are misleading. The 32k lookup table will easily be cached entirely in L1 cache in such a short loop, whereas it will constantly be purged from cache in an emulator with lots of other code running along with it. So in this case, speed2 appears to be the best, at least on a P4 procsesor.
tetsuo55
Regular
Posts: 307
Joined: Sat Mar 04, 2006 3:17 pm

Post by tetsuo55 »

pressing compile on the bottom of that link just gives me loads of errors, so i cannot test
byuu

Post by byuu »

Nevermind, thanks anyway. I can test it in an hour.

Kind of strange, if I replace my ternary conditions with more traditional if() {} brackets (eg unroll the code a little), I get a small performance gain:

Code: Select all

uint speed2(uint addr) {
  if(addr & 0x408000) {
    if(addr & 0x800000) { return fast; }
    return 8;
  }
  if((addr + 0x6000) & 0x4000) { return 8; }
  if((addr - 0x4000) & 0x7e00) { return 6; }
  return 12;
}

uint speed3(uint addr) {
  return speed_table[addr >> 9];
}
speed2 gets 12812, and speed3 gets 12407. For such a poor synthetic test, I think this is the best performance I'm going to get.
blargg
Regular
Posts: 327
Joined: Thu Jun 30, 2005 1:54 pm
Location: USA
Contact:

Post by blargg »

When testing the lookup function, shouldn't you use a data set that has the same distribution as you'd commonly find in normal code? In other words, some addresses and ranges are more often accessed than others.
byuu

Post by byuu »

Kind of tough, we haven't profiled many games. And it'd vary based on which mapper was used and what the ROM memory speed was.

I'd put the distribution at something like this:
80-bf:8000-ffff, c0-ff:0000-ffff = most common, FastROM regions
00-3f:8000-ffff, 40-7d:0000-ffff = very common, SlowROM regions
7e-7f:0000-ffff = fairly common, WRAM
00-3f|80-bf:0000-1fff = common, first 8k WRAM
00-3f|80-bf:2000-5fff [sans 4000-41ff] = semi-common
00-3f|80-bf:4000-41ff = very uncommon, XSlowROM
00-3f|80-bf:6000-7fff = uncommon, SRAM, DSP-n, etc map here

Both TRAC and mine's routines optimize to provide the quickest return for the most common regions.

However, the important differences are that TRAC uses more precalculation so that the more common FastROM regions only require one conditional; whereas mine has less math overhead, but requires two branch conditionals. So, what's more important?

It's hard to say just how much overhead the condition statements really have... it seems that in my extremely synthetic linear timing tests, the predictions are far faster than the math; but that could easily not be the case when used inside an emulator...
Post Reply