Another LUT->bit transformation request x.x

byuu · Post by **byuu** » Wed Aug 30, 2006 12:33 am

Ok, this one is for ROM memory speed. We need to determine the speed of every memory access made, both read and write.

We know the speed of each region follows this table (from anomie):

Code: Select all

00-3f:0000-1fff=  8
00-3f:2000-3fff=  6
00-3f:4000-41ff= 12
00-3f:4200-5fff=  6
00-3f:6000-ffff=  8
40-7f:0000-ffff=  8
80-bf:0000-1fff=  8
80-bf:2000-3fff=  6
80-bf:4000-41ff= 12
80-bf:4200-5fff=  6
80-bf:6000-7fff=  8
80-bf:8000-ffff=6,8
c0-ff:0000-ffff=6,8

The 6,8 is basically 6 when fastrom is enabled, 8 when it is not. We can store this number in a variable, therefore the fastest code we can make with a lookup table is:

Code: Select all

inline uint speed(uint addr) { return speedlut[addr >> 9]; }

Where speedlut is a pointer to speedtableSlow or speedtableFast, set when FastRom mode is toggled on or off.

Result:

Code: Select all

mov ebx,addr ; shr ebx,9 ; mov edx,[speedlut] ; mov al,[ebx+edx]

Now, the fastest we can make this without the LUT is:

anomie's method:

Code: Select all

uint speed(uint addr) {
  if((addr & 0xc00000) == 0x400000)return 8;
  if((addr & 0x808000) == 0x808000)return fast;
  if((addr & 0xc00000) == 0xc00000)return fast;
  if((addr & 0xe000) == 0x2000)return 6;
  if((addr & 0xfe00) == 0x4000)return 12;
  if((addr & 0xe000) == 0x4000)return 6;
  return 8;
}

TRAC's method:

Code: Select all

uint speed(uint addr) {
  addr |= ((addr & 0x8000) << 7);
  if ((addr & 0xc00000) == 0xc00000) return fast;
  addr |= ((addr - 0x2000) & 0x4000) << 8;
  if ((addr & 0xc00000) == 0x400000) return 8;
  if ((addr & 0x7e00) != 0x4000) return 6;
  return 12;
}

Where fast is set when fastrom is toggled as:
fast=(fastrom)?6:8;

Quite a bit more code and lots of conditional branches to ruin branch prediction.

Now, what I'm wondering is... is this worth trying to come up with more complex bit arithmetic tricks to get rid of the LUT, or is it just simply faster to use the two 32kb lookup tables?

For what it's worth, access patterns are below for typical ROMs:

Code: Select all

00-3f:0000-1fff=common
00-3f:2000-3fff=common
00-3f:4000-41ff=very uncommon
00-3f:4200-5fff=common
00-3f:6000-7fff=uncommon
00-3f:8000-ffff=common
40-7f:0000-ffff=uncommon
80-bf:0000-1fff=common
80-bf:2000-3fff=common
80-bf:4000-41ff=very uncommon
80-bf:4200-5fff=common
80-bf:6000-7fff=uncommon
80-bf:8000-ffff=very common
c0-ff:0000-ffff=very common

Post by **Nach** » Wed Aug 30, 2006 12:44 am

one lookup table, populate appropriatly at start depending if fast or slow.

mov bx,[addr]
div bh,2
add eax,[speedlut]
mov al,[eax]

byuu · Post by **byuu** » Wed Aug 30, 2006 12:48 am

And if $420d.d0 is toggled while the game is running?

Post by **Nach** » Wed Aug 30, 2006 12:52 am

byuu wrote:And if $420d.d0 is toggled while the game is running?

Which is?
I'll assume that toggles fastrom which I wasn't aware was possible or part of the equation.

creaothceann · Post by **creaothceann** » Wed Aug 30, 2006 7:17 am

byuu wrote:The 6,8 is basically 6 when fastrom is enabled, 8 when it is not. We can store this number in a variable, therefore the fastest code we can make with a lookup table is:
Code: Select all
inline uint speed(uint addr) { return speedlut[addr >> 9]; }
Where speedlut is a pointer to speedtableSlow or speedtableFast, set when FastRom mode is toggled on or off.

Result:
Code: Select all
mov ebx,addr ; shr ebx,9 ; mov edx,[speedlut] ; mov al,[ebx+edx]

You can use only one table by modifying its values when FastROM is toggled.

mozz · Post by **mozz** » Fri Sep 01, 2006 2:27 am

Depending on what table(s) you already have for handling aliasing etc, you may be able to re-use a few bits from the entries of those.

If you can borrow two bits from the top (or bottom) of each table entry, you can encode all the possible speeds. For example, if you have a table of pointers to memory blocks (or a table of offsets to add to the address), you could make sure they are all 4-byte aligned and that gives you the two bits. Either use 6+2*n to encode the speed directly, or have a constant that you shift by 4*n or something.

With 3 or 4 bits, more direct encodings are possible.

byuu · Post by **byuu** » Fri Sep 01, 2006 10:53 pm

Bugfix:

Code: Select all

uint speed(uint addr) {
  addr |= ((addr & 0x8000) << 7);
  if((addr & 0xc00000) == 0xc00000) { return fast; }
  addr |= ((addr - 0x2000) & 0x4000) << 8;
  if(addr & 0x400000) { return 8; }
  if((addr & 0x7e00) != 0x4000) { return 6; }
  return 12;
}

TRAC's method wins. It's equal to or faster than the memory lookup, and does not consume 64k memory / 32k cache.

Identical results in all possible addresses compared to anomie's method.

mozz · Post by **mozz** » Sat Sep 02, 2006 4:34 pm

...I should add that if you encode speeds in a few bits of a lookup table you already have, then you have already paid for the cache miss. That might make it preferable to TRAC's method (I don't know).

Another possibility is to borrow just one bit from your lookup table entries and use it to differentiate common from uncommon cases. I doubt that would be better than TRAC's method though.

WolfWings · Post by **WolfWings** » Sun Sep 03, 2006 5:54 am

byuu wrote:Bugfix:
Code: Select all
uint speed(uint addr) {
  addr |= ((addr & 0x8000) << 7);
  if((addr & 0xc00000) == 0xc00000) { return fast; }
  addr |= ((addr - 0x2000) & 0x4000) << 8;
  if(addr & 0x400000) { return 8; }
  if((addr & 0x7e00) != 0x4000) { return 6; }
  return 12;
}
TRAC's method wins. It's equal to or faster than the memory lookup, and does not consume 64k memory / 32k cache.

Identical results in all possible addresses compared to anomie's method.

And actually, that code should be quite fast after the fast-check as a series of CMOV's if you rebuilt it. Have you tried this construction instead?

Code: Select all

uint speed(uint addr) {
  addr |= ((addr & 0x8000) << 7);
  if((addr & 0xc00000) == 0xc00000) { return fast; }
  {
    int a = 12;
    addr |= ((addr - 0x2000) & 0x4000) << 8;
    if((addr & 0x7e00) != 0x4000) { a = 6; }
    if(addr & 0x400000) { a = 8; }
    return a;
  }
}

Or if you want nothing but CMOVs at all...

Code: Select all

uint speed(uint addr) {
  int a = 12;
  addr |= ((addr & 0x8000) << 7);
  addr |= ((addr - 0x2000) & 0x4000) << 8;
  if((addr & 0x7e00) != 0x4000) { a = 6; }
  if(addr & 0x400000) { a = 8; }
  if((addr & 0xc00000) == 0xc00000) { a = fast; }
  return a;
}

I think all that code is correct.

TRAC · Post by **TRAC** » Fri Sep 08, 2006 12:36 pm

Mozz wrote:...I should add that if you encode speeds in a few bits of a lookup table you already have, then you have already paid for the cache miss. That might make it preferable to TRAC's method (I don't know).

Even if that is possible, you may very well end up slowing down the use of the piggybacked LUT, which could make things worse...

Moreover, I wouldn't be so sure about CMOVs being faster than branches in this case. As long as the branches are reused (ie, preferably not inlined - it's not a small code block by any means, anyhow), branch prediction should function exceedingly well.

CMOVs, otoh, might have trouble resolving the result quickly due to the dependency chains; second, the CMOV paths have to evaluate more cases before a result is reached; and lastly, there's no guarantee that the compiler won't optimize it to use branches instead of CMOVs anyway, in part because of the second reason.

byuu · Post by **byuu** » Fri Sep 08, 2006 4:11 pm

Moreover, I wouldn't be so sure about CMOVs being faster than branches in this case. As long as the branches are reused (ie, preferably not inlined - it's not a small code block by any means, anyhow), branch prediction should function exceedingly well.

Interesting point... in my code, sCPU::op_read(addr) and sCPU::op_write(addr, data) calls MemBus::speed(addr). MemBus is an object, and it's the base class so there's no vtable lookup, at least. But it still has to switch out the class context pointer (ecx) between sCPU and MemBus, and perform a long function call. I actually did declare the function inline, assuming that since it's only used in two functions, it would be better to avoid the function overhead, but I can see your point in the most common cases (- lda $4212 : bpl - loop, for instance) where sharing the function would quickly result in 99.9% accurate branch predictions for long periods of time. I doubt the compiler was listening to me when I asked it to inline the function anyway.

As a side note, one of the things that's really bothering me right now is the way c++ compilers refuse to inline functions that are large, when they are only used a single time. I have a highly critical loop that I try and manage by having it call small "chunks" of the function at a time, eg:

Code: Select all

void sCPU::tick() {
  if(status.nmi_counter) { nmi_tick(); }
  if(status.irq_counter) { irq_tick(); }

  status.hclock += 2;

  if(status.dram_refreshed == false) {
    if(status.hclock >= 530) {
      status.dram_refreshed = true;
      add_clocks(40);
      return;
    }
  }

  poll_interrupts();
}

I have to use Visual C++'s __forceinline to get nmi_tick, irq_tick and poll_interrupts to inline. These functions are used nowhere else, and result in a very, very real performance drop by not being inlined in this case. This is why I'm so skeptical when I always hear "the compiler knows best" in regards to optimizations and inlining :/

Of course, I could always go the evil #define route and enjoy backslash hell, but I'd rather not have to do that.

and lastly, there's no guarantee that the compiler won't optimize it to use branches instead of CMOVs anyway, in part because of the second reason.

One of the reasons I'm highly critical of designing c++ code in ways that optimize for a specific processor. The results are iffy at best when not hardcoded in ASM, and can easily result in large performance penalties on non-x86 architecture in this case. bsnes does compile for the G5 processor (albeit with the older CPU/SMP cores), so I can't ignore this platform in my optimizations.

byuu · Post by **byuu** » Fri Sep 08, 2006 5:50 pm

Double post because this is interesting x.x

I came up with a new algorithm that should (hopefully) be faster in most cases. It's speed2 below:

Code: Select all

//anomie -- most conditions, no math
uint speed0(uint addr) {
  if((addr & 0xc00000) == 0x400000)return 8;
  if((addr & 0x808000) == 0x808000)return fast;
  if((addr & 0xc00000) == 0xc00000)return fast;
  if((addr & 0xe000) == 0x2000)return 6;
  if((addr & 0xfe00) == 0x4000)return 12;
  if((addr & 0xe000) == 0x4000)return 6;
  return 8;
}

//TRAC -- lots of math, 3 conditions (FastROM in 1)
uint speed1(uint addr) {
  addr |= ((addr & 0x8000) << 7);
  if(addr >= 0xc00000) { return fast; }
  addr |= ((addr - 0x2000) & 0x4000) << 8;
  if(addr & 0x400000) { return 8; }
  if((addr & 0x7e00) != 0x4000) { return 6; }
  return 12;
}

//byuu -- litle math, 4 conditions (FastROM in 2)
uint speed2(uint addr) {
  if(addr & 0x408000) { return (addr & 0x800000) ? fast : 8; }
  return ((addr + 0x6000) & 0x4000) ? 8 : ((addr - 0x4000) & 0x7e00) ? 6 : 12;
}

//LUT -- no conditions, cache killer
uint speed3(uint addr) {
  return speed_table[addr >> 9];
}

//byuu+TRAC -- only one condition (impossible to avoid), absurd amount of math
uint speed4(uint addr) {
  addr |= (addr & 0x8000) << 7;
  if(addr >= 0xc00000) { return fast; }
  addr |= ((addr - 0x2000) & 0x4000) << 8;
  return 6 + ((addr & 0x400000) >> 21) + ((addr & 0x7e00) != 4000) * 6;
}

Clock ticks (in ms) to test the entire 24-bit address range 64 times on a P4 1.7ghz:

speed0: 14844
speed1: 21750
speed2: 13953
speed3: 12172
speed4: 30890

I've no explanation for why speed0 is performing so well x.x

Anyone care to verify how the code performs on an AMD Athlon before I get home to test for myself?

http://cpp.sourceforge.net/?show=19679

Not sure how long that link will last.

Also, speed3 tests are misleading. The 32k lookup table will easily be cached entirely in L1 cache in such a short loop, whereas it will constantly be purged from cache in an emulator with lots of other code running along with it. So in this case, speed2 appears to be the best, at least on a P4 procsesor.

tetsuo55 · Post by **tetsuo55** » Fri Sep 08, 2006 9:53 pm

pressing compile on the bottom of that link just gives me loads of errors, so i cannot test

byuu · Post by **byuu** » Fri Sep 08, 2006 11:52 pm

Nevermind, thanks anyway. I can test it in an hour.

Kind of strange, if I replace my ternary conditions with more traditional if() {} brackets (eg unroll the code a little), I get a small performance gain:

Code: Select all

uint speed2(uint addr) {
  if(addr & 0x408000) {
    if(addr & 0x800000) { return fast; }
    return 8;
  }
  if((addr + 0x6000) & 0x4000) { return 8; }
  if((addr - 0x4000) & 0x7e00) { return 6; }
  return 12;
}

uint speed3(uint addr) {
  return speed_table[addr >> 9];
}

speed2 gets 12812, and speed3 gets 12407. For such a poor synthetic test, I think this is the best performance I'm going to get.

blargg · Post by **blargg** » Sat Sep 09, 2006 4:40 am

When testing the lookup function, shouldn't you use a data set that has the same distribution as you'd commonly find in normal code? In other words, some addresses and ranges are more often accessed than others.

byuu · Post by **byuu** » Sat Sep 09, 2006 5:39 am

Kind of tough, we haven't profiled many games. And it'd vary based on which mapper was used and what the ROM memory speed was.

I'd put the distribution at something like this:
80-bf:8000-ffff, c0-ff:0000-ffff = most common, FastROM regions
00-3f:8000-ffff, 40-7d:0000-ffff = very common, SlowROM regions
7e-7f:0000-ffff = fairly common, WRAM
00-3f|80-bf:0000-1fff = common, first 8k WRAM
00-3f|80-bf:2000-5fff [sans 4000-41ff] = semi-common
00-3f|80-bf:4000-41ff = very uncommon, XSlowROM
00-3f|80-bf:6000-7fff = uncommon, SRAM, DSP-n, etc map here

Both TRAC and mine's routines optimize to provide the quickest return for the most common regions.

However, the important differences are that TRAC uses more precalculation so that the more common FastROM regions only require one conditional; whereas mine has less math overhead, but requires two branch conditionals. So, what's more important?

It's hard to say just how much overhead the condition statements really have... it seems that in my extremely synthetic linear timing tests, the predictions are far faster than the math; but that could easily not be the case when used inside an emulator...

ZSNES board

Another LUT->bit transformation request x.x

Another LUT->bit transformation request x.x

Re: Another LUT->bit transformation request x.x