Taking dynarecs one step further: Just-in-time assembly?

tcaudilllg2 · Post by **tcaudilllg2** » Mon Mar 24, 2008 7:30 am

Declarations list for the produced code:


var equal = false;
var notEqual = false;
var higherOrSame = false;
var lower = false;
var negative = false;
var positive = false;
var overflow = false;
var noOverflow = false;
var higher = false;
var lowerOrSame = false;
var greaterThanOrEqual = false;
var lesserThan = false;
var greaterThan = false;
var lesserThanOrEqual = false;

Actually I need to stop and think about this. It's not as simple as I thought it would be, what with those flags being tied to the state of the operation.

In fact, it seems to me that you need to check for situations under which the flag is set for every instruction that requires them, nor do I think there is a way around it.

Everything done by the processor must be captured with a control structure somehow.

When processing "add with carry", I need to add in a little test as to whether the carry flag has been set or not by the operation -- and make it explicit in the produced code UNLESS the target platform has its own carry bit for the corresponding situation.

Given THAT situation... one wonders which will be faster: the translation.. or the emulation.

tcaudilllg2 · Post by **tcaudilllg2** » Mon Mar 24, 2008 9:41 am

Now for the main event... the various instruction modes.

Let's start with Add...

Code: Select all


  if (instruction == add) {

    [divvy up code[codeIndex] into its three seperate bytes]
   
    if (Byte1 == Register3) {
      Destination = "r3";
    }

    if (Byte2 == Register2) {
      2ndOperand = "r2";
    }

    if (Byte3 == Register3) {
      3rdOperand ="r3";
    }

    translation = Destination + "=" + 2ndOperand + "+" + 3rdOperand;
  }

That's a waste. Instead, let's divvy it up between the three instruction categories.

Code: Select all


// determine the mode of the processor

  modeMask = 2^27 + 2^26 + 2^25 + 2^24;
  processorMode = Byte1 && modeMask;

// determine the mode of the instruction

  operandTypeMask = 2^24;
  instructionMode = processorMode && 2^24;

  if (processorMode == registerOperandMode || processorMode == registerImmediateMode) {

// Based on this, we can avoid testing all of the instructions

    if (instruction == add || instruction == subtract) {

// I don't think I've got my math off....

      op[0] = Byte3 && (2^20 + 2^19 + 2^18 +2^17);
      op[1] = Byte2 && (2^16 + 2^15 + 2^14 +2^13);

      index = 0;
      while (index < 2) {
        switch (op[index]) {
          case Register0: term[index] = "r0"; break;
          case Register1: term[index] = "r1"; break;
          case Register2: term[index] = "r2"; break;
          case Register3: term[index]= "r3"; break;
          case Register4: term[index]= "r4"; break;
          case Register5: term[index]= "r5"; break;
          case Register6: term[index]= "r6"; break;
          case Register7: term[index]= "r7"; break;
          case Register8: term[index]= "r8"; break;
          case Register9: term[index]= "r9"; break;
          case Register10: term[index]= "r10"; break;
          case Register11: term[index]= "r11"; break;
          case Register12: term[index]= "r12"; break;
          case Register13: term[index]= "r13"; break;
          case Register14: term[index]= "r14"; break;
     //  case Register15: term[index] = "r15"; 
          default: 
        }
       index++
      }

      if (instructionMode == immediate) {
        term[3] = byte4;
      }
      else {
        switch (Op1) {
          case Register0: term[3] = "r0"; break;
          case Register1: term[3] = "r1"; break;
          case Register2: term[3] = "r2"; break;
          case Register3: term[3] = "r3"; break;
          case Register4: term[3] = "r4"; break;
          case Register5: term[3] = "r5"; break;
          case Register6: term[3] = "r6"; break;
          case Register7: term[3] = "r7"; break;
          case Register8: term[3] = "r8"; break;
          case Register9: term[3] = "r9"; break;
          case Register10: term[3] = "r10"; break;
          case Register11: term[3] = "r11"; break;
          case Register12: term[3] = "r12"; break;
          case Register13: term[3] = "r13"; break;
          case Register14: term[3] = "r14"; break;
  //  case Register15: term[3] = "r15"; 
          default: 
        }
      }

      if (instruction == add)
        translation = term[1] + "=" term[2] + "+" + term[3];
      if (instruction == subtract)
        translation = term[1] + "=" term[2] + "-" + term[3];
    }
    if (instruction == multiply || instruction == divide) {

      op[0] = Byte3 && (2^20 + 2^19 + 2^18 +2^17);
      op[1] = Byte2 && (2^16 + 2^15 + 2^14 +2^13);
      op[2] = Byte2 && (2^16 + 2^15 + 2^14 +2^13);

      index = 0;
      while (index < 3) {
        switch (op[index]) {
          case Register0: term[index] = "r0"; break;
          case Register1: term[index] = "r1"; break;
          case Register2: term[index] = "r2"; break;
          case Register3: term[index]= "r3"; break;
          case Register4: term[index]= "r4"; break;
          case Register5: term[index]= "r5"; break;
          case Register6: term[index]= "r6"; break;
          case Register7: term[index]= "r7"; break;
          case Register8: term[index]= "r8"; break;
          case Register9: term[index]= "r9"; break;
          case Register10: term[index]= "r10"; break;
          case Register11: term[index]= "r11"; break;
          case Register12: term[index]= "r12"; break;
          case Register13: term[index]= "r13"; break;
          case Register14: term[index]= "r14"; break;
     //  case Register15: term[index] = "r15"; 
          default: 
        }
       index++
      }

      if (instructionMode == immediate) {
        term[3] = byte4;
      }
      else {
        switch (Op1) {
          case Register0: term[3] = "r0"; break;
          case Register1: term[3] = "r1"; break;
          case Register2: term[3] = "r2"; break;
          case Register3: term[3] = "r3"; break;
          case Register4: term[3] = "r4"; break;
          case Register5: term[3] = "r5"; break;
          case Register6: term[3] = "r6"; break;
          case Register7: term[3] = "r7"; break;
          case Register8: term[3] = "r8"; break;
          case Register9: term[3] = "r9"; break;
          case Register10: term[3] = "r10"; break;
          case Register11: term[3] = "r11"; break;
          case Register12: term[3] = "r12"; break;
          case Register13: term[3] = "r13"; break;
          case Register14: term[3] = "r14"; break;
  //  case Register15: term[3] = "r15"; 
          default: 
        }
      }

      if (instruction == multiply)
        translation = term[1] + "=" term[2] + "*" + term[3];
      if (instruction == divide)
        translation = term[1] + "=" term[2] + "/" + term[3];
    }

  }
  else
  if (processorMode == branch) {

    branchAddress = (byte2 * 2^16) + (byte3 * 2^8) + byte4;

    for (index = 0; index < subroutineList.length && indexFound == false; index++) {
  // has the subroutine already been decoded?
      if (subroutineList[index].address == branchAddress) {
  // if it has, jump to it
        programCounter = branchAddress;
      }
  // otherwise we've got to decode it
    subroutineList[subroutinelist.length].address = branchAddress;
    subroutineList[subroutinelist.length].name = "BRANCH_" + branchAddress;
    subroutineList[subroutinelist.length].source = subroutineList[subroutinelist.length].name + ":\n";
    subroutineList[subroutinelist.length + 1] = new subroutine();
    programCounter = branchAddress;
    // what next?
  }

I'm somewhat confused here. How many different forms of the registers need I account for, do you think?

Edit 3/30/08: added subtraction (not a big change because it behaves just like addition)

Edit 3/27/08: began adding branching.

Edit:
ARM says there are immediate and register forms of the last operand in the instruction. The marker bit is 24. The processor itself has eight modes. I've fitted the ADD instruction appropriately, and all that is left, I suspect, to implement it with reasonable accuracy is the institution of its flag consequences. (which do not even come into account unless the S bit is set).

Because my goal is a speed comparision between automatic translation and interpretation (by means of JS's onTimer function), I see myself as having two remaining tasks ahead:
- implement a branch instruction by which means continuous addition may be employed
- (not sure if the flags are important or not to the test... seems like it's situation dependent).

On the other hand, I've yet to prove that this technique can be used to access memory, either.

blargg · Post by **blargg** » Mon Mar 24, 2008 10:14 pm

Wow, I feel like I stepped into a coding asylum. Am I the only one?

Deathlike2 · Post by **Deathlike2** » Mon Mar 24, 2008 10:21 pm

blargg wrote:Wow, I feel like I stepped into a coding asylum. Am I the only one?

Sometimes, the inmates run the asylum.

DataPath · Post by **DataPath** » Mon Mar 24, 2008 10:46 pm

Hey, tcaudilllg2, maybe you should take this discussion to email or #zsnes, rather than using the forums as a dumping ground for thinking "out loud".

tcaudilllg2 · Post by **tcaudilllg2** » Tue Mar 25, 2008 12:32 pm

DataPath wrote:Hey, tcaudilllg2, maybe you should take this discussion to email or #zsnes, rather than using the forums as a dumping ground for thinking "out loud".

Eh, I didn't know there was a protocol for this kinda thing.

Do you or do you not want faster emulation?

Anyway, I think that the question of whether or not this technique is faster than interpretation can be solved by creating two emulators, one that uses this technique and another that uses intepretation (the traditional method), and comparing their performance with internal stopwatch algorithms. I suspect it is, simply because string matching is terribly slow, accounting as it does for an entire loop sequence in itself.

Resources to quality ARM docs (the best you can find) would be appreciated.

Post by **grinvader** » Tue Mar 25, 2008 6:31 pm

DataPath wrote:#zsnes

No.

tcaudilllg2 wrote:string matching is terribly slow, accounting as it does for an entire loop sequence in itself.

tcaudilllg2, meet enums. Enums, meet tcaudilllg2.
Have fun.

blargg · Post by **blargg** » Tue Mar 25, 2008 8:48 pm

tcaudilllg2 wrote:Do you or do you not want faster emulation?

Profiler tells where most time is spent. There are other things than the CPU to be emulated: PPU, SPC-700. PPU is probably most ripe for optimization in any SNES emulator.

tcaudilllg2 · Post by **tcaudilllg2** » Tue Mar 25, 2008 11:35 pm

grinvader wrote:
DataPath wrote:#zsnes
No.

tcaudilllg2 wrote:string matching is terribly slow, accounting as it does for an entire loop sequence in itself.
tcaudilllg2, meet enums. Enums, meet tcaudilllg2.
Have fun.

How does ZSNES do it?

I found an ARM doc. According to it, the first three bits determine the mode of the instruction, implying that there are eight such modes. (although not all are used by every instruction).

byuu · Post by **byuu** » Tue Mar 25, 2008 11:45 pm

How does ZSNES do it?

Somewhat like this:

Code: Select all

mov edx,[cpu_reg_pc]
mov eax,[edx]
and eax,0xff
jmp [optable+eax*4]

optable:
  dd op00_brk
  ...
  dd opff_sbc_longx

tcaudilllg2 · Post by **tcaudilllg2** » Wed Mar 26, 2008 12:34 am

blargg wrote:
tcaudilllg2 wrote:Do you or do you not want faster emulation?
Profiler tells where most time is spent. There are other things than the CPU to be emulated: PPU, SPC-700. PPU is probably most ripe for optimization in any SNES emulator.

What do you mean?

@byuu:
my, that is fast. But if it's that fast, then shouldn't it have good speed on say, even a low 486? I once ran it on a DX4-100 on 8-bit, 11Khz and the speed was pretty bad... I remember the marker flag in OgreBattle waving for fifteen seconds after I set it. I understand though that sound requires some several MIPS to pull off.

Deathlike2 · Post by **Deathlike2** » Wed Mar 26, 2008 12:41 am

tcaudilllg2 wrote:
blargg wrote:
tcaudilllg2 wrote:Do you or do you not want faster emulation?
Profiler tells where most time is spent. There are other things than the CPU to be emulated: PPU, SPC-700. PPU is probably most ripe for optimization in any SNES emulator.
What do you mean?

He meant exactly what he said.

If you want to legitmately speed up the emu, you work on portions of the emulation that are speed critical.

tcaudilllg2 · Post by **tcaudilllg2** » Wed Mar 26, 2008 2:09 am

I'd like to know more about how ZSNES works (from an architectural standpoint) so I'll start a thread about the PPU. I am still working on JIT, though.

Deathlike2 · Post by **Deathlike2** » Wed Mar 26, 2008 11:25 am

How about you look at the source if you are so interested?

tcaudilllg2 · Post by **tcaudilllg2** » Wed Mar 26, 2008 12:27 pm

Deathlike2 wrote:How about you look at the source if you are so interested?

Because you already know and it's less energy to ask you. You can put your knowledge into context. I mean, would it take more than a few minutes to describe the process, which you've never done before on this forum, that I can plainly see.

I consulted the "documentation" project, but it seems to say very little about the state of the actual source or how it was conceived. (correction, it says nothing at all about those things).

I've seen the source and it left me with more questions than answers.

I'll take another look at it, but a blow-by-blow of the internals would be preferred.

I'm not interesting in the workings of ZSNES as I am in the workings of emulators in general. (of systems whose capabilities are similar to those of the SNES).

tcaudilllg2 · Post by **tcaudilllg2** » Wed Mar 26, 2008 2:40 pm

I looked up snes9x's code. The PPU is remarkably sound and efficient. (although I hear that every switch in a control block does indeed count a clock. I wonder if using if statements when there are say, 100 consecutive cases, would cut down on cpu time).

The CPU overhead, however, is considerable compared to JIT techniques. JIT can correlate between similar functions of CPUs, whereas interpretation must account for every descrepancy and emulate the entire machine in software.

DataPath · Post by **DataPath** » Wed Mar 26, 2008 3:07 pm

a cascading sequence of if/else if/else is a O(n) operation, while a switch statement is a O(1) operation.

A switch statement, when an option, is 99.9% of the time faster and (when dealing with more than 3 alternatives) more readable than cascading if/else statements.

If you really do plan on writing an emulator, though, I'd recommend coding it up to be readable first, and not optimize it until you have something that works.

creaothceann · Post by **creaothceann** » Wed Mar 26, 2008 4:39 pm

And after using a profiler.

Post by **grinvader** » Wed Mar 26, 2008 6:37 pm

byuu wrote:

How does ZSNES do it?

Somewhat like this:

Code: Select all

[/quote]
I think we call instead of jump, and also the table is manually allocated instead of dd'ed (see cpu/table.asm).

Edit: ayup
[code]cpu/execute.asm:568:     call dword near [edi+ebx*4]

blargg · Post by **blargg** » Wed Mar 26, 2008 7:15 pm

tcaudilllg2 wrote:I looked up snes9x's code. The PPU is remarkably sound and efficient. [...] The CPU overhead, however, is considerable compared to JIT techniques.

It doesn't matter how inefficient a piece of code is if most time is spent in other code. The only way to know where time is spent is with a profiler. You couldn't tell by looking, even if you were an experienced programmer. So stop focusing on CPU emulation and start reading up on performance profiling.

kode54 · Post by **kode54** » Thu Mar 27, 2008 1:33 am

Code: Select all

                                     entry  static       dynamic     %     run
Function Name                        count   instr         instr  total   total
SingleLine32                       1929312      27   11867198112   13.9   13.9
_S9xMainLoop                         17205     117    8874905232   10.4   24.3
MixStereo                          4396207     470    7942227625    9.3   33.6
DrawTile16_Normal1x1              19578834     599    7218263518    8.5   42.1
_S9xAPUExecute                   264932479      95    5856425395    6.9   49.0
DrawBackdrop16_Normal1x1             78792      44    5193199140    6.1   55.0
DrawBackdrop16Add_Normal1x1           9317      83    3829078184    4.5   59.5
DecodeBlock                        2965075     148    2469980301    2.9   62.4
DrawTile16AddS1_2_Normal1x1        1414424    3104    1801920913    2.1   64.5
S9xGetByte                        75050735      95    1566329006    1.8   66.4
_S9xMixSamples                     4396301     705    1553350797    1.8   68.2
DrawTile16Add_Normal1x1            3025889    1804    1515681532    1.8   70.0
DrawBackgroundOffset                  6595     455    1365314169    1.6   71.6
Immediate8                       133981212      10    1329396614    1.6   73.1
fx_run                              120818      29    1292973934    1.5   74.6
_S9xSetupOBJ                          9458     354     920981353    1.1   75.7
DrawBackground                      164219     509     775799405    0.9   76.6
std::_Tree<std::_Tmap_traits<unsigned int,s9xcommand_t,std::less<unsigned int>,s
td::allocator<std::pair<unsigned int const ,s9xcommand_t> >,0> >::_Lbound
                                   9979264      19     730968577    0.9   77.5
_S9xDoHEventProcessing            27666157     224     729700319    0.9   78.3
DrawMode7BG1Add_Normal1x1              396     336     719795408    0.8   79.2
DrawOBJS                             55487     260     624140843    0.7   79.9
S9xSetWord                        15235126     407     570901760    0.7   80.6
Immediate16                       63577364       9     557307954    0.7   81.2
Direct                            47572393      12     523468503    0.6   81.9
Relative                          73032425       7     511226975    0.6   82.5
std::less<unsigned int>::operator()
                                  87355527       7     477488109    0.6   83.0
OpF0E0                            31845267      36     459955811    0.5   83.6

Looks like the most expensive functions are PPU and APU. Although, it does appear that one of the CPU opcodes crept in there at the bottom of the sample listing. Wait, I almost missed that APU execution function. I suppose that could be over optimized, but it looks like the PPU should come first.

tcaudilllg2 · Post by **tcaudilllg2** » Thu Mar 27, 2008 1:35 am

Mmm nope. Now that I think about it, all of those switches and ifs which in an emulator control the hardware could be hardcoded, granting the effect of the switches while losing next to nothing to overhead.

The more accurate the emulator, the more gains to be made from direct translation, in my view. However, certainly the emulator precedes the translator.

Thinking translation wise is an altogether different way of thinking from thinking emulation-wise. Emulation says, "instruct the host hardware to act like the other hardware, which will execute the program by default." Translation says, "draw correspondences between the functioning of the host hardware and the hardware the program is made for, and match them, essentially altering the mechanistic form of the program if not its character. To the degree there are no correspondences, simulate them in relation to the correspondences that do exist." In substance, an automatic port of the program to the host platform is acheived.

Now on the issue of branch jumping, in ASM one only jumps to an address. How to implement this at high level?
- produce a label bearing the name of the target address. (for example "label_9F24")
- instruct the program to "goto" the address.
- have the translator keep record of this address and its code location. Test every new jump address against this table. Once translated, the routine may be rereferenced whereever it is called for in the ASM. (if the jump is unpredictable, interpretation will doubtless be required.)
- have the translator create a seperate code sequence for every jump, unless the code for the jump has already been translated as part of the main program. The idea is to keep the subroutines and the main program seperate.

tcaudilllg2 · Post by **tcaudilllg2** » Thu Mar 27, 2008 3:07 am

@kode64: which profiling software do you use?

kode54 · Post by **kode54** » Thu Mar 27, 2008 3:12 am

Visual Studio 2005.

tcaudilllg2 · Post by **tcaudilllg2** » Thu Mar 27, 2008 9:55 am

I'm busy asking myself how exactly one would bring IRQs into all of this. I hadn't thought it a problem at the outset, but after reading up on byuu's work I now understand that it is synchronity problems between the components which produces many emulation errors.

Actually I just had an idea. The reason we experience these timing errors, I think, is because near the end of the SNES' lifespan, programmers realized that they could save cycles by calculating the moments of interface between the CPU and other components, thus doing away with polling loops. Polling loops are standard practice for hardware interfacing, thus we see calcuated synchronity used very, very rarely outside of performance critical (and low-end) systems. (observe for example, GBA emulator compatibility is far higher compared to the SNES).

I don't think IRQs will be a genuine concern then.

Also, I suspect SNES emulators might improve performance by not interpreting device reads faithfully: treating every device read as a polling loop would probably eliminate the need for synchronization of the system. (though it's more complex even than that).