Taking dynarecs one step further: Just-in-time assembly?

tcaudilllg2 · Post by **tcaudilllg2** » Fri Mar 21, 2008 8:07 am

Might one convert the bytecode of say, a 6502 program to something equivalent in x86 bytecode, and run it natively on the host device? Would one call this emulation at all?

byuu · Post by **byuu** » Fri Mar 21, 2008 8:37 am

It would be called dynamic recompilation.

If you meant to say "convert it to x86 code, and save the result to a native executable that never needs to be recompiled again", then you would call that static recompilation.

Unfortunately, that idea is only good on paper. In the real world, many things are impossible with pure static recompilation. This includes self-modifying code, and code that contain indirect memory accesses (eg indirect jumps.) The only thing you can there is keep trying to play through the game to log all possible code paths. And how likely is it that you'll succeed there?

The only solution is to implement a dynarec backend to handle these cases, which pretty much means you're duplicating a lot of effort since now you need two separate recompilers.

If it provided enormous speedups, it might be worth it. But for the most part, dynarecs can be optimized highly enough that recompilation of "hot" code blocks only needs to occur once. You're better off optimizing your dynarec than adding yet another source of potential failure to your emulator.

If you just want a native platform executable to run a game "sans emulator", then you can just attach the original code to the native emulator, so that it appears to the end user to be one native application. So long as it's fast (and any 6502 program would be), who cares how it works internally, right?

tcaudilllg2 · Post by **tcaudilllg2** » Fri Mar 21, 2008 8:49 am

byuu wrote:It would be called dynamic recompilation.

If you meant to say "convert it to x86 code, and save the result to a native executable that never needs to be recompiled again", then you would call that static recompilation.

Good luck if your code is self-modifying. You'll need a dynarec backend to handle that. And if you're going to include the dynarec anyway, then it better offer fantastic speedups -- which it usually doesn't. For a 6502, there's very little point. You may as well just stick the 6502 program inside an emulator, thus creating a standalone EXE, ala NES Lord.

Wikipedia disagrees with you. Here's the article for JIT:
http://en.wikipedia.org/wiki/Just-in-time_compilation

And here's the one for dynarec:
http://en.wikipedia.org/wiki/Dynamic_recompilation

Actually I was meaning to suggest performing the translation, and then storing the translation itself in memory. When the translating program closed, it would flush the cached translation.

The point is this: a 6502 program should, in theory, be capable of running on a 8086 at full speed if it were translated into the 8086's native code and the code pointer for the translator itself repositioned to the 6502 program's code. The translation program becomes the 6502 program, so to speak.

tcaudilllg2 · Post by **tcaudilllg2** » Fri Mar 21, 2008 9:04 am

This includes self-modifying code, and code that contain indirect memory accesses (eg indirect jumps.)

Can you give an example? This seems to me something rather infrequently experienced. And I think you could effectively premodel the memory.

If it works in Java I don't see why it wouldn't work with a 6502.

I'm pretty sure this isn't as complicated as it sounds. I do intend to attempt it, in fact. I know a lot of theory, but my experience is poor. I remember doing something like in QB a long time ago, with CALL ABSOLUTE. It would be best to try this on a low power computer sim first. (like DOSBOX)

byuu · Post by **byuu** » Fri Mar 21, 2008 9:04 am

Meh, quoting entire posts is annoying. I forgot about indirect memory accesses, so I just updated my post above.

I really don't care what Wikipedia says. They think bsnes was the first emulator to add Sufami Turbo support, and that Africa's elephant population has soared in the last few years.

Further, I don't see how it disagrees anyway. You'll have to explain.

Much of the "heavy lifting" of parsing the original source code and performing basic optimization is often handled at compile time, prior to deployment: compilation from bytecode to machine code is much faster than compiling from source. The deployed bytecode is portable, unlike native code. Since the runtime has control over the compilation, like interpreted bytecode, it can run in a secure sandbox. Compilers from bytecode to machine code are easier to write, because the portable bytecode compiler has already done much of the work.

JIT is really a different ballgame all together. When you are converting to bytecode, you have full access to the source, so it's a lot easier to create output that can easily be recompiled where necessary.

Actually I was meaning to suggest performing the translation, and then storing the translation itself in memory. When the translating program closed, it would flush the cached translation.

So, if you're flushing the translation when the program closes anyway ... what's the difference between your idea and dynamic recompilation? Are you meaning to recompile the entire application as a whole upon program startup? Need more information ...

The point is this: a 6502 program should, in theory, be capable of running on a 8086 at full speed if it were translated into the 8086's native code and the code pointer for the translator itself repositioned to the 6502 program's code. The translation program becomes the 6502 program, so to speak.

With the right recompilation and enough optimization, sure, you could get pretty close to 1:1. But a 6502 by itself isn't very useful. You'll also need to handle all of the interactions with the other processors and such. Not impossible, but much, much harder.

byuu · Post by **byuu** » Fri Mar 21, 2008 9:11 am

Can you give an example? This seems to me something rather infrequently experienced. And I think you could effectively premodel the memory.

If it works in Java I don't see why it wouldn't work with a 6502.

Does Java support indirect memory jumps from pointers? I hear a lot of hoopla about how these managed languages lack pointers.

Further, JIT is not a static only recompiler, as far as I know. It can also recompile code while the application is running. A dynamic recompiler can handle indirect memory jumps, because it will be aware of what is in memory at any given point, whereas a static recompiler will not.

As for an example ...

Code: Select all

...
in eax,0x0384 ;read address from hardware peripheral
jmp [eax] ;jump to the address it returned

How will you know what address we are jumping to, if you don't know what value the external hardware peripheral is going to give you? Thus, you can't recompile this code without first encountering it through emulation. Further, even with emulation, you can't be sure that you'll always get that same value from the hardware peripheral, every single time. For all you know, port 0x0384 is a PRNG.

A dynarec would handle this by inserting a function between in and jmp to find out where we are jumping to, and start recompiling the code there if it wasn't already. Static recompilation would have no idea where this code would end up, and thusly it couldn't find the code that was jumped to, which would be needed in order to recompile it.

If "random hardware peripheral" is too bizarre an example, then pretend it's reading a player's health, and jumping into a table of function pointers to routines that perform certain actions (such as drawing the life bars red when near death, etc.) How would recompiling all code at startup (static recompilation) know what possible range of health you could have? Just reading from a 32-bit memory address, it wouldn't know if the jump table contained 10 valid entries, or 4 billion valid entries.

tcaudilllg2 · Post by **tcaudilllg2** » Fri Mar 21, 2008 9:21 am

I'm not peculiar on the disagreement, just pointing out that there appear to be conflicting definitions. (makes for problems in the google box).

I'm pretty sure the 8086 supports indirect jumps. I'm not seeing the problem: just say "if X wants to jump to Y supplied by Z, then write A as jumping to B supplied from C".

I think you're reading into this too literally. It's conceptually simple, and I may not be advocating a purely static approach at all, but something in between static and dynamic.

EDIT:
Now you would have to get all the resources used by the program and reorganize them on the basis of some principle. Then you could say, "although the original program said get X from Y, I'm telling you to get X from Z instead."

I'll make an example for clarification.

byuu · Post by **byuu** » Fri Mar 21, 2008 9:26 am

You still haven't told me what you want to do.

If you want to recompile the entire application at startup (static recompilation), so that you don't have to bog the application down with recompiling steps, then the point is that:

"When X wants to jump to Y supplied by Z, then A won't (can't) know what C is, so it can't recompile the code at B. It never knows that there is code at B. Thus, the application will die as soon as the code jumps to B."

If you don't mind recompiling code at runtime (dynamic recompilation), then you can obviously handle this case.

EDIT:

I think you're reading into this too literally. It's conceptually simple, and I may not be advocating a purely static approach at all, but something in between static and dynamic.

That's what I needed, thank you. In that case, that sounds fine. It's a lot more work, but it's hard to say whether it's worth it until you try it.

Pretty much any static recompiler has to have a dynamic recompiler backend anyway. So yes, I recommend you research static recompilation techniques, then. korn is the only emulator I can think of offhand that claimed to use this technique. I don't think it was open source (could be wrong), so that may not be too helpful.

Deathlike2 · Post by **Deathlike2** » Fri Mar 21, 2008 9:33 am

tcaudilllg, you are back for more?

You gotta stop getting into theory and discuss more about practice. After the last "theory" of some random idea, it didn't go well the last time.. you have been warned (not as a zboard warning, more of a "we will bitchslap you for being a retard" kind of hint).

tcaudilllg2 · Post by **tcaudilllg2** » Fri Mar 21, 2008 9:52 am

Deathlike2 wrote:tcaudilllg, you are back for more?

You gotta stop getting into theory and discuss more about practice. After the last "theory" of some random idea, it didn't go well the last time.. you have been warned (not as a zboard warning, more of a "we will bitchslap you for being a retard" kind of hint).

I've heeded the warning.

No I think this is pretty damn sound. If you know what you're dealing with and where something is, then I'd think you could work with it. In the case of a ROM you have all the image data available within its small, confined space: you know exactly where it is. You'd have to set off a chunk of memory for the size of the ROM, and then have the translator basically treat that memory as a sort of storage from which the program could access its resources. (probably more complex than it sounds).

I'm aware that it probably wouldn't work in all cases, but any cases wherein it did work would be worth gunning for.

As it is I wonder somewhat how emulation will continue to progress if CPU speeds do not continue to climb. (a 16 core-CPU 4.0Ghz CPU seems not up to the task when you're faced with emulating a 3.0Ghz machine). Somewhere along the line we're gonna have to cut the entropy.

EDIT:
Looked it up. Corn indeed made it happen. I think it would be damn worth it to try to tackle those challenges. The performance gains would be tremendous in some cases.

funkyass · Post by **funkyass** » Sat Mar 22, 2008 6:55 am

a CPU need not have higher frequencies to be higher performing.

that being said, dynamic recomplaiton has to be evaulated on a platform-by-platform basis, because timing is an issue that can't be compiled away dynamically.

tcaudilllg2 · Post by **tcaudilllg2** » Sat Mar 22, 2008 7:23 am

funkyass wrote:a CPU need not have higher frequencies to be higher performing.

that being said, dynamic recomplaiton has to be evaulated on a platform-by-platform basis, because timing is an issue that can't be compiled away dynamically.

That's a good point, the IPS rule holds pretty solid with low performance chips. So long as the MIPS rate continues to climb, there will be progress. However, the throughput does seem to be leveling off per core, and line-by-line interpretation is a one-process deal.

Now the timing is where I suspect I'd have the most trouble. My solution would be to break it up by the framerate, doing so many instructions in 1/60 sec and doing so many the next 1/60, etc. This would probably end up shielding the time descrepancy from the user's PoV. I'm aware that the problems the accurate emulation scene tries to tackle would remain problematic, but for the majority of programs there would be no difference.

I'm thinking about performing a low level to high level translation first. (as a stepping stone) I'm undecided as to which chip to use though... the fewer the opcodes, the better. I've got other responsibilities and I don't want to get bogged down on this.

Is anyone else interested in this? The more people who participated the more enjoyable it would be.

funkyass · Post by **funkyass** » Sat Mar 22, 2008 7:48 am

you know the difference between RISC and CISC right?

tcaudilllg2 · Post by **tcaudilllg2** » Sat Mar 22, 2008 7:53 am

funkyass wrote:you know the difference between RISC and CISC right?

Somewhat. I've read that RISC uses some sort of free access register system where the registers can access each other (something CISC reserves to RAM access), meaning less opcodes are needed.

I think I could cut down on development time and code by ignoring the cycles completely and concentrating completely on instruction processing rates. For a 6502, the rate would be ~300,000 (according to this), 1/60 of which would be 5,000 instructions per frame. Now how to split those up without something overlooking them?

funkyass · Post by **funkyass** » Sat Mar 22, 2008 8:10 am

consider this: all modern CPU's operate as RISC processors - meaning that they all instructions(ops) in the next cycle.

the X86 instruction set is CISC - which means instructions can take varible cycles to complete.

all x86 processors translate CISC ops into RISC ops.

the generic issue about timing is not all consoles used an external source for timing, they just relied on the number of cycles of x ops - so how are you going to account for that all ops in the dynamically recompiled programs will return in once cycle?

tcaudilllg2 · Post by **tcaudilllg2** » Sat Mar 22, 2008 8:38 am

I think there's a typo in your post. I didn't get the message. "all modern CPU's operate as RISC processors - meaning that they all [sic] instructions(ops) in the next cycle."

If I've got 5,000 instructions to process in a given frame, then I have to process 5,000 instructions per frame and then wait for vBlank before I process any more. The problem is that I -am- the program in this instance, and cannot regulate my own speed. Something outside me (the program) must do it for me.

So for that reason alone it's not possible to do it without having something in control of the CPU.

Then how was it done with Corn? If you read the program from memory at every frame, while self modifying to keep the vBlank/memory copy in check, would it not be possible?

Like this:

[read entire code base into memory]
[set aside a special code buffer with room for 1 frame of instructions]
[copy the following to the buffer]:
- 5,000 instructions
- the vBlank test code
- the code to copy over the next instruction list...

The problem is looping. You can't know in advance how many instructions there will be in a given loop... unless you run an instruction counter in between every operation, and test that. That would be the key: you'd put in the instruction max test between every instruction. In practice, it'd look like the N64's "multiply by two" bug. (or is that a myth?)

But it would be faster than any form of interpretation. Yes that would work, I'm sure of it.

You could just run through a program's instruction code and put up a vBlank test instruction after it. Actually, given that the vBlank is a non-maskable interrupt, the institution of one test after every opcode effectively synthesizes the phenomenon of the non-mask interrupt.

In pseudocode:

A = 4 // program instruction

programCounter++ // inserted instructions
if programCounter > 5000 GOTO vBlank

B = 2 // program instruction

programCounter++ // inserted instructions
if programCounter > 5000 GOTO vBlank

C = Sqrt(B ^ 2 + C ^ 2) // program instruction

programCounter++ // inserted instructions
if programCounter > 5000 GOTO vBlank

Not obvious because we're taught not to think that way -- we're taught to optimize, not squander -- , but it works.

DataPath · Post by **DataPath** » Sat Mar 22, 2008 4:33 pm

Actually, I think the core for Intel and AMD processors, at least, are hybrid RISC/CISC/VLIW.

The x86 ISA can be translated down to microops, or translated up in macro-op fusion. Or executed as-is. Both Intel and AMD do things like this.

And SIMD instructions (SSE#) are, IIRC, all VLIW.

funkyass · Post by **funkyass** » Sat Mar 22, 2008 8:36 pm

all ops on a RISC processor return in one cycle(and are more simplier operations better for pipelining and execution amongst mutliple execution units)

Corn got away with it because of another factor you aren't considering - progaming methodolgy.

the n64 was probably among the first consoles start using something similar to gaming API's - you can translate the video calls to opengl and the audio calls to whatever - directsound for example; Because of that, dynamic compilation would work better for more recent consoles than older ones, where you'd have to handle the other processing units with their own unique way of talking to the program.

tcaudilllg2 · Post by **tcaudilllg2** » Sun Mar 23, 2008 12:22 am

I checked it out. I'm thinking ARM is probably the best choice.

This is the methodology I will use:
- read the instruction
- look up the name of the instruction in a lookup table at the index of the instruction
- look up the corresponding instruction string in a table at the same index as the name
- modify the string to suit the conditions of the instruction
- tack it onto the rebuilt program string

I've not yet decided on a testing platform. Until then I'll psuedocode it JS-style.

Code: Select all


for (codeIndex = 0; codeIndex < code.length; codeIndex++) {

  opcodeIndex = code[codeIndex];
  instruction = instructionCorrespondence[opcodeIndex];
  translation = translationCorrespondence[opcodeIndex];
  
  translatedCode = translatedCode + Translate(translation);
}

Some of the translations are easily conducted. For example:

Code: Select all


// instruction = ADD
// translation = "R3 = R1 + R2"

function Translate (translation) {

  // extract the specifics of the instruction  
  // modify the translation to suit the instruction
  // return

}

The trick is to find equivalent processes between platforms.

tcaudilllg2 · Post by **tcaudilllg2** » Sun Mar 23, 2008 2:44 am

ARM-specific algorithm:

Code: Select all


instructionCorrespondence = new Array (16);

for (index = 0; index < [program size]; index + 4) {
  operator[codeIndex] = [memory at first byte];
  code[codeIndex] = [memory at next three bytes];
  codeIndex++;
}


for (codeIndex = 0; codeIndex < code.length; codeIndex ++) {

  // code only retains the opcodes; the actual data has already been seperated in this routine

  opcodeIndex = operand[codeIndex];

  // this code extracts the conditional bits (32 thru 28)

  subtractor = 16;
  while (opcodeIndex > 15) {
    opcodeIndex = opcodeIndex - subtractor;
    subtractor * 2;
    }

  instruction = instructionCorrespondence[opcodeIndex];
   translatedCode = translatedCode + Translate();
}

As for translation:

Code: Select all


function Translate () {

  if (instruction == "ADD") {

    [divvy up code[codeIndex] into its three seperate bytes]
    
    if (Byte1 == Register3) {
      Destination = "R3";
    }

    if (Byte2 == Register2) {
      2ndOperand = "R2";
    }

    if (Byte3 == Register3) {
      3rdOperand ="R3";
    }

    translation = Destination + "=" + 2ndOperand + "+" + 3rdOperand;
  }

  return translation;
}

That's one form of one instruction.

Am I the only one who thinks this isn't crazy?

Edit: adding more instructions...

Code: Select all


function Translate () {

  if (instruction == add) {

    [divvy up code[codeIndex] into its three seperate bytes]
    
    if (Byte1 == Register3) {
      Destination = "r3";
    }

    if (Byte2 == Register2) {
      2ndOperand = "r2";
    }

    if (Byte3 == Register3) {
      3rdOperand ="r3";
    }

    translation = Destination + "=" + 2ndOperand + "+" + 3rdOperand;
  }
  else
  if (instruction == addWithCarry) {
    
  }
  else
  if (instruction == subtract) {
  }
  else
  if (instruction == subtractWithCarry) {
  }
  else
  if (instruction == subtractReverse) {
  }
  else
  if (instruction == subtractWithCarryReverse) {
  }
  else
  if (instruction == multiply) {
  }
  else
  if (instruction == multiplyAccumulate) {
  }
  else
  if (instruction == divide {
  }
  else
  if (instruction == move) {
  }
  else
  if (instruction == moveNot) {
  }
  else
  if (instruction == compareDifference) {
  }
  else
  if (instruction == compareSum) {
  }
  else
  if (instruction == compareAND) {
  }
  else
  if (instruction == compareXOR) {
  }
  else
  if (instruction == logicalAND) {
  }
  else
  if (instruction == logicalXOR) {
  }
  else
  if (instruction == logicalOR) {
  }
  else
  if (instruction == bitClear) {
  }
  else
  if (instruction == load) {
  }
  else
  if (instruction == store) {
  }
  else
  if (instruction == loadBlock) {
  }
  else
  if (instruction == storeBlock) {
  }
  else
  if (instruction == swap) {
  }
  else
  if (instruction == softwareInterrupt) {
  }
  else
  if (instruction == MRS) {
  }
  else
  if (instruction == MSR) {
  }
  else
  if (instruction == coprocDataInstruction) {
  }
  else
  if (instruction == moveToCoprocFromReg) {
  }
  else
  if (instruction == moveToRegFromCoproc) {
  }
  else
  if (instruction == loadToCoproc) {
  }
  else
  if (instruction == storeToCoproc) {
  }

/* Barrel shift set - for later

  else
  if (instruction == shiftLeft) {
  }
  else
  if (instruction == shiftRight) {
  }
  else
  if (instruction == RotateShiftRight) {
  } */

// if there was a conditional suffix flag, append it as a one-line conditional.

  switch (condition) {
    case equal: translation = "if (equal) " + translation; break;
    case notEqual: translation = "if (notEqual) " + translation;  break;
    case higherOrSame: translation = "if (higherOrSame) " + translation;  break;
    case lower: translation = "if (lower) " + translation;  break;
    case negative: translation = "if (negative) " + translation;  break;
    case positive: translation = "if (positive) " + translation;  break;
    case overflow: translation = "if (overflow) " + translation;  break;
    case noOverflow: translation = "if (noOverflow) " + translation;  break;
    case higher: translation = "if (higher) " + translation;  break;
    case lowerOrSame:  translation = "if (lowerOrSame) " + translation;  break;
    case greaterThanOrEqual: translation = "if (greaterThanOrEqual) " + translation;  break;
    case lesserThan: translation = "if (lesserThan) " + translation;  break;
    case greaterThan: translation = "if (greaterThan) " + translation;  break;
    case lesserThanOrEqual: translation = "if (lesserThanOrEqual) " + translation; break;
  }

/*  else
  if (condition == always) {
  }
  else
  if (condition == reserved) {
  } // don't think these really matter */

  return translation;
}

Yeah that was pretty tiny. However, the variety of modes is very much a complicating factor.

Edit 10/23/08: added conditional appendages

Post by **grinvader** » Sun Mar 23, 2008 7:13 pm

You must learn about enums and switches. The sooner the better.

tcaudilllg2 · Post by **tcaudilllg2** » Sun Mar 23, 2008 9:23 pm

grinvader wrote:You must learn about enums and switches. The sooner the better.

I know about those, but I don't see why they would be any more efficient. (or are you referring to the ARM instructions?)

From my standpoint it's easier to think of logical constructs as propositions.

Deathlike2 · Post by **Deathlike2** » Sun Mar 23, 2008 9:31 pm

tcaudilllg2 wrote:
grinvader wrote:You must learn about enums and switches. The sooner the better.
I know about those, but I don't see why they would be any more efficient. (or are you referring to the ARM instructions?)

From my standpoint it's easier to think of logical constructs as propositions.

It has nothing to do with efficency. It has everything to do with looking at better written code.

creaothceann · Post by **creaothceann** » Sun Mar 23, 2008 11:53 pm

tcaudilllg2 wrote:I don't see why they would be any more efficient.

Your code uses loads of If instructions; afaik a compiler might optimize a Switch into a jump table.
Besides, it'd be much easier to read and maintain.

tcaudilllg2 · Post by **tcaudilllg2** » Mon Mar 24, 2008 1:57 am

creaothceann wrote:
tcaudilllg2 wrote:I don't see why they would be any more efficient.
Your code uses loads of If instructions; afaik a compiler might optimize a Switch into a jump table.
Besides, it'd be much easier to read and maintain.

If it were BASIC-style, I'd agree; but I don't see how break commands are more elegant than closing braces.

It's just preference. I'll keep it in mind though for the final draft. As it is I have no problem with copy-paste abuse. (which is largely how I made that list).

I'll go back and add the flag registers now.

OK I think I see where you're coming from now: the switch statement implies that you indent at each level, which is far more elegant than the if statement over a large span of tests.

ZSNES board

Taking dynarecs one step further: Just-in-time assembly?

Taking dynarecs one step further: Just-in-time assembly?

Actually...