Two-stage pipeline emulation

Archived bsnes development news, feature requests and bug reports. Forum is now located at http://board.byuu.org/
Locked
byuu

Two-stage pipeline emulation

Post by byuu »

Alright, one thing I've been kicking around a lot lately has been emulation of a two-stage pipeline.

It's more an educational interest than a requirement. It really would be a requirement if the pipeline was longer, though.

Basically, the S-CPU has two stages: a fetch and decode stage, and an execution stage.

Analogy: say you want to unload a truck and put items onto a shelf. You could spend 1 minute getting an item from the truck, and another minute putting that item on the appropriate shelf. With a pipeline, imagine you have two people. One to take an item off the truck and hand it to the other, who puts it on the shelf. This process takes 1 minute per item instead of 2 minutes per item.

Well, that's how the S-CPU works. Things like I/O cycles, IRQ tests one cycle before the end of the opcode, cli : rti tricks, etc etc suddenly make a lot more sense when you understand how this pipeline works.

An example of a linear process:

Code: Select all

inc $12
- read 0xd6
- read 0x12
- read [0x12]
- i/o
- increment read value
- IRQ test
- write [0x12]
cli
- read 0x58
- IRQ test
- i/o
- p.i = 0
And an example of this process pipelined:

Code: Select all

Legend:
/ work cycle
\ bus cycle
(both happen at the exact same time)

/ <empty>
\ read 0xd6

/ idle
\ read 0x12

/ idle
\ read [0x12]

/ increment read value
\ i/o

/ idle
\ write [0x12]

- IRQ test

/ idle
\ read 0x58

/ idle
\ i/o

- IRQ test

/ p.i = 0
\ read next opcode
Suddenly, things make a lot more sense. You can see why last_cycle() is needed ... IRQs are tested at the start of each new bus cycle, but the last work cycle of CLI is still enqueued, so you end up testing for IRQs with the state of P.i prior to the last work cycle that changes P.i to zero.

Now, the only problem is that I have no idea how to implement such a multi-tasking process efficiently in a single-threaded environment in C++ :/

Again, the simulation used currently with last_cycle() works just fine and covers every known edge case. This is just an academic interest.
byuu

Post by byuu »

Here's what I have so far:

Code: Select all

#define echo(n) printf(n "\n")

bool irq_execute;

bool irq_test() {
  echo("irq test");
  return irq_execute = true; //IRQ should occur
}

void irq_run() {
  echo("irq run");
  echo("irq.bus1");
  echo("irq.wrk1");
  echo("irq.bus2");
  echo("irq.wrk2");
}

void op00() {
  echo("op00 bus.1");
  echo("op00 wrk.1");
  echo("op00 bus.2");
  irq_test();
  echo("op00 wrk.2");
}

void op01() {
  echo("op01 bus.1");
  echo("op01 wrk.1");
  echo("op01 bus.2");
  irq_test();
  echo("op01 wrk.2");
}

void (*work_fp)();

void work_fp_nop() {
  echo("<work pipeline inactive>");
}

void irq_wrk2() { echo("irq.wrk2"); }

void irq_run_pipeline() {
  if(irq_test() == false) return;

  work_fp();
  echo("irq run");
  echo("irq.bus1");
  echo("irq.wrk1");
  echo("irq.bus2");
  work_fp = &irq_wrk2;
}

void op00_wrk2() { echo("op00 wrk.2"); }

void op00_pipeline() {
  work_fp();
  echo("op00 bus.1");
  echo("op00 wrk.1");
  echo("op00 bus.2");
  work_fp = &op00_wrk2;
}

void op01_wrk2() { echo("op01 wrk.2"); }

void op01_pipeline() {
  work_fp();
  echo("op01 bus.1");
  echo("op01 wrk.1");
  echo("op01 bus.2");
  work_fp = &op01_wrk2;
}

int main() {
  echo("linear:");
  op00();
  if(irq_execute) irq_run();
  op01();
  if(irq_execute) irq_run();
  echo("");

  work_fp = &work_fp_nop; //needed once every reset / pipeline flush
  echo("pipeline:");
  op00_pipeline();
  irq_run_pipeline();
  op01_pipeline();
  irq_run_pipeline();
  work_fp();
  echo("");

  echo("done");
  getch();
  return 0;
}
The basic idea is to use function pointers to delay one one cycle as needed. The idea can actually be nested to create N-stage pipelines. As an optimization, there's no need to use function pointers inside actual opcodes, except on the very last cycle.

That last cycle gets tricky because of the rollover to the next opcode. You can't implement the last cycle of a previous opcode at the start of a new opcode, so you have to split it off into another function that the new opcode will execute first.

And obviously, the next optimization is to move IRQ run in between the last bus and work cycles, so that no function pointers are needed at all.
Locked