I believe I've finally got a decent working model for CPU<>PPU sync, with only a minimal compromise.
PPU cannot run ahead of CPU:
We know that many registers can be written mid-scanline and have immediate effects, these include at least the screen and BG enable commands. One can also modify OAM and CGRAM on a per-cycle basis.
CPU cannot ordinarily run ahead of PPU:
The CPU polls the V/H pins of the PPU every clock tick. The states are fairly predictable, but the lengths can change based on interlace and overscan settings.
Extreme 1:
Forcefully sync both processors every single clock tick, causing ~20M context switches a second. Too slow, eats up ~50% of a second just syncing; let alone emulation overhead.
Extreme 2:
Give each core its own completely independent PPU latch counter, forcefully sync whenever interlace and overscan are changed.
Compromise:
Keep only one PPU counter system, a class inside of the PPU itself. Use a tick/tock system and a ring buffer history, like so:
Code: Select all
void CPU::tick() {
ppu.counter.tick();
uint16_t v = ppu.counter.vtime();
uint16_t h = ppu.counter.htime();
}
void PPU::tick() {
ppu.counter.tock();
uint16_t v = ppu.counter.vhist();
uint16_t h = ppu.counter.hhist();
}
//called by CPU
void PPU::Counter::tick() {
//this will deplete the ring buffer completely ...
if(ring_buffer.is_full()) scheduler.sync_ppu_up_to_cpu();
//save current value for the PPU later on ...
ring_buffer.push(current_vcounter, current_hcounter);
//here, we use ppu.interlace(), ppu.overscan() and ppu.field()
//as needed, to properly increment the counter positions by one cycle.
}
//called by PPU
void PPU::Counter::tock() {
//this will add at least one new entry to the buffer ...
if(ring_buffer.is_empty()) scheduler.sync_cpu_up_to_ppu();
//slide our FILO buffer forward by one time position
ring_buffer.pop();
}
uint16_t PPU::Counter::v,htime() {
//or to get really hard-core, we could return their blanking states
//as a boolean, instead. though I see no advantage in doing this ...
return current_v,hcounter();
}
uint16_t PPU::Counter::v,hhist() {
return ring_buffer.v,hcounter();
}
Whenever the CPU writes to $[00-3f]:[2100-213f], we call scheduler.sync_ppu_up_to_cpu() so that the PPU will use the new values just-in-time. This would work exactly the same as the current CPU<>SMP sync on writes to $[00-3f]:[2140-217f].
The only big difference is that the PPU cannot run ahead.
But, that should not cause any noticeable speed penalty. For each time the PPU syncs to the CPU, the CPU should easily be able to execute at least one full scanline before needing to switch control back to the PPU.
That would drop ~20m context switches to ~(262*60*2) or ~32k switches a second (assuming of course the app doesn't hammer the $21xx registers non-stop. I know of no SNES games that do.)
The nicest part of this approach is that 100% of the PPU counter implementation resides inside of the PPU, rather than having the CPU try and emulate the PPU timing.
So, if in theory, we were to throw a different PPU core in that ran at 640x480, with 525 scanlines per frame, the CPU would need absolutely no adjustments ... even IRQs would work as expected. True modularity that allows eg the bsnes CPU core to be plugged directly into an Apple II GS emulator, if desired.
And there's only one counter. No need to spawn two instances. Since the PPU never goes ahead of the CPU, and the CPU syncs the PPU on writes to the interlace and overscan registers, it's 100% safe to use cached counter values. We never have to worry about an edge case causing the two counters to desync (which would permanently break IRQ timing if they did.)
If we're really lucky, the counter values won't even be needed for an unrolled cycle-based PPU loop, except for the V=240 and V=261->262 or 0. Like how the cycle-based DSP runs in a fixed 32-cycle loop.
I'm thinking it will be best to shoehorn the PPU1 and PPU2 chips into a single class, like I do now. Just instead of having struct regs {}, have struct ppu1 {} and struct ppu2 {}; and clearly denote which chip is doing what for each cycle.
Now to see if I can work out a model to do this same thing with the current scanline-renderer ...
I'd really like to make this the primary target for work in 2009. Enough talk, time to get started on this already.