NMIB / IRQB timing (applies to NES as well)

Archived bsnes development news, feature requests and bug reports. Forum is now located at http://board.byuu.org/
byuu

NMIB / IRQB timing (applies to NES as well)

Post by byuu »

I found what was causing my IRQ timings to be off by up to 2 clocks, but I don't know how to fix it ...

Whenever an NMI or IRQ triggers, some basic code executes to enter the interrupt (backup PC/P, load vector, etc).

The W65C815S document notes that the first two cycles are I/O. We know that is false, because I/O takes 6 clocks. Two would be 12.

We've always observed it to be 14 clocks, until today. Here's the basic routine:

Code: Select all

void sCPU::op_irq() {
  op_read(regs.pc.d);
  op_io();
  if(!regs.e) op_writestack(regs.pc.b);
  op_writestack(regs.pc.h);
  op_writestack(regs.pc.l);
  op_writestack(regs.e ? (regs.p & ~0x10) : regs.p);
  rd.l = op_read(event.irq_vector + 0);
  regs.pc.b = 0x00;
  regs.p.i  = 1;
  regs.p.d  = 0;
  rd.h = op_read(event.irq_vector + 1);
  regs.pc.w = rd.w;
}
My theory was that one of the cycles was repeating the opcode fetch for the next opcode. Presumably that would be some sort of quirk of the 2-stage pipeline being flushed.

However, when writing up some starter tests for Mecarobot, I hit some edge cases where the first two cycles of IRQ were taking 16 clock cycles instead of 14.

I ruled out all that I could ... it's not related to the FastROM setting, it isn't related to the speed of the most recent address on the bus, etc.

Here's my notes so far:

Code: Select all

nop = 14 cycles [opfetch + i/o]
bit = 30 cycles [opfetch + aal + aah + read]
bpl = 20 cycles [opfetch + relfetch + i/o]

Cycle  VDA  VPA  RWB  AddrBus   DataBus
nop.0    1    1    1  PBR,PC    Opcode
nop.1    0    0    1  PBR,PC+1  I/O
-----
bit.0    1    1    1  PBR,PC    Opcode
bit.1    0    1    1  PBR,PC+1  AAL
bit.2    0    1    1  PBR,PC+2  AAH
bit.3    1    0  1/0  DBR,AA    Data Low
-----
bpl.0    1    1    1  PBR,PC    Opcode
bpl.1    0    1    1  PBR,PC+1  Offset
bpl.2    0    0    1  PBR,PC+1  I/O

HTIME latching results:

VTIME  Emulation  Hardware
   20 = 24/6f     24/70 Slow
   21 = 23/6e     23/6f Slow
   22 = 29/75     29/75 Fast
   23 = 28/74     28/74 Fast
   24 = 27/73     27/73 Fast
   25 = 26/72     26/72 Fast
   26 = 25/71     25/71 Fast
   27 = 24/70     24/70 Fast
   28 = 23/6f     23/6f Fast
   29 = 22/6e     22/6e Fast
   2a = 27/72     27/72 Fast
   2b = 26/71     26/71 Fast
   2c = 25/70     25/70 Fast
   2d = 24/6f     24/6f Fast
   2e = 23/6e     23/6e Fast
   2f = 25/71     26/71 Slow

20 = slow
008104 bpl $8100     [$008100] A:0020 X:8000 Y:0000 S:01ff D:0000 DB:00 nVMxdiZC V: 31 H:1354
1354 = opfetch
1362 = rel fetch
   6 = i/o
* IRQ transition @  32,  10
008100 nop                     A:0020 X:8000 Y:0000 S:01ff D:0000 DB:00 nVMxdiZC V: 32 H:  12
  12 = opfetch
  20 = i/o
* IRQ @  32,  26

21 = slow
008100 nop                     A:0021 X:8000 Y:0000 S:01ff D:0000 DB:00 nVMxdiZc V: 33 H:   8
   8 = opfetch
  16 = i/o
* IRQ transition @  33,  10
* IRQ @  33,  22

22 = fast
008100 nop                     A:0022 X:8000 Y:0000 S:01ff D:0000 DB:00 nVMxdizc V: 34 H:   4
   4 = opfetch
  12 = i/o
* IRQ transition @  34,  10
008101 bit $4212     [$004212] A:0022 X:8000 Y:0000 S:01ff D:0000 DB:00 nVMxdizc V: 34 H:  18
  18 = opfetch
  26 = $12 fetch
  34 = $42 fetch
  42 = $4212 fetch
* IRQ @  34,  48

23 = fast
008100 nop                     A:0023 X:8000 Y:0000 S:01ff D:0000 DB:00 nVMxdizc V: 35 H:   0
   0 = opfetch
   8 = i/o
* IRQ transition @  35,  10
008101 bit $4212     [$004212] A:0023 X:8000 Y:0000 S:01ff D:0000 DB:00 nVMxdizc V: 35 H:  14
  14 = opfetch
  22 = $12 fetch
  30 = $42 fetch
  38 = $4212 read
* IRQ @  35,  44

2a = fast
008101 bit $4212     [$004212] A:002a X:8000 Y:0000 S:01ff D:0000 DB:00 nVMxdizc V: 41 H:1350
1350 = opfetch
1358 = $12 fetch
   2 = $42 fetch
  10 = $4212 read
* IRQ transition @  42,  10
008104 bpl $8100     [$008100] A:002a X:8000 Y:0000 S:01ff D:0000 DB:00 nvMxdizc V: 42 H:  16
  16 = opfetch
  24 = rel fetch
  32 = i/o
* IRQ @  42,  38

2f = slow
008104 bpl $8100     [$008100] A:002f X:8000 Y:0000 S:01ff D:0000 DB:00 nVMxdizc V: 46 H:1360
1360 = opfetch
   4 = rel fetch
  12 = i/o
* IRQ transition @  47,  10
008100 nop                     A:002f X:8000 Y:0000 S:01ff D:0000 DB:00 nVMxdizc V: 47 H:  18
  18 = opfetch
  26 = i/o
* IRQ @  47,  32

VTIME  Transition     Trigger  TriggerAddr  TransOp Delay
   20  i/o->opfetch       i/o      $008100  nop     8+8 (16)
   21  opfetch->i/o       i/o      $008100  nop     8+8 (16)
   22  opfetch->i/o      read      $004212  bit     8+6 (14)
   23  i/o->opfetch      read      $004212  bit     8+6 (14)
   2a  r4212->opfetch     i/o      $008105  bpl     8+6 (14)
   2f  relfetch->i/o      i/o      $008100  nop     8+8 (16)
Basically, the only pattern I have been able to find is that if the actual IRQ itself (not the pin transition) occurs immediately after an NOP instruction, then the first two IRQ cycles consume 16 clocks of time -- as if it's reading from a slow memory region (8 clocks per) twice.

However, whenever the IRQ begins after either BPL or BIT, then it takes 14 clocks of time.

This doesn't make a damn bit of sense, even taking the pipeline into consideration.

If we assumed it was related to the second to last cycle, NOP would indeed be 8, we could get 8+8 that way. But the second to last cycle of both BIT and BPL are slow memory region fetches as well, so that doesn't work.

If we look at the very last cycle, NOP is an I/O cycle, but so is BPL; and BIT's last cycle reads from a fast memory region ($4212.) Both take 6 clock cycles.

Note that I'm fairly confident it's the last opcode before entering the IRQ that is causing this issue. Both cli and nop end up with 16 clocks (used cli with my Mecarobot test and it was throwing it off -- not enough to be the cause of the bug, but it's a start), and I have other tests. All of my other IRQ latch timing tests use - lda $4212 : bpl - or - bit $4212 : bpl -. This is why I always got perfect timing with them, of course. I went on the assumption time was always 14 clocks, and it was because of that.

So, anyone have any ideas why NOP/CLI take 16 clocks inside the IRQ, yet BIT/BPL take 14 clocks? Any low-level NES emu authors work with this issue before?
byuu

Post by byuu »

Addendums:

If the IRQ is triggered immediately after WDM or XBA, this also consumes 14 clocks in the IRQB handler.

WDM takes 16 clocks, or 8 (opfetch) + 8 (signature fetch).
XBA takes 20 clocks, or 8 (opfetch) + 6 (i/o) + 6 (i/o).

Code: Select all

[SlowROM timings]
Opcode  Cycles  Clocks  Delay
   cli       2    14     16
   nop       2    14     16
   bit       4    30     14
   bpl       3    20     14
   wdm       2    16     14
   xba       3    20     14

[FastROM timings]
Opcode  Cycles  Clocks  Delay
   nop       2      12     12
   bit       4      24     12
   bpl       3      18     12
FastROM timings all fit into my current model, as I use op_read(regs.pc) + op_io(), which would be 12 in this case.

I guess I should clarify, the observed range for the first two cycles of IRQB / NMIB is now 12 - 16, whereas it was only 12-14 before.

We can also rule out any kind of lower clock divisor synchronization, eg IRQ waits until (total_clocks%N)==0 ... given the large number of tests I've run (> 16,384) we would have observed a misaligned clock at least once that would result in inconsistent timing behavior based on the opcode before NMIB / IRQB. It's definitely the opcode ...

---

Ah yes, the basic loop, in case anyone thinks it might have something to do with the previous opcode as well ...

I use any of these:

Code: Select all

- bit $4212 : bpl -

Code: Select all

- nop : bit $4212 : bpl -

Code: Select all

- wdm #$00 : bit $4212 : bpl -

Code: Select all

  rep nop 65536
And get consistent results every time. Hence why I don't think it's related to the previous opcode in any way.
bobthebuilder
Hazed
Posts: 76
Joined: Sat Jan 28, 2006 7:21 am

Post by bobthebuilder »

Well if it affects the nes too, then post it here http://nesdev.parodius.com/bbs/ . They are some of the most dedicated people I have seen :P .
creaothceann
Seen it all
Posts: 2302
Joined: Mon Jan 03, 2005 5:04 pm
Location: Germany
Contact:

Post by creaothceann »

bobthebuilder wrote:Well if it affects the nes too, then post it here http://nesdev.parodius.com/bbs/ . They are some of the most dedicated people I have seen :P .
I had posted the same message, but luckily you can delete posts here. :wink:
vSNES | Delphi 10 BPLs
bsnes launcher with recent files list
bobthebuilder
Hazed
Posts: 76
Joined: Sat Jan 28, 2006 7:21 am

Post by bobthebuilder »

I forgot they had a snes forum :oops:
dvdmth
New Member
Posts: 6
Joined: Thu Feb 14, 2008 9:12 pm

Post by dvdmth »

How does this also apply to the NES? All CPU cycles are the same length on that console (12 clocks for NTSC, 16 for PAL). Also, a memory access (read or write) occurs on every CPU cycle, including the "internal operation" cycles.
byuu

Post by byuu »

Aw, didn't realize that. You guys have it easy :(
I knew you had the same pseudo-IRQ op, so I was hoping it had some interesting quirk to it that would help.

In that case, nobody's left to help. I guess I'll just have to do it the really, really hard way. Giant lookup table for all 256 opcodes to indicate the IRQ delay lengths for each opcode. I'll just have to make assumptions for the ones that are too tricky to trigger an IRQ after (BRK, etc.)
Verdauga Greeneyes
Regular
Posts: 347
Joined: Tue Mar 07, 2006 10:32 am
Location: The Netherlands

Post by Verdauga Greeneyes »

byuu wrote:I guess I'll just have to do it the really, really hard way. Giant lookup table for all 256 opcodes to indicate the IRQ delay lengths for each opcode. I'll just have to make assumptions for the ones that are too tricky to trigger an IRQ after (BRK, etc.)
There's no discernable (even without an understood basis in hardware) pattern to this that you can use?
byuu

Post by byuu »

Shouldn't have used cli, as that delays an opcode. In my case, I had cli : nop #16, so it was just using nop timing. Removed it from the table.

Updated tables:

Code: Select all

[SlowROM timings]
Opcode  Cycles  Clocks  Delay
   asl       2    14     16
   bit       4    30     14
   bpl       3    20     14
   brl       4    30     14
   clc       2    14     16
   nop       2    14     16
   tax       2    14     16
   wdm       2    16     14
   xba       3    20     14
There's no discernable (even without an understood basis in hardware) pattern to this that you can use?
The only pattern I see is that opcodes which take < 16 cycles to complete end up taking 16 cycles in the first two IRQB cycles. Those take longer only take 14 cycles in IRQB.

That makes absolutely no sense at all from a hardware perspective. There's no way in hell the processor is counting the total cycle count for entire instructions to insert some sort of arbitrary delay. And then to only add the delay only in SlowROM mode, on top of that.

Sigh ...
blargg
Regular
Posts: 327
Joined: Thu Jun 30, 2005 1:54 pm
Location: USA
Contact:

Post by blargg »

I've only skimmed, but could this be due to some opcodes doing dummy accesses to the program counter, and others to other addresses, causing different clock usage due to the various bus speeds?
byuu

Post by byuu »

blargg, if you want to speak a bit quicker, I'm on IRC. If not, we can bounce ideas here, too.

Thanks a million for looking at this, even though it turns out to not be related to the NES much.
I've only skimmed, but could this be due to some opcodes doing dummy accesses to the program counter, and others to other addresses, causing different clock usage due to the various bus speeds?
Yes, that's possible. I was actually just trying that out. If it's getting speeds of 6 and 8, then it must be reading from somewhere. Well, that's something I can test with some black magic, eg:

Code: Select all

;IRQ vector
org $00f800
  lda $4211

  ;read counter, will return results of most recent latch prior to IRQ
  lda $213c : sta $2180
  lda $213c : and #$01 : sta $2180
  lda $213d : sta $2180
  lda $213d : and #$01 : sta $2180

  ;read counter twice to align half-dot
  lda $2137
  lda $213c : sta $2180
  lda $213c : and #$01 : sta $2180
  lda $213d : sta $2180
  lda $213d : and #$01 : sta $2180

  lda $2137
  lda $213c : sta $2180
  lda $213c : and #$01 : sta $2180
  lda $213d : sta $2180
  lda $213d : and #$01 : sta $2180

  ;cannot rely upon rti due to black magic tricks
  ;simulate rti, jump to stored address at $00+
  stz $4200
  pla : pla : pla : pla
  cli
  jmp [$0000]

...

test1() {
  lda.b #test1_end     : sta $00
  lda.b #test1_end>>8  : sta $01
  lda.b #test1_end>>16 : sta $02
  jsl seek_frame
  lda $2137 ;latch counters to V=0,Hdot=6
  lda #$01 : sta $4209 : stz $420a
  lda #$20 : sta $4200

  nop #84
  ;IRQ will trigger immediately after jmp $2137 completes, PC will be $002137
  ;IRQ handler's first cycle should read from this address, which will latch
  ;counters. As a result, reading counters inside IRQ should return V=1,H=7
  ;IRQ routine will reroute return so that execution does not resume at $002137
  jmp $2137
}

test1_end() {
  ;verify that test1 passed
  lda $7f0000 : cmp #$07 : beq + : jmp fail
+ lda $7f0001 : cmp #$00 : beq + : jmp fail
+ lda $7f0002 : cmp #$01 : beq + : jmp fail
+ lda $7f0003 : cmp #$00 : beq + : jmp fail
+ lda $7f0004 : cmp #$68 : beq + : jmp fail
+ lda $7f0008 : cmp #$bd : beq + : jmp fail
+
}
This test pretty much verifies that the very first cycle of IRQB is beyond a shadow of a doubt reading from PC. It's the only way it would latch the counters like that.

IRQB looks like this so far:

Code: Select all

void sCPU::op_irq() {
  op_read(regs.pc.d);
  op_io();
  if(!regs.e) op_writestack(regs.pc.b);
  op_writestack(regs.pc.h);
  op_writestack(regs.pc.l);
  op_writestack(regs.e ? (regs.p & ~0x10) : regs.p);
  rd.l = op_read(event.irq_vector + 0);
  regs.pc.b = 0x00;
  regs.p.i  = 1;
  regs.p.d  = 0;
  rd.h = op_read(event.irq_vector + 1);
  regs.pc.w = rd.w;
}
Note that I don't handle the edge case where an NMI triggers during an IRQ, or a BRK / COP during NMI / IRQ. That's going to be absolute pain.

We absolutely know that the first cycle, op_read(pc) is correct. Dummying it out as just an 8-clock delay fails the test (counter does not latch again), and swapping that cycle with op_io() below it also causes the test to fail (counter latch is too far ahead.) The problem is with the second cycle. It's most likely reading from somewhere ... but from where? :/

It's not from the last address on the bus, and it's not from the ABus values from the W65C816S document. It's going to be really tricky to figure out where it's reading from, but if we do, this problem should be solved ...

Now, given that FastROM mode always seems to result in 12-clock total delays, that's six for op_read(PC), and six for op_read(???). So it looks like it's sometimes reading from PC, and sometimes reading from some other address.
byuu

Post by byuu »

Narrowing down the possibilities.

Code: Select all

test2() {
  lda.b #test2_end     : sta $00
  lda.b #test2_end>>8  : sta $01
  lda.b #test2_end>>16 : sta $02
  jsl seek_frame
  lda $2137 ;latch counters to V=0,Hdot=6
  lda #$01 : sta $4209 : stz $420a
  lda #$20 : sta $4200

  nop #84
  ;read should occur from $2136, but not from $2137
  ;this means irqb() second cycle is not op_read(PC + 1)
  jmp $2136
}

test2_end() {
  lda $7f000c : cmp #$06 : beq + : jmp fail
+ lda $7f000d : cmp #$00 : beq + : jmp fail
+ lda $7f000e : cmp #$00 : beq + : jmp fail
+ lda $7f000f : cmp #$00 : beq + : jmp fail
+ lda $7f0010 : cmp #$68 : beq + : jmp fail
+ lda $7f0014 : cmp #$bd : beq + : jmp fail
+
}

test3() {
  lda.b #test3_end     : sta $00
  lda.b #test3_end>>8  : sta $01
  lda.b #test3_end>>16 : sta $02
  jsl seek_frame
  lda $2137 ;latch counters to V=0,Hdot=6
  lda #$01 : sta $4209 : stz $420a
  lda #$20 : sta $4200

  nop #84
  ;lda opcode itself will read $2137, but irqb() will not
  ;this means irqb() second cycle is no op_read(prev)
  ;eg it is not the last address fetched by the processor prior to irqb()
  lda $2137 ;latch counters to V=1,Hdot=7
}

test3_end() {
  lda $7f0018 : cmp #$07 : beq + : jmp fail
+ lda $7f0019 : cmp #$00 : beq + : jmp fail
+ lda $7f001a : cmp #$01 : beq + : jmp fail
+ lda $7f001b : cmp #$00 : beq + : jmp fail
+ lda $7f001c : cmp #$6a : beq + : jmp fail
+ lda $7f0020 : cmp #$bf : beq + : jmp fail
+
}

test4() {
  lda.b #test4_end     : sta $00
  lda.b #test4_end>>8  : sta $01
  lda.b #test4_end>>16 : sta $02
  jsl seek_frame
  lda $2137 ;latch counters to V=0,Hdot=6
  ldx #$4203 : txs
  lda #$01 : sta $4209 : stz $420a
  lda #$20 : sta $4200

  nop #80
  ;this test will write PC value below to $4203,$4202,$4201.
  ;$4201 write of #$00 will latch counters to V=1,Hdot=14
  ;this verifies that fifth cycle of irqb() is op_writestack(PC.l)
  ;which pretty much verifies the positioning of the entire
  ;sequence from the third cycle to the last cycle, eg:
  ;#1 = op_read(PC)
  ;#2 = ???
  ;#3 = if(!E) op_writestack(PC.b)
  ;#4 = op_writestack(PC.h)
  ;#5 = op_writestack(PC.l)
  ;#6 = op_writestack(P)
  ;#7 = op_read(IRQV.l)
  ;#8 = op_read(IRQV.h)
  jml $000000
}

test4_end() {
  lda $7f0024 : cmp #$14 : beq + : jmp fail
+ lda $7f0025 : cmp #$00 : beq + : jmp fail
+ lda $7f0026 : cmp #$01 : beq + : jmp fail
+ lda $7f0027 : cmp #$00 : beq + : jmp fail
+ lda $7f0028 : cmp #$7e : beq + : jmp fail
+ lda $7f002c : cmp #$d4 : beq + : jmp fail
+
}
Recap:

The first test tells us that IRQB cycle 1 is op_read(PC), and that IRQB cycle 2 is not op_read(PC).

The second test tells us that IRQB cycle 2 is not op_read(PC+1).

The third test tells us that IRQB cycle 2 is not op_read(Abus), where Abus was the last value read by the processor (eg the last cycle of the opcode leading into the IRQ.)

The fourth test tells us that IRQB cycle 5 is op_write(SP--, PC.L)
We don't really need to test the other locations, the pattern of PC.B, PC.H, PC.L, P is well known and observed in WRAM. So we know those four are together. And now the IRQV.L, IRQV.H fetches can only fit at the bottom, so this pretty much gives us the entire structure of IRQB / NMIB, with the exception of cycle #2.

I'm about out of ideas. If it's reading from somewhere, I don't know where. The fact that setting FastROM on makes it read fast, and setting FastROM off makes it read slow, suggests it's reading from $[80-FF]:8000 | NNNN when the last opcode is NOP.

But it's reading from somewhere else that is always FastROM when the last opcode is XBA, which also ends with two I/O cycles. Or when it's basically any opcode that is not OpFetch + I/O (eg 12 FastROM / 14 SlowROM.)

This makes absolutely no sense at all. A table is looking like the only way to emulate this. The only observable pattern is that if we're in SlowROM mode and execute an opcode with two cycles, that ends up at 14, which is less than two SlowROM accesses. Perhaps it's a synchronization thing, where the IRQ needs to push that to 16 clocks before proceeding (eg two slow memory accesses.)
byuu

Post by byuu »

Well, this test turned out to be a dead end, but I figured I'd post it, given how evil it is.

I was thinking, why not try executing an 'nop' immediately before reading $2137? Then the two cycle delay will either manifest itself before or after the latch read. It's almost certainly after, but meh. Running out of ideas.

Obvious problem was that $21xx is a fast memory region, and the delay doesn't occur there. But it was still fun figuring out how to do it.

Basically, to have PC = $002137, we have to execute nop out of $002136. This is the high 8-bits of the signed 24-bit multiplication result of 16-bit M7A * 8-bit M7B >> 8. So nop was out, you can't get a value of 0xeaNNNN with that. So go for clc, 0x180000 is possible with 0x6000 * 0x40.

Once that was setup, use some clock-perfect timing to get the jmp to seek to $002136, read PPU1 MDR / multiplication result of 0x18 (clc), which executes and then triggers the IRQ routine, which will then read from $002137.

I guess the only cool part about this test is that I got the same results on hardware as under emulation. Cool stuff :)

Code: Select all

test5() {
  lda.b #test5_end     : sta $00
  lda.b #test5_end>>8  : sta $01
  lda.b #test5_end>>16 : sta $02
  jsl seek_frame
  lda $2137 ;latch counters to V=0,Hdot=6
  ldx #$4203 : txs
  lda #$01 : sta $4209 : stz $420a
  lda #$20 : sta $4200

  ;set PPU1 MDR to 0x18 (clc)
  lda #$00 : sta $211b
  lda #$60 : sta $211b
  lda #$00 : sta $211c
  lda #$40 : sta $211c
  ;#$6000 * #$40 = #$180000 >> 16 = #$18 (clc)

  nop #67
  jmp $2136
}
dvdmth
New Member
Posts: 6
Joined: Thu Feb 14, 2008 9:12 pm

Post by dvdmth »

I'm not convinced it is the second cycle of the interrupt handler that's being extended by 2 clocks. I've put some thought into it, and I'm beginning to think it's the last cycle of the preceding instruction that's being extended. In the list of opcodes above, all opcodes that cause the delay are single-byte, 2-cycle instructions. On the NES's CPU, the second cycle of an instruction is always the same (read second byte of instruction), except that single-byte instructions don't increment the PC. If the instruction is two cycles long, then the decision to invoke the interrupt happens at the start of the second cycle. Maybe the decision to invoke the interrupt at this second cycle is somehow causing the normal IO behavior to change to the old-style "always read second byte" behavior, which would cause the last cycle of the instruction to take two more clocks in SlowROM areas.
byuu

Post by byuu »

I'm not convinced it is the second cycle of the interrupt handler that's being extended by 2 clocks.
I'm almost certain that it is.

The reason? If the opcode prior to IRQB is jmp $2137, then PC = $2137, and that latches the counters. The counters are always accurate to emulation here. Given the counters are clock time >> 2 instead of clock time, I would need to test each half-dot to be 100% certain.

I've also ruled out that it's not just something like the bus cycle being one cycle ahead of the work cycle (eg the two-stage pipeline.) If VDA=1, VPA=1, then we're fetching an opcode. If those were to stay set after the first cycle of nop and into IRQB, then we'd fetch the same value again.

However, I had a test that would execute clc (same as nop) from $2136, so that PC would become $2137. If it were fetching PC twice, it would latch the counters twice. The timing reflected that it was latching the counters only once.

We could see if it's fetching PC+1 -- make clc execute out of $2135, and then another clc out of $2136 (which would be the first cycle read in IRQB). If it reads from $2137 (and thus latches the counters), then we know it's fetching PC+1.

But ... if that were the case, then all opcodes in SlowROM mode would result in 16 clock delays, because it would always be two SlowROM accesses. So it's kind of pointless to test even that.

Maybe I should just give up. I've been testing this for the past four days with zero success. It's similar to math mul / div delays, and I gave up with those. Maybe it's best to just stick to emulating what's feasibly possible and move on. It really hurts, given this is the last thing needed for perfect IRQ timing ... eh, but what are you going to do, right?

I could just stick this in an errata section or something. Offer a $50 bounty for the solution for posterity. Sigh ... why can't the SNES scene be as active as NESdev? ;_;

We are light years behind you guys.
grinvader
ZSNES Shake Shake Prinny
Posts: 5632
Joined: Wed Jul 28, 2004 4:15 pm
Location: PAL50, dood !

Post by grinvader »

byuu wrote:It's similar to math mul / div delays, and I gave up with those.
I think you're getting warm. The giving up might have to be revised.
皆黙って俺について来い!!

Code: Select all

<jmr> bsnes has the most accurate wiki page but it takes forever to load (or something)
Pantheon: Gideon Zhi | CaitSith2 | Nach | kode54
byuu

Post by byuu »

New angle.

So we know wdm #$00 takes 16 clocks and has a 6-clock delay in IRQB cycle 2, whereas nop in SlowROM region takes 14 clocks and has an 8-clock delay. But an nop in FastROM region takes 12 clocks and also has a 6-clock delay.

I've thought of a way to determine whether the delay is due to the total cycle count being 14, or by the second cycle being I/O in a SlowROM region. Or another way to look at it -- I can determine which cycle being fast results in the second IRQB cycle being fast -- the first or the second.

I can use HiROM memory mapping, and use SRAM to put an lda #$ (0xa9) instruction at 0xa07fff with FastROM enabled. If I trigger an IRQ immediately after the LDA, it will read the operand from fast memory, rather than slow memory.

Similarly, I can execute nop from 0x80ffff, and it will wrap around to a SlowROM region for the I/O value.

I'll post the results when I have them.
Verdauga Greeneyes
Regular
Posts: 347
Joined: Tue Mar 07, 2006 10:32 am
Location: The Netherlands

Post by Verdauga Greeneyes »

Cool, good to see you still focusing on this - looking forward to seeing those results. Keep it up!
byuu

Post by byuu »

Tests finished. Used 80ffff->800000 for fast->slow and 001fff->002000 for slow->fast.

Code: Select all

008423 jml $80ffff   [$80ffff] A:0027 X:01ff Y:0000 S:01ff D:0000 DB:00 nvMxdizC V:  0 H:1344
* IRQ edge transition @   1,  10
80ffff nop                     A:0027 X:01ff Y:0000 S:01ff D:0000 DB:00 nvMxdizC V:  1 H:  12
* IRQ @   1,  24
* IRQ.0 @   1,  24
* IRQ.1 @   1,  32
* IRQ.2 @   1,  38 (should be 40)
00f800 lda $4211     [$004211] A:0027 X:01ff Y:0000 S:01fb D:0000 DB:00 nvMxdIzC V:  1 H:  86 (should be 88)

Will execute IRQ, first cycle reads from 800000 (slow memory region.)
Result: IRQ first two cycles consume 16 cycles; second cycle is slow.

---

008423 jml $80ffff   [$80ffff] A:0027 X:01ff Y:0000 S:01ff D:0000 DB:00 nvMxdizC V:  0 H:1344
* IRQ edge transition @   1,  10
80ffff lda #$27                A:0027 X:01ff Y:0000 S:01ff D:0000 DB:00 nvMxdizC V:  1 H:  12
* IRQ @   1,  26
* IRQ.0 @   1,  26
* IRQ.1 @   1,  34
* IRQ.2 @   1,  40
00f800 lda $4211     [$004211] A:0027 X:01ff Y:0000 S:01fb D:0000 DB:00 nvMxdIzC V:  1 H:  88

Will execute IRQ, first cycle reads from 800001 (slow memory region.)
Result: IRQ first two cycles consume 14 cycles; second cycle is fast.

---

008423 jml $80ffff   [$80ffff] A:0027 X:01ff Y:0000 S:01ff D:0000 DB:00 nvMxdizC V:  0 H:1344
* IRQ edge transition @   1,  10
80ffff xba                     A:0027 X:01ff Y:0000 S:01ff D:0000 DB:00 nvMxdizC V:  1 H:  12
* IRQ @   1,  30
* IRQ.0 @   1,  30
* IRQ.1 @   1,  38
* IRQ.2 @   1,  44
00f800 lda $4211     [$004211] A:2700 X:01ff Y:0000 S:01fb D:0000 DB:00 nvMxdIZC V:  1 H:  92

Will execute IRQ, first cycle reads from 800000 (slow memory region.)
Result: IRQ first two cycles consume 14 cycles; second cycle is fast.

---

008426 jml $001fff   [$001fff] A:00ea X:01ff Y:0000 S:01ff D:0000 DB:00 NvMxdizC V:  0 H:1344
* IRQ edge transition @   1,  10
001fff nop                     A:00ea X:01ff Y:0000 S:01ff D:0000 DB:00 NvMxdizC V:  1 H:  12
* IRQ @   1,  26
* IRQ.0 @   1,  26
* IRQ.1 @   1,  32
* IRQ.2 @   1,  38
00f800 lda $4211     [$004211] A:00ea X:01ff Y:0000 S:01fb D:0000 DB:00 NvMxdIzC V:  1 H:  86

Will execute IRQ, first cycle reads from 002000 (fast memory region.)
Result: IRQ first two cycles consume 12 cycles; second cycle is fast.

---

008426 jml $001fff   [$001fff] A:00a9 X:01ff Y:0000 S:01ff D:0000 DB:00 NvMxdizC V:  0 H:1344
* IRQ edge transition @   1,  10
001fff lda #$00                A:00a9 X:01ff Y:0000 S:01ff D:0000 DB:00 NvMxdizC V:  1 H:  12
* IRQ @   1,  26
* IRQ.0 @   1,  26
* IRQ.1 @   1,  32
* IRQ.2 @   1,  38
00f800 lda $4211     [$004211] A:00a9 X:01ff Y:0000 S:01fb D:0000 DB:00 NvMxdIzC V:  1 H:  86

Will execute IRQ, first cycle reads from 002001 (fast memory region.)
Result: IRQ first two cycles consume 12 cycles; second cycle is fast.

---

008426 jml $001fff   [$001fff] A:00eb X:01ff Y:0000 S:01ff D:0000 DB:00 NvMxdizC V:  0 H:1344
* IRQ edge transition @   1,  10
001fff xba                     A:00eb X:01ff Y:0000 S:01ff D:0000 DB:00 NvMxdizC V:  1 H:  12
* IRQ @   1,  32
* IRQ.0 @   1,  32
* IRQ.1 @   1,  38
* IRQ.2 @   1,  44
00f800 lda $4211     [$004211] A:eb00 X:01ff Y:0000 S:01fb D:0000 DB:00 nvMxdIZC V:  1 H:  92

Will execute IRQ, first cycle reads from 002000 (fast memory region.)
Result: IRQ first two cycles consume 12 cycles; second cycle is fast.
Essentially, this shows that the second IRQB cycle consumes 8 clock cycles only when PC inside the IRQB points to a SlowROM region. The speed of the actual nop opcode bus read (nop's first cycle) doesn't matter.

Or, on a larger scale, it means that the total speed of the opcode really doesn't make a difference. Eg there is no 14<>16 clock alignment or something bizarre like that. This is shown by the second test (and others), where lda #$nn consumes 14 clocks by way of making the first cycle slow, and second cycle fast. Yet even at 14 clocks, you don't get the odd 8-clock second cycle of IRQB.

I would think that somehow these two-cycle single-fetch opcodes were causing VDA, VPA to be set in such a way that the IRQ routine would try and read from PC twice. But that isn't the case at all -- the PPU1 MDR test from before would have latched the counters twice. If that happened, the time would not match emulation where it was latched only once.

So basically, I know enough to completely emulate the behavior, but I have absolutely zero understanding of how or why this behavior occurs. Worse yet is that it pretty much flat out requires a lookup table.
looking forward to seeing those results
... why? :/
Verdauga Greeneyes
Regular
Posts: 347
Joined: Tue Mar 07, 2006 10:32 am
Location: The Netherlands

Post by Verdauga Greeneyes »

byuu wrote:
looking forward to seeing those results
... why? :/
The results and your interpretation of them; I can understand maybe half of the actual results, but it's enough to feel at least somewhat involved in the process; besides, my post would've been terribly short without it, and I didn't want it to feel like 'SPAM!' when it was meant as a show of interest ;)
byuu

Post by byuu »

How embarrassing. My PPU1 MDR test earlier was flawed.

dvdmth mentioned that he believed it was the last cycle of NOP / CLC that was extended, rather than the second cycle of IRQ.

I believed this to not be the case, because I had written a test which would set $2136 to read back CLC. This would cause PC to become $2137 (SHLV), which would cause the first cycle of IRQB to latch the counters.

My emulated results showed the same time as hardware, even when accounting for half-dot misalignment (you can get the same counter value on clock and clock+2, so long as (clock&2)==0, because the counter result is divided by four to show the dot rather than clock position.)

But this test was flawed -- I can't reproduce the slow cycle when working from a purely fast speed memory region. That is, there was no slow cycle in the entire process. CLC opfetch was fast, CLC I/O was fast, IRQB $2137 read was fast, and IRQB cycle 2 was fast.

Further, I knew this part, but the latch part didn't actually matter. If CLC's second cycle were to read from $2137, the IRQB cycle 1 read from $2137 would update the counters before I had a chance to read them back anyway. The only thing I was able to verify successfully was that IRQB cycle 2 was definitely not reading from $2137.

Code: Select all

008441 jmp $2136     [$002136] A:0040 X:01ff Y:0000 S:01ff D:0000 DB:00 nvMxdizC V:  0 H:1354
* IRQ edge transition @   1,  10
002136 clc                     A:0040 X:01ff Y:0000 S:01ff D:0000 DB:00 nvMxdizC V:  1 H:  14
* IRQ @   1,  26
* IRQ.0 @   1,  26
* Latched counters @   1,  28 =   1,  7
* IRQ.1 @   1,  32
* IRQ.2 @   1,  38
00f800 lda $4211     [$004211] A:0040 X:01ff Y:0000 S:01fb D:0000 DB:00 nvMxdIzc V:  1 H:  86

<>

008441 jmp $2136     [$002136] A:0040 X:01ff Y:0000 S:01ff D:0000 DB:00 nvMxdizC V:  0 H:1354
* IRQ edge transition @   1,  10
002136 clc                     A:0040 X:01ff Y:0000 S:01ff D:0000 DB:00 nvMxdizC V:  1 H:  14
* Latched counters @   1,  22 =   1,  5
* IRQ @   1,  26
* IRQ.0 @   1,  26
* Latched counters @   1,  28 =   1,  7
* IRQ.1 @   1,  32
* IRQ.2 @   1,  38
00f800 lda $4211     [$004211] A:0040 X:01ff Y:0000 S:01fb D:0000 DB:00 nvMxdIzc V:  1 H:  86
I have one more test to run now. I'll post the results shortly.
byuu

Post by byuu »

Final test complete. We nailed it!

Code: Select all

test5() {
  lda.b #test5_end     : sta $00
  lda.b #test5_end>>8  : sta $01
  lda.b #test5_end>>16 : sta $02
  jsl seek_frame
  lda $2137 ;latch counters to V=0,Hdot=6
  lda #$01 : sta $4209 : stz $420a
  lda #$20 : sta $4200

  nop #83
  jmp $217f
}

test5_end() {
  lda $7f0030 : cmp #$00 : beq + : jmp fail
+ lda $7f0031 : cmp #$00 : beq + : jmp fail
+ lda $7f0032 : cmp #$06 : beq + : jmp fail
+ lda $7f0033 : cmp #$00 : beq + : jmp fail
+ lda $7f0034 : cmp #$00 : beq + : jmp fail
+ lda $7f0035 : cmp #$00 : beq + : jmp fail
+ lda $7f0036 : cmp #$7a : beq + : jmp fail
+ lda $7f003a : cmp #$cf : beq + : jmp fail
+
}

Code: Select all

00843e jmp $217f     [$00217f] A:0020 X:01ff Y:0000 S:01ff D:0000 DB:00 nvMxdizC V:  0 H:1354
* IRQ edge transition @   1,  10
00217f brk #$00                A:0020 X:01ff Y:0000 S:01ff D:0000 DB:00 nvMxdizC V:  1 H:  14
* 2180 read @   1,  22
* IRQ @   1,  26
* IRQ.0 @   1,  26
* 2180 read @   1,  28
* IRQ.1 @   1,  32
* IRQ.2 @   1,  38
00f800 lda $4211     [$004211] A:0020 X:01ff Y:0000 S:01fb D:0000 DB:00 nvMxdIzc V:  1 H:  86
This test is nothing short of miraculous. It's built upon years of reverse engineering and knowledge ... I'm going to write an article about it.

But to summarize here. Essentially, I've never been able to get IRQB cycle 2 to read from any memory address, hence I couldn't get it to latch SHLV, increment WMDATA, access a fast->slow ROM region change or access a slow->fast ROM region change. All signs pointed to it being an I/O cycle, and in the end, that turns out to be exactly what it was all along.

The test above ... where to begin?

Okay, it uploads a small SPC700 program:

Code: Select all

main:
  mov $f4,#$18
  mov $f5,#$18
  mov $f6,#$18
  mov $f7,#$18

  bra main
This makes $2140-$2143 (mirrored to $2140-$217f) return #$18, or CLC. Next, I use my counter magic seek_frame trick to get the counter aligned to V=0,Hclock=0. From here, I execute just enough instructions and then jump the program code right into memory mapped I/O registers.

The first time, CLC is fetched from $217f (which is S-SMP port $f7 above.) It immediately increments PC to $2180, which is WMDATA. After this cycle, the bus points to the start of a new opcode, so the IRQ test is performed, and it succeeds. This means IRQB is triggered instead of the next opcode.

Now here, if dvdmth's theory is correct, the I/O cycle should see that an IRQ is pending, and perform a read rather than an I/O cycle. But from where? The most logical guess was from $2180 (or PC+1), eg the same as IRQB.

We know for certain that IRQB cycle 1 would attempt to read from $2180, and that IRQB cycle 2 would not, due to earlier tests trying to get it to do that with $2137.

Background: WMDATA, or $2180, is a special register for writing to WRAM. You set the address in $2181-$2183, and then whenever you read or write to it, it will automatically increment the address.

So basically, the WRAM write address would be incremented only once if only IRQB read from PC+1, but it would be incremented twice if both IRQB cycle 1 and CLC cycle 2 read from PC+1.

As it turns out, WMDATA is indeed incremented twice!! So this pretty much verifies the entire IRQB routine, as well as a special new behavior of all immediate-mode opcodes.

And it only took a week of 6-8 hour days of testing to figure it out. The coolest part is that it would have been absolutely fucking impossible to figure this out if not for quite literally all of the previous work we've done in the past. That's what I want to write about in an article.

Anyway, the best part about this is how easy it is to emulate. There's no need for cycle counting, lookup tables, or any nonsense like that.

Instead, we just change the opcodes themselves.

CLC changes from:

Code: Select all

void op_clc() {
  last_cycle();
  op_io();
  regs.p.c = 0;
}
To:

Code: Select all

void op_clc() {
  last_cycle();
  if(event.irq) op_read(regs.pc.d);
  else op_io();
  regs.p.c = 0;
}
Do this for each of the ~20 or so immediate opcodes, and the effect is fully emulated. Best of all, this isn't just a theory to match observed results, hundreds of tests have narrowed and ruled out every other last possibility.

So yeah, case closed. Thank you everyone for the help.
Last edited by byuu on Tue Apr 08, 2008 4:08 pm, edited 1 time in total.
franpa
Gecko snack
Posts: 2374
Joined: Sun Aug 21, 2005 11:06 am
Location: Australia, QLD
Contact:

Post by franpa »

Does this fix resolve any game issue's? or are you still checking that out?
Core i7 920 @ 2.66GHZ | ASUS P6T Motherboard | 8GB DDR3 1600 RAM | Gigabyte Geforce 760 4GB | Windows 10 Pro x64
Verdauga Greeneyes
Regular
Posts: 347
Joined: Tue Mar 07, 2006 10:32 am
Location: The Netherlands

Post by Verdauga Greeneyes »

byuu wrote:And it only took a week of 6-8 hour days of testing to figure it out. The coolest part is that it would have been absolutely fucking impossible to figure this out if not for quite literally all of the previous work we've done in the past. That's what I want to write about in an article.
Six to eight hours a day? Wew! How do you have time/energy for anything else? :) Either way, congratulations! That's a major unknown-cause bug off your errata page just a day or two after you created it ;) By the way, will the test you invented to obtain this result give you any sort of edge for figuring out the NMI-related edge case problem below it? (not that I'm suggesting anything - you deserve a break!)
Jipcy
Veteran
Posts: 768
Joined: Thu Feb 03, 2005 8:18 pm
Contact:

Post by Jipcy »

Glad to see you finally achieved success on that one! I imagine it feels good to be doing some real emulation fixes again.

How did you stumble upon this bug? Is it related to Mecarobot?

I really do enjoy watching the whole process, even if I only vaguely understand it.
[url=http://zsnes-docs.sf.net]Official ZSNES Docs[/url] | [url=http://zsnes-docs.sf.net/nsrt]NSRT Guide[/url] | [url=http://endoftransmission.net/phpBB3/viewtopic.php?t=394]Using a Wiimote w/ emulators[/url]
Locked