Latch timing

DMV27 · Post by **DMV27** » Sat Feb 05, 2005 4:05 pm

byuusan wrote:Assume: m=0/x=1 {
ldy #$00 : lda $20ff,y
} -> This will *not* trigger cycle 3a. The first read does not cross the page boundary, and even though the second read does (reads from $2100), it doesn't count. This should be expected, given cycle 3a couldn't occur after cycle 4, but it's still weird. Why is cycle 3a even needed, then? If it's able to cross the page boundary on the second read with no overhead, why is there overhead during the first cycle?

Cycle 3a is an optimization done by the cpu pipelining. If AddrLow + X does not cause a carry, then cycle 3a will not occur. If it does carry, then the cpu will have to wait until 3a for AddrHigh to finish being read before it can add in the carry. The second read does not have any overhead because the address is fully incremented during the first read. From page 57 of the w65816 datasheet:

When VDA or VPA are high and during all write cycles, the Address Bus is always valid. VDA and VPA should be used to qualify all memory cycles. Note that when VDA and VPA are both low, invalid addresses may be generated. The Page and Bank addresses could also be invalid. This will be due to low byte addition only. The cycle when only low byte addition occurs is an optional cycle for instructions which read memory when the Index Register consists of 8 bits. This optional cycle becomes a standard cycle for the Store instruction, all instructions using the 16-bit Index Register mode, and the Read-Modify-Write instruction when using 8- or 16-bit Index Register modes.

Here is the example from page 39 of the Programming Manual:

Consider, as an example, the loading of the accumulator using absolute indexed addressing (two lines for a cycle indicate simultaneous operations due to pipelining):
Code: Select all
Cycle 1: Fetch the instruction opcode, LDA.
Cycle 2: Fetch the operand byte, the low byte of an array base.
         Interpret the opcode to be LDA absolute indexed.
Cycle 3: Fetch the second operand byte, the high array base byte.
         Add the contents of the index register to the low byte.
Cycle 4: Add the carry from the low address add to the high byte.
Cycle 5: Fetch the byte at the new effective memory address.
(NOTE: The 6502 also does a fetch during Cycle 4, before it checks to see if there was any carry; if there is no carry into the high byte of the address, as is often true, then the address fetched from was correct and there is no cycle five; the operation is a four-cycle operation in this case. Absolute indexed writes, however, always require five cycles.)

anomie · Post by **anomie** » Sat Feb 05, 2005 4:09 pm

byuusan wrote:In order to get DRAM refresh matching my SNES with the new cycle-by-cycle core, I actually had to make it occur mid-opcode.

We figured that recently, IIRC.

I tried to do the test again tonight, but unfortunately, it does not work after resetting the SNES. I just get a blank screen. I tried it ~10 times, and tried resetting immediately when the program started and whatnot. Nothing seems to work :/ The program works fine when I reset it in my emulator.

I've noticed that bug too, but i haven't had a chance to track it down. I forget if it fixed it in that one, but my NMI routine is buggy as well (maybe that's the same bug?).

I made a test program that executed:
clc : xce
rep #$20 : sep #$20 ;... repeat these two 32 times for a total of 64 opcodes
lda $2137

And I got the latch value 0001:0051. To match this through emulation, I had to remove cycle 2a from rep/sep. Therefore, either the document is wrong, or this is a quirk specific to the SNES version of the CPU. But rep/sep is only two cycles, one opcode fetch and one operand fetch.

Odd. I just modified the standard test to do REP #$80 or SEP #$80 rather than LDA $00. Either way, 0 => $0044, 1 => $0049, and 2 => $004f. All consistant with 22 master cycles per SEP/REP. Same results with SEP #$20.

If that's 51 hex (not 51 decimal), that also seems to me to match with 22 master cycles: your 188 init, 14 CLC, 14 XCE, 22*64 for the SEPs and REPs, and your 24 for the LDA $2137 before latch. Grand total 1648 master cycles into the frame. First scanline takes off 1324, leaving 324 cycles into scanline 1 => 0x51 dots.

Now for the weird one.
Assume: m=0/x=1 {
ldy #$00 : lda $20ff,y
} -> This will *not* trigger cycle 3a.

I would expect that. The cycle 3a is because of the pipelining: if Z=Addr+Y doesn't carry, we can load immediately on the 8-bit add. The CPU speculatively does this read. If there was a carry, then we have to add the carry and re-do the read with the new (correct) address. OTOH, reading the second byte for a word read just reads from Z++, which it can apparently just do for a 24-bit value without worrying.

Here's an interesting question for you: Set Y=$FF, and do LDA $2138,Y. Does it latch? Or set up an IRQ, SEI, WAI, and LDA $4212,Y. Does it clear the pending IRQ?

And if you feel like it, set Y=$01 and DB=$7e, and verify if LDA $FFFF,Y will load from $7F:0000 or $7E:0000 (and ditto for Y=$00 and the high byte of a word read).

[later]

Ok, verified that writing 0 to $4201 latches 4 master cycles later than reading $2137, and that when the bit is 0 reading $213f doesn't reset the flag. The flag applies to any method of latching, BTW. And the PPU Speed Test seems to work if i fix my NMI routine bug. I should email you the fixed version...

byuu · Post by **byuu** » Sat Feb 05, 2005 4:29 pm

If that's 51 hex (not 51 decimal), that also seems to me to match with 22 master cycles: your 188 init, 14 CLC, 14 XCE, 22*64 for the SEPs and REPs, and your 24 for the LDA $2137 before latch. Grand total 1648 master cycles into the frame. First scanline takes off 1324, leaving 324 cycles into scanline 1 => 0x51 dots.

Oh my god do I need sleep, bad. I'm sorry for the misinformation (again), I didn't even bother to work it out. I put *s next to the cycles I don't have implemented, saw the * in rep/sep and so I assumed it wasn't there.
The actual opcode function has the i/o cycle in there.

Code: Select all

void g65816_op_rep(void) {
  gx816->op.r.b = gx816->read_operand(1); //1,2 [op fetch]
  snes_time->add_cpu_icycles(1);          //3 [i/o]
  gx816->regs.p &= ~gx816->op.r.b;
  g65816_incpc(2);
  if(gx816->regs.e == true) {
    gx816->regs.p |= 0x30;
  }
}

Man... I really gotta start going over stuff more before posting here >_>

I would expect that. The cycle 3a is because of the pipelining: if Z=Addr+Y doesn't carry, we can load immediately on the 8-bit add. The CPU speculatively does this read.

Ok, that leads to a lot of questions.
Let's say Y=$FF, LDA $4120,Y. Will the speculative read cycle end up costing 12 master cycles for $41xx region instead of 6 for $42xx region?
I'll try the Y=$FF, LDA $2138,Y one and see if it latches tonight.

anomie · Post by **anomie** » Sat Feb 05, 2005 4:37 pm

byuusan wrote:Ok, that leads to a lot of questions.
Let's say Y=$FF, LDA $4120,Y. Will the speculative read cycle end up costing 12 master cycles for $41xx region instead of 6 for $42xx region?

Assuming VDP and VPA are both output as 0 for the cycle, I'm thinking that it'll only be 6 cycles no matter what... Wouldn't hurt to test it though.

[later]

Verified my theory on the NMI/IRQ delay, LDX #$2000 / STX $00 / NOP / LDA ($00) / NOP does a Fast/Fast delay between LDA ($00) and either of the NOPs, only STX $00 / NOP does Slow/Fast.

Also, i think i've verified the pipelining theory of IRQ triggering. I set up an IRQ / SEI / WAI / STZ $00 / LDA #$01 / CLI / LDA #$42, with the IRQ routine storing A to $00 as well. And $00 ends up as 0x42, not 0 or 1: the IRQ doesn't trigger until after the instruction after the CLI. Trading the CLI for an RTI (properly set up to return to the very next instruction) gives 1, not 42: the IRQ triggers immediately after the RTI.

Overload · Post by **Overload** » Sun Feb 06, 2005 3:28 am

byuusan wrote:
Something else that might be worth mentioning. Clearing bit 7 of $4201 also disables the ability to clear the latch signal when you read $213f. If you set bit 7 of $4201, then clear it, the PPU2 will latch and bit 6 of $213f is set. Because latching is now disabled, no matter how many times you read $213f the latch signal will not clear until you set bit 7 of $4201.
Does this apply to $2137, too? Like say $4201 bit 7 is set, and $213f bit 6 is clear. When I read $2137, I take it the next read from $213f will have bit 6 set, and all subsequent reads will have $213f bit 6 cleared?

That is correct.

byuusan wrote: By the way Overload, do you have any knowledge on the Super UFO 8 line of copiers, or ideas of how/why open bus would not be mapped for me? I'd really need to get that working on my copier in order for me to emulate it properly, and I really can't wait to get started on this open bus thing.

I own a Game Doctor 3 (NTSC) and a Super Wildcard DX2 (PAL). I don't know much about the Super UFO line of copiers. I have looked at the Super UFO 8 BIOS before and I know the copier has registers on the B-Bus. Are you saying that either bus fails to give open bus or just the B-Bus?

Easist way to detect Open Bus:

Code: Select all

lda.l $002000
cmp.l $102000
bne OpenBusDetected

byuu · Post by **byuu** » Sun Feb 06, 2005 3:57 am

Are you saying that either bus fails to give open bus or just the B-Bus?

I've only tested one.

PEA $21c2-$ff
PLD
LDX $ff ; will read $21c2

That returns 0x00 instead of 0xff. I'll read over the notes and try the other bus, and maybe those PPU1/2 ones. I'll also give your code a try.

I have looked at the Super UFO 8 BIOS before and I know the copier has registers on the B-Bus.

Hmmm... hopefully I don't discover one by accident during my testing.

byuu · Post by **byuu** » Sun Feb 06, 2005 7:09 am

Ok, verified that writing 0 to $4201 latches 4 master cycles later than reading $2137, and that when the bit is 0 reading $213f doesn't reset the flag. The flag applies to any method of latching, BTW. And the PPU Speed Test seems to work if i fix my NMI routine bug. I should email you the fixed version...

Awesome. I'll add the 4 cycle delay tonight, then. I wonder if it's because of $21xx vs $42xx region, or read vs. write, and if it's latch specific or not. Writing to $2118 and reading from $2139 could determine this.
I got your new test and it works on my UFO, thanks for fixing it. Unfortunately, the SRAM doesn't match through emulation yet :(
But they are extremely close, off by no more than 2 on every cycle.
I blame the difference on bad NMI/IRQ timing, which my two demos don't use. It ended up with dots 321/325 being the long ones again, by the way.

Here's an interesting question for you: Set Y=$FF, and do LDA $2138,Y. Does it latch?

No. VDA=0/VPA=0 for this cycle so you were right to assume it's always an I/O cycle. No read is performed here ever, and it is always 6 master cycles long. I tested this by doing:
lda $2137
ldy #$ff
lda $2138,y

The latch position was the usual $0000:$0035, meaning only the first one latched the counter. By the way, do you know if the latch counters are reset to zero upon resetting the SNES? Or do they stay the same as the last time they were latched before the reset?

Overload's test fails on the UFO8:
lda.l $002000
cmp.l $102000
bne OpenBusDetected

The branch is not taken as $002000 and $102000 are identical (both 00).
The PPU1/2 Open Bus stuff is too vague for me to follow. I'll put that off until I get a new copier as well.

anomie: Wanted to mention something about your ob-wrap.txt file. It almost sounds like you're saying jmp (addr) will end up jumping to $00:(addr). Even though it's a jmp and not a jml so it should be obvious it ends up in the same pbr, you may want to add a little note as such. The jmp (addr,x) note was what initially confused me. Great document, btw. I was wrapping within the bank on word reads from DBR:absolute_addr before reading it.

More info on $213f bit 6.
Take the following example:

Code: Select all

  clc : xce

  lda #$00 : sta $4201
  lda $213f : sta $7ec000 ;bit 6 is set
  lda $213f : sta $7ec001 ;bit 6 is set
  lda #$ff : sta $4201
  lda $213f : sta $7ec002 ;bit 6 is set
  lda $213f : sta $7ec003 ;bit 6 is clear

Basically, when you disable latching by writing to 0 to bit 7 of $4201, bit 6 is set in $213f. Every time you read from $213f, bit 6 will be set. Once you write a 1 back to bit 7 of $4201, the next read from $213f will still have bit 6 set. Subsequent reads will then have it clear. Implementation:

Code: Select all

byte mmio_r213f(void) {
byte r = 0x00;
  //set bits 7, 5, 4
  if(!(ppu.io4201 & 0x80)) {
    r |= 1 << 6;
  } else if(ppu.counter_latched == true) {
    r |= 1 << 6;
    ppu.counter_latched = false;
  }
  return r;
}

void mmio_w4201(byte value) {
  if((ppu.io4201 & 0x80) && !(value & 0x80)) {
    //latch counters
    ppu.counter_latched = true;
  }
  ppu.io4201 = value;
}

byte mmio_r2137(void) {
  if(ppu.io4201 & 0x80) {
    //latch counters
    ppu.counter_latched = true;
  }
  return 0x00; //open bus
}

ppu.counter_latched should be set to false at power on/reset. In other words, reading from $213f as the first opcode will result in bit 6 being clear.

anomie · Post by **anomie** » Sun Feb 06, 2005 1:41 pm

byuusan wrote:No. VDA=0/VPA=0 for this cycle so you were right to assume it's always an I/O cycle. No read is performed here ever, and it is always 6 master cycles long.

I wasn't sure about whether the read would go on the bus or not, it's good to know it doesn't.

By the way, do you know if the latch counters are reset to zero upon resetting the SNES? Or do they stay the same as the last time they were latched before the reset?

No idea.

The PPU1/2 Open Bus stuff is too vague for me to follow. I'll put that off until I get a new copier as well.

A simple test for each: PPU1, read $2134, then read $2104 and see if you get the same thing back. You can easily set $2134 with the multiplication thing. PPU2, read $213d (get the low byte of V), then $213c (low byte of H), then $213d again (high bit of V) and look at the garbage bits 1-7 and see if they match the $213c value.

anomie: Wanted to mention something about your ob-wrap.txt file. It almost sounds like you're saying jmp (addr) will end up jumping to $00:(addr). Even though it's a jmp and not a jml so it should be obvious it ends up in the same pbr, you may want to add a little note as such. The jmp (addr,x) note was what initially confused me.

Hmmm... You're probably right.

Great document, btw.

Thanks! BTW, if you feel like verifying any of it that would be helpful, just in case I made any weird mistakes.

Basically, when you disable latching by writing to 0 to bit 7 of $4201, bit 6 is set in $213f. Every time you read from $213f, bit 6 will be set. Once you write a 1 back to bit 7 of $4201, the next read from $213f will still have bit 6 set. Subsequent reads will then have it clear. Implementation:

I would probably do this instead:

Code: Select all

byte mmio_r213f(void) {
  byte r = 0x00;
  //set bits 7, 5, 4
  if(ppu.counter_latched == true) {
    r |= 1 << 6;
    if(ppu.io4201 & 0x80) ppu.counter_latched = false;
  }
  return r;
}

byuu · Post by **byuu** » Sun Feb 06, 2005 2:44 pm

Thanks, I'll try out the PPU1/2 tests next time I drag the UFO out.

Thanks! BTW, if you feel like verifying any of it that would be helpful, just in case I made any weird mistakes.

I did verify most of them as I was going through. Specifically the absolute and absolute-indexed stuff. The indirect stuff was the same as I had, but I didn't test it, nor did I play with the emulation mode stack page boundaries (pea/pld and such). What I tried worked great.
BTW, I'm only verifying the information because I feel the more times something is confirmed, the better. I certainly trust your notes, and they've always been right in the past (well, at least a lot more than mine) :)

I would probably do this instead:

Man, you're good at this.
I've always considered myself fairly proficient with writing optimized c++, but you always seem to do better than me. Especially evident with the longer dots thing a few pages back (which, by the way, I moved into a nice table to support the $4201 4-cycle delay). Any secrets you care to share? :)

anomie · Post by **anomie** » Sun Feb 06, 2005 9:22 pm

byuusan wrote:BTW, I'm only verifying the information because I feel the more times something is confirmed, the better.

That's one of the major reasons I want you to verify everything.

Man, you're good at this.
I've always considered myself fairly proficient with writing optimized c++, but you always seem to do better than me. Especially evident with the longer dots thing a few pages back (which, by the way, I moved into a nice table to support the $4201 4-cycle delay). Any secrets you care to share? :)

Thanks! I don't have any secrets though, i just look at things and somehow come up with code that's usually not too terrible.

byuu · Post by **byuu** » Wed Feb 09, 2005 3:00 pm

Ok, I ran some tests on NMI.

The $4210 bit 7 seems to be set and cleared at exactly the same time as $4212 bit 7 (set = 0.5x, 225/240y ; clear = 0.5x, 0y). Your tests indicate that vblank clear/set is at 1.0x, so if that's right, NMI would also be 1.0x.
$4210 bit 7 is set/cleared regardless if $4200 bit 7 (nmi enable) is set or not.

Now, take the following code:

Code: Select all

nmi:
  pha
  lda $4210
  pla
  rti

reset:
  lda #$80 : sta $4200
- lda $4210 : bpl -

Both the reset routine and the nmi routine read from $4210. Reading from $4210 will only return bit 7 as being set once, then clear every other time. So if the NMI routine always cleared bit 7, this loop would never end. Chrono Trigger's intro does this right before the first battle screen (I think it's part of the battle engine).
The trick is that the NMI flag can be raised during the opcode, whereas the NMI routine doesn't trigger until after the opcode completes. So that means that if anywhere in the first 3 cycles of lda $4210 NMI is reached, then the main loop will break. Otherwise, the NMI routine will get the $4210 bit 7 as being set.
There's another part to this, though, which is where things get interesting. I believe $4210 bit 7's status and the actual NMI trigger are two completely different functions of the SNES. I believe that the NMI actually triggers later on, thus giving the chance for the reset loop code above a greater window of time to break out of the loop.
Specifically, I believe that the actual NMI only triggers after an opcode completes, and even then, at the very earliest: 3.0x,225/240y. Or cycle 12 of the NMI scanline, 10 cycles after $4210 bit 7 is set.

I did run a number of tests to get the above info, but the latter one (where the NMI actually occurs) is a bit sketchy. I would really appreciate it if someone could confirm this for me.

One interesting thing to note is that I actually had the correct cycle position (12) for when the NMI occurs in my code already. But I don't remember how I got the value to begin with. Note that I still verified that was the correct position anyway.

Also, I had to get the NMI start cycle delay correct to verify some of my information. I had this implemented incorrectly before, so I thought I'd share the correct way of doing it.

Code: Select all

/***********
 *** IRQ ***
 ***********
cycles:
  [1] pbr,pc ; io   -> 6/8 cycles (based on PBR:PC memory speed)
  [2] pbr,pc ; io   -> 6 cycles
  [3] 0,s    ; pbr  -> 8 cycles (stack push, always at 00:xxxx)
  [4] 0,s-1  ; pch  -> 8 cycles
  [5] 0,s-2  ; pcl  -> 8 cycles
  [6] 0,s-3  ; p    -> 8 cycles
  [7] 0,va   ; aavl -> 8 cycles (memory read, always at 00:xxxx)
  [8] 0,va+1 ; aavh -> 8 cycles
*/

I only have this verified for Native mode, and even then, only for NMIs. The 65816 doc says it's the same for reset/etc. though.
I don't know if pbr is pushed in emulation mode. I also don't know if the return address pushed is offset by 1 like how jsr opcodes are.

The weird thing is the very first cycle. No actual instruction is fetched here, according to the value on the data bus (i/o), however VDA = 1, VPA = 1, which is an opcode fetch cycle. Sure enough, by implementing cycles 1 and 2 as i/o cycles, and the rest as memory cycles, I was off by 2 master cycles. By changing the first cycle to a program cycle (opcode fetch / memory access), I got the correct position with $2137.

I'm guessing that's just part of the pipelining as well.

Even after verifying all of this info, we still need to see what all happens in emulation mode, what happens when you toggle overscan around the NMI region, what happens when you enable/disable NMI around the start/end of NMI, and what happens if you don't read $4210 in your NMI routine.

anomie · Post by **anomie** » Wed Feb 09, 2005 10:09 pm

byuusan wrote:Ok, I ran some tests on NMI.

The $4210 bit 7 seems to be set and cleared at exactly the same time as $4212 bit 7 (set = 0.5x, 225/240y ; clear = 0.5x, 0y). Your tests indicate that vblank clear/set is at 1.0x, so if that's right, NMI would also be 1.0x.

Hrm... The only thing is, my tests indicate that $4212 bit 7 gets set .5 dots earlier than $4210 bit 7. Both are cleared at the same time though, [0,0].

Now, take the following code:
[...]
The trick is that the NMI flag can be raised during the opcode, whereas the NMI routine doesn't trigger until after the opcode completes. So that means that if anywhere in the first 3 cycles of lda $4210 NMI is reached, then the main loop will break. Otherwise, the NMI routine will get the $4210 bit 7 as being set.

Of course.

There's another part to this, though, which is where things get interesting. I believe $4210 bit 7's status and the actual NMI trigger are two completely different functions of the SNES. I believe that the NMI actually triggers later on, thus giving the chance for the reset loop code above a greater window of time to break out of the loop.
Specifically, I believe that the actual NMI only triggers after an opcode completes, and even then, at the very earliest: 3.0x,225/240y. Or cycle 12 of the NMI scanline, 10 cycles after $4210 bit 7 is set.

Yes, and the exact delay depends on the speed of the last CPU cycle of the "previous" opcode and the first CPU cycle (the opcode fetch) of the next CPU cycle. To get 10 cycles, you're using SlowROM instructions? ;)

I don't know if pbr is pushed in emulation mode. I also don't know if the return address pushed is offset by 1 like how jsr opcodes are.

PBR is not pushed. I've tested RTI in emulation mode and it requires no PBR on the stack, so the IRQ and NMI pseudo-opcodes cannot be pushing it. And RTI returns to the exact address on the stack, not addr+1 like RTS/RTL.

The weird thing is the very first cycle. No actual instruction is fetched here, according to the value on the data bus (i/o), however VDA = 1, VPA = 1, which is an opcode fetch cycle. Sure enough, by implementing cycles 1 and 2 as i/o cycles, and the rest as memory cycles, I was off by 2 master cycles. By changing the first cycle to a program cycle (opcode fetch / memory access), I got the correct position with $2137.

I'm guessing that's just part of the pipelining as well.

I would have to agree.

what happens when you toggle overscan around the NMI region,

IIRC, if you turn it off, the NMI happens at the usual dots of the next available scanline. Exactly how late you can be before pushing things off until the next line I don't know (e.g. if you toggle at [0,$e4], will it go ahead with $4210 bit 7 at [0.5,$e4] or wait for [0.5, $e5]?).

what happens when you enable/disable NMI around the start/end of NMI

I know for IRQs you can enable them as soon as .5 dot before it's due and it'll still go off. And preliminary results indicate that enabling NMI late will trigger NMI more-or-less right away as long as $4210 bit 7 is still 1. I don't know if disabling NMI will clear $4210 bit 7 or not.

and what happens if you don't read $4210 in your NMI routine.

Nothing happens, it's edge triggered. Although... if disabling NMI doesn't clear $4210 bit 7, it should be possible to trigger a second NMI with "LDA #$80 / STZ $4200 / STA $4200" as long as you never read $4210.

IRQs on the other hand are level-sensitive, so not reading $4211 will set off another IRQ as soon as you clear I.

byuu · Post by **byuu** » Wed Feb 09, 2005 11:16 pm

Hrm... The only thing is, my tests indicate that $4212 bit 7 gets set .5 dots earlier than $4210 bit 7. Both are cleared at the same time though, [0,0].

Ok, I'm going to make some test ROMs to determine when $4210/$4212 bits get set/cleared, and then send them to you. I was thinking maybe like a SRAM-based one, read the values 64k times and output to SRAM. Since the SNES starts at a known time-position at reset, we should either have identical SRAM files, or see the difference in our timings in them.
Maybe I can somehow make the test so that it determines the exact cycle position on its own.
I'm betting this will end up to be another minor difference between our SNES units, though; like dot 321/325 vs 323/327.

Yes, and the exact delay depends on the speed of the last CPU cycle of the "previous" opcode and the first CPU cycle (the opcode fetch) of the next CPU cycle. To get 10 cycles, you're using SlowROM instructions? ;)

Yes, exactly. How did you get 10 cycles from that, though?
The shortest possible CPU cycle is 6, so that would mean FastROM would be 12 or more cycles minimum.

PBR is not pushed. I've tested RTI in emulation mode and it requires no PBR on the stack, so the IRQ and NMI pseudo-opcodes cannot be pushing it. And RTI returns to the exact address on the stack, not addr+1 like RTS/RTL.

So what happens if you're executing code in bank 01?
e.g.
$018000: nop
-> Interrupt trigger (push $8001, p)
$00:c000: rti
<- End interrupt (pull p, $8001)
$00:8001: ...
Does emulation mode force you to only use one bank of code? Or perhaps the vector will be triggered at the current PBR, making it possible to have a vector in every single bank?

anomie · Post by **anomie** » Thu Feb 10, 2005 2:51 am

Yes, exactly. How did you get 10 cycles from that, though?
The shortest possible CPU cycle is 6, so that would mean FastROM would be 12 or more cycles minimum.

Those are the results. If both are Fast (FastROM or IO), the delay is only 6 master cycles. If one is Fast and one is Slow, the delay is 8. If both are Slow, the delay is 10. Fast+XSlow gives 12, and Slow+XSlow gives 14. Theoretically, XSlow+XSlow would give 18. I have no idea why the numbers are as they are.

So what happens if you're executing code in bank 01?
e.g.
$018000: nop
-> Interrupt trigger (push $8001, p)
$00:c000: rti
<- End interrupt (pull p, $8001)
$00:8001: ...

Exactly.

Does emulation mode force you to only use one bank of code? Or perhaps the vector will be triggered at the current PBR, making it possible to have a vector in every single bank?

Vector is always in Bank 0, and always jumps to Bank 0. But you can use multiple banks (JML, JSL, RTL, and such still work), as long as you ensure no NMI/IRQ/BRK/COPs when you're executing in a different bank. Or don't use RTI to return from the interrupt.

byuu · Post by **byuu** » Thu Feb 10, 2005 5:45 pm

Hell. Ok, if you DMA transfer and you cross past the NMI trigger point, the NMI will execute once the DMA transfer finishes, regardless of where the DMA ends at.
So if you're at x: 0, y: 224, and you DMA to x: 0, y: 227, or to x: 0, y: 63, or cross two frames and end up at x: 0, y: 230, it still executes the NMI.
My theory is that the NMI pin transition at the start of a frame from 0->1 does not occur when a DMA is active. This would allow it to avoid executing the NMI routine twice. I confirmed it will only execute it once, regardless of the number of frames you cross.
The NMI doesn't execute right away, though. It waits a certain (presently unknown) number of cycles. I've seen it execute either one or two of the next instructions before actually jumping to the NMI vector.
If this is not emulated correctly, the NMI routine will occur too soon. Latching inside the NMI routine will return incorrect results. The emulator will correct itself and end up with the right cycle positions after the NMI routine finishes, and the PC reaches the point where the NMI would occur on a real SNES. I still want to know how it works so that it can be emulated properly.

Next up is more fun with DMA transfers. DMA transfers require 8 master cycles per byte transferred, as well as an initialization delay.
The delay is the same regardless of how many active channels there are (from 1 to 8). However, the actual delay varies depending on when the DMA transfer begins.
So far, I've seen 24 master cycles near the top of the frame, and 32 master cycles around the middle of the frame. I have no idea how it knows which to use. It isn't related to the previous/next opcode when you write to $420b.
And since this happens with regular DMA, I'm sure it applies to HDMA as well.

43xx reads:

Code: Select all

43[0-7][8-f]
           8,  9,  a,  b,  c,  d,  e,  f
input   { 01, 02, 03, 04, 05, 06, 07, 08 }
output {
  snes  { 01, 02, 03, 08, 43, 43, 43, 08 }
}

You can write a value to 43x0-43xf where x is 0-7. You can read it back and get the same value. I don't know how the values change exactly when you perform a DMA/HDMA transfer.
43[0-7]b is mirrored with 43[0-7]f
---
12->430b
34->430f
430b->34
430f->34
---
12->430b
34->430f
56->430b
430b->56
430f->56
---
43[0-7][c-e] appear to be open bus. The values read are usually $43, $00, and other odd things. 4380-43ff is probably open bus as well.

byuu · Post by **byuu** » Thu Feb 10, 2005 7:20 pm

My UFO is starting to break :(
The floppy drive doesn't eject very well anymore because of how many times I've loaded disks into it...

BUT! I was able to figure out the DMA delay, at last.

Behold, the horror: http://setsuna.the2d.com/files/dma_delay.txt

Basically, the delay is based upon which cycle the transfer begins on. The type of transfer, the numbers of bytes in the transfer, and the number of active channels does not affect the delay at all. Since the smallest possible cycle increment is 2:
cycle 0 = 32 cycle delay
cycle 2 = 24 cycle delay
cycle 4 = 24 cycle delay
cycle 6 = 24 cycle delay
cycle 8 = 32 cycle delay
cycle 10 = 24 cycle delay
cycle 12 = 24 cycle delay
cycle 14 = 24 cycle delay

The pattern should be fairly obvious: { 32, 24, 24, 24 }

It doesn't start on cycle 0, though. What I did was print out the exact hcycle pos during the CPU cycle that wrote to $420b. By this, I am assuming that the DMA transfer begins immediately right here, and not at the end of the write cycle/opcode. With that in mind, I believe that cycle 188, the very first possible cycle upon SNES reset, has the DMA delay as 8dots/32cycles.

Logic:
h: 140 (8)
h: 148 (8)
h: 156 (8)
h: 164 (8)
h: 172 (8)
h: 180 (8)
h: 188 (8)

* h = 1204
0136:0000
013e:0000
8dot delay

Since I can confirm that h = 1204 is a 32 cycle delay position, I can subtract 8 backwards to reach 188.

We still don't know if there's a 4 cycle delay to $42xx accesses, to writes, or not at all. The $2137/$4201 latches indicate that there's a 4 cycle delay somewhere in-between the read/write CPU cycle that actually touches the register.

Ultimately, it shouldn't matter how you handle the 4-cycle read/write delay. As long as you test your emulator against the real SNES and sync the 1 in 4 cycle positions that has a delay of 32 cycles, you should be fine.

anomie · Post by **anomie** » Fri Feb 11, 2005 1:27 am

byuusan wrote:Hell. Ok, if you DMA transfer and you cross past the NMI trigger point, the NMI will execute once the DMA transfer finishes, regardless of where the DMA ends at.
So if you're at x: 0, y: 224, and you DMA to x: 0, y: 227, or to x: 0, y: 63, or cross two frames and end up at x: 0, y: 230, it still executes the NMI.

Going to [0,63] really triggers NMI? Hrm, i would have thought it wouldn't.

My theory is that the NMI pin transition at the start of a frame from 0->1 does not occur when a DMA is active. This would allow it to avoid executing the NMI routine twice. I confirmed it will only execute it once, regardless of the number of frames you cross.

Have you checked $4210? And (especially if bit 7 is still set at [0,63]) does the normal NMI at [0,225] occur, or is it skipped until the next frame?

The NMI doesn't execute right away, though. It waits a certain (presently unknown) number of cycles. I've seen it execute either one or two of the next instructions before actually jumping to the NMI vector.
If this is not emulated correctly, the NMI routine will occur too soon. Latching inside the NMI routine will return incorrect results. The emulator will correct itself and end up with the right cycle positions after the NMI routine finishes, and the PC reaches the point where the NMI would occur on a real SNES. I still want to know how it works so that it can be emulated properly.

Next up is more fun with DMA transfers. DMA transfers require 8 master cycles per byte transferred, as well as an initialization delay.
The delay is the same regardless of how many active channels there are (from 1 to 8). However, the actual delay varies depending on when the DMA transfer begins.
So far, I've seen 24 master cycles near the top of the frame, and 32 master cycles around the middle of the frame. I have no idea how it knows which to use. It isn't related to the previous/next opcode when you write to $420b.
And since this happens with regular DMA, I'm sure it applies to HDMA as well.
There certainly is a dependance on the cycle during which HDMA begins, but it's nothing so simple as you've discovered. It does depend on the mode, the number of channels, and seemingly on the instruction speeds in the general vicinity.

You can write a value to 43x0-43xf where x is 0-7. You can read it back and get the same value.
I've confirmed this in the past.

I don't know how the values change exactly when you perform a DMA/HDMA transfer.
Supposedly, the count registers ($43x5-6) end up 0 after DMA, and $43x2-3 are incremented/decremented as appropriate for what the transfer did. For HDMA, $43x5-6 are (supposedly) copied from the HDMA table and incremented as the transfer progresses. $43x8-9 are supposedly copied from $43x2-3 and incremented. And $43xA is again supposedly copied from the HDMA table and decremented. And yes, you can write the registers in between HDMA transfers and have an effect ($43x2-3 probably have no effect until next frame, but $43x7-A definately do).

All this could use verification, of course.

43[0-7]b is mirrored with 43[0-7]f
And that's all we know about those.

43[0-7][c-e] appear to be open bus. The values read are usually $43, $00, and other odd things. 4380-43ff is probably open bus as well.
Confirmed.

By this, I am assuming that the DMA transfer begins immediately right here, and not at the end of the write cycle/opcode.
We know that the actual DMA on a "STA $420B" occurs just after the next instruction's opcode fetch. Interesting would be to try "STX $420B" (X=$00xx), and maybe even a BRK from an appropriate bank with the stack set so PBR gets 'pushed' to $420B (and arrange so PB, PCh, PCl, P, AAVL, AAVH, and the first opcode of the BRK handler are all different). In all cases, DMA from Open Bus into WRAM so we can 'see' by the Open Bus value just where the DMA occurred.

We still don't know if there's a 4 cycle delay to $42xx accesses, to writes, or not at all. The $2137/$4201 latches indicate that there's a 4 cycle delay somewhere in-between the read/write CPU cycle that actually touches the register.

I suspect it's that the read can be recognized as soon as $37 gets put on Address Bus B and /RD goes, while write needs to wait until the data bus value stabilizes and then has to go out the parallel port before getting to the CPU.

byuu · Post by **byuu** » Fri Feb 11, 2005 1:54 am

Going to [0,63] really triggers NMI? Hrm, i would have thought it wouldn't.

Yeah. Same here.

Have you checked $4210? And (especially if bit 7 is still set at [0,63]) does the normal NMI at [0,225] occur, or is it skipped until the next frame?

I have not checked $4210. That should be trivial to check, though. The normal NMI doesn't execute until one or two opcodes after the DMA finishes, no matter when the DMA finishes (I checked this by printing the value at $02,s in the NMI to the screen), so long as an NMI would have occured during that DMA transfer, had it been all CPU cycles instead.
I'm going to try and get the cycle delay after a DMA transfer tonight by using various opcodes after the STA $420B and checking the PC location where the NMI was triggered. It may be possible to kill the pending NMI with something like: STA $420B : STZ $4200

There certainly is a dependance on the cycle during which HDMA begins, but it's nothing so simple as you've discovered. It does depend on the mode, the number of channels, and seemingly on the instruction speeds in the general vicinity.

Well, I was more meaning that towards DMA. But you have initialization delays for each active channel in HDMA, I was thinking those could probably vary like DMA does, in addition to all the other complexities of HDMA. But DMA initialization doesn't care what modes you use, how much you transfer, nor how many active channels there are. It's always 24 or 32 cycles. I could definately use verification of this, if possible.
Your timing.txt says there is an 8 cycle initialization per channel, and snes9x.com's old forum says there's a phantom byte transfer or something on every channel. I didn't notice either of these happening.

We know that the actual DMA on a "STA $420B" occurs just after the next instruction's opcode fetch.

I'm curious how much that's going to come back to haunt me. Right now, I don't emulate the opcode prefetches. I emulate the prefetch within the actual opcode. This would obviously never affect $2137/$4201 latch positions since they actually happen mid-opcode, but I would guess that it could throw off DMA transfer timing, especially in cases where the DMA transfer exceeds vblank and data is lost in the transfer. Gah, I really need open bus for this...

I suspect it's that the read can be recognized as soon as $37 gets put on Address Bus B and /RD goes, while write needs to wait until the data bus value stabilizes and then has to go out the parallel port before getting to the CPU.

So you believe that anything that goes on address bus B happens 4 cycles before things that go through the parallel port? What exactly is the parallel port?

anomie · Post by **anomie** » Sat Feb 19, 2005 4:48 pm

byuusan wrote:But you have initialization delays for each active channel in HDMA, I was thinking those could probably vary like DMA does, in addition to all the other complexities of HDMA.

The current numbers are "18" master cycles per scanline if any channels are still active, plus 16-56 cycles per channel for the actual transfer. The "18" is presumably where the variance happens.

Your timing.txt says there is an 8 cycle initialization per channel, and snes9x.com's old forum says there's a phantom byte transfer or something on every channel. I didn't notice either of these happening.

I got that number from some others' test, I haven't gotten around to testing it myself yet.

So you believe that anything that goes on address bus B happens 4 cycles before things that go through the parallel port? What exactly is the parallel port?

No, I just think there's a longer path accessing the PPU Latch via $4201 than via $2137. Let's try again... Note that these T values have no correspondance with anything real, and are probably nonlinear. At time Tr=0, the CPU begins the read cycle accessing $2137. At Tr=1, the read actually goes out Address Bus B. At Tr=2, the PPU sees the read and latches. OTOH, at Tw=0 the CPU begins the write cycle accessing $4201. At Tw=1, the write actually goes out. At Tw=2, the internal CPU register gets the written value. At Tw=3, it outputs the value to the IO Port. At Tw=4 the PPU sees the new value, and latches at Tw=5.

"parallel port" == "IO port" == the pins accessed by $4201 and $4213.

This morning's results: WAI and IRQ. If an IRQ is already pending when WAI executes, the WAI instruction takes 1 read and 2 IO cycles (the same as XBA). Otherwise, the WAI instruction adds IO cycles until an interrupt triggers, then does 2 more IO cycles to complete the opcode. WAI seems to recognize the pending interrupt immediately on /IRQ or /NMI transition.

Tests were done with an IRQ set for [0,1] and the following code:

Code: Select all

    ; FastROM here
    wai
    lda $2137  ;; Latch A
    ; [... read $213c-d and continue ROM ...]

IRQFunc:
    sep #$20
    pha
    lda.w $2100  ;; delay
    lda $2137  ;; Latch B
    ; [... read $213c, $4211 ...]
    rti

With the I flag set, Latch A latches between [11.5,1] and [12.5,1], depending on the alignment of the WAI with the IRQ trigger point. That's cycle 1364+11.5*4 = 1410. /IRQ goes low at cycle 1374 (if my previous results are correct), leaving 36 cycles between /IRQ and the latch. FastROM 'LDA $2137' accounts for 24 cycles, leaving 12 for finishing up the WAI. ... If you latch at the beginning of the $2137 rather than the end, I'm not sure how the numbers work out mostly because I'm not sure how that changes the /IRQ timing.

With I clear, Latch B latches between [46.5,1] and [47.5,1] (47.0 and 48.0 if we change the delay to "lda.w $0000"). That's 1364+46.5*4=1550, or 176 cycles after /IRQ. Minus 60 for the IRQ handler, 22 for SEP, 22 for PHA, 30 for the delay, and 30 for the SlowROM LDA leaves 12 for the WAI again.

BTW, if IRQ/NMI does a memory access instead of an IO for that first cycle, that changes the delay between /IRQ or /NMI and the earliest the interrupt can trigger. And it sort of makes sense now! For the interrupt to be recognized after the current opcode it must be pending at the start of the final CPU cycle of the opcode. That's it. And if that's when it does the check, there's no need for elaborate "flags get changed during the next opcode fetch" theories, it's just that PLP, SEI, CLI, SEP, and REP update the flags during their final CPU cycle, after the IRQ check has happened.

[later]

Some DMA tests, now. The test is the same as always: wait a variable number of cycles, execute DMA, and latch. DMA does have 8 master cycles per channel overhead, and 8 cycles per byte. And there's an overhead for the whole DMA transfer, of somewhere between 12 and 24 cycles. This varies depending on just when the transfer begins in a cycle 4 steps long (e.g. 14-20-20-14, 14-18-18-18, or 16-22-16-14, we have to guess the half-dots anyway). The 4-step pattern varies based on the number of bytes transferred and the number of channels, and on FastROM/SlowROM.

Some numbers, after correcting for the per-channel and per-byte costs.
Constant overhead would give steadily increasing results along the lines of 5 5 6 6 7 7 8 8, although the exact starting point would of course differ. All these are FastROM.
1 channel, 1 byte: 5 6 5 5 7 8 7 7
2 channels, 1 byte each: 6 5 5 6 8 7 7 8
3 channels, 1 byte each: 5 5 6 5 7 7 8 7
4 channels, 1 byte each: 5 6 5 5 7 8 7 7
1 channel, 2 bytes: 5 5 6 5 7 7 8 7
1 channel, 3 bytes: 6 5 5 6 8 7 7 8
1 channel, 4 bytes: 5 6 5 5 7 8 7 7
2 channels, 2 bytes each: 5 5 6 5 7 7 8 7
2 channels, 3 bytes each: 5 6 5 5 7 8 7 7
2 channels, 4 bytes each: 6 5 5 6 8 7 7 8
3 channels, 2 bytes each: 5 5 6 5 7 7 8 7
3 channels, 3 bytes each: 5 5 6 5 7 7 8 7
4 channels, 2 bytes each: 5 5 6 5 7 7 8 7
4 channels, 3 bytes each: 6 5 5 6 8 7 7 8

If we name the patterns A (5 6 5 5), B (5 5 6 5), and C (6 5 5 6), we get

Code: Select all

 B123456789
C
1 ABCABCABC
2 CBACBACBA
3 BBBBBBBBB
4 ABCABCABC
5 CBACBACBA
6 BBBBBBBBB
7 ABCABCABC
8 CBACBACBA

SlowROM gives pattern C for everything. Weird.

byuu · Post by **byuu** » Sat Feb 19, 2005 10:04 pm

Unfortunately, I can't say I totally follow your IRQ tests, but that's ok. I'm sure it'll make more sense when I get past NMI and HDMA and start working on IRQ.

DMA does have 8 master cycles per channel overhead, and 8 cycles per byte. And there's an overhead for the whole DMA transfer, of somewhere between 12 and 24 cycles.

I guess our SNES units have another timing difference, then... I think I'll start on that timing documentation really soon, then.
I'm also hard pressed to come up with good ideas for timing tests to [dis]prove our other timing differences.

Reznor007 · Post by **Reznor007** » Sun Feb 20, 2005 5:57 am

byuusan wrote:Unfortunately, I can't say I totally follow your IRQ tests, but that's ok. I'm sure it'll make more sense when I get past NMI and HDMA and start working on IRQ.

DMA does have 8 master cycles per channel overhead, and 8 cycles per byte. And there's an overhead for the whole DMA transfer, of somewhere between 12 and 24 cycles.
I guess our SNES units have another timing difference, then... I think I'll start on that timing documentation really soon, then.
I'm also hard pressed to come up with good ideas for timing tests to [dis]prove our other timing differences.

Timing differences between 2 of the same machines may just be the deviation from the rated spec of the oscillators/crystals in the systems. The timing crystals aren't 100% perfect, so some may be right at the proper value, some may be higher, and some may be slower.

As you can see here http://www.alpha-ii.com/Info/snes-spdif.html his SNES DSP is operating slightly faster than the standard 32KHz(though it is 100% normal).

While experimenting with all the timing as you guys are, maybe running these tests on many SNES units, then averaging the values and using that in emulators would be the best bet.

Overload · Post by **Overload** » Sun Feb 20, 2005 1:28 pm

Reznor007 wrote:
byuusan wrote:Unfortunately, I can't say I totally follow your IRQ tests, but that's ok. I'm sure it'll make more sense when I get past NMI and HDMA and start working on IRQ.

DMA does have 8 master cycles per channel overhead, and 8 cycles per byte. And there's an overhead for the whole DMA transfer, of somewhere between 12 and 24 cycles.
I guess our SNES units have another timing difference, then... I think I'll start on that timing documentation really soon, then.
I'm also hard pressed to come up with good ideas for timing tests to [dis]prove our other timing differences.
Timing differences between 2 of the same machines may just be the deviation from the rated spec of the oscillators/crystals in the systems. The timing crystals aren't 100% perfect, so some may be right at the proper value, some may be higher, and some may be slower.

I don't think timing crystals have any relevance in this discussion. Clock cycles determine timing and theoretically all systems with the same cpu version should give the same results.

byuu · Post by **byuu** » Sun Feb 20, 2005 3:29 pm

The only way the crystal would come into play would be if the PPU dot clock ran independantly from the CPU cycle clock (both would need their own crystals, really). Otherwise, even if the crystal was faster/slower, the dot clock positions would still be the same.
There's probably some internal 21mhz counter that is used by both the CPU and PPU to calculate the cycle/dot counters. But that's pure speculation.

We have already proven via anomie's test that the longer dot position is different on our systems. We have the same CPU/PPU1/PPU2 (3/1/2), as well. This could also be a result of the copiers we are using, though.

anewuser · Post by **anewuser** » Sun Feb 20, 2005 4:23 pm

If that´s the case, then I will give you guys a pat on the back (if I had money or something else more worthy to contribute/help I will help you more) You are doing quite a bit of work, and more if you are going to test different copiers.

Good luck on that.

Reznor007 · Post by **Reznor007** » Sun Feb 20, 2005 5:31 pm

Overload wrote:
Reznor007 wrote:
byuusan wrote:Unfortunately, I can't say I totally follow your IRQ tests, but that's ok. I'm sure it'll make more sense when I get past NMI and HDMA and start working on IRQ.

DMA does have 8 master cycles per channel overhead, and 8 cycles per byte. And there's an overhead for the whole DMA transfer, of somewhere between 12 and 24 cycles.
I guess our SNES units have another timing difference, then... I think I'll start on that timing documentation really soon, then.
I'm also hard pressed to come up with good ideas for timing tests to [dis]prove our other timing differences.
Timing differences between 2 of the same machines may just be the deviation from the rated spec of the oscillators/crystals in the systems. The timing crystals aren't 100% perfect, so some may be right at the proper value, some may be higher, and some may be slower.
I don't think timing crystals have any relevance in this discussion. Clock cycles determine timing and theoretically all systems with the same cpu version should give the same results.

But if because of crystal differences one SNES is running at 3.57MHz, and another at 3.59, it could alter such tiny timing measurements.

On a related note, what is the actual speed of the SNES timing crystal? Specs for the system show it as 3.58MHz, but usually a crystal around that speed is actually 3.579545MHz.