byuusan wrote:But you have initialization delays for each active channel in HDMA, I was thinking those could probably vary like DMA does, in addition to all the other complexities of HDMA.
The current numbers are "18" master cycles per scanline if any channels are still active, plus 16-56 cycles per channel for the actual transfer. The "18" is presumably where the variance happens.
Your timing.txt says there is an 8 cycle initialization per channel, and snes9x.com's old forum says there's a phantom byte transfer or something on every channel. I didn't notice either of these happening.
I got that number from some others' test, I haven't gotten around to testing it myself yet.
So you believe that anything that goes on address bus B happens 4 cycles before things that go through the parallel port? What exactly is the parallel port?
No, I just think there's a longer path accessing the PPU Latch via $4201 than via $2137. Let's try again... Note that these T values have no correspondance with anything real, and are probably nonlinear. At time Tr=0, the CPU begins the read cycle accessing $2137. At Tr=1, the read actually goes out Address Bus B. At Tr=2, the PPU sees the read and latches. OTOH, at Tw=0 the CPU begins the write cycle accessing $4201. At Tw=1, the write actually goes out. At Tw=2, the internal CPU register gets the written value. At Tw=3, it outputs the value to the IO Port. At Tw=4 the PPU sees the new value, and latches at Tw=5.
"parallel port" == "IO port" == the pins accessed by $4201 and $4213.
This morning's results: WAI and IRQ. If an IRQ is already pending when WAI executes, the WAI instruction takes 1 read and 2 IO cycles (the same as XBA). Otherwise, the WAI instruction adds IO cycles until an interrupt triggers, then does 2 more IO cycles to complete the opcode. WAI seems to recognize the pending interrupt immediately on /IRQ or /NMI transition.
Tests were done with an IRQ set for [0,1] and the following code:
Code: Select all
; FastROM here
wai
lda $2137 ;; Latch A
; [... read $213c-d and continue ROM ...]
IRQFunc:
sep #$20
pha
lda.w $2100 ;; delay
lda $2137 ;; Latch B
; [... read $213c, $4211 ...]
rti
With the I flag set, Latch A latches between [11.5,1] and [12.5,1], depending on the alignment of the WAI with the IRQ trigger point. That's cycle 1364+11.5*4 = 1410. /IRQ goes low at cycle 1374 (if my previous results are correct), leaving 36 cycles between /IRQ and the latch. FastROM 'LDA $2137' accounts for 24 cycles, leaving 12 for finishing up the WAI. ... If you latch at the beginning of the $2137 rather than the end, I'm not sure how the numbers work out mostly because I'm not sure how that changes the /IRQ timing.
With I clear, Latch B latches between [46.5,1] and [47.5,1] (47.0 and 48.0 if we change the delay to "lda.w $0000"). That's 1364+46.5*4=1550, or 176 cycles after /IRQ. Minus 60 for the IRQ handler, 22 for SEP, 22 for PHA, 30 for the delay, and 30 for the SlowROM LDA leaves 12 for the WAI again.
BTW, if IRQ/NMI does a memory access instead of an IO for that first cycle, that changes the delay between /IRQ or /NMI and the earliest the interrupt can trigger. And it sort of makes sense now! For the interrupt to be recognized after the current opcode it must be pending at the start of the final CPU cycle of the opcode. That's it. And if that's when it does the check, there's no need for elaborate "flags get changed during the next opcode fetch" theories, it's just that PLP, SEI, CLI, SEP, and REP update the flags during their final CPU cycle, after the IRQ check has happened.
[later]
Some DMA tests, now. The test is the same as always: wait a variable number of cycles, execute DMA, and latch. DMA does have 8 master cycles per channel overhead, and 8 cycles per byte. And there's an overhead for the whole DMA transfer, of somewhere between 12 and 24 cycles. This varies depending on just when the transfer begins in a cycle 4 steps long (e.g. 14-20-20-14, 14-18-18-18, or 16-22-16-14, we have to guess the half-dots anyway). The 4-step pattern varies based on the number of bytes transferred and the number of channels, and on FastROM/SlowROM.
Some numbers, after correcting for the per-channel and per-byte costs.
Constant overhead would give steadily increasing results along the lines of 5 5 6 6 7 7 8 8, although the exact starting point would of course differ. All these are FastROM.
1 channel, 1 byte: 5 6 5 5 7 8 7 7
2 channels, 1 byte each: 6 5 5 6 8 7 7 8
3 channels, 1 byte each: 5 5 6 5 7 7 8 7
4 channels, 1 byte each: 5 6 5 5 7 8 7 7
1 channel, 2 bytes: 5 5 6 5 7 7 8 7
1 channel, 3 bytes: 6 5 5 6 8 7 7 8
1 channel, 4 bytes: 5 6 5 5 7 8 7 7
2 channels, 2 bytes each: 5 5 6 5 7 7 8 7
2 channels, 3 bytes each: 5 6 5 5 7 8 7 7
2 channels, 4 bytes each: 6 5 5 6 8 7 7 8
3 channels, 2 bytes each: 5 5 6 5 7 7 8 7
3 channels, 3 bytes each: 5 5 6 5 7 7 8 7
4 channels, 2 bytes each: 5 5 6 5 7 7 8 7
4 channels, 3 bytes each: 6 5 5 6 8 7 7 8
If we name the patterns A (5 6 5 5), B (5 5 6 5), and C (6 5 5 6), we get
Code: Select all
B123456789
C
1 ABCABCABC
2 CBACBACBA
3 BBBBBBBBB
4 ABCABCABC
5 CBACBACBA
6 BBBBBBBBB
7 ABCABCABC
8 CBACBACBA
SlowROM gives pattern C for everything. Weird.