DMA timing (again)

byuu · Post by **byuu** » Tue Jul 12, 2005 10:53 am

I was poking around with it again, and since we lost our old thread, I can't seem to get things to match up again :/

Now, anomie: you say that we start off with the DMA clock with a range of 0-7 that ticks once per master clock cycle. So we're running the CPU, and then a write to $420b appears. The CPU completes this write, and then completes the very next cycle (which is usually the first cycle of the next opcode) for whatever reason. Now we read the DMA clock.
int init = dma_clock() & 6; //dma_clock = 0-7
if(!init)init = 8; //0,2,4,6 -> 8,2,4,6
add_cc1_cycles(init);
This aligns the CPU and DMA clocks, presumably so the data exposed between the two chips stays in sync always. Now, we transfer all of the data, each byte taking 8 cycles. Now, once we're finished, you say that the CPU sees how many master clock cycles the next CPU cycle will be, and somehow gets a value 2-6 (fast), 2-8 (slow), or 2-12 (xslow), and adds that at the end. How do you determine how long this delay will be, exactly?

First problem is that my results ranged from the DMA delay being between 12-24 cycles. I can't get within that range with 2-8(start) + 2-12(end). That gives me a range of 4-20. So I thought, maybe add in 8 somewhere, that gives 12-28, too high. But since my tests can't use xslow mode, that would give me the 12-24 range. So then we should add 8 cycles, presumably at the start or end of DMA as well. Maybe this is the extra read byte everyone always used to talk about with DMA. There's probably a clever way to test this...

Now the next problem is that I really don't think it's possible for the CPU to 'predict' the next opcode cycle length. e.g. say we have
sta $420b
clc : xce
In this case, the first delay occurs after clc's first cycle (opcode fetch), but then the second delay depends on the second cycle of clc. This is always an i/o cycle (6). But how would the CPU know that until it executed it? I don't believe the CPU is peeking ahead and analyzing how long it will take to complete the next cycle before it happens, and if it does... how could we emulate this behavior? Perhaps the CPU pauses itself after this cycle executed, for some odd reason? I don't see why the CPU would need to sync itself again anyway. The CPU should not have to worry about being aligned with DMA timing when DMA is inactive... that's what the first alignment before DMA is for :/

Hmm, I'm thinking that extra 8 cycles in the middle is just the CPU reading the first byte to transfer. Like so:
src = read();
for(i=0;i<r43x5;i++) {
*dest++ = src; //the DMA does this
src = read(); //while the CPU does this, hence why the sync is needed
}
So then it probably occurs directly after the first delay to sync with DMA.

anomie · Post by **anomie** » Tue Jul 12, 2005 2:07 pm

byuusan wrote:presumably so the data exposed between the two chips stays in sync always.

Two portions of the same chip, actually...

Now, we transfer all of the data, each byte taking 8 cycles.

Plus 8 cycles overhead per channel, plus one more 8 cycles for the whole DMA.

Now, once we're finished, you say that the CPU sees how many master clock cycles the next CPU cycle will be, and somehow gets a value 2-6 (fast), 2-8 (slow), or 2-12 (xslow), and adds that at the end. How do you determine how long this delay will be, exactly?

Look at the speed of the next CPU cycle to be executed after the DMA to get 6, 8, or 12. The CPU must be paused for a multiple of this number.

Now the next problem is that I really don't think it's possible for the CPU to 'predict' the next opcode cycle length. [...] I don't believe the CPU is peeking ahead and analyzing how long it will take to complete the next cycle before it happens, and if it does... how could we emulate this behavior?

And yet it does. How best to emulate it, i don't know yet.

Perhaps the CPU pauses itself after this cycle executed, for some odd reason?

i don't think this fits with our timing or DMA Open Bus results.

I don't see why the CPU would need to sync itself again anyway. The CPU should not have to worry about being aligned with DMA timing when DMA is inactive... that's what the first alignment before DMA is for :/

It's not the CPU aligning itself with the DMA in the first place. It's the SNES pausing the CPU to make way for the DMA to activate, then restarting the CPU after DMA is complete. Both CPU and DMA can only pause for a full CPU/DMA Clock cycle, which gives rise to these interesting 2-8 master cycle pauses at either end.

Hmm, I'm thinking that extra 8 cycles in the middle is just the CPU reading the first byte to transfer. Like so:
src = read();
for(i=0;i<r43x5;i++) {
*dest++ = src; //the DMA does this
src = read(); //while the CPU does this, hence why the sync is needed
}
So then it probably occurs directly after the first delay to sync with DMA.

Except that wouldn't work, you'd need two data busses for that and the SNES has only one. Current theory is that the DMA logic sets AAddress on Bus A, BAddress on Bus B, and tells one bus to write and the other to read. The shared data bus handles moving the byte between the two.

byuu · Post by **byuu** » Tue Jul 12, 2005 9:41 pm

Plus 8 cycles overhead per channel, plus one more 8 cycles for the whole DMA.

I can't find one of these, then. I've never gotten a delay >24, ever. With hundreds of tests... 8+8 is already 16, then you factor in the 2-8 (24), and 2-8 again (32, or 2-12 -> 36)...
My tests were mostly with one channel, but even still that's 16 cycles/overhead for the DMA alone, with no alignment cycles.
I've also tried doing 8 DMA transfers (sta $420b = #$ff) and my latch counter was still not off by more than 24 cycles to the SNES' counter.

Look at the speed of the next CPU cycle to be executed after the DMA to get 6, 8, or 12. The CPU must be paused for a multiple of this number.

Still don't understand this. I have the 6, 8, or 12. Now how do I turn that into 2-12? What do I subtract from the 6/8/12 to get that?
Do I take the delay, and find a multiple?
Like, start delay = 6, DMA channel overhead = 8, total = 14.
The next cycle will take 8, so then we need 8+8 = 16, 16-14 = 2?
Or does it use the entire DMA transfer, including all cycles from all the bytes transferred?
Does the per-channel or total DMA overhead come into play in this total?
Either way, I take it if that total = 16, and our next cycle length = 8, then the end delay needs to be 8 (8+8=16, but we can't have zero, so we need 8+8+8=24, 24-16=8)...

i don't think this fits with our timing or DMA Open Bus results.

I can't say anything on open bus, but it would sync our timing values. It would be impossible to latch the counters within two cycles of writing to $420b, so our results would end up being correct. But it wouldn't be correct emulation, supposedly.
How are you polling open bus during and immediately after a DMA transfer, anyway?

How best to emulate it, i don't know yet.

I'm going to give it a try... I guess it's time to really make the CPU/APU cycle based instead of opcode based... somehow, I'm going to need a flag that runs the next cycle just to get the count, and not actually do anything like access memory or modify the CPU state in any way.
It would probably be best to write two CPU cores for this. Joy.

http://byuu.org/files/cyclecpu.txt
Here's what I have so far, as a lousy mockup for running cycle-by-cycle and getting the cycle count for the next CPU cycle...

anomie · Post by **anomie** » Tue Jul 12, 2005 11:34 pm

Still don't understand this. I have the 6, 8, or 12. Now how do I turn that into 2-12? What do I subtract from the 6/8/12 to get that?

You start the DMA transfer with 2-8 cycles, then add your overheads and your byte costs. Call the total of all this T. Then add 2-6, 2-8, or 2-12 cycles so T+x is a multiple of 6, 8, or 12 as appropriate.

I can't say anything on open bus, but it would sync our timing values. It would be impossible to latch the counters within two cycles of writing to $420b, so our results would end up being correct. But it wouldn't be correct emulation, supposedly.
How are you polling open bus during and immediately after a DMA transfer, anyway?

DMA from Open Bus into $2180, or from an Open Bus register (e.g. $2190) into RAM. The Open Bus value read after a "STA $420b" is always the next opcode, never the next byte (yes, NOP wouldn't show a difference, but try LDA #$ab).

somehow, I'm going to need a flag that runs the next cycle just to get the count, and not actually do anything like access memory or modify the CPU state in any way.

It's not that bad. At any point, you know what the next cycle will be: memory fetch (and you know the addr) or IO. IO is always 6, and memory is a fairly simple calculation based off the address.

byuu · Post by **byuu** » Sat Jul 16, 2005 10:57 am

Ok, I understand it now (again). Thank you.

It's not that bad. At any point, you know what the next cycle will be: memory fetch (and you know the addr) or IO. IO is always 6, and memory is a fairly simple calculation based off the address.

True, but the problem there is when you're in the middle of an opcode, the address relies on addresses read before it (e.g. for an indirect address). It would be non-trivial to just jump right in the middle of an opcode to get an address.
The easy solution seemed to be to create some sort of pseudo-code to add all the red tape that a cycle-based emulator would need, along with a separate time() function for each opcode to tell you how long the next cycle would be.

I ended up with a little ~4k parser. The input looks like this:
http://byuu.cinnamonpirate.com/files/op_read_b.txt
And the output looks like this:
http://byuu.cinnamonpirate.com/files/op_read_cpp.txt

The bottom stuff is auto-generated formalities for the classes. All of the functions used inside the opcodes are inlined. Seems I can implement half of the CPU with <10kb of code, and switch between cycle and opcode-based timing with a single line edit in the parser, or just have it generate both sets of opcode types. I gain the advantage of having all like-type opcodes paired together to avoid errors, and I can probably add a lot more optimizations to the code generator in the future if need be (obviously from that sample code, my focus isn't speed). The op_time() function in the time() opcodes should allow me to handle special cases like the 4-cycle delay before $4201 writes latch the counter.

Does this (using a preparser) seem like a good idea or a bad idea from a programming standpoint? Obviously I could just include the generated .cpp files with any source code releases, even though the parser is ansi-c++ as well.

Anyway, I'm halfway done with this, and it seems to work ok, so hopefully in a few more days I'll be able to verify your DMA timing notes through emulation.

byuu · Post by **byuu** » Mon Jul 18, 2005 10:00 am

Wow, anomie: you're a genius. Your DMA timing notes are amazing. I was able to get exact emulation results to my SNES with all four possible DMA counter positions, with either a 6 or 8 cycle "next" opcode cycle time (nop vs lda $2137, second cycle being 6 or 8), and with any number of channels active / bytes being transferred. Wonderful.

I really have no idea how you figured out those results, but awesome job. There is but one minor issue with your previous notes that you may want to correct:

"(starting from Reset, presumably. If you latch $2137 4 master
cycles into the 6-cycle read, the clock starts at 0 i think)."

The DMA clock starts at 2 on reset if you latch $2137 0 master cycles into the 6-cycle read. Thus, it would start at 0 if you latched 2 cycles in, and at 6 if you latched 4 cycles in.

I really think that this counter starts at zero, however. I think that $2137 reads latch the counter two cycles in, and $4201 writes latch the counter six cycles in. This will give us correct results for $2137 reads, $4201 writes, and DMA transfers. And the number zero gives a hint of legitimacy.
Obviously the exact delays probably vary depending on the exact read and write address, and whether or not its a read or a write, but I think we can figure out all of the important cases ($21xx, $4xxx ranges) with a few simple tests anyway.

Here's my notes confirming the above:
http://byuu.org/files/dma_results.txt

Kinda gibberishey, but I'm sure you're used to that from me by now. I also ran a lot more tests using a modified version of my optiming program (I'm starting to love that thing), and all matched up perfectly. I plan to write up a much cleaner version of those notes here very shortly.

Now, one major issue stands up at me: you say that HDMA uses the exact same delays as DMA, but what happens when HDMA triggers while DMA is running? I take it I'm not going to like the answer... :/

---

Edit: Hmm, by disallowing HDMA while DMA is transferring, nothing breaks. Could've sworn that used to break a lot of games on me.

I went ahead and tried changing all MMIO reads to require 2 master cycles before actually reading, and 4 master cycles after; and all MMIO writes to 6/0. This cleans up my hblank/vblank code quite a lot, but I also noticed something strange with real hardware.

hblank begin : $4212 bit 6 varies on and off when hcycles == 1096
hblank end : $4212 bit 6 varies on and off when hcycles == 4
vblank begin : $4212 bit 7 varies on and off when vcycles == 225 && hcycles == 0
vblank end : $4212 bit 7 varies on and off when vcycles == 0 && hcycles == 0

Much like 323/327, there seems to be a magic tolerance of half a dot on the start/end flags, so the actual value (bits 7 and 6) read from $4212 changes every time you reset the console. I'm betting it's because the dot clock's half dot isn't reset when the SNES resets itself, or maybe something similar. That would cause all of the behavior I've seen on my SNES. This would also explain why your dot timing test never reflected the 321/325 long dots I noticed before.
If anyone wants to test out a couple ROMs for me on hardware to see if they get the same flickering values, I'd appreciate it. Let me know and I'll post the ROMs here.

The new $4212 code looks like this, does this match your results now?
if(hcycles <= 2 || hcycles >= 1096)result |= 0x40; //dots 274-339+0
if(vcounter >= (visible_scanlines + 1))result |= 0x80; //not set for first dot of 0 or clear for first dot of (visible_scanlines + 1) anymore

Next thing, with making $2137 reads latch 2 cycles later, and $4201 6 cycles later, I had to update my reset cc1 counter position from 188 to 186. I didn't have to update the dma counter position though, which seemed kind of weird at first.
186 & 7 = 2. Now, since DMA has its "own" timer in emulation (yeah, it's shared with cc1 on real hardware), it isn't affected by whatever the cc1 clock starts at. And it also isn't affected by the 2-cycle MMIO read delay, or 6-cycle write delay, since DMA only calculates timings on complete cycles. But of course, $4210/$4212/$2137/$4201 are affected by both. I'm not quite sure I understand how, but for whatever reason: my old timing worked (188/2 at reset with 0/6 reads), and so does my new timing (186/2 with 2/4 reads), so... meh.
That part I quoted you on above is probably irrelevant now in hindsight.

anomie · Post by **anomie** » Tue Jul 19, 2005 9:51 pm

byuusan wrote:Wow, anomie: you're a genius. Your DMA timing notes are amazing. I was able to get exact emulation results to my SNES with all four possible DMA counter positions, with either a 6 or 8 cycle "next" opcode cycle time (nop vs lda $2137, second cycle being 6 or 8), and with any number of channels active / bytes being transferred. Wonderful.

I really have no idea how you figured out those results, but awesome job.

I just looked at the results, and did the best i could to explain them...

I really think that this counter starts at zero, however. I think that $2137 reads latch the counter two cycles in, and $4201 writes latch the counter six cycles in. This will give us correct results for $2137 reads, $4201 writes, and DMA transfers. And the number zero gives a hint of legitimacy.

That does sound likely.

Now, one major issue stands up at me: you say that HDMA uses the exact same delays as DMA, but what happens when HDMA triggers while DMA is running? I take it I'm not going to like the answer... :/

Do you like "i don't know"? I don't remember actually checking that.

Edit: Hmm, by disallowing HDMA while DMA is transferring, nothing breaks. Could've sworn that used to break a lot of games on me.

Really? Hrm... OTOH, it could be the games you tested are just not doing such a thing. A test: set a HDMA to draw something interesting to the screen (e.g. change color 0 from black to white). Then do a very long garbage DMA (to $2190 or something) and see if it delays the HDMA or what.

byuu · Post by **byuu** » Thu Jul 21, 2005 10:33 am

Really? Hrm... OTOH, it could be the games you tested are just not doing such a thing. A test: set a HDMA to draw something interesting to the screen (e.g. change color 0 from black to white). Then do a very long garbage DMA (to $2190 or something) and see if it delays the HDMA or what.

It does not :(
Changed it back to give HDMA priority. I setup a vertical gradient from red to black, then half way down the screen I transferred 0x2000 bytes (40 lines or so timewise) from $213e to $7f0000, moved that to SRAM and looked at it. I was hoping the HDMA would stop the DMA transfer process, but no such luck. All 0x2000 bytes were 0x51 in the SRAM file, and the HDMA had no missing lines, nor were they shifted down any.
So, HDMA is probably pausing DMA and performing the transfer, and then resuming the DMA transfer. I don't know if HDMA needs to sync with the CPU again (probably does for the channel init overhead), or if the DMA has to again once the HDMA is finished (again, probably does).
I guess our best bet would be to get HDMA timing perfect, and then start playing around with overlapping the two.

byuu · Post by **byuu** » Thu Jul 21, 2005 11:47 am

Well then, allow me to start things off.

HDMA initialization (which happens at the start of every frame) isn't constant.

First, my test code:

Code: Select all

;execute test code
  lda #$08              : sta $4300 : sta $4310 : sta $4320 : sta $4330 : sta $4340 : sta $4350 : sta $4360 : sta $4370
  lda #$3e              : sta $4301 : sta $4311 : sta $4321 : sta $4331 : sta $4341 : sta $4351 : sta $4361 : sta $4371
  lda.b #hdma_table     : sta $4302 : sta $4312 : sta $4322 : sta $4332 : sta $4342 : sta $4352 : sta $4362 : sta $4372
  lda.b #hdma_table>>8  : sta $4303 : sta $4313 : sta $4323 : sta $4333 : sta $4343 : sta $4353 : sta $4363 : sta $4373
  lda.b #hdma_table>>16 : sta $4304 : sta $4314 : sta $4324 : sta $4334 : sta $4344 : sta $4354 : sta $4364 : sta $4374
- lda $4212 : bpl -
  lda.b #%00000001 : sta $420c
- lda $4212 : bmi -
  lda #$00 : sta $420c

  jsl $7f0000

With that code, I can simply change the lda.b #%nnnnnnnn to test any number of channels being initialized.

This is the latch values I got with the above:

Code: Select all

[8] 75.5->79.0->82.5 %11111111 + 6
[7] 74.0->77.5->81.0 %01111111 + 6
[6] 72.5->76.0->79.5 %00111111 +12
[5] 69.5->73.0->76.5 %00011111 + 6
[4] 68.0->71.5->75.0 %00001111 + 6
[3] 66.5->70.0->73.5 %00000111 +12
[2] 63.5->67.0->70.5 %00000011 + 6
[1] 62.0->65.5->69.0 %00000001 +24
[0] 56.0->59.5->63.0 %00000000
per-channel: 6+6+12=24/3=8
base: 16 cycles
---
[6] 72.5->76.0->79.5 %11100111
[5] 69.5->73.0->76.5 %11010110

The [n] represents the number of active channels. The three latch numbers are just with 0 nop's at $7f0000, then with 1, then with two. They're only used to get the half-dot latch position.
The last two just confirm the pattern doesn't care which channels are active, just how many.

Now, let's try the same thing with an nop at the TOP of this code (which will offset the DMA counter by -2):

Code: Select all

[8] 79.5->83.0->86.5 %11111111 + 8
[7] 77.5->81.0->84.5 %01111111 + 8
[6] 75.5->79.0->82.5 %00111111 + 8
[5] 73.5->77.0->80.5 %00011111
per-channel: 8

I stopped at 5 because the pattern was pretty obvious, and because I realized the nop was poorly placed. I can't predict the DMA counter position after the - lda $4212 : bpl - line. I want to make sure that I'm only offseting the DMA counter by two before I run hundreds of tests.

Here's what I plan to do next:
1) Find out exactly when HDMA initialization occurs, so that we can see what the DMA counter is currently at. It will either be at an exact half-dot position every time, or a sliding half-dot position every time; after a cycle completes, or after an opcode completes.
2) Write a new demo program to automate as much of the hundreds of tests that this is invariably going to take without having to keep uploading new programs to a copier.
3) Once we have base initialization out of the way, then we can worry about per-scanline timing.
4) Once we have per-scanline timing taken care of, then we can worry about DMA/HDMA conflict timing.

[Later:]
Should've known. Nintendo is going to make this as difficult as possible. By removing the wait for start of frame code from the above code, I manually punched in different opcode counts to get the exact half-dot where the HDMA init began. This is what I got:

Code: Select all

0 frames*
6.0- [750/821]
6.5+ [749/822] -> 12
12, 13, 15, 14, 14, 15, 17, 16, 16, 17, 19, 18,
18, 19, 21, 20, ...

1 frame*
5.0- [742/829]
5.5+ [741/830] -> 11
11, 12, 14, 13, 13, 14, 16, 15, 15, 16, 18, 17,
17, 18, 20, 19, ...

2 frames*
5.0- [740/830]
5.5+ [739/831] -> 11
11, 12, 14, 13, 13, 14, 16, 15, 15, 16, 18, 17,
17, 18, 20, 19, ...

3 frames*
6.0- [728/842]
6.5+ [727/843]

4 frames*
6.0- [732/839]
6.5+ [731/840]

5 frames*
5.0- [734/836]
5.5+ [733/837]

6 frames*
5.0- [738/833]
5.5+ [737/834]

7 frames
6.0- [736/834]
6.5+ [735/835]

8 frames
6.0- [740/831]
6.5+ [739/832] -> 12
12, 13, 15, 14, 14, 15, 17, 16, 16, 17, 19, 18,
18, 19, 21, 20, ...

To explain that... the first one is the number of frames I skipped [non-interlace mode] (first test is performed on the second frame, due to me waiting for vblank initially before running code) -- I skip the frames before ever enabling HDMA, the next line is the very last dot that doesn't trigger HDMA init, the next line is the very first dot that does. The best pattern I can notice is 6.5, 6.5 -- 5.5, 5.5, etc.
However -- look what happens when I start incrementing the dot clock by 0.5 dots at a time! I only tried it on four of the pairs, but it's definitely an insane pattern of some sort. Whatever the hell kind of pattern that is, it's consistent between the two different HDMA init start times (the pattern is the same when HDMA init requires 6.5 time, and when it required 5.5 time).
The dot clock is completely distorted while HDMA is active. I have no clue what HDMA is trying to do here.
The only thing I was able to figure out is that HDMA init occurs on a certain cycle, and not a certain opcode. Otherwise, the HDMA init times would be more sporadic, like DRAM refresh was.

anomie · Post by **anomie** » Thu Jul 21, 2005 1:54 pm

We always knew HDMA didn't terminate the DMA transfer, it just pauses it while the HDMA occurs.

HDMA Init isn't supposed to be constant. It works much like DMA: it waits until the edge of a CPU cycle (but can be mid-opcode), 'syncs' to the DMA clock, does some work (8 per channel, plus another 16 per channel set for indirect addressing), then 'syncs' back to the CPU clock.

The HDMA transfer is the same with the syncing. Each active channel (whether or not a transfer occurs this line) has 8 cycles, presumably to decrement or load $43xA. Then the channel gets 16 cycles if a new indirect address is loaded and 8 cycles per byte actually transferred.

If we figured out the exact dots those want to trigger on, I forget the numbers.

byuu · Post by **byuu** » Thu Jul 21, 2005 2:46 pm

HDMA Init isn't supposed to be constant. It works much like DMA: it waits until the edge of a CPU cycle (but can be mid-opcode), 'syncs' to the DMA clock, does some work (8 per channel, plus another 16 per channel set for indirect addressing), then 'syncs' back to the CPU clock.

So what's up with the horrible dot-clock distortion long after HDMA init, and before the first HDMA transfer? And joy, now we get to account for HDMA init overlapping a DMA transfer. That actually probably happens quite a lot in games that overshoot vblank time.
I'll try and mirror the DMA code into HDMA init and see what happens.

Edit: Ok, I don't understand the weird dot-clock distortion... but after performing a DMA-style init on the HDMA init, I was able to match the same distortion pattern and results I got on my copier.

Keeping in mind that HDMA init can trigger on the edge of any cycle, here's the code:

Code: Select all

  if(status.hdmainit_triggered == false) {
    if(time.v != 0 || time.hc >= 16) {
      status.hdmainit_triggered = true;
      c = 0;
      for(z=0;z<8;z++) {
        if(dma->channel[z].hdma_active == true)c++;
      }
      if(c == 0)return;
    //only perform HDMA init when HDMA is active...
      status.hdma_cpu_state = status.cpu_state;
      status.cpu_state  = CPUSTATE_HDMA;
      status.hdma_state = IHDMASTATE_INIT;
      return;
    }
  }

  if(status.cpu_state == CPUSTATE_HDMA) {
    switch(status.hdma_state) {
    case IHDMASTATE_INIT:
      exec_cycle();
      status.hdma_state = IHDMASTATE_DMASYNC;
      break;
    case IHDMASTATE_DMASYNC:
      n = 8 - clock->dma_counter();
      add_cycles(n);
      status.hdma_cycle_count = n;
      add_cycles(8);
      status.hdma_cycle_count += 8;
      status.hdma_state = IHDMASTATE_RUN;
      break;
    case IHDMASTATE_RUN:
      dma->hdma_initialize();
      for(z=0;z<8;z++) {
        if(!dma->channel[z].hdma_active)continue;
        c = (dma->channel[z].hdma_indirect == true)?24:8;
        add_cycles(c);
        status.hdma_cycle_count += c;
      }
      status.hdma_state = IHDMASTATE_CPUSYNC;
      break;
    case IHDMASTATE_CPUSYNC:
      c = time_cycle();
      z = 0;
      while(z <= status.hdma_cycle_count)z += c;
      z -= status.hdma_cycle_count;
      add_cycles(z);
      status.cpu_state = status.hdma_cpu_state;
      break;
    }
    return;
  }

There's still a problem with when exactly HDMA init begins. It starts at the beginning of dot 4 for two successive frames, and then the beginning of dot 3 for two more. I don't know what's causing that yet, and of course -- even being off by half a dot with this timing totally breaks the results.

Theory: The distortion is in non-interlace mode, so take the following:
262*1364 =357368 cycles/frame
262*1364-4=357364 cycles/frame
We start the system with dma_counter = 0...

Going to frame 1... 0 + 357368 = 357368 & 7 = 0
Going to frame 2... 357368 + 357364 = 714732 & 7 = 4
Going to frame 3... 714732 + 357368 = 1072100 & 7 = 4
Going to frame 4... 1072100 + 357364 = 1429464 & 7 = 0
Going to frame 5... 1429464 + 357368 = 1786832 & 7 = 0

So then, the cycle position where HDMA triggers would be 12 + ((cycles & 7) ^ 4)... interesting though, as DMA is not affected by this at all. It must be because HDMA is trying to align itself with new frames, and its 8-cycle clock is getting out of sync with the number of cycles in a frame.

We simply need to keep track of what dma_counter was at the very start of the frame. 8 - frame_start_dma_counter + 8 = exact cycle start of HDMA init for the frame. Man, (H)DMA sure does love the number 8.

[Later:] Yep, that got my results matching real hardware perfectly. Hooray. Now onto per-scanline HDMA...
One thing to note is that the dma frame counter starts at 4 and not zero... this could mean that the SNES sleeps for a full frame (two non-interlace frames) after resetting the console... probably easier to just set the dma counter to zero and the dma frame counter to four. What I do is at the start of each frame, add the number of cycles the previous frame had to an 8-bit counter and mask it by 6 (same as 7, really). You obviously would have to set the number of cycles for the first frame (262*1364) at reset.

The test above I was incorrect on. I forgot that I actually sleep for two frames before running the test, so the first test is really on the third frame.

anomie · Post by **anomie** » Fri Jul 22, 2005 12:55 am

byuusan wrote:So what's up with the horrible dot-clock distortion long after HDMA init, and before the first HDMA transfer?

Where are your NOPs getting inserted, before or after the Init?

It must be because HDMA is trying to align itself with new frames, and its 8-cycle clock is getting out of sync with the number of cycles in a frame.

That would be my guess.

One thing to note is that the dma frame counter starts at 4 and not zero... this could mean that the SNES sleeps for a full frame (two non-interlace frames) after resetting the console...

Most likely, it's just an artifact of however the HDMA logic detects the start of the frame....

byuu · Post by **byuu** » Fri Jul 22, 2005 4:42 am

Where are your NOPs getting inserted, before or after the Init?

I would think after.

The code is
- lda $4212 : bpl - ;now we're in vblank
lda #$01 : sta $420c
jsl $7f0000 ; run a bunch of NOPs and other opcodes to get us past HDMA init
lda $2137

I basically keep changing the lda $002100/lda $008000 counts to offset the counter by .5 each time. So I would assume that after hitting HDMA init, all the future opcodes would just push that clock forward slightly, but I guess not. The distortion lasts for >8 dots, too, which is longer than the slowest opcode inside the jsl.

You're probably right about the HDMA counter. Maybe it can be simplified by checking to see if V==0, H >= n, and dma_counter == 0, where n could be anything. I'll look into it.

ZSNES board

DMA timing (again)

DMA timing (again)

Re: DMA timing (again)

Re: DMA timing (again)

Re: DMA timing (again)