New HDMA emulation findings

Archived bsnes development news, feature requests and bug reports. Forum is now located at http://board.byuu.org/
byuu

New HDMA emulation findings

Post by byuu »

Posting in a new thread as this is relevant to all SNES emu authors.

First up, we know that all three DMA states (DMA, HDMA and HDMA init) all wait one cycle after acknowledgment before beginning. With DMA, it would be exceedingly difficult to disable all channels immediately after enabling them (though I'm sure it's possible.)

However, with HDMA and HDMA init, it's quite easy to do. So essentially, if HDMA triggers, and in the DMA synchronization cycle, you write zero to $420c, it will prevent the HDMA from occurring at all.

I blocked the channels when this happened before, but there's one small catch I missed: it prevents the synchronization delays as well! It's a complete abort of the operation.

Next up, I've noticed something odd with the DMA sync timing for pure HDMA only: we previously understood the DMA sync overhead to be "8 - dma_clock() + 8" cycles. However, it seems that sometimes (most of the time from my observations), the last part (+8) is omitted. I'm guessing it is related to complexity, as I setup eight identical channels that do nothing but reload NTRLx -- with one channel active, the +8 is not there; but with eight channels active, it is there. I will have to investigate this further. But for now, it seems safer to omit the +8 cycles, as it seems more common and games are more finicky about running too slowly than too fast (Jumbo Ozaki no Hole in One, Breath of Fire II German, etc.)

And lastly, the most important finding of all.

Previously, during HDMA transfers, we were adding 8 clocks for each active channel, and 8 clocks for each byte transferred (1-4 bytes/channel.) If the line counter then reached zero, we would add eight more to reload that. If this were an indirect HDMA transfer, we would add 16 more clocks to fetch this.

It seems all of this was right, except for the very first part. I set up eight HDMA channels and ran HDMA for two scanlines. The result was emulation was exactly 128 clocks ahead of real hardware. (8*8)*2 = 128. Note that there was no DMA sync / CPU sync offset because 128 is divisible by the modulo-8 DMA counter, and the CPU cycle immediately after HDMA was obviously the same for both of my tests.

This one seems rather hard to believe, but 8-64 clocks per HDMA transfer is pretty significant, so it definitely needs more investigation to verify my findings.

For the time being, here's a test ROM for everyone to try out themselves:
http://byuu.cinnamonpirate.com/temp/tes ... isable.zip

Pardon the name of the archive, I was testing the first finding when I named it. Anyway, it writes the latch counter after HDMA occurs on the second scanline to SRAM.

Real hardware will return V=2,H=0. One thing to note is that I current get V=1,H=336. This is because I don't add the +8 during HDMA synchronization as I mentioned above, and the rest of the difference is due to the syncs throwing off the exact numbers. If I add the +8, I can match hardware.

But now try the test with adding the eight-clock HDMA channel overhead and suddenly we're 128 clocks ahead!

I know there's a zero percent chance of someone actually doing this, but I'll ask anyway :P
Does anyone want to try and help locate what exactly forces the +8 cycle overhead during HDMA?
grinvader
ZSNES Shake Shake Prinny
Posts: 5632
Joined: Wed Jul 28, 2004 4:15 pm
Location: PAL50, dood !

Post by grinvader »

I'd also like to have these timings logically explained. It's getting a bit full of exceptions, which is a good hint that the base assumptions are incorrect.
皆黙って俺について来い!!

Code: Select all

<jmr> bsnes has the most accurate wiki page but it takes forever to load (or something)
Pantheon: Gideon Zhi | CaitSith2 | Nach | kode54
byuu

Post by byuu »

grinvader wrote:I'd also like to have these timings logically explained. It's getting a bit full of exceptions, which is a good hint that the base assumptions are incorrect.
I've written a logical explanation of the DMA sync process before, actually, but the diagrams sucked, so I don't think I ever posted it. Definitely need to rewrite that one day.

It does actually make sense, with the exception of how it "knows" how long the next clock cycle will be. It can probably do that because it times the bus cycle before the work cycle executes.

As for the rest of it, I won't be so arrogant as to say the assumptions are incorrect; but I do hope that those with concerns run their own tests. I really, really would like an extra person looking over my findings.

The tough part is that it's impossible to query every possibly combination of every action. But it's quite easy to get a specific test working right. What I do is then store the results of that test into a test ROM and continue on. Anytime I make changes to a given component, I make sure all of the old tests still work. So, I can at least say that my new findings do not break any known behavior from before.

Some matters also lack logical explanations. Such as if an interrupt triggers during the I/O cycle of a fetch+I/O opcode (nop, clc, etc), the I/O opcode ends up becoming a read from PBR,PC. I've proven it with multiple tests, but it still makes absolutely no sense.

But I should note, the SNES really is this complex. There really are hundreds of inane edge cases that you can either support or ignore. But I will say this, I have games breaking just by adding eight extra clock cycles in a HDMA sync. It seems this process is more important to timing than I once thought -- I suspect simply adding 18 clocks will no longer suffice if you want to get 100% compatibility.

---

EDIT:

Looked at the two games that break with the extra 8-cycle DMA sync / setup overhead for HDMA: Jumbo Osaki only has one active channel then, and Breath of Fire 2 German has two. My tests showed the delay wasn't there for one channel, and that it was for eight. So it seems like the extra delay is probably tacked on once enough channels are active. Possibly 5-8 or something.

I'll modify my test to strobe all 256 possible values for HDMA channels to see what the pattern is.

I should also check if the 8-cycle per-channel overhead is still there when neither a transfer nor line fetch occurs. My test was always fetching a new line counter value, and it seems Mecarobot Golf was as well.
FitzRoy
Veteran
Posts: 861
Joined: Wed Aug 04, 2004 5:43 pm
Location: Sloop

Post by FitzRoy »

Bump for great importance, and to let byuu know that he should fix the date on his latest website entry or people might think he went back in time.
byuu

Post by byuu »

Okay, wasted another day on this.

http://byuu.cinnamonpirate.com/temp/test_hdmasync.zip

The rest runs with every combination of every channel enabled/disabled (eg 256 passes), hitting two of the DMA counter positions. It then shifts the timing phase by 2 and repeats, giving a total of 512 HDMA runs. It latches the counters twice with a phase 2 shift between them to verify the HDOT position is perfect.

Blue screen indicates pass, red screen fail, black screen epic fail. I hate posting these test_ ROMs, because they're really difficult to pass. I wouldn't recommend trying to pass this test ROM in another emulator until it supports absolutely perfect cycle timing, bus hold delays, DMA, HDMA and HDMA init sync delays, color burst scanline timing, long dots, DRAM refresh and all timing differences between the two CPU revisions. Otherwise you'll just drive yourself mad.

Also, if you run it on a copier, note that the DMA counter may be misaligned (50% chance), causing the test to fail. I could detect this, but it would be semi-annoying, so I took the easy way out: just reset the game immediately after loading it, so that the DMA counter will be reset to zero. It'll pass on every future reset, as well.

Looks like I was wrong about the +8/+0 HDMA overhead issue. It always has the DMA sync overhead of: 8 - dma_counter() + 8. The same for DMA and HDMA init. That's good, less complicated that way.

What was actually throwing me off was my HDMA trigger position, which in turn threw off the counter values of my limited test by exactly 2. It was due to the forced V,H counter alignment that I didn't notice. Tricky stuff.

Anyway, I've previously verified that HDMA init occurs at V=0,H=12+8-dma_counter() on CPU revision 1, and V=0,H=12+dma_counter() on CPU revision 2. Yeah, no idea why they changed that, but they did.

As it turns out, HDMA does not begin at H=1106, it actually begins at: H=1100+dma_counter() at the start of the scanline. What's neat is it's the same for both CPU revisions. I wasn't expecting that.

For those out of the loop, the DMA counter is basically incremented once every eight clock cycles, and when I mention it above, I'm meaning its value mod 8. Again since there's no single cycle stepping, that means HDMA occurs at either 1100, 1102, 1104 or 1106.

Further, since scanlines are always either 1360 or 1364 cycles long, and both are divisible by four, this means HDMA always occurs at either 1100 or 1104. The exact trigger point essentially alternates every scanline except for non-interlace even field line 240, which is 1360 cycles (evenly divisible by 8.)

It's the same for DRAM refresh on CPU revision 2, actually.

Also note that by trigger position, I'm obviously meaning where the process begins. You still end up executing one more CPU cycle (where HDMA can be aborted with no overhead if writing zero to $420c there), so an emulator that doesn't support the cycle delay would want to use H=~1112-1114 to start the transfer.

---

The good news in (closer to) plain English:

I can now get proper timing (and thus latch results) for all three DMA modes, so now I can start messing with triggering HDMA during DMA, and figuring out what happens then. Not looking forward to that ... but I need to figure it out before I can rewrite the DMA state machine code.

Not planning to mess around much with the CPU crashing HDMA conflict bug just yet.
tetsuo55
Regular
Posts: 307
Joined: Sat Mar 04, 2006 3:17 pm

Post by tetsuo55 »

Awesomely done Byuu!, sorry i can't help you with this at all.

You should be proud to post these test programs, they raise the accuracy bar yet again. Also they should help other programmers make their emulators more accurate.

What you're probably affraid of is that people would start harrasing other devs to fix their emulator so the test passes
byuu

Post by byuu »

And one more day ... hopefully this will be enough so that I can take a break from all this.

http://byuu.cinnamonpirate.com/temp/test_hdma.zip

Two tests this time: test_hdmasync.smc is the same as before, test_hdmatiming.smc to verify cycle overhead for most/all combinations of byte transfer / no byte transfer, line transfer / no line transfer and indirect address fetch / no indirect address fetch.

Unfortunately, I've invalidated my previous findings yet again, which seems to be a common theme, but eh -- that's progress.

With each new finding, yet another regression test ROM is created, so that previously observed results are still met with new findings. So even if new info is wrong, it's still always an improvement over old theory.

That said, here's what now seems to be the case:

HDMA trigger position is always H=1104. Because my previous HDMA sync test ROM only had two actual HDMA trigger points (due to V=0,H=0 sync function call), and due to some extreme luck (or lack thereof), it passed with H=1100+DMA_counter() [1100 or 1104, alternating]. I went with trying DMA_counter() because HDMA init sync definitely does use this (have a test ROM to verify this. It won't pass with a fixed value on either CPU revision.) I discovered this when testing with the new ROM in the above archive, test_hdmatiming.smc, which gave an additional six unique points to hit HDMA transfers.

So we were only off by 2 clocks before.

Next, looks like I was also wrong about HDMA per-channel overhead. So long as the HDMA completed flag is false and the channel is enabled, you always get 8 cycles/channel overhead. But the thing that's really interesting is that this overhead is shared with the line counter fetch. My current theory is that the line fetch happens no matter what, but $43x8/43x9/43xa are only updated if --linecounter & 0x7f == 0, eg we've actually reached the end of the channel.
It's probably just grouping together the bus access + work cycles to save time.

Tracking down when each action inside the HDMA occurs will be tricky, as line counter / indirect address fetches only read from the A-bus, and reads from $21xx/~$40-43xx are invalid / return 0x00. So no counter latching for that. But I can get a counter latch by using the actual DMA / HDMA transfers to access $2137/$4201, so we can at least determine the timing of the only accesses that can possibly be observed by software.

Hopefully I've got the above info right this time ...
ZH/Franky

Post by ZH/Franky »

Hey byuu, I read on your errata page about "CPU revision 1 DMA<>HDMA conflict".

If you manage to get this HDMA thing perfect, will you be able to emulate the "CPU revision 1 DMA<>HDMA conflict"?

Not that I want the emulator I'm using to freeze up, but I'm just asking.
byuu

Post by byuu »

Updated the test_hdma.zip archive with a few new additions to test_hdmatiming.

I started inspecting what happens inside the HDMA run event. Very interesting, it actually performs the transfer for each channel first, and then after that it handles line counter reloading and indirect address fetches. Makes sense if you think about it ... let's say HDMA starts at H=1104+12=1116. The max transfer is 4 bytes/channel. So 1116+8*8*4=1372. Just over the end of the active scanline, but well before the actual rendering starts at H=~22. Had they interleaved the line counter and indirect address loads with the transfers, the last channel wouldn't have written HDMA until well into the display of the frame if the HDMA transfer were complex enough (H=1116+(4+1+2)*8*8=1564-1364=200-16 -> last write would be at H=184.)

Surprisingly, I haven't noticed any real difference in any games after fixing this.

Next, I've written up four quick test runs to get some rough data on the overhead of HDMA during DMA. Looks to be quite nasty. I'm seeing additional overhead of 6-12 cycles. The big thing to note about those values are that they are not multiples of 8. I don't really expect HDMA to misalign the DMA counter, but the numbers are quite bizarre. Obviously the DMA end's CPU sync could create these differences, but I'm having trouble matching it. Maybe it's a very subtle bug in my state machine, not sure.

But I was at least able to verify one thing ... the sync overhead I was using was 8 - dma_counter() + 8, and I was blocking this when an HDMA event occurred during a DMA transfer. Seems the +8 should be moved inside the HDMA run / HDMA init functions as a base overhead no matter what. Without it, I was 14-20 clocks ahead of hardware.

Also, inside of HDMA transfers ... I'm not entirely sure how the bus interleaving works. Obviously 8 cycles/transfer makes for an interesting problem: you don't have the read value immediately, you have some bus overhead before the read value is returned. But then wouldn't that cut into the bus write time? I was thinking maybe it just chained things like <read N, write N-1, read N+1, write N, read N+2, write N+1, ...>, but if that were the case, the final write wouldn't be possible.

I can't test this because I can't latch the counters with a write. Only $4201 with d7 clear can do that, and a B->A transfer with A=$4201 is invalid: the write will not go through.

The only thing I can think of is to use A->B and write to $2118 with the display enabled, and try and get the write right next to active display where writes will be blocked, to try and determine where exactly the write occurs. But that's going to be seriously painful because of all of the DMA sync / CPU sync stuff throwing things off ever so subtly.

For now, it looks like the read bus delay is four clock cycles regardless of source. But again, due to the sync stuff throwing off timing, I could be wrong. Both 4 and 6 will match the timings on hardware, and 4 seems more logical of the two.

...

We're getting pretty close, though. Get this HDMA during DMA sync right, and refine the bus delays just a tad, and it should be mostly hardware perfect.

As for the HDMA glitch on CPUr1, I don't have any plans to emulate it just yet, but perhaps in the future. I'm now defaulting to CPUr2, so that if and when I do add it, nobody will experience the crash unless they set the CPU to r1 and run some homebrew that triggers it.
neviksti
Lurker
Posts: 122
Joined: Thu Jul 29, 2004 6:15 am

Post by neviksti »

byuu wrote:Also, inside of HDMA transfers ... I'm not entirely sure how the bus interleaving works. Obviously 8 cycles/transfer makes for an interesting problem: you don't have the read value immediately, you have some bus overhead before the read value is returned. But then wouldn't that cut into the bus write time? I was thinking maybe it just chained things like <read N, write N-1, read N+1, write N, read N+2, write N+1, ...>, but if that were the case, the final write wouldn't be possible.
I may be misunderstanding, but are you asking what the Bus A vs Bus B accesses look like? Remember that they share a data bus, so it's not like you need to read, then write the value. You read it, it is there, so you latch it. Writes usually require setup time, but litte if any hold time.

Actually, when checking things before making the SPC7110 FIFO mod, I got to see a DMA on the real hardware on a scope. It just holds the /RD line low the whole time for that bus. I guess that makes sense, but I wasn't expecting that for some reason.
byuu

Post by byuu »

Remember that they share a data bus, so it's not like you need to read, then write the value. You read it, it is there, so you latch it. Writes usually require setup time, but litte if any hold time.
I really don't understand the hardware level very well, sadly.
Well, it's not like writes could have been delayed a full cycle anyway, the hardware timing won't allow for that possibility. But good to know that's not the case.

My only observations thus far have been that reads from $2137 (FastROM region, 6 cycles) latch the counters 2 clock cycles into the read cycle, and writes to $4201 (FastROM region, 6 cycles) latch the counters 6 clock cycles into the write cycle.

From that, I've extrapolated that SlowROM (8) should be 4 cycles in for read, 8 cycles in for write. And XSlowROM (12) should be 8 cycles in for read, 12 cycles in for write. And thus, DMA running at 8 cycles should be 4 of 8 for read, 8 of 8 for write. It just seems to chain the read and write together into one cycle.

I was kind of surprised that it took 4 cycles for reads from what is otherwise a FastROM region, but DMA is odd in that it makes all accesses 8 clock cycles. Looking at it now, that's probably because it needs all 8 to chain the read+write together.
byuu

Post by byuu »

So, let's work on HDMA sync during DMA transfer ...

Assume that at V=0,H=100 we have:
DMA channel 0 active, transferring 512 bytes
HDMA channel 1 active, will fetch one line counter with no transfer

V=0,H=100:
- DMASYNC event, DMA counter = 4 -> 8 - 4 = DMA clock align of 4
V=0,H=104:
- DMA base overhead +8
V=0,H=112:
- DMA ch0 setup +8
V=0,H=120:
- DMA ch0 run for 118 bytes / 944 cycles. 120+944=1064
- DRAM refresh in the middle +40, puts us at H=1104
V=0,H=1104:
- HDMA triggers here
- HDMA base overhead +8
V=0,H=1112:
- HDMA ch1 line counter load +8
V=0,H=1120:
- DMA ch0 run for 512-118 (394) bytes / 3152 cycles, two DRAM refreshes occur during this time.
That means DMA counter = 4+8+8+944+8+8+3152=4132 clocks

Now for CPUSYNC, the next cycle will be nop I/O cycle (6 clocks), so:
6 - (4132 % 6) = 6 - 4 = 2
I don't believe DRAM refresh adds to DMA counter, but even if it did, we'd have 3 lines, 3 DRAM refreshes, 120 cycles, and:
6 - (4252 % 6) = 2 as well. Same result.

But DRAM refresh does affect counter, so we end up consuming 4252+2 CPUSYNC cycles, or 4254 cycles.
Add our start offset of H=100, and we get 4354.
V=4354/1364=3
H=4354%1364=262

And that's what I get under emulation. Problem is, on hardware, the correct answer should be:
V=3,H=274

We're 12 clock cycles too short.

Another example:

Assume V=0,H=114, same as before:

V=0,H=114:
- DMA counter = 6
- DMASYNC = 8-6=2
V=0,H=116:
- DMA base +8
V=0,H=124:
- DMA ch0 +8
V=0,H=132:
- Transfer 117 bytes / 936 cycles
- 114+936+40 DRAM refresh=1108
V=0,H=1108:
- HDMA begins
- HDMA setup +8
V=0,H=1116:
- HDMA ch1 line fetch +8
V=0,H=1124:
- Transfer 512-117 (395) bytes / 3160 cycles
- Two DRAM refreshes occur during this time

So then we have:
2+8+8+936+8+8+3160=4130

Same thing with adding DRAM refresh to DMA clock count, doesn't matter:
6-(4130%6)=6-2=4
6-(4250%6)=6-2=4

So then, 4130+120 DRAM refresh x3+4 CPUSYNC=4254. Add +114 for the start, and we get 4368.

V=4368/1364=3
H=4368%1364=276

V=3,H=276.

Unfortunately, on hardware we get:
V=3,H=282.

We are six clock cycles too fast in this case.

Lastly, assume V=0,H=128 and V=0,H=144. Same as the last test, only DMA counter shifts to 0 and 2, respectively. Our CPU sync cycle is still NOP I/O (6 cycles) as before.

The result is emulation is still 6 clocks too fast in both of these cases.

V=3,H=296 and V=3,H=310 for emulation;
V=3,H=302 and V=3,H=316 for hardware.

I can break down these two transfers as above if it will help.

...

Sigh, I didn't want to have to use this, but I don't know of anyone else who can help, and I can't seem to figure it out myself ... so ... sorry in advance :/

Image
byuu

Post by byuu »

I think I have it. It looks to be a logic bug somewhere in the emulator that was breaking things when I put the dma_add_clocks(8) call inside the HDMA-during-DMA-run call. I'll have to trace that back further.

Observe for the first example above:

DMA overhead = 4132 + CPUSYNC of 2. But to match hardware, we need 4146. 4134 was off by 12. 4132 by 14.

Let's drop CPUSYNC and try adding some DMA clocks:

Code: Select all

+ 2: 6-(4134%6)=6-0=6;  2+6= 8
+ 4: 6-(4136%6)=6-2=4;  4+4= 8
+ 6: 6-(4138%6)=6-4=2;  6+2= 8
+ 8: 6-(4140%6)=6-0=6;  8+6=14*
+10: 6-(4142%6)=6-2=4; 10+4=14*
+12: 6-(4144%6)=6-4=2; 12+2=14*
+14: 6-(4146%6)=6-6=6; 14+6=20
4132+14=4146, matching hardware perfectly.

Now for the second example:

We have DMA overhead = 4130 + CPUSYNC 4. What we need is 4140. We are six off with CPUSYNC added, we are ten off without.

Let's try adding some cycles again.

Code: Select all

+ 2: 6-(4132%6)=6-4=2;  2+2= 4
+ 4: 6-(4134%6)=6-0=6;  4+6=10*
+ 6: 6-(4136%6)=6-2=4;  6+4=10*
+ 8: 6-(4138%6)=6-4=2;  8+2=10*
+10: 6-(4140%6)=6-0=6; 10+6=16
4130 + 10 = 4140. Another hardware match.

Only +8 will match both cases.

And indeed, adding +8 cycles for this edge case matches the other two recorded tests I have, as well. Obviously, we'd want to make a lot more. Have it hit cases with CPUSYNC cycle length = 8, have it hit with alternating start DMA counters for the DMA and HDMA transfers, and all of that fun stuff ... but this looks pretty solid.

And it makes sense: the DMA unit works in 8-cycle intervals, so any other values would misalign things.

So, cancel the signal I suppose. Now to find out where the emulator logic bug is ... :/
tetsuo55
Regular
Posts: 307
Joined: Sat Mar 04, 2006 3:17 pm

Post by tetsuo55 »

So basically your DMA/HDMA emulation is perfect(as far as you where able to test) but the bug is in another part of the emulation
byuu

Post by byuu »

Was.

I had:

Code: Select all

cycle_edge() {
  ...
  if(status.hdma_pending) { hdma_run(); hdma_pending = false; }
  if(status.dma_pending)  { dma_run();  dma_pending  = false; }
  ...
}
dma_run() would call cycle_edge() after every 8 clock cycles. So when you'd hit HDMA during DMA, it would run. But in this case, because I set dma_pending = false; after dma_run(), it ended up calling dma_run() twice.

I'm surprised I was able to get the timing working at all like that. New routine:

Code: Select all

//w420b() will set status.dma_pending to true, and will call cycle_edge() after the write completes.

void cycle_edge() {
  if(status.hdmainit_triggered == false) {
    if(status.hcounter >= status.hdmainit_trigger_position || status.vcounter) {
      status.hdmainit_triggered = true;
      hdma_init_reset();
      if(hdma_enabled_channels()) {
        status.hdma_pending = true;
        status.hdma_mode = 0;
      }
    }
  }

  if(status.hdma_triggered == false) {
    if(status.hcounter >= 1104) {
      status.hdma_triggered = true;
      if(hdma_active_channels()) {
        status.hdma_pending = true;
        status.hdma_mode = 1;
      }
    }
  }

  if(status.dma_state == DMA_Run) {
    if(status.hdma_pending) {
      status.hdma_pending = false;
      if(hdma_enabled_channels()) {
        dma_add_clocks(8 - dma_counter()); //DMA sync
        status.hdma_mode == 0 ? hdma_init() : hdma_run();
        status.dma_state = DMA_CPUsync;
      }
    }

    if(status.dma_pending) {
      status.dma_pending = false;
      if(dma_enabled_channels()) {
        dma_add_clocks(8 - dma_counter()); //DMA sync
        dma_run();
        status.dma_state = DMA_CPUsync;
      }
    }
  }

  if(status.dma_state == DMA_Inactive) {
    if(status.dma_pending || status.hdma_pending) {
      status.dma_clocks = 0;
      status.dma_state = DMA_Run;
    }
  }
}
And for good measure:

Code: Select all

void sCPU::hdma_run() {
  dma_add_clocks(8);

  for(unsigned i = 0; i < 8; i++) {
    if(hdma_active(i) == false) continue;
    channel[i].dma_enabled = false; //HDMA run during DMA will stop DMA mid-transfer

    if(channel[i].hdma_do_transfer) {
      static const unsigned transfer_length[8] = { 1, 2, 2, 4, 4, 4, 2, 4 };
      unsigned length = transfer_length[channel[i].xfermode];
      for(unsigned index = 0; index < length; index++) {
        unsigned addr = !channel[i].hdma_indirect ? hdma_addr(i) : hdma_iaddr(i);
        dma_transfer(channel[i].direction, dma_bbus(i, index), addr);
      }
    }
  }

  for(unsigned i = 0; i < 8; i++) {
    if(hdma_active(i) == false) continue;

    channel[i].hdma_line_counter--;
    channel[i].hdma_do_transfer = bool(channel[i].hdma_line_counter & 0x80);
    if((channel[i].hdma_line_counter & 0x7f) == 0) {
      hdma_update(i);
    } else {
      dma_add_clocks(8);
    }
  }

  counter.set(counter.irq_delay, 2);
}
FitzRoy
Veteran
Posts: 861
Joined: Wed Aug 04, 2004 5:43 pm
Location: Sloop

Post by FitzRoy »

So... now it's perfect? Goody.
franpa
Gecko snack
Posts: 2374
Joined: Sun Aug 21, 2005 11:06 am
Location: Australia, QLD
Contact:

Post by franpa »

Perfect , until byuu finds another error :P
Core i7 920 @ 2.66GHZ | ASUS P6T Motherboard | 8GB DDR3 1600 RAM | Gigabyte Geforce 760 4GB | Windows 10 Pro x64
King Of Chaos
Trooper
Posts: 394
Joined: Mon Feb 20, 2006 3:11 am
Location: Space

Post by King Of Chaos »

franpa wrote:Perfect , until byuu finds another error :P
Gee, thanks for casting a black cloud over the good news. Good work as always Byuu. :)
[url=http://www.eidolons-inn.net/tiki-index.php?page=Kega]Kega Fusion Supporter[/url] | [url=http://byuu.cinnamonpirate.com/]bsnes Supporter[/url] | [url=http://aamirm.hacking-cult.org/]Regen Supporter[/url]
Verdauga Greeneyes
Regular
Posts: 347
Joined: Tue Mar 07, 2006 10:32 am
Location: The Netherlands

Post by Verdauga Greeneyes »

Excellent work byuu, it's been a joy to watch your progress as usual :)
To make franpa's statement into a somewhat more reasonable question: do you have anything left to verify about this - or is there anything you're not sure about because you don't have the means to check it out?
krom
Rookie
Posts: 13
Joined: Sat Sep 29, 2007 4:08 am
Contact:

Well done

Post by krom »

Hi byuu,
Well done in ironing out that mecarobot bug, I am really impressed with the new wip of bsnes, it is definitely releasable in my view :)
Cheers to the rest of the community especially neviksti, for helping make bsnes even better.
byuu

Post by byuu »

Let's see ...

The edge cases where DMA ends right as HDMA begins and vice versa will work as expected on CPUr2. When one ends, state transitions to CPUSYNC. And the bus-cycle edge will be called before the next H/DMA pending flag can be set, so it will be inactive and waiting. As expected, so everything will work fine.

But obviously it's not supposed to work on CPUr1. That will be tricky to even detect, and I really don't think anyone wants the emulator to die completely when this happens :P

I should also write a quick test to latch counters during a regular DMA for channels 0 and 7 with all eight channels active, just to definitively say where the per-channel overhead is located.

After that, not really. That should be it for H/DMA.

The major units left are the ALU (prevent reading mul/div registers early), auto joypad-polling ("prevent" reading $4016/$4017 controller port bits while the SNES is doing it in the background -- in truth, it just throws everything off), and NMI during IRQ edge cases. The first two will require noticeable speed penalties to support.

It'd be nice to come up with a better explanation for that one-cycle delay thing that happens after H/DMA before an IRQ can trigger, but it's supported now at any rate.
mozz
Hazed
Posts: 56
Joined: Mon Oct 10, 2005 3:12 pm
Location: Montreal, QC

Post by mozz »

byuu wrote:The major units left are the ALU (prevent reading mul/div registers early), auto joypad-polling ("prevent" reading $4016/$4017 controller port bits while the SNES is doing it in the background -- in truth, it just throws everything off), and NMI during IRQ edge cases. The first two will require noticeable speed penalties to support.
Is the hit for that really that big? I thought you would just need to store the current cycle counter when you start a mul/div or auto-joypad-poll, then when reading the affected registers, compare it with the current counter to make sure enough time has passed. If it hasn't, maybe you have to use some slow-but-precise algorithm to emulate exactly what the result would be. But if enough time has passed, you can just return the proper final value.

Though perhaps it is actually more complicated than this for some reason (I don't know, you are a lot more familiar with SNES hardware than I am).
byuu

Post by byuu »

What I did last time was store the H/V counters when the mul / div began, and each time either the result regs were read, or a full scanline had been emulated, I'd adjust the counters there.

It was still somewhat noticeable, by ~2% or so. Same as I get with the same setup for my DMA counter. Those loops are just so sensitive to any changes.

The joypad polling I cannot do the same. If it polls every 8 clocks and is done automatically, I will need something like if(auto_joypad_poll_active == true) inside the main add_clocks() loop. That will be much more painful.

I was thinking of just merging it all with something like if(anything_active == true) and inside that, I can check all the counters for math, joypad, etc. That should make the speed hit for both joypadpoll + ALU the same (both are rarely active), and keep the code for running them as flexible as possible.

---

Nach seems really interested too, maybe I should try getting some logs again. The biggest problem is I don't even know where to begin: does the ALU tick per clock cycle, per opcode cycle, every N clock cycles, every opcode ... ? And I also cannot read arbitrary numbers of cycles after each execution. I'll be limited to numbers like +26, +34, +38, +40, +44, +48, (now I can do +2 steps every time) due to there being no way to immediately read after a write (and no, DMA cannot touch the mul / div regs.)
grinvader
ZSNES Shake Shake Prinny
Posts: 5632
Joined: Wed Jul 28, 2004 4:15 pm
Location: PAL50, dood !

Post by grinvader »

That granularity should be fine to learn more about mul/div "intermediate" results and get a good idea how long it takes.
Just remember to test with various multiplier values (to indicate early-out algo).
皆黙って俺について来い!!

Code: Select all

<jmr> bsnes has the most accurate wiki page but it takes forever to load (or something)
Pantheon: Gideon Zhi | CaitSith2 | Nach | kode54
mozz
Hazed
Posts: 56
Joined: Mon Oct 10, 2005 3:12 pm
Location: Montreal, QC

Post by mozz »

byuu wrote:What I did last time was store the H/V counters when the mul / div began, and each time either the result regs were read, or a full scanline had been emulated, I'd adjust the counters there.

It was still somewhat noticeable, by ~2% or so. Same as I get with the same setup for my DMA counter. Those loops are just so sensitive to any changes.

The joypad polling I cannot do the same. If it polls every 8 clocks and is done automatically, I will need something like if(auto_joypad_poll_active == true) inside the main add_clocks() loop. That will be much more painful.

I was thinking of just merging it all with something like if(anything_active == true) and inside that, I can check all the counters for math, joypad, etc. That should make the speed hit for both joypadpoll + ALU the same (both are rarely active), and keep the code for running them as flexible as possible.

---

Nach seems really interested too, maybe I should try getting some logs again. The biggest problem is I don't even know where to begin: does the ALU tick per clock cycle, per opcode cycle, every N clock cycles, every opcode ... ? And I also cannot read arbitrary numbers of cycles after each execution. I'll be limited to numbers like +26, +34, +38, +40, +44, +48, (now I can do +2 steps every time) due to there being no way to immediately read after a write (and no, DMA cannot touch the mul / div regs.)
It sounds as if you intend to emulate these things as a state machine, running as part of the main CPU loop? That sounds like it might be marginally simpler than doing a "catch up" sort of thing, but also a lot slower.

Do a "catch up" implementation for mul/div. First the game writes some registers to start the mul, and you store the cycle counter at that time. Later, when the game tries to read a register to get the mul result, then you check the cycle counter, and only then simulate what the mul/div hardware was doing and produce the desired intermediate (or final) result. 99+% of the time the game is not going to read the result before its ready anyways, so you can produce the "correct" answer like you do now. Only if they read before the result is ready, do you need to do any per-cycle intermediate stuff.

You only need to "catch up" simulation of something, when its state is visible to other parts of the system (e.g. when they are trying to read from it/interact with it). The mul/div hardware is essentially invisible except when the program reads from, or writes to, its registers. So it doesn't even need to be simulated except at those times. I know you know all this already, it sounds like you are intending to do it in a slower way for some reason. :lol:
Locked