wip05 discussion

Archived bsnes development news, feature requests and bug reports. Forum is now located at http://board.byuu.org/
Locked
byuu

wip05 discussion

Post by byuu »

<edit: removed outdated WIP>

Delta queue support

First up, I've added a binary min-heap delta queue. I converted all events except IRQ/NMI test and hold. If we can convert these to use the delta queue, there should be a speedup of 30-40% or so -- pretty much the biggest low-hanging fruit there is. And the thing that has plagued me for 12-18 months in the past before the major speed hit v0.018 when I gave up and went with testing IRQ/NMI on every single clock tick.

But it won't be easy: the delta queue works by adding an event when you know its going to trigger. But we cannot know if an IRQ or NMI interrupt will trigger until we're at the current time. One can literally disable or change these 2 clocks before they occur, which would leave a bad trigger event in our queue.

IRQ/NMI hold also needs to be scheduled exactly four clocks after IRQ/NMI trigger. Unless we queue these at least ~16 clocks in advance of the trigger, then we may not be able to trip them exactly when needed.

Since the test/hold are in the same inner loop, before or after the delta queue time update, we can't just enqueue the hold and not the test.

So, in the WIP I've included my insanely rigorous test ROMs for IRQ, NMI and HDMA timing, and I'm asking for help. If anyone could please help in merging sCPU::add_clocks() IRQ testing into the delta queue, I'd be greatly in their debt.

Relevant code is at src/cpu/scpu/timing/[timing.cpp, irq.cpp] and src/cpu/scpu/deltaqueue.cpp.

I'll be working on it as well, of course.

Note: removing events not at the top of the heap is not supported. If this is needed, it would probably be best to do an O(n) search for the event, and overwrite the event code with 0 (meaning ignored) than to try and pull out the event and renormalize the heap. IRQ/NMI hold edge cases are very rare, so O(n) time shouldn't hurt speed.

ALU delay

Since there's no speed hit anymore, I added back hardware ALU (mul / div) delays. While we still don't emulate the proper partial calculation results, we should at least return 0 when reading too soon.

The exact delay varies based upon the calculation, however. We ran into problems with Taz-Mania in the past. So for this WIP only, I've added settings to the advanced panel:
"temp.alu_mul_delay" + "temp.alu_div_delay"

The value has to be a multiple of 2 (2, 4, ... 32, 34, ...), and the goal is to find the highest possible value that will not cause any bugs in games.

What I'm asking is for people to just set the value to something and test a few games. If you spot a bug that's not in v038, try lowering the value until it goes away. Then post the values here. We'll keep lowering the current number until we find the best setting for future releases.

Let's start with really high values that will definitely cause bugs:
ALU mul delay = 104
ALU div delay = 208

For example, pick any game ... say Zelda 3. Note how the triforces won't render now. Lower the value until it works, post what numbers you needed here plus the game name. Then everyone will use those values and test other games. Rinse and repeat.

Important note: you have to reset the game after changing these values in the GUI for them to take effect.

Fullscreen on startup

I've added "video.start_in_fullscreen_mode". Because there's no way to exit other than a keyboard shortcut, I've unhid the "Exit" option for now. We can discuss the UI design stuff in the main v038 talk thread, just stick to mentioning if you hit any bugs with it for this thread.

Thanks to all in advance for any help here!
Last edited by byuu on Sat Jan 03, 2009 3:04 pm, edited 1 time in total.
Verdauga Greeneyes
Regular
Posts: 347
Joined: Tue Mar 07, 2006 10:32 am
Location: The Netherlands

Re: v0.038 wip5 public release -- help needed

Post by Verdauga Greeneyes »

Without looking at the code (sorry), if the IRQ/NMI test and hold events can be disabled, why not just add a flag to the queue to determine whether a queued event should actually happen?

Edit: I think you just added a note to cover this :P I guess a linear search would be a pain..

Edit2: alternatively, to keep track of events that can be disabled, queue them with an optional pointer to a boolean. Check if the pointer is non-zero and if so, check its value to see if it should be executed. That way whatever can disable the IRQ/NMI can directly influence it without needing to do a linear search.
tetsuo55
Regular
Posts: 307
Joined: Sat Mar 04, 2006 3:17 pm

Post by tetsuo55 »

How do the tests work?

Does a blue screen mean the test was successfull? what about a black screen?

It could be me but i have the feeling that the controls are a lot more responsive
byuu

Post by byuu »

blue = pass; red = fail; black = epic fail

They all pass with the WIP. I only included the tests to check for regressions if anyone attempts delta queue interrupts. Note that some tests will take a long time (maybe 2-3 minutes), notably irq.smc, nmi.smc and test_hdmasync.smc.

Controls shouldn't be affected.
Edit: I think you just added a note to cover this Razz I guess a linear search would be a pain..
Maybe not. I don't expect IRQ / NMI to be toggled too often, and there won't be more than 1-5 events in queue at any point anyway.

I can handle the IRQ / NMI trigger easily enough ... just go through and invalidate any pending requests whenever you change the interrupt enable or counter position registers. It's the 4-clock hold immediately after that makes things especially tricky.
ecst
Rookie
Posts: 13
Joined: Fri Nov 14, 2008 1:01 am

Post by ecst »

Just a quick note on the priority queue (actually, there is nothing left of any deltas) implementation (I know this is anything but final): You need to use the sliding compare not just in the tick function, but everywhere you compare absolute cycle counts ("heap[child].counter >= heap[parent].counter", "heap[child + 1].counter < heap[child].counter", "heap[parent].counter <= heap[child].counter"). You would also want to use the more rigorous method menioned by blargg, there have been some real-world cases affected by this.

Also, I do not understand the significance of allocating a doubly sized ("queuesize << 1") array and setting unused counter values to ~0, but I guess it is just work in progress, so please ignore this ...
byuu

Post by byuu »

Damnit, seems irq.smc and nmi.smc are failing. Could've sworn they were working before I finished last night. Ah well.

Really not looking good for queuing the interrupt events. It's getting crazy complicated with a half-dozen edge cases failing with my initial attempts. Trying to queue just the hold release to simplify the poll_interrupts() phase requires single-stepping the delta queue increment along with the interrupt polling loop, which ends up slowing things down rather than speeding them up.

Exceptionally annoying is that the copy of MinGW4 at work builds binaries from the same exact source base that are ~20% slower than the copy I have at home.
Just a quick note on the priority queue (actually, there is nothing left of any deltas) implementation
Damn naming conventions :(
Delta sounded cooler.
You need to use the sliding compare not just in the tick function, but everywhere you compare absolute cycle counts
Bah, screw that. I'll just use a normalize function instead. Call it once a second or so.

Code: Select all

void normalize() {
  for(unsigned i = 0; i < heapsize; i++) heap[i].counter -= counter;
  counter = 0;
}
Also, I do not understand the significance of allocating a doubly sized ("queuesize << 1") array and setting unused counter values to ~0
A cheap trick. Instead of testing every position for each step of dequeue() to make sure its valid, I just allow it to test both spots and go down one extra layer. It'll hit ~0, which will always be greater than everything else and thus never get swapped.

Then again, with both normalize and without, it would eventually cause issues. I'll work on it, then.
tetsuo55
Regular
Posts: 307
Joined: Sat Mar 04, 2006 3:17 pm

Post by tetsuo55 »

test_hdmasync also fails with a black screen
Jonas Quinn
ZSNES Developer
ZSNES Developer
Posts: 115
Joined: Thu Jul 29, 2004 9:51 pm
Location: Germany

Post by Jonas Quinn »

tetsuo55 wrote:test_hdmasync also fails with a black screen
You just have to wait a bit longer.
tetsuo55
Regular
Posts: 307
Joined: Sat Mar 04, 2006 3:17 pm

Post by tetsuo55 »

Jonas Quinn wrote:
tetsuo55 wrote:test_hdmasync also fails with a black screen
You just have to wait a bit longer.
I thought so, i waited even longer than before and i saw blue
byuu

Post by byuu »

Ah, fixed irq.smc and nmi.smc ... wasn't queueing the very first DRAM refresh upon power-on.

src/cpu/scpu/timing/timing.cpp:timing_reset():

Code: Select all

  status.dram_refresh_position = (cpu_version == 1) ? 530 : 538;
  delta.enqueue(EventDramRefresh, status.dram_refresh_position);  //add this
tetsuo55
Regular
Posts: 307
Joined: Sat Mar 04, 2006 3:17 pm

Post by tetsuo55 »

byuu wrote:Ah, fixed irq.smc and nmi.smc ... wasn't queueing the very first DRAM refresh upon power-on.

src/cpu/scpu/timing/timing.cpp:timing_reset():

Code: Select all

  status.dram_refresh_position = (cpu_version == 1) ? 530 : 538;
  delta.enqueue(EventDramRefresh, status.dram_refresh_position);  //add this
Did you update the wip?
ecst
Rookie
Posts: 13
Joined: Fri Nov 14, 2008 1:01 am

Post by ecst »

byuu wrote: Bah, screw that. I'll just use a normalize function instead. Call it once a second or so.

Code: Select all

void normalize() {
  for(unsigned i = 0; i < heapsize; i++) heap[i].counter -= counter;
  counter = 0;
}
Wait, do not be so unidealistic! That normalizing is just plainly ugly. Why not inline the function blargg proposed:

Code: Select all

alwaysinline bool less_than(unsigned x, unsigned y) {
  return x - y > UINT_MAX / 2;
}
I tested with gcc, and, as expected, the only difference on x86 to the naive x < y is the use of js instead of jb.
lscharen
New Member
Posts: 7
Joined: Tue Dec 30, 2008 11:37 pm

Post by lscharen »

ecst wrote:
byuu wrote: Bah, screw that. I'll just use a normalize function instead. Call it once a second or so.

Code: Select all

void normalize() {
  for(unsigned i = 0; i < heapsize; i++) heap[i].counter -= counter;
  counter = 0;
}
Wait, do not be so unidealistic! That normalizing is just plainly ugly. Why not inline the function blargg proposed:

Code: Select all

alwaysinline bool less_than(unsigned x, unsigned y) {
  return x - y > UINT_MAX / 2;
}
I tested with gcc, and, as expected, the only difference on x86 to the naive x < y is the use of js instead of jb.
Also, blargg's code had a nice optimization that replaced swaps in the enqueue() and deqeue() methods with a single assignment. This would cut the number of memory accesses in half.
byuu

Post by byuu »

Dropped the double queue size and ~0 stuff. Looks kind of messy but it works.

Code: Select all

  void dequeue() {
    swap(0, --heapsize);  //yes, yes; doesn't have to be a swap. i'll optimize it before release

    unsigned parent = 0;
    while(true) {
      unsigned child = (parent << 1) + 1;
      if(child >= heapsize) break;
      if(child + 1 < heapsize && heap[child + 1].counter < heap[child].counter) child++;
      if(heap[parent].counter <= heap[child].counter) break;
      swap(parent, child);
      parent = child;
    }
  }
Wait, do not be so unidealistic! That normalizing is just plainly ugly.
What's ugly about it? "Premature optimization is the root of all evil." If the function is called once a second on an array that never has more than 10 elements, I'd be surprised if it consumed more than 0.001% CPU time. And now you don't have to call less_than(x, y) where it's not immediately obvious whether you're testing x < y or y < x, or rely on 2's complement (not that anything modern doesn't use that) ... so does it really matter?
Also, blargg's code had a nice optimization that replaced swaps in the enqueue() and deqeue() methods with a single assignment. This would cut the number of memory accesses in half.
Ah, keep one node local and then write it out at the very end. Very clever :)
lscharen
New Member
Posts: 7
Joined: Tue Dec 30, 2008 11:37 pm

Post by lscharen »

byuu wrote:
Also, blargg's code had a nice optimization that replaced swaps in the enqueue() and deqeue() methods with a single assignment. This would cut the number of memory accesses in half.
I don't see how that's possible, but I only had a cursory glance. I'll go look again.
I have some comments in the code I posted explaining it. Basically, you just keep a copy of the item you're moving up or down the heap and insert it once at the end of the loop, rather than move it one level at a time by swapping. It's the difference between these two ways of moving the first item of an array to the end.

Code: Select all

void move_by_swapping(int *array, unsigned int n)
{
   unsigned int i;

   for ( i = 0; i < n - 1; i++ )
      swap( &a[i], &a[i+1] );
}

void move_by_insertion(int *array, unsigned int n)
{
   unsigned int i;
   int temp = array[0];

   for ( i = 0; i < n - 1; i++ )
      a[i] = a[i+1];

   a[n-1] = temp;
}
Edit: But I see you figured it out already. :)

Edit2: If the priority queue is a real bottleneck for you, the C code I posted in the other thread implements the heap using indexes to (event, counter) records. This lets you update the heap by just copying integers rather than (event, counter) structs....but you could probably pack both of those values into a single integer anyway. Your way is probably more cache-friendly, too.
ecst
Rookie
Posts: 13
Joined: Fri Nov 14, 2008 1:01 am

Post by ecst »

byuu wrote: What's ugly about it? "Premature optimization is the root of all evil." If the function is called once a second on an array that never has more than 10 elements, I'd be surprised if it consumed more than 0.001% CPU time.
I was not speaking of speed but rather of conceptual elegancy ... but granted, this is a matter of taste. The comment about x86 originated in me mistakenly assuming speed was the hindering issue.
byuu wrote: or rely on 2's complement (not that anything modern doesn't use that)
It does not rely on it.
byuu wrote: And now you don't have to call less_than(x, y) where it's not immediately obvious whether you're testing x < y or y < x
Ah, you rotten child! Spoiled by the syntactic sugar of infix operators.
byuu wrote: ... so does it really matter?
Of cource not. It is just fun wasting tremendous amounts of time arguing about abstract non-sense.
Verdauga Greeneyes
Regular
Posts: 347
Joined: Tue Mar 07, 2006 10:32 am
Location: The Netherlands

Post by Verdauga Greeneyes »

ecst wrote:It does not rely on it.

Code: Select all

(int) (x - y) < 0; 
relies on it ...

Code: Select all

x - y > UINT_MAX / 2; 
does not.
blargg
Regular
Posts: 327
Joined: Thu Jun 30, 2005 1:54 pm
Location: USA
Contact:

Post by blargg »

ecst wrote:
byuu wrote: What's ugly about it? "Premature optimization is the root of all evil." If the function is called once a second on an array that never has more than 10 elements, I'd be surprised if it consumed more than 0.001% CPU time.
I was not speaking of speed but rather of conceptual elegancy
To elaborate on this, using a special compare eliminates the need for a separate mechanism to periodically normalize times. In some ways, the special compare is less complex. But unless you're using a class that overloads < and -, it's more error-prone since you could easily forget to use the special compare/subtract.
byuu wrote:or rely on 2's complement (not that anything modern doesn't use that)
ecst wrote:It does not rely on it.
Verdauga Greeneyes wrote:

Code: Select all

(int) (x - y) < 0; 
relies on it ...

Code: Select all

x - y > UINT_MAX / 2; 
does not.
In fact, '(int) (x - y)' relies on the compiler's implementation-defined conversion (C++03 $4.3 para 3) from an unsigned to signed that's greater than INT_MAX.
byuu

Post by byuu »

blargg wrote:To elaborate on this, using a special compare eliminates the need for a separate mechanism to periodically normalize times. In some ways, the special compare is less complex. But unless you're using a class that overloads < and -, it's more error-prone since you could easily forget to use the special compare/subtract.
One could implement the normalize function semi-transparently. So long as your app doesn't look at the queue itself, you can add an internal event upon startup that will call normalize() and then re-add itself.

Of course, your method is really the best for ease of use; and I'd add it immediately if I were making a generic delta / priority / whatever-the-hell-it-is class.

Problem I have with a fully generic one is implementing the callback for when events trigger. Damn C++ templates won't let you pass a generic member function pointer as a template parameter (for obvious reasons), and without that we'd take a negative speed hit for calling a functor for each event.

EDIT: well, I suppose I can do something like:

Code: Select all

template<typename T> class queue {
  T *object; void (T::*callback)(unsigned);
  queue(T *object_, void (T::*callback_)(unsigned));
}
class CPU {
  queue<CPU> q;
  void callback(unsigned);
  CPU() : q(this, &callback) {}
}
Still a slight speed hit compared to direct function calls w/possible inlining.
blargg wrote:In fact, '(int) (x - y)' relies on the compiler's implementation-defined conversion (C++03 $4.3 para 3) from an unsigned to signed that's greater than INT_MAX.
I really wish they could make a subset of C++ that fills in all the undefined behavior parts like that. Eg ">> is always arithmetic, numbers are always 2's complement, char is always 8-bits exactly", etc.

Better to take the speed hit for non-native work than to have your app break entirely, or try and cater to PDP-11s x.x
Locked