Delta queue design

byuu · Post by **byuu** » Fri Jan 02, 2009 3:20 am

DRAM refresh, IRQ and NMI pins can trigger at any clock cycle, even in the middle of an opcode cycle read.

DMA, HDMA and HDMA init can trigger on the edge of an opcode cycle (tested every ~8 master cycles).

IRQ and NMI can generate an actual interrupt on the edge of a full opcode (every ~32 master cycles).

I do not know how often the ALU and joypad polling tick as we don't emulate those properly yet.

My thoughts were to make the per-clock stuff just do whatever event it is they need immediately, and for the rest to set flags in a variable for each one, so for cycle_edge we'd have:

while(event_cycleedge) {
switch(event_cycleedge & -event_cycleedge) {
...
}
event_cycleedge &= event_cycleedge - 1;
}

Then there's just the final edge case of handling events that occur N clocks after a previous event.

For instance, when IRQ trips on a clock boundary, it holds the value high for 6 clocks. If the program reads the "test and clear pending IRQ"s register at +0 or +2 clocks in to that event, it returns 1 and cancels the IRQ. +4 clocks in, it will return 1, but trigger the IRQ and leave the bit set. +6 and on, IRQ triggers and the next read will be 0 as expected.

I would need a way to determine exactly where an event occurred, not just that it had, to schedule the future IRQ pin events. Or maybe I can just insert an event for IRQ+0, IRQ+2, IRQ+6 and IRQ+8 to manipulate the behavior.

Verdauga Greeneyes · Post by **Verdauga Greeneyes** » Fri Jan 02, 2009 3:21 am

Assuming a uniform distribution and a bidirectional list, the following should be a bit faster: (it got a bit long so I compacted it a bit into for loops; I hope you don't mind)

Code: Select all

void Delta_Queue::insert( int time )
{
    events_t::iterator p;
    if ( time >= last )
    {
        events.push_back( time - last );
        last = time;
    }
    else
    {
        if ( time < last / 2 )
        {
            for(p = events.begin(); time > *p; time -= *p++);
        }
        else
        {
            int temp = last;
            for(p = events.end() - 1; time < temp; temp -= *p--);
            p++;
            time -= temp;
        }
        *p -= time;
        events.insert( p, time );
    }
}

Edit: note this does not guarantee the order of simultaneous events, though it could with minimal modifications. It just stops as soon as possible.

blargg · Post by **blargg** » Fri Jan 02, 2009 4:26 am

Verdauga Greeneyes, nifty. You can write the backwards loop a tad more compactly (and avoid events.end() - 1, which requires that it be a random access iterator, which it wasn't in mine which actually used a list):

Code: Select all

int temp = last;
for(p = events.end(); time < temp; temp -= *--p);
time -= temp;

Here's a version using an array-based heap, absolute times, and no need to periodically adjust them, as overflow doesn't cause any problems. Full code + exerciser: Delta_Queue_2.cpp

Absolute times seem superior to me. The demo includes functions that implement less-than and subtraction that handle when one of them has wrapped around but the other hasn't, so you have all operations. You could wrap them in a class and overload < and - so that they are completely transparent to use.

lscharen · Post by **lscharen** » Fri Jan 02, 2009 4:51 am

blargg wrote: Here's a version using an array-based heap, absolute times, and no need to periodically adjust them, as overflow doesn't cause any problems. Full code + exerciser: Delta_Queue_2.cpp

I threw together a C implementation of the linked-list and heap-based versions. The heap does not coalesce events by default, but it could easily be modified to do so. Code is minimally tested.

Heap and Linked List Delta Queues.

Enjoy!

Edit:

I updated my code to incorporate the optimizations in the code blargg posted -- which are very nice, btw. Credit given where credit is due.

Verdauga Greeneyes · Post by **Verdauga Greeneyes** » Fri Jan 02, 2009 5:33 am

blargg wrote:Verdauga Greeneyes, nifty. You can write the backwards loop a tad more compactly (and avoid events.end() - 1, which requires that it be a random access iterator, which it wasn't in mine which actually used a list)

Thanks. I wasn't sure whether the '- 1' would work, although prefix -- surely would have considering it works on p. But in the end your version looks much better.

byuu · Post by **byuu** » Sun Jan 04, 2009 3:53 pm

Okay, here's my attempt at a generic priority queue. No real reason for it to be non-copyable, I'm just being lazy. I felt std::vector overhead would be too much, and opted for a fixed array size instead. You'd really want to use a pointer for complex type_t objects with high copying costs. I could probably do that transparently with type traits, but that's really something the user should be doing.

Code: Select all

#ifndef NALL_PRIORITYQUEUE_HPP
#define NALL_PRIORITYQUEUE_HPP

#include <limits>
#include <nall/function.hpp>
#include <nall/utility.hpp>

namespace nall {

template<typename type_t> void priorityqueue_nocallback(type_t) {}

//priority queue implementation using binary min-heap array;
//does not require normalize() function.
//O(1)     find   (tick)
//O(log n) insert (enqueue)
//O(log n) remove (dequeue)
template<typename type_t> class priorityqueue : noncopyable {
public:
  inline void tick(unsigned ticks) {
    basecounter += ticks;
    while(heapsize && gte(basecounter, heap[0].counter)) callback(dequeue());
  }

  //counter is relative to current time (eg enqueue(64, ...) fires in 64 ticks);
  //counter cannot exceed std::numeric_limits<unsigned>::max() >> 1.
  void enqueue(unsigned counter, type_t event) {
    unsigned child = heapsize++;
    counter += basecounter;

    while(child) {
      unsigned parent = (child - 1) >> 1;
      if(gte(counter, heap[parent].counter)) break;

      heap[child].counter = heap[parent].counter;
      heap[child].event = heap[parent].event;
      child = parent;
    }

    heap[child].counter = counter;
    heap[child].event = event;
  }

  type_t dequeue() {
    type_t event(heap[0].event);
    unsigned parent = 0;
    unsigned counter = heap[--heapsize].counter;

    while(true) {
      unsigned child = (parent << 1) + 1;
      if(child >= heapsize) break;
      if(child + 1 < heapsize && gte(heap[child].counter, heap[child + 1].counter)) child++;
      if(gte(heap[child].counter, counter)) break;

      heap[parent].counter = heap[child].counter;
      heap[parent].event = heap[child].event;
      parent = child;
    }

    heap[parent].counter = counter;
    heap[parent].event = heap[heapsize].event;
    return event;
  }

  void reset() {
    basecounter = 0;
    heapsize = 0;
  }

  priorityqueue(unsigned size, function<void (type_t)> callback_ = &priorityqueue_nocallback<type_t>)
  : callback(callback_) {
    heap = new heap_t[size];
    reset();
  }

  ~priorityqueue() {
    delete[] heap;
  }

private:
  function<void (type_t)> callback;
  unsigned basecounter;
  unsigned heapsize;
  struct heap_t {
    unsigned counter;
    type_t event;
  } *heap;

  //return true if x is greater than or equal to y
  inline bool gte(unsigned x, unsigned y) {
    return x - y < (std::numeric_limits<unsigned>::max() >> 1);
  }
};

}

#endif

Hopefully my greater-than-equal function works right. Seems to in my test cases, at least.

Any optimization improvements welcome.

ecst · Post by **ecst** » Sun Jan 04, 2009 6:29 pm

In the other thread, byuu wrote:Took a ~1% speed hit or so by using functors for the callback and using the signed math trick to avoid the need for a normalize() function.

Replace

Code: Select all

return x - y < (std::numeric_limits<unsigned>::max() >> 1);

with

Code: Select all

return x - y <= (std::numeric_limits<unsigned>::max() >> 1);

to reverse the speed hit for signed math.

lscharen · Post by **lscharen** » Sun Jan 04, 2009 7:53 pm

May as well throw out some data structure prOn for everyone. This paper was published in 2005.

Ladder Queues

byuu · Post by **byuu** » Mon Jan 05, 2009 12:51 pm

ecst wrote:Replace
Code: Select all
return x - y < (std::numeric_limits<unsigned>::max() >> 1);
with
Code: Select all
return x - y <= (std::numeric_limits<unsigned>::max() >> 1); 
to reverse the speed hit for signed math.

This actually caused a loss of 2-3 fps >_<
I know it's not because it's slower, more likely some sort of alignment issue with the generated code. I moved a variable inside a class below a two-entry struct and lost ~10fps. Really hate these damn compilers.

lscharen wrote:Ladder Queues

It's documents like that which reaffirm to me that I'm not a computer scientist, but a computer amateur x.x
Still, 10-100 million entry systems ... somehow I doubt they tested that with 4-5 entry systems :P

O(1) insertion is really clever, though it comes at a cost of amortized O(1) get-first. I like guaranteed O(1) first. Really, O(log n) is perfectly scalable in my opinion to any size. But I guess with 10m events, you need every bit of performance you can get. Sheesh.

Wonder how it compares to a more traditional Fibonacci heap or red-black tree. Never even heard of an SCQ.

ecst · Post by **ecst** » Mon Jan 05, 2009 1:24 pm

byuu wrote:This actually caused a loss of 2-3 fps >_<

Strange. Would you mind sharing the WIP, so that I can take a look at the produced binary code?

byuu wrote: I know it's not because it's slower, more likely some sort of alignment issue with the generated code. I moved a variable inside a class below a two-entry struct and lost ~10fps. Really hate these damn compilers.

Might also have something to do with cache hits, but in this case I do not see how.

blargg · Post by **blargg** » Mon Jan 05, 2009 4:46 pm

Remember, for a limited data size, O(1) algorithm A can be slower than O(log n) algorithm B. The point of complexity measures is how the algorithms work as the data grows, and they are critical when the data can grow greatly. For this case, what really matters is the absolute average performance, thus there's no substitute for implementing and benchmarking.