View unanswered posts | View active topics It is currently Wed Apr 24, 2019 7:59 pm



This topic is locked, you cannot edit posts or make further replies.  [ 36 posts ]  Go to page Previous  1, 2
Delta queue design 
Author Message
Reply with quote
Post 
DRAM refresh, IRQ and NMI pins can trigger at any clock cycle, even in the middle of an opcode cycle read.

DMA, HDMA and HDMA init can trigger on the edge of an opcode cycle (tested every ~8 master cycles).

IRQ and NMI can generate an actual interrupt on the edge of a full opcode (every ~32 master cycles).

I do not know how often the ALU and joypad polling tick as we don't emulate those properly yet.

My thoughts were to make the per-clock stuff just do whatever event it is they need immediately, and for the rest to set flags in a variable for each one, so for cycle_edge we'd have:

while(event_cycleedge) {
switch(event_cycleedge & -event_cycleedge) {
...
}
event_cycleedge &= event_cycleedge - 1;
}

Then there's just the final edge case of handling events that occur N clocks after a previous event.

For instance, when IRQ trips on a clock boundary, it holds the value high for 6 clocks. If the program reads the "test and clear pending IRQ"s register at +0 or +2 clocks in to that event, it returns 1 and cancels the IRQ. +4 clocks in, it will return 1, but trigger the IRQ and leave the bit set. +6 and on, IRQ triggers and the next read will be 0 as expected.

I would need a way to determine exactly where an event occurred, not just that it had, to schedule the future IRQ pin events. Or maybe I can just insert an event for IRQ+0, IRQ+2, IRQ+6 and IRQ+8 to manipulate the behavior.


Fri Jan 02, 2009 3:20 am
Regular
User avatar

Joined: Tue Mar 07, 2006 10:32 am
Posts: 347
Location: The Netherlands
Reply with quote
Post 
Assuming a uniform distribution and a bidirectional list, the following should be a bit faster: (it got a bit long so I compacted it a bit into for loops; I hope you don't mind)

Code:
void Delta_Queue::insert( int time )
{
    events_t::iterator p;
    if ( time >= last )
    {
        events.push_back( time - last );
        last = time;
    }
    else
    {
        if ( time < last / 2 )
        {
            for(p = events.begin(); time > *p; time -= *p++);
        }
        else
        {
            int temp = last;
            for(p = events.end() - 1; time < temp; temp -= *p--);
            p++;
            time -= temp;
        }
        *p -= time;
        events.insert( p, time );
    }
}

Edit: note this does not guarantee the order of simultaneous events, though it could with minimal modifications. It just stops as soon as possible.


Fri Jan 02, 2009 3:21 am
Profile
Regular
User avatar

Joined: Thu Jun 30, 2005 1:54 pm
Posts: 327
Location: USA
Reply with quote
Post 
Verdauga Greeneyes, nifty. You can write the backwards loop a tad more compactly (and avoid events.end() - 1, which requires that it be a random access iterator, which it wasn't in mine which actually used a list):
Code:
int temp = last;
for(p = events.end(); time < temp; temp -= *--p);
time -= temp;


Here's a version using an array-based heap, absolute times, and no need to periodically adjust them, as overflow doesn't cause any problems. Full code + exerciser: Delta_Queue_2.cpp

Absolute times seem superior to me. The demo includes functions that implement less-than and subtraction that handle when one of them has wrapped around but the other hasn't, so you have all operations. You could wrap them in a class and overload < and - so that they are completely transparent to use.


Fri Jan 02, 2009 4:26 am
Profile WWW
New Member

Joined: Tue Dec 30, 2008 11:37 pm
Posts: 7
Reply with quote
Post 
blargg wrote:
Here's a version using an array-based heap, absolute times, and no need to periodically adjust them, as overflow doesn't cause any problems. Full code + exerciser: Delta_Queue_2.cpp


I threw together a C implementation of the linked-list and heap-based versions. The heap does not coalesce events by default, but it could easily be modified to do so. Code is minimally tested.

Heap and Linked List Delta Queues.

Enjoy!

Edit:

I updated my code to incorporate the optimizations in the code blargg posted -- which are very nice, btw. Credit given where credit is due.


Last edited by lscharen on Fri Jan 02, 2009 7:14 pm, edited 1 time in total.



Fri Jan 02, 2009 4:51 am
Profile
Regular
User avatar

Joined: Tue Mar 07, 2006 10:32 am
Posts: 347
Location: The Netherlands
Reply with quote
Post 
blargg wrote:
Verdauga Greeneyes, nifty. You can write the backwards loop a tad more compactly (and avoid events.end() - 1, which requires that it be a random access iterator, which it wasn't in mine which actually used a list)

Thanks. I wasn't sure whether the '- 1' would work, although prefix -- surely would have considering it works on p. But in the end your version looks much better.


Fri Jan 02, 2009 5:33 am
Profile
Reply with quote
Post 
Okay, here's my attempt at a generic priority queue. No real reason for it to be non-copyable, I'm just being lazy. I felt std::vector overhead would be too much, and opted for a fixed array size instead. You'd really want to use a pointer for complex type_t objects with high copying costs. I could probably do that transparently with type traits, but that's really something the user should be doing.

Code:
#ifndef NALL_PRIORITYQUEUE_HPP
#define NALL_PRIORITYQUEUE_HPP

#include <limits>
#include <nall/function.hpp>
#include <nall/utility.hpp>

namespace nall {

template<typename type_t> void priorityqueue_nocallback(type_t) {}

//priority queue implementation using binary min-heap array;
//does not require normalize() function.
//O(1)     find   (tick)
//O(log n) insert (enqueue)
//O(log n) remove (dequeue)
template<typename type_t> class priorityqueue : noncopyable {
public:
  inline void tick(unsigned ticks) {
    basecounter += ticks;
    while(heapsize && gte(basecounter, heap[0].counter)) callback(dequeue());
  }

  //counter is relative to current time (eg enqueue(64, ...) fires in 64 ticks);
  //counter cannot exceed std::numeric_limits<unsigned>::max() >> 1.
  void enqueue(unsigned counter, type_t event) {
    unsigned child = heapsize++;
    counter += basecounter;

    while(child) {
      unsigned parent = (child - 1) >> 1;
      if(gte(counter, heap[parent].counter)) break;

      heap[child].counter = heap[parent].counter;
      heap[child].event = heap[parent].event;
      child = parent;
    }

    heap[child].counter = counter;
    heap[child].event = event;
  }

  type_t dequeue() {
    type_t event(heap[0].event);
    unsigned parent = 0;
    unsigned counter = heap[--heapsize].counter;

    while(true) {
      unsigned child = (parent << 1) + 1;
      if(child >= heapsize) break;
      if(child + 1 < heapsize && gte(heap[child].counter, heap[child + 1].counter)) child++;
      if(gte(heap[child].counter, counter)) break;

      heap[parent].counter = heap[child].counter;
      heap[parent].event = heap[child].event;
      parent = child;
    }

    heap[parent].counter = counter;
    heap[parent].event = heap[heapsize].event;
    return event;
  }

  void reset() {
    basecounter = 0;
    heapsize = 0;
  }

  priorityqueue(unsigned size, function<void (type_t)> callback_ = &priorityqueue_nocallback<type_t>)
  : callback(callback_) {
    heap = new heap_t[size];
    reset();
  }

  ~priorityqueue() {
    delete[] heap;
  }

private:
  function<void (type_t)> callback;
  unsigned basecounter;
  unsigned heapsize;
  struct heap_t {
    unsigned counter;
    type_t event;
  } *heap;

  //return true if x is greater than or equal to y
  inline bool gte(unsigned x, unsigned y) {
    return x - y < (std::numeric_limits<unsigned>::max() >> 1);
  }
};

}

#endif


Hopefully my greater-than-equal function works right. Seems to in my test cases, at least.

Any optimization improvements welcome.


Sun Jan 04, 2009 3:53 pm
Rookie

Joined: Fri Nov 14, 2008 1:01 am
Posts: 13
Reply with quote
Post 
In the other thread, byuu wrote:
Took a ~1% speed hit or so by using functors for the callback and using the signed math trick to avoid the need for a normalize() function.


Replace

Code:
return x - y < (std::numeric_limits<unsigned>::max() >> 1);


with

Code:
return x - y <= (std::numeric_limits<unsigned>::max() >> 1);


to reverse the speed hit for signed math.


Sun Jan 04, 2009 6:29 pm
Profile
New Member

Joined: Tue Dec 30, 2008 11:37 pm
Posts: 7
Reply with quote
Post 
May as well throw out some data structure prOn for everyone. This paper was published in 2005.

Ladder Queues


Sun Jan 04, 2009 7:53 pm
Profile
Reply with quote
Post 
ecst wrote:
Replace

Code:
return x - y < (std::numeric_limits<unsigned>::max() >> 1);


with

Code:
return x - y <= (std::numeric_limits<unsigned>::max() >> 1);


to reverse the speed hit for signed math.


This actually caused a loss of 2-3 fps >_<
I know it's not because it's slower, more likely some sort of alignment issue with the generated code. I moved a variable inside a class below a two-entry struct and lost ~10fps. Really hate these damn compilers.

lscharen wrote:
Ladder Queues


It's documents like that which reaffirm to me that I'm not a computer scientist, but a computer amateur x.x
Still, 10-100 million entry systems ... somehow I doubt they tested that with 4-5 entry systems :P

O(1) insertion is really clever, though it comes at a cost of amortized O(1) get-first. I like guaranteed O(1) first. Really, O(log n) is perfectly scalable in my opinion to any size. But I guess with 10m events, you need every bit of performance you can get. Sheesh.

Wonder how it compares to a more traditional Fibonacci heap or red-black tree. Never even heard of an SCQ.


Mon Jan 05, 2009 12:51 pm
Rookie

Joined: Fri Nov 14, 2008 1:01 am
Posts: 13
Reply with quote
Post 
byuu wrote:
This actually caused a loss of 2-3 fps >_<


Strange. Would you mind sharing the WIP, so that I can take a look at the produced binary code?

byuu wrote:
I know it's not because it's slower, more likely some sort of alignment issue with the generated code. I moved a variable inside a class below a two-entry struct and lost ~10fps. Really hate these damn compilers.


Might also have something to do with cache hits, but in this case I do not see how.


Mon Jan 05, 2009 1:24 pm
Profile
Regular
User avatar

Joined: Thu Jun 30, 2005 1:54 pm
Posts: 327
Location: USA
Reply with quote
Post 
Remember, for a limited data size, O(1) algorithm A can be slower than O(log n) algorithm B. The point of complexity measures is how the algorithms work as the data grows, and they are critical when the data can grow greatly. For this case, what really matters is the absolute average performance, thus there's no substitute for implementing and benchmarking.


Mon Jan 05, 2009 4:46 pm
Profile WWW
Display posts from previous:  Sort by  
This topic is locked, you cannot edit posts or make further replies.   [ 36 posts ]  Go to page Previous  1, 2

Who is online

Users browsing this forum: No registered users and 1 guest


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
Jump to:  
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group.
Designed by ST Software.