
Author 
Message 
byuu

Delta queue design
Delta Queue
So, planning to try geting things like ALU and auto joypad reads into a delta queue. If it works well, then do the same for NMI+IRQ. Not entirely sure how to design such a class.

Background info:
There's probably no more than ~10 events: DRAM refresh, DMA trigger, HDMA trigger, HDMA init trigger, NMI trigger, IRQ trigger, ALU step, Joypad poll step, etc. They don't stack, so we can make our queue a fixed size, no need for a resizable vector.

The only way I can think to implement it is like so:
List is sorted in linear order, next event to fire comes first. Each event tells you that it triggers +n clock cycles from the previous, eg:
queue {
{ 26, NMI }, //NMI trigger in 26 clocks ahead of current time
{ 14, ALU }, //ALU step in 42 clocks (26+14)
};
add_clocks(n) would be something like:
 subtract n from the top of the queue
loop:
 if queue(0) > 0, return
 trigger queue(0) event
 zero out queue(0) so it won't fire if queue wraps around? or mark the enum as inactive / empty?
 add the nownegative queue(0) to queue(1); advance queue one place
add_event and remove_event would be a major pain. I'd essentially have to rebuild the queue from scratch. At best, an O(log n) search to find where it should be inserted, and rebuild everything after that to avoid throwing off the relative timestamps. I would need to add and remove events at least 23x per scanline, minimum ... but then there's only ~10 of them or so. Maybe it won't matter much? What kind of complexity level does adding and removing from such a queue usually entail?
My table would just be a time index+enum. Triggering an event can switch on the enum to invoke the function call. So copying an enum for rebuilding shouldn't be too bad.
Advancing the queue would be easy enough at O(1) by using a ringbuffer index into it.

Is there a better way to implement this, possibly?

Tue Dec 30, 2008 8:08 pm 


grinvader
ZSNES Shake Shake Prinny
Joined: Wed Jul 28, 2004 4:15 pm Posts: 5622 Location: PAL50, dood !

Re: Delta queue design
Circular buffer structure.
(is what I decided on when faced with a similar issue)
_________________ 皆黙って俺について来い！！
Pantheon: Gideon Zhi  CaitSith2  Nach  kode54

Tue Dec 30, 2008 9:40 pm 


byuu

Re: Delta queue design
I'm using a ring buffer to advance the queue for push/pop.
But each element relies on the previous element, as each timestamp is relative. In order to find where to insert, I'd sum the times of all elements from the beginning until they exceed the new element. Then I'd have to insert the new element in the middle and move all the old ones back.
A linked list wouldn't really help, as inserting a new timed element would throw off all the elements after it, so I'd have to recompute all of those anyway.
Just worried that inserting / deleting might add just as much overhead as maintaining a linear list and checking each one individually.

Tue Dec 30, 2008 10:05 pm 


sinamas
Gambatte Developer
Joined: Fri Oct 21, 2005 4:03 pm Posts: 157 Location: Norway

I use either a heap with absolute times or, if there is a very low number of fixed events, I just statically select the closest event linearly on changes, avoiding looping/indirection overhead (the low constant factor benefit outweighs the O(n) penalty for low n) but having to write the selection function per instance.

Tue Dec 30, 2008 10:35 pm 


Verdauga Greeneyes
Regular
Joined: Tue Mar 07, 2006 10:32 am Posts: 347 Location: The Netherlands

Are there any properties of the delays you can take advantage of? For instance if the events never happen less than 4 clocks apart and the last event is always less than 256 clocks ahead, you could use a 64entry array and do something like (startingIndex + delay / 4) % 64 to find the next spot for insertion (storing delay % 4 as the offset). You'll be walking through a somewhat sparse array, but at least you won't have to reorder the entries.

Tue Dec 30, 2008 10:51 pm 


byuu

Won't that require decrementing and checking every item in the list? That's pretty much what I do now, just hardcoded for extra speed. How do you know which event is the closest without checking them all? Are you keeping your list sorted, but still using absolute times? That could work well with a linked list for O(log n) search + insertion + deletion. You'd need to keep track of how many clocks you took off the first entry, and once it trips, take that counter off all the other ones in turn. Could be tricky, but workable.
Not that I'm aware of. They're all very different types of timing events with varying levels of granularity.

Tue Dec 30, 2008 11:21 pm 


Verdauga Greeneyes
Regular
Joined: Tue Mar 07, 2006 10:32 am Posts: 347 Location: The Netherlands

And the maximum delay between the first and last events in the queue is unreasonably large?

Tue Dec 30, 2008 11:57 pm 


lscharen
New Member
Joined: Tue Dec 30, 2008 11:37 pm Posts: 7

Re: Delta queue design
    byuu wrote: Background info: There's probably no more than ~10 events: DRAM refresh, DMA trigger, HDMA trigger, HDMA init trigger, NMI trigger, IRQ trigger, ALU step, Joypad poll step, etc. They don't stack, so we can make our queue a fixed size, no need for a resizable vector.
The only way I can think to implement it is like so: List is sorted in linear order, next event to fire comes first. Each event tells you that it triggers +n clock cycles from the previous, eg: queue { { 26, NMI }, //NMI trigger in 26 clocks ahead of current time { 14, ALU }, //ALU step in 42 clocks (26+14) };
add_clocks(n) would be something like:  subtract n from the top of the queue loop:  if queue(0) > 0, return  trigger queue(0) event  zero out queue(0) so it won't fire if queue wraps around? or mark the enum as inactive / empty?  add the nownegative queue(0) to queue(1); advance queue one place
add_event and remove_event would be a major pain.
Advancing the queue would be easy enough at O(1) by using a ringbuffer index into it.
Is there a better way to implement this, possibly?
    
You definitely want to use a linkedlist to do the insertions in order to preserve the O(1) amortized time. It may seem like a binary search would be faster, but you lose the savings over a straight linkedlist insertion when you have to move all the array elements to insert a new event.
The O(1) amortized time idea goes like this:
1. Insert a new event at time t + n
2. Start walking the linked list from the head
3. For each linked like node with an event time less than t + n, remove one credit from the amortization pool
4. Insert the new even in the linked list
5. For each call to next_event, return the head of the linked list, advance to the next node and add one credit from the amortization pool.
You can see that the credit pool will always be equal to zero when it becomes empty, so everything takes amortized constant time.
If you use an array with a binary search, then your insertions will take O(log n) + O(n) time to find and insert the new event.
In terms of implementation, you can implement linked lists using a set of fixedlength arrays for speed. This can be optimized further using standard linkedlist optimizations. This code is not tested and is just for discussion.
    Code: // Arrays to implement linked list // // delta : time in the future to fire event // event : type of event (ALU, NMI, etc.) // next : index of next linked list node // head : index of list head // free_node : list of free array slots // free_len : number of free slots
unsigned int delta[MAX_LEN]; unsigned int event[MAX_LEN]; int next[MAX_LEN]; unsigned int free_node[MAX_LEN] = {0, 1, 2, ..., MAX_LEN1}; int head = 1; unsigned int free_idx = MAX_LEN  1;
void add_event( unsigned int evnt, unsigned int dlta ) { int prev, curr; unsigned int delta_sum = 0;
// special case inserting to an empty list (assume free_idx > 0) if ( head == 1 ) { head = free_node[free_idx]; delta[head] = dlta; event[head] = evnt; next[head] = 1; return; }
// otherwise scan for the first empty slot curr = head; prev = 1;
while ( curr >= 0 ) { delta_sum += delta[curr];
if ( delta_sum >= dlta ) break;
prev = curr; curr = next[curr]; }
// If the times are the same, merge the events if ( delta_sum == dlta ) { event[curr] = evnt; return; }
// Otherwise insert a new event. After an event is inserted, // we need to check if there is another event after the insertion // point and decrement its delta value
// Special case an insertion before the head if ( prev == 1 ) { head = free_node[free_idx]; delta[head] = dlta; event[head] = evnt; next[head] = curr; } else { // General insertion. Point the previous node to a new one next[prev] = free_node[free_idx];
// Advance to the new node and set it's link and data fields // The delta field calculation is a bit ugly since it depends on // how we terminated the search loop, e.g. endoflist or // delta_sum > dlta. This could probably be cleaned up to // avoid the branch. prev = next[prev];
next[prev] = curr; event[prev] = evnt; if ( curr >= 0 ) delta[prev] = dlta  (delta_sum  delta[curr]); else delta[prev] = dlta  delta_sum; }
// Adjust the value of delta[curr] if needed if ( curr >= 0 ) delta[curr] = delta_sum  dlta; }
void add_clocks( unsigned int dlta ) { unsigned int delta_sum = 0;
// scan the linked list until delta_sum > dlta int curr = head;
while ( curr >= 0 ) { // Advance the delta counter delta_sum += delta[curr]; if ( delta_sum > dlta ) break;
// process the event do_event( event[curr] );
// delete the linked list node free_node[++free_idx] = curr; curr = next[curr]; }
// reset the start of the list head = curr;
// optionally reset the next delta if ( head >= 0 ) delta[head] = delta_sum  dlta; }
    
That should be a start. Edit:Updated example code to actually sum the relative deltas and make the implementation a bit more realistic. Also, if you are querying the delta queue on singlecycle intervals, then the add_clocks() function is greatly simplified because dlta = 1, e.g.
Last edited by lscharen on Wed Dec 31, 2008 5:41 am, edited 2 times in total.

Wed Dec 31, 2008 1:02 am 


DataPath
Lurker
Joined: Wed Jul 28, 2004 1:35 am Posts: 128

agreed  linked list.
{ delta_time_to_expire, event, next*}
insertions are cheap, and you can't DO a binary search on a (classical) delta queue for insertions because you don't know the accumulated time delay on any particular event  only the time delta from the prior event.
If you expected your queue to have to get very long (rule #1 of choosing an algorithm: Know your domain), you could do an array of linked lists (somewhat like a hash table), and you could take advantage of precomputed "block" (referring to the linked lists) deltas to further reduce the insertion time. If you *really* wanted to do binary searches, you could reduce the cost of keeping the absolute times current by keeping semiabsolute times in each block (absolute times relative to the first nonactive block), which would give you the ability to do a binary search. Then when the active block expires, you recompute absolute times only for the blocks, instead of each and every element in the queue.
But that's getting into a very high level of complexity that's generally not warranted for a delta queue.
edit: the biggest advantage that lscharen's implementation has over a classical linkedlist implementation is locality. For the size of linkedlist that byuu is talking about, there shouldn't be more than 2 cache misses in walking the list. A classical implementation would have many many more cache misses.
Last edited by DataPath on Wed Dec 31, 2008 1:35 am, edited 1 time in total.

Wed Dec 31, 2008 1:23 am 


Nach
ZSNES Developer
Joined: Tue Jul 27, 2004 10:54 pm Posts: 3902 Location: Solar powered park bench

I must be misunderstanding the case here, since it seems too easy to me.
Can't you keep the same queue you have, just have an extra integer alongside it to contain the offset to use when tallying up to where you need to be in the queue?
_________________ May 9 2007  NSRT 3.4, now with lots of hashing and even more accurate information! Go download it. _____________ Insane Coding

Wed Dec 31, 2008 1:32 am 


sinamas
Gambatte Developer
Joined: Fri Oct 21, 2005 4:03 pm Posts: 157 Location: Norway

No, I keep a single cycle counter that is incremented and compared to the head. The event time is the count at which the event is supposed to occur. EDIT: what do you mean by "the list"? A binary heap is usually implemented with an array.
No, I mean it's a hardcoded linear search, but because of the low constant cost factor it's faster than O(log N) alternatives for low N. It only needs to be done every time an event is executed or scheduled.

Wed Dec 31, 2008 2:07 am 


byuu

lscharen, thank you very much for the example code.
Not following what you mean by "amortization pool credits", but I get the general idea of your code. The part I missed was that you don't have to recompute all elements after your insertion part, just one.
Example ... say we have the following queue:
[0] = { NMI, +26 }
[1] = { IRQ, +14 } //= +42 total
[2] = { DRAM, +10 } //= +52 total
We want to insert ALU at +32, so we do an O(n) summation and find that we insert at [0]. So first link ALU between [0] and [2] to get:
[0] = { NMI, +26 }
[1] = { ALU, ??? }
[2] = { IRQ, +14 }
[3] = { DRAM, +10 }
Then fix the times:
[1] = clocks(32)  [0] = +6
[2] = [0] + [2]  [1] = +10
[3] is still correct [+26, +6, +10, +10 = +52], no need to update it.
The O(n) search to find the insertion point sucks, but like DataPath mentioned (and I missed in the first post), the relative values means we can't perform a binary search here. The linked list also rules out that possibility, anyway.
Not entirely following what you mean, sorry. Ah, that brings up two problems for me: 1) you have to test each and every entry in the list when adding time, or keep the list sorted somehow. 2) exact cycle count timestamps that grow forever will eventually overflow, unless you subtract them all by the smallest value in the list periodically. Do you do something like that after each frame or so? Perhaps it would be better to keep a sorted exact cycle position list rather than an exact relative cycle position list. Don't think it'll matter with a linked list approach, though. I'm referring to an array, yes. But from lscharen's observation above, I'm leaning toward using a linked list now.
That's what I'm worried about now. I don't have many events, so I don't know if this queue scheme will be worth the extra complication compared to hardcoded tests in the appropriate places.

Wed Dec 31, 2008 11:33 pm 


sinamas
Gambatte Developer
Joined: Fri Oct 21, 2005 4:03 pm Posts: 157 Location: Norway

No idea what you're talking about here. Are you perhaps confusing a heap with something else? With a binary heap adding cycles would be O(1), finding the head is O(1), adjusting or scheduling an event is O(log N), and removing the head is O(log N). Heaps are typically used for priority queues. Yes, I do subtract all counters periodically to avoid overflow. Using a linked list for a binary heap would be silly (slow).
Note that by hardcoded linear search I'm referring to something very different from what you're currently doing.

Thu Jan 01, 2009 12:12 am 


byuu

Ah yes, I was thinking you had a linear array. Can't say I've ever used a binary heap before, so I had to read up on it. The 2n+1, 2n+2 trick to index into the heap is cute. ... because they're absolute times, so you're using a minheap to trigger that, removing the element, then repeating the compare. Got it. Always heap[0], yes. Hmm. I'll have to work that out on paper. I'm visualizing a tree that branches in each direction, so you could add to one side indefinitely rather than both evenly. Probably just diagram>actual implementation confusion. Obviously what you say about big O timing is correct. Yes, subtract the current time from everything, since that will always be the lowest value; assuming you've polled for events after your last cycle add.
I'll take your word for that, have enough here to work with at the moment ;)
Thank you very much for your help. Now I'm not sure which approach I should use ... linked list + relative timing, or binary heap + absolute timing + periodic subtraction to prevent overflow. Meh.

Thu Jan 01, 2009 1:51 am 


ecst
Rookie
Joined: Fri Nov 14, 2008 1:01 am Posts: 13

No need there. When comparing time stamps, instead of a < b always do (signed) a  b < 0. On most processor models, this amounts to checking the negative flag instead of the carry flag, so there should be no difference in performance.

Thu Jan 01, 2009 2:11 am 


blargg
Regular
Joined: Thu Jun 30, 2005 1:54 pm Posts: 327 Location: USA

byuu, I found this helpful illustration of arraybased heap operations. I had never studied this structure before. Arraybased heap lecture notes.pdf
Ecst's point is that if you ensure that all your times fall within 0 to UINT_MAX/2 of each other, you can entirely eliminate periodic adjustments of all times. Unsigned values wrap around by definition, so addition and subtraction work without change. Only ordered comparison needs special treatment:
On a two's complement machine where conversion from unsigned to signed doesn't do any range check, which is the case on almost all machines these days, that can be optimized to the following (and I wouldn't be surprised if gcc does this automatically): Don't try to use signed int everywhere and eliminate less_than; even though overflow of signed integers doesn't usually generate a trap on most machines these days, a compiler's optimizer might take advantage of that fact and rewrite the second "if" below into the first:

Thu Jan 01, 2009 4:09 am 


sinamas
Gambatte Developer
Joined: Fri Oct 21, 2005 4:03 pm Posts: 157 Location: Norway

Good point, ecst. I use that in my gettime/sleep abstraction code for instance, but haven't applied it to this for various reasons.

Thu Jan 01, 2009 5:41 am 


DataPath
Lurker
Joined: Wed Jul 28, 2004 1:35 am Posts: 128

Really, all this angst over refining the process should be dumped in favor of the most straightforward readable implementation, and optimized if it comes up in performance profiling.

Thu Jan 01, 2009 4:00 pm 


lscharen
New Member
Joined: Tue Dec 30, 2008 11:37 pm Posts: 7

Yes, I was trying to get across the idea that, even though the worstcase insertion time is O(n), a sequence of insert/remove operations have amortized constant time. See http://en.wikipedia.org/wiki/Amortized_analysis for more information.     Quote: Example ... say we have the following queue: [0] = { NMI, +26 } [1] = { IRQ, +14 } //= +42 total [2] = { DRAM, +10 } //= +52 total We want to insert ALU at +32, so we do an O(n) summation and find that we insert at [0]. So first link ALU between [0] and [2] to get: [0] = { NMI, +26 } [1] = { ALU, ??? } [2] = { IRQ, +14 } [3] = { DRAM, +10 } Then fix the times: [1] = clocks(32)  [0] = +6 [2] = [0] + [2]  [1] = +10
[3] is still correct [+26, +6, +10, +10 = +52], no need to update it.
The O(n) search to find the insertion point sucks, but like DataPath mentioned (and I missed in the first post), the relative values means we can't perform a binary search here. The linked list also rules out that possibility, anyway.
    
Right. You only have to update the delta time of the event you are inserting and the delta of the next event. If you're not convinced of the O(1) expected performance of the queue, implement it and try it out. You may be correct that it would be faster to keep an array sorted if the number of events is bounded by a small number, e.g.

Thu Jan 01, 2009 6:38 pm 


sinamas
Gambatte Developer
Joined: Fri Oct 21, 2005 4:03 pm Posts: 157 Location: Norway

Suppose you have N events, and that every time an event is executed it is rescheduled to execute after all the other events (but we don't know this, so we can't just stick it to the back of a FIFO). Then with a delta queue approach each event execution+rescheduling has O(N) complexity. Amortizing over the number of event executions like that looks like a fallacy to me, since you're not considering that all the events already in the queue need to be scheduled too.

Thu Jan 01, 2009 7:34 pm 


blargg
Regular
Joined: Thu Jun 30, 2005 1:54 pm Posts: 327 Location: USA

lscharen, excellent point about only having to adjust the next delta, not all of them, when inserting before the end. sinamas, if you keep track of the absolute time of the last event, inserting an event past all the others can done in constant time. I wrote a simple delta queue class and exercise program, which demonstrates both of the above points. Excerpt of the key parts shown below. Note that this demo doesn't associate anything with the events, just the times of the events themselves. Delta_Queue.cpp

Fri Jan 02, 2009 1:41 am 


sinamas
Gambatte Developer
Joined: Fri Oct 21, 2005 4:03 pm Posts: 157 Location: Norway

True, but we still get linear complexity on average if the common case position is before the end and has a mean proportional to N.

Fri Jan 02, 2009 2:20 am 


lscharen
New Member
Joined: Tue Dec 30, 2008 11:37 pm Posts: 7

I have to agree with sinamas here. I took the javadoc comment from the page byuu linked to on its face, but I can't figure out a proof of amortized constant time for a linkedlist delta queue. I even looked at the Java source code and it's basically the same implementation that I proposed earlier.
http://avrora.cvs.sourceforge.net/viewvc/avrora/avrora/src/avrora/sim/clock/DeltaQueue.java?revision=1.1.1.1&view=markup
In terms of efficient implementation, I'm starting to think that a minheap using the 2scomplement trick might very well be faster. The lack of having to update a next field and guaranteed O(log n) insertion might make up for O(log n) removal.

Fri Jan 02, 2009 2:56 am 


bobthebuilder
Hazed
Joined: Sat Jan 28, 2006 7:21 am Posts: 76

This is the kind of talk that keeps me coming back. I always learn something new and realize how much more my skills need to improve

Fri Jan 02, 2009 3:05 am 


ecst
Rookie
Joined: Fri Nov 14, 2008 1:01 am Posts: 13

blargg, good job in elegantly implementing a delta queue (and also in providing commentary on standardscompliant C/C++ implementation of "sliding" compares). Note that you can get rid of the subtractions in "*p = time;" and "time = *p++", and also the "last" variable if you use absolute time stamps instead of relative deltas (paradoxical as it may sound).
The misunderstanding regarding complexity seems to be the following: The O(1) for delta queues takes the number of events as constant, not depending on it as a variable, rather giving it the meaning of O(1) per event and time unit.
Facilitating comparability, let us consider the number n of events, the number t of cycles of the considered time interval, the frequencies (average number of occurences per cycle) f_1, ..., f_n of the events, and the average delta queue insertion points (how many list items we have to iterate over when inserting the event) a_1, ..., a_n of each event. Note that a_i * f_i <= 1 (this is the key point of delta queues).
The binary heap is O(log(n) * (f_1 + ... + f_n) * t).
The delta queue is O((a_1 * f_1 + ... + a_n * f_n) * t). Using a_i * f_i <= 1, we can give the simpler, but less sharp estimate O(n * t) (this is the O(1) per event and time unit from above).
As a case study, further simplifying, assume that each event respawns as soon as it got killed and has constant delay, that f_1 > ... > f_n, that no events occur simultaneously, and that the time interval is sufficiently large. After some calculations, we find that a_1 * f_1 + ... + a_n * f_n = 2 * (0 * f_1 + 1 * f_2 + ... + (n  1) * f_n). So it all boils down to log(n) * (f_1 + ... + f_n) vs. 0 * f_1 + 1 * f_2 + ... + (n  1) * f_n. Just for fun, let f_n = 1 / n^q with q > 0 some real parameter. Interestingly, the binary heap wins for 0 < q < 2, while the delta queue wins for q > 2, with a draw for q = 2.
Informally, these heuristic results would mean the delta queue is best suited for situations where the event delays cover all different kinds of magnitudes, while the binary heap should be used in the remaining cases, and when guaranteed optimal (and good) worst case complexity is important. Note that (albeit not obvious, and maybe slowed by a constant factor) binary heaps can also feature logarithmic removal/rescheduling of events (e. g., in case IRQ settings change).
Interesting to consider would be a synergistic approach, with the at most logarithmic insertion of the binary heap, but cheaper insertion for events sooner to occurs. Maybe a kind of "unbalanced" binary search tree with a bias for the right (with larger values).
Nevertheless, all these considerations will probably be in vain if the number of events is low enough for efficient hardcoded solutions. Still, a scalable approach would be interesting (e. g., if you, byuu, plan on potentially adding many future events, for whatever reason).
Last edited by ecst on Fri Jan 02, 2009 4:18 am, edited 2 times in total.

Fri Jan 02, 2009 3:11 am 


Who is online 
Users browsing this forum: No registered users and 0 guests 

You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum



