Delta queue design

byuu · Post by **byuu** » Tue Dec 30, 2008 8:08 pm

Delta Queue

So, planning to try geting things like ALU and auto joypad reads into a delta queue. If it works well, then do the same for NMI+IRQ. Not entirely sure how to design such a class.

---

Background info:
There's probably no more than ~10 events: DRAM refresh, DMA trigger, HDMA trigger, HDMA init trigger, NMI trigger, IRQ trigger, ALU step, Joypad poll step, etc. They don't stack, so we can make our queue a fixed size, no need for a resizable vector.

---

The only way I can think to implement it is like so:
List is sorted in linear order, next event to fire comes first. Each event tells you that it triggers +n clock cycles from the previous, eg:
queue {
{ 26, NMI }, //NMI trigger in 26 clocks ahead of current time
{ 14, ALU }, //ALU step in 42 clocks (26+14)
};

add_clocks(n) would be something like:
- subtract n from the top of the queue
loop:
- if queue(0) > 0, return
- trigger queue(0) event
- zero out queue(0) so it won't fire if queue wraps around? or mark the enum as inactive / empty?
- add the now-negative queue(0) to queue(1); advance queue one place

add_event and remove_event would be a major pain. I'd essentially have to rebuild the queue from scratch. At best, an O(log n) search to find where it should be inserted, and rebuild everything after that to avoid throwing off the relative timestamps. I would need to add and remove events at least 2-3x per scanline, minimum ... but then there's only ~10 of them or so. Maybe it won't matter much? What kind of complexity level does adding and removing from such a queue usually entail?

My table would just be a time index+enum. Triggering an event can switch on the enum to invoke the function call. So copying an enum for rebuilding shouldn't be too bad.

Advancing the queue would be easy enough at O(1) by using a ring-buffer index into it.

---

Is there a better way to implement this, possibly?

Post by **grinvader** » Tue Dec 30, 2008 9:40 pm

byuu wrote:add_event and remove_event would be a major pain. I'd essentially have to rebuild the queue from scratch. At best, an O(log n) search to find where it should be inserted, and rebuild everything after that to avoid throwing off the relative timestamps. I would need to add and remove events at least 2-3x per scanline, minimum ... but then there's only ~10 of them or so. Maybe it won't matter much? What kind of complexity level does adding and removing from such a queue usually entail?

Circular buffer structure.

(is what I decided on when faced with a similar issue)

byuu · Post by **byuu** » Tue Dec 30, 2008 10:05 pm

grinvader wrote:Circular buffer structure.

I'm using a ring buffer to advance the queue for push/pop.

But each element relies on the previous element, as each timestamp is relative. In order to find where to insert, I'd sum the times of all elements from the beginning until they exceed the new element. Then I'd have to insert the new element in the middle and move all the old ones back.

A linked list wouldn't really help, as inserting a new timed element would throw off all the elements after it, so I'd have to recompute all of those anyway.

Just worried that inserting / deleting might add just as much overhead as maintaining a linear list and checking each one individually.

sinamas · Post by **sinamas** » Tue Dec 30, 2008 10:35 pm

I use either a heap with absolute times or, if there is a very low number of fixed events, I just statically select the closest event linearly on changes, avoiding looping/indirection overhead (the low constant factor benefit outweighs the O(n) penalty for low n) but having to write the selection function per instance.

Verdauga Greeneyes · Post by **Verdauga Greeneyes** » Tue Dec 30, 2008 10:51 pm

Are there any properties of the delays you can take advantage of? For instance if the events never happen less than 4 clocks apart and the last event is always less than 256 clocks ahead, you could use a 64-entry array and do something like (startingIndex + delay / 4) % 64 to find the next spot for insertion (storing delay % 4 as the offset). You'll be walking through a somewhat sparse array, but at least you won't have to reorder the entries.

byuu · Post by **byuu** » Tue Dec 30, 2008 11:21 pm

I use either a heap with absolute times

Won't that require decrementing and checking every item in the list? That's pretty much what I do now, just hard-coded for extra speed.

if there is a very low number of fixed events, I just statically select the closest event linearly on changes

How do you know which event is the closest without checking them all? Are you keeping your list sorted, but still using absolute times? That could work well with a linked list for O(log n) search + insertion + deletion.

You'd need to keep track of how many clocks you took off the first entry, and once it trips, take that counter off all the other ones in turn. Could be tricky, but workable.

For instance if the events never happen less than 4 clocks apart and the last event is always less than 256 clocks ahead

Not that I'm aware of. They're all very different types of timing events with varying levels of granularity.

Verdauga Greeneyes · Post by **Verdauga Greeneyes** » Tue Dec 30, 2008 11:57 pm

byuu wrote:Not that I'm aware of. They're all very different types of timing events with varying levels of granularity.

And the maximum delay between the first and last events in the queue is unreasonably large?

lscharen · Post by **lscharen** » Wed Dec 31, 2008 1:02 am

byuu wrote:
Background info:
There's probably no more than ~10 events: DRAM refresh, DMA trigger, HDMA trigger, HDMA init trigger, NMI trigger, IRQ trigger, ALU step, Joypad poll step, etc. They don't stack, so we can make our queue a fixed size, no need for a resizable vector.

The only way I can think to implement it is like so:
List is sorted in linear order, next event to fire comes first. Each event tells you that it triggers +n clock cycles from the previous, eg:
queue {
{ 26, NMI }, //NMI trigger in 26 clocks ahead of current time
{ 14, ALU }, //ALU step in 42 clocks (26+14)
};

add_clocks(n) would be something like:
- subtract n from the top of the queue
loop:
- if queue(0) > 0, return
- trigger queue(0) event
- zero out queue(0) so it won't fire if queue wraps around? or mark the enum as inactive / empty?
- add the now-negative queue(0) to queue(1); advance queue one place

add_event and remove_event would be a major pain.

Advancing the queue would be easy enough at O(1) by using a ring-buffer index into it.

Is there a better way to implement this, possibly?

You definitely want to use a linked-list to do the insertions in order to preserve the O(1) amortized time. It may seem like a binary search would be faster, but you lose the savings over a straight linked-list insertion when you have to move all the array elements to insert a new event.

The O(1) amortized time idea goes like this:

1. Insert a new event at time t + n
2. Start walking the linked list from the head
3. For each linked like node with an event time less than t + n, remove one credit from the amortization pool
4. Insert the new even in the linked list
5. For each call to next_event, return the head of the linked list, advance to the next node and add one credit from the amortization pool.

You can see that the credit pool will always be equal to zero when it becomes empty, so everything takes amortized constant time.

If you use an array with a binary search, then your insertions will take O(log n) + O(n) time to find and insert the new event.

In terms of implementation, you can implement linked lists using a set of fixed-length arrays for speed. This can be optimized further using standard linked-list optimizations. This code is not tested and is just for discussion.

Code: Select all

// Arrays to implement linked list
//
// delta : time in the future to fire event
// event : type of event (ALU, NMI, etc.)
// next : index of next linked list node
// head : index of list head
// free_node : list of free array slots
// free_len : number of free slots

unsigned int delta[MAX_LEN];
unsigned int event[MAX_LEN];
int next[MAX_LEN];
unsigned int free_node[MAX_LEN] = {0, 1, 2, ..., MAX_LEN-1};
int head = -1;
unsigned int free_idx = MAX_LEN - 1;

void add_event( unsigned int evnt, unsigned int dlta )
{
   int prev, curr;
   unsigned int delta_sum = 0;

   // special case inserting to an empty list (assume free_idx > 0)
   if ( head == -1 )
   {
      head = free_node[free_idx--];
      delta[head] = dlta;
      event[head] = evnt;
      next[head] = -1;
      return;
   }

   // otherwise scan for the first empty slot 
   curr = head;
   prev = -1;

   while ( curr >= 0 )
   {
      delta_sum += delta[curr];

      if ( delta_sum >= dlta )
         break;

      prev = curr;
      curr = next[curr];
   }

   // If the times are the same, merge the events
   if ( delta_sum == dlta )
   {
      event[curr] |= evnt;
      return;
   }

   // Otherwise insert a new event.  After an event is inserted,
   // we need to check if there is another event after the insertion
   // point and decrement its delta value

   // Special case an insertion before the head
   if ( prev == -1 )
   {
      head = free_node[free_idx--];
      delta[head] = dlta;
      event[head] = evnt;
      next[head] = curr;
   }
   else
   {
      // General insertion.  Point the previous node to a new one
      next[prev] = free_node[free_idx--];

      // Advance to the new node and set it's link and data fields
      // The delta field calculation is a bit ugly since it depends on
      // how we terminated the search loop, e.g. end-of-list or
      // delta_sum > dlta.  This could probably be cleaned up to 
      // avoid the branch.
      prev = next[prev];

      next[prev] = curr;
      event[prev] = evnt;
      
      if ( curr >= 0 )
         delta[prev] = dlta - (delta_sum - delta[curr]);
      else
         delta[prev] = dlta - delta_sum;
   }

   // Adjust the value of delta[curr] if needed
   if ( curr >= 0 )
      delta[curr] = delta_sum - dlta;
}

void add_clocks( unsigned int dlta )
{
    unsigned int delta_sum = 0;

    // scan the linked list until delta_sum > dlta
    int curr = head;

    while ( curr >= 0 )
    {
        // Advance the delta counter
        delta_sum += delta[curr];
        if ( delta_sum > dlta )
           break;

        // process the event
        do_event( event[curr] );

        // delete the linked list node
        free_node[++free_idx] = curr;
        curr = next[curr];
    }

    // reset the start of the list
    head = curr;

    // optionally reset the next delta 
    if ( head >= 0 )
       delta[head] = delta_sum - dlta;
}

That should be a start.

Edit:

Updated example code to actually sum the relative deltas and make the implementation a bit more realistic.

Also, if you are querying the delta queue on single-cycle intervals, then the add_clocks() function is greatly simplified because dlta = 1, e.g.

Code: Select all

void add_clock(void)
{
    // Just check and see if the head of the list is
    // present with a delta value equal to one
    if ( head < 0 )
        return;

    if ( --delta[head] == 0 )
    {
        // process the event
        do_event( event[head] );

        // delete the linked list node
        free_node[++free_idx] = head;
        head = next[head];
    }
}

DataPath · Post by **DataPath** » Wed Dec 31, 2008 1:23 am

agreed - linked list.

{ delta_time_to_expire, event, next*}

insertions are cheap, and you can't DO a binary search on a (classical) delta queue for insertions because you don't know the accumulated time delay on any particular event - only the time delta from the prior event.

If you expected your queue to have to get very long (rule #1 of choosing an algorithm: Know your domain), you could do an array of linked lists (somewhat like a hash table), and you could take advantage of precomputed "block" (referring to the linked lists) deltas to further reduce the insertion time. If you *really* wanted to do binary searches, you could reduce the cost of keeping the absolute times current by keeping semi-absolute times in each block (absolute times relative to the first non-active block), which would give you the ability to do a binary search. Then when the active block expires, you recompute absolute times only for the blocks, instead of each and every element in the queue.

But that's getting into a very high level of complexity that's generally not warranted for a delta queue.

edit: the biggest advantage that lscharen's implementation has over a classical linked-list implementation is locality. For the size of linked-list that byuu is talking about, there shouldn't be more than 2 cache misses in walking the list. A classical implementation would have many many more cache misses.

Post by **Nach** » Wed Dec 31, 2008 1:32 am

I must be misunderstanding the case here, since it seems too easy to me.

Can't you keep the same queue you have, just have an extra integer alongside it to contain the offset to use when tallying up to where you need to be in the queue?

sinamas · Post by **sinamas** » Wed Dec 31, 2008 2:07 am

byuu wrote:
I use either a heap with absolute times
Won't that require decrementing and checking every item in the list?

No, I keep a single cycle counter that is incremented and compared to the head. The event time is the count at which the event is supposed to occur.
EDIT: what do you mean by "the list"? A binary heap is usually implemented with an array.

byuu wrote:
if there is a very low number of fixed events, I just statically select the closest event linearly on changes
How do you know which event is the closest without checking them all? Are you keeping your list sorted, but still using absolute times?

No, I mean it's a hard-coded linear search, but because of the low constant cost factor it's faster than O(log N) alternatives for low N. It only needs to be done every time an event is executed or scheduled.

byuu · Post by **byuu** » Wed Dec 31, 2008 11:33 pm

lscharen, thank you very much for the example code.

Not following what you mean by "amortization pool credits", but I get the general idea of your code. The part I missed was that you don't have to recompute all elements after your insertion part, just one.

Example ... say we have the following queue:
[0] = { NMI, +26 }
[1] = { IRQ, +14 } //= +42 total
[2] = { DRAM, +10 } //= +52 total
We want to insert ALU at +32, so we do an O(n) summation and find that we insert at [0]. So first link ALU between [0] and [2] to get:
[0] = { NMI, +26 }
[1] = { ALU, ??? }
[2] = { IRQ, +14 }
[3] = { DRAM, +10 }
Then fix the times:
[1] = clocks(32) - [0] = +6
[2] = [0] + [2] - [1] = +10

[3] is still correct [+26, +6, +10, +10 = +52], no need to update it.

The O(n) search to find the insertion point sucks, but like DataPath mentioned (and I missed in the first post), the relative values means we can't perform a binary search here. The linked list also rules out that possibility, anyway.

Can't you keep the same queue you have, just have an extra integer alongside it to contain the offset to use when tallying up to where you need to be in the queue?

Not entirely following what you mean, sorry.

No, I keep a single cycle counter that is incremented and compared to the head. The event time is the count at which the event is supposed to occur.

Ah, that brings up two problems for me:
1) you have to test each and every entry in the list when adding time, or keep the list sorted somehow.
2) exact cycle count timestamps that grow forever will eventually overflow, unless you subtract them all by the smallest value in the list periodically. Do you do something like that after each frame or so?

Perhaps it would be better to keep a sorted exact cycle position list rather than an exact relative cycle position list. Don't think it'll matter with a linked list approach, though.

EDIT: what do you mean by "the list"? A binary heap is usually implemented with an array.

I'm referring to an array, yes. But from lscharen's observation above, I'm leaning toward using a linked list now.

No, I mean it's a hard-coded linear search, but because of the low constant cost factor it's faster than O(log N) alternatives for low N. It only needs to be done every time an event is executed or scheduled.

That's what I'm worried about now. I don't have many events, so I don't know if this queue scheme will be worth the extra complication compared to hard-coded tests in the appropriate places.

sinamas · Post by **sinamas** » Thu Jan 01, 2009 12:12 am

Ah, that brings up two problems for me:
1) you have to test each and every entry in the list when adding time (at least when its possible for those events to fire.)

No idea what you're talking about here. Are you perhaps confusing a heap with something else? With a binary heap adding cycles would be O(1), finding the head is O(1), adjusting or scheduling an event is O(log N), and removing the head is O(log N). Heaps are typically used for priority queues.
Yes, I do subtract all counters periodically to avoid overflow.

Using a linked list for a binary heap would be silly (slow).

That's what I'm worried about now. I don't have many events, so I don't know if this queue scheme will be worth the extra complication compared to hard-coded tests in the appropriate places.

Note that by hard-coded linear search I'm referring to something very different from what you're currently doing.

byuu · Post by **byuu** » Thu Jan 01, 2009 1:51 am

No idea what you're talking about here. Are you perhaps confusing a heap with something else?

Ah yes, I was thinking you had a linear array. Can't say I've ever used a binary heap before, so I had to read up on it.

The 2n+1, 2n+2 trick to index into the heap is cute.

With a binary heap adding cycles would be O(1)

... because they're absolute times, so you're using a min-heap to trigger that, removing the element, then repeating the compare. Got it.

finding the head is O(1)

Always heap[0], yes.

adjusting or scheduling an event is O(log N), and removing the head is O(log N)

Hmm. I'll have to work that out on paper. I'm visualizing a tree that branches in each direction, so you could add to one side indefinitely rather than both evenly. Probably just diagram->actual implementation confusion.

Obviously what you say about big O timing is correct.

Yes, I do subtract all counters periodically to avoid overflow.

Yes, subtract the current time from everything, since that will always be the lowest value; assuming you've polled for events after your last cycle add.

Note that by hard-coded linear search I'm referring to something very different from what you're currently doing.

I'll take your word for that, have enough here to work with at the moment ;)

Thank you very much for your help. Now I'm not sure which approach I should use ... linked list + relative timing, or binary heap + absolute timing + periodic subtraction to prevent overflow. Meh.

ecst · Post by **ecst** » Thu Jan 01, 2009 2:11 am

byuu wrote:2) exact cycle count timestamps that grow forever will eventually overflow, unless you subtract them all by the smallest value in the list periodically. Do you do something like that after each frame or so?

No need there. When comparing time stamps, instead of a < b always do (signed) a - b < 0. On most processor models, this amounts to checking the negative flag instead of the carry flag, so there should be no difference in performance.

blargg · Post by **blargg** » Thu Jan 01, 2009 4:09 am

byuu, I found this helpful illustration of array-based heap operations. I had never studied this structure before. Array-based heap lecture notes.pdf

Ecst's point is that if you ensure that all your times fall within 0 to UINT_MAX/2 of each other, you can entirely eliminate periodic adjustments of all times. Unsigned values wrap around by definition, so addition and subtraction work without change. Only ordered comparison needs special treatment:

Code: Select all

bool less_than( unsigned x, unsigned y )
{
    return x - y > UINT_MAX / 2;
}

On a two's complement machine where conversion from unsigned to signed doesn't do any range check, which is the case on almost all machines these days, that can be optimized to the following (and I wouldn't be surprised if gcc does this automatically):

Code: Select all

bool less_than( unsigned x, unsigned y )
{
    return (int) (x - y) < 0;
}

Don't try to use signed int everywhere and eliminate less_than; even though overflow of signed integers doesn't usually generate a trap on most machines these days, a compiler's optimizer might take advantage of that fact and rewrite the second "if" below into the first:

Code: Select all

int time_x, time_y;

if ( time_x < time_y ) // won't work if y has overflowed, but x hasn't

if ( time_x - time_y < 0 ) // optimizer might rewrite this as above

sinamas · Post by **sinamas** » Thu Jan 01, 2009 5:41 am

Good point, ecst. I use that in my gettime/sleep abstraction code for instance, but haven't applied it to this for various reasons.

DataPath · Post by **DataPath** » Thu Jan 01, 2009 4:00 pm

Really, all this angst over refining the process should be dumped in favor of the most straightforward readable implementation, and optimized if it comes up in performance profiling.

lscharen · Post by **lscharen** » Thu Jan 01, 2009 6:38 pm

byuu wrote:lscharen, thank you very much for the example code.

Not following what you mean by "amortization pool credits", but I get the general idea of your code. The part I missed was that you don't have to recompute all elements after your insertion part, just one.

Yes, I was trying to get across the idea that, even though the worst-case insertion time is O(n), a sequence of insert/remove operations have amortized constant time. See http://en.wikipedia.org/wiki/Amortized_analysis for more information.

Example ... say we have the following queue:
[0] = { NMI, +26 }
[1] = { IRQ, +14 } //= +42 total
[2] = { DRAM, +10 } //= +52 total
We want to insert ALU at +32, so we do an O(n) summation and find that we insert at [0]. So first link ALU between [0] and [2] to get:
[0] = { NMI, +26 }
[1] = { ALU, ??? }
[2] = { IRQ, +14 }
[3] = { DRAM, +10 }
Then fix the times:
[1] = clocks(32) - [0] = +6
[2] = [0] + [2] - [1] = +10

[3] is still correct [+26, +6, +10, +10 = +52], no need to update it.

The O(n) search to find the insertion point sucks, but like DataPath mentioned (and I missed in the first post), the relative values means we can't perform a binary search here. The linked list also rules out that possibility, anyway.

Right. You only have to update the delta time of the event you are inserting and the delta of the next event. If you're not convinced of the O(1) expected performance of the queue, implement it and try it out. You may be correct that it would be faster to keep an array sorted if the number of events is bounded by a small number, e.g.

Code: Select all

void array_queue_insert( unsigned int time, unsigned int event)
{
   // find the insertion point
   int idx = binary_search( array, time );
   
   // move the elements from idx+1 to array_len
   memmove( &array[idx + 1], &array[idx], (array_len - idx) * sizeof(int));

   // insert the new event
   array[idx] = event; 
}

sinamas · Post by **sinamas** » Thu Jan 01, 2009 7:34 pm

Suppose you have N events, and that every time an event is executed it is rescheduled to execute after all the other events (but we don't know this, so we can't just stick it to the back of a FIFO). Then with a delta queue approach each event execution+rescheduling has O(N) complexity. Amortizing over the number of event executions like that looks like a fallacy to me, since you're not considering that all the events already in the queue need to be scheduled too.

blargg · Post by **blargg** » Fri Jan 02, 2009 1:41 am

lscharen, excellent point about only having to adjust the next delta, not all of them, when inserting before the end. sinamas, if you keep track of the absolute time of the last event, inserting an event past all the others can done in constant time. I wrote a simple delta queue class and exercise program, which demonstrates both of the above points. Excerpt of the key parts shown below. Note that this demo doesn't associate anything with the events, just the times of the events themselves. Delta_Queue.cpp

Code: Select all

void Delta_Queue::insert( int time )
{
    if ( time >= last )
    {
        events.push_back( time - last );
        last = time;
    }
    else
    {
        events_t::iterator p = events.begin();
        while ( p != events.end() )
        {
            if ( time <= *p )
            {
                *p -= time;
                break;
            }
            
            time -= *p++;
        }
            
        events.insert( p, time );
    }
}

void Delta_Queue::pop_front()
{
    last -= front();
    events.pop_front();
}

sinamas · Post by **sinamas** » Fri Jan 02, 2009 2:20 am

blargg wrote:sinamas, if you keep track of the absolute time of the last event, inserting an event past all the others can done in constant time.

True, but we still get linear complexity on average if the common case position is before the end and has a mean proportional to N.

lscharen · Post by **lscharen** » Fri Jan 02, 2009 2:56 am

sinamas wrote:
blargg wrote:sinamas, if you keep track of the absolute time of the last event, inserting an event past all the others can done in constant time.
True, but we still get linear complexity on average if the common case position is before the end and has a mean proportional to N.

I have to agree with sinamas here. I took the javadoc comment from the page byuu linked to on its face, but I can't figure out a proof of amortized constant time for a linked-list delta queue. I even looked at the Java source code and it's basically the same implementation that I proposed earlier.

http://avrora.cvs.sourceforge.net/viewv ... iew=markup

In terms of efficient implementation, I'm starting to think that a min-heap using the 2s-complement trick might very well be faster. The lack of having to update a next field and guaranteed O(log n) insertion might make up for O(log n) removal.

bobthebuilder · Post by **bobthebuilder** » Fri Jan 02, 2009 3:05 am

This is the kind of talk that keeps me coming back. I always learn something new and realize how much more my skills need to improve

ecst · Post by **ecst** » Fri Jan 02, 2009 3:11 am

blargg, good job in elegantly implementing a delta queue (and also in providing commentary on standards-compliant C/C++ implementation of "sliding" compares). Note that you can get rid of the subtractions in "*p -= time;" and "time -= *p++", and also the "last" variable if you use absolute time stamps instead of relative deltas (paradoxical as it may sound).

The misunderstanding regarding complexity seems to be the following: The O(1) for delta queues takes the number of events as constant, not depending on it as a variable, rather giving it the meaning of O(1) per event and time unit.

Facilitating comparability, let us consider the number n of events, the number t of cycles of the considered time interval, the frequencies (average number of occurences per cycle) f_1, ..., f_n of the events, and the average delta queue insertion points (how many list items we have to iterate over when inserting the event) a_1, ..., a_n of each event. Note that a_i * f_i <= 1 (this is the key point of delta queues).

The binary heap is O(log(n) * (f_1 + ... + f_n) * t).

The delta queue is O((a_1 * f_1 + ... + a_n * f_n) * t). Using a_i * f_i <= 1, we can give the simpler, but less sharp estimate O(n * t) (this is the O(1) per event and time unit from above).

As a case study, further simplifying, assume that each event respawns as soon as it got killed and has constant delay, that f_1 > ... > f_n, that no events occur simultaneously, and that the time interval is sufficiently large. After some calculations, we find that a_1 * f_1 + ... + a_n * f_n = 2 * (0 * f_1 + 1 * f_2 + ... + (n - 1) * f_n). So it all boils down to log(n) * (f_1 + ... + f_n) vs. 0 * f_1 + 1 * f_2 + ... + (n - 1) * f_n. Just for fun, let f_n = 1 / n^q with q > 0 some real parameter. Interestingly, the binary heap wins for 0 < q < 2, while the delta queue wins for q > 2, with a draw for q = 2.

Informally, these heuristic results would mean the delta queue is best suited for situations where the event delays cover all different kinds of magnitudes, while the binary heap should be used in the remaining cases, and when guaranteed optimal (and good) worst case complexity is important. Note that (albeit not obvious, and maybe slowed by a constant factor) binary heaps can also feature logarithmic removal/rescheduling of events (e. g., in case IRQ settings change).

Interesting to consider would be a synergistic approach, with the at most logarithmic insertion of the binary heap, but cheaper insertion for events sooner to occurs. Maybe a kind of "unbalanced" binary search tree with a bias for the right (with larger values).

Nevertheless, all these considerations will probably be in vain if the number of events is low enough for efficient hard-coded solutions. Still, a scalable approach would be interesting (e. g., if you, byuu, plan on potentially adding many future events, for whatever reason).

ZSNES board

Delta queue design

Delta queue design

Re: Delta queue design

Re: Delta queue design

Re: Delta queue design