Delta queue design
DRAM refresh, IRQ and NMI pins can trigger at any clock cycle, even in the middle of an opcode cycle read.
DMA, HDMA and HDMA init can trigger on the edge of an opcode cycle (tested every ~8 master cycles).
IRQ and NMI can generate an actual interrupt on the edge of a full opcode (every ~32 master cycles).
I do not know how often the ALU and joypad polling tick as we don't emulate those properly yet.
My thoughts were to make the per-clock stuff just do whatever event it is they need immediately, and for the rest to set flags in a variable for each one, so for cycle_edge we'd have:
while(event_cycleedge) {
switch(event_cycleedge & -event_cycleedge) {
...
}
event_cycleedge &= event_cycleedge - 1;
}
Then there's just the final edge case of handling events that occur N clocks after a previous event.
For instance, when IRQ trips on a clock boundary, it holds the value high for 6 clocks. If the program reads the "test and clear pending IRQ"s register at +0 or +2 clocks in to that event, it returns 1 and cancels the IRQ. +4 clocks in, it will return 1, but trigger the IRQ and leave the bit set. +6 and on, IRQ triggers and the next read will be 0 as expected.
I would need a way to determine exactly where an event occurred, not just that it had, to schedule the future IRQ pin events. Or maybe I can just insert an event for IRQ+0, IRQ+2, IRQ+6 and IRQ+8 to manipulate the behavior.
DMA, HDMA and HDMA init can trigger on the edge of an opcode cycle (tested every ~8 master cycles).
IRQ and NMI can generate an actual interrupt on the edge of a full opcode (every ~32 master cycles).
I do not know how often the ALU and joypad polling tick as we don't emulate those properly yet.
My thoughts were to make the per-clock stuff just do whatever event it is they need immediately, and for the rest to set flags in a variable for each one, so for cycle_edge we'd have:
while(event_cycleedge) {
switch(event_cycleedge & -event_cycleedge) {
...
}
event_cycleedge &= event_cycleedge - 1;
}
Then there's just the final edge case of handling events that occur N clocks after a previous event.
For instance, when IRQ trips on a clock boundary, it holds the value high for 6 clocks. If the program reads the "test and clear pending IRQ"s register at +0 or +2 clocks in to that event, it returns 1 and cancels the IRQ. +4 clocks in, it will return 1, but trigger the IRQ and leave the bit set. +6 and on, IRQ triggers and the next read will be 0 as expected.
I would need a way to determine exactly where an event occurred, not just that it had, to schedule the future IRQ pin events. Or maybe I can just insert an event for IRQ+0, IRQ+2, IRQ+6 and IRQ+8 to manipulate the behavior.
-
- Regular
- Posts: 347
- Joined: Tue Mar 07, 2006 10:32 am
- Location: The Netherlands
Assuming a uniform distribution and a bidirectional list, the following should be a bit faster: (it got a bit long so I compacted it a bit into for loops; I hope you don't mind)
Edit: note this does not guarantee the order of simultaneous events, though it could with minimal modifications. It just stops as soon as possible.
Code: Select all
void Delta_Queue::insert( int time )
{
events_t::iterator p;
if ( time >= last )
{
events.push_back( time - last );
last = time;
}
else
{
if ( time < last / 2 )
{
for(p = events.begin(); time > *p; time -= *p++);
}
else
{
int temp = last;
for(p = events.end() - 1; time < temp; temp -= *p--);
p++;
time -= temp;
}
*p -= time;
events.insert( p, time );
}
}
Verdauga Greeneyes, nifty. You can write the backwards loop a tad more compactly (and avoid events.end() - 1, which requires that it be a random access iterator, which it wasn't in mine which actually used a list):
Here's a version using an array-based heap, absolute times, and no need to periodically adjust them, as overflow doesn't cause any problems. Full code + exerciser: Delta_Queue_2.cpp
Absolute times seem superior to me. The demo includes functions that implement less-than and subtraction that handle when one of them has wrapped around but the other hasn't, so you have all operations. You could wrap them in a class and overload < and - so that they are completely transparent to use.
Code: Select all
int temp = last;
for(p = events.end(); time < temp; temp -= *--p);
time -= temp;
Absolute times seem superior to me. The demo includes functions that implement less-than and subtraction that handle when one of them has wrapped around but the other hasn't, so you have all operations. You could wrap them in a class and overload < and - so that they are completely transparent to use.
I threw together a C implementation of the linked-list and heap-based versions. The heap does not coalesce events by default, but it could easily be modified to do so. Code is minimally tested.blargg wrote: Here's a version using an array-based heap, absolute times, and no need to periodically adjust them, as overflow doesn't cause any problems. Full code + exerciser: Delta_Queue_2.cpp
Heap and Linked List Delta Queues.
Enjoy!
Edit:
I updated my code to incorporate the optimizations in the code blargg posted -- which are very nice, btw. Credit given where credit is due.
Last edited by lscharen on Fri Jan 02, 2009 7:14 pm, edited 1 time in total.
-
- Regular
- Posts: 347
- Joined: Tue Mar 07, 2006 10:32 am
- Location: The Netherlands
Thanks. I wasn't sure whether the '- 1' would work, although prefix -- surely would have considering it works on p. But in the end your version looks much better.blargg wrote:Verdauga Greeneyes, nifty. You can write the backwards loop a tad more compactly (and avoid events.end() - 1, which requires that it be a random access iterator, which it wasn't in mine which actually used a list)
Okay, here's my attempt at a generic priority queue. No real reason for it to be non-copyable, I'm just being lazy. I felt std::vector overhead would be too much, and opted for a fixed array size instead. You'd really want to use a pointer for complex type_t objects with high copying costs. I could probably do that transparently with type traits, but that's really something the user should be doing.
Hopefully my greater-than-equal function works right. Seems to in my test cases, at least.
Any optimization improvements welcome.
Code: Select all
#ifndef NALL_PRIORITYQUEUE_HPP
#define NALL_PRIORITYQUEUE_HPP
#include <limits>
#include <nall/function.hpp>
#include <nall/utility.hpp>
namespace nall {
template<typename type_t> void priorityqueue_nocallback(type_t) {}
//priority queue implementation using binary min-heap array;
//does not require normalize() function.
//O(1) find (tick)
//O(log n) insert (enqueue)
//O(log n) remove (dequeue)
template<typename type_t> class priorityqueue : noncopyable {
public:
inline void tick(unsigned ticks) {
basecounter += ticks;
while(heapsize && gte(basecounter, heap[0].counter)) callback(dequeue());
}
//counter is relative to current time (eg enqueue(64, ...) fires in 64 ticks);
//counter cannot exceed std::numeric_limits<unsigned>::max() >> 1.
void enqueue(unsigned counter, type_t event) {
unsigned child = heapsize++;
counter += basecounter;
while(child) {
unsigned parent = (child - 1) >> 1;
if(gte(counter, heap[parent].counter)) break;
heap[child].counter = heap[parent].counter;
heap[child].event = heap[parent].event;
child = parent;
}
heap[child].counter = counter;
heap[child].event = event;
}
type_t dequeue() {
type_t event(heap[0].event);
unsigned parent = 0;
unsigned counter = heap[--heapsize].counter;
while(true) {
unsigned child = (parent << 1) + 1;
if(child >= heapsize) break;
if(child + 1 < heapsize && gte(heap[child].counter, heap[child + 1].counter)) child++;
if(gte(heap[child].counter, counter)) break;
heap[parent].counter = heap[child].counter;
heap[parent].event = heap[child].event;
parent = child;
}
heap[parent].counter = counter;
heap[parent].event = heap[heapsize].event;
return event;
}
void reset() {
basecounter = 0;
heapsize = 0;
}
priorityqueue(unsigned size, function<void (type_t)> callback_ = &priorityqueue_nocallback<type_t>)
: callback(callback_) {
heap = new heap_t[size];
reset();
}
~priorityqueue() {
delete[] heap;
}
private:
function<void (type_t)> callback;
unsigned basecounter;
unsigned heapsize;
struct heap_t {
unsigned counter;
type_t event;
} *heap;
//return true if x is greater than or equal to y
inline bool gte(unsigned x, unsigned y) {
return x - y < (std::numeric_limits<unsigned>::max() >> 1);
}
};
}
#endif
Any optimization improvements welcome.
ReplaceIn the other thread, byuu wrote:Took a ~1% speed hit or so by using functors for the callback and using the signed math trick to avoid the need for a normalize() function.
Code: Select all
return x - y < (std::numeric_limits<unsigned>::max() >> 1);
Code: Select all
return x - y <= (std::numeric_limits<unsigned>::max() >> 1);
May as well throw out some data structure prOn for everyone. This paper was published in 2005.
Ladder Queues
Ladder Queues
This actually caused a loss of 2-3 fps >_<ecst wrote:Replace
withCode: Select all
return x - y < (std::numeric_limits<unsigned>::max() >> 1);
to reverse the speed hit for signed math.Code: Select all
return x - y <= (std::numeric_limits<unsigned>::max() >> 1);
I know it's not because it's slower, more likely some sort of alignment issue with the generated code. I moved a variable inside a class below a two-entry struct and lost ~10fps. Really hate these damn compilers.
It's documents like that which reaffirm to me that I'm not a computer scientist, but a computer amateur x.xlscharen wrote:Ladder Queues
Still, 10-100 million entry systems ... somehow I doubt they tested that with 4-5 entry systems :P
O(1) insertion is really clever, though it comes at a cost of amortized O(1) get-first. I like guaranteed O(1) first. Really, O(log n) is perfectly scalable in my opinion to any size. But I guess with 10m events, you need every bit of performance you can get. Sheesh.
Wonder how it compares to a more traditional Fibonacci heap or red-black tree. Never even heard of an SCQ.
Strange. Would you mind sharing the WIP, so that I can take a look at the produced binary code?byuu wrote:This actually caused a loss of 2-3 fps >_<
Might also have something to do with cache hits, but in this case I do not see how.byuu wrote: I know it's not because it's slower, more likely some sort of alignment issue with the generated code. I moved a variable inside a class below a two-entry struct and lost ~10fps. Really hate these damn compilers.
Remember, for a limited data size, O(1) algorithm A can be slower than O(log n) algorithm B. The point of complexity measures is how the algorithms work as the data grows, and they are critical when the data can grow greatly. For this case, what really matters is the absolute average performance, thus there's no substitute for implementing and benchmarking.