"New" S-DSP core

byuu · Post by **byuu** » Sat Mar 22, 2008 3:37 pm

Okay, it's just blargg's. I hope he doesn't mind ...

I rewrote his S-DSP emulator in pure C++. Only took me seven hours, not bad. anomie's took a few days.

Now, given, it's extremely similar of course. First, the algorithms are going to be mostly the same regardless of who writes the code. Second, I really didn't see a reason to waste too much time on this reverse engineering a bunch of stuff myself, so I pretty much just took the code and "rewrote" (read: copied) it in my unique style, and changed a few things here and there. Code flow, variable names, tables, exact algorithms, etc were blatant, direct copies.

Things I did change:
- counter rate 0 is now hardcoded to not ever hit zero
- counter read is now boolean instead of unsigned short
- a lot of multiplication was converted to shifts
- broke up the program into ~9 source files
- no more global functions anywhere, all in one class
- removed the hooks for things like external channel muting -- will re-add if I ever add an option like that to bsnes
- modified VREG to not need the voice regs handle passed to it
- all voice functions take a reference instead of pointer to the voice structs now
- packed 32-line timing table expanded to multi-line
- left everything in their own small chunk functions ... kind of torn on whether I want to merge that with the main timing function. I like the encapsulation, but it would remove the need to keep so many struct-based state variables
- added a few more comments on parts that confused me at first
- removed assignment inside conditional stuff; even though I do that myself on occasion in other code I write, heh
- yadda yadda, more minor stuff like that

Going to keep working at it -- wanted to get it working now, so that finding regressions will be easier. I want to remove the double writes for the ring buffer, make a decision on whether I want to rely on sign extension, or use sclip<> for that, implement a compile-time option to bypass libco (will save 2.048 million co_switch calls a second) since the S-DSP's entire operation fits into a single switch table quite easily, convert a lot of the mul / div stuff to shifts, convert those clever split up branches in the envelope and BRR decoding routines to switch / case tables, remove the shift tables from the BRR decoding, and try and figure out what's going on with some of the code so that I can try and document it :)

I'll see if I can contribute something back, too. Perhaps I can look into what happens when you enable mute or something.

New WIP up which has the new core enabled by default. For those without WIP access, I've posted the new source for reference. Comments welcome.

Code: Select all

byuu.org/files/bsnes_dsp.zip

... man, feels weird posting a new topic.

creaothceann · Post by **creaothceann** » Sat Mar 22, 2008 4:49 pm

byuu wrote:- a lot of multiplication was converted to shifts

Are there still compilers that don't do that automatically?

Turambar · Post by **Turambar** » Sat Mar 22, 2008 7:18 pm

I don't know about compilers that much, but if modern compilers really convert multiplication and stuff into shifts automatically, I think that for the sake of readability these pieces of code should be left intact.

Good work anyway, OOP is the way to go.

Verdauga Greeneyes · Post by **Verdauga Greeneyes** » Sat Mar 22, 2008 8:48 pm

byuu wrote:Going to keep working at it -- wanted to get it working now, so that finding regressions will be easier. I want to remove the double writes for the ring buffer, make a decision on whether I want to rely on sign extension, or use sclip<> for that, implement a compile-time option to bypass libco (will save 2.048 million co_switch calls a second) since the S-DSP's entire operation fits into a single switch table quite easily, convert a lot of the mul / div stuff to shifts, convert those clever split up branches in the envelope and BRR decoding routines to switch / case tables, remove the shift tables from the BRR decoding, and try and figure out what's going on with some of the code so that I can try and document it

I guess this is best left to the experts

Wish I could help, but without understanding the processes involved.. I'll have a look at it when you've done all you want to do, I think, and after I read that article.

byuu · Post by **byuu** » Sat Mar 22, 2008 11:13 pm

Are there still compilers that don't do that automatically? :?

It wasn't meant to be a speedup, it was meant to look more like the actual hardware would. It's very unlikely that little DSP chip performed many multiplications.

Besides, anyone working on that should know their powers of two anyway.

creaothceann · Post by **creaothceann** » Sat Mar 22, 2008 11:14 pm

byuu wrote:
Are there still compilers that don't do that automatically?
It wasn't meant to be a speedup, it was meant to look more like the actual hardware would. It's very unlikely that little DSP chip performed many multiplications.

Right, makes sense.

Post by **pagefault** » Sun Mar 23, 2008 5:37 pm

It sounds good but you also might want to put the fomulae there for readability.

byuu · Post by **byuu** » Mon Mar 24, 2008 3:16 am

Okay, updated the BRR decoding to eliminate the shift table, improve the readability of the very clever 4->16->32-bit nybble sign extension, and use a switch table for the filter type.

Unfortunately, I'm having trouble removing the ring buffer. I'm not sure why the brr buffer size is 12 samples (requires modulation instead of bit-masking), and even with it, I'm failing a lot of tests. So, ring buffers stay in for now.

Pulling download link since there's an official release with it in place now.

It sounds good but you also might want to put the fomulae there for readability.

Yeah, that stays :)

blargg · Post by **blargg** » Mon Mar 24, 2008 5:43 am

It's 12 samples in the emulator because it's 12 samples in the DSP. Not sure why you're having problems eliminating the second copy in the ring buffer. Maybe you're not properly handling the case when a new sample is started, and it's using the last two samples from the 12-sample buffer for some BRR filtering modes? If you did this relative to the last pointer, you'd fail the tests, since it always uses the last 2 samples regardless of where it was.

Also, I'd think you'd just make a ring buffer class that encapsulates the second copy business, since that's the true OO way. Then it could keep the second copy without exposing this to clients. Otherwise you have a costly %12 or a conditional.

byuu · Post by **byuu** » Mon Mar 24, 2008 4:56 pm

If you did this relative to the last pointer, you'd fail the tests, since it always uses the last 2 samples regardless of where it was.

Correct, this is what I was doing.

Your code used pos[brr_buffer_size - 1], which I assumed was relative to the previous sample since pos is incremented after each sample is written, and also because anomie's core did the same thing.

To be honest, it's hard to understand a lot of your optimizations, as they're above my skill level. Took me a while to see what you were doing with the 4-bit scalar sign extend from 16-bit integer in the BRR decoder, and I'm still at a loss with the interp_pos fractional stuff.

Also, I'd think you'd just make a ring buffer class that encapsulates the second copy business, since that's the true OO way. Then it could keep the second copy without exposing this to clients. Otherwise you have a costly %12 or a conditional.

Thanks, that's a very good idea. I've been meaning to design a ring buffer class anyway, this will be the perfect occasion. Maybe use write-twice for non-powers of two, and bit-masking for powers of two. Throw in a few functions for poking and relative seeking, and that should be good.

byuu · Post by **byuu** » Tue Mar 25, 2008 11:48 pm

Okay, OO ring buffer attempt. Doesn't optimize when size is a power of two yet.

Code: Select all

template<int size> struct ring_buffer {
  uint8_t array[size * 2];
  int offset;

  uint8_t read() {
    uint8_t r = array[offset];
    if(++offset >= size) offset = 0;
    return r;
  }

  void write(uint8_t data) {
    array[offset] = data;
    array[offset + size] = data;
    if(++offset >= size) offset = 0;
  }

  uint8_t& operator[](int index) {
    int p = offset + index;
    return array[p + (p < 0) * size];
  }

  ring_buffer() : offset(0) {
    memset(&array, 0, size * 2);
  }
};

Works only when -size <= index[n] <= +size, which is good enough for me. Look good?

Verdauga Greeneyes · Post by **Verdauga Greeneyes** » Wed Mar 26, 2008 1:34 am

Just thought of a crazy way to write read():

Code: Select all

uint8_t read() { return array[(offset %= size)++]; }

Of course this has the caveat that the other functions would have to account for the case when offset == size, and it wouldn't surprise me if this version is slower anyway. Just had to share though

Edit: 'cuz I'm bored, here's a version of write() that would work with it:

Code: Select all

void write(uint8_t data) {
  array[offset %= size] = data;
  array[offset++ + size] = data;
}

byuu · Post by **byuu** » Wed Mar 26, 2008 1:48 am

Modulus (division) is very slow.

We can basically cache the data once to simplify fetching to one compare and multiply, or twice to be a direct index.

It's basically a 50/50 trade-off. Double the work of a flat array to read, double the work to write. Though given the nature of ring buffers (usually read at least as much as written), it may not matter much. Perhaps a triple buffer would be best for simplicity's sake.

Verdauga Greeneyes · Post by **Verdauga Greeneyes** » Wed Mar 26, 2008 1:57 am

byuu wrote:Modulus (division) is very slow.

*nod* Just wanted to add, for powers of two you could use

Code: Select all

offset &= ~size

(which should be fast) if you really wanted to

blargg · Post by **blargg** » Wed Mar 26, 2008 6:40 am

Yes, for power of two, it's trivial. But BRR the ring buffers are 12 samples and used much more often than the power-of-two echo buffer.

byuu, the read doesn't have to handle negative indicies. The whole point of making the second copy was to make reading require nothing special:

Code: Select all

uint8_t operator[](int index) const {
    assert( 0 <= index && index < size*2 );
    return array[index];
}

The assertion documents the valid range. Given index i which is 0 <= i < size, to get element i-1, you do i+size-1 (this is how BRR decoding gets the previous two samples). Note also that operator [] shouldn't return a reference to non-const, since writes must modify more than one element.

It makes more sense for you to abstract a "modulo" array rather than the ring buffer, since the caller wants to keep track of the position in it. For BRR decoding, this must be done so the caller can get previous samples (or read must accept an offset). I was thinking of something more like operator [] for reading, as above, and a write( index ) which just did array [index] = array [index+size] = data.

byuu · Post by **byuu** » Wed Mar 26, 2008 7:05 am

byuu, the read doesn't have to handle negative indicies.

The reason I did that was so that you could seek forward or backward based on the current position.

So, instead of n = array[pos-1], you would just say array[-1]. All of the S-DSP code always indexes by v.t_brr_pos, so it seemed like that would work.

But yes, if we made it request the literal address of the value, we need only add size to put us in the pivotal center of the array. From here we can add or subtract up to "size" and get the correct value.

Maybe I can move "size" inside the index, eg return array[index + size], and then access with array or array[i + 1] ...

Note also that operator [] shouldn't return a reference to non-const, since writes must modify more than one element.

Oops, good catch.

It makes more sense for you to abstract a "modulo" array rather than the ring buffer, since the caller wants to keep track of the position in it.

Hmm ... not as versatile, but it probably would be simpler that way.

---

EDIT: okay, I see your point. Here's what I came up with:

Code: Select all

#ifndef NALL_MODULO_HPP
#define NALL_MODULO_HPP

#include <nall/new.hpp>

namespace nall {

template<typename T, int size>
class modulo_array {
public:
  inline T operator[](int index) const {
    return buffer[size + index];
  }

  inline T read(int index) const {
    return buffer[size + index];
  }

  inline void write(unsigned index, const T value) {
    buffer[index] =
    buffer[index + size] =
    buffer[index + size + size] = value;
  }

  modulo_array() {
    buffer = new(zeromemory) T[size * 3];
  }

  ~modulo_array() {
    delete[] buffer;
  }

private:
  T *buffer;
};

} //namespace nall

#endif //ifndef NALL_MODULO_HPP

I added a third write so that you can wrap around once in each direction without a crash. I know it isn't needed (gaussian seeks +0 to +3, while brr_pos &3 always == 0), but it makes the modulo_array class more versatile in general.

Not really sure if I should take a write param of const T value or const T& value, and an operator[] return of T or const T& ... that always confuses me. The former will be faster with integral types, but the latter faster with class objects, right? Perhaps I should use type traits to select the type?

Eg, typedef typename static_if< is_fundamental<T>::value, T, T& >::type type;

blargg · Post by **blargg** » Wed Mar 26, 2008 9:21 am

Not really sure if I should take a write param of const T value or const T& value, and an operator[] return of T or const T& ... that always confuses me. The former will be faster with integral types, but the latter faster with class objects, right? Perhaps I should use type traits to select the type?

If it's inline, the const T& can yield the best code on a decently fast compiler. If T=int, then the compiler should generate the same code as passing by value. Otherwise, if you were to pass by value, I don't think the compiler could optimize the copy out if the copy constructor for T were non-trivial or outline.

But, unnecessary generality often makes code worse, not better. Here I think you've gone off the deep end, adding just-in-case features that will never be used. Take the extreme programming approach: code for today, and just make sure you can easily modify for tomorrow, rather than trying to code for tomorrow today.

ZH/Franky · Post by **ZH/Franky** » Wed Mar 26, 2008 5:11 pm

Byuu, what was wrong with the old core you had?

franpa · Post by **franpa** » Thu Mar 27, 2008 12:34 am

most likely more accurate and more configurable.

byuu · Post by **byuu** » Thu Mar 27, 2008 12:52 am

Here I think you've gone off the deep end, adding just-in-case features that will never be used.

Meh, I quite enjoy my libraries. They've come in handy a lot.

Code: Select all

    const int p1 = v.buffer[v.buf_pos - 1];
    const int p2 = v.buffer[v.buf_pos - 2] >> 1;
    ...
    s = sclamp<16>(s);
    s = (int16)(s << 1);
    v.buffer.write(v.buf_pos++, s);
    if(v.buf_pos >= brr_buf_size) v.buf_pos = 0;

Code: Select all

  offset = v.buf_pos + (v.interp_pos >> 12);
  int output;
  output  = (fwd[  0] * v.buffer[offset + 0]) >> 11;
  output += (fwd[256] * v.buffer[offset + 1]) >> 11;
  output += (rev[256] * v.buffer[offset + 2]) >> 11;
  output  = (int16)output;
  output += (rev[  0] * v.buffer[offset + 3]) >> 11;
  return sclamp<16>(output) & ~1;

Code: Select all

int sDSP::calc_fir(int i, bool channel) {
  int s = state.echo_hist[channel][state.echo_hist_pos + i + 1];
  return (s * (int8)REG(fir + i * 0x10)) >> 6;
}

Still working on it, but I like it so far :/

Byuu, what was wrong with the old core you had?

Absolutely nothing. I want to understand the S-DSP better. Just plugging in another's library as a black box and being content with that is a meaningless endeavor to me.

By doing this, I almost understand the new code as well as I did anomie's -- which is to say, not all that well, but good enough for a basic understanding. blargg's code is actually very clear, I'm just trying to make it a tiny bit clearer by replacing faster speed optimizations with slower, easier to read code; the other half is that I've worked with anomie's code for two years, so obviously I am more familiar with that.

Eventually, I'd like to write up documentation, run my own tests, etc.

most likely more accurate and more configurable.

Jipcy · Post by **Jipcy** » Thu Mar 27, 2008 5:35 am

byuu wrote:
most likely more accurate and more configurable.

I agree with him that it would be the most likely reason, even though it wasn't the actual reason.

creaothceann · Post by **creaothceann** » Thu Mar 27, 2008 12:33 pm

Franky wrote:Byuu, what was wrong with the old core you had?

franpa wrote:most likely more accurate and more configurable.

Strange idea of "wrong"...

franpa · Post by **franpa** » Thu Mar 27, 2008 1:12 pm

yea i wrote 'less' but then changed them to 'more' for some reason.

Post by **grinvader** » Thu Mar 27, 2008 6:41 pm

franpa wrote:for some reason.

INTELLIGENCE

(following the negating caps rule, obviously)

ZSNES board

"New" S-DSP core

"New" S-DSP core

Re: "New" S-DSP core

Re: "New" S-DSP core

Re: "New" S-DSP core

Re: "New" S-DSP core