"New" S-DSP core

Archived bsnes development news, feature requests and bug reports. Forum is now located at http://board.byuu.org/
Locked
byuu

"New" S-DSP core

Post by byuu »

Okay, it's just blargg's. I hope he doesn't mind ...

I rewrote his S-DSP emulator in pure C++. Only took me seven hours, not bad. anomie's took a few days.

Now, given, it's extremely similar of course. First, the algorithms are going to be mostly the same regardless of who writes the code. Second, I really didn't see a reason to waste too much time on this reverse engineering a bunch of stuff myself, so I pretty much just took the code and "rewrote" (read: copied) it in my unique style, and changed a few things here and there. Code flow, variable names, tables, exact algorithms, etc were blatant, direct copies.

Things I did change:
- counter rate 0 is now hardcoded to not ever hit zero
- counter read is now boolean instead of unsigned short
- a lot of multiplication was converted to shifts
- broke up the program into ~9 source files
- no more global functions anywhere, all in one class
- removed the hooks for things like external channel muting -- will re-add if I ever add an option like that to bsnes
- modified VREG to not need the voice regs handle passed to it
- all voice functions take a reference instead of pointer to the voice structs now
- packed 32-line timing table expanded to multi-line
- left everything in their own small chunk functions ... kind of torn on whether I want to merge that with the main timing function. I like the encapsulation, but it would remove the need to keep so many struct-based state variables
- added a few more comments on parts that confused me at first
- removed assignment inside conditional stuff; even though I do that myself on occasion in other code I write, heh
- yadda yadda, more minor stuff like that

Going to keep working at it -- wanted to get it working now, so that finding regressions will be easier. I want to remove the double writes for the ring buffer, make a decision on whether I want to rely on sign extension, or use sclip<> for that, implement a compile-time option to bypass libco (will save 2.048 million co_switch calls a second) since the S-DSP's entire operation fits into a single switch table quite easily, convert a lot of the mul / div stuff to shifts, convert those clever split up branches in the envelope and BRR decoding routines to switch / case tables, remove the shift tables from the BRR decoding, and try and figure out what's going on with some of the code so that I can try and document it :)

I'll see if I can contribute something back, too. Perhaps I can look into what happens when you enable mute or something.

New WIP up which has the new core enabled by default. For those without WIP access, I've posted the new source for reference. Comments welcome.

Code: Select all

byuu.org/files/bsnes_dsp.zip
... man, feels weird posting a new topic.
creaothceann
Seen it all
Posts: 2302
Joined: Mon Jan 03, 2005 5:04 pm
Location: Germany
Contact:

Re: "New" S-DSP core

Post by creaothceann »

byuu wrote:- a lot of multiplication was converted to shifts
Are there still compilers that don't do that automatically? :?
vSNES | Delphi 10 BPLs
bsnes launcher with recent files list
Turambar
Rookie
Posts: 12
Joined: Mon Jun 04, 2007 5:56 pm

Post by Turambar »

I don't know about compilers that much, but if modern compilers really convert multiplication and stuff into shifts automatically, I think that for the sake of readability these pieces of code should be left intact.

Good work anyway, OOP is the way to go.
Verdauga Greeneyes
Regular
Posts: 347
Joined: Tue Mar 07, 2006 10:32 am
Location: The Netherlands

Re: "New" S-DSP core

Post by Verdauga Greeneyes »

byuu wrote:Going to keep working at it -- wanted to get it working now, so that finding regressions will be easier. I want to remove the double writes for the ring buffer, make a decision on whether I want to rely on sign extension, or use sclip<> for that, implement a compile-time option to bypass libco (will save 2.048 million co_switch calls a second) since the S-DSP's entire operation fits into a single switch table quite easily, convert a lot of the mul / div stuff to shifts, convert those clever split up branches in the envelope and BRR decoding routines to switch / case tables, remove the shift tables from the BRR decoding, and try and figure out what's going on with some of the code so that I can try and document it :)
I guess this is best left to the experts :) Wish I could help, but without understanding the processes involved.. I'll have a look at it when you've done all you want to do, I think, and after I read that article.
byuu

Re: "New" S-DSP core

Post by byuu »

Are there still compilers that don't do that automatically? :?
It wasn't meant to be a speedup, it was meant to look more like the actual hardware would. It's very unlikely that little DSP chip performed many multiplications.

Besides, anyone working on that should know their powers of two anyway.
creaothceann
Seen it all
Posts: 2302
Joined: Mon Jan 03, 2005 5:04 pm
Location: Germany
Contact:

Re: "New" S-DSP core

Post by creaothceann »

byuu wrote:
Are there still compilers that don't do that automatically? :?
It wasn't meant to be a speedup, it was meant to look more like the actual hardware would. It's very unlikely that little DSP chip performed many multiplications.
Right, makes sense.
vSNES | Delphi 10 BPLs
bsnes launcher with recent files list
pagefault
ZSNES Developer
ZSNES Developer
Posts: 812
Joined: Tue Aug 17, 2004 5:24 am
Location: In your garden

Post by pagefault »

It sounds good but you also might want to put the fomulae there for readability.
Watering ur plants.
byuu

Post by byuu »

Okay, updated the BRR decoding to eliminate the shift table, improve the readability of the very clever 4->16->32-bit nybble sign extension, and use a switch table for the filter type.

Unfortunately, I'm having trouble removing the ring buffer. I'm not sure why the brr buffer size is 12 samples (requires modulation instead of bit-masking), and even with it, I'm failing a lot of tests. So, ring buffers stay in for now.

Pulling download link since there's an official release with it in place now.
It sounds good but you also might want to put the fomulae there for readability.
Yeah, that stays :)
blargg
Regular
Posts: 327
Joined: Thu Jun 30, 2005 1:54 pm
Location: USA
Contact:

Post by blargg »

It's 12 samples in the emulator because it's 12 samples in the DSP. Not sure why you're having problems eliminating the second copy in the ring buffer. Maybe you're not properly handling the case when a new sample is started, and it's using the last two samples from the 12-sample buffer for some BRR filtering modes? If you did this relative to the last pointer, you'd fail the tests, since it always uses the last 2 samples regardless of where it was.

Also, I'd think you'd just make a ring buffer class that encapsulates the second copy business, since that's the true OO way. Then it could keep the second copy without exposing this to clients. Otherwise you have a costly %12 or a conditional.
byuu

Post by byuu »

If you did this relative to the last pointer, you'd fail the tests, since it always uses the last 2 samples regardless of where it was.
Correct, this is what I was doing.

Your code used pos[brr_buffer_size - 1], which I assumed was relative to the previous sample since pos is incremented after each sample is written, and also because anomie's core did the same thing.

To be honest, it's hard to understand a lot of your optimizations, as they're above my skill level. Took me a while to see what you were doing with the 4-bit scalar sign extend from 16-bit integer in the BRR decoder, and I'm still at a loss with the interp_pos fractional stuff.
Also, I'd think you'd just make a ring buffer class that encapsulates the second copy business, since that's the true OO way. Then it could keep the second copy without exposing this to clients. Otherwise you have a costly %12 or a conditional.
Thanks, that's a very good idea. I've been meaning to design a ring buffer class anyway, this will be the perfect occasion. Maybe use write-twice for non-powers of two, and bit-masking for powers of two. Throw in a few functions for poking and relative seeking, and that should be good.
byuu

Post by byuu »

Okay, OO ring buffer attempt. Doesn't optimize when size is a power of two yet.

Code: Select all

template<int size> struct ring_buffer {
  uint8_t array[size * 2];
  int offset;

  uint8_t read() {
    uint8_t r = array[offset];
    if(++offset >= size) offset = 0;
    return r;
  }

  void write(uint8_t data) {
    array[offset] = data;
    array[offset + size] = data;
    if(++offset >= size) offset = 0;
  }

  uint8_t& operator[](int index) {
    int p = offset + index;
    return array[p + (p < 0) * size];
  }

  ring_buffer() : offset(0) {
    memset(&array, 0, size * 2);
  }
};
Works only when -size <= index[n] <= +size, which is good enough for me. Look good?
Verdauga Greeneyes
Regular
Posts: 347
Joined: Tue Mar 07, 2006 10:32 am
Location: The Netherlands

Post by Verdauga Greeneyes »

Just thought of a crazy way to write read():

Code: Select all

uint8_t read() { return array[(offset %= size)++]; }
Of course this has the caveat that the other functions would have to account for the case when offset == size, and it wouldn't surprise me if this version is slower anyway. Just had to share though :)
Edit: 'cuz I'm bored, here's a version of write() that would work with it:

Code: Select all

void write(uint8_t data) {
  array[offset %= size] = data;
  array[offset++ + size] = data;
}
byuu

Post by byuu »

Modulus (division) is very slow.

We can basically cache the data once to simplify fetching to one compare and multiply, or twice to be a direct index.

It's basically a 50/50 trade-off. Double the work of a flat array to read, double the work to write. Though given the nature of ring buffers (usually read at least as much as written), it may not matter much. Perhaps a triple buffer would be best for simplicity's sake.
Verdauga Greeneyes
Regular
Posts: 347
Joined: Tue Mar 07, 2006 10:32 am
Location: The Netherlands

Post by Verdauga Greeneyes »

byuu wrote:Modulus (division) is very slow.
*nod* Just wanted to add, for powers of two you could use

Code: Select all

offset &= ~size
(which should be fast) if you really wanted to :P
blargg
Regular
Posts: 327
Joined: Thu Jun 30, 2005 1:54 pm
Location: USA
Contact:

Post by blargg »

Yes, for power of two, it's trivial. But BRR the ring buffers are 12 samples and used much more often than the power-of-two echo buffer.

byuu, the read doesn't have to handle negative indicies. The whole point of making the second copy was to make reading require nothing special:

Code: Select all

uint8_t operator[](int index) const {
    assert( 0 <= index && index < size*2 );
    return array[index];
}
The assertion documents the valid range. Given index i which is 0 <= i < size, to get element i-1, you do i+size-1 (this is how BRR decoding gets the previous two samples). Note also that operator [] shouldn't return a reference to non-const, since writes must modify more than one element.

It makes more sense for you to abstract a "modulo" array rather than the ring buffer, since the caller wants to keep track of the position in it. For BRR decoding, this must be done so the caller can get previous samples (or read must accept an offset). I was thinking of something more like operator [] for reading, as above, and a write( index ) which just did array [index] = array [index+size] = data.
byuu

Post by byuu »

byuu, the read doesn't have to handle negative indicies.
The reason I did that was so that you could seek forward or backward based on the current position.

So, instead of n = array[pos-1], you would just say array[-1]. All of the S-DSP code always indexes by v.t_brr_pos, so it seemed like that would work.

But yes, if we made it request the literal address of the value, we need only add size to put us in the pivotal center of the array. From here we can add or subtract up to "size" and get the correct value.

Maybe I can move "size" inside the index, eg return array[index + size], and then access with array or array[i + 1] ...
Note also that operator [] shouldn't return a reference to non-const, since writes must modify more than one element.


Oops, good catch.

It makes more sense for you to abstract a "modulo" array rather than the ring buffer, since the caller wants to keep track of the position in it.


Hmm ... not as versatile, but it probably would be simpler that way.

---

EDIT: okay, I see your point. Here's what I came up with:

Code: Select all

#ifndef NALL_MODULO_HPP
#define NALL_MODULO_HPP

#include <nall/new.hpp>

namespace nall {

template<typename T, int size>
class modulo_array {
public:
  inline T operator[](int index) const {
    return buffer[size + index];
  }

  inline T read(int index) const {
    return buffer[size + index];
  }

  inline void write(unsigned index, const T value) {
    buffer[index] =
    buffer[index + size] =
    buffer[index + size + size] = value;
  }

  modulo_array() {
    buffer = new(zeromemory) T[size * 3];
  }

  ~modulo_array() {
    delete[] buffer;
  }

private:
  T *buffer;
};

} //namespace nall

#endif //ifndef NALL_MODULO_HPP
I added a third write so that you can wrap around once in each direction without a crash. I know it isn't needed (gaussian seeks +0 to +3, while brr_pos &3 always == 0), but it makes the modulo_array class more versatile in general.

Not really sure if I should take a write param of const T value or const T& value, and an operator[] return of T or const T& ... that always confuses me. The former will be faster with integral types, but the latter faster with class objects, right? Perhaps I should use type traits to select the type?

Eg, typedef typename static_if< is_fundamental<T>::value, T, T& >::type type;
blargg
Regular
Posts: 327
Joined: Thu Jun 30, 2005 1:54 pm
Location: USA
Contact:

Post by blargg »

Not really sure if I should take a write param of const T value or const T& value, and an operator[] return of T or const T& ... that always confuses me. The former will be faster with integral types, but the latter faster with class objects, right? Perhaps I should use type traits to select the type?
If it's inline, the const T& can yield the best code on a decently fast compiler. If T=int, then the compiler should generate the same code as passing by value. Otherwise, if you were to pass by value, I don't think the compiler could optimize the copy out if the copy constructor for T were non-trivial or outline.

But, unnecessary generality often makes code worse, not better. Here I think you've gone off the deep end, adding just-in-case features that will never be used. Take the extreme programming approach: code for today, and just make sure you can easily modify for tomorrow, rather than trying to code for tomorrow today.
ZH/Franky

Post by ZH/Franky »

Byuu, what was wrong with the old core you had?
franpa
Gecko snack
Posts: 2374
Joined: Sun Aug 21, 2005 11:06 am
Location: Australia, QLD
Contact:

Post by franpa »

most likely more accurate and more configurable.
Core i7 920 @ 2.66GHZ | ASUS P6T Motherboard | 8GB DDR3 1600 RAM | Gigabyte Geforce 760 4GB | Windows 10 Pro x64
byuu

Post by byuu »

Here I think you've gone off the deep end, adding just-in-case features that will never be used.
Meh, I quite enjoy my libraries. They've come in handy a lot.

Code: Select all

    const int p1 = v.buffer[v.buf_pos - 1];
    const int p2 = v.buffer[v.buf_pos - 2] >> 1;
    ...
    s = sclamp<16>(s);
    s = (int16)(s << 1);
    v.buffer.write(v.buf_pos++, s);
    if(v.buf_pos >= brr_buf_size) v.buf_pos = 0;

Code: Select all

  offset = v.buf_pos + (v.interp_pos >> 12);
  int output;
  output  = (fwd[  0] * v.buffer[offset + 0]) >> 11;
  output += (fwd[256] * v.buffer[offset + 1]) >> 11;
  output += (rev[256] * v.buffer[offset + 2]) >> 11;
  output  = (int16)output;
  output += (rev[  0] * v.buffer[offset + 3]) >> 11;
  return sclamp<16>(output) & ~1;

Code: Select all

int sDSP::calc_fir(int i, bool channel) {
  int s = state.echo_hist[channel][state.echo_hist_pos + i + 1];
  return (s * (int8)REG(fir + i * 0x10)) >> 6;
}
Still working on it, but I like it so far :/
Byuu, what was wrong with the old core you had?
Absolutely nothing. I want to understand the S-DSP better. Just plugging in another's library as a black box and being content with that is a meaningless endeavor to me.

By doing this, I almost understand the new code as well as I did anomie's -- which is to say, not all that well, but good enough for a basic understanding. blargg's code is actually very clear, I'm just trying to make it a tiny bit clearer by replacing faster speed optimizations with slower, easier to read code; the other half is that I've worked with anomie's code for two years, so obviously I am more familiar with that.

Eventually, I'd like to write up documentation, run my own tests, etc.
most likely more accurate and more configurable.
Image
Jipcy
Veteran
Posts: 768
Joined: Thu Feb 03, 2005 8:18 pm
Contact:

Post by Jipcy »

byuu wrote:
most likely more accurate and more configurable.
Image
I agree with him that it would be the most likely reason, even though it wasn't the actual reason.
[url=http://zsnes-docs.sf.net]Official ZSNES Docs[/url] | [url=http://zsnes-docs.sf.net/nsrt]NSRT Guide[/url] | [url=http://endoftransmission.net/phpBB3/viewtopic.php?t=394]Using a Wiimote w/ emulators[/url]
creaothceann
Seen it all
Posts: 2302
Joined: Mon Jan 03, 2005 5:04 pm
Location: Germany
Contact:

Post by creaothceann »

Franky wrote:Byuu, what was wrong with the old core you had?
franpa wrote:most likely more accurate and more configurable.
Strange idea of "wrong"...
vSNES | Delphi 10 BPLs
bsnes launcher with recent files list
franpa
Gecko snack
Posts: 2374
Joined: Sun Aug 21, 2005 11:06 am
Location: Australia, QLD
Contact:

Post by franpa »

yea i wrote 'less' but then changed them to 'more' for some reason.
Core i7 920 @ 2.66GHZ | ASUS P6T Motherboard | 8GB DDR3 1600 RAM | Gigabyte Geforce 760 4GB | Windows 10 Pro x64
grinvader
ZSNES Shake Shake Prinny
Posts: 5632
Joined: Wed Jul 28, 2004 4:15 pm
Location: PAL50, dood !

Post by grinvader »

franpa wrote:for some reason.
INTELLIGENCE


(following the negating caps rule, obviously)
皆黙って俺について来い!!

Code: Select all

<jmr> bsnes has the most accurate wiki page but it takes forever to load (or something)
Pantheon: Gideon Zhi | CaitSith2 | Nach | kode54
Locked