SNES emulation speed

spiller · Post by **spiller** » Sun Mar 15, 2009 11:50 pm

Hi everyone! This is my first post!

This past week, fascinated by ZSNES, I've become quietly engrossed in my own attempt at a SNES emulator.

I dived in head-first not knowing the first thing about SNES internals and got very lost, but having spent several days backtracking and reading docs I downloaded, I've actually got an original, half-good CPU implementation working. (Which is remarkable, for somebody clueless and distractable like me.)

After much optimization effort, I got the core CPU running at about 13.6 million instructions per second. My benchmark for this involves tens of thousands of iterations of the first 500-odd instructions of Donkey Kong Country (first game I ever played, now it's the first game in my emulator!

) (It can't get beyond that point because the game seems to loop waiting for a result in a register. Since I haven't implemented registers yet, it gets stuck... but anyway... )

At first I was very proud of that speed, especially as it's written in Java. But the gigantic elephant in the source is that my so-called emulator is very raw at the moment, totally lacking in any kind of cycle counting, synchronization, PPU, APU, etc. Register reads and writes are silently ignored, and the only interface it has is the command line that lets me trace what's it doing.

So here's my question that I'm looking for a little advice on: What kind of overhead do those things have? Is it more than what's needed by the core CPU? Do you think 13.6 million is okay for real-time emulation? I didn't find mention of this in any docs, and I don't even know if it's the right order of magnitude. Is it enough? More than enough? Not nearly enough?

By the way, are the theoretical speed limits of the CPUs of ZSNES, Snes9x, or bsnes known? Those would give me some great benchmarks to compare with.

byuu · Post by **byuu** » Mon Mar 16, 2009 12:11 am

The biggest bottleneck, especially for an opcode-level emulator (that doesn't break instructions down to cycles), is the PPU. Even in bsnes, the super-optimized, scanline-based, caching PPU consumes ~40% processing time.

As for the CPU, the trickiest part is testing for interrupts (IRQ and NMI), though if you don't mind giving up compatibility with a half-dozen games, you can test that once per scanline for very little cost.

After that, synchronization between components is your next biggest enemy, especially when you do so at the cycle level. I'd say just the raw opcode implementations aren't all that important. Mine are very unoptimized (no ZN-flag tricks, no taking advantage of native processor flags, no optimizations / custom opcodes based on 8-bit/16-bit register modes, very very slow multi-level memory read/write subsystem using virtual functions and class pointers), and in total that only consumes about ~10% total execution time.

For me, it's 40% PPU, 30% IRQ testing (I test every single dot tick, ~5.25 million times a second), 10% CPU, 12% DSP (@ 1.024MHz, much faster @ 32KHz), 8% SMP.

Hope that's not too discouraging. I'd love to have another active SNES dev working on core SNES emulation, so I wish you the best and hope to hear more from your emulator in the future :D

spiller · Post by **spiller** » Mon Mar 16, 2009 1:35 am

Thanks for your reply, byuu!

byuu wrote: Mine are very unoptimized (no ZN-flag tricks, no taking advantage of native processor flags, no optimizations / custom opcodes based on 8-bit/16-bit register modes, very very slow multi-level memory read/write subsystem using virtual functions and class pointers)

You raise lots of interesting points there. ZN-flag tricks? Does that mean that what I thought was an original idea, to store the ZN value and only test it at a branch or push status op has already been thought of? Ah well.

My CPU does take advantage of everything like that if it can, including having separate 8-bit X/M forms and caching memory bank lookups (my memory banks are in separate objects, subclassed to reduce the amount of logic required per byte; e.g., each Bank_Mode21_C0_FF object knows that it just needs to read from ROM, and already knows its own offset into that ROM, rather than having to do any slow testing of the bank and address, etc.). It's thanks to all that sort of stuff that I got the speed raised from an initial 2-ish million IPS.

Despite my best efforts though, I'm slightly limited by Java's datatypes. It's lack of unsigned integer types (even byte arithmetic is signed!) is infuriating, as it means that 16-bit unsigned arithmetic must be done in 32-bit signed ints, then bitmasked. There's bitmasking everywhere! It also can't do unions to access the low/high bytes of words separately. But I don't want to use C/++ for this, as it's already been done so well in other emulators there seems to be little left to achieve in that area.

byuu wrote: [...] for an opcode-level emulator (that doesn't break instructions down to cycles) [...]

That brings up a separate issue that's been worrying me greatly. I first wondered about this when I read the W65C816S datasheet and read about the ABORTB pin, being able to interrupt part-way through opcodes. I hoped there was no SNES hardware that required that. Then searching for docs the other day, I found the page on your website where you show the difference between a "state machine" and a "cothreaded" CPU implementation. I'm not very sure of either concept, but I'm perturbed by the cycle-level code... If I don't use cothreading, do I have to implement separate code blocks for each cycle of each opcode? That sounds a veritable nightmare.

I guess what I'm asking is, what's the finest granularity that the CPU must be able to handle? Can I keep my opcodes atomic? Will it compromise compatibility?

I've been putting off writing the rest of the CPU ops because of this quandary.

byuu wrote: For me, it's 40% PPU, 30% IRQ testing (I test every single dot tick, ~5.25 million times a second)

Wow! I never imagined IRQs were going to be such a pain. I shall have to read up on those.

byuu wrote: so I wish you the best and hope to hear more from your emulator in the future

Thank you very much!

blargg · Post by **blargg** » Mon Mar 16, 2009 4:48 am

I've written several CPU emulators in Java and I settled on using int for everything except the arrays representing memory. For those, I use byte, and mask with 0xFF everywhere (and have obscure bugs on those few places I forget to...). There's little reason to use shorter types to restrict the number of bits, even in C and C++, because int is the natural and efficient size to operate on.

The biggest omission I noticed was the lack of goto. This was great for allowing the CPU emulators to share common instruction suffixes in the big opcode switch statement. For example all the instructions which store a byte using absolute,X addressing could goto a common implementation, reducing code size. I did a fairly mechanical conversion from my goto-loving C++ implementations, basically having multiple opcode switch statements, where the first handled the source operand, the second the operation, and the third the destination. That way instructions sharing a particular step could use the same code, reducing code size.

Heh, and yeah, the "defer flag calculation until needed" has been rediscovered several times. It really helps on some other processors like the Z-80, where there are the half-carry and parity flags. You might like my documentation of it, along with some other emulation ideas.

If you don't get a full SNES emulator working, you could always turn it into a SNSF music file player. I've written a Java-based SPC-700 emulator (SPC player), which would just need your CPU to be a SNSF player...

byuu · Post by **byuu** » Mon Mar 16, 2009 10:52 am

ZN-flag tricks? Does that mean that what I thought was an original idea, to store the ZN value and only test it at a branch or push status op has already been thought of?

Yeah. Store the value result, calculate the flags only when another opcode uses it.

e.g., each Bank_Mode21_C0_FF object knows that it just needs to read from ROM

Mine holds class pointers for every page (every 256 bytes), you need that granularity to map the SuperFX GSU region.

Whenever memory is read, I call:
return memory_table[addr >> 8](addr);

The table holds base classes with virtual functions, and I store derived classes that do things like read from ROM with an offset against the address, or call MMIO register functions, etc.

I figured a single memory function that did your typical if(addr & 0x800000) { ... } etc would be faster, but a lot less flexible. I use my memory mapper to dynamically rewrite the map while the game runs for bank-swapping games like Star Ocean and Far East of Eden Zero, or for the BS-X.

But I don't want to use C/++ for this, as it's already been done so well in other emulators there seems to be little left to achieve in that area.

It's up to you, but I'd rather see another C++ one than a Java one :/
Then again I'm not a fan of Java at all, so I won't share my grievances as they'd just be biased to no end ...

To entice you, I wrote two classes that let me make signed or unsigned integers of any bit-length, and all overflow+masking is transparent. They only take a ~10% speed hit over native math, due to masking overhead.

typedef int_t<24> int24_t;
typedef uint_t<48> uint48_t;

uint48_t foo = 0xffffffffffff;
foo++;
foo == 0;

If I didn't care about speed, I'd use them for the 24-bit address bus stuff. Be very nice to avoid all the masking and ambiguity regarding "do I have to ignore the top 8-bits or are they already clear?" in the code.

I hoped there was no SNES hardware that required that.

ABORTB was for operating system control. Eg for the Apple II GS.
I know of no SNES hardware that can utilize it. Not even sure if the CPU pin is connected to anything. At any rate, you don't have to worry about it.

If I don't use cothreading, do I have to implement separate code blocks for each cycle of each opcode? That sounds a veritable nightmare.

Well, it's Java so you can't use cothreading even if you want to.

But you shouldn't need this for your CPU. As long as the SMP is enslaved to the CPU, you only need to make a cycle-based state machine out of the much-simpler SMP, and have your clock / memory read-write functions run the SMP when needed (eg when the CPU accesses $2140-217f.)

It'll get trickier with the SuperFX / SA-1. The latter is another 65816 (and then some), so you'll have to cycle step one or the other, or share the fun problems other emulators do in Super Mario RPG and such.

I guess what I'm asking is, what's the finest granularity that the CPU must be able to handle?

It depends on what compatibility you want. Is 95% okay? Just opcode-based is fine. Want 99%? Going to need cycle based. Want 100%? Don't we all :P You'll need to handle things like bus-hold delays and such (a memory read doesn't happen immediately at the start of a cycle, but ~33% of the way through it.)

Opcode-based, you'll not (easily) be able to emulate some games that do really wild stuff to affect IRQ/NMI/DMA states in the middle of opcodes. Eg what happens if DMA triggers in the middle of your opcode, and that DMA transfers 10 full emulated seconds worth of data?

If your opcode is its own function, and it calls the DMA function right in the middle, you can't return from your CPU to service the SMP / DSP / user-interface for ten full seconds, unless your DMA is considered the master of your entire program and calls a function to update all that stuff. Which is ... a nasty design, to say the least. Even if you hide it well.

And then trying to emulate the DMASYNC + CPUSYNC states of DMA / HDMA with an opcode-based core ... good luck :/

But you don't need any of that stuff to get almost all games working, if that's what you're going for.

Wow! I never imagined IRQs were going to be such a pain.

Oh ... there are so many unbelievable edge cases ... you have no idea :(

spiller · Post by **spiller** » Mon Mar 16, 2009 2:59 pm

Thanks blargg and byuu. I'm awfully new to all these concepts -- no doubt I've still got hundreds of mistakes to make yet.

Blargg, that page of yours is very interesting. I hadn't considered using an opcode switch statement. I've been using classes, you see, like this:

Code: Select all

	private abstract class Op {
		public String mnemonic;
		public int address_mode = AM_IMPLIED;
		public int length = 1;
		
		
		// Ops should override this
		public void exec() {
		}
		
		
		// Rest of these are general op utility functions
		
		// for reading the instruction
		protected final int readOpByte() {
			int tmp = PB.readByte(IR);
			IR = (IR + 1) & 0xFFFF;
			return tmp;
		}
		
		protected final int readOpShort() {
			int tmp = PB.readShort(IR);
			IR = (IR + 2) & 0xFFFF;
			return tmp;
		}
		
		
		// for calculating addresses
		protected final int calcAddressAbsolute() {
			return readOpShort();
		}
		
		protected final int calcAddressDirect() {
			int offset = readOpByte();
			int addr = (D + offset) & 0xFFFF;
			return addr;
		}
		
		
		// for stack manipulation
		protected final void pushByte(int value) {
			ZB.writeByte(S, value);
			S = (S - 1) & 0xFFFF;
		}
		
		protected final void pushShort(int value) {
			ZB.writeShort((S - 1) & 0xFFFF, value);
			S = (S - 2) & 0xFFFF;
		}
		
		protected final int popByte() {
			S = (S + 1) & 0xFFFF;
			return ZB.readByte(S);
		}
		
		protected final int popShort() {
			S = (S + 2) & 0xFFFF;
			return ZB.readShort((S - 1) & 0xFFFF);
		}
		
		
		// for comparisons
		protected final void compareByte(int a, int b) {
			int cmp = a - b;
			setZeroNegativeByte(cmp);
			Carry = (cmp >= 0);
		}
		
		protected final void compareShort(int a, int b) {
			int cmp = a - b;
			setZeroNegativeShort(cmp);
			Carry = (cmp >= 0);
		}
		
	}

Then there are a few hundred (mostly still unimplemented) op classes subclass like so:

Code: Select all

	// LDX --- Load Index X
	private class Op_A2 extends Op {
		{
			mnemonic = "LDX";
			address_mode = AM_IMMEDIATE;
			length = 3;
		}
		
		public void exec() {
			int tmp = readOpShort();
			X = tmp;
			setZeroNegativeShort(tmp);
		}
	}
	
	private class Op_A2_X extends Op_A2 {
		{
			length = 2;
		}
		
		public void exec() {
			int tmp = readOpByte();
			X = (X & 0xFF00) | tmp;
			setZeroNegativeByte(tmp);
		}
	}

I assumed that was the Java way to do it. It gives a nice syntax (e.g., Op op = decode(IR); op.exec(); ) and, like function pointers, enables separate op tables to be generated based on the states of the M/X bits. Am I crazy to use classes? Who knows. This is the first time I ever downloaded, learnt, or used Java for anything, and I'm rapidly discovering that it shares superficial syntax with C++, but it's a very different beast inside.

I plan to stick with my classes for now and finish the CPU ops. If I later decide I want to use switch statements I'll get a script to go over the code and rewrite/regenerate it.

byuu wrote: I figured a single memory function that did your typical if(addr & 0x800000) { ... } etc would be faster, but a lot less flexible.

That's what I had at first. But at least in Java, changing to having each bank as a separate object yielded an execution speed improvement of at least 20%, probably much more, because I was able to add extra optimizations based on it later. It lets me put all the logic to handle different ROM map modes, etc., to a single loader function, and at runtime, just read pointers.

byuu wrote: ABORTB was for operating system control. Eg for the Apple II GS.
I know of no SNES hardware that can utilize it.

Thank goodness.

byuu wrote: Well, it's Java so you can't use cothreading even if you want to.

So I noticed, when I looked at your cothreading library to figure out what you meant by the term. Stack manipulation? That is very clever, and no doubt very efficient compared to a jump to ring 0 for a kernel/CPU-level task switch.

byuu wrote: It depends on what compatibility you want. Is 95% okay? Just opcode-based is fine.

Argh! My first design sacrifice! Such a painful moment, but if I don't make some compromises I haven't even the foggiest hope of managing to continue this. I get distracted by new projects every ~two weeks. I have to hurry to get something on the screen so that I don't get disheartened with this and move on, leaving me with yet another abandoned project. So, opcode-based it is.

Thanks for all your advice. Does anyone know the answer to my original quandary, about how many instructions the SNES CPU can execute per second?

creaothceann · Post by **creaothceann** » Mon Mar 16, 2009 3:38 pm

spiller wrote:Does anyone know the answer to my original quandary, about how many instructions the SNES CPU can execute per second?

Well, depends on what instructions, and what memory is accessed... see the memory speeds here.

byuu · Post by **byuu** » Mon Mar 16, 2009 4:11 pm

Does anyone know the answer to my original quandary, about how many instructions the SNES CPU can execute per second?

Hard to say exactly. The core clocks runs at 21.477MHz.

Each cycle can take 6, 8 or 12 clocks, let's assume 8 on average (12 is very rare.)

Each opcode takes 2-6 cycles, so let's say 4 on average.

21,477,272/32=~671,164 opcodes/second.

Post by **grinvader** » Mon Mar 16, 2009 11:16 pm

byuu wrote:Which is ... a nasty design, to say the least.

You're one to talk... ^^

byuu · Post by **byuu** » Mon Mar 16, 2009 11:34 pm

grinvader wrote:
byuu wrote:Which is ... a nasty design, to say the least.
You're one to talk... ^^

With the exception of half the ZSNES team, I've gotten nothing but praise for my code design (I admit it's far from perfect). We can certainly disagree on design approaches, fine. But I'm sure we can both agree that an S-CPU DMA run function shouldn't be calling a function to service user interface menubar messages.

whicker · Post by **whicker** » Tue Mar 17, 2009 2:31 am

byuu (or whoever), what even triggers the IRQ?

Post by **grinvader** » Tue Mar 17, 2009 7:08 am

You can set up IRQs for any ppu H/V. You can set it for any V, any H, or both at the same time.
NMI fires just after vblank starts if you enable it.

Auto joypad read kicks in about then too (if enabled).

spiller · Post by **spiller** » Tue Mar 17, 2009 5:35 pm

Thanks, creaothceann. RomHacking.net blocks direct links, but I found the file.

byuu thanks for the info. I hadn't realised there was a difference between a clock and a cycle. I shall have to reinterpret a load of docs again with that new information. :-/

I feel pretty amateur at all this, but my CPU is still coming along. I've finally learnt the meanings of, and implemented functions for, all of the absurdly complicated addressing modes, so *yeay*. Anomie's wrapping doc was a Godsend

Post by **grinvader** » Tue Mar 17, 2009 8:18 pm

But but but, direct indirect long indexed is NOT absurd !1!! ^^

h4tred · Post by **h4tred** » Mon Mar 23, 2009 6:00 am

But I'm sure we can both agree that an S-CPU DMA run function shouldn't be calling a function to service user interface menubar messages.

I think you mean the other way around.....

Srsly, if you have a better way to work around limitations in OpenGL with multithreading in pure Win32, I would love to hear it. (had zero issues with using Direct3D or DirectDraw)

Exophase · Post by **Exophase** » Wed Mar 25, 2009 10:00 pm

Look out for the deferred flag calculation trick - storing a result is usually enough for Z and N flag emulation but it can break if what you're emulating sets these flags manually because on the CPU it's possible to have both Z and N set at the same time. On the other hand, it's impossible to have a numeric value that's both negative and zero.

Deferring Z/N calculation is actually only a relatively small win (one instruction instead of two-ish), so people will instead go for doing it for carry, overflow, and others. This can be more trouble than it's worth, because now you have to store not just the result but the type of operation and the operands. Carry is fast to calculate, but overflow can take a bit more - I've been able to implement all four flags in 6 instructions on MIPS, but your mileage will vary depending on what you're emulating and what you're emulating it on.

I generally prefer using pointer tables rather than function tables for memory emulation, although you can combine the two together (but probably not very well in Java). It's usually faster to load a pointer, test if it's valid, branch off to another handler if it isn't (perhaps derived from the pointer if you're using the combined approach), then load/store relative to the pointer. This is more so true in languages with real pointers (C/C++) since you can bias them by the physical address base of the emulated region (and don't have to AND it off). This is compared to function pointers which incur an indirect branch and a function context switch - the former can especially be a big hit on a platform that isn't branch predicting it and has a deep pipeline. On Java you can use NULL pointer exceptions to test if the memory access is not "ordinar.y"

However, the above only holds if the memory accesses "hit" actual arrays enough of the time, which is only true if the machine doesn't do a lot of I/O. I don't know how fast NULL pointer exception catching is in Java, but if it hooks an OS exception handler then it isn't going to be very fast at all (so I'd explicitly test unless the platform does very little I/O).

Some tips for making 2D scanline based tile/sprite renderers fast (the vets probably already know all these):

- Cache which sprites are on which scanlines and only update this when the sprite state changes (which will usually only happen during vsync). This way you don't have to check this every line. This also makes it faster/easier to limit how many sprites are drawn per line.
- Avoid using intermediate buffers more than one scanline large or you'll unnecessarily thrash cache. Also minimize the number of buffers/reuse them if possible.
- Where possibly avoid the need for buffers at all by being able to incrementially write to the screen.
- For platforms with sprite ordering that's independent of layer ordering, you can avoid having to do Z tests by sorting the sprites by layer and drawing them in order - however this will not emulate the "masking" aspect that old consoles tend to have. Since games rarely rely on it, you can detect it with a good degreee of accuracy and special case it when it comes up.
- If you need more resources to emulate certain effects, try to see if they're at all possible for each scanline and make different renderers depending on if they are or not.
- If there are no sprites behind a given layer (you can check this per scanline, and the sprite binning I mentioned helps for this too) then you can draw a "base" layer without having to do transparency checking and instead drawing the background color straight. This is opposed to filling the background color then drawing on top of it conditionally.
- If the platform has a bunch of transparency colors then you can propagate the palette with the background each time it's changed, instead of having to check the sub-palette index for transparency. This goes in conjunction with the top suggestion (more worthwhile if there's only one or two layers).
- For paletted graphics (that's most of them), convert the color to the native space when it's written to the palette, not after you're done rendering it. So you should have a converted palette that you use to draw your graphics. You can do whatever other conversions here as well, just bear in mind that if the emulated platform has effects that use the raw value of the color then you'll either not be able to do this or you'll have to emulate those on the converted format (which can be less accurate, if at all viable). But, you can still choose to use the converted palette or not.
- Similarly, cache tiles to a native format that's easier to parse - usually 4bpp packed instead of planar (only applies if the original format isn't easy to parse). You might want to pad it out to a bigger size to make de-interleaving easier as well, although this hurts cache performance.

blargg · Post by **blargg** » Wed Mar 25, 2009 11:04 pm

Overflow isn't worth it, since the instructions which modify it are relatively rare. Here's how my Java-based 6502 emulator handles them. It properly handles any of the four combinations of N and Z by having two bits for N, which are ORed together to find the real N; the second copy is 8 bits shifted from the real one. I suppose carry would be a bit more involved on the 65816, due to varying register widths.

Code: Select all

// SEC
c = ~0;

// CMP
nz = a - operand;
c  = ~nz;
nz &= 0xFF;

// ADC
int t = operand ^ a;
nz = operand + a + (c >> 8 & 1);
c  = nz;
a  = nz & 0xFF;
p &= ~0x40;
p |= ((t ^ nz) + 0x80) >> 2 & 0x40;

// INY
nz = y + 1;
y  = nz & 0xFF;

// BIT
p = (p & ~0x40) | (operand & 0x40);
nz = operand;
if ( (a & operand) == 0 )
    nz <<= 8; // result must be zero, even if N bit is set

// BNE
if ( (nz & 0xFF) != 0 )
    branch_taken();

// BMI
if ( (nz & 0x8080) != 0 )
    branch_taken();

// BCS
if ( (c & 0x100) != 0 )
    branch_taken();

// PHP
int t = p | 0x30 | (c >> 8 & 1) | (((nz >> 8) | nz) & 0x80);
if ( (nz & 0xFF) == 0 )
    t |= 0x02;
push( t );

// PLP
int t = pop();
p  = t & 0x4C;
c  = t << 8;
nz = c | (~t & 0x02);

byuu · Post by **byuu** » Wed Mar 25, 2009 11:04 pm

I think you mean the other way around.....

No, if the S-CPU DMA run function needs to emulate ten full frames because it can't return (eg it's not a state machine or its own thread), then it will have to call a function that will update literally everything else and return back to the DMA proc.

In effect, the DMA run proc will become the master of your entire emulator; because nothing else can similarly call the DMA proc if it runs too long without a potential stack overflow.

- Similarly, cache tiles to a native format that's easier to parse - usually 4bpp packed instead of planar (only applies if the original format isn't easy to parse). You might want to pad it out to a bigger size to make de-interleaving easier as well, although this hurts cache performance.

Love that one. So easy to implement and use.

But oddly enough, I never saw much of a speedup. Maybe ~2%, tops. And that's with a PPU that consumes ~45% of emulation time.

Also kind of tricky. Both ZSNES and bsnes ran into problems upon power cycle with not updating the dirty states of these cached tiles ;)
My personal experience is that each speed-up always adds a degree of subtle complexity that can come back to haunt you. Which just gives me more respect for the people who have the patience and skill to pull them off successfully.

Exophase · Post by **Exophase** » Thu Mar 26, 2009 6:12 pm

byuu, I've looked at your render code a little - would you mind a few optimization suggestions? Of course, it'd probably help to know which functions are taking the most time so I don't completely miss the mark, but my money's on bPPU::render_line_bg.

blargg; That's a good approach on the NZ for platforms where you have more bits than what you're emulating. In my 6820 emulator I use NZ sort of similarly, except it uses bit 31 for N and bottom 8 bits for Z - this is as fast to set on ARM with its folded barrel shifter (the interpreter is for ARM). If you can pull that off then the N checks should be a little faster, but I doubt anything will translate to that from Java code, regardless of JVM arch.

By the way, I just wanted to mention how cool it is that you've done emulators that are both really accurate and really efficient, relatively speaking. And for information on GB noise that I used a while ago in GBA emulation ;D

byuu · Post by **byuu** » Thu Mar 26, 2009 6:56 pm

Yeah, ~40% CPU time is spent in render_line_bg. I optimized that routine as much as I could, eg:

Code: Select all

if((opt_x >> 3) != (prev_optx >> 3)) {

Code: Select all

if((hoffset >> 3) != prev_x || (voffset >> 3) != prev_y) {

Tries to cache tile decoding as much as possible.

Code: Select all

      if(bg_td_state[tile_num] == 1) {
        render_bg_tile(color_depth, tile_num);
      }

Caches planar->linear decode.

Code: Select all

void bPPU::update_bg_info()

Tries to cache all the complicated processing so it's only done once per scanline, rather than once per BG layer.

One thing that's important to me is keeping the BG renderer in a single function. I don't like the way other emulators have it split up into a bunch of nested macros to try and 'build' the function. Too hard to read, too hard to debug.

---

Hmm, this helps from 118.5 -> 119.5fps:

Code: Select all

  const bool use_direct_color = (regs.direct_color == true && bg == BG1 && (regs.bg_mode == 3 || regs.bg_mode == 4));
      ...
      if(use_direct_color) {
        col = get_direct_color(pal_num, col);
      } else {
        col = get_palette(col + pal_index);
      }

These don't help at all:

Code: Select all

//kill xpos, ypos
    if(t & 0x8000)voffset ^= 7; //invert y tile pos
    tile_ptr = bg_td + (tile_num * 64) + ((voffset & 7) * 8);
    if(t & 0x4000)hoffset ^= 7; //invert x tile pos
    col = *(tile_ptr + (hoffset & 7));

Code: Select all

//kill mirror_x, mirror_y
    //blargg would be proud ;)
    tile_num += (((hoffset ^ (t >> 11)) & 8 & (tile_width  << 1)) >> 3);
             +  (((voffset ^ (t >> 12)) & 8 & (tile_height << 1)) << 1);
    //was:
      mirror_y = !!(t & 0x8000);
      mirror_x = !!(t & 0x4000);
      if(tile_width  == 4) { //16x16 horizontal tile mirroring
        if(!!(hoffset & 8) != mirror_x)tile_num++;
      }

      if(tile_height == 4) { //16x16 vertical tile mirroring
        if(!!(voffset & 8) != mirror_y)tile_num += 16;
      }

Exophase · Post by **Exophase** » Thu Mar 26, 2009 10:22 pm

Okay, this is at least some of the things I'm seeing... I don't really know how much of this the compiler can be dealing with already, but definitely not all of it:

- You have a lot of conditionals inside inner loops over values that don't change for the duration of that loop. The optimization to improve this is known as "loop hoisting"; the drawback is that it increases the number of loops exponentially for each boolean. However, since the same paths will be taken over and over again, it's not a big hit on icache and it's well worth the bloat (within reason).

Unfortunately this tends to lead to the "constructed functions" you don't seem to like, but if you use templates and parameterize the conditionals out instead you can keep it to a single body, with a switched dispatch that handles all of the combinations. It wouldn't really read much worse this way, at least in my opinion.

This would probably be the biggest win, since it'd remove a ton of conditionals from your inner loops.

- Instead of checking every pixel against both windows you should break up the line into areas where the windows are in the same state and render it piecemeal for this.

- Since mosaic is usually only used sporadically you can special case it (another if in the sequence parameterized out by templates)

- This goes moreso for offset per tile mode, which as I understand it is very rarely actually used in SNES games. With OPT not used nothing can change voffset (so you can put if(is_opt_mode) around that check, if going with the template tree, which will again optimize out for the cases where it's not used), the voffset check, probably some other things.

- You should be able to render 8 pixels in a row without checking if 8 have passed like you are doing. Instead you can have a loop of 8, which will reduce the number of conditional checks on the inner-most loop.

- You can apply the same optimizations that I listed in the first one to the above inner loop, such that you have a different one for the two horizontal flip states. Now you should have 8 pointer derefs from fixed offset for the pixel loads, which will unroll nicely (compiled should do this much).

- Instead of using a palette offset you can get a pointer straight into the palette, saving a bit on lookup since you don't have to do *(pointer_base + pointer_offset + color) for each palette lookup.

- For modes where both the subscreen and mainscreen are being drawn to you can interleave the resultant colors in the scanline buffer. This will result in better cache locality of reference and fewer pointers/offsets that need to be kept in registers.

- You can simplify the z-combine strategy a lot. Instead of doing a z-check for every BG pixel, you can render the BGs in z-order to one buffer, then render the spriets in z-order to another buffer, then z-combine them. To get around per-pixel priority you need to render the BGs in two passes, handing low priority in one and high priority in another. This adds some overhead, but you can minimize it by caching the priority state to a bitmask, which you can then quickly test per-scanline to see if it's all or nothing (I imagine that games will often stick to one priority for at least some layers, for a lot of the screen).

- You can combine your pixel cache into a much more densely packed format, which will reduce the number of writes you need to make for each, hopefully mainly cached within a single register, since most of the values don't change per-scanline.

- I would investigate exactly what your compiler is outputting for the !! values, to see if it's as good as shift then and 0x1. It might be costing another conditional instead. This might be a sticking point on readibility for you, but personally I find the !! rather confusing and would replace it with a macro anyway (ie, get_bit or some such). Shift + and might also be a little better on some platforms where the immediate size is crippled (and gcc is awful at dealing with caching immediates).

I'll let you know if I think of any more; please tell me if any of this sounds useful to you or if you need clarification. I understand that you probably won't want to do anything that results in a substantial amount of extra code due to readibility issues.

byuu · Post by **byuu** » Sun Mar 29, 2009 8:06 am

if you use templates and parameterize the conditionals out instead you can keep it to a single body, with a switched dispatch that handles all of the combinations

Would really be nice if C++ blurred the static / run-time system a bit more, so that you could pass boolean variables as template parameters, rather than requiring a dispatcher that doubles in size (n^2) for each added parameter.

It almost seems like a good idea to use code generation for something like this.

This would probably be the biggest win, since it'd remove a ton of conditionals from your inner loops.

I honestly think you're over-estimating how big of a win it will be. I killed off three or four with above optimizations and outright disabling rarely used portions like is_opt_mode, and didn't get more than a ~1% speedup.

Instead of checking every pixel against both windows you should break up the line into areas where the windows are in the same state and render it piecemeal for this.

This is a lot easier said than done. There are two windows and they cross over each other. There's also four mixing modes for how the cross-over works (inside, outside, XOR and NOR), as well as some other thing I'm forgetting. Doable, but I don't want to be the one to do it.

Since mosaic is usually only used sporadically you can special case it (another if in the sequence parameterized out by templates)

Yeah, the memory lookup is probably worse than a conditional.

You should be able to render 8 pixels in a row without checking if 8 have passed like you are doing. Instead you can have a loop of 8, which will reduce the number of conditional checks on the inner-most loop.

Again, this is much easier said than done. Scrolling throws it off, offset-per-tile mode throws it off, and mosaic mode really throws it off. And that's just off the top of my head.

- You can simplify the z-combine strategy a lot. Instead of doing a z-check for every BG pixel, you can render the BGs in z-order to one buffer, then render the spriets in z-order to another buffer, then z-combine them.

I initially went with z-combine on my first implementation, and found it to be slower, as well as more difficult for a number of reasons I've forgotten since the four and a half years it's been since I wrote said code.

There's probably much better z-combine algorithms than I used, of course.

I would investigate exactly what your compiler is outputting for the !! values

That was a really old habit. I use implicit boolean casting now, eg:
bool mirror_x = (tilemap & 0x8000);

If my destination isn't boolean:
unsigned value = (bool)(tilemap & 0x8000) + foobar;

I'll let you know if I think of any more; please tell me if any of this sounds useful to you or if you need clarification. I understand that you probably won't want to do anything that results in a substantial amount of extra code due to readibility issues.

Honestly, I don't really want to mess with the old renderer too much. I want to work on a cycle-level renderer for the sake of preservation and understanding of the S-PPU at a lower level of detail.

What I'd really like is a team member who could write a newer PPU that uses all of the optimizations you mention above and then some. Hell, I don't care if it's all done in SSSE3 assembler.

And it still wouldn't be a major win, the S-PPU is ~40% emulated time, so even doubling the speed (probably the best you can do without breaking any edge-case games) would only speed up the emulator by ~20%, tops.

It's not that I don't want that speedup, it's just that with the limited time I have, it seems like a bad investment to work on that. At least for now.

Exophase · Post by **Exophase** » Sun Mar 29, 2009 8:45 am

byuu wrote:
if you use templates and parameterize the conditionals out instead you can keep it to a single body, with a switched dispatch that handles all of the combinations
Would really be nice if C++ blurred the static / run-time system a bit more, so that you could pass boolean variables as template parameters, rather than requiring a dispatcher that doubles in size (n^2) for each added parameter.

It almost seems like a good idea to use code generation for something like this.

Maybe a preprocessor. For this level of expansion the actual amount of code generated would still be manageable, and who knows which paths you'd eventually need.

byuu wrote:I honestly think you're over-estimating how big of a win it will be. I killed off three or four with above optimizations and outright disabling rarely used portions like is_opt_mode, and didn't get more than a ~1% speedup.

I don't really see how you were killing them, particularly the innermost ones. The is_opt_mode body itself is not a big culprit, it's the later things that it would remove (that you mention later).

byuu wrote:This is a lot easier said than done. There are two windows and they cross over each other. There's also four mixing modes for how the cross-over works (inside, outside, XOR and NOR), as well as some other thing I'm forgetting. Doable, but I don't want to be the one to do it.

I do it this way with GBA rendering and that includes object windows too, and can turn off color effects per window and other complications. It doesn't have programmable rules for determining overlap (uses priority), but this shouldn't complicate things with regards to my suggestion because the value of the span still resolves to a boolean externally.

For a lot of these suggestions I am speaking from experience, if that means anything. GBA doesn't have all the features and eccentricities SNES does (probably Nintendo now knew better since almost no one used a lot of them on SNES) but it has a lot of the basic structure and some new/different features.

byuu wrote:Yeah, the memory lookup is probably worse than a conditional.

They're probably not that different, but this was again in reference to the template trees, where a conditional in this sense is punted outside the loop.

byuu wrote:Again, this is much easier said than done. Scrolling throws it off, offset-per-tile mode throws it off, and mosaic mode really throws it off. And that's just off the top of my head.

That could be an example of OPT mode not being actually used most of the time making things a win, but I don't see how it or scrolling throws it off. I do it this way in GBA and PC-Engine emulation and scrolling certainly isn't an issue if you handle the edges separately or if you have an intermediate buffer which you can run over (you have such an intermediate buffer, it just needs 16 pixels of padding on both edges)

For mosiac I would say that you're better off pushing it into an intermediate buffer and post-processing it down. Another approach is to have a counter which holds the previous output and reloads only when the counter underflows. Mosiac, being applied in the output domain, shouldn't be affected by the reloads; of course this counter would have to span the per-tile loops and would incur a check per pixel.

byuu wrote:I initially went with z-combine on my first implementation, and found it to be slower, as well as more difficult for a number of reasons I've forgotten since the four and a half years it's been since I wrote said code.

Eh? You're using z-combine now, at least that's how I would describe it. I don't think there's much way it could end up being much slower with what I'm envisioning, when you're reducing the aggregate work per pixel tremendously (not to mention a branch that might not be all too predictable). Splitting the per-tile priority into two passes might seem counter-intuitive, but I think that notaz can vouch for its effectiveness since he does this in Picodrive (the platforms I emulate don't have per-tile priority)

Even keeping everything the same, just packing the values in the pixel buffer more aggressively could make a difference since you are doing a lot of writes that aren't changing.

byuu wrote:Honestly, I don't really want to mess with the old renderer too much. I want to work on a cycle-level renderer for the sake of preservation and understanding of the S-PPU at a lower level of detail.

What I'd really like is a team member who could write a newer PPU that uses all of the optimizations you mention above and then some. Hell, I don't care if it's all done in SSSE3 assembler.

And it still wouldn't be a major win, the S-PPU is ~40% emulated time, so even doubling the speed (probably the best you can do without breaking any edge-case games) would only speed up the emulator by ~20%, tops.

It's not that I don't want that speedup, it's just that with the limited time I have, it seems like a bad investment to work on that. At least for now.

The thing with those edge cases is that a lot of them can be detected with pretty high granularity so that you only they stress things to the necessary point. I also think that the "it can only be at best twice as fast" remark doesn't sound like much more than a guess. It's not impossible that it could be made even faster than that, even for the worst cases. If you want to start looking at where the theoretical boundaries might lie maybe we should start looking at what's being done per pixel and what strictly needs to be done. Now granted, with other things in your emulation taking up huge mandatory amounts of CPU time (and can't be skipped like frame rendering can be) I can see why one might not bother. But you've said yourself that you could probably get more reduced out of the IRQ checking - if that 30% were cut down significantly as well then you'd start to see major wins (maybe more realistically targeting current or near future netbooks)

But I understand why you don't want to mess with it when it works, and I understand why you want to keep things neat and simple. The real reason why I say all of these things is more to let others (ie, spiller) know that there are lots of things that can be done. Although I do agree that for typical SNES emulators PPU emulation will be the biggest strain, I was actually rather surprised to hear it was the case with yours because I don't think that it needs to be the case.

blargg · Post by **blargg** » Mon Mar 30, 2009 12:40 am

byuu wrote:Would really be nice if C++ blurred the static / run-time system a bit more, so that you could pass boolean variables as template parameters, rather than requiring a dispatcher that doubles in size (n^2) for each added parameter.

You could generate the function pointer table below automatically, if desired. Either way, the compiler instantiates the templates.

Code: Select all

template<bool x,bool y>
void f() { ... }

typedef void (*fp)();

fp funcs [2] [2] = {
    f<false,false>,
    f<false,true >,
    f<true ,false>,
    f<true ,true >
};

void call_f( bool x, bool y )
{
    funcs [x] [y]();
}

Exophase · Post by **Exophase** » Mon Mar 30, 2009 1:40 am

That's not generating it automatically, you're still explicitly instantiating all the combinations in the table declaration.

However, with some additional setup you might be able to nest the instantiations at compile time and make the growth linear instead of exponential. If you can't do it cleanly with templates then you can probably do it with macros over templates.