Multithreaded emulator state saving

byuu · Post by **byuu** » Wed Jan 30, 2008 8:37 pm

EDIT: seems I've misquoted henke37.

This post was requesting henke37 to demonstrate a smaller example case of his idea.

mozz · Post by **mozz** » Wed Jan 30, 2008 9:12 pm

Jipcy wrote:Would it be possible, then, that there would be certain conditions that would cause an infinite loop? As in, each time you advanced A or B or C to a safe point, it affected another processor, going round in a round in a never-ending circle?

I don't think it can. Or rather, it doesn't matter because you only save each processor once, and thereafter you log any reads or writes it does and just append the logs to the end of the savestate.

I'll try and summarize why implementing savestates in bsnes is hard:

In a cothreaded emulator, the CPU runs in its own cothread, as does the APU, the PPU, etc. Each cothread can suspend at some spot (e.g. an instruction boundary) just because it wants to, but it might also be forced to suspend at some spot (e.g. memory read or write in some instruction) because it needs to communicate with a different processor. E.g. if the CPU tries to read a value from the APU, we may have to suspend the CPU cothread and simulate the APU for a while until it "catches up" to the CPU. And since the CPU simulation is in the middle of doing something (i.e. its partway through the simulation of a particular instruction), this is a bad time to try and save the CPU state directly. We need to capture the whole context in which the cothread suspended (i.e. "I was simulating this particular instruction, and I got to the fourth cycle of it, and I've already done half of the instruction's internal effects but not the other half, and now I'm trying to read from a certain address"). Doing this in a portable fashion is pretty hard. Doing it in a non-portable fashion would be very hackish and brittle (your old savestates would probably not work anytime there was a minor change to bsnes).

So what we would like to do is somehow save enough information to recreate the exact context of each cothread, but we would like this information to be in a portable format which is unlikely to change over time. "Log replay" depends on an ordered log of side effects (i.e. reads and writes) to capture that context. Hopefully it will be portable between versions of bsnes (and possibly even across different emulators?) because the order of the reads and writes is dictated by the SNES hardware being emulated, and should not change between emulator versions (unless they had horrible bugs in them, I guess).

Its hard or impossible in general, to get all cothreads to a "good" spot at the same time. (Sure, it might work some of the time--but there's no way to know how long it could take. Even if it would probably work in 99% of the cases, there may theoretically be cases where they would never line up, and your save would never be able to occur).

What "log replay" tries to do, is save each state at a "good" spot, but still allow it to advance to a "bad" spot afterwards, if that happens to be necessary in order to get one of the other cothreads to its own "good" spot. So first you get CPU to a "good" spot and save it, then you get APU to a "good" spot and save that, then you get PPU to a "good" spot and save that. But after each one is saved, you start logging anything further that happens to it, and you also save those logs (as well as being careful about side effects, e.g. probably save the CPU memory at the end, after the logging is done). So on loading, you're loading each cothread it in its "good" state, then feeding it the log entries to get it back into the final state it was in at the end of the saving process. While a cothread is being fed log entries, it is sort of "in a bubble"--it can't directly affect, or be affected by, the other cothreads. Only the log entries. Because the writes it is simulating have already happened (during the save process), and it would be a mistake to let them take effect again. E.g. the CPU writing to a certain PPU register, has already affected the PPU state. When the CPU again performs that write, we just check that it matches in the log and don't do anything else with the value. Once each cothread has used up its log entries, you get rid of the "bubble" and they start talking to memory and to each other in the normal fashion, and you resume running normally.

I'm lousy at explaining this, partly because its been over a year since I thought hard about it, and partly because the idea itself is kind of complicated. I wish there were a simpler way that was reliable and theoretically sound.

[Edit: The emphasis is on "reads" and "writes" because these are the only ways that the CPU can affect other processors. In other words, it doesn't matter what else happens inside the CPU simulation, or if we lost a bit of that and had to re-simulate it on the load, or whatever. The reads and writes are the externally-visible side effects caused by the execution of instructions on the CPU. We need to make sure that across a load and a save, we have exactly the right side effects becoming visible to other cothreads at exactly the right times. If there are other kinds of side effect---such as IRQ signals?---then they may need to be handled similarly.]

henke37 · Post by **henke37** » Wed Jan 30, 2008 10:35 pm

byuu wrote:
The using the debug info parts means that the structure of the stack frame is to be looked up in the debugging info that the compiler generates.

Also, my idea is something that can not be done fully in c++. It needs to be done in assembler, because it is going to have to read and write the raw bytes that makes up a stack frame. True, it will be very simple for the bsnes only code, since it is just two functions to call, CothreadLoad and CothreadSave.

Implementation taking 4 lines max is what I call it being trivial.
Okay, please write the "4 lines max" "in assembler" that takes "the structure of the stack frame ... in the debugging info that the compiler generates" to implement "CothreadLoad" and "CothreadSave". I'll do the rest.

Hell, don't even worry about the "4 lines max" part. Write an application that parses "the structure of the stack frame ... in the debugging info that the compiler generates" and I'll be happy. We won't even worry about all of the other inherent issues, such as pointers in the stack frame to allocated memory that are dynamically assigned at runtime and such. I just want to see you parse the "the debugging info that the compiler generates" in an actual application.

Nice misquotations there. I never said that the parsing would take 4 lines.
I said that the mapping of function return addresses to and from the location in the source code is trivialy.

But anyway, I honestly don't have time to implement such a parser.

And I addressed the pointers issue already, don't store the pointers, store IDs of things to point at, as in my example file.

byuu · Post by **byuu** » Wed Jan 30, 2008 11:25 pm

Nice misquotations there. I never said that the parsing would take 4 lines.

I'm sorry, this is why I was careful to only quote exactly what you've said in the past, completely unmodified. The impression you gave me was that it would be trivial to implement in four lines of code. I wasn't at all trying to misquote you.

I'll remove the quotes from my post then, feel free to do the same or not with yours.

But anyway, I honestly don't have time to implement such a parser.

Then please stop suggesting your idea as the solution to this problem, as only you seem to even think it's plausible. In fact, I'm not even sure if anyone else here even thinks it is possible.

On the bright side, this will save you some more time.

Jipcy · Post by **Jipcy** » Thu Jan 31, 2008 7:05 am

mozz, you actually explained it quite well. I'm pretty sure I understand the concept now, not that I can do anything to help.

I have to say that, in general, save states of ANY sort are more useful than none at all. Save states that only work with a single compile of an emulator obviously have the least usefulness, but even just emulator-specific save states are extremely useful.

I'll stop uselessly replying to this thread now.

henke37 · Post by **henke37** » Thu Jan 31, 2008 8:03 am

byuu wrote:
Nice misquotations there. I never said that the parsing would take 4 lines.
I wasn't at all trying to misquote you.

Don't worry, I don't think you tried either.

byuu wrote:
But anyway, I honestly don't have time to implement such a parser.
Then please stop suggesting your idea as the solution to this problem, as only you seem to even think it's plausible. In fact, I'm not even sure if anyone else here even thinks it is possible.

Well, to end my opinion about it, it is just as plausible as your emulator using cooperative threading. It's insane to anyone who first hears about it, but dam does it work like a charm once implemented.

blargg · Post by **blargg** » Thu Jan 31, 2008 10:15 am

henke37's solution is definitely possible, just non-trivial to implement. As he says, the debugger is able to take the program stopped at (almost) any point and find all the stack frames and locals in the function call chain, using compiler-generated meta-information. The addition is to then to make this persistent, which adds the problem of pointers.

As he says, these pointers must be converted. You can have pointers to other functions, objects on the stack, objects on the freestore, and OS objects. Each requires different handling since they must be reconstituted. How are you going to handle the latter types? For objects of a class type, you need to be able to get the compiler-generated vtable pointer. For freestore objects you also need to know the size of the block and how it was allocated. And for OS objects, you can only reconstitute them the same way they were originally created and modified; you can't just restore their bits. And then you have other problems too, because the restored program might not be exactly the same as the saved, in terms of relative addresses of blocks on the freestore. Sometimes one sorts a container by object pointer, which could fail. OK, so you restore containers specially. I imagine there is a long list of special cases that would be necessary.

All-encompassing persistence systems like this need to be designed at the language level to be practical to implement. Since that is indeed what is being described, maybe someone else has solved all this in C++ on Windows and Unix.

henke37 · Post by **henke37** » Thu Jan 31, 2008 5:44 pm

Well, feel free to search for such a software, because, I don't have a good keyword to Google.

mozz · Post by **mozz** » Thu Jan 31, 2008 9:56 pm

blargg wrote:henke37's solution is definitely possible, just non-trivial to implement. As he says, the debugger is able to take the program stopped at (almost) any point and find all the stack frames and locals in the function call chain, using compiler-generated meta-information. The addition is to then to make this persistent, which adds the problem of pointers.

One way to get that meta-information is using tools that parse the debug info generated by the compiler. The debug formats are platform- and sometimes compiler-specific, so you need a variant of the tool for each one. Using that you generate some sort of RTTI data structures, except they contain this sort of binary layout information that you can't get from C++'s built-in RTTI mechanism. Maybe something like this:
http://citeseer.ist.psu.edu/kakkad96portable.html

I think it would be pretty complicated to get it all to work, considering you still end up with a non-portable save format. You'd have to write a certain amount of platform-specific code for each port, and probably use custom tools as part of your emu build process. Better to spend that effort on a complex-but-portable format =)

Deathlike2 · Post by **Deathlike2** » Thu Jan 31, 2008 11:27 pm

blargg wrote:henke37's solution is definitely possible, just non-trivial to implement.

Amen.

byuu · Post by **byuu** » Fri Feb 01, 2008 12:30 am

Yes, I believe non-trivial to be a severe understatement in this case. I was hesitant to suggest anything might not be possible, but when an idea becomes as implausible as this, the distinction becomes merely a technicality.

Who knows, perhaps when I have all special chips supported, the PPU emulated right, a nice debugger and the core fully separated from the UI, I'll give mozz's idea a shot. So, look forward to news on that in 2026 or so :)

It's that, or we try it in a smaller scale application first, such as a tiny NES or SMS emulator.

If there are other kinds of side effect---such as IRQ signals?---then they may need to be handled similarly.

Almost missed this. The PPU outputs hblank and vblank pins that the CPU polls every clock tick. Easy enough, output two bits to the log file for every clock tick, as though it were a read from the CPU (it's technically just that). Right now though, the PPU is enslaved to the CPU, so this wouldn't even be necessary at the moment.