This is the second part of a series of articles that I’m writing based on my experiences with my GameBoy emulator written in .NET Core; Retro.Net. This time I’ll be having a look at the core of the emulator, which implements a high level form of dynamic recompilation to emulate the Z80 derived GameBoy CPU.
At it’s lowest level, a CPU can be represented by a finite state machine, where state is handled by registers and domain by the address space. This model is pretty simple to implement in software as all you need is a mechanism for tracking state and a definition of all possible state transitions. Luckily the GameBoy CPU, a Sharp LR35902, is derived from the popular and very well documented Zilog Z80 - A microprocessor that is unbelievably still in production today, over 40 years after it’s introduction.
The Z80 is an 8-bit microprocessor, meaning that each operation is natively performed on a single byte. The instruction set does have some 16-bit operations but these are just executed as multiple cycles of 8-bit logic. The Z80 has a 16-bit wide address bus, which logically represents a 64K memory map. Data is transferred to the CPU over an 8-bit wide data bus but this is irrelevant to simulating the system at state machine level. The Z80 and the Intel 8080 that it derives from have 256 I/O ports for accessing external peripherals but the GameBoy CPU has none - favouring memory mapped I/O instead. Since I found an abundance of Z80 documentation, including non-official, or undocumented behaviours that were discovered by ZX Spectrum developers, I opted to implement an emulator compatible with the full Z80 specification. To achieve Intel 8080 and GameBoy compatibility, which aren’t strict logical subsets of the Z80, I simply provide a platform configuration mechanism that can toggle features.
Firstly we need some registers. The Intel 8080 and GameBoy CPU have six 8-bit general purpose registers, an accumulator, flags, stack pointer and program counter. 16-bit access is also provided to each general purpose register and the accumulator and flags registers in sequential pairs. Additionally, the Z80 has two more 16-bit index registers, an alternative set of each general purpose, accumulator and flags registers and a few more bits and pieces.
|AF||16-bit or two 8-bit||Accumulator (A) and flag bits (F)|
|BC||16-bit or two 8-bit||Data/address|
|DE||16-bit or two 8-bit||Data/address|
|HL||16-bit or two 8-bit||Accumulator/address|
The Z80 defines alternative/banked versions of
HL that are accessed via the exchange opcodes and also has some more specialized registers.
|IX||16-bit or two 8-bit||Displacement offset base|
|IY||16-bit or two 8-bit||Displacement offset base|
|I||8-bit||Interrupt vector base register|
|R||8-bit||DRAM refresh counter|
The flags register is a single byte that contains a bit-mask set according to the last result. Notice that the GameBoy flags register only uses the most significant 4-bits and does not implement the sign or parity/overflow flag. The least significant bits of the GameBoy flags register are always 0.
|8080/Z80 Bit||GameBoy Bit||Name|
The flags registers are implemented slightly differently on each platform so we’ll hide them away behind an interface.
For example, we might implement the GameBoy flags register as follows.
General purpose registers
For defining our register classes we could do with some help for converting between the 8 and 16-bit representations of the registers.
Let’s start by defining the general purpose register set.
Accumulator and flags
The accumulator and flags must be separate from the general purpose registers due to the Z80’s banked register support.
To simplify the rest of the system, we will access registers via an interface conforming to the full Z80 specification.
We’ll have two implementations of this interface; one for the Z80 and one shared by the GameBoy and Intel 8080. The Z80 implementation will have full functionality whereas the Intel 8080/GameBoy implementation will throw when calling methods or properties that are not supported by the platform. It will be the responsibility of the CPU core implementation to configure the correct level of register support for each platform.
Next we need to implement the memory management unit (MMU) to broker access to the address space. The relationship between CPU, MMU, memory and memory mapped IO devices looks something like the following.
An MMU should support reading and writing data in various lengths across the entire address space, whilst abstracting away the hardware that is physically attached to each location in the space.
We can implement an MMU in a platform agnostic way by introducing a concept of segments. A segment has a location and length so that the MMU can correctly position it in address space and will provide implementation specific data access operations. For example, most GameBoy cartridges have a microcontroller acting as a memory bank controller (MBC) over multiple banks of read only memory (ROM). Read requests for data in an MBC address space will be forwarded to a configured page of ROM, whereas write requests will change which page is configured. For this reason we really need different interfaces for readable and writeable segments.
For example, a very simple segment that represents random access memory (RAM) could be implemented using a byte array:
It is the responsibility of the platform to wire up the correct segments and their implementations.
A CPU runs on a fetch-decode-execute cycle, called the machine cycle or m-cycle. The CPU will initially fetch a byte, whose location in the address space is pointed to by the program counter register (PC), decode it as an instruction (opcode) and execute it, or contextually use it as a literal for a previous cycle. Opcodes not related to absolute program flow, such as jumps or calls, will end a cycle by incrementing the program counter to point at the next byte in the address space. Opcode length is variable and whilst some operations run in a single cycle, others require multiple fetch-decode-execute cycles to run. Opcodes are specified by the official Zilog Z80 documentation but you must also consider the differences in implementation on the GameBoy CPU. Here’s an example of running three simple opcodes on a Z80:
We’re not really concerned with this low level cycle as software cannot control it, but we do need to keep track of how many have occurred so that we have a mechanism to match (read: approximate) platform timing. Our higher level cycle will be based on a concept of an operation, which can be represented by one or more opcodes and optional literals.
Each operation cycle will:
- Fetch the next opcode.
- Decode the fetched opcode.
- Fetch any extra data required to resolve the operation including extra opcodes and literals.
- Record all m-cycles consumed in the operation so that we can block later to adjust our timings.
- Execute the opcode.
A naive CPU core
Naively this can be implemented as a while loop with a massive switch statement to handle the opcodes.
Arithmetic logic unit
Notice that we have abstracted CPU arithmetic behind an arithmetic logic unit (ALU). The interface for this would look something like this.
We spend the majority of the
Add function dealing with flags. This is a common theme of ALU methods so it makes sense that a complete ALU implementation would group flag manipulations into logical groups abstracted away behind methods such as
A dynamic CPU core
We can do much better than the naive Z80 implementation above. The main issue I have with it is that we are decoding each opcode on every cycle, even if we’ve already decoded the exact same block of memory in a previous cycle. Since we’re all software developers here, we know that the basic building blocks of software are control flow structures such as loops and functions. Software written for the Z80 platform is no different, it also consists of control flow structures, which means that the same code is run many times but against different data or CPU states. An ideal solution to this problem is to recompile Z80 opcodes and static data into a structure suitable for caching. In this case we would run a decode step only once per block of memory, compiling and caching it with enough meta data to recall later. When the CPU core attempts to decode the same block in the future, it will first check the cache. A hit will mean it can skip the entire decode-compile step resulting in a massive boost to performance. We will define block boundaries where an operation changes the program counter directly i.e. program flow opcodes such as jumps, calls, resets and returns. Since we’re using .NET we have the expression tree API; a brilliant abstraction over emitting IL. Here’s an example of how we might improve the original naive example to use cached expression trees.
The examples presented here have obviously been bastardised to fit onto a blog post, demonstrating a concept or technique rather than providing a complete implementation. For example, the
ExpressionTreeZ80 above fails to consider timing or self modifying code, misses out on some simple optimizations such as a pre-fetch queue and fails to consider unit testability by not separating the decode and execute steps. For the full implementation please visit Retro.Net on GitHub.
A UK midlands based software developer, mathematician, husband, runner, cyclist, cat lover, coffee drinker and retro gamer.