FPGA Development Deep Dive

We are excited to share a deep technical look at the M64's FPGA development, written by our core development partner Robert Peip (aka FPGAzumSpass). The M64 FPGA, memory subsystem, and interfaces were designed with accuracy in mind from the start. Even though we still have work to do (FPGA work takes a LONG time), we are excited by our hardware architecture's full potential. A huge upside is picking PSRAM over DDR as an early decision we know was right. We firmly believe that M64's hardware architecture is superior.

We're also excited to see how others unlock M64's potential. As a proof-of-concept, we ported a different core from MiSTer to M64's hardware to prove it was possible, and only time will tell which others are brought over. M64's design is a statement towards shedding the chains of hardware limitations available to the open-source retro-gaming community by having a powerful base with a large and fast FPGA, four fast and low latency PSRAMs and 4K-capable video output. Seeing other cores run on M64 hardware is only possible if we open source our design, and that's exactly what we're committed to doing.

What Makes Hardware Emulation Hard

By Robert Peip (aka FPGAzumSpass)

Accuracy in both software and FPGA emulation is often discussed as a hot topic. There are several misunderstandings and wrong assumptions around it and I want to take a deeper dive into this topic. First we'll look at what accuracy is and what FPGAs can do for that. Then we'll look at examples in games and tests, finishing with the current state of the M64.

Accuracy vs Cycle Accuracy

When replicating the behavior of a classic gaming console, ideally all the internal hardware components of this console are replicated. These can be processors, memory modules, or specialty-purpose chips. A clock is what typically drives all of these electronic parts.

The N64 CPU clocks in at 93.75 MHz, which means it can execute calculations 93,750,000 times per second.

While simple calculations (instructions) only take a single cycle inside the CPU, others can take much longer. For example, it might cost ~70 clock cycles to do a 64-bit division or 6 cycles to do a floating point multiplication with N64's CPU.

When emulating the CPU, care has to be taken that these timings are correct, otherwise the execution speed is wrong. You can easily see that if some emulator executes this multiplication in one clock cycle instead of six and a game is using plenty of them, it would make the game run too fast.

This is sometimes easier said than done with such pipelined CPUs, as they execute multiple instructions in parallel to some degree. Maybe the next instruction is not in cache and needs to be fetched from RAM while the multiplication is running in the background and due to the RAM read taking a while, the multiplication is already done when the RAM read is finished.

What makes things even worse is that multiple chips communicate with each other and run in parallel. This can lead to interlock dependencies and increase the effect of inaccuracies even further, or cancel them out by luck.

Why Bother Using an FPGA?

To make things short: A fully accurate emulator is easier to write in software than to build in an FPGA. The catch is execution speed, not correctness.

Emulation has relative deterministic behavior in the end and this can of course be done in software. In fact, every digital FPGA behavior can be simulated in software just fine. This is done during emulator development all the time.

Why even bother with FPGAs then?

Well, the more you want to emulate the parallel behavior of chips or functionalities inside a chip, the tighter the timing requirements are. The N64 does hundreds of things in parallel each clock cycle and to fully replicate that in software with full accuracy would mean to execute them all and sync up the different modules after each clock cycle.

This leads to the issue that even for an experience to fully and accurately emulate the N64, you would need a PC with a CPU that is so fast it doesn't even exist today. Because of that, software emulators have to optimize and sometimes cut corners of functionality or timing that is probably not important to the game. This is often required to keep playable execution speeds on common hardware. Ares is a popular and extremely accurate N64 software emulator, but even it has to resort to some clever tricks to get it close.

FPGAs on the other hand have the possibility to do all these things with true parallelism, like the original hardware did. They open up the possibility to even reach the "perfect" accuracy we all eagerly seek, assuming there's enough margin to perform all of the calculations and any constraints associated with it. This however does not mean they always preserve what's accurate in their implementation. Our understanding of the N64 hardware and its internal processors continue to expand, as do our implementations when we learn something new. If we didn't do this, we would fall behind.

Perfect Accuracy is Impossible

Without getting too philosophical, there is a natural limit of targeting a console's accuracy and that is limited by the cycle speed. This isn't a barrier just for recreations like M64 or A3D, it's sometimes a part of the original hardware too.

Remember the concept of clocks from earlier in this post? Well, almost all internal logic is based on clocks that are generated by a clock generator chip on the printed circuit board. These clock generators have variations, resulting in them being slightly slower or faster for each manufactured console. This is a natural phenomenon which engineers and scientists seek to tighten, but never truly eliminate.

The clock generated for NTSC or PAL video out on N64 has a variation of 30 ppm (parts per million), which means a nominal clock rate of 1 MHz could be anywhere between 999,970 and 1,000,030 Hz. In other words: two N64s running side by side could drift apart up to one second every five hours, or about 5 seconds per day.

The same happens inside the console itself, as CPU and other components run from a different clock generator than the Video Out and these two drift against each other. In rare cases it could be that one N64 might render one extra frame that another N64 would not have rendered in the same situation, because the CPU had slightly more time between two Video Out interrupts.

Honestly though, this effect is very unimportant when playing any games, but still makes for a super huge impact. You can never rely on two executions of the exact same scenario to result in the same perfect output on screen. Turning on the console and not pressing any button at all, just letting some intro run, could result in completely different images on the screen after some hours with the exact same console if things all stack up right. It may be imperceptible to us when we play, but when it comes to emulating original hardware, it's still a thing we have to consider.

Donkey Kong 64 Case

Let us look at one popular example for accuracy on the N64.

In the intro sequence of Donkey Kong 64, there is a section that requires a specific timing for the events to happen. Donkey Kong will climb up a hill and then do some jumps over to another platform using three vines. With the timing of the N64, he will always jump and grab the vine successfully, reaching the other platform. If an emulator's timing is "wrong", DK will miss the vine and land in the water, as the jump is done at a position where there is no vine.

Now why is that even the case?

This sequence in the intro, like many other in-game sequences in plenty of games, is programmed via frame counting. Specific actions like turning around or jumping are done after a certain number of frames. The game's speed, however, is determined by time. Let me explain.

These games have the same execution speed, no matter how fast the content can be rendered. Donkey Kong will move forward by the same distance in one second, no matter if the console can render the scene at 27 or 30 frames per second. However, if the A button is pressed after 30 frames, it will be later in time in case the framerate drops.

What does that mean for missing the vines in this example?

The jumps are late in the sequence, so already hundreds of frames have passed before they should have happened. To make the jump correct, there is a rather narrow window of how many frames must have been rendered until then. To make things worse, the N64 has severe issues holding the target framerate of 30 during that scene, so the average framerate might be around 25.

What does it tell about accuracy when a software emulator or FPGA device makes Donkey Kong grab the vine?

Actually, not much. An emulator could fulfill these requirements by rendering 30 frames per second and stall for some frames at the right position. While this is a very extreme example, it still passes the "vine test" but would be wrong for numerous reasons.

Consistent 25 frames per second could also work to pass it, but the N64 in practice is actually jumping somewhere between 15 and 30 instead. So even with this feeling good while playing and passing the test, it's still not authentic.

In result, passing it only means that on average the system is about the same speed, but certain situations can fall very far off.

RAM Access Matters Too

While DDR memory is designed for high throughput, it really struggles with low-latency access. Most of that bandwidth comes from handling sequential, aligned workloads that burst data in neat, orderly sequences.

The problem hits when you bombard it with small, random accesses where the system gets penalized with significant overhead. To try and fix this, developers often have to use complex caching strategies to avoid hitting the DDR memory directly. On top of that, DDR controllers rarely guarantee consistent read or write latency.

While this is fine for typical tasks, it kills determinism in hardware emulation. We can only guess if this is what A3D has to work around (since it uses DDR memory), but it's a fundamental issue we wanted to avoid.

That's why we chose Pseudo-Static RAM (PSRAM) for M64. PSRAM excels at low-latency, random access, even if the total throughput takes a hit. Although its peak bandwidth is lower than DDR, its latency for those small, unpredictable accesses is far more consistent.

So, how does this choice help N64 emulation?

For the N64, especially the RDP, latency is what truly matters. The speed at which each individual memory request is serviced is crucial, as the RDP is notoriously sensitive to timing. If memory fails to respond quickly, the RDP stalls unexpectedly. By using PSRAM's more uniform and predictable access times, we can significantly cut down on these performance-killing stalls and instead match the timing of expected stalls, not avoid them.

We should be transparent: M64's PSRAM implementation still needs optimization. We are currently running at around 50% of our architectural potential. This wasn't intentional, and it's a drawback we are actively working on well before launch.

Synthetic Tests

Testing the accuracy of an emulator or FPGA implementation on the scale of games to judge it is often not useful, because of the sheer possibilities with the different chips being too fast or slow. As a developer, you would just not know where to look.

Instead each component can be measured on its own and transaction times between the components can be measured as well. Let's look at some examples to give you an idea.

All measurements were taken in May 2026 using M64 v1.14.3 and A3D v1.3.0.

Establishing a Baseline

A starting point for any tests is measuring that your measurement tools are correct. Typically a timer is used to measure how long operations take, but this timer itself must work just like in the original system.

In this example here you can see the min and max time it took for different scenarios on M64 and A3D, versus N64. N64's results are shown on the right side of each screenshot. Here, we are examining the cycles N64 needs to fulfill each subtest versus M64 and A3D.

For these kinds of tests, it is very reproducible. Min and max are always the same on N64, it's 100% reproducible and the FPGA implementations match with original hardware, as you would expect for such a basic test.

Note: This is not the case for all software emulators, because emulating these internal timers is already a taxing task for them as it needs to be done on-time for every emulated clock cycle.

M64

A3D

Instruction Fetch for CPU Cache

If we want to measure inter-module handling, we can examine the memory transitions from the CPU. You can see by the differences in the N64 measurements that min and max mostly match but not always. For example the OPS 3 test sometimes takes 27 cycles and sometimes 28 on N64.

The reason for this is the clock ratio of the CPU and the Memory interface. While the CPU runs at 93.75 MHz, the Memory interface runs at 62.5 MHz. That means that for 2 cycles on the memory side, there are 3 cycles on the CPU side. Depending on where in this constellation a test is executed, the result may differ when the same test is executed multiple times.

Nonetheless, you can see that it is still very reproducible on N64 and M64 gets very close to that.

What happens on an A3D? There are sometimes up to 5 additional cycles required, making the operation slower than on N64. It's only speculation, but the reason could be the DDR memory on the A3D requiring additional time for refresh. PSRAM refresh is internally managed and lower-impact than DDR rank-blocking tRFC.

M64

A3D

Storing Data in RAM

Loading data seemed simple, but what about writing to RAM?

In addition to the clock difference between CPU and Memory, there is another module involved in this translation: the write queue.

Whenever the CPU writes to the RAM, it doesn't have to wait for data to be written, it can continue its work and let the write happen in the background. These write requests are added to a queue and are fulfilled in order. The queue has enough depth for multiple words, but if the RAM is busy, even the queue can fill up and things have to wait.

So running this test depends on plenty of conditions and getting it correct is a very hard task.

You can see that even on N64, the time it takes to write 5 words (SW 5) can differ a lot between 5 and 42 cycles.

You can see that the implementation in M64 struggles to stay within the boundaries of the N64 in some parts, sometimes being faster or slower.

A3D (in its current version) seems to either not have the write queue at all or with a too shallow depth: writes of 3 to 5 words take significantly longer than they should, up to 5 times longer than N64. This probably needs some rework.

M64

A3D

CPU Instruction Timing

Unfortunately I have to mention where accuracy is lacking currently in M64.

This test is checking for some CPU instructions and you can see that while A3D is passing these tests, the M64 still has floating point instruction timings wrong. This is inherited from the MiSTer core running on the slow FPGA of the DE10-Nano. Implementing the floating point ADD instruction to execute as fast as required with the given speed of the FPGA was not really viable at the time of development.

This is something that still has to be improved on the M64 side. The capabilities are there to do it, but we need to finish the work.

M64

A3D

Summary

As shown above, there is currently no perfect solution that exists today. Things can still improve and if you look closely, you can find the flaws and imperfections. Most of them average out more or less over time, while others might have more influence. Of course, perfection is always a goal of ours, but perfection takes time and is a step-by-step process.

Open Points for Accuracy on M64

We have already seen the floating point instruction timing above to be left as an open point in the grand scheme of accuracy. But there are a few more things.

The most critical one is the Reality Signal Processor (RSP) dual pipeline. The RSP of the N64 is a Coprocessor capable of doing vector instructions as well as regular CPU instructions. You can think of it like having another full CPU (running at 62.5 MHz) with additional vector instructions, often used for graphics and sound calculations. It's very powerful, very complex, and very cool.

This RSP can execute a vector instruction and a regular instruction in the same clock cycle in N64. M64's current core design implements these as two instructions are currently executed one after each other, potentially leading to less performance in games that depend on it. Same as with the FPU, all the parts are there, but it must be implemented.

A third one is another one inherited from the MiSTer core. The Translation Lookaside Buffer (TLB) implementation on MiSTer has taken some shortcuts to make it possible to be implemented on the slow FPGA at all. These shortcuts need to be removed to reach full performance with games that use virtual memory address mappings like Goldeneye or Conker.

These three are the big blocks that will be tackled as soon as possible to ensure the best experience. As mentioned above, there are always some differences you can measure, but those will have small effects in comparison.

Overall there is still work to be done, but the path is clear and it will be done.

So when will Donkey Kong grab the vine? Maybe it happens by accident in the next update if our timing profile changes. Maybe he will miss in a later version after that.

Does that mean the M64 got less accurate? No, it just means that the special condition wasn't hit by chance anymore.

Let me assure you that the decisions made when designing the M64 hardware were done with accuracy in view. A very fast FPGA like the one in M64 will not have any issues fulfilling the tasks. The memory architecture in M64 is able to handle all situations without compromises. The remaining issues will be cleaned up as M64 matures. In the meantime, I hope you can continue to enjoy these classic games and bring back those memories.

Your cart is empty

Your cart

Estimated total

FPGA Development Deep Dive

What Makes Hardware Emulation Hard

Accuracy vs Cycle Accuracy

Why Bother Using an FPGA?

Perfect Accuracy is Impossible

Donkey Kong 64 Case

RAM Access Matters Too

Synthetic Tests

Establishing a Baseline

Instruction Fetch for CPU Cache

Storing Data in RAM

CPU Instruction Timing

Summary

Open Points for Accuracy on M64

Store

Support

Country/region

What Makes Hardware Emulation Hard

Accuracy vs Cycle Accuracy

Why Bother Using an FPGA?

Perfect Accuracy is Impossible

Donkey Kong 64 Case

RAM Access Matters Too

Synthetic Tests

Establishing a Baseline

Instruction Fetch for CPU Cache

Storing Data in RAM

CPU Instruction Timing

Summary

Open Points for Accuracy on M64

Store

Support