Notes on a novel in-game CPU: the dcpu-16

April 9th, 2012

The hacker behind the Minecraft phenomena, Notch, is working on his next game, most likely another hit. This one is interesting in that it includes an in-game 16-bit processor called the dcpu-16. Details are sparse, but it seems as though gamers will use this processor to control spacecraft and play in-game games. The dcpu-16 spec is currently available at http://0x10c.com/doc/dcpu-16.txt, and in the few days since its release there are already many community produced assemblers and emulators.

Like moxie, it’s a load-store architecture with variable width instructions (16 to 48 bits long). But the dcpu-16′s 16-bitness is pervasive. There are 8 16-bit registers, and the smallest addressable unit of memory is a 16-bit word. There are only about 16 unique opcodes in the 16-bit instruction, which means there’s room for 2 6-bit operands. With only 8 registers, a 6-bit operand can encode multiple addressing modes (direct, indirect, offset, etc) and still have room for small literal values.

If you poke around github you’ll find the start of a llvm backend as well as a tcc port. I haven’t looked into these compilers, but a C ABI for the dcpu-16 would certainly be unusual to most developers. You would likely have a 32-bit long, but char, short and int would all be 16 bits.

As far as GNU tools go, a binutils port would be pretty straight forward. I created a branch in moxiedev to try my hand at a dcpu-16 binutils port. It’s not very well tested, but gas, ld, objdump, etc all appear to work as advertised. All instructions with immediate operands, whether literal values or computed by linker relocations, are encoded in their long form. Taking advantage of the smaller encodings will require linker relaxation work. It’s not rocket science, but more work than the couple of hours I was willing to put into it. There appears to be one bug in GNU ld related to handling relocations for ELF targets where the smallest addressable memory value is 16 bits vs 8. I worked around it by making one small non-portable change to the target independent linker code.

I think GDB should be fairly straight forward as well. For most real targets GDB will want to insert breakpoint instructions in the text of a program, and it wants that instruction to be the same size as the smallest instruction available on the target. Alas, the dcpu-16 has no breakpoint instruction, 16-bit or othwerwise, so the simulator will have to include special hardware breakpoint emulation logic. My suggestion is to repurpose some of the 16-bit illegal instruction encodings. For instance, the ISA allows for nonsensical instruction like this:


  SET 5, 6

This means set the literal value 5 to 6. Setting a literal value makes no sense, and the spec currently says that these instructions are silently ignored. Rather than ignore them, you could use this class of instruction as special software interrupt/breakpoint/trap instructions like moxie’s swi.

A GCC port would be more challenging. It’s definitely possible, but would stretch GCC outside of its comfort zone. You’d end up excercising bits of the compiler that aren’t normally tested, and I imagine would end up spending a lot of time debugging some of the darker recesses of the compiler code. Best of luck to the brave soul who tries this!

I’m very curious to see how this all plays out. Given the massive success of Minecraft, I wouldn’t be surprised if we see an app store for in-game dcpu-16 based games. Good luck to Notch and the team at Mojang.

Using the Altera USB-Blaster on Fedora

October 28th, 2011

Altera’s Quartus tools include some special software to download bitstreams to their devices over USB (a DE-2 eval board, in my case). They require some tricky work to set up properly on Fedora – my dev host of choice. But you’re in luck! I’ve packaged up an RPM that takes care of this extra work for you. It creates a udev rule to set up the USB-Blaster properly when you plug in your USB JTAG connection. It also provides a service wrapper for Altera’s jtagd daemon and moves some of Altera’s data files around so things just work. The sources are in moxiedev, but I’ve posted the binary and source RPMs here for convenience: http://moxielogic.org/tools/. Be sure to read the docs in /usr/share/doc/moxie-quartus-altera-1 once installed. It assumes that you’ve already installed Quartus II on your box (they don’t package in RPM format, unfortunately).

I’m told that the open source alternative UrJTAG may work for this device as well, but I haven’t had a chance to look into it. Any experience worth sharing out there?

Fake RAM, load/store and push

October 11th, 2011

Progress report time….

I need RAM in order to implement/test most instructions. To that end, I’ve implemented a fake data cache that is always accessed within a single cycle during the WRITE pipeline stage. Eventually this will have to be replaced with a real data cache that reads/writes to real memory over the wishbone bus while the processor pipeline stalls.

The push instruction was easy enough to implement. It’s the first one that writes to both memory and a register (to update the stack pointer). This meant reworking the interface between the EXECUTE and WRITE stages. Pop is a little more tricky because we need to update two registers: the stack pointer and the register we’re loading memory into. I’m going to work this out tomorrow night, but I can see now how making it work in a single cycle will require a little more logic than splitting it up into two cycles. It will be interesting to experiment with changes like that once everything is working.

Also, I reorganized the HDL source to cleanly separate the moxie core from the muskoka SoC and related firmwares and cores. As usual, everything is in github.

GTKWave Tip #2: Translate Filter Files

October 6th, 2011

Tip #1 was about using an external process to perform dynamic translations of signal display values. GTKWave can also perform simple static translations of data. In the example below, for instance, moxie’s execute unit is receiving a 4-bit signal identifying “register A” (riA_i) for whatever operation is about to happen. This waveform would normally just show “0″, “1″, “2″, etc. We can replace those values by using a simple filter file that maps those string representations to alternate values. Here’s the one I use for register names: gtkwave-regs.txt. After applying the filter we see the register name “$r1″ instead of its register file index value of “2″.. a tremendous readability improvement!

gtkwave with moxie register name translation filter file

gtkwave with moxie register name translation filter file

GTKWave Tip #1

October 4th, 2011

GTKWave is a new tool for me, so I’ll use this space to post useful tips as I discover them.

The first tip comes from Tony Bybell, author of GTKWave, who pointed me at some helpful functionality in a recent blog comment. You can enhance GTKWave’s waveform display by replacing the normal data presentation with something of your own design. So, for instance, instead of looking at an opcode of “8033″, you can see the disassenbled moxie instruction “inc $fp 3″. You do this by implementing a program that reads raw values from stdin and writes translations to stdout, and then attaching this program as a filter to one of your signals.

I wrote a quick moxie disassembler in lisp (gtkwave-opcodes.lisp), and it is proving to be very handy!

gtkwave with moxie disassembly

gtkwave with moxie disassembly

Branch delays

September 28th, 2011

I’ve coded up logic for more arithmetic instructions, register moves, as well as direct and indirect jumps. For jumps, I simply pass a branch signal from the execute stage back to the fetch stage, as well as the computed target address. Here’s some code that works now:

.text
	xor	$r0, $r0 # Zero out $r0
	mov	$r1, $r0
	mov	$r2, $r0
	mov	$r3, $r0
	mov	$r4, $r0
loop:	inc	$r0, 0x1 # Increment $r0
	inc	$r1, 0x1
	inc	$r2, 0x1
	inc	$r3, 0x1
	inc	$r4, 0x1
	jmpa	loop+0x1000 # Offset hack
	nop
	nop

A couple of things to note… Boot ROM is mapped into the address space at 0×1000, which explains the offset hack above. A linker script is probably the right way to do this. Using “.org 0×1000″ at the start of the source appears to pad the resulting object with 0×1000 bytes of nothing, which means it doesn’t fit into the small space I’ve allocated for bootrom.

Also note that I’ve got to deal with branch delay slots. I’m not exactly sure what I want to do yet. Due to the nature of the variable width instruction encoding, it looks like you can have either 1 or 2 delay slots to fill, depending on the instruction sizes. I don’t like this at all. I’ll probably end up limiting it to one delay slot.

This is one of those ugly areas where implementation informs design. I’d rather not have delay slots at all, but it’s hard to ignore the performance gain.

A simulation milestone for the Muskoka SoC!

September 27th, 2011

A moxie-based SoC had it’s first successful simulation run today….

gtkwave display of first code run

gtkwave display of first code run

Pretty exciting! So, here’s what’s happening…

The SoC, code named “Muskoka”, has three main components: the moxie core, a wishbone switch and a ROM device. The switch was easy to implement, as I just have a single bus master (moxie), and a single slave device (bootrom) for now. The boot ROM is populated with bits generated by assembling the code below like so:

moxie-elf-as -o bootrom.x bootrom.s
moxie-elf-objcopy -O verilog bootrom.x bootrom.vh

.text
	xor	$r0, $r0  # Zero out $r0
	xor	$r1, $r1
	xor	$r2, $r2
	xor	$r3, $r3
	xor	$r4, $r4
	inc	$r0, 0x1  # Increment $r0 by 1
	inc	$r1, 0x1
	inc	$r2, 0x1
	inc	$r3, 0x1
	inc	$r4, 0x1
	inc	$r0, 0x1
	inc	$r1, 0x1
	inc	$r2, 0x1
	inc	$r3, 0x1
	inc	$r4, 0x1
	inc	$r0, 0x1
	inc	$r1, 0x1
	inc	$r2, 0x1
	inc	$r3, 0x1
	inc	$r4, 0x1

The moxie core itself is has 4 pipeline stages: Fetch, Decode, Execute and Write. The Fetch stage is probably the most complicated at this point. Remember that moxie has both 16- and 48-bit instructions. However, we’re fetching instruction memory from the ROM device 32-bits at a time. The fetch logic feeds these 32-bits into an instruction queue and pulls out the right number of bits at the other end, depending on the instruction about to be decoded.

So far only NOP, XOR and INC have been implemented, and there’s no pipeline hazard detection logic, but careful analysis of the simulation dumps in gtkwave show that everything is finally starting to work. Baby steps…

As usual, everything has been committed to moxiedev on github.

RTEMS status update

June 15th, 2011

The RTEMS port in moxiedev is looking pretty good right now. Here’s a test of the RTEMS network loopback device running on the moxie gdb simulator.  The first two client connections fail in this test.  It’s supposed to fail in the first case, but I’m not sure about the second case.   It’s possible that this is just a result of there being no pre-emptive context switching in the gdb simulator (no timers!).

$ moxie-elf-run loopback.exe
"Network" initializing!
"Network" initialized!
Try running client with no server present.
Should fail with `connection refused'.
Connect to server.
Can't connect to server: Connection refused
Client closing connection.

Start server.

Try running client with server present.
Connect to server.
Can't connect to server: Connection refused
Client closing connection.
Client task terminating.
Create socket.
Bind socket.

Try running two clients.
Connect to server.
Connect to server.
ACCEPTED:7F000001
ACCEPTED:7F000001
Write 22-byte message to server.
Write 22-byte message to server.
Read 43 from server: Server received 22 (Hi there, server (3).)
Read 43 from server: Server received 22 (Hi there, server (2).)
Client closing connection.
Client task terminating.
Client closing connection.
Client task terminating.
Worker task terminating.
Worker task terminating.

Try running three clients.
Connect to server.
Connect to server.
Connect to server.
ACCEPTED:7F000001
ACCEPTED:7F000001
ACCEPTED:7F000001
Write 22-byte message to server.
Write 22-byte message to server.
Write 22-byte message to server.
Read 43 from server: Server received 22 (Hi there, server (5).)
Read 43 from server: Server received 22 (Hi there, server (6).)
Read 43 from server: Server received 22 (Hi there, server (4).)
Client closing connection.
Client task terminating.
Client closing connection.
Client task terminating.
Client closing connection.
Client task terminating.
Worker task terminating.
Worker task terminating.
Worker task terminating.
*** END OF LOOPBACK TEST ***

This really needs to be tested on qemu, but I think it’s time to get back into verilog.

Bisecting GCC

June 6th, 2011

The thing about GCC is that things break when you take your eye off the ball. And this is what happened during my months long hiatus from the moxie project. Somewhere between early March and today, the moxie GCC port lost the ability to compile non-trivial code, notably libgcc. Firing up gdb on a core file may have been illuminating to somebody who lived in GCC sources every day but, to the occasional hacker, it’s difficult to see where things went wrong if you don’t know what you’re looking for. Enter git bisect

The git bisect tool automates finger pointing by binary searching through your source history for offending patches. It needs three things to work:

  1. An older known working version of the sources.
  2. A newer known broken version of the sources.
  3. A test executable (typically a shell script) that will tell whether a given version of the source code is broken or not.

Given all this, git bisect will start a binary search through the git history for your code, looking for the exact commit that caused the test to fail.

The test case I used was to build moxie’s C compiler and try to compile one of the libgcc sources that fails. If the compiler doesn’t report an error, we’re good, otherwise we know we still have the bug. Here’s the script I used as the git bisect test:

#!/bin/sh

# My git clone of the gcc tree
GCCSRC=~/bisect/gcc

# My pre-processed test case
TESTSRC=~/bisect/test.i

cd ~/bisect

rm -rf build
mkdir build

(cd build;
 $GCCSRC/configure --target=moxie-elf --enable-languages=c;
 make -j8 all-gcc)

if test -f build/gcc/cc1; then
  # build my test case
  build/gcc/cc1 -O2 $TESTSRC;
  # cc1 returns exit codes outside of git's acceptable range, so...
  if test "$?" -ne "0"; then
    exit 1;
  fi;
  exit 0;
else
  exit 1;
fi

Note that GCC is maintained in a subversion tree, but there’s an official git mirror that makes all of this possible. You need to clone it locally before you can do anything.

There were over 1000 commits between my last known working version and today’s GCC sources. My first thought was… “this is going to take hours”. I was wrong.

Running “git bisect run ~/bisect/test.sh” took all of 35 minutes.

The smartest thing I did here was work on a large amazon ec2 instance. It’s a cloud-hosted virtual server similar to a dual-core system with 7GB RAM and ample fast storage all for about 34 cents an hour. I’ve taken to doing development in the cloud and, relative to my standard setup, it is blazingly fast! I created a Fedora 15 image, yum installed all my tools (don’t forget ccache!), git cloned moxiedev, gcc and my emacs config files, and I was bisecting in no time.

Git bisect told me that on Monday, March 21, my old colleague Richard Sandiford committed some improvements to GCC that were tripping up the moxie port. A few minutes later I caught up with Richard on IRC, where he explained the patch to me. Shortly after this I’m testing a fix. Amazing.

On-chip communications

October 6th, 2010

I need to build real SoC infrastructure around my developing core in order to test it on real hardware. For the most part, this means a memory controller and IO devices. I’ve decided to implement a shared-bus wishbone-style interconnect for these devices. Wishbone is an open source on chip bus architecture that is popular with many open core developers. While not perfect, wishbone is a good choice for this first SoC due its simplicity and the ample supply of sample implementations.

The main complaints about wishbone are the lack of efficient semaphore operations (the bus remains locked for an entire read/modify/write operation) and the lack of pipelined reads/writes. This doesn’t bother me for the moment. I just need a simple interconnect so I can focus on debugging the moxie core part of the SoC.

By the way, there’s a terrific book on this subject called “On-Chip Communication Architectures”. Chapters 2 and 3 are great introductions to on-chip interconnects, and it looks like Google Books has the entire chapter 2 online here…