Using the Altera USB-Blaster on Fedora

October 28th, 2011

Altera’s Quartus tools include some special software to download bitstreams to their devices over USB (a DE-2 eval board, in my case). They require some tricky work to set up properly on Fedora – my dev host of choice. But you’re in luck! I’ve packaged up an RPM that takes care of this extra work for you. It creates a udev rule to set up the USB-Blaster properly when you plug in your USB JTAG connection. It also provides a service wrapper for Altera’s jtagd daemon and moves some of Altera’s data files around so things just work. The sources are in moxiedev, but I’ve posted the binary and source RPMs here for convenience: http://moxielogic.org/tools/. Be sure to read the docs in /usr/share/doc/moxie-quartus-altera-1 once installed. It assumes that you’ve already installed Quartus II on your box (they don’t package in RPM format, unfortunately).

I’m told that the open source alternative UrJTAG may work for this device as well, but I haven’t had a chance to look into it. Any experience worth sharing out there?

Fake RAM, load/store and push

October 11th, 2011

Progress report time….

I need RAM in order to implement/test most instructions. To that end, I’ve implemented a fake data cache that is always accessed within a single cycle during the WRITE pipeline stage. Eventually this will have to be replaced with a real data cache that reads/writes to real memory over the wishbone bus while the processor pipeline stalls.

The push instruction was easy enough to implement. It’s the first one that writes to both memory and a register (to update the stack pointer). This meant reworking the interface between the EXECUTE and WRITE stages. Pop is a little more tricky because we need to update two registers: the stack pointer and the register we’re loading memory into. I’m going to work this out tomorrow night, but I can see now how making it work in a single cycle will require a little more logic than splitting it up into two cycles. It will be interesting to experiment with changes like that once everything is working.

Also, I reorganized the HDL source to cleanly separate the moxie core from the muskoka SoC and related firmwares and cores. As usual, everything is in github.

GTKWave Tip #2: Translate Filter Files

October 6th, 2011

Tip #1 was about using an external process to perform dynamic translations of signal display values. GTKWave can also perform simple static translations of data. In the example below, for instance, moxie’s execute unit is receiving a 4-bit signal identifying “register A” (riA_i) for whatever operation is about to happen. This waveform would normally just show “0″, “1″, “2″, etc. We can replace those values by using a simple filter file that maps those string representations to alternate values. Here’s the one I use for register names: gtkwave-regs.txt. After applying the filter we see the register name “$r1″ instead of its register file index value of “2″.. a tremendous readability improvement!

gtkwave with moxie register name translation filter file

gtkwave with moxie register name translation filter file

GTKWave Tip #1

October 4th, 2011

GTKWave is a new tool for me, so I’ll use this space to post useful tips as I discover them.

The first tip comes from Tony Bybell, author of GTKWave, who pointed me at some helpful functionality in a recent blog comment. You can enhance GTKWave’s waveform display by replacing the normal data presentation with something of your own design. So, for instance, instead of looking at an opcode of “8033″, you can see the disassenbled moxie instruction “inc $fp 3″. You do this by implementing a program that reads raw values from stdin and writes translations to stdout, and then attaching this program as a filter to one of your signals.

I wrote a quick moxie disassembler in lisp (gtkwave-opcodes.lisp), and it is proving to be very handy!

gtkwave with moxie disassembly

gtkwave with moxie disassembly

Branch delays

September 28th, 2011

I’ve coded up logic for more arithmetic instructions, register moves, as well as direct and indirect jumps. For jumps, I simply pass a branch signal from the execute stage back to the fetch stage, as well as the computed target address. Here’s some code that works now:

.text
	xor	$r0, $r0 # Zero out $r0
	mov	$r1, $r0
	mov	$r2, $r0
	mov	$r3, $r0
	mov	$r4, $r0
loop:	inc	$r0, 0x1 # Increment $r0
	inc	$r1, 0x1
	inc	$r2, 0x1
	inc	$r3, 0x1
	inc	$r4, 0x1
	jmpa	loop+0x1000 # Offset hack
	nop
	nop

A couple of things to note… Boot ROM is mapped into the address space at 0×1000, which explains the offset hack above. A linker script is probably the right way to do this. Using “.org 0×1000″ at the start of the source appears to pad the resulting object with 0×1000 bytes of nothing, which means it doesn’t fit into the small space I’ve allocated for bootrom.

Also note that I’ve got to deal with branch delay slots. I’m not exactly sure what I want to do yet. Due to the nature of the variable width instruction encoding, it looks like you can have either 1 or 2 delay slots to fill, depending on the instruction sizes. I don’t like this at all. I’ll probably end up limiting it to one delay slot.

This is one of those ugly areas where implementation informs design. I’d rather not have delay slots at all, but it’s hard to ignore the performance gain.

A simulation milestone for the Muskoka SoC!

September 27th, 2011

A moxie-based SoC had it’s first successful simulation run today….

gtkwave display of first code run

gtkwave display of first code run

Pretty exciting! So, here’s what’s happening…

The SoC, code named “Muskoka”, has three main components: the moxie core, a wishbone switch and a ROM device. The switch was easy to implement, as I just have a single bus master (moxie), and a single slave device (bootrom) for now. The boot ROM is populated with bits generated by assembling the code below like so:

moxie-elf-as -o bootrom.x bootrom.s
moxie-elf-objcopy -O verilog bootrom.x bootrom.vh

.text
	xor	$r0, $r0  # Zero out $r0
	xor	$r1, $r1
	xor	$r2, $r2
	xor	$r3, $r3
	xor	$r4, $r4
	inc	$r0, 0x1  # Increment $r0 by 1
	inc	$r1, 0x1
	inc	$r2, 0x1
	inc	$r3, 0x1
	inc	$r4, 0x1
	inc	$r0, 0x1
	inc	$r1, 0x1
	inc	$r2, 0x1
	inc	$r3, 0x1
	inc	$r4, 0x1
	inc	$r0, 0x1
	inc	$r1, 0x1
	inc	$r2, 0x1
	inc	$r3, 0x1
	inc	$r4, 0x1

The moxie core itself is has 4 pipeline stages: Fetch, Decode, Execute and Write. The Fetch stage is probably the most complicated at this point. Remember that moxie has both 16- and 48-bit instructions. However, we’re fetching instruction memory from the ROM device 32-bits at a time. The fetch logic feeds these 32-bits into an instruction queue and pulls out the right number of bits at the other end, depending on the instruction about to be decoded.

So far only NOP, XOR and INC have been implemented, and there’s no pipeline hazard detection logic, but careful analysis of the simulation dumps in gtkwave show that everything is finally starting to work. Baby steps…

As usual, everything has been committed to moxiedev on github.

RTEMS status update

June 15th, 2011

The RTEMS port in moxiedev is looking pretty good right now. Here’s a test of the RTEMS network loopback device running on the moxie gdb simulator.  The first two client connections fail in this test.  It’s supposed to fail in the first case, but I’m not sure about the second case.   It’s possible that this is just a result of there being no pre-emptive context switching in the gdb simulator (no timers!).

$ moxie-elf-run loopback.exe
"Network" initializing!
"Network" initialized!
Try running client with no server present.
Should fail with `connection refused'.
Connect to server.
Can't connect to server: Connection refused
Client closing connection.

Start server.

Try running client with server present.
Connect to server.
Can't connect to server: Connection refused
Client closing connection.
Client task terminating.
Create socket.
Bind socket.

Try running two clients.
Connect to server.
Connect to server.
ACCEPTED:7F000001
ACCEPTED:7F000001
Write 22-byte message to server.
Write 22-byte message to server.
Read 43 from server: Server received 22 (Hi there, server (3).)
Read 43 from server: Server received 22 (Hi there, server (2).)
Client closing connection.
Client task terminating.
Client closing connection.
Client task terminating.
Worker task terminating.
Worker task terminating.

Try running three clients.
Connect to server.
Connect to server.
Connect to server.
ACCEPTED:7F000001
ACCEPTED:7F000001
ACCEPTED:7F000001
Write 22-byte message to server.
Write 22-byte message to server.
Write 22-byte message to server.
Read 43 from server: Server received 22 (Hi there, server (5).)
Read 43 from server: Server received 22 (Hi there, server (6).)
Read 43 from server: Server received 22 (Hi there, server (4).)
Client closing connection.
Client task terminating.
Client closing connection.
Client task terminating.
Client closing connection.
Client task terminating.
Worker task terminating.
Worker task terminating.
Worker task terminating.
*** END OF LOOPBACK TEST ***

This really needs to be tested on qemu, but I think it’s time to get back into verilog.

Bisecting GCC

June 6th, 2011

The thing about GCC is that things break when you take your eye off the ball. And this is what happened during my months long hiatus from the moxie project. Somewhere between early March and today, the moxie GCC port lost the ability to compile non-trivial code, notably libgcc. Firing up gdb on a core file may have been illuminating to somebody who lived in GCC sources every day but, to the occasional hacker, it’s difficult to see where things went wrong if you don’t know what you’re looking for. Enter git bisect

The git bisect tool automates finger pointing by binary searching through your source history for offending patches. It needs three things to work:

  1. An older known working version of the sources.
  2. A newer known broken version of the sources.
  3. A test executable (typically a shell script) that will tell whether a given version of the source code is broken or not.

Given all this, git bisect will start a binary search through the git history for your code, looking for the exact commit that caused the test to fail.

The test case I used was to build moxie’s C compiler and try to compile one of the libgcc sources that fails. If the compiler doesn’t report an error, we’re good, otherwise we know we still have the bug. Here’s the script I used as the git bisect test:

#!/bin/sh

# My git clone of the gcc tree
GCCSRC=~/bisect/gcc

# My pre-processed test case
TESTSRC=~/bisect/test.i

cd ~/bisect

rm -rf build
mkdir build

(cd build;
 $GCCSRC/configure --target=moxie-elf --enable-languages=c;
 make -j8 all-gcc)

if test -f build/gcc/cc1; then
  # build my test case
  build/gcc/cc1 -O2 $TESTSRC;
  # cc1 returns exit codes outside of git's acceptable range, so...
  if test "$?" -ne "0"; then
    exit 1;
  fi;
  exit 0;
else
  exit 1;
fi

Note that GCC is maintained in a subversion tree, but there’s an official git mirror that makes all of this possible. You need to clone it locally before you can do anything.

There were over 1000 commits between my last known working version and today’s GCC sources. My first thought was… “this is going to take hours”. I was wrong.

Running “git bisect run ~/bisect/test.sh” took all of 35 minutes.

The smartest thing I did here was work on a large amazon ec2 instance. It’s a cloud-hosted virtual server similar to a dual-core system with 7GB RAM and ample fast storage all for about 34 cents an hour. I’ve taken to doing development in the cloud and, relative to my standard setup, it is blazingly fast! I created a Fedora 15 image, yum installed all my tools (don’t forget ccache!), git cloned moxiedev, gcc and my emacs config files, and I was bisecting in no time.

Git bisect told me that on Monday, March 21, my old colleague Richard Sandiford committed some improvements to GCC that were tripping up the moxie port. A few minutes later I caught up with Richard on IRC, where he explained the patch to me. Shortly after this I’m testing a fix. Amazing.

On-chip communications

October 6th, 2010

I need to build real SoC infrastructure around my developing core in order to test it on real hardware. For the most part, this means a memory controller and IO devices. I’ve decided to implement a shared-bus wishbone-style interconnect for these devices. Wishbone is an open source on chip bus architecture that is popular with many open core developers. While not perfect, wishbone is a good choice for this first SoC due its simplicity and the ample supply of sample implementations.

The main complaints about wishbone are the lack of efficient semaphore operations (the bus remains locked for an entire read/modify/write operation) and the lack of pipelined reads/writes. This doesn’t bother me for the moment. I just need a simple interconnect so I can focus on debugging the moxie core part of the SoC.

By the way, there’s a terrific book on this subject called “On-Chip Communication Architectures”. Chapters 2 and 3 are great introductions to on-chip interconnects, and it looks like Google Books has the entire chapter 2 online here…

Quartus II and the Cloud: Not There Yet…

September 20th, 2010

Altera’s Quartus II software is now running on my Fedora box, but I was really hoping to load it into a grid of Amazon EC2 instances so I could batch out jobs for synthesis, place and route. Unfortunately the free-beer Quartus software requires a license file that is generated based on your MAC address. Bah! Amazon doesn’t provide persistent MAC addresses AFAICT.

Are they trying to prevent this kind of use, or are they just behind the times? Altera – please make your software cloud ready! I’d really like to try out condor’s ec2 support to speed up this kind of work.

It’s a shame that there are no credible Free Software alternatives to these tools.

Maybe there’s some way to make this work. Please comment below if you have the clue that I’m missing…