Tag Archives: gcc

Moxie ports and hardware developments

It’s been a while since my last update.. let me bring you up to speed.

A couple of libffi releases got in the way of moxie hacking (although libffi 3.0.13 now includes full moxie support!), but things are picking up speed again.

On the software side of things, the moxie RTEMS and QEMU ports have both been accepted upstream. So now it’s possible to build, run and debug RTEMS applications on QEMU purely with upstream project sources. You may notice that I’m doing much less work in the moxiedev repository these days. This was mostly just a staging area for moxie software support (tools, OS), and there’s little use for it now that most everything is upstream. All of the moxie HDL work now happens in the moxie-cores git tree.

As for the hardware side of things, here are some of the recent changes:

  • The MoxieLite core now supports ssr and gsr instructions, along with a bank of 16 special registers. The special register uses are defined here: http://moxielogic.org/wiki/index.php/Architecture
  • And now that the special register support is in place, exceptions and the swi (software interrupt) instruction are working in hardware. Semantics are defined here: http://moxielogic.org/wiki/index.php/Exceptions
  • bad (illegal) instructions now cause an illegal instruction exception
  • A simple interrupt controller has been added to the marin SoC. I have the Nexys3 momentary switches hooked up as interrupt sources, so I can trigger interrupts and handle them in software by pressing those buttons.
  • A trivial timer has been hooked up to the interrupt controller, so I can now generate ‘tick’ interrupts for RTEMS in support of preemptive multitasking (everything was cooperative up ’til now).

I’m actually just debugging the timer ticks right now, but it’s very close.

And on a final note… while RTEMS is a great little embedded RTOS, it’s clear from this EE Times embedded survey that I’m going to have to implement FreeRTOS support next: http://www.eetimes.com/electronics-news/4407897/Android–FreeRTOS-top-EE-Times–2013-embedded-survey. I think that’s what I’ll tackle after I get RTEMS running preemptively.

Running a C Program on the Marin SoC

I’ve just committed the bits required to run a C program on the Marin SoC.

Rather than hook up the Nexys3 external RAM module, I’m using extra space on the FPGA itself for RAM. Most of the hard work was sorting out the linker script magic required to generate an appropriate image.

I’ve also added a UART with 1k hardware FIFO transmit and receive buffers. The 1k is probably overkill, so I’ll likely shrink them once everything else is working.

I’ve moved all memory mapped IO devices up to 0xF0000000. So, for instance, the 7-segment display LED is at 0xF0000000, and the UART transmit register is at 0xF0000004. I’ll just keep going from there.

Next comes libgloss hacking to map stdout/stdin to the UART (which I talk to with minicom on my Linux box). We’re very close to “Hello World” now!

MoxieLite in Action

Brad Robinson just sent me this awesome shot of MoxieLite in action. His Xilinx Spartan-6 FPGA based SoC features a moxie core handling VGA video, keyboard and FAT-on-flash filesystem duties using custom firmware written in C. This is all in support of a second z80-based core on the same FPGA used to emulate an ’80s era computer called the MicroBee. Those files in the listing above are actually audio cassette contents used to load the MicroBee software. The moxie core is essentially a peripheral emulator for his final product.

Keep up the great work, Brad!

The most recent compiler patch was the addition of -mno-crt0, which tells the compiler not to include the default C runtime startup object at link time. This is common practice for many embedded projects, where some system specific house keeping is often required before C programs can start running. For instance, you may need to copy the program’s .data section from ROM into RAM before jumping to main().

I’m going back to my pipelined moxie implementation. Last I looked I had to move memory reads further up the pipeline…

It’s Alive!

There’s a working hardware implementation of moxie in the wild!

Intrepid hacker Brad Robinson created this moxie-compatible core as a peripheral controller for his SoC. He had been using a simple 8-bit core, but needed to address more memory than was possible with the 8-bit part. Moxie is a nice alternative because it has a compact instruction encoding, a supported GNU toolchain and a full 32-bit address space. FPGA space was a real concern, so he started with a non-pipelined VHDL implementation, and by all accounts it is running code and flashing LEDs on a Nexys3 board!

The one major “ask” was that there be a little-endian moxie architecture and toolchain in addition to the default big-endian design. I had somewhat arbitrarily selected big-endian for moxie, noting that this is the natural byte order for TCP. In Brad’s design, however, the moxie core will handling FAT filesystem duties, which is largely a little-endian task. At low clock speeds every cycle counts, so I agreed to produce a bi-endian toolchain and, for the most part, it’s all committed in the upstream FSF repositories (with the exception of gdb and the simulator). moxie-elf-gcc is big-endian by default, but compile with -mel and you’ll end up with little-endian binaries.

Brad also suggested several other useful tweaks to the architecture, including changing the PC-relative offsets encodings for branches. They had originally been encoded relative to the start of the branch instruction. Brad noted, however, that changing them to be relative to the end of the branch instruction saved an adder in his design. I made this change throughout the toolchain and (*cough*) documentation.

I’ll write more about this as it develops… Have to run now.

Oh. Here’s the VHDL on github: http://github.com/toptensoftware/MoxieLite. Go Brad!


Notes on a novel in-game CPU: the dcpu-16

The hacker behind the Minecraft phenomena, Notch, is working on his next game, most likely another hit. This one is interesting in that it includes an in-game 16-bit processor called the dcpu-16. Details are sparse, but it seems as though gamers will use this processor to control spacecraft and play in-game games. The dcpu-16 spec is currently available at http://0x10c.com/doc/dcpu-16.txt, and in the few days since its release there are already many community produced assemblers and emulators.

Like moxie, it’s a load-store architecture with variable width instructions (16 to 48 bits long). But the dcpu-16′s 16-bitness is pervasive. There are 8 16-bit registers, and the smallest addressable unit of memory is a 16-bit word. There are only about 16 unique opcodes in the 16-bit instruction, which means there’s room for 2 6-bit operands. With only 8 registers, a 6-bit operand can encode multiple addressing modes (direct, indirect, offset, etc) and still have room for small literal values.

If you poke around github you’ll find the start of a llvm backend as well as a tcc port. I haven’t looked into these compilers, but a C ABI for the dcpu-16 would certainly be unusual to most developers. You would likely have a 32-bit long, but char, short and int would all be 16 bits.

As far as GNU tools go, a binutils port would be pretty straight forward. I created a branch in moxiedev to try my hand at a dcpu-16 binutils port. It’s not very well tested, but gas, ld, objdump, etc all appear to work as advertised. All instructions with immediate operands, whether literal values or computed by linker relocations, are encoded in their long form. Taking advantage of the smaller encodings will require linker relaxation work. It’s not rocket science, but more work than the couple of hours I was willing to put into it. There appears to be one bug in GNU ld related to handling relocations for ELF targets where the smallest addressable memory value is 16 bits vs 8. I worked around it by making one small non-portable change to the target independent linker code.

I think GDB should be fairly straight forward as well. For most real targets GDB will want to insert breakpoint instructions in the text of a program, and it wants that instruction to be the same size as the smallest instruction available on the target. Alas, the dcpu-16 has no breakpoint instruction, 16-bit or othwerwise, so the simulator will have to include special hardware breakpoint emulation logic. My suggestion is to repurpose some of the 16-bit illegal instruction encodings. For instance, the ISA allows for nonsensical instruction like this:

  SET 5, 6

This means set the literal value 5 to 6. Setting a literal value makes no sense, and the spec currently says that these instructions are silently ignored. Rather than ignore them, you could use this class of instruction as special software interrupt/breakpoint/trap instructions like moxie’s swi.

A GCC port would be more challenging. It’s definitely possible, but would stretch GCC outside of its comfort zone. You’d end up excercising bits of the compiler that aren’t normally tested, and I imagine would end up spending a lot of time debugging some of the darker recesses of the compiler code. Best of luck to the brave soul who tries this!

I’m very curious to see how this all plays out. Given the massive success of Minecraft, I wouldn’t be surprised if we see an app store for in-game dcpu-16 based games. Good luck to Notch and the team at Mojang.

Bisecting GCC

The thing about GCC is that things break when you take your eye off the ball. And this is what happened during my months long hiatus from the moxie project. Somewhere between early March and today, the moxie GCC port lost the ability to compile non-trivial code, notably libgcc. Firing up gdb on a core file may have been illuminating to somebody who lived in GCC sources every day but, to the occasional hacker, it’s difficult to see where things went wrong if you don’t know what you’re looking for. Enter git bisect

The git bisect tool automates finger pointing by binary searching through your source history for offending patches. It needs three things to work:

  1. An older known working version of the sources.
  2. A newer known broken version of the sources.
  3. A test executable (typically a shell script) that will tell whether a given version of the source code is broken or not.

Given all this, git bisect will start a binary search through the git history for your code, looking for the exact commit that caused the test to fail.

The test case I used was to build moxie’s C compiler and try to compile one of the libgcc sources that fails. If the compiler doesn’t report an error, we’re good, otherwise we know we still have the bug. Here’s the script I used as the git bisect test:


# My git clone of the gcc tree

# My pre-processed test case

cd ~/bisect

rm -rf build
mkdir build

(cd build;
 $GCCSRC/configure --target=moxie-elf --enable-languages=c;
 make -j8 all-gcc)

if test -f build/gcc/cc1; then
  # build my test case
  build/gcc/cc1 -O2 $TESTSRC;
  # cc1 returns exit codes outside of git's acceptable range, so...
  if test "$?" -ne "0"; then
    exit 1;
  exit 0;
  exit 1;

Note that GCC is maintained in a subversion tree, but there’s an official git mirror that makes all of this possible. You need to clone it locally before you can do anything.

There were over 1000 commits between my last known working version and today’s GCC sources. My first thought was… “this is going to take hours”. I was wrong.

Running “git bisect run ~/bisect/test.sh” took all of 35 minutes.

The smartest thing I did here was work on a large amazon ec2 instance. It’s a cloud-hosted virtual server similar to a dual-core system with 7GB RAM and ample fast storage all for about 34 cents an hour. I’ve taken to doing development in the cloud and, relative to my standard setup, it is blazingly fast! I created a Fedora 15 image, yum installed all my tools (don’t forget ccache!), git cloned moxiedev, gcc and my emacs config files, and I was bisecting in no time.

Git bisect told me that on Monday, March 21, my old colleague Richard Sandiford committed some improvements to GCC that were tripping up the moxie port. A few minutes later I caught up with Richard on IRC, where he explained the patch to me. Shortly after this I’m testing a fix. Amazing.

Summer is over, so put away the white pants and start submitting patches!

It’s been a while since my last update. What can I say… summer was nice.

But now, back to business! I’ve just committed some long overdue patches to the upstream GNU tools:

This gets us to booting the kernel, loading BusyBox, running some shell code and… crashing on the first fork. No problemo. Nothing a small matter of programming can’t fix. However, there are some other distractions…

Verilog is lots of fun! It looks like regular programming, but it feels more like building a kinetic sculpture.

There’s also the small matter of not having an interrupt controller! So there’s some work here to design an interrupt controller, implement it in verilog, simulate it qemu (and possibly the gdb sim), and port the kernel over to using it. This should be interesting…

Speed bumps on the road to moxie userland

Sooo….. it turns out there’s lots to take care of before userland apps like BusyBox can run.

  • The root filesystem. This one is easy. I just built a short Hello World application in C with moxie-uclinux-gcc. This produces an executable in BFLT format which I call ‘init’. The kernel build machinery takes this and produces a compressed root filesystem image linked to the vmlinux binary. The good news is that the kernel is able to boot, detect this initramfs, decompress it and load the init executable (which involves fixing up all of init’s relocations). My Hello World doesn’t actually use the C library or any system calls. It just writes Hello through direct communication with the simulator via our software interrupt (swi) instruction. I thought this would let me avoid dealing with system calls for now. I was wrong…
  • System calls. This one is harder. Obviously (in retrospect!) the kernel creates the init process via the execve system call. Implementing system call support involves lots of platform dependent stuff. For instance, how do we invoke system calls? How are parameters passed? How do we switch back and forth between userland and the kernel? The first question is easy: I’ll use our trusty software interrupt (swi) instruction to invoke system calls. This means creating an exception handler and installing it as described in this old post.
    As an aside, the swi instruction takes a 32-bit immediate operand. We currently use this to identify calls to the simulator via libgloss. This works well for escaping to the simulator, but isn’t the best way to identify system calls to the kernel. The Linux kernel is going to ignore this operand, and we’ll pass the system call ID in a register instead. This avoids us having to do complex instruction decoding in the exception handler processing the interrupt (also trashing any future data cache). Libgloss and the sim only need a small number of IDs, so I’m going to chop the swi instruction down from 48-bits to 16-bits in a future build of the tools.
    Passing arguments to the system calls was also interesting to sort out…
  • System call argument passing. The moxie ABI currently only has two registers being used to hold function arguments. The remaining arguments must live on the stack. This decision goes back to when we only had 8 registers to play with. It turns out that Linux kernel system calls can have a maximum of 5 arguments. In order to avoid tricky argument marshaling, I’ve decided to try changing the general ABI accordingly, so that up to 5 registers may be used to hold function arguments. This involves changes to the compiler, debugger and a smattering of assembly language in libgloss.
    The great thing about having integrated benchmarks into the moxiedev environment is that you can easily compare before and after performance for ABI changes like this. Running “ant benchmark” runs through the MiBench benchmark suite and saves a nice report for easy comparison. It turns out that switching from 2 to 5 register arguments is almost universally a win in terms of both code size and instruction trace length (an approximation of run time). The consumer jpeg benchmarks were slightly larger and slower, but only by less than 1%. Every other benchmark result was slightly better. The one outlier was the “network_dijkstra” benchmark which ended up 44% “faster” (44% fewer instructions being executed).
  • The first real moxie compiler bug. Sometimes things just don’t work! This is especially true when you’re tracking the bleeding edge from upstream. I won’t go into the details, but I discovered a rare bug in the compiler where it would assume that compare results could live across function calls. Fortunately I was able to track down the guilty compilation pass and disable it with -fno-rerun-cse-after-loop. I know that some people have brought up kernels without the benefit of a nice debugger, but I just don’t see how that is possible. The simulator, and a solid gdb port with reverse debugging capabilities have proven to be invaluable!

There’s still lots to figure out and implement in the system call space, but it’s clear that we’re getting very close to running our first Linux program!

Everything is relative (finally!)

The Moxie ISA still needs quite a bit of tuning. Take branches, for instance. A beq instruction currently encoded like so

00001111xxxxxxxx iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii

…where the “x“s represent “don’t care” bits, and “i“s are a 32-bit absolute branch target. That’s right — branch targets are not PC relative! This is hugely wasteful.

I’ve finally got around to fixing this. Here’s how I did it…

  1. I recoded all branch instructions as “Form 3″ instructions, and tweaked the as-of-yet unused Form 3 encodings so they look like this:
      FORM 3 instructions start with a bits "11"...                                 
        0              F                                                            
       oooo         - form 3 opcode number                                          
       vvvvvvvvvv   - 10-bit immediate value. 

    This gives us 16 opcodes with a 10-bit immediate value. There are only 9 branch instructions, so we have a bit of room left in the Form 3 opcode space.

  2. I introduced a new 10-bit PC-relative Moxie relocation in BFD. This tells the linker and friends how to process PC-relative relocations.
  3. I hacked the assembler to generate these new relocations instead of simply emitting a 32-bit absolute address.
  4. I hacked the disassembler to print the new Form 3 instructions out nicely.
  5. Finally, I taught the compiler how to emit valid branch instructions. It’s not that they look any different now; it’s just that you need to worry about branch targets that exceed our 10-bit range. Actually, we have an 11-bit range because we know that all instructions are 16-bit aligned. This lets us drop the bottom bit from the encoding since we know it will always be 0.
    An 11-bit range lets us branch about 1k backwards to 1k forwards. If the compiler detects that a branch target is out of range, we want it to do something like the following transformation…

        beq    .FAR_TARGET


        bne    . + 8
        jmpa   .FAR_TARGET

    The “bne .+8” line means branch forward 8 bytes from the current PC. This would skip the unconditional jump to .FAR_TARGET (a 6-byte instruction + 2-bytes for the branch = 8). Note that we have to reverse the logic from “beq” to “bne” for this to make sense.

    This is only possible if GCC can tell how far away the branch targets are. Fortunately, we’re able to annotate instructions in the machine description file (moxie.md) with their length; currently either 2 or 6 bytes long. GCC then processes these annotations to determine branch distances.

    Now that we know branch distances at compile time, the compiler can do smart instruction selection to deal with out-of-range branches. The changes were quite simple and limited to the .md file in the backend.

The savings after this ISA change are substantial. For instance, the consumer_jpeg_c benchmark in MoxieDev is more than 15% smaller when we use PC-relative branches! The u-boot binary, on the other hand, is “only” 7% smaller.

I hope to commit these changes to SRC and GCC once the GCC port is merged upstream. Fingers crossed…