Posts Tagged ‘architecture’

Fetching instructions

Tuesday, September 7th, 2010

Moxie requires some interesting instruction fetch logic.

For my initial implementation I’m assuming a 32-bit path to instruction memory. But moxie has both 16- and 48-bit instructions, so it’s not like simple RISC cores that can feed the pipeline on every cycle. My solution is to feed 32-bit instruction memory words into a Instruction FIFO. 16- and 48-bit instructions pop out of the other end of the FIFO on every cycle (or a NOP bubble when we’re waiting for the last 16 bits of a 48-bit instruction). My initial Instruction FIFO is 64-bits long. From my simple testing it looks like this does a reasonable job of keeping the instruction memory path busy, and issuing instructions as often as possible (I’m just eyeballing the gtkwave output, reproduced below). I can experiment with a longer Instruction FIFO later.

This image shows a few signals from the Instruction FIFO. valid_o tells us that we’re popping off a valid instruction from the FIFO, whereas full_o tells us not to write any data to the FIFO because it’s full. So far, so good – decoupling the fetching of instruction memory from the rest of the pipeline is obviously the right thing to do.

One more complication that I’m going to punt on for now is PC tracking. Eventually we’ll want to pass the PC down the pipeline so we get accurate exception addresses. Tracking the PC through the Instruction FIFO is just one more little complication that I’ll tackle after I make more progress on the rest of the microarchitecture.

I’ve only done some behavioral simulation so far, but I believe the code is synthesizable. The code is in github here: http://bit.ly/9yVQ7U. Running make should build everything, then just run “a.out”.

Note that I’m using magic instruction memory: an array populated with a hello world app built like so…
$ moxie-elf-gcc -o hello.x -O2 hello.c -Tsim.ld
$ moxie-elf-objcopy -O verilog hello.x hello.vh

And the verilog simulator reads hello.vh directly. Pretty cool!

(I just realized I wrote about fetching instructions almost 18 months ago – that took too long!)

The Moxielyzer

Wednesday, October 14th, 2009

I just committed a little binary analysis tool to moxiedev. You can use it to perform simple static analysis on moxie binaries. The kinds of things I’m looking for are compiler bugs (because I know there’s still one there that is triggered by -frerun-cse-after-loop), and instruction statistics. For instance, which registers are used as load offsets, and how often? The tool uses a primitive plugin architecture that should make it easy to add new analysis tools in the future. It’s called the moxielyzer, and here is the initial commit. Run it with no arguments to get a list of plugins. Run it with just a plugin name, and it will describe the plugin. Run it with a plugin name as well as an ELF moxie executable filename, and the analysis will be performed.

I had written a similar tool for ggx back in the bad old days. Another option was to hack this stuff into gas, but I prefer to keep gas “clean” (translation: I want the freedom to maintain hacky analysis code).

BTW – I’m also rolling out a new libffi in a few weeks. You can keep track of the release candidate test results on the wiki here.

Speed bumps on the road to moxie userland

Thursday, July 30th, 2009

Sooo….. it turns out there’s lots to take care of before userland apps like BusyBox can run.

  • The root filesystem. This one is easy. I just built a short Hello World application in C with moxie-uclinux-gcc. This produces an executable in BFLT format which I call ‘init’. The kernel build machinery takes this and produces a compressed root filesystem image linked to the vmlinux binary. The good news is that the kernel is able to boot, detect this initramfs, decompress it and load the init executable (which involves fixing up all of init’s relocations). My Hello World doesn’t actually use the C library or any system calls. It just writes Hello through direct communication with the simulator via our software interrupt (swi) instruction. I thought this would let me avoid dealing with system calls for now. I was wrong…
  • System calls. This one is harder. Obviously (in retrospect!) the kernel creates the init process via the execve system call. Implementing system call support involves lots of platform dependent stuff. For instance, how do we invoke system calls? How are parameters passed? How do we switch back and forth between userland and the kernel? The first question is easy: I’ll use our trusty software interrupt (swi) instruction to invoke system calls. This means creating an exception handler and installing it as described in this old post.
    As an aside, the swi instruction takes a 32-bit immediate operand. We currently use this to identify calls to the simulator via libgloss. This works well for escaping to the simulator, but isn’t the best way to identify system calls to the kernel. The Linux kernel is going to ignore this operand, and we’ll pass the system call ID in a register instead. This avoids us having to do complex instruction decoding in the exception handler processing the interrupt (also trashing any future data cache). Libgloss and the sim only need a small number of IDs, so I’m going to chop the swi instruction down from 48-bits to 16-bits in a future build of the tools.
    Passing arguments to the system calls was also interesting to sort out…
  • System call argument passing. The moxie ABI currently only has two registers being used to hold function arguments. The remaining arguments must live on the stack. This decision goes back to when we only had 8 registers to play with. It turns out that Linux kernel system calls can have a maximum of 5 arguments. In order to avoid tricky argument marshaling, I’ve decided to try changing the general ABI accordingly, so that up to 5 registers may be used to hold function arguments. This involves changes to the compiler, debugger and a smattering of assembly language in libgloss.
    The great thing about having integrated benchmarks into the moxiedev environment is that you can easily compare before and after performance for ABI changes like this. Running “ant benchmark” runs through the MiBench benchmark suite and saves a nice report for easy comparison. It turns out that switching from 2 to 5 register arguments is almost universally a win in terms of both code size and instruction trace length (an approximation of run time). The consumer jpeg benchmarks were slightly larger and slower, but only by less than 1%. Every other benchmark result was slightly better. The one outlier was the “network_dijkstra” benchmark which ended up 44% “faster” (44% fewer instructions being executed).
  • The first real moxie compiler bug. Sometimes things just don’t work! This is especially true when you’re tracking the bleeding edge from upstream. I won’t go into the details, but I discovered a rare bug in the compiler where it would assume that compare results could live across function calls. Fortunately I was able to track down the guilty compilation pass and disable it with -fno-rerun-cse-after-loop. I know that some people have brought up kernels without the benefit of a nice debugger, but I just don’t see how that is possible. The simulator, and a solid gdb port with reverse debugging capabilities have proven to be invaluable!

There’s still lots to figure out and implement in the system call space, but it’s clear that we’re getting very close to running our first Linux program!

ISA improvements

Thursday, June 11th, 2009

I’ve committed the PC-relative branch instruction changes upstream. But this is just one of many ISA improvements that need to happen. Here are a handful of other ideas off the top of my head. None of these projects should be particularly difficult.

  • Shorten load/store offsets to 16-bits. They are currently 32-bits, but for all of the benchmarks I’ve looked at the upper 16-bits are always 0×0000 or 0xffff. If the compiler ever really wants to use an offset > 16-bits, it should revert to computing the target address in registers. I don’t expect that much code would require this.
  • Introduce shift instructions with immediate operands. There’s plenty of opcode space for us to add 16-bit shift instructions that include a 5-bit immediate shift value (so we can shift up to 32-bits in either direction). Right now we load a 32-bit immediate shift value into a register which burns that register as well as wastes 32-bits of code space per shift.
  • Get the compiler to generate 16-bit immediate loads. All immediates are 32-bits right now, but the vast majority of these constants are < 16 bits long.
  • Push/pop multiple registers to the stack with one instruction. Although we have 16-registers, the ABI doesn’t have us pushing all 16 to the stack on function entry. We should be able to have a single 16-bit instruction that pushes/pops all of the relevant registers in one go. The instruction would include a bitmap identifying the registers we need to push/pop. ARM has something like this. The only drawback I can think of is that it could increase interrupt latencies as we’d probably have to retire the entire instruction (~10 memory writes/reads) before servicing an interrupt.
  • Many register rich ISAs include one register that is hardwired to zero. We could try this to see if it makes a difference, but I doubt it would be a win. Another idea would be to create a cmpz instruction to compare a register to zero so we don’t have to burn a register for this common operation. Maybe cmp1 might even make sense. This is easy to measure.

Those are some of the obvious ones, and all I have time to write about now.

Everything is relative (finally!)

Sunday, June 7th, 2009

The Moxie ISA still needs quite a bit of tuning. Take branches, for instance. A beq instruction currently encoded like so

00001111xxxxxxxx iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii

…where the “x“s represent “don’t care” bits, and “i“s are a 32-bit absolute branch target. That’s right — branch targets are not PC relative! This is hugely wasteful.

I’ve finally got around to fixing this. Here’s how I did it…

  1. I recoded all branch instructions as “Form 3″ instructions, and tweaked the as-of-yet unused Form 3 encodings so they look like this:
    
      FORM 3 instructions start with a bits "11"...                                 
    
        11oooovvvvvvvvvv
        0              F                                                            
    
       oooo         - form 3 opcode number
       vvvvvvvvvv   - 10-bit immediate value.
    

    This gives us 16 opcodes with a 10-bit immediate value. There are only 9 branch instructions, so we have a bit of room left in the Form 3 opcode space.

  2. I introduced a new 10-bit PC-relative Moxie relocation in BFD. This tells the linker and friends how to process PC-relative relocations.
  3. I hacked the assembler to generate these new relocations instead of simply emitting a 32-bit absolute address.
  4. I hacked the disassembler to print the new Form 3 instructions out nicely.
  5. Finally, I taught the compiler how to emit valid branch instructions. It’s not that they look any different now; it’s just that you need to worry about branch targets that exceed our 10-bit range. Actually, we have an 11-bit range because we know that all instructions are 16-bit aligned. This lets us drop the bottom bit from the encoding since we know it will always be 0.
    An 11-bit range lets us branch about 1k backwards to 1k forwards. If the compiler detects that a branch target is out of range, we want it to do something like the following transformation…

        beq    .FAR_TARGET

    …becomes…

        bne    . + 8
        jmpa   .FAR_TARGET

    The “bne .+8” line means branch forward 8 bytes from the current PC. This would skip the unconditional jump to .FAR_TARGET (a 6-byte instruction + 2-bytes for the branch = 8). Note that we have to reverse the logic from “beq” to “bne” for this to make sense.

    This is only possible if GCC can tell how far away the branch targets are. Fortunately, we’re able to annotate instructions in the machine description file (moxie.md) with their length; currently either 2 or 6 bytes long. GCC then processes these annotations to determine branch distances.

    Now that we know branch distances at compile time, the compiler can do smart instruction selection to deal with out-of-range branches. The changes were quite simple and limited to the .md file in the backend.

The savings after this ISA change are substantial. For instance, the consumer_jpeg_c benchmark in MoxieDev is more than 15% smaller when we use PC-relative branches! The u-boot binary, on the other hand, is “only” 7% smaller.

I hope to commit these changes to SRC and GCC once the GCC port is merged upstream. Fingers crossed…

Processor Exceptions

Thursday, April 2nd, 2009

My first go at exceptions is working well. The basic idea is that moxie will have a single exception handling routine whose address lives in special register 1. You set the exception handler like so:

void install_handler(void (*handler)(void))
{
  printf ("Installing handler 0x%x\n", (unsigned) handler);
  asm("ssr %0, 1" : : "r" (handler));
}

When the processor hits an exception, it performs a standard function call to the handler. We return from the handler just like it was any old function, since it currently uses the standard C ABI. The exception type will be found in special register 2. The current exception types are as follows:

#define MOXIE_EX_DIV0 0 /* Divide by zero */
#define MOXIE_EX_BAD  1 /* Illegal instruction */
#define MOXIE_EX_IRQ  2 /* Interrupt request */
#define MOXIE_EX_SWI  3 /* Software interrupt */

In the case of IRQ and SWI exceptions, the interrupt number will be found in special register 3. I don’t have instructions yet to disable or enable interrupts, but those are an obvious next step. Here’s a sample exception handler:

void handler (void)
{
  int et;

  /* Get the exception handler from special register 2.  */
  asm("gsr %0, 2" : "=r"(et) : "0"(et) );

  switch (et)
    {
    case MOXIE_EX_DIV0:
      printf("DIVIDE BY ZERO EXCEPTION\n");
      break;
    case MOXIE_EX_BAD:
      printf("ILLEGAL INSTRUCTION EXCEPTION\n");
      break;
    case MOXIE_EX_IRQ:
      {
        int irq;
        asm("gsr %0, 3" : "=r"(irq) : "0"(irq) );
        printf("INTERRUPT REQUEST %d\n", irq);
      }
      break;
    case MOXIE_EX_SWI:
      {
        int swi;
        asm("gsr %0, 3" : "=r"(swi) : "0"(swi) );
        printf("SOFTWARE INTERRUPT REQUEST %d\n", swi);
      }
      break;
    default:
      printf("UNKNOWN EXCEPTION 0x%x\n", et);
      break;
    }
}

The handler for DIV0 and SWI may also want to know where the exception happened. This can be determined by pulling the return address off of the stack and subtracting the appropriate instruction length (2 for div and 6 for swi).

I’ve implemented support for this in the qemu port, and the test directory in moxiedev contains a simple program to exercise this new functionality. I think we’ll want to hook up some peripherals in qemu to the IRQ system soon.