nelhage debugs shit

lightly-edited tales of debugging systems

Things I learned writing a JIT in Go

I was at Gophercon last week, and the last day, Saturday, was a hack day where people all sat down and just worked on projects, largely in Go. I decided on a whim to play with doing runtime code generation in Go. I’ve done some toy JIT work before in C and C++, so I’m pretty familiar with the space, and it seemed like something fun I hadn’t heard anyone playing with in Go.

After a few days of hacking, I produced some working code, complete with a PoC brainfuck JIT, entirely written in Go. I figured I’d write up some of the many things I learned in the process.

Go/plan9’s assembler is weird

Go has its own assembler, inherited from Plan 9. Here’s a sample, defining a function to add two numbers:

// add(x,y) -> x+y
TEXT ·add(SB),0,$0-24
        MOVQ x+0(FP), AX
        ADDQ y+8(FP), AX
        MOVQ AX, rv+16(FP)

Go assembly is not translated directly to x86 machine code, as you may be used to. Instead, the intermediate form between the compiler and the linker (or between the assembler and the linker) is an “incompletely defined instruction set” which the linker then does instruction selection and code generation over. So, while many instructions in a go assembly file will map directly to x86 opcodes, many others are pseudo-instructions, that may get turned into one of several concrete instruction sequences at link-time.

The linker also does more aggressive code transformations. For example, Go’s variable-size and/or split stacks are implemented nearly entirely in the linker. The $0-24 above says that this function uses 0 bytes of stack, but has 24 bytes of arguments (which live on the caller’s frame). The linker takes this information and inserts stack-expansion preambles as needed. Because the linker sees the whole program, it can even do interprocedural analysis to explore the stack sizes needed by entire leaf function call chains and optimize accordingly.

Notice also that the function being defined is ·add. Go symbols end up in the resulting objects using their fully-qualified names, e.g. However, since / and . are punctuation character in the C and assembly syntax, they’ve extended the tools to accept U+00B7 MIDDLE DOT (·) and U+2215 DIVISION SLASH (∕) in the input, which get converted to “normal” dots and slashes, as Russ Cox explains on the mailing list.

Note the reference to the SB and FP registers, which don’t exist on x86. SB is the “static base” pseudo-register, which refers to the base of memory — all references are required to be relative to a register, and SB is how you specify absolute addresses. The linker will select an appropriate x86 addressing mode. Similarly, FP is the pseudo-register pointing to the base of our stack frame; Typically this will turn into an access relative to %rsp, with the offset adjusted appropriately.

Go and the plan9 C compilers have their own ABI

Go, and C code compiled by 6c don’t use the “normal” SysV ABI and calling convention, but have their own. It’s striking in its simplicity compared to what you may be used to:

  • All registers are caller-saved
  • All parameters are passed on the stack
  • Return values are also returned on the stack, in space reserved below (stack-wise; higher addresses on amd64) the arguments.

In general, this seems to be consistent with a pattern among the plan9 tools of preferring simplicity and consistency over marginal performance gains. It’s really hard to say that they’re wrong for making that choice.

You can see this at work in the ·add example above, where we pull arguments from 0(FP) and 8(FP), and return the sum by writing back into 16(FP). This also explains why we had a 24-byte argument frame, despite only accepting two eightbyte arguments — the return value also gets a slot in the argument frame.

Go funcs are represented as a pointer to a C function pointer

You can read Russ’s writeup for more information, but Go (as of 1.1) represents funcs as a pointer to a C function pointer. On function entry, a well-known “context” register (%rdx on amd64, or DX as go calls it) pointer pointing at the function pointer. This is used to implement closures — the C function pointer is followed by context information, and the pointed-to assembly code knows how to extract that information from DX in the correct way and then proceed onwards.

The same representation lets you handle the case where you have an r io.Reader and capture f := r.Read — in that case, f will be a pointer to a block of memory like so:

f --> | io.Reader.Read·fm
      | { r ... }

where io.Reader.Read·fm (note the middle dot — because of the above-mentioned translation, it is impossible to refer to this symbol from human-written code, even in C or assembly) is a compiler-generated stub that knows how to extract r out of DX and invoke Read on it.

You can see how this works in great detail by reading the compiler output.

Static linking has many implications

Because Go is always statically linked, the linker can see the entire program at link-time, and they use this fact. The interprocedural analysis on stack depth I mentioned above is one case. Go also assumes that the entire Go symbol table is available at link-time, and loads a version into the binary, which is used by, among other places, the garbage collector when tracing memory!

When working on my JIT, I had code that generated calls from JITed code back into Go. I found Go occasionally printing errors like

runtime: unexpected return pc for io.Writer.Writer·fm called from 0x......

I dug into the runtime to see if I could somehow inform the runtime of the address of my JITed code, but nope: That error comes from code that does a single binary-search on The Symbol Table and bails if the caller PC isn’t there.

Once go gets dynamic linking (I understand it’s on the way!), that will probably also expand the range of things I can safely do from a JIT :)

cgo is slow

I’d heard this, but I got a chance to observe it first-hand. In an effort to make my JIT safer, I coopted the cgo runtime to run JITed code on a C stack via cgo. The overhead difference is huge (for the trivial go -> jit call):

BenchmarkEmptyCall          500000000               3.43 ns/op
BenchmarkEmptyCgoCall       50000000                60.6 ns/op

The overheard of calling back into Go from cgo is even greater (this is benchmarking go -> jit -> go call chains:

BenchmarkGoCall     500000000              5.07 ns/op
BenchmarkCgoCall    10000000               250 ns/op

It’s bad enough that my Brainfuck JIT is actually slower than a straight-up interpreter on many simple programs (. and , are implemented via calls back into Go, so the jit -> go overhead comes into play), if you use the cgo version:

BenchmarkCompiledHello        500000              3405 ns/op
BenchmarkCompiledHelloCgo     500000              5861 ns/op
BenchmarkInterpretHello       500000              5679 ns/op

Go’s testing and benchmarking tools are really fun

Just see the above section for the benchmarking tools!

For the JIT, I ported a simple C++ x86 assembler I’d written for another project. While I never actually wrote a real test suite for the C++ one — it seemed far too annoying — the Go one actually has decent test coverage, and I found many bugs in the C++ version, because Go made it so easy to test.

In conclusion

This was a lot of fun! Iearned a lot about Go’s runtime and toolchain, and also a bit about x86!