A Better VM

06 Oct 2016

For the last couple of years I’ve been working with Clojure, a lisp which runs on top of the JVM. My reservations with Clojure itself, and Clojure’s maintainership are at this point fairly well established. However I’d be lying if I said that after thinking long and hard about the way I want to develop software I’ve come up with anything incrementally achievable and better. Clojure’s syntax is convenient. Its datastructures are clever. Its immutable defaults are sane with respect to any other language. Its integration with the JVM while fatal to its own semantics ensure unmatched leverage. In short, I don’t know if it’s possible to do better atop the JVM.

But why use the JVM?

The JVM itself is a wonder of engineering, compiler technology and language research. But the JVM standard library is deeply mutable and makes assumptions about the way that we can and should program that aren’t true anymore. While the JVM itself may be awesome, I’m just not convinced that the object/class library it comes with is something you’d actually want around as a language designer. Especially as the designer of a functionally oriented language, or a language with a different typing/dispatch model than Java’s.

The conclusion I personally came to was that, faced with what already exists on the JVM I couldn’t muster the wherewithall to work in that brownfield. I needed a greenfield to work with. Somewhere I could explore, have fun exploring and make mistakes.

DirtVM

I honestly don’t remember why I chose this name, but it’s stuck in my head and adorns the git repo where all of this lives. Dirt isn’t something I’m gonna be releasing on github, although you can totally browse the code on git.arrdem.com. Simply, dirt isnt’s intended for anyone else to use or contribute to now or in the foreseeable future. It’s my experiment, and I think that my total control over the project is important to, well, finishing it sometime.

So what’s the goal?

Versioning is something I think is really important both at the package and compilation unit level. I previously wrote about artifact versioning, and experimented with versioned namespaces in my fork of Clojure. Unfortunately, as a user experience versioned namespaces didn’t pan out. There were too many corner cases, and too many ways that partial recompilation could occur and generate disruptive, useless warnings. So versioning is one major factor in Dirt’s design.

Another is functional programming. After working with the JVM, I’m pretty much convinced that design by inheritance is just flat out wrong. While I was working with Dr Perry, he shared an awesome paper with me: Object-Oriented programs and Testing. The point of the paper essentially is that testing inheritance structured programs adequately is really hard and everybody does it wrong. Testing mutable programs is already hard enough, and single inheritance poses so many design difficulties that it just doesn’t seem worthwhile. In my time using Clojure, I found that I never actually needed inheritance. Interfaces satisfied, and functions which operated against interfaces were better captured as functions in namespaces than as inherited static functions packaged away in some base class or helper package.

The failure mode of interfaces as presented by Java however is that they are closed. Only the implementer of a type can make it participate in an interface. This is far too restrictive. Haskell style typeclasses get much closer to my idea of what interfaces should look like, in terms of being open to extension/implementation over other types and carrying contracts.

Which brings me to my final goal: contracts or at least annotations. I like types. I like static dispatch. While I can work in dynlangs, I find that the instant my code stops being monomorphic at least with respect to some existential type constructed of interfaces/protocols I start tearing my hair out. Types are great for dispatch, but there’s lots and lots of other static metadata about programs which one could potentially do dataflow analysis with. For all that I think Shen is batshit crazy and useless, the fact that it provides a user extensible type system which can express annotations is super interesting. Basically I think that if programmers are given better-than-Clojure program metadata, and tools for interacting with program metadata that it would be possible to push back against @Jonathan_Blow’s excellent observation that comments and documentation always go stale and become worthless.

So what’s the architecture?

DirtVM so far is just a collection of garbage collectors and logging interfaces I’ve built. All of the rest of this is hot air I hope to get to some day.

The fundamental data architecture is something like this:

Module:
  Name:
    Group, Package
  License:
    Owner, 
    Date, 
    Terms: (GPL1 | GPL2 | EPL | MIT | Other)
  Version:
    Major,
    Minor, 
    Patch
  Metadata: 
    (bag of attributes)
  Dependencies:
    List[(Name, Version)]
  Exports:
    List[Namespace]

Namespace:
  Name
  Metadata
  Imports: List[Import]
  Exports:
    Types, Ifaces, Fns

Import:
  Namespace, Alias

The essential idea is that because versioning is so hard, it’s easier to fix the runtime to allow co-hosting of multiple versions of artifacts than to somehow try and solve the many many difficulties of software versioning and artifact development. Java 9 modules look really really good and come close to being an appropriate solution, but the Java team have abandoned the idea of versioned modules. In Dirt when code is compiled within a namespace it has access only to what has been explicitly imported by that namespace. Imports are restricted to the contents of the module and the module’s direct dependencies. It is not possible for a namespace to import a transitively depended module. This means that at all times a user is in direct control of what version of a function or a type they are interacting with. There is no uncertainty.

This gets a little sticky for datastructures. If module A depends directly on B and B depends directly on C, it’s possible that B will return into A a data structure, function or closure which comes from the module C. This turns out to work fine. Within a single module, protocol/interface dispatch is done only against the implementation(s) visible in that module scope. Because A has no knowledge at compile time of any such type from C, it can’t do anything with such a type except hand it back to B which can use it.

Types and interfaces are very haskell style. Mutability will be supported, but avoided in the standard library wherever possible. Interfaces will be typeclass style pattern matching dispatch, not call target determined. This makes real contracts like Eq possible and extensible rather than being totally insane like non-commutative object equality. Types are just records, and will be internally named and distinct by package, version namespace and name. This makes it possible to have multiple consistent implementations of the same interface against versions of the same type co-hosted. Much in the Haskell style, the hope is that for the most part interface dispatch can be statically resolved.

Why record types instead of a Lua or Python style dynamic object dispatch system? Because after working with Clojure for a while now it’s become clear to me that whatever advantages dynamic typing may offer in the small are entirely outweighed by static typing’s advantages in the large, and that packaging functions into objects and out of namespaces buys you nothing. While dynamic typing and instance dispatch can enable open dispatch they also defeat case based reasoning when reliability is required. Frankly my most reliable Clojure code would have translated directly to Haskell or Ocaml. Refactoring matters especially as projects grow. Being able to induct someone else to your code matters. Being able to produce meaningful errors someone understand and can trace to a root cause requires information equivalent to types. Dynamic typing just obscures static contracts and enables violations to inadvertently occur, leaning on exhaustive test coverage. Dynamic typing introduces concrete runtime costs, and slows down program evolution because building tools is simply harder. Tools matter, so static typing ho.

In addition to interfaces/typeclasses, there are also fns (fn in Clojure) which are statically typed, non-extensible, single arity procedures. Despite impurity, the term function is used for these in keeping with industry convention.

The namespaces are very much Clojure style, because I’ve been really happy with the way that Clojure’s namespaces work out for the most part and I want to support a language which isn’t at least syntactically and in the namespace system that distant from Clojure. Import renaming is awful, but qualified imports are fine hence why imports support aliases.

The ultimate goal of this project is to be able to present a virtual machine interface which is itself versioned. Imagine if you could write software which used dependencies themselves targeting incompatible/old versions of the standard library! That would solve the whole problem of language and library evolution being held to the lowest common denominator.

Dirt itself will be a garbage collected, mostly statically typed bytecode VM much like the JVM. Probably gonna get a ssa/dataflow bytecode level representation rather than a stack machine structure. But that level of detail I’ll figure out when I get to it. For now I’m having fun with writing C and garbage collectors. The next step will probably be some pre-dirt language to help me generate C effectively.

Here’s to log cabins and projects of passion!