Skip to content

ZetaVM, my new compiler project

April 29, 2017

Like many of you I’m sure, I’ve wanted to create my own programming language for a long time. I think it’s a common aspiration for many programmers, to create a language with all the features we love and none of the ones we hate. Our own “ultimate programming language”, created in our own image. What I’ve come to realize over time, however, is that I’m actually not quite sure what should go into my ultimate programming language. I have a lot of ideas, but they don’t necessarily fit into a coherent whole. It’s also the case that my ideas about what the “ultimate programming language” should be like keep changing as I gain more programming experience and get exposed to new ideas.

My PhD was in compiler design, and this is something I truly enjoy playing with. As such, I’ve decided that my next programming language project wouldn’t be a programming language per-se, but actually a compiler, a platform to create new programming languages. I’m doing this in part because I enjoy it, and it’s something I feel confident I’m good at, but also because I think I can build a platform that will make it much easier for myself and others to do programming language design and experimentation. ZetaVM is going to have, as one of its main design goals, to make creating new programming languages very accessible. It will make it possible for anyone who’s mastered a language such as Python to create a language of their own, in less than 2000 lines of code. Not only that, but ZetaVM will instantly make your language decently fast. It will have JIT optimizations suitable to languages such as Python/Lua/JavaScript, and will instantly give you fast objects, type-specialization, etc.

ZetaVM is a virtual machine for dynamic programming languages. It will provide native support dynamic typing and most common data types found in Python/Lua/JS/Ruby, such as strings, extensible objects, extensible arrays. What makes it particularly easy to get your own language running on this VM is that Zeta’s Intermediate Representation (IR) is representable as a textual format similar to JSON. This makes it fairly trivial for you to write, say, a Python parser for your new language, and generate Zeta IR in a textual format at the output. You don’t have to worry about implementing dynamic typing, or register allocation, or garbage collection, or arrays and objects, all of that is done for you. I’ve created a simple language called Plush (JS and Lua’s bastard child), which demonstrates how this can be done, and serves to help me bootstrap and test the system.

Beyond making it easy for myself and others to create programming languages, Zeta will be a platform for me to try some bold new ideas in the design space of programming languages and JIT compilers. I would like to try and tackle one of the biggest issues plaguing programming languages today, which is that of code rot. My goal is to eventually freeze the IR and APIs provided by Zeta, so that code that runs on Zeta today might have a chance of still working in 20 years, without any changes. This goal is ambitious, but I have some ideas which I believe might make it work.

Finally, one big disclaimer I should give is that Zeta is still a young and immature project. In its current state, Zeta is experimental and will have many breaking changes, as most new languages/platforms do. Zeta also currently only has a naive interpreter which walks the object-based IR and is dog slow, about 200K instructions per second. I’m currently working on an interpreter that will compile the object-based IR into a lower-level internal IR. This interpreter will use Basic Block Versioning (BBV) and self-modifying code. I believe it should realistically able to reach speeds of 100MIPS within the coming months. My plan after that is to build a lightweight JIT which will sit on top of the optimizing interpreter and compile the internal IR to machine code.

Advertisements
41 Comments
  1. Brian permalink

    Z – ok that is being a bit pretentious!

    My ideal language would be one that just does it, you just describe what you want to do (voice input if needs be).

    Oops we are all out of a job!

    • > Z – ok that is being a bit pretentious!

      The name was inspired by Turing’s zeta-function machine (http://www.turing.org.uk/sources/zetamachine.html). It’s also one of the few short names that wasn’t taken.

      > My ideal language would be one that just does it, you just describe what you want to do (voice input if needs be). Oops we are all out of a job!

      Fortunately, Zeta is currently aimed mostly at people who program for enjoyment rather than at commercial users :)

  2. > What makes it particularly easy to get your own language running on this VM is that Zeta’s Intermediate Representation (IR) is representable as a textual format similar to JSON.

    This is something I really wish many languages would do. If interpreters/compilers accepted a simple-to-generate format, like s-expressions (or JSON, or even XML!), then lots of redundant busy-work could be avoided (custom parsers for each interpreter, compiler, linter, static analyser, type checker, syntax highlighter, etc.)

    Unfortunately any mention of such things tends to be misunderstood as using Lisp (which is like claiming that Python is a form of C, since they both represent code as ASCII text!), then dismissed as ‘too many parentheses’ (despite the fact it only has to be an internal format!)

    • Zeta will at least eliminate a lot of the work that does into code generation and performance optimization. It will be easier to use than LLVM, and provide more building block for dynamic languages.

      > Unfortunately any mention of such things tends to be misunderstood as using Lisp

      Semi-related: Zeta will allow programs to introspect their own IR, and to generate new IR on the fly. If you generate objects with the right set of properties, you can call that code as a normal function.

    • Most languages (the ones that I know of, anyway) have their syntax described by a context-free grammar. The grammar’s terminals are the keywords in the language and a few other things (identifiers, constants, …). So we already have a simple way of describing a language’s syntax. This can be exploited to create tools that generate parsers, syntax highlighters, etc. from scratch, given the language’s grammar. For instance, Lex+Yacc accept a CFG and generate a parser for it (that’s only half-true; Some CFGs can be problematic). But from what I’ve seen, most compilers/interpreters use a custom, hand-written parser, even though general tools such as Yacc are already available. Why? I’m not sure. Performance, maybe? Maybe Maxime can elaborate on that.

      • > Most languages (the ones that I know of, anyway) have their syntax described by a context-free grammar.

        That’s true *in theory* (although indentation-sensitivity can complicate things); in practice many languages/implementations progress faster than their associated documentation/standards. Maintaining and validating grammars, generating parsers, etc. is also just busy-work, since that job’s already been solved more robustly by the compilers/interpreters, if only they’d give us access to the results of their parse (not necessarily an *abstract* syntax tree, since that may be very implementation-specific).

        One problem I’ve hit when trying to analyse Haskell code in particular is widespread use of preprocessors, like cpp and co. Since these perform arbitrary text substitution, they break parsing pretty badly, and figuring out how to handle them is usually buried somewhere in a language implementation (GHC, Cabal and Stack, in the case of Haskell) :(

      • These tools are relatively painful to use. Lex/Yacc in particular is very limited: it only recognizes LR(k) languages. And, which is worse (because LR(k) isn’t that restrictive), you have to express your language in an LR(k) grammar. That can be painful (shift-reduce conflicts anyone?).

        Fortunately, there are better alternatives, you can find a big list here: https://github.com/norswap/whimsy/blob/master/doc/autumn/notes/parsing-tools.md
        (I now realize that GLR parsers, mostly Bison and Elkhound from the top of my head, are conspicuously absent from this list… Gotta add them.)

        Another reason is that most parsers do not handle state. So if your language is a bit contextual (e.g. depends on whitespace layout, has semicolon insertion, herestrings, yadda yadda), you can’t use those tools (shameless plug: you can my Autumn tool though).

        A final and under-appreciated reason: maybe the right tool isn’t available in your implementation language. This can be particularly a problem if you want to bootstrap your compiler: why bother with a parsing library or parser generator if you are going to have to rewrite the parser in your new language, that obviously doesn’t have those kind of tools.

        Actually… one more reason: it’s more fun to write by hand. Use a tool and you’re gonna have frustrating experiences. One possible reason is that you do not understand the underlying model enough (and documentation tends not be a strong suit). Another reason is that grammar are just hard to debug. I’ve looked around, but have never found a tool that makes it really easy to debug grammars. (And we’re not even talking about generating nice parser errors here.) The best way (and the one I advocate) is just relentless bottom-up testing: does my integer literal parse correctly, identifiers, … ? Okay, then do my function calls parse correctly? My types? My methods? etc, etc… Not to say that writing your own parser isn’t going to have its own frustrations, but heh, at least it’s your own mess.

  3. That’s a great project :)

    Of course, I would think so because my (ongoing) PhD thesis is pretty much on the same subject.
    (Some code: https://github.com/norswap/whimsy)

    Differences: I have a front-end and am in the process of building a middle-end, with no plan for a back-end during the thesis. So maybe there are opportunities for collaboration.

    At any rate, I’d love to discuss some design ideas. Is there an email where I could reach you?
    You can contact me at norswap@gmail.com.

  4. Any inspiration taken from Racket? How does it compare to your ideas?

    As far as I understand it one of the challenges often discussed by Racket developers is how to get different languages that run on it’s VM to cooperate. Might be something worth nipping in the bud for Zeta. Any thoughts?

    • Some. The idea of a multi-language VM is obviously not new. I will have a package manager working within a few months, and people will hopefully be able to import packages written in other languages fairly seamlessly.

  5. Would you mind contrasting ZetaVM with Lua-Terra: http://terralang.org/ ? One of the nice things about Lua-Terra is that it JITs all the way down through LLVM to machine code (both x86 for CPUs and PTX for GPUs), which is really important if you’re trying to custom generate code for heterogeneous architectures. Will ZetaVM have similar functionality? (Asking as a prospective user. :)

    • Seems like very different platforms and use cases, but one of the things I would have to do is transparent compilation to GPUs. Have kind of a parallel for construct where a given function can automatically be compiled for and run on GPUs if available.

  6. Antony Riakiotakis permalink

    Will this be released with source code on a public repository or a tarball?

  7. Do you think your project is similar to what Graal-Truffle is trying to achieve? Is the main difference the focus on dynamic languages?

    • There are a lot of differences, starting with the fact that Zeta won’t be based on the JVM, and will be aimed at dynamic languages. I think Zeta may also be easier to use for people with less language development experience.

  8. Kashyap Chatamballi permalink

    The concept sounds awesome! I was a little surprised to see the implementation – https://github.com/maximecb/zetavm – in cpp though! … To me CPP immediately spells hard to compile on new platforms :(

    • Why would you say C++ is hard to compile on new platforms?

      • For one, I have not had a successful experience with compiling LLVM on windows; Even on mac, the experience has been less than pleasant. Again, all this is from a couple of years ago and all this is pretty subjective. In your case, I was almost expecting “D” :) – hence the surprise. I’d love to understand the options you considered and your reasoning for zeroing on C++.

        • The D community has been very nice to me, but unfortunately, using D has limited the amount of adoption I could get on the Higgs project. C++ has changed in recent years and become a nicer language.

  9. Why not instead reuse Gambit’s GVM and build upon it?

    • Because it wouldn’t do what I want/need. Because I like implementing VMs and wish to go in a more experimental direction. Because I don’t feel comfortable modifying and maintaining a huge codebase I’m not familiar with.

  10. Any plans to support WebAssembly?

    • You mean as an input or as a target? It should be feasible to write a webassembly parser/front-end. As an output target, maybe when WASM is more mature.

  11. Are you familiar with Paul Phillips’ talk at PNW Scala 2013 [1] ? Your post reminded me of many of the themes in the talk.

    By the way, one of the links (“a textual format”) is broken.

    [1] https://youtu.be/TS1lpKBMkgg

  12. PeterO permalink

    I read about your project with interest. But, nowhere do I see any reference to Smalltalk. Smalltalk has inspired so many languages and productivity ideas for IDEs. Many new language developments and features being added to IDEs have been pioneered in Smalltalk. If you want to make your ZetaVM the best it can be, you need to ensure you can do the following:

    #1. Support Smalltalk and a rich meta data model that’s integrated with the IDE.
    #2. Support multi-threading so your VM can be run efficiently on today’s multi-CPU machines.

    Smalltalk is more than 2x more productive than Python, Ruby and PHP.
    Smalltalk is more than 3x more productive than JavaScript.

    There are many ways to measure productivity and code quality. Ultimately, total effort per function point is a useful one. On page 46 of http://namcookanalytics.com/wp-content/uploads/2013/07/Function-Points-as-a-Universal-Software-Metric2013.pdf a table sets out number of months to implement a 1,000 function point program. Here’s a few metrics for dynamics and static languagues:

    Smalltalk 21 coding months
    Ruby 46
    C# 51
    Python 53
    C++ 53
    Java 53
    PHP 53
    JavaScript 71
    C 128
    Assembly 213

    There are too many good features in Smalltalk to ignore in a project such as ZetaVM.

  13. Good luck with the language.

  14. This is relevant to my interests :)

    Is this the best place for questions? I don’t wanna clog up your GH issues for trivial things without asking first.

    Anyway, my current question is about the textual format, I can’t seem to see any images in the tests dir that import symbols, only exporting, and seemingly implicits for $true and $false. Is this not done yet? Could you give a quick example of how you think it would work?

  15. Mike Brown permalink

    I too was surprised you’ve opted for C++. There are some noted challenges programming a VM in a language without GC. Though I can appreciate the comment D has limited the acceptance of Higgs.

    I am curious to why you’ve gone the raw pointer, and pointer arithmetic route with Zeta?

    • Pointer arithmetic is necessary in a JIT compiler. Raw pointers, only because I haven’t taken the time to think about memory management. I’ve been focused on getting an MVP up and running.

      • Mike Brown permalink

        I wasn’t able to run the benchmarks, though im on a 32bit system, is this 64bit only?

        > Pointer arithmetic is necessary in a JIT compiler

        Very interested in hearing more about this and the key internals of a JIT? Maybe even a blog post if you get stuck for post ideas :)

  16. I just wanted to make sure that you have heard of LES (http://loyc.net/les/) and Loyc trees (http://loyc.net/loyc-trees) which might be relevant when choosing text formats for IR or for the input language.

  17. Edward de Jong permalink

    Your intermediate form is terrific, a huge improvement over previous work. However, it would be way faster to execute if you stopped using strings for your verbs, and just assumed a set of integer constants for the opcodes, so instead of using “push” use PUSH which is some constant like 3, or if you insist OP_PUSH, so that effectively all the opcodes are integers then you can dispatch them much faster. If you imagine your data structure in RAM, as JSON in JS or AS3 or any other object oriented language, storing a string to represent one of perhaps 200 opcodes is terribly wasteful in terms of CPU cycles. If you store as a 32 bit integer, a single instruction can compare one to another, while if a string, you have more than 10x the clock cycles to set up the string comparison. Remember that intel chips, which are 99% of all desktops and server computers have a very awkward string compare, where you gotta load up the registers, clear the direction flag, etc., It is heavily disfavored in the intel architecture to use strings to store 200 different items. In fact, a single byte could probably hold your opcodes, but hey go ahead and use a 32 bit integer; if you make them ascending and consecutive, you can dispatch on your opcodes with an array of function pointers which will interpret blindingly fast. Strings are so prevalent in the wasteful world of JS, and JSON, but they are a crappy way to do it. their only advantage is that you can extend your opcodes without limit; but realistically, your opcodes will be finite and not likely to expand that often. I strongly object to the common and wasteful practice of using character strings to represent a small set of alternatives. In Modula-2 Wirth offered an enumerated type, which was implemented as an integer, and allowed you to have sets and arrays of enumerated types, which eliminated errors. It is such a shame that the JS mindset refuses to embrace inventions from 1972. After all, Javascript is nothing more than a hastily constructed clone of ActionScript 2, and inherited its weaknesses, which have gradually been remediated by careful tweaking. But still the omission of constants and enumerated types shouldn’t be carried forward by your Zeta intermediate language.

    • A few things. The first is that using string interning, string comparisons can be simple pointer comparisons. There is no need to even read the string in order to compare it with another string. The second thing is that I’ve already written an interpreter which compiles the bytecode representation into a flat memory layout, with 16-bit integer opcodes. It compiles this bytecode into a form that is less focused on user-friendliness, more optimized for compactness and fast execution.

      And lastly, the reason to use strings for opcodes in the textual bytecode is that it’s simpler and more user-friendly. Strings are immutable, they act as constants. If I were using some constant integers, then there’s a question of where those definitions come from. I wouldn’t want every single bytecode file to have to define PUSH=10,POP=11, etc. I don’t love the idea of having some hidden constant definitions that pollute the global name space either.

  18. Nicolas permalink

    For a new plateform, make it an easy to mix differente langage together is a way to ease migration a new technology, to rewrite a code piece by piece.

    Lex/Yacc technology are very bad to generate good error message. Speed of parsing is a none sense today, we want immediat feedback on the code, and good and precise error message.

    UTF8 is a must, and is rarely done right.

    Multiprocessor and message passing is a must like in golang or in http://www.seastar-project.org/ (ultra fast c++ lib).

    Don’t forget SIMD code, it’s the fastest way to crunch number in today cpu.

  19. kenneth permalink

    Though I can appreciate the comment D has limited the acceptance of Higgs.
    A final and under-appreciated reason: maybe the right tool isn’t available in your implementation language.

Trackbacks & Pingbacks

  1. Zeta’s JITterpreter | Pointers Gone Wild

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: