Programming without Text Files

July 20th, 2013

When I was an undergrad, I remember losing marks on a programming assignment because the teacher didn't like that my source code spanned more than 80 columns, which didn't play nicely with his Emacs setup. I also remember having arguments with people about the use of tabs vs spaces for indentation, or whether curly braces should be placed on their own line or not. It's somewhat amusing to think that in 2013, decades after the first high-level programming languages (FORTRAN and LISP) were invented, we're still programming by writing source code in ASCII (or UTF-8) text files. In some ways, programming hasn't changed all that much since the days of the PDP-11.

This is somewhat strange to me, because as a compiler designer, I can tell you that the first thing any compiler typically does is to parse your source code into something known as an Abstract Syntax Tree or AST for short. This step is necessary for the compiler to be able to extract any meaning out of your program. It isn't just compilers though. Static analysis, documentation generator or lint tools need to do the same thing: textual source can't be manipulated directly. Working with text-based source has advantages in terms of simplicity and interoperability, but I believe that it will become advantageous to move to designs where source code is represented and edited directly as a hierarchical data structure instead. What I'm proposing is that programming languages of the future shouldn't be based on text-files. I believe that working with streams of characters imposes limitations on language design, limits expressiveness and ends hurting tool support and interoperability in the long run.

In terms of language design, working with linear streams of characters means we have to deal with silly things such as:

Designing a grammar that avoids ambiguity
Delimiters and operators not based on formal notation, but rather on ASCII characters
Limiting the length of keywords to save screen real-estate
Ambiguity in lookups and the names of symbols, name collisions, variable shadowing
Fixed-width fonts and inconsistent source code layouts
Significant vs insignificant whitespace
Tacked on documentation and source annotations through special comments

As for expressiveness, consider one of the most powerful ideas behind LISP: that of macros. The C language has "macros", but they're based on text substitution. It's generally agreed that C's macro system is weak and error-prone, and its usage is often discouraged. In contrast, LISP macros are based on the idea that you have macro functions that are executed at compilation time. These functions generate an AST (new programming code) to be inserted into your program. The LISP macro system is very powerful because it allows you to create new language constructs that integrate relatively seamlessly into the language. These new constructs can form a domain-specific language (DSL) which you can then use to better express solutions to the problem you're trying to solve. Why don't more programming languages have a macro system like LISP's then? The problem is in large part that it's not practical to implement such a system in other languages, because they're text-based.

LISP source code is also text, but its parenthesized format is very regular: it represents an AST in the form of nested linked lists. Programmers can generate and manipulate code in the same nested representation internally, and even execute generated source code on-the-fly using the `eval` function. One of the main issues people have with LISP-like languages is that of readability. I believe it should be possible to improve upon LISP by bypassing the parenthesized representation and programming by editing the underlying data structures directly instead. At this point I should clarify that when I speak of programming languages that aren't text-based, I'm not suggesting visual languages you edit using a giant touch interface as seen in minority report, nor do I mean languages based on boxes and arrows that require two dozen mouse clicks to implement one addition. I'm thinking about enhanced Integrated Development Environments (IDEs) you can use with a good old keyboard, in a workflow fairly similar to what you're already used to.

What would this buy us, besides having a language that is potentially more LISP-like? For one, if such a language were to have macros, they could have an even more seamless integration into the language than what LISP allows. These constructs could have their own customized visual representations. They could also have additional semantic information attached for better integration with your IDE and compiler. Imagine being able to design a calculus-oriented DSL which is visually represented using mathematical notation, has associated error-detection and autocompletion in your IDE for quick editing and is automatically optimized based on rules you've made known to the compiler. Imagine being able to request the AST for a function so that you can transform and instrument it on the fly. Now imagine being able to visually format source code any way you want, without having to worry about tabs, spaces, curly braces, commenting styles or the width of other people's displays.

In my mind, it's quite clear that there is much to gain by editing and storing source code using a data format that more directly represents its actual semantics. One of the bigger obstacles for this to work, however, is that of designing suitable editing software. People are relatively comfortable writing source code as text, but editing ASTs seems like it would be more tedious at first glance. It should be possible to make the editing easier and faster by having the editor produce accurate auto-completion suggestions. Since the editor has direct access to the AST, it would become possible for it to make better auto-completion suggestions based on knowledge of language semantics. Another possibility that opens up for speeding up editing is that of building your own pattern-based shortcuts. For example, you could have your own collection of macros (code generator functions or patterns of source trees with holes in them) to be expanded at the touch of a few keys.

What about tool support? As it stands, many programming languages, even among the most popular ones, have rather poor tool support. Part of the difficulty comes from the fact that tools need to parse and generate source code. If you take a languages like C and C++, for example, it can actually be difficult to find a parser that supports the complete grammar. If you want to write a tool that does code transformations, you also have to worry about issues like preserving comments, which is sometimes difficult, especially if the parser you chose has no support for this. If source code were represented in a more uniform data structure, it might actually be easier to write tools that parse, analyze, and generate code. It would also become easier to add metadata to existing source code. In the end, I believe that transitioning to this kind of programming model would enable significant gains that would largely outweigh the disadvantages.