Code that Doesn’t Rot

One of the most frustrating experiences, both as an end user and as a programmer, is to try to run a piece of software that used to work perfectly not that long ago, only to find out that it’s now broken, the software won’t run anymore, and it’s not clear why. The software didn’t change, but something broke, seemingly for no reason. This is often due to the phenomenon known as “code rot”, or “bit rot”.
I used to work in a deep learning research lab, and I’ve seen cases where researchers had published code just six months ago, but one or more of the dependencies on which they relied had since made breaking changes. That put us in the unfortunate position of having to troubleshoot someone else’s code. Sometimes you were lucky, and the problem was just that one of the Python packages which their code needed had made breaking changes, and the problem could be fixed by simply editing the project manifest. Sometimes people imported packages which they didn’t really use, and we could completely remove the dependency. Other times we had to resolve conflicts between the multiple Python package managers that could be installed on a given system (pip2, pip3, Conda, and Ubuntu’s apt).
Editing a list of dependencies or wrangling with package managers doesn’t sound too bad, but PyTorch, the deep learning framework which was key to a lot of this work, had this irritating feature that each release needed a specific version of the Nvidia GPU driver to run. Installing an older GPU driver required root access, which we often didn’t have when working on remote compute clusters. Even if you were doing this on a local machine where you did have root access, the process of installing a new GPU driver was quite slow and tedious. Thanks to Nvidia’s unfriendly driver installers, this process couldn’t be automated, and once you were done, you might succeed in getting a specific release of PyTorch to work, but your other projects couldn’t run anymore, because they needed the latest release instead.
Researchers are encouraged to publish their code for the sake of replicability, but there’s not much point if nobody else is able to run said code a few months down the line. As such, we started to encourage those wishing to publish their code to rely on containerization using either Docker or Singularity. That fixed issues such as broken Python packages, incompatible versions of PyTorch or TensorFlow and missing libraries on the host system. However, there was another problem, which is that much deep learning code can’t really run fast enough on a CPU alone to be usable. Most of the code we saw required GPU acceleration. The solution to enable that was to use nvidia-docker, a special version of Docker which allowed code access to the host’s Nvidia GPU drivers. This, however, reintroduced the problem that the code running inside the container needed a special version of the GPU drivers to be installed on the host machine to run correctly. Nvidia’s solution to give people access to GPUs inside a Docker container was to breach the container and expose details of the host system in the process.
How much time do we collectively waste, every year, fixing bugs due to broken dependencies? How many millions of hours of productivity are lost every single day? How much time do we spend rewriting software that worked just fine before it was broken? In my opinion, code rot is a problem that we ought to try to fix, or at least alleviate. Because of fundamental reasons I’ll discuss later in this post, code rot probably can’t ever be fully eliminated, but I think the situation could be made a lot better through more principled and forward-thinking software engineering. At the very least, the situation would be improved if we acknowledged that code rot is a problem, and understood what causes it to happen in the first place.
How can we avoid code breakage? Linus Torvalds seems to think that in general, when compiling software, it’s better to statically link libraries if you can, because very few shared libraries are actually version safe, and with dynamic linking, you’re always adding complexity and exposing yourself to the risk that the system your software is going to get installed on won’t have a compatible version of the libraries you need.
In my opinion, for software that relies on a package manager, it’s better to fix package version numbers if possible. That is, to specify directly in the manifest of your package which version of each dependency to use. The reason for this is that you unfortunately can’t trust newer version of your dependencies not to make breaking changes, and often, one broken dependency is all it takes to render your software broken. In some cases, programmers may avoid specifying fixed version numbers because package managers such as Python’s pip don’t support having multiple versions of a given package installed at the same time, which could mean that the version of a package you request could clash with other software already installed on a given system. This is a shortcoming that needs to be addressed by package manager if we want to build reliable software.
There’s another problem though. Part of Python’s appeal is that it makes it easy to link with C code through its Foreign Function Interface (FFI). This is part of what made Python so popular, because it makes it easy for anyone to write a package to interface with commonly available C libraries and benefit from the strength of the C ecosystem. However, this convenience comes at a cost. The FFI is essentially a trap door through which software gets access to external dependencies that can’t be controlled by the package manager, which vastly increases the risk of code breakage. Third party package managers such as Conda try to address this by managing the installation of external libraries as well as Python code, but this can potentially introduce conflicts with Python packages installed through other means.
In my opinion, the most practical solution to manage with these problems in the real world is to take a conservative and minimalistic approach to software design. Purposefully minimize dependencies if possible. Don’t add new dependencies unless the value added is really worth the added cost of potential code breakage down the line. Avoid external dependencies as much as it’s reasonable to do so, and when you do choose to rely on external packages and libraries, pick libraries that have a long history of being stable, well-maintained, backwards compatible and easy to install. Remember that it can take just one broken dependency for your software to fail to run on a user’s system, and if your software breaks, they might not even tell you it’s broken.
Good software engineering practices can go a long way towards minimizing the risk of code rot, but I think it’s also valuable to ask ourselves what is the cause of code rot in the first place. Could we possibly build software that doesn’t rot? One interesting observation is that such software does exist, in a sense. People still write software for retro gaming platforms such as the Super NES and the Atari 2600. These platforms are essentially frozen in time, with a fixed set of devices and limited I/O capabilities. The fixedness of the platform, its simplicity, and the impossibility of relying on external software packages means that any software you write for it is unlikely to be broken by changes in the platform itself.
The root cause of code rot is change. The world itself is changing, and so is software. As such, the only way to be completely protected from code rot is to target a platform that never changes. Not just the platform itself, but every interface the platform has to the outside, every device, file format and networking protocol. We can’t stop the world from changing, but we can try to build software on more stable foundations. Like San Francisco’s Millennium Tower, modern software is build on soft soil that keeps shifting under our feet, but it doesn’t have to be that way.
Although the world does change, many elements of the computing world remain fairly stable. Computer keyboards have been around since the 1950s. Color displays have been around since the 1980s. Touch devices have been commonplace for over a decade. IPv4 was introduced in 1981 and IPv6 in 1995. If I’m writing a program that only needs to read input from a keyboard and pointer device, and display pixels on a screen, there’s no real reason why that program ever needs to break. The APIs for getting input from a keyboard and rendering frames to a display can be extremely simple. The world will keep changing and new I/O devices will be invented, but even 200 years from now, the concept of a keyboard and a color display should be easy to understand and interface with.
Recently, I’ve been thinking about the design of virtual machines. If we want to create software that doesn’t break, maybe what we need is some kind of executable code archival format. A minimalistic virtual machine with a small set of I/O devices that are interfaced through a small API surface, a small RISC instruction set, and an overall design that’s generally kept as simple and stable as possible. Something like the modern equivalent of a Commodore 64, with a high-resolution color display and the performance of a modern machine. I don’t think something like that would be applicable for every use cases, but I surmise that much of the software we use only really needs to interface with the outside world in fairly simple ways. For example, it needs to get input from the user in terms of mouse clicks or touch devices, it needs to draw pixels to display a user interface, and it might also need to read and write files, and maybe access the network. That software doesn’t fundamentally need to link to any external libraries, everything can be statically linked, it just needs simple, stable interfaces to access the outside world.
The idea of a Virtual Machine (VM) isn’t new. The Java VM tried to achieve this but largely failed. Sun Microsystems coined the famous “write once, run anywhere” slogan, which people began to jokingly mock as “write once, debug everywhere”. In my opinion, the area where most VM designers go wrong is that they tend to expose too many APIs, and each of them has an API surface that is much too large. When an API has a large surface, it’s very easy for subtle bugs and corner cases to creep in. You almost inevitably end up with different implementations of the API behaving in different ways. The Web Audio and Canvas HTML APIs are good examples of this. APIs with a large surface are created because VM designers believe this is more convenient for programmers, and will yield better performance. However, this convenience has a cost, in that it makes code breakage much more likely. Fundamentally, to output audio, one should be able to output a simple list of floating-point samples, and to draw pixels, one should be able to output a grid of pixels. It doesn’t have to be much more complex than that, and if the APIs were kept simpler, they would be much less likely to break.
In order to minimize the risk of code breakage, API boundaries are not the only thing one has to think about. If your software interfaces with the outside world, you also need to think about file formats and networking protocols. In this area, choosing mature, stable, open standards is generally preferable. I don’t know if my idea of a code archival format, or a virtual machine designed for maximum API stability will see the light of day. For the moment, it remains a thought experiment, but in the meantime, I encourage every developer to think about purposefully minimizing dependencies, about designing smaller, more stable APIs and about packaging their software in a way that will maximize its useful shelf life. Engineering more robust and durable software is one way in which you can improve the lives of a large number of people in small ways and help reduce electronic waste.
I think WASM+WASI is turning into something close to what you propose. Slowly though :-)
Yeah I think WASM + WASI are interesting. Though the danger here is that a lot of WASM code could be very reliant on the HTML DOM. Also proposals like this seem unlikely to be stable across time and platforms: https://github.com/WebAssembly/wasi-nn
DOM is stable, though.
Not sure if it’s what you’re saying here, but I see definitely it show up a lot elsewhere: there’s a widespread misconception that the Web platform is brittle. In fact, browser standards are among the most stable things that exist. It’s the the component-/framework-writers in the “Modern Web” world that are the source of the sense of churn that people have—and of course relying on non-standard features, aka undefined behavior.
The Web browser is actually probably the best candidate that we have for realizing Lorie’s UVC.
The problem isn’t necessarily with the way the DOM specifies those APIs. It’s that when you have APIs with a very large surface, it’s much harder for independent implementations to always behave the same. In the last 2 months, I found 3 bugs in Safari’s implementation of web audio and pointer capture.
The debate between static and dynamic linking is interesting. In addition to the point raised by Linus Torvalds, the prominent (well, prominent in my view) open source software developer Drew Devault makes the point that outside of libc most software uses less than 5% of the exported symbols in linked libraries. So in many cases people think dynamically linking three programs to one 500k shared library is efficient, but if they statically linked their three programs it might only add, for the sake of argument, 30-100k per program. That makes it more space efficient, as well as allowing you to upgrade them independently.
As Zellyn Hunter mentioned, I think WASM + WASI might become what we need.
Last, I think the essential tension between “Don’t Reinvent The Wheel” on one side and “Keep It Simple, Stupid” / “You Ain’t Gonna Need It” on the other is an endless source of debate and conflict. At work my main responsibility is a Java web application with 100 dependencies. There are some it’s completely impractical to remove, like a library to read and write Excel files or the security libraries. But we have a 600k library we only use just for parsing configuration files, and a 275k library we only use for reading log files in reverse, and a 700k library we use for a few collections methods, and so forth. I think we could replace 40MB in dependencies with 4k lines of our own code and speed application startup by 30% (automated processes decompress and scan everything at application start, so less to decompress and scan gives faster startup). But one of my teammates is adamant about using tried, tested, public code whenever we can and avoiding writing our own code.
It strikes me that there is some overlap with what you’re talking about here and the problems of “digital preservation” in general. Of course, in the case you describe above, you want to preserve artifacts but also be able to continue to build and develop on them. And you probably have more stringent performance requirements than would basic document preservation. Anyway, your writing tickled some memories, and I managed to find some references to work that may be of interest as you think about these things:
https://www.technologyreview.com/2002/10/01/234717/data-extinction/
Click to access rj10338.pdf
https://en.wikipedia.org/wiki/UVC-based_preservation
Just dropping in to talk about the use of virtual machines throughout history a bit.
The very first bytecode interpreter I’m aware of dates to Dr. Martin Richards’ first BCPL implementation, which puts it around 1967 or so. Since then, the Tripos operating system was written in BCPL. It is so named because it was considered a TRIvial Portable Operating System. I think this was around the mid-70s. Metacomco bought Tripos, and re-implemented it to run natively on a number of platforms, eventually culminating in AmigaDOS. BCPL’s virtual machine roots seemed to have died with Metacomco’s involvement. But, Dr. Martin Richards’ resurrected BCPL and has rewritten much of Tripos from scratch, again in BCPL and again running in a VM (vis-a-vis Cintpos). One of his more recent publications is describes using BCPL and Tripos as a suitable platform for long-lived software applications, where long-lived is defined as 20 to 50 years. See https://www.cl.cam.ac.uk/teaching/2003/CompProgLangs/control.pdf if you are interested in that paper.
And, then, of course, there is the most refined development of the virtual machine concept, the IBM AS/400 or iSeries. That history is much too rich to go into here. Highly recommended as a rich source of interesting research here. The fact that modern iSeries hardware, built upon 64-bit POWER-architecture processors, can still run 48-bit software from the IBM System/38 line-up, is simultaneously simply incredible and, yet, entirely obvious that it should.
Speaking of emulating I/O interfaces and preserving data formats, the IBM mainframe line-up is a perfect example here. While the CPUs are still capable of running most System/360 and System/370 instructions (the rest are emulated in software), the I/O architecture is almost entirely replaced with emulations. Each I/O “channel” is now its own POWER-based computer which is capable of emulating not just the DMA facilities inherent in a channel, but also controllers for older disk drives like IBM 3390s and such. As a result, the software thinks it’s still running on CCKD equipment, when in reality, it’s probably running on top-of-the-line, modern SATA or NVMe devices.
There are also those programs which emulate both the processor and the I/O interfaces. Obvious examples are Commodore 64 and NES emulators, mentioned in the blog. But, more contemporaneous examples include the Uxn platform. If you’re interested in this approach, I highly recommend researching Uxn.
The fact that Java has basically failed at its preferred mission (but not at its original mission, which is an entirely different topic), combined with the success of the IBM mainframes, iSeries hardware, and even homebrew designs like Uxn, suggests to me that virtualization of the I/O interfaces is significantly more important than virtualization of the processor. But, if you can manage to do both, you’ve got an *extremely* trustworthy system on which to build your applications.
Every program that I write targets a minimal vm(100~loc), the same rom works on every device that has an emulator ported to it, it makes writing portable graphical applications very easy, I’m excited to see you also venture in that sort of ideas.
The idea was inspired from the game Another World
https://fabiensanglard.net/another_world_polygons/
Would be happy to meet up an discuss VM design some time :)
Stumbled on this today, I think you might enjoy this paper at the intersection of long-term thinking and computing.
The Cuneiform Tablets of 2015
Click to access tr2015004_cuneiform.pdf
Nice find!
The most valuable thing about your VM (https://wiki.xxiivv.com/site/uxn.html) is that it constrains what programs can do. Once people are willing to work within limitations, things become much easier. This fits in with the point about minimizing dependencies. But if you want to run say a journalled file system, I don’t think there’s any VM out there that’s going to reduce bitrot in the stack.
The way to reduce bitrot is to subtract, not add.
It’s true. I see a lot of folks approaching uxn like “I want to build an OS for this”, and they don’t really consider why they even want to do so, why they think that an OS will make them reach their goals more than working on a single task at a time.
The need for multi-tasking might be the problem that is worth fixing..
How does the archival format help you with things like PyTorch, though? I assume you’re using PyTorch because of its high performance. Wouldn’t the archival VM slow things down too much?
The PyTorch/docker example is just an egregious example of code breakage.
In order to make deep learning code portable, you’d need VM support for tensor operations IMO. That is, specialized instructions that act on tensors.
It’s tough to standardize deep learning because the field is still in rapid flux. Might be best to wait until things stabilize more, and it becomes more obvious what the most useful primitives are IMO.
There’s a pretty good candidate for “A minimalistic virtual machine with a small set of I/O devices that are interfaced through a small API surface, a small RISC instruction set, and an overall design that’s generally kept as simple and stable as possible.” Even better, it’s actively maintained and it’s been ported everywhere.
QEMU running a RISC-V CPU and all VirtIO drivers with a light POSIX OS on top of it.
Perhaps I picked wrong when I chose an instruction set for https://github.com/akkartik/mu
I see where you’re coming from but basing this on real hardware adds extra complexity IMO. Among other things, you need to carefully define the properties of your RISC-V CPU, because they’re all a little bit different. The RISC-V instruction set has many extensions, and most CPUs don’t implement all of them. You’d also have to pick how many cores this virtual CPU has, and if it has more than one, then that opens a whole other can of worms with nondeterministic execution.
I used to think similarly and want to get everything just right. But there’s value in going with something that already exists and has momentum. Not to mention OS drivers you don’t have to implement, having tried to do that myself.
If the goal is just to avoid bitrot and something gets us that, a little extra complexity may be ok. Somebody only has to pick the CPU features once, put them into a file. Now it’s easily shared, saved in version control, etc.
I like the simpler Uxn approach better :)
That’s a nice idea. But choosing a new format as archival format doesn’t sound like the best way to get longevity. I would use something that already outlived most other tools — something like Lisp/Scheme or Fortran.
It’s one reason why I nowadays write in Scheme, using Guile: https://gnu.org/s/guile