jeffa.io

I recently wrote a command-line shell and programming language called whale. I did so for fun and because I wanted to work out some frustrations with life on the command line by trying to build solutions. The nushell project was a big inspiration and assistance (I borrowed their readline library) and they were probably the first to make me wish for structured data.

The String Problem

Traditional shells (e.g. sh, bash, fish and zsh) differ in meaningful ways. For example, control flow, variable assignment, how they read startup scripts, etc. But fundamentally, they all represent data the same way: as text strings. This is a problem because, unless the user entered it, a string is rarely the most meaningful way to present data.

String-based shells work well when you are evoking commands with flags and arguments because those are all user-entered strings. But when you fetch some JSON data from the web or export a database as CSV, you are suddenly left with an unwieldy wall of text that the shell can’t handle on its own. This is not actually a design flaw, these shells were designed to solve problems related to text because they ran on systems that were entirely text-based.

But structured data has come a long way since those shells were designed. Three of the most popular shells (sh, bash and zsh) are older than both JSON and CSV. The CSV format was formally defined in 2005, the year that fish was released. And JSON’s formal definition did not come until 2017. These shells use a single ambiguous data type (the string) because it is universal and ad-hoc parsing through sed and awk was the best that could be done.

The Unix Tool Problem

The whole point of a command-line shell, in my estimation, is to allow the user to pass data to and between single-purpose programs. The unix tool philosophy is to do one thing and do it well. One writes programs by processing input into output. This is a simple yet monumental computing and programming paradigm.

Is it possible to disambiguate data on the command line by using more than just strings without violating this philosophy? In this paradigm, it may seem that the shell should only recognize data in the form of an ambiguous type like the string because the meaningful work is supposed to happen inside of programs, not between them.

But a shell that implements structured data correctly can work within the Unix tools philosophy and can even compliment its features.

Solving a Problem with Strings

Thu Aug 17 01:00:00 AM EDT 2023 | buy milk
Thu Aug 17 02:00:00 AM EDT 2023 | take out garbage

In order to process this file, we have to start by treating the entire thing as single string. Then a program like awk, sed or grep may divide it by line before searching and interpolating positional values. This will work as long as the strings are uniform enough to pose as structured data, but it is a feeble framework that can easily be compromised by something as simple as two programs parsing dates differently.

Solving the Same Problem with Structured Data

Here’s how the still-experimental shell whale handles the same job of defining a function that saves a note to a file.

Notice that we first declare variables with dot notation to group them. Then we read our existing notes from a TOML file into a variable. This variable is a list, a data structure that is a simple contiguous sequence. We can then call push on the list to add new item to the end. Finally, we write the file using our new list of notes.

TOML, unlike JSON, has a time data type that makes it a good choice when saving dates and times. That solves the problem of different programs processing dates differently, from_toml and to_tomlare the shell’s universal means of conversion for this format. There are also conversion macros for CSV and JSON. These are infallible: they create only valid output and will convert any valid input. This improvement is virtually without downsides: any structured data can still be treated as a string and structured data can be converted between formats, which improves interoperability.

Writing Unix Tools for Structured Data

Rust users are lucky to have serde, a library that can create in-memory values from data formats and vica versa. If you are compiling a program to be used as a command-line tool, I would urge you to think about how you might use structured data as input and output. Pandoc, for example, can use a YAML file as an alternative to command-line arguments but I would prefer it to have an additional option to pass that YAML to standard input. In the meantime, it is easy enough to implement pandoc’s features in whale by translating a data structure into command line arguments. It is not a perfect solution but it goes a long way to prove that the commacnd line shell is still the gold standard of computing.

If you are interested in structured data on the command line, whale is in its experimental phase and I’m happy to talk more about it. Nushell is a more mature example of a command-line shell with structured data and jq is an excellent resource for querying JSON.