f : ⊥ x ⊥ → ⊥ | All posts

Platform changes and Bazel rebuilds

2024-01-11T15:00:00+00:00

Bazel is a build system from Google which uses a strong change detection model to solve a number of build correctness problems make-like systems struggle with. While it handles most cases of rebuilds correctly out of the box, one recurrent gap is that if glibc changes bazel doesn’t notice and may produce broken results. I’d like to talk about how we hacked around this problem at work, since there aren’t a lot of well documented solutions out there.

A bit of background

In make and similar build systems, a build product is “up to date” if it is newer than all the input files by which that product is defined. The logic is seemingly elegant. If a depends on b and b depends on c, when c changes then c will be newer than b. Thus b must be rebuilt, which will make b newer than a and cause a to rebuild. Easy.

One problem that make struggles with is incomplete dependency specifications. What if a also depends on d, but this dependency isn’t listed as part of the build definition? In this case make a will fail to update a when d changes.

make and friends are vulnerable to this class problem because they run build actions directly in your shell. Any file in your working directory is visible to build actions and could in reality be a dependency of your build.

make also struggles with build configuration. To make, any build step (action) is an opaque shell script which will execute if the file dependencies aren’t up to date. It has no understanding of build configurations, and requires user discipline to convey all such options as input files rather than as environment variables. Hence in part the conventional ./configure scripts.

bazel solves these problems in two related ways. At its core, bazel is a tool for building a plan of actions (basically fork() calls) including input files, input command line and input shell environment, along with other configuration. Each one of these actions is fingerprinted using a content hash of the input files and all the input configuration. The action fingerprint serves as a repeatable identifier for (theoretically) the unique build output defined by all the provided inputs.

One neat property this model provides is that it allows bazel to cache and recycle build products. If a previously built artifact is requested for rebuild, bazel can just fetch the previous artifact from a cache – even a remote cache! – and save the actual build work. While this isn’t always valuable in building small applications, it can be profitable when caching the results of running a test suite or caching the results of enormous builds (such as a web browser).

The fundamental risk of this model is that – as with make and file timestamps – if the build depends on something which Bazel doesn’t know is a build input, then Bazel can’t detect change. Should that input change, since Bazel is unaware of it the action(s) and their fingerprint(s) for the build won’t change and you may get an incorrect rebuild because of the incomplete dependencies.

To help ensure that build actions produce repeatable results, Bazel executes actions in (somewhat) isolated chroot-like environments called sandboxes. Sandboxes contain a view of the source code, narrowed to the explicit dependencies of the current build action. This makes it hard for sandboxed actions to use dependencies which are not stated.

The default sandboxing isn’t perfect, but it’s pretty good and even trying to use sandboxes provides a lot of soundness pressure make and friends lack. You can even opt into running your entire build in a Docker sandbox, or on an entirely different machine via remote execution if you want to get paranoid about sandboxing!

$ bazel build \
    --spawn_strategy=docker \
    --experimental_docker_image=ubuntu@sha256:... \
    //your:target

Now there’s a big huge large asterisk on the default sandboxing machinery, and that’s the system on which you’re doing builds. For instance if your build uses clang, which clang are you getting? When your build links against a library using ldd, what version of the library are you linking against?

The answer is that you’re usually doing a non-hermetic build using the host C compiler and linking against the host glibc.

So really your entire OS install state is a dependency of the build, but not one which is an explicit input. If you apt-get upgrade, suddenly your cc and glibc could change without Bazel noticing. Such a change should force rebuilds, but by default won’t since system files aren’t inputs to your build.

Using a workstation you’d probably never notice that you upgraded glibc and Bazel didn’t do rebuilds. glibc is good at forwards-compatibility, so the old build artifact will keep working fine for a long time.

But when working with Docker containers it’s easy to run into old glibcs and get backwards-incompatibility version issues. A /lib64/libc.so.6: version 'GLIBC_2.39' not found error because your Bazel cache contains entries from a too-new build can ruin your whole day.

Unfortunately, Bazel doesn’t natively understand glibc or have a feature well suited to doing so. There’s a whole mess of GH issues associated with this. https://github.com/bazelbuild/bazel/issues/16976 and https://github.com/bazelbuild/bazel/issues/8766 to name two of the most recently active.

How could we work around this?

Building in known contexts

We already talked about Dockerized execution, which is certainly one way to fully lock down the build context and make changes to that context explicit.

Another option is to take an entire sysroot as a dependency, so you’re always using a fixed compiler and a glibc out of a sysroot. Essentially this is doing builds inside an explicitly managed container to ensure that no build dependencies can accidentally change.

I’m given to understand this is somewhat like how builds work at Google, which is part of why there isn’t a better story for glibc “in the box”. Google’s internal build architecture apparently “has no such thing as an external dependency”, in which sysroots appear to be part of the story. I’m sure that works if you can fund a team to maintain a sysroot build matrix. Must be nice.

The Aspect folks provide a hermetic GCC toolchain which uses basically the same sysroot strategy.

A parallel fix would be using Nix environment/shell definitions to stabilize the tools used when building. For instance by wrapping Bazel with a tool that boots a Nix defined shell before calling out to Bazel, or by using rules_nixos to try and embed Nix within Bazel. This doesn’t directly solve the caching issue since the Nix environment isn’t strictly a build input, but it will at least get you reproducible builds after a manner.

The advantage of these strategies is that they’re reproducible, and some can work with Bazel’s remote execution capabilities. The downside is that since you’re doing essentially containerized builds, they’re heavyweight.

Workspace status?

Taking a step back, there are two possible solutions to the glibc versioning problem. One is to make sure that the build as configured always runs in the same environment. Nix, Docker and sysroot images are all ways to achieve that.

The other would be to make the glibc version an explicit build input so that if it changes that is visible to Bazel. This sounds a lot like Bazel’s workspace status feature.

Workspace status allows Bazel to capture information about the host and repository state for the build. This state is divided into “stable” keys which are expected to change infrequently and cause build artifacts to invalidate, and “volatile” keys which are expected to change and don’t cause rebuilds.

These keys are generated by running an arbitrary shell script (or other executable) before each build action. This seems to align nicely with wanting to inspect the glibc version, since we can just run a shell script to inspect that and other details before every action. Right?

Unfortunately repository status is only a build input of “stamped” (--stamp) builds, and then only an input to executable rules (*_binary) and will not cause intermediate library products to be rebuilt for stamping. So that doesn’t actually align with the semantics we want which is that if glibc changes the entire cache gets busted.

It’d be nice if there were a status command/state for semi-permanent build inputs such as the OS and glibc version which did have these global cache busting semantics, but there isn’t so we have to look elsewhere.

Salting the build

Since we can’t make the glibc version a build input via the workspace status machinery, are there other channels we could use to achieve the same result?

The primary thing Bazel considers as a build input is files, but environment variables and global cc flags are also supported. For instance bazel build --copt=<something> would apply a compiler option to all sub-builds, and a change to this flag is a build input change which triggers rebuilds. bazel build --action_env=VAR=val specifies an environment variable value for any build action which depends on the variable VAR as an input.

If our goal is simply to ensure that Bazel performs rebuilds on context changes, we could wrap Bazel with a script such as --copt=-DPLATFORM_FINGERPRINT=$(ldd --version | sha256sum). This allows us to create a build input with global scope which will cause any cc task to rebuild if it changes.

In a sense we’re just salting the fingerprints of all actions in build by adding inputs that happen to change when the external environment changes.

#!/bin/sh

# A sketch at a Bazel wrapper

next=$(
    (
        which -a bazelisk
        which -a bazel
    ) | grep -v "$(realpath $0)"
)
shift

function platform_args {
    glibc=$(ldd --version | awk '{print $4; exit}')
    echo "--action_env=PLATFORM_GLIBC_REV=$glibc"
    echo "--copt=-DPLATFORM_GLIBC_REV=$glibc"
}

# HACK: this doesn't quite work because Bazel accepts startup args before the command
command="$0"
shift

exec "$next" "$command" $(platform_args) "$@"

Rather than just passing through the glibc version, one could imagine a more generalized platform fingerprint hash value incorporating factors like the libc version, the OS release, an actual salt counter for busting the cache and anything else your build may be conditional on.

Some adjustments to Bazel WORKSPACE rules may be required to make this tack work, as --action_env values don’t flow into workspace actions by default. Likewise rules wrapping non-cc compilers such as go would require minor adjustments to make the PLATFORM_GLIBC_REV a global environment input.

Thankfully, Bazel’s git_repo support for pulling in rulesets makes forking or vendoring rules easy, so the prospect of running a patched rules_go and rules_python isn’t that daunting.

A last word

The best solution would be for sysroot images under Bazel to be as easy as other kinds of 3rd party dependencies. Since a sysroot image is “just” a normal build input file, they play nicely with all of Bazel’s change detection machinery and with remote execution.

rules_nixos achieves this by using Nix definitions to create the required context. I wish them great success and want to kick the tires one of these days, because that seems like a good general strategy for locking down the build context without having to do a ton of extra work.

However, given that the alternatives (Nix, sysroots) are heavyweight and there isn’t a way to use workspace status to achieve what we want build salting may be an acceptable solution. It isn’t a perfect solution because from Bazel’s perspective the implication of --action_env=PLATFORM_GLIBC_REV=2.21 is that as long as the option is set you’ll get the same build results.

Really it’d be better to be able to express glibc as a platform constraint that could inform remote execution worker selection but Bazel doesn’t support that. And we haven’t deployed remote execution yet anyway so I guess I’ll be using a salting shim script for a while.

Tentacles

2023-10-04T15:00:00+00:00

About a year ago, I decided to get into 3d printing. I had a specific project in mind (which of course I still haven’t done) for building a fairly large interior structure for a pelican case, and so at the advice of a friend I decided to pick up a Creality CR-10v3 for its large build volume. Fast forward many months of building custom firmware, cursing furiously at slicer settings and hand-tuning the printer to get acceptable precision and I finally started making practical prints. But … printing is kinda slow. A single part can routinely take 3-4h to bake. Printing multiple parts at a time is possible, but printing time is mostly a factor of tool-head movement time. More movement for multiple parts means more time than printing the parts serially, and more time means more risk of bed ahesion or something else failing. So better to make printing multiple objects sequentially efficient, or even to print in parallel. Enter print farming, and my second CR-10v3.

Printing in parallel across multiple machines allows for total wall clock time-to-print to be reduced, and reduces the blast radius of part failures to a single part. But it requires a different approach to managing 3d printers than off the shelf solutions provide.

While printer vendors are beginning to roll out “cloud connected” printers managed through their own proprietary management platforms (sigh), printers started as just AVR microcontrollers reading streamed G-CODE instructions over serial connections. OctoPrint is a piece of software which implements exactly that streaming of G-CODE from a “real” computer (often a Raspberry Pi via OctoPi or some other toy computer). Unfortunately for me however, OctoPrint is designed to control exactly one physical printer.

Internally, OctoPrint’s model consists of model files which can be sliced into G-CODE, G-CODE files which can be “selected” for printing, and the “selected” file which can be “printing” (streaming G-CODE to the physical printer) actively. OctoPrint is mostly concerned with managing and parsing the serial connection protocol between itself and the physical printer, and doesn’t make much of any effort to present a job queue or any other high-level constructs.

For a single printer setup, it’s quite convenient to run an OctoPrint instance because many slicers (such as PrusaSlicer) have native support for “Send to OctoPrint” when slicing a model. This means that when you’re iterating on a model you essentially have a “slice and go” button. No need to generate a G-CODE script file and copy it around. Just load up a model, slice it, click print and the slicer tool will HTTP POST the sliced script to a configured OctoPrint API for you, and start the print running.

But, OctoPrint being a single-printer piece of software that doesn’t work when you have multiple machines in the loop. It also doesn’t work when you want to run off multiple copies of a given file, because OctoPrint can’t represent that request for say six copies. It only knows that there’s a file selected for printing. Once that print is done, something needs to request a new print – usually you. This usually means logging into the web UI for OctoPrint and clicking print again once you’ve turned the machine over between tasks. For two copies, that may be OK. But if you want to print many copies, it rapidly gets onerous. And forget trying to manually push files and print jobs out to multiple OctoPrint instances.

Surely there is a better way!

In theory OctoFarm should provide that better way, but among other defects it doesn’t actually have a scheduler.

So I built Tentacles.

Tentacles is a solution for fronting multiple OctoPrint instances with a job queue, and presenting them as if it was one OctoPrint … with multiple tentacles. In this screenshot, we can see that OctoPrint is currently configured to drive two printers, both CR-10s and that there’s a job currently scheduled to P1.

While the UI could use some love, the job queue UI shows that the job is currently running, and allows for duplication of the job, or cancellation.

We can also see that each of the printers is configured with a number of details – nozzle size, machine limits, machine type and loaded filament type. These details serve as scheduling constraints, allowing Tentacles to ensure that it doesn’t accidentally upload a PLA print to a machine loaded with ABS which needs much higher working temperatures. It also allows scheduling of jobs to printers which are the “right size”. Were I to add a smaller printer such as a Prusa Mini to the fleet, physical size constraints would prevent jobs which need a large working bed from running on the small printer. Users don’t need to specify the requirements of their jobs – it can be extracted either through simulating the G-CODE or by parsing the metadata PrusaSlicer helpfully includes.

Okay, so Tentacles can receive jobs, enqueue them and map them out … how do we determine when a printer is ready to receive work? OctoPrint provides some state bits reflecting whether a job is currently running or cancelling, but that doesn’t handle machine turnover. When a print finishes, or when there’s no print in progress, we need a way to decide whether the print bed is clear and the printer ready to do work. Enter the bedready plugin. By the standards of CV it’s super primitive and just does image-to-image comparisons, but if you use a printer’s webcam to take a reference image of an empty bed in a default “reset” position, that can be enough. As long as OctoPrint runs a script which returns the print bed to the “reset” position when a job finishes or the printer resets, you can then compare the webcam pictures to detect say finished prints that still haven’t been removed from the print bed. Or stray tools. Or support material that didn’t get cleaned up, any of which indicates that the printer is not ready to accept jobs.

Just for fun, Tentacles is also a fully multi-tenant solution! While it doesn’t feature job priorities, quotas or chargeback (yet), it does have a user signup, verification and approval flow. Once approved, users can directly request jobs on Tentacles as if it were a simple OctoPrint instance! To date, two friends have successfully printed jobs through Tentacles without ever coming over to my shop.

I could say more, but this is probably the most interesting stuff. The code is available, licensed under the anticapitalist.software license. No support or releases are provided – if you want to try and use it you’re gonna have to bazel build it yourself.

This is, after all, hobby software for my hobby print farm. But it is awful handy for running off fleets of benchies.

Farewell StrangeLoop

2023-10-03T05:57:00+00:00

In 2009 when Alex ran the first StrangeLoop, I was a sophomore in High School. When I started hanging out in Clojure circles around 2013, StrangeLoop was the place to be. Rich had spoken there, Zach had spoken there, as would Joe, SPJ and many many more industry leaders.

I finally managed to join the fun in 2016, and to say I got to meet my entire Twitter feed in one place would be an understatement. I finally got to meet Chas, Daniel S. G., KF, fell in with the Papers We Love crew, saw some mind-bending talks and left beyond stoked to be back the next year. Since then, the annual trek to St. Louis (and Salt + Smoke) has been the highlight of the calendar socially and intellectually.

On more than one occasion I’ve quipped – and I stand by this –

Every year we get together in St. Louis to remember that computers are incredible and that we do enjoy using them

Particular standouts were Jose’s talk on the natural numbers which inspired me to start banging out a lazy thunk implementation of numerics that very night

Lea’s discourse on nitting

Marlow’s talk on Haxel which made me finally understand transaction logs / monads and on which I’ve shamelessly based … approximately everything I’ve built since

Janelle’s absolutely hysterical talk on playing with ML

Sarah’s incredible talk on DIYing up her own pancreas

Felienne’s fascinating experience report of trying to design a language which isn’t simple in any sense but is approachable and the constraints that imposed

Just to pick a handful of talks which I had the pleasure of attending.

And now, the løøp as one chat group long called it is over. It will be sorely missed.

Thank you Alex, Crystal, Nick, Ryan, Mario, and everyone else who helped make the venue as special and magical as it was. Whether you knew it or not this project and its community defined my career in more ways than I could count.

And every year reminded me that computers can be fun, and there is art in what we do.

macaw.social

2022-11-06T00:00:00+00:00

This post serves as cross-proof that I’m @arrdem@macaw.social in the fediverse.

macaw.social is a small instance run by me and some other ex-Twitter SREs. The instance is named ironically for a family of services you hopefully never heard of named Macaw.

It’s entirely possible that as the fediverse stabilizes I deploy my own instance (I did just pick up paren.town and paren.space to go along with paren.party), but for now this’ll do.

The bird is dead. Long live the bird.

A eulogy

2022-10-27T00:00:00+00:00

Originally on cohost

It sucks that Twitter’s leadership never figured out how to monetize what they had to such a degree that the company and by extension the product, network and relationships captured on it have remained an acquisition target. And now have been enclosed by someone with nothing better to do.

It’s not surprising. Twitter’s always been a hot mess internally and externally.

But it is disappointing because Twitter does (did) a good job of forming communities of interest and helping folks find new adjacencies and perspectives. You create an account, you follow some people, you post about stuff and you find people who post about the same things. You follow (and unfollow) people and get a sense of them. Not just an Instagram facade, but a fairly raw braindump of their life. Their struggles, successes and vibes. You can fall into programming languages twitter and find everyone’s trans and presents that as part of who they are. You see peoples sports teams and their local politics.

Amidst the attention and outrage machine there are people to be found and relationships to be formed. Not just brands and politicians being messy at each other and using yet another platform to retrench their microphone.

Twitter’s success has always been in elevating voices you usually wouldn’t hear. People can get their five minutes of fame and be a one hit wonder for that one time they dunked on a politician or posted cell phone footage of what just happened. Giving access to eyewitness media and accounts as events unfolded presented a challenge to established media organizations and arbiters of truth.

Maybe it was an impossible dream to monetize that chaos well enough to insulate it from enclosure. Twitter had to shut down the 3rdparty client interfaces because of one play to enclose it from the outside in, and that was certainly a death nail moment after which Twitter was on the fearful defensive.

Facebook succeeded as a business - for a while at least - because they captured structured data about their users that could be directly fed into ads targeting. “what’s happening” and media upload doesn’t let a company target nearly as well; privacy concerns notwithstanding.

The thing that big platforms like Twitter and TikTok succeed at is bridging cultural boundaries. Maybe retreat into a dark forest of forums was inevitable from a moderation and culture war perspective. But loss of the big microphone and the wide platform on which to find and make new connections is a shame. Even if it was always a clown car in a gold mine.

Cram; a new dotfile manager

2022-08-31T01:43:00+00:00

Ah dotfiles. Love ‘em or hate ‘em we’ve got to live with ‘em. While Rob Pike has words about Unix hidden files, we (almost all) work on computers and with software whose behavior is determined in large part by hidden files in our home directories. There’s probably a .bashrc or .zshrc and a whole .ssh/ and .config/ directories kicking around on your workstation full of stuff that matters a fair bit to your day-to-day and standing up a new work machine is probably a wasted day of trying to remember homebrew incantations and source software.

A traditional answer to this is a git repo, and some sort of installer. GNU Stow is a well-trod solution, providing the ability to install dotfiles, but you probably need to wrap it in some script to install software. And Stow doesn’t do well at uninstalling files. As with Ansible it’s effectively imperative, not a declarative solution to managing the state of your configs.

Puppet or NixOS could solve this problem, but they’re suuuuper heavyweight for just managing your dotfiles in a portable way and they create bootstrapping problems since you can’t count on them being available.

I’d like to present what started life many years ago as my custom script, and has become a fair bit more - cram (repo, v0.2.0 release).

Cram is a single-file Python 3.6+ zip app, designed to be something you can just check in with your dotfiles and run anywhere using a system python interpreter. Cram provides a package abstraction like Stow with the addition of packages which can exec. Most importantly, Cram hews to immutable infrastructure principles with an execution log, dry-run/diff capabilities and supports automatic removal of installed resources.

Let’s take a quick tour!

Let’s get cramming

you can clone the repo here git clone https://git.arrdem.com/arrdem/cram-demo.git and follow along

Cram doesn’t know anything about $XDG_CONFIG_DIR or your OS’s package manager or even dotfiles at all. What cram does know about is packages, profiles and a state log.

To Cram, a package is a directory under packages.d (this is hardcoded). Such a directory may contain a pkg.toml file, which as with other package file formats may describe dependencies, preparation, installation and post-install steps. An example of such a file is as follows -

[cram]
version = 1

[package]
# The package.require list names dependencies
[[package.require]]
name = "packages.d/some-other-package"

# (optional) The package.build list enumerates either
# inline scripts or script files. These exec as a
# package is 'built', before it is installed.
[[package.build]]
run = "some-build-command"

# (optional) Hook script(s) which occur before installation.
[[package.pre_install]]
run = "some-hook"

# (optional) Override installation behavior.
# By default, everthing under the package directory
# (the `pkg.toml` excepted) treated is as a file to be
# stowed using symlinks.
[[package.install]]
run = "some-install-command"

# (optional) Hook script(s) which occur after installation.
[[package.post_install]]
run = "some-other-hook"

Managing files with cram

Cram is used to “apply” changes to a directory under management. The conventional incantation for this is ./cram apply ~/conf ~/, for managing a home directory or dotfiles.

Let’s look at the usage -

$ ./cram apply --help
Usage: __main__.py apply [OPTIONS] CONFDIR DESTDIR

  The entry point of cram.

Options:
  --execute / --dry-run
  --force / --no-force
  --state-file PATH
  --optimize / --no-optimize
  --require TEXT
  --exec-idempotent / --exec-always
  --help                          Show this message and exit.

By default, Cram will “require” the following packages:

profiles.d/default
profiles.d/$HOSTNAME

But you can override this by passing --require. For the purposes of this demo, we will just install a fake package. Don’t worry, we aren’t going to actually install anything here.

$ ./cram apply --dry-run --require packages.d/fake . ~/
2022-07-28 22:25:26,521 - __main__ - WARNING - No previous statefile .cram.log
- mkdir ~/.config
- chmod ~/.config 16877
- mkdir ~/.config/fake
- chmod ~/.config/fake 16877
- link ./packages.d/fake/.config/fake/b.conf ~/.config/fake/b.conf
- link ./packages.d/fake/.config/fake/a.conf ~/.config/fake/a.conf

--dry-run (which is also the default behavior) instructed Cram to figure out what to do, but not to do anything. This is a changelog of commands which Cram is proposing to execute against your filesystem. All of these commands are generated by the default stow style installer, and produce an installed state. Were you to use apply --execute, Cram would go ahead and make these changes.

That No previous statefile warning is the secret sauce of Cram. Cram works in terms not just of this log of what changes it will make, but in terms of a persisted log of what changes it has made. This allows Cram to optimize repeated executions to remove installation steps that haven’t changed, while still retaining a precise log of how to get where you are now from an empty slate. This also allows Cram to clean up after itself.

Let’s do a real demo of this.

$ ./cram apply --execute --require packages.d/fake . ~/
- mkdir ~/.config
- chmod ~/.config 16877
- mkdir ~/.config/fake
- chmod ~/.config/fake 16877
- link ./packages.d/fake/.config/fake/a.conf ~/.config/fake/a.conf
- link ./packages.d/fake/.config/fake/b.conf ~/.config/fake/b.conf

So now we’ve got two files and a couple directories on the filesystem we may or may not want. We can see the record of this state as follows -

$ ./cram state .
- mkdir ~/.config
- chmod ~/.config 16877
- mkdir ~/.config/fake
- chmod ~/.config/fake 16877
- link ./packages.d/fake/.config/fake/a.conf ~/.config/fake/a.conf
- link ./packages.d/fake/.config/fake/b.conf ~/.config/fake/b.conf

Were you to delete a file, say rm packages.d/fake/.config/a.conf and then inspect changes -

$ rm packages.d/fake/.config/a.conf
$ ./cram apply --require packages.d/fake . ~/
- unlink ~/.config/fake/a.conf

What Cram did here was compute what the install steps for the current state would be, compare that with the PREVIOUSLY EXECUTED steps, identify a file that is no longer to be installed, and include removing that file in the new plan.

And if we apply --execute our changes, note that the state file DOES NOT include the unlink cleanup instruction. It only contains the steps required to produce the now-current state

$ ./cram apply --execute --require packages.d/fake . ~/
- unlink ~/.config/fake/a.conf
$ ./cram state .
- mkdir ~/.config
- chmod ~/.config 16877
- mkdir ~/.config/fake
- chmod ~/.config/fake 16877
- link ./packages.d/fake/.config/fake/b.conf ~/.config/fake/b.conf

Managing software with Cram

We can also manage the software that consumes our dotfiles with Cram! Let’s look at how homebrew would be installed -

$ ./cram list . packages.d/homebrew
packages.d/homebrew: (PackageV1)
requires:
log:
  - exec /tmp ('/bin/sh', PosixPath('/tmp/stow/e5d3a54761ee43023832d565e11ec4661b84f4ec66629042674b6658993e8cb8.sh'))

Not super helpful - let’s take a look at the pkg.toml

$ cat packages.d/homebrew/pkg.toml

[cram]
version = 1

[package]
require = []

[[package.install]]
run = "[ ! -e /opt/homebrew/bin/brew ] && /bin/bash -c \"$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)\""

Cram has a big escape hatch for letting you run scripts under /bin/sh, and it’s providing run= directives for the package installer (or other hooks). So that Cram can determine when an install script or other hook changes, these scripts get extracted and content-hashed. If the content-hash of a script hasn’t changed, Cram won’t run it when an apply reoccurs. We consider exec to be idempotent (--exec-idempotent) although you can override this default (--exec-always).

Installing homebrew this way lets us write other packages which depend on homebrew, for instance zsh

[cram]
version = 1

[[package.require]]
name = "profiles.d/macos/homebrew"

[[package.install]]
run = "brew install zsh"

This works a treat, but can get repetitive for which we don’t have a good story. Another concern is that, because we use a unique subprocess per script $PATH gets discarded so there isn’t a good pattern for ensuring that say /opt/homebrew/bin stays on $PATH between dependent tasks.

Managing larger configurations with Cram

I’ve talked a couple times at the package.require feature of Cram packages. But that doesn’t tell you too much about how to organize larger configurations.

A profile is a directory under profiles.d or hosts.d which may but need not have a pkg.toml specifying requirements. Where a package is fundamentally a set of installation directives, a profile is a group of packages. For instance profiles.d/emacs/doom-emacs is a package of configurations specific to the Emacs package.

The tricky bit is that a profile IMPLICITLY requires all its subpackages. This is useful for profile and host specific packages - you don’t have to have a bunch of macos-foo packages running around, they could live in profiles.d/macos/* and then a given MacOS host can depend on profiles.d/macos to grab all the relevant configuration.

The hosts.d/demo host provides an example of this pattern by depending on the macos and work profiles as meta-packages.

To see how the demo host would be installed, ./cram apply --require profiles.d/default --require hosts.d/demo . ~/ would do the trick.

Limitations of Cram

One of the problems in software is that authors don’t lay out what problems they are and aren’t trying to solve. So without further ado, here’s what Cram does, doesn’t and will never do.

Cram manages files and scripts to be run when setting up an environment. Cram tries to help you make this all idempotent. Cram is designed to work for smallish amounts of configuration.

While Cram may 1.0.0 at some point, as it stands Cram is already the product of years of incrementally optimizing and thinking about my dotfiles managers when the need to bootstrap a new machine arises.

It is tempting to teach Cram how to read Starlark or some other config format and try to make it more of an ur-Nix, but there’s no reason to. The present incarnation of Cram satisfies my needs, and my Cram configs can zero-touch deploy either a new work macbook or my personal machines. Besides some small tweaks to enable other folks to adopt it, I don’t see major changes coming down the pike.

Cram won’t do sandboxing. Scripts are scripts. No safeguards.
Cram won’t have or provide templating.
Cram won’t have and doesn’t integrate with a secret manager.
Cram won’t have an inventory or data system.
Cram won’t have a conditional dependency system.
Cram isn’t a package manager and doesn’t support remote recipes.
Cram can’t abstract over where you’re going to get emacs from, it only knows that you have a package with a run = directive.
Cram won’t have a real proper extension or provider interface.
The concept of a profile is somewhat backwards - a real metapackage or group would be better.

Happy cramming!

Superficial Simplicity

2022-07-04T22:09:00+00:00

For the last decade I’ve chased and wrestled with the ideal of “simple” software, but I’ve come to see it as a false summit and want to spend some ink on why in the hope that it can lead to a better understanding of simplicity and more intelligent conversations about complexity.

Those of you who’ve orbited around Clojure will recognize the scare quoted word from Hickey’s “Simple Made Easy” (transcript).

To summarize, Hickey differentiates between things which are “simple” in that they do one thing, those which are “complex” (he uses “complected”) which do more than one or “too many”, and those which are “easy” in that they are convenient for a task. The core thesis of the talk is that software which is “simple” is intrinsically better, easier to build and higher quality than software which is “complex”, and that we can build up “simple” solutions by “composing” solutions to “simple” subproblems.

Like grugbrain.dev, Hickey offers an ad-hoc intuitive appeal to what “simplicity” and “complexity” are. Many of the things which Hickey explicitly presents as complex have or are forms of nonlocal effects and nonlocal data dependencies, but some are just large tools. And in subsequent talks and years Hickey has leaned on the idea of “decomplecting” as having much of the same meaning as approaching problems by decomposition and trying to build tools which do one thing.

I think this is broadly useful commentary, and while imprecise it played a critical role in growing my thinking about software.

Okay so we’ve got a senses of what constitutes simplicity – doing one thing or being more focused – let’s consider what happens when we do that.

Into the T-shirt tarpits

There’s this brainworm in programming language developer circles of making “kernel” languages. A “kernel” is a minimal language or environment which can implement itself. They orbit the Turing Tarpit, and have a lot of “Give me a place to stand and with a lever I will move the whole world” energy. Aspirations to better future computing grounded on simplicity aren’t uncommon.

Kernel languages are a neat hat trick for a language author. They’re self-satisfying because they concretely demonstrate the power and utility of the language; after all, just look at it, it’s self-hosting!

Best of all these tools are simple in the sense that they have few parts and do little. Few of these languages feature generic types and fewer still feature inference – these features requiring lots of supporting machinery. More often we see interpreted languages with a small ‘kernel’ of special forms which the implementation must provide. Norvig’s “One Page Lisp” is an example of this particular school of thought, as are the examples Michael Fogus discusses in what he terms The German School of Lisp.

I don’t mean to dismiss kernel lisps as being merely intellectually pleasing toys for language developers. Simple languages have arguable benefits.

Having a simple core model for the system lets users learn the entire model. A challenge users face with large systems or systems with large specifications is that it can become difficult to learn them. As a student one must find an entry point or a thread to begin pulling on. Niklaus Wirth’s work on Pascal, Modula and ultimately Oberon is one of the realest forms of simple languages and is perhaps best understood in the context of these tools as tools for pedagogy – not perhaps as tools for software development. “The School of Niklaus Wirth” relates an infamous story of how Wirth eliminated a hash table contributed by a student to the compiler because it added complexity, which makes the most sense in this light. Simplicity in the literal school where Wirth teaches to this day doesn’t just serve an aesthetic purpose, it facilitates students.

Brodie’s “Thinking Forth” contains this incredible graphic (fig. 4.7) which is used to emphasize his claim that forth is approachable because it is simple in the sense that the reader/compiler/interpreter does only a few things.

from "Thinking Forth" by Brodie pic.twitter.com/vOGx5dGKz8
— arrdem#4301 (@arrdem) May 29, 2022

“The Cuneiform Tablets of 2015” considers one vision of computing Kay has forwarded which tries to be “simpler” in the sense of having a smaller specification. Particularly, it is a pursuit of the same idea of a kernel which kernel language pursue; one which Kay has repeatedly phrased as ‘the T-shirt computer’ with the challenge “what are Maxwell’s laws of computing?”

To “The Cuneiform Tablets of 2015”, the virtue of simplicity is as an aid to communication and re-implementation. Similarly to Shen, Kay et. all suggest that a “T-shirt computer” could be used as a vehicle for long-term experience and information conveyance. Rather than trying to design an image, sound or video format which is directly readable hundreds or thousands of years hence, they suggest it could be more practical to define and convey a “simple” computer, and then to convey media for that computer.

It’s critical that, to Kay, this computer is simple because it is of small definition or implementation. This is perhaps related to the intuitive notion of complexity which Hickey and Grug appeal to. Size of implementation is perhaps a kind of dependency, and certainly to a programmer who must conceptualize of the program it is perhaps a view of the rough dependency tree or “number of things” Hickey describe.

SectorLISP, or JonesFORTH, or SectorFORTH are really incredible examples of this train of thought of small implementations. Here you have multiple implementations of entire viable abstract computers in a tiny amount of code. What could be simpler?

Building up, not shrinking down

Steele’s talk “Growing A Language” (transcript (PDF)) begins to get at what I think is the crux of a refutation of this claim. While the presentation is dated and some of the productions Steel uses don’t pass muster it’s a really phenomenal example of what it’s like to live in one of these simple-and-yet-not environments. I’ve always enjoyed watching his face in the first nine minutes; trying not laugh on stage as he produces obvious definitions one after the other until he can explain what he’s doing.

As Steele put it so eloquently

If you want to get far at all with a small language, you must first add to the language to make a language which is more large. In some cases, we will find it convenient to add “er” to the end of a word to mean more. Thus we could take “larger” to mean “more large” or “better” to mean “more good”.

The joke and also point being that he would have used the word “better” but hadn’t yet defined the required rule of meaning and so couldn’t.

To summarize, Steele posits that a language must BOTH be “large enough” to be “useful”, and yet be “small enough” to learn. The conceit of the talk and core premise being that being of “small” size and being “simple” conflate.

Steele suggests the trick in balancing size against utility is that a language must empower users to extend (if not change) the language by adding words and maybe by adding new rules of meaning. The hope then is that the language need only grow a little bit. Users can grow a language where and when they judge best. Meanwhile the core language need only change to further enable users to grow the language, if at all. The core language then is but a common set of meanings and a framework for defining rules of meaning which allow users to begin to speak together.

Throughout the talk, Steele begs a question which the kernel language projects highlight and which reflects on the general question of defining let alone managing complexity.

How can a small (simple) language (or tool) truly be better by dint of being small if one must grow it before one can use it?

Just as Steele does not distinguish between words defined by a language and words defined by a user, neither does this question. A language which users must extend in order to use effectively is as suspect as a library so incomplete that it must be extended or wrapped to be useful. One can even extend this argument to whole programs such as UNIX tools, or software systems.

This nicely refutes the simple-future-computing-through-simplicity aspirations I’ve long held, and that kernel projects evince.

Simplicity in one component, is not enough. Simplicity as discussed so far is a local property. This specification is small and thus simple; that implementation is small and thus simple. Calling tools of small size and focused functionality simple is self-fulfilling and fails to provide a meaningful theory for whether or not things built with simple tools are themselves simple.

Perhaps we should instead be considering how languages or computers let their users say what they wish to say, and compute what they wish to compute without imposing on them undue costs.

Convenience? Or simplicity through design?

Clojure itself is an interesting case study in this trade-off between what one could perhaps call internal vs exposed or demanded complexity. Clojure is not self-hosting, leaning instead on a Java compiler, core data structures written in Java and pair of large Java classes that provide bridges from Clojure to Java. Clojure’s implementation is reasonably complex, but it uses that complexity budget to try to paper over intricacies of the JVM and present users with a cleaner slate. For instance despite the conceptual simplicity of data that doesn’t change, the concrete implementation of fast immutable data structures is considerable.

The authors of Clojure don’t think you should be using classes to represent or encapsulate data. As a result, there are good literals for writing immutable maps, sets, vectors, strings, numbers, symbols and composites of these data types without interacting with their Java underpinnings. Meanwhile it would be difficult and far more verbose to write out comparable structures of objects as one would in Java.

Limiting the state space of data through immutability helps Clojure users manage complexity. Having convenient notation for complex forms of data helps Clojure users make the “right choice” by default. Having an interactive environment where users can explore transformations on data helps users build up their programs incrementally in a functional and compositional style.

Would making the tool itself simpler (do less) enhance these properties? After a lot of experimentation, I don’t think so.

Another interesting case study comes from the other side of the “better than Java” debate – Kotlin. Kotlin’s extension functions are an incredible hack which provides enormous ergonomic leverage without introducing fancy language features. In a LISP, nobody would bat an eye at defining a new WITH-FOO macro that sets some foo context and runs a body. Take for instance a closing macro which, which runs a body “with” an closable resource.

;; ~  enunciated 'unquote' is for inserting an expression
;; ~@ enunciated 'unquote-splicing' is for inserting many expressions
;; & is Clojure syntax for accepting zero or more extra arguments
(defmacro closing [[name init-expr] & body]
   (let [~name ~init-expr]           ; evaluate the init-expr (eg. open)
      (try ~@body                    ; do the body with `name` bound
         (finally (.close ~name))))) ; close

(closing [f (open "~/scratch.txt")]
   (.write f "hello, world"))

WITH-FOO macros are incredibly convenient because, as with with x: in Python or try (File f = ...) { } try-with-resources in Java, or even let statements they create local, lexically mapped scopes for resources or state. There’s nothing a (well-behaved) WITH-FOO macro does that couldn’t be implemented by properly balanced push and pop operations, but creating a syntactic pattern for setting and unsetting context, or using and disposing of resources, is incredibly powerful.

In Kotlin, this is “just” a function which happens to take a callback (which looks like a normal {} block body) and calls it.

inline fun <T> AutoClosable.use (body: (AutoClosable) -> T): T {
    try {
        return body(this)
    } finally {
        this.close()
    }
}

// Note that {} is syntax sugar for a lambda function
// Note that .f {} is syntax sugar for .f({})
File("~/scratch.txt").use { f ->
   f.write("hello, world")
}

A tiny bit of syntax sugar has effectively opened up all much of the same design space as a macro to users who want to make their language “more large” without at the same time making the language fully self-modifying or meaningfully more difficult to analyze. It’s a bit complex to understand that .use {} is a fancy function call, but the ergonomics benefits are enormous and an IDE or compiler can still analyze it because it’s just a function not a true macro with its own rules of meaning. Furthermore this general pattern of being able to add structured, type-dispatched behavior to a type, interface or intersection of such enables a controlled form of code injection that’s generally useful and has predictable scope in comparison to some other forms of behavior injection.

The risk of giving users access to real macros is that users can and will build their own entire languages. Take the infamous Common Lisp LOOP macro as a somewhat extreme example.

Being able to write

(loop for x in '(1 2 3)
      do (print x))

is perhaps well and good, but even “simple” examples begin to show the monster lurking in the LOOP DSL

 (loop repeat 10
       for x = (random 100)
       if (evenp x)
          collect x into evens
          and do (format t "~a is even!~%" x)
       else
          collect x into odds
          and count t into n-odds
       finally (return (values evens odds n-odds)))

This laughably complex macro is no fault of Common Lisp the language. Well. Maybe the fault of the language standards committee who should have known better than to include it in the language. But the tools for building such macros are fundamental to the language, and even were this macro not standardized users could and did write their own.

Perhaps this macro is “easy” in terms of the notation being familiar to programmers who have used other languages. It certainly seems like much of a traditional Pascal has been crammed into this macro. But in no sense is it “simple” or consistent with other notation within the framework of Lisp. This macro is unlike anything else, and has an incredibly complex interpretation defined by a large macro.

By relying on user-defined language extensions even for what a larger language would consider the standard library or core features, small languages can actually be far more difficult to build tooling around. Larger languages such as Java or Kotlin are able to lean on well understood syntax and either limited or no space for syntactic extensions to be precise about what is and isn’t valid.

The fact that one can build an async/await engine as a userland Lisp library defeats the kind of analysis required to enable many helpful compiler errors. Perhaps one could make errors less bad in the presence of macros and many have tried, but the user experiences remain poor. Simplicity of implementation can be counterproductive to managing the complexity users experience.

Building on Steele’s deliberate blurring of the line between a “language” and a “library”, I suggest this train of thought applies to libraries as well. Libraries – language extensions really – help us say and do more but can fail to help us manage complexity or impose costs on their users in exactly the same ways as language features.

I’d suggest that ORMs alone provide all the evidence of this one could ever want. Fundamentally, ORMs exist to try to automatically build bridges between however your program runs and how SQL (or some other database language) runs over there in a different interpreter on the database. This is an incredibly hard problem, and implementing this well requires getting an incredible number of design and ergonomic tradeoffs right. It may even depend on having a sufficiently malleable base language to do the job well going by the nearly unique success ActiveRecord has achieved by (ab)using metaclass hacking.

A large enough stack of T-shirts

All of this brings me and I hope you to the counter-intuitive conclusion that simple tools do not necessarily do better at helping users manage complexity than more complex tools. If anything, simple tools seem to do worse because by being locally simple they push more concerns out to the user to manage rather than participating in managing them.

A language or tool which prioritizes its own implementation or specification over the interface it presents to users will never be easy or enable its users to achieve simplicity as they must wrangle the remainder of complexity from the incomplete tool. Such a tool is at best superficially simple.

The real question – the unanswered question – is what tools effectively help users manage “complexity”, how and why.

FanExpo Denver '22

2022-07-03T00:00:00+00:00

A buddy happened to have an extra ticket to FanExpo Denver, so I got to swing through and check it out.

Despite being squarely in the target audience for events like this, somehow I’d never been to a ComiCon or such before, and FanExpo was definitely a cultural experience a couple of things stuck out from.

Vendors and booths

Weird mix of brands advertising something (gaming stuff mostly), smaller vendors with their own wares, artists with originals and ahem wholesalers. Several booths had literally identical products. The wholesalers were kinda fascinating because there were at least two booths with identical selections of prop firearms and swords. Now I appreciate that relative to say paper art prints those products require a lot of tooling, but they clearly came undifferentiated from the same supplier. There was also an xbox themed minifridge that occurred in multiple vendors’ selections.

It was also fascinating how the artists skewed. Of the artists, most had their own original work but much of it as presented was homages both in content and in style. I found (and bought some) excellent original art from a couple of the vendors, but the average booth was homages to either DC, Marvel or Star Wars characters in fairly traditional styles. There were a few fun stylebends – but they all skewed ur-japanese/anime? Eg. DC characters in vaporwave color schemes or stylebent to samurai armor were really the only style variations to be had.

But all the really fun art was originals or small studio comics, and even there it was hard to walk the line of original art and characters/themes vs homages.

Fandoms

I’d say overwhelmingly star wars. Almost zero trekkie presence. Maybe two trekkie cosplayers and a handful of vendors with trek inspired product. Meanwhile, Star Wars characters probably accounted for half the art I went through. It certainly felt like anime/manga plus traditional comics was still less by volume, with manga styles being in the minority. One Piece got probably the most representation, but I think I only found one piece of Akira art for instance and it wasn’t really art – just a replica of Kaneda’s jacket.

Cosplay and gender roles/effort level

Lots of mando and even original series characters and product, the good o’ 504th was out in force. Definitely a multivariant age split.

Younger folks’ costumes were almost uniformly anime. Soul Eater, One Piece, Kingdom Hearts all out in force in marked contrast to the over 40 set.

Adults were DC/Marvel characters. Couple of spider men, one whole spider family, couple Harleys, but again skewing away from Star Wars.

And then, bless them, you’ve got the 504th with set-grade R2D2 and Imperial Pilot builds. Easily age 40+ with money and time to throw at this.

Some outstanding semi-pro cosplays too, and fun talks on builds from that crowd. I need to play with resin casting and multi-step manufacturing processes from the printer to larger or repeated cast objects. Also really need to look into finishing techniques for printed objects.

Most folks who dressed up were femme or crossdressing-to. Probably 3:1 ratio. Dudes wore armor (504th, couple SPARTAN builds, several mandos, batman) chicks wore good anime outfits and the occasional Gwen Stacy. The common ground was absolutely body suits, and it was cool to see folks being able to get into costumes easily. While there were a few shirtless Inoske cosplays running around, there’s really something to be said about gender roles and perceived acceptability of costumes. When the mean dude cosplay is a suit of armor and the mean chick cosplay is an anime dress if not a Harley outfit, that says something about relative failure to perceive guys as attractive and presentable.

Closing thoughts

I’d be curious what the mix looks like at COAF, and I definitely got some fun art out of the exhibition floor but that was a … fascinating cultural experience.

It’s curious to me how much representation Star Wars got. It feels like a lot of the product present was Star Wars, while relatively little of the art was, and much of the cosplay (504th excepted) was anime. And this is at a “comics” event, not at an “anime” event. We’ve got one of those coming up!

It feels a bit like the con pulled a weird mix of old school tabletop/dnd folks (fairly small), computer gamers (even smaller), “fans” broadly construed of a lot of media, fans-who-cosplay and then lol 504th. I’d probably go again next year as a cultural experiment but unless I find something useful in the talk tracks I’d have a hard time hanging out at the con. Because it sure felt more like a garage sale of fandom products than a … fan event. And I don’t love my interest in things being so brazenly reduced to “CONSUME”.

Software working conditions

2022-05-17T06:40:00+00:00

If you’ve spent any time around a traditional workshop or machine shop, you’ve probably seen signs about how safety is everyone’s responsibility and about keeping the shared space clean. In an environment with sparks, unrated flammables left around are a risk. In an environment with rotating tools like lathes loose clothing that can wrap or snag and pull can lead to injury. Less extreme examples like sweeping up the shop and keeping the fridge clean all fill different parts of the same shared obligation to the other users of a space.

Another important question in a shared space is the space and its ergonomics. Are the tables the right height to be comfortable when in use? Are appropriate assistive tools like lifts and jacks available as needed? Is there working volume around all the tools so that multiple people can move parts safely and at once?

A feedback loop that’s well developed in physical shops and under-discussed in software shops is the relationship between workers and these ergonomic considerations of the space. In a physical space, it’s common and indeed easy to tweak a shop. Tables can be adjusted. Chairs that don’t come up to the appropriate height replaced. Drawers, vises and other working assists can be added. Toolchests moved or replaced with movable cabinets that the workers judge more convenient.

Particularly when working with metal or wood, the distinction between the manufactured goods used in the shop and the goods made in the shop becomes somewhat forced. If your business is making furniture, making your own is no great leap. If you work metal, welding up a shop cart or putting taller legs on a stool similarly is no great effort. The shop is itself as much a product of craft and capability as much as anything produced in it, and the quality of the shop enables (or limits) the quality of work done therein.

Software is somewhat similar. The only meaningful difference between a program you wrote and one you pull from a package manager is how long you wait for compilers. Once it gets fork()ed, it’s all the same.

Much like these other classes of workers, the tools of our craft are the same tools used to make our tools. And yet often times we don’t see the tools we use day-to-day as things we own or mold to fit our needs better.

I’m not talking about individuals editor configurations or shell preferences. Like personal toolboxes, these are personal choices are somewhat beyond critique. Everyone loves to argue about the best handplanes or has their favorite screwdrivers. They represent our personal working conditions, not our shared working environment.

I’m thinking here much more of the infrastructure that supports development processes. What’s your code review flow? What’s your acceptance criteria? How much testing do you do? How much testing can you do or could you do? Do the codebases you share with peers enable collaboration? Or do they have meaningful barriers to understanding? When changes are made, how is the balance between delivery and architecture struck? How is that reviewed and owned within the group?

In every sense, these tooling and process considerations represent the working conditions of a programming role but rarely do I see them discussed as such. Conceiving of these things as working conditions is valuable, because it shifts the frame from these things being arbitrary business decisions to acknowledging if not demanding agency over them.

Don’t like a tool? Why? What problem is it solving or not solving? Can you make it better? Is there an alternative? Can you make the workspace better for everyone by changing it?

Does your patch respect the current structure of the code? If it revises it, does it fit within the existing lines or does it establish new lines? If you’re abusing existing structure, how badly? Is it worth leaving a mess in the workspace for the next person?

Did you remember to sweep up and power down the tools before you turn off the lights?

Techdebt Tornado

2022-03-31T00:00:00+00:00

Techdebt tornado, adj.; pejorative

One who successfully delivers changes with limited or counterproductive regard for architecture.

One who produces work, especially feature work, at the cost of existing architecture and without developing a successor architecture.

The Thirty Million Line Problem

2021-04-27T16:00:00+00:00

This blog post is a review and musing on a talk of the same title “The Thirty Million Line Problem” (2015) (~1h lecture + 40m q&a).

To summarize the talk which badly needs it, Casey Muratori argues the following:

That software used to be simple not general purpose. The example of interest is game operating systems as for the Amiga and such. It was possible for developers to be in total command of a machine and its resources.
That software “complexity” (measured in lines of code) has exploded in the last decades.
That much of the “complexity” of modern software stacks is in the operating system not applications.
That “complexity” in operating systems is driven by attempting to support enormous variety of hardware and hardware abstraction layers which paper over differences between devices.
The example of USB is given as an instance where hardware designers gave device designers leeway to do whatever they want over a standardized bus, which has created more dependencies on vendor-provided drivers.

Casey posits that general purpose kernel+driver complexity will grow unchecked unless standardized device interfaces are set for device manufacturers, and hardware is standardized into an “ISA” which specifies primitive driverless device interfaces. He posits that this is not an unreasonable proposal; historical game consoles had known and documented hardware interfaces, and modern SoCs are arguably “standardized” computers. Both of these enabled simplification of drivers and in the kernel such as Casey’s seeking.

There’s a couple themes to Casey’s talk I want to poke at.

The first is what software complexity is. I’ve taken a crack at this in the past and want to again. Casey doesn’t make it clear if his position is that complexity is strictly line count of software, or if it’s something different. I think the question of how we measure complexity is really interesting and not one we’ve spent enough ink on. Casey’s intuitive argument that complexity is at least sketched by code line count seems to be a common one in the industry, and one worth exploring.

Second is whether hardware’s really the problem here. To give the bit away I agree that it is, but the why is interesting and there’s good reason to wonder whether the complexity is incidental or essential.

Casey’s definitely on to something that much of the complexity driving the reuse of general purpose operating systems lies in the device driver library that comes with each OS. Implementing even one device driver let alone support for entire product families is an enormous engineering burden and it’s no surprise that often driver support is provided by the manufacturers. For instance Intel and Nvidia both have dedicated Linux development resources.

Hardware abstraction layers are another enormous source of complexity, not just because they’re hard to implement but because they’re fundamentally faulty abstractions. The purpose of a HAL is to provide “predictable” behavior across a variety of hardware implementations. This means using hardware features where they’re available and trying to provide efficient software bridges where they aren’t. There’s a tension here between providing “transparent” access to the underlying hardware with an inconsistent interface and predictable (hardware determined) performance, and providing a “consistent” interface which masks the underlying hardware and may provide very different performance across different devices due to needing software implementations of what are hardware features elsewhere.

A “transparent” abstraction is really no abstraction whatsoever. It’s just an extra step in taking on a hard dependency. An abstraction which can’t provide consist enough performance won’t be useful because in order to get acceptable performance it must be bypassed or otherwise “seen through”.

There is of course a Dan tweet for this, but it remains an incredibly important point in software engineering it feels like we skate over routinely.

I'm increasingly thinking that every functioning system has two forms: The abstraction that outsiders are led to believe, and the reality that insiders actually and carefully operate.

You don't incrementally learn a system. You eventually unlearn its necessary lies.
— Dan Kaminsky (@dakami) January 17, 2018

Third and finally is whether Casey’s project of standardizing the hardware <-> software interfaces to eliminate the complexity of drivers and HALs would solve the problem it sets out to.

In some spaces for which the technology is stable I think it could and that standardization has already been achieved. Disk drives and other storage technologies already have good established driver interfaces. Keyboards, mice and other human input devices likewise have standard interfaces.

I think the problem with Casey’s proposal lies in the technology he cares most about - accelerators. The purpose of an accelerator is to provide the maximum of performance. This means that - at least to some level - a hardware dependent abstraction is presented. An at least translucent abstraction.

As hardware performance shifts, eventually the abstraction will to. For instance it’s one thing for a new board to expose a faster multiply operation, it’s another entirely to expose matrix multiplication or vector operations. The operational semantics of “multiply” stretch if you will to “fast multiply”. They don’t stretch to a fundamentally different interface. If a user desires maximum performance, they have to adopt the different operational model somehow somewhere. This seems to preclude the idea of stability and we’re back to challenging the notion of whether durable abstractions are even available.

In Casey’s world, that new interface would have to follow a new industry standard for what it would look like, so at least vendor churn of interfaces on the open market would be constrained and software would be able to target that standard interface. I think this is the best we could do, and it rings a lot of The Cuniform Tablets of 2015 in some regards of trying to present a stable if not preservationist minded programming target. Stable much in the same sense of the early computers Casey calls back to repeatedly - a fully defined architecture which never gets to change.

It’s interesting to muse on Casey’s point that a Raspberry Pi, no longer nearly the toys they were when this talk was new, arguably presents such a platform and whether Apple enjoys a comparable advantage with their relatively small hardware support matrix; especially on the new M1 hardware.

A Pi cluster parts list

2020-11-28T18:00:00+00:00

Previously, I talked about some limitations of building RPi clusters generally.

This time, I’m gonna cut to the chase and present my currently partially complete build.

First layer of foam seated and everything powers. Fan has been effectively removed from the design entirely. pic.twitter.com/SofIg2cZQD
— r"Re(e+|i?)d" (@arrdem) March 27, 2020

My basic design unit for the build was a W6.5” x L4.5” block consisting of six Pis bolted together using 11mm M2.5 standoffs and a USB power bar. The underlying hardware costs almost nothing and is available at your hardware store of choice, and there’s plenty of variations of acrylic sleds which fit into such stacks to be had.

The Pis themselves are all model 3 B+s, sourced from wherever you can find cheap pis and SD cards. In price shopping I found that Amazon’s listings for Pis were more expensive than those available on other resellers. I wound up going with CanaKit for mine.

For the case, I used a Nanuk 915. Nanuk is a lower-pricepoint Pelican alternative, and the 915’s internal dimensions (L13.8” x W9.3” x H6.2”) happen to fit two of these 7.5”x4.5” blocks side-by-side with a little room to spare.

For power, I’m using a single consolidated and switched 12v rail. I fabricated it myself using some basic terminal hardware (switch and barrel jack socket embedded in lexan fronting a terminal block) but there’s really no surprises there. The transformer I’m using is a 12V @ 20A / 240W monster monster spec’d to run potentially 10 Pis and switch and the display at once.

first off adding a real power jack so the power supply isn't hard solderer in. pic.twitter.com/qTYeeWYnQd
— r"Re(e+|i?)d" (@arrdem) March 27, 2020

And hook the new switch plate up to a new distribution rail pic.twitter.com/gzXOQudUO8
— r"Re(e+|i?)d" (@arrdem) March 27, 2020

For USB hubs to power the Pis, I’m using a Sabrent 60W USB Hub, which runs off of a 12v supply. This is important, because it let me standardize the entire case on a single 12v source rail shared between the currently one USB (and future second) hub, the networking switch and the display.

I will note that it’s important to use short USB cables so that the hub packs well to the “vine” of 5 Pis. I managed to find some 6” micro USB cables which worked fine, but I think you could get that down to about 4”. Or just give up on the USB hub entirely and go with a backplane, which is what I’d probably do were I to build all this again.

For the switch, I went with a NETGEAR 16-Port managed switch (GS116E). I specifically chose the cheapest switch I could find with support for VLAN trunking, and which ran off of a 12v source again so I could get the entire case down to a single transformer.

The real problem with choosing the switch is I wasn’t able to find one shallow enough to fit in the 6.2” depth of the 915 case, let alone when a standard barrel jack is hanging out the back. My “solution” to this was to shell the switch and run it as a bare board, replacing the barrel jack with a soldered pigtail to the rail.

Because I’m working with about a quarter of an inch to spare in this case, all the network cables were hand-made and hand-trunked to the switch.

The last addition to the case was honestly the cheapest screen I could find. The downside to this particular display turned out to be that its wiring is right-hand sided (I would have preferred left) and, critically, that all its I/O buttons are rear facing. This meant not only did I have to drill a VESA mount into the case, but I had to manufacture a stand-off plate so the buttons weren’t all permanently depressed and put pass-through holes in the case as well, to which I added some wire snips as button extensions so the controls were still usable when the display was mounted.

punched some through holes in my portable unit so that the back mounted controls on the bargain bin monitor I'm using are externally usable without unmounting pic.twitter.com/VI2obwmVY5
— r"Re(e+|i?)d" (@arrdem) June 19, 2020

All told, my cost on the build is about $600 on the build so far. If I add the other five Pis, that goes up by about $200. Considering I previously wrote about spending about $800 apiece for three AMD boxes, having a whole portable twelve host network for the price of a single server isn’t shabby at all.

Really the only unsolved problem with this case is cooling. Finding low profile 12v fans has so far proved troublesome since most motherboards run on 3v or 5v, and the 915’s mere 6” of depth doesn’t leave a ton of space for fans underneath the Pi stack.

I will also note that, installed in a dense bolted unit, it’s difficult to get individual Pis in and out which is a rather needed operation when setting the whole thing up. Were I to do this again, I’d definitely consider how to make a sled based design in which removing single Pis was easy work.

But this is what I’ve got and I rather like it! Fingers crossed we get some Intel (compatible) hardware that fits the Pi form factor so I can run a hardware mix one of these days.

Thanks to @krainboltgreene for reminding me that I never actually wrote any of this up.

Notes from building Raspberry Pi clusters

2020-11-28T17:00:00+00:00

A while ago I got it into my head to put a Raspberry Pi cluster together.

So I uh got a pelican case. And a ton of new pis. And am gonna try to build a portable unit. pic.twitter.com/2mxuYC1aLl
— r"Re(e+|i?)d" (@arrdem) February 19, 2020

As with the other builds floating around the internet, the intent of mine was produce a low-cost and in my case portable platform for developing and ultimately demonstrating cluster operations technology.

We’ve all seen Pi clusters around, and probably seen dozens of blog posts about doing this or that or deploying a cluster management technology on the Pis.

The first thing I’m gonna note is that the TuringPi exists, and that a compute module backplane or other module based minicluster is gonna be cheaper though more limited than any solution built around integrating multiple full size Pis.

That said if you’re heartset on a pile of full size Pis like I used, let’s get to it. I’m gonna skip the basics which you could google for easily and focus on some details that make doing a good cluster build hard. Namely, power, networking, mechanicals, netboot and I’ll offer some closing thoughts on the Pi platform in my application(s).

Power

The various models of Pis have different peak draw requirements. You weren’t going to be able to run something truly CPU intensive on the Pis as a platform anyway, but sometimes Zookeeper or Docker or what have you will peg the cores. And this means you’ll need to have appropriate power available to back it up, unless you want to see undervolting and soft locks.

Model	Power connection	Max draw (A)	… (W)
B	Micro USB, GPIO (5v)	1.2 A	6 W
A	Micro USB, GPIO (5v)	1.2 A	6 W
B+	…	1.2 A	6 W
A+	…	1.2 A	6 W
2B	…	400 mA	2 W
3B	…	730 mA	3.7 W
3A+	…	2.5 A	12.5 W
3B+	…	1 A	5 W
4B	USB-C, GPIO (5v)	1.3 A	6.5 W

It’s important to note here that max current draw is approximate, as it really depends on what other peripherals are hooked up to the Raspberry Pi. These numbers are conservative (high), reflect using USB peripherals on a given Pi in addition to running the Pi itself. The Raspberry Pi foundation quotes higher power usage numbers which assume USB port peak load not CPU peak load.

Power over USB

If you’re going to run several Pis together in a cluster these power numbers matter because you’ll need to ensure that whatever wall voltage to USB hub or other power source you’re using is appropriately provisioned. For instance if you were to build out a 5 RPI 3 B+ cluster, your max power draw is somewhere around 25W. Considerably higher than most 5-port USB power supplies.

The main drawback of going down the USB power road is that USB hubs typically aren’t individually switched, let alone with software control. While the Raspberry Pi is an incredibly stable platform, it lacks the remote management capabilities which can be expected of server hardware.

In a datacenter, if a computer gets real borked, you can usually remotely power cycle it using IPMI in a fancy deployment or in simpler setups by just … unplugging it and plugging it back in again with a remotely managed power distribution unit. Entire management systems such as Open19 rely on being able to do this variously. But a typical USB hub won’t deliver anything you could automate around like this.

Power over backplane

All the Pis have the same GPIO header layout, and it’s possible to power the Pi models directly by supplying a 5v power source. This is how the Power Over Ethernet modules for the Pi work. They provide physical negotiation of PoE delivery, convert voltage as needed and deliver power directly to the Pi’s GPIO. The main drawback of the PoE modules for the Raspberry Pi is cost. A PoE hat for the Pi can cost $30, and less integrated PoE splitter solutions can be down to $12, you’re adding a considerably bulky component to every Pi which can complicate a mechanical layout.

The main advantage of PoE is when deploying devices remotely from the power source, where it’s convenient to deliver power over ethernet rather than separately supplying power. That’s far less relevant in the context of building an integrated cluster, but since some PoE delivery [network] switches offer the ability to turn off PoE delivery per switch port, which would be another way to get remote switching capabilities.

Backplane solutions such as Bitscope’s Blade or better yet ClusterCTL Stack can also make powering groups of Pis extremely easy. Particularly, ClusterCTL provides software defined power switching per-pi which can be used to implement the sort of remote hard reset discussed above.

Networking

It’s also important to note on the networking front that the Pis really are a … limited platform. Commodity compute hardware has offered full GiB/s throughput for years. The Pis however, don’t.

Model	Networking
B	10/100
A
B+
A+
2B
3B
3A+
3B+	10/100/1000 (3Mb/s)
4B	10/100/1000 (full)

The Pis are relatively low performance so just about any off the shelf managed or unmanaged switch will be able to keep up with them. You will need to be able to dedicate a switch port per Pi, but the main consideration in network design for your cluster is how you want to structure DHCP and manage egress.

Networking considerations

I’ll say more of this in a bit when it comes to booting the Pis, but the Raspberry Pi has some … interesting ideas about how netbooting occurs with respect to more conventional platforms. For now, I’ll just say having routing separation between your Pi cluster and any other networks you may run will be convenient because you’ll probably want to run a separate DHCP server not use an embedded one.

In my setup, I accomplish this by runing an unmanaged switch which I connect to an isolated (separate VLAN) upstream switch port. This leaves me at liberty to run my own DHCP server on the unmanaged switched network, and makes the network viable when disconnected from any upstream router(s).

If you want to use a Pi cluster as a testbed for real networking problems or distributed systems, you’ll almost certainly want to run a more sophisticated piece of routing hardware than just a generic Netgear unmanaged switch. Otherwise you’ll have a hard time simulating or causing link failures, packet loss, lag and such.

Laying out a cluster

There’s a ton of mechanical layout options. My original build followed the traditional “pis over power and switch” layout, but because I was using an appropriately provisioned beefier power supply it wound up looking a bit different.

The @tirefireind pi cluster got rebuilt and is looking mighty fine now! Only one of the pis seemed to need re-imaging, but probably gonna spend some time thinking about how to do PXE and roll them all just because pic.twitter.com/CFRlt5iuLg
— r"Re(e+|i?)d" (@arrdem) January 19, 2020

There’s any number (1, 2, 3, 4, 5, 6, 7, …) of 3d printable cluster configurations to be had and a quick google search for "rasbperry pi rack" turns up a number of vendors who would be delighted to sell you something packaged.

Cases

Things get tricker and there are fewer examples of fitting Pis into common hard cases such as Nanuk or Pelican products. It can most certainly be done and done well, it’s just unusual.

I documented most of rebuilding my case -

tore everything out. let's revisit pic.twitter.com/pXx9080t6h
— r"Re(e+|i?)d" (@arrdem) March 27, 2020

And my friend Matt built a comparable thing using a 2u sled design -

pic.twitter.com/zdfkvJdn43
— Matt Getty (@aspen) November 18, 2020

A surprising number of cases can be had which are properly internally dimensioned for a 19” wide rack mount unit, and as the Pis aren’t particularly deep (65mm or 2 9/16” on the longest side) so going with a 2u sled for packing pis like Matt did is a pretty good strategy. I went with manual packing of a 5 pi block into a Nanuk 915, which it turns out fits two such blocks although I haven’t felt the need to expand my case yet.

The big downside of the packed case design I went with is that appropriate cooling is really hard. This isn’t a huge problem for me since I’m not looking to run workloads, just demonstrate provisioning technologies. But the Pis do run plenty toasty when pushed, and most of the Pi “rack” solutions do incorporate fans to push air through the Pis for a reason.

I could probably do a better job with my case layout if I were to design and 3d print up a 5 Pi carrier which integrated with a fan and bolted into the box, but right now everything’s hand-packed with foam. C’est la vie.

Booting the Raspberry Pi

[Network] Booting the Pis is, politely, a mess.

The brief version is that all Pi versions will first try to boot from their SD card, and will then try to boot from a USB device. If you’re willing to individually image SD cards, go crazy. That’s a well trodden path that totally works, although it doesn’t dovetail particularly well with any sort of remote cluster management technology like Puppet or inventory discovery or what have you.

In a real production environment, you’d boot new hardware into some sort of “discovery” phase using either your DHCP server or netbooting hardware to an “OS” which collects host metadata, reports it back and reboots the machine so a different decision can be made metadata in hand. Usually in a production environment you’d use a HTTP server to emit PXE menus (2008), and play games in your webserver of choice to control the generated PXE menu.

The good news is some Pi models (the 3s and later) also support a rudimentary form of network booting. I won’t spend too many words on it here, it’s reasonably documented, but the short and very bad news is that the Pis don’t use PXE booting.

Instead of performing a PXE boot, they implement a form of TFTP booting. This is what booting a server used to look prior to about ‘99.

The Raspberry Pi’s firmware knows how to make a DHCP request, extract a next-server and boot-file from the DHCP response, will fetch that file and boot it. Typically, this will be a bootloader, which separately identifies itself and requests more files, eventually loading a .txt file specifying a kernel initrd and command line to boot. This works great, and totally works for implementing locally stateless boot of Raspberry Pis.

The really bit caveat is that netbooting is (unless you have a Pi 4 whose boot order can easily be changed) the dead last thing a Pi will try when it turns on. This means that if you write a (seemingly) viable image to a local SD card, that bootable SD card will always win out in the future over a bootable network. This can present a remote management challenge if you want to be able to recover wedged or corrupted hosts without manually re-imaging SD cards, as in a classical production PXE environment you’d PXE boot every time so remote management would be able to recover a “wedged” host.

iPXE

PiPXE is a build of the iPXE PXE implementation for the Rapsberry Pi platform. Leveraging the (shiny new!) EFI support backported to the Pi 3 series and present in the 4s, it’s possible to use iPXE as a chainloader in a much more conventional PXE menu based boot process than the TFTP based process which the Pi firmware provides.

The ENORMOUS CAVIAT with this is that it isn’t possible (As of this writing, November ‘20) to TFTP boot a Pi to the iPXE chainloader. The short version is that the Pi’s firmware has a “filesystem” abstraction layer which treats TFTP roots and SD cards the same. The EFI implementation has no comparable support for the Pi’s network card. This means in order to do PiPXE booting, the PiPXE boot configuration must be present on a local SD card although I believe it works fine beyond that.

In review

If you want to build a 5-node (or smaller) cluster with which to demonstrate some piece of software, the Pis are a pretty reasonable platform for that. You’ll buy some SD cards, flash each one by hand, give each node a name, maybe use some Puppet or Ansible to configure them after you’ve done some hand setup and it’ll work great. Throw docker-swarm or k3s or something on it and treat it like a cloud, at least until something goes wrong.

For myself, having built a Pi cluster with the intent of using it to demonstrate more traditional large-scale remote management tools which depend on PXE, the Pi has proven to be a limiting substrate due to its lack of proper PXE support. Were one willing to implement a custom TFTP server with support for variable content or do some serious firmware development it would be possible to implement something resembling a traditional PXE provisioning flow but that’s a pretty heavy lift for a hobby project.

Thanks to @krainboltgreene for reminding me that while bits and pieces of this have been tweeted, I’ve never codified it.

More precious than silver

2019-11-04T16:00:00+00:00

I’ve been doing a lot of reflecting lately on the last project I shipped - what went well, and what didn’t. A while back I tweeted out some halfbaked thoughts. One of which was a reflection that while the entire engineering organization beyond my team was using a tremendously powerful toolset, we still got bogged down.

My group (> 30 engineers) just delivered a fairly major set of systems, and lemme tell you no amount of leverage to make change within our components was able to save us from the overall complexity of the problem.
— Reid (@arrdem) November 17, 2018

Why? How? The entire reason I got started caring about software engineering and tooling in the first place was trying to find longer levers with which to move more. With which to take on and ship otherwise intractable projects. Have I just been barking up entirely the wrong tree this whole time?

While not particularly well structured or even cohesive, the thread seems to have hit some chords. Particularly Dimitri came in with this observation -

My team is over 30 devs, but we break into sub-teams of < 5. The reason has nothing to do with technology, but rather communication overhead. More people working directly together means more interactions, emails, meetings, and so on. I've yet to see an effective team over 10.
— Dmitri Sotnikov ⚛ (@yogthos) November 17, 2018

And Tim as well -

I want to say you're wrong, but I'm not sure you are. The fact is, most of the time when a project grows to needing 6+ devs, there's a push to break it up into more services, or the project boggs down. So, microservices that communicate via pre-defined protocols <cont>
— Timothy Baldridge (@timbaldridge) November 17, 2018

And somewhat to my surprise Zach -

That could be (generously) interpreted as something I’ve often argued, which is that everyone should have independent ownership of some part of the codebase, and the trick is to figure out who gets what. Obviously you want overlapping knowledge, but not overlapping authority.
— lambda the proximate (@ztellman) November 17, 2018

The question for me started out as how do I, as a software engineer, optimize myself and my ability to ship code. Okay so go out buy a keyboard learn a power editor, learn a language and ecosystem with leverage… and you’re a super hacker, right?

Six years heavily invested in Clojure and other things widely regarded as power tools later, well maybe not.

You can improve a single person’s output, but only to a point. There’s only so much sleep you can loose, so much coffee and pizza can do before you remember there are only 24 hours in the day and do you really want to be spending all of them with a keyboard? Of course not. Maybe there’s an age bracket for that - but even I’m a bit of a whippersnapper and I’ve found it fleeting at best.

The conclusion is that, well, college me miss-stated the problem. Surprise.

The question isn’t how to maximize single developer throughput as it is how to maximize team throughput. You and I don’t scale. We have fixed bounds - and better things to do I hope.

We however, maybe we scale. And the more of us that can be brought to bear on a problem, the more we can get done. So perhaps a better stated problem is this -

How do we make people more effective as engineers building software systems and using software to solve problems?

This is a big and thorny question which has been and must be attacked from many sides.

Coding?

The first and most obvious - the mistake I made - is to optimize the simple act of coding.

Traditionally the act of coding has been regarded something like the act of sculpting clay. One starts with a formless mass and from that creates the world through grand vision. This is somewhat apt. Most Software begin life as an empty set of formless buffers and directories on which the programmer must impose their will. This is a narrative which leaves no room for false starts, rework and learning. It falls afoul of the narrative fallacy, wherein past events correctly viewed presage their future when perhaps they did not.

In fact Parnas 86 “A Rational Design Process” provides a compelling argument that such a perfect forming of software into the world is impossible. In short - the act of creating software let alone deploying it into the world changes the world. No matter how perfectly suited the software was for the world into which it was deployed, the act of authoring is an educational one. We as program authors develop greater understanding of problem spaces when we build software. Furthermore anyone exposed to the initial product will refine their understanding of what their needs are. Both of these changes at least partially invalidate the criteria under which the software was to be evaluated and against which it was design, demanding re-design and iteration.

This suggests that, while coding may be an essential part of the software development process, it coding is more than simply building the right thing. It has exploratory aspects, and maybe that’s more the value add.

OODA?

Those of you perhaps more acquainted with fighter jets may be aware of an (arguably overused) term - the OODA loop.

The OODA loop (Observe, Orient, Decide, Act) loop is a bit of jargon used to refer to the albeit obvious structure of any decision making process. It was popularized by Boyd - particularly the idea that you can “get inside” another entity’s OODA loop and “beat them” that is be more effective at decision making just by being able to iterate faster.

A pilot who can react faster has more time to respond. A command and control system which can detect a possible “bolt from the blue” strike is better able to take measure to protect itself.

Or to choose a more mundane example, a programmer who is able to rapidly explore their intuitions and explore program behavior will be able to at the least test more theories and will probably learn more and certainly more easily. They will be able to develop confidence in their tools and their application through experimentation. They will be able to build up a metis (craft, cunning, skill) with their tools born of familiarity as opposed to techne (skill, technique) of rote understanding.

This is not a new idea. Our ability to think with tools depends in large part upon our ability to understand our tools as extensions of ourselves through rapid feedback. Read, Eval, Print, Loop (REPL) workflows and ideas about interactive programming exist to try and tighten the feedback loops between programmer and machine. And there’s an entire field of study devoted to characterizing the “acceptable” latency of machine interactions in terms of how quickly a human is able to process that feedback.

For instance learning Jenkins will be hard when it takes an hour to update a job because the JJB builder sits behind a slow CI system and you need reviews to make changes. Learning Python is easier - it sits on your computer and responds relatively quickly to inputs.

Likewise there are a ton of opportunities with blue/green deploys, red/green tests, α/β testing, traffic generation, traffic sampling and event based systems to choose architectures which allow for rapid, safe experimentation and tightening the OODA loop of “pushing to prod”. Assuming we can’t make local development sufficiently prod-like to offer similar feedback wins.

Planning?

Communication and coordination processes like Agile exist to try and optimize the OODA loop at the organizational level. They provide a framework for the communications of requirements gathering, prioritization, scoping and delivery. At the end, it gives an objective for delivered value which can be re-evaluated and iterated upon.

The OODA decision is just utilitarian evaluation on a short timescale - evaluation on a long one is too hard.

Unfortunately this leaves Agile open to the usual attacks against incrementalism and utilitarianism. Its limited scope of evaluation blinds it to long-term costs or yields.

This workflow excels where the problems to be solved can be incrementally delivered, and where decisions are not expected to have long-term or irreversible consequences. For instance a web application which simply provides views to some backing data store can easily be delivered incrementally. It can be reworked incrementally, and choices in its design can be cheap to re-visit because the application is merely a client with no carried state that must be managed. Re-planning or pivoting is cheap.

This is clearly not a general property of problems, and leaves out may problems which due to integrity constraints, fixed investment in data storage or other relatively immovables are facts of life. However as from Parnas, no matter how good our plans may be we will find fault in them and they must be adjusted. Fully gathering requirements and planning is impossible, as is developing “the right thing”, but neither can we blindly A/B test ourselves to product/market fit. Design and plans are required.

Communication?

Designing systems to minimize state, or even be stateless makes it easy to discover implementations of desired data flows and for engineers to build intuitions about the system. Unfortunately however, no matter how easy it is to build up intuitions and insight, they are personal. Even with exploratory tools we need to be able to communicate insights to other people if we really want to be able to scale out the development process. Furthermore in order to plan development effectively we absolutely have to be able to communicate lest we leave out understood factors.

One solution to this problem is simply to retain smarter people and have mostly single ownership as mentioned by Zach above. This works - until it doesn’t. Single ownership effectively optimizes to maximize context, it mitigates some sources of deadlock by assigning a leader for components of a project, while still enabling elective collaboration. Because collaboration is elective not the norm however, it’s easy for owners to become single points of failure who are unable to say get off on-call or go on vacation without impacting timelines. This will produce long-run org degradation. You won’t be able to have an on-call rotation. That person won’t be able to go on PTO without undue impact. Even on the most healthy team(s) people get tired and need change, and when they choose to do the next thing you’ll be bereft of the only person who was a domain expert on the component. A rewrite will be the likely result.

Single ownership however still seems like it could be an efficient thing to do because communication is hard and teaching is slower and harder. Communication costs increase with the square of the number of people involved - this is just an obvious corollary to Metcalfe’s law. So what’s the dynamic space here?

We all know Brooks’ The Mythical Man Month, if only for its famous statement that adding more engineers to a late project will make it later not speed it up. Brooks points out that adding more heads both increases the coordination burden on the team, and that the new heads have to be trained before they become productive! Obvious perhaps, but a mistake I’m sad to see I’ve seen folks make.

Okay so we know that coordination costs increase exponentially with headcount, and there’s a cost to bringing new folks up to speed which together produce Brooks’ slowdown. But how can we put numbers on the ideal size of org?

Well a straw poll works in a pinch, but the sample size is pretty small.

Follow up - what was the real duration?
— Reid D. M. (@arrdem) October 5, 2019

It turns out that no small amount of effort has been spent over the centuries optimizing the size of a combat unit, and perhaps infantry tactics being an applied exercise in group psychology and communication costs will be somewhat reflective of software dynamics.

The history is fascinating, but you’ll forgive me for not recounting all of it here. Powell ‘18 gives a good treatment of the US’s explorations, as does the previous Hughes monograph on which it is built.

The short version is that the US has experimented with a number of squad configurations consisting of several teams under a single leader - occasionally with a delegate. While extremes of two-person pairs and ten persons units under a leader have been tried current infantry doctrine appears based on two five-person teams with two overall leaders, for a squad headcount of twelve. The advantage of this organization fundamentally is that it - as the minimum size for ether team is three - both teams have resiliency built in as does the leadership. furthermore by being organized into teams, decisions can be devolved and the group overall is flexible. Other configurations such as three teams of three about as good, but have less resiliency.

It’s interesting to note that this conclusion aligns nicely with the vignettes about software team size from Demitri and Tim, as well as with my straw poll. This should be unsurprising - we’re trying to measure what should be a rough psychological constant - but it is a check to show we’re in the right ballpark.

Planning again? Or is it still communication?

So if the maximum team size is about five, and the maximum effective organization size under one leader is about 13, doesn’t that limit the scope of our designs? A group of 13 can only do so much. There are only 24 hours in the day, of which we only work 8, 5 days a week, 48±4 weeks a year or so. Furthermore, a group of 13 (really 10 or so engineers) can only have so much skillset coverage.

Perhaps of most importance, how do you coordinate design efforts across groups? Conway’s Law (solution architecture will mirror the structure of the organization) is real, and must be resisted. But if you can’t really get more than five minds on a problem, then at best you can get representatives of five 13 person teams in one place? Which puts the effective limit of the scope of an engineering project is 65 minds - partitioned into units no larger than the scope of 5 minds.

Maybe you could build arbitrarily large log₅ trees of engineers, but doing so presupposes that your chosen 1 of 5 representative AT EVERY LEVEL is able to capture ALL the context of their four peers. That is these representatives themselves must be Sufficiently Capable Domain Experts. Fundamentally this is the Scrum of Scrums concept.

If it isn’t safe to assume that context can be losslessly compartmentalized in this manner, then you’re faced with decision making costs which grow at least linearly on the size of the organization. The delegate being unable to capture all context and answer all questions, questions must at least be devolved and we’re again exploring the space between O(log₅ N) and O(N²) going back to Metcalfe.

All of this drives me to the conclusion that, well, some things just take a while. Both architecture and software development take time, and the “hammock driven development” meme is more valid than we may suppose. Some things just take time, and even once a rough design is settled upon, they still take time.

In general, problems don’t parallelize, and designing a problem based decomposition requires context and coordination, which limits scope and is serial. Large scale attempts to produce a decomposition will fall into the Conway Trap. Maybe you can even “Reverse Conway” engineer your organization to create a structure which will drive a more acceptable solution architecture - but you still have to come up with that design and this presumes your organization is politically flexible enough for such an outcome to be practicable.

So either we must do things achievable usable only with our present tools, or we must steel ourselves for the journey.

I guess this is a disappointing conclusion as it merely affirms that some hard things are hard and cannot be made easier. Maybe knowing saves our futile seeking and worrying; and that is more precious than silver.

Thanks to Matt ‘@arachnocapital2’ for reading a draft of this essay and contributing some managerial feedback.

[2016] Breaking changes considered essential

2019-10-28T16:00:00+00:00

Version numbers are hard. We as programmers are awful about tracking what constitutes a breaking change compared to what constitutes an invisible bugfix. From a formal semantics perspective, there is no such thing as a bug and every “bugfix” is strictly speaking a breaking change. And yet this notion of “fixing” behavior “broken” by shoddy implementations is central to modern notions of libraries, dependency management and deployment. I’ve previously written about versioning, its semantics and some theories about how we could do it better.

This time I’m flogging a different horse: the unavoidable necessity of change in software.

Bugs are the obvious motivator for change in software artifacts, especially in this open source world. Users or “the community” finds bugs, maintainers find bugs, some combination of these conspire to fix bugs found and in theory release those fixes back into the wild. We as users all want to take advantage of these changes. SemVer, Elm’s versioning, repeatable builds and much other well established practice helps to ensure that taking advantage of “incremental” changes and fixups is tractable.

But bugfixes aren’t the only changes we may want to encourage.

Why do we want libraries which persist unchanged forever? I once had a conversation with a senior engineer in which he held forth as a good thing that here today in 2019 we’re running FORTRAN code unmodified since the mid 80s via whatever $NEWLANG to libc to libfortran monstrosity we need to cook up to do so.

There is an interesting argument to be made, implied by Steele’s ‘98 OOPSLA keynote, that a language is not simply a compiler, interpreter and other runtime infrastructure. Rather, it is somehow the semantics defined by the runtime, the semantics and critically the style of the standard libraries and those of libraries which users choose to write and which gain adoption.

This may sound obvious, but think about it for a minute. The semantics of the language and its standard library clearly dictate how code can be written in some language. The less obvious part is that the style of the standard library and the culture around the language dictate how code is written. Consider Ruby and Python. The language cores are immensely similar. They’re both object oriented VMs with modular imports and object oriented standard libraries. In both languages object introspection and metaprogramming are possible. Yet techniques like metaclass hacking dominate in Ruby when they are pretty unusual in Python. Designs which would be considered idiomatic in Ruby would be strange in Python and vice versa.

Worse, as languages and communities age, change occurs. The performance characteristics of the language may change. Libraries are invented. Features may be added. Blog posts on style written. Talks and experience reports on patterns given. Programs which once were too slow to write may be be fast enough now to be commonly used. In short, better patterns of program design appear as the domain to which a language is applied becomes well explored. Designs once considered excellent will age badly or ossify, their limitations recognized at least by some. However in the process of this learning, the community will write a bunch of code whose limitations won’t initially be appreciated.

Considering the microcosm of Clojure, early Clojure libraries stand out clearly from more recent work. They are essentially thin layers around Java APIs which perhaps didn’t need wrapping. They may not compose well, they may make pervasive use of mutable state, they may not offer good facilities for debugging, introspection, static understanding and so forth. More recent work is characterized by using reified effects, state values, reliance on functions rather than on macros, avoidance of kwargs and other design choices too numerous to list.

So if we take this conception of a language as the semantics of its standard library and of its libraries, we can clearly say that the semantics of Clojure have changed over the years as everyone has gotten better at writing Clojure and as the Clojure community absorbed more cross pollination from the pure (and purer) functional languages. Can we then say that Clojure code circa 2007 is the same Clojure that we’re writing today?

Arguably it is, as of Clojure 1.3 when the clojure.contrib namespace was deprecated and removed, not much has changed. Many things have been added such as the arrow macros, transducers, EDN and still more, but all the old code probably works.

And we shouldn’t be using it!

The changes in style that have occurred are I argue more than simple changes in fashion. Perhaps one could argue that if naming style changed (which perhaps it has) migrating from one library to another which has more stars or a nicer webpage or better names is a waste of time. The problem with this view is that it minimizes the many cases when APIs that seemed like a good idea at the time turn out to be badly designed in the long run and should be replaced.

Clearly there is great value in the Java approach of maintaining compatibility for all time. This means that the library base can grow to be enormous and most tasks become exercises in evaluating and plumbing together libraries.

However as many of the older Java APIs are evidence, it is not the case that one API no matter how stable or well defined initially remains appropriate for all time. The Java collections API is simply too big and assumes too much mutability to apply to new collections such as Clojure’s which are immutable. APIs which use Enumerable instead of its successor Iterable are one example of this. Another is the decision for various Java core classes to be final such as Pattern, precluding other tools which may target the JVM from offering compatible functionality.

The critical aspect of these design choices is that they all seemed reasonable at the time and only later as the language and community evolved were they thought better of. Call this conceptual debt if you will. For all the advantages of compatibility, I claim that it is a structural misvaluation of engineer time going forwards to claim that the price of upgrading across changed APIs is always lost.

In the presence of pervasive immutability and purity, APIs can be safely abandoned in place. They may be inadequate or map poorly to potential use cases, but their continued presence has no carrying cost. This isn’t true for mutative APIs - the semantics with which we manage physical memory, process protections and other resource management(s) cannot be so abandoned. There the continued presence of the old APIs and their mismatches the desired semantics continually undermine the desired semantics.

Two solutions come to my mind.

The first is to undertake the mighty project of ensuring compatibility cannot be breached at any point in the future. Unison is a fascinating attempt to design a language which provides precisely this property - all code remains valid for all time. While I applaud the effort, it’s not clear to me that this is possible or even desirable. Software seems to be the business of inventing words, attempting to presage all their contexts, meanings and uses. Choosing ground rules for a programming system which enforce infinite forwards-compatibility creates rather odd cases of undesired consistency and makes the escape hatch perhaps a bigger hammer than we wished. In short, semantic breakage remains possible if not needed and it’s not clear that being able to travel back in time to the old ‘stable’ code is a valuable property given those limitations.

The second and perhaps more obvious one is to relax the constraint of unending support for all previously valid programs, and admit that we’re in the business of planning and managing change. The breaking changes between Python 2 and 3 is a perfect example of such a thing. In order to deal with structural issues in the language, the decision was made to undertake breaking changes. I think that many see this as a bad thing because it served to fragment the community and because Python 3 was not adopted as rapidly as anyone involved would have wished it to be. But it’s an interesting case study in such a thing and the effects which it has on the community ecosystem.

This is not to say that I think breaking changes should be undertaken lightly or frequently. Unless undertaken carefully and with due notice, breaking changes only serve to tire out users and library maintainers. If only for his conception of users finite willingness/ability to learn, I think that Brian Goetz’s talk at Clojure/Conj 2014 was worthwhile.

Brian Gotez and I are fundamentally at odds here, as he admits that he’s at odds with Rich. Brian quotes Nikos Kazantzakis:

You have your brush, you have your colors, you paint the paradise, then in you go.

I think the folly in Brian’s argument is in the concept that, at least for software, there is another approach. From a formal specification standpoint, most changes even “patches” and “bugfixes” are breaking on a formal semantics level. As DeWayne Perry is fond of saying

We are in the position of minor gods, able to build rocks which we ourselves cannot move again

Now there is an argument to be made that some software is truly “finished” and need never be changed. Old numerics code in FORTRAN is the usual case study of this. Untold metric grad student souls have been poured into ensuring the correctness and performance of this software. Breaking let alone even replacing these programs is a simple waste of effort.

To this argument I have no explicit counter. I have a utilitarian argument in that there are exceedingly few such libraries, as they cover only well understood domains. Newtonian physics for instance is well understood and there is little need for improvement. Likewise numerics libraries. Has the definition of matrix multiplication changed? However the overwhelming majority of the tools which we use are neither of such vintage nor of such quality. Database drivers come and go. HTTP clients are a dime a dozen. If cleaning the slate of scratch work comes at the price of repeating foundational formulas occasionally that is the price of progress.

In my research for this article, I came across a quote on StackOverflow (source)

For every evangelical programmer/blogger there are 1000 avid blog readers that immediately re-invent themselves and adopt the latest techniques. For every one of those there are 10,000 programmers out there with their nose to the grind stone getting a days work done and getting product out the door. Those guys are using tried and trusted techniques that have worked for them for years. They wait until new techniques are widely adopted and show actual benefits before taking them up. Don’t call them stupid, and they’re anything but lazy, call them “busy” instead.

This is more I think the unseen enemy. The argument Brian makes is that it takes “a certain kind of hubris to say that the code one wrote shouldn’t be written that way anymore”.

The central tenant of tool, library and language development is that we do not have tools which are appropriate to our present needs, regardless of how appropriate they may have been to our previous needs. If we had said that writing assembly was good enough and that’s the way all programmers should program for all time because it’s the way that programmers already knew how to program then why do we have the incredible diversity of tools available today? We’d all be better served banging registers together by hand like cro-magons with rocks.

If this is not a patent argument for stagnation, I don’t know what is. The thesis of this argument is that the incremental costs of teaching programmers (or rather of programmers learning whether personally or corporately) to use new tools, new libraries, new styles does not justify the returns in productivity and defect rate.

So where does this leave us.

Unless the domain is well understood, library lifetime is a misplaced priority.
We don’t really understand (or at least don’t correctly abstract) most domains.
Use (especially continued use) of poor abstractions in ignorance does not somehow defray their costs or escape their weaknesses. Poor patterns have carrying costs no matter how familiar.
Burning bridges needs to be done carefully. It has costs in time, effort and motivation.
If we don’t burn bridges, nothing happens.

Clojure code survival graph, illustrating an additive approach to growth. #clojure pic.twitter.com/gOyP8Ruc5U
— Alex Miller (@puredanger) August 15, 2019

So what to do? Light it all on fire. Eventually. Tastefully. When it’s clear that it must burn.

Thanks to Angus ‘goose’ Fletcher for reading a draft of this essay.

Test post

2019-10-26T16:00:00+00:00

sorry goose

this is testing the machinery to thank you for the next one ;)

Automated Publishing with Jekyll

2019-09-17T21:45:00+00:00

This post is somewhat meta - as it concerns a whole bunch of the automation by which I go about writing on this blog.

Alright nerds. You asked for natural numbers and zookeeper but you're gonna get trash world ruby first. Strap in.
— Reid; Yak Hunter (@arrdem) September 17, 2019

I’d like to be able to write more. At present, I’m running a CI/CD setup on one of my servers which - when I push to the blog repo - causes a deploy. Great! No touch deploys! Honestly, it’s served me well for a really long time now.

The not so great part is the semantics of Jekyll under a setup like this. Jekyll is a pretty tried-and-true static site generator, used for among other things Github’s Pages feature. Ruby ain’t my personal cup of tea, but it does a good enough job of taking Markdown, applying some CSS and emitting HTML which is all I need.

But static is the operative word here. Jekyll only exists at $ jekyll build time.

Now Jekyll has a feature - date - which lets you tag a post you’re authoring with an effective date. If this date is in the future, Jekyll will ignore it unless you’re rendering in --future mode. See where I’m going with this?

The problem is I may be up at 2-3 a.m. writing or mucking with the lab because I never entirely kicked those habits. Y’all mostly aren’t, and by posting “late” at “night” I’m largely self-defeating. Y’all won’t see it until the morning, and by then it’ll have been pushed back because it’s “old” relative to the morning tweet flood.

What I’d like to do is to start using a real CMS style workflow for authoring content. I write a post, schedule it for it for 9-10 a.m. the next day and forget about it.

So let’s go throw some more script(s) at this. I’m gonna want some automation for rendering the site, and I’ll need more for announcing changes to the site.

Autopublish

Let’s start with the autodeploys. I’ll want a wrapper script. It’d be nice to be able to include the gem dependencies of my blog setup in the blog itself - and run a gem install when they change.

Because I’m doing this with Ansible as usual, all this is gonna be parameterized on the precise blog name. It’d be nice after all to run jaunt’s blog (dead) or ox’s blog (not alive yet) off of the same infra.

role/git-jekyll-domain/templates/jekyll-build.j2

#!/bin/bash
# Usage
#  bash jekyll-build <blogname>

set -ex

[[ $(whoami) -eq "" ]]

echo "[autodeploy] starting build for $1"

# Set by git when executing post-receive, would stop git from
# detecting it's in a repo
unset GIT_DIR

# Go to the argument site to build
cd "$HOME/$1"

# Not technically race safe but close enough for now
if [ -e build.lock ]; then
    echo "Another build is in progress - soft aborting"
  exit 0
else
  touch build.lock
fi

before=$(git rev-parse HEAD)
git pull origin master && echo "[autodeploy] repo updated"
git checkout -f && echo "[autodeploy] reset complete"
after=$(git rev-parse HEAD)

# If the gemfile has changed, install changes before rendering
if git log --name-only $before..$after | grep "Gemfile"; then
    echo "[autodeploy] dependency changes detected, installing"
    gem install --file Gemfile
    echo "[autodeploy] gem update completed"
fi

echo "[autodeploy] attempting to render"
## FIXME: this is a garbage path hack
JEKYLL=$(find ~/.gem -type f -name jekyll | sort | head -n 1)

## FIXME: how to do an atomic-mv cutover here instead of killing the
## file tree in place?
rm -rf _site
"${JEKYLL}" build && echo "[autodeploy] done rendering!"

echo "[autodeploy] done!"

rm build.lock

Okay so that’s not bad - now we just need to lay down a couple other things. The git hook for instance. Git’s hooks are just shell scripts which get run after some event occurs. In this case I’m leveraging the post-receive hook which runs after objects have been pushed and refs have been updated. This means that the state I’ve pushed is fully in the repo, and the above build script will be able to pull it.

role/git-jekyll-domain/templates/post-receive.j2

#!/bin/bash

sudo -u {{ distribution_nginx_user }}\
    /srv/http/jekyll-build "{{ domain }}"

But I really don’t want to just grant my git user sudo, that’d be nuts. So let’s have a sudoers.d file that’ll allow this one command.

role/git-jekyll-domain/templates/10-jekyll-build.j2

# Grant the git user the right to the static site rebuild script as the http user
git ALL=({{ distribution_nginx_user}}:ALL) NOPASSWD: /srv/http/jekyll-build

Bolting all this together with an Ansible role doesn’t take too much more doing -

role/git-jekyll-domain/tasks/main.yml

# Expected parameters:
#   {{repo}} - the absolute path to the source repo
#   {{domains}} - a list of domains to serve
#   {{domain}} - (default {{domains[0]}} the name of the domain to serve, also the name of its template
#   {{ssl}} - whether this is a "normal" domain or an SSL enabled domain
#   {{cron}} - whether to run the build on a 5min cron.
---
- name: Install system packages
  package: name={{ item }} state=present
  with_items:
    - python-pygments
    - ruby
    - rubygems
    - git

- name: Install ruby-dev
  when: "ansible_distribution == 'Ubuntu'"
  package: name={{ item }} state=present
  with_items:
    - make
    - build-essential
    - ruby-dev

- name: Clone site
  git:
    repo: "{{ repo }}"
    version: master
    dest: "/srv/http/{{ domain }}"
  become: yes
  become_user: "{{ distribution_nginx_user }}"

- name: check if Gemfile exists
  stat: 
    path: "/srv/http/{{ domain }}/Gemfile"
  register: gemfile

- name: Install gems
  when: gemfile.stat.exists == True
  become_user: "{{ distribution_nginx_user }}"
  command: "sudo -u {{ distribution_nginx_user }} gem install -g /srv/http/{{ domain }}/Gemfile"

- name: Install post-receive
  template:
    src: post-receive
    dest: "{{ repo }}/hooks/post-receive"

- name: Set executable bit
  file:
    dest: "{{ repo }}/hooks/post-receive"
    mode: "u+x"
    owner: git
    group: git

- name: Install sudoers entry
  template:
    src: 10-git-http-jekyll.j2
    dest: /etc/sudoers.d/10-git-http-jekyll

- name: Install build script
  template:
    src: jekyll-build.j2
    dest: /srv/http/jekyll-build

- name: Set the executable bit
  file:
    dest: /srv/http/jekyll-build
    mode: "u+x"
    owner: "{{ distribution_nginx_user }}"

- name: Initial site build
  command: "/srv/http/jekyll-build {{ domain }}"
  become: yes
  become_user: "{{ distribution_nginx_user }}"

- name: Create cron entry
  when: cron is defined
  cron:
    name: "Rebuild {{ domain }}"
    job: "sudo -u {{ distribution_nginx_user }} /srv/http/jekyll-build {{ domain }}"
    # FIXME (arrdem 2019-09-17):
    #   FFS pull these as real parameters
    minute: "*/5"

- name: Install nginx domain
  include_role:
    name: nginx-domain
  vars:
    body: |
      root /srv/http/{{ domain }}/_site;
      index index.html
      charset utf-8;

      location ~* \.(css|js|gif|jpe?g|png)$ {
        expires 168h;
        add_header Pragma public;
        add_header Cache-Control "public, must-revalidate, proxy-revalidate";
      }

Alright awesome. Now with a simple playbook I can lay down all these files and get on with it.

play.yml

---
- hosts:
    - apartment_www
  vars_files:
    - "vars/{{ ansible_distribution }}.yml"
    - "vars/default.yml"
  roles:
    - role: git-jekyll-domain
      repo: /srv/git/arrdem/arrdem.com.git
      domains:
        - arrdem.com
        - arrdem.me
        - www.arrdem.com
        - www.arrdem.me
      ssl: true
      cron: true

And that’s all it takes for autodeploys!

We aren’t quite done yet however.

The other big feature that a CMS offers is automated announcement and of newly posted material. If I just let this cronjob run, posts will go up and unless you’re subscribed to the Atom feed you’ll never notice it. And come on this is 2019 nobody uses Atom anymore and I work for Twitter. I need tweets!

Autoannounce

So let’s build some announcement machinery!

One of Jekyll’s features is hooks. You can write Ruby code which will be executed at certain points in the lifecycle of your blog’s rendering. We’re gonna need two.

I don’t want to check the public and private keys for my Twitter account into git where y’all can see them. Sorry. So I’m gonna need a secret storage story, and then a way to post tweets so y’all see ‘em when the blog finally publishes.

Jekyll just loads whatever code it finds in the _plugins directory, so all we’re gonna have to do here is add a gem "twitter" line to the blog’s Gemfile and away we go.

Let’s do secrets first since it’s easy. This plugin attaches to the :after_init hook, and just tries to load up another file I’ve gitignored and chosen to manage by hand as if it were part of the site’s normal config.

_plugins/secrets.rb

# A way to load secrets from a pair to _config.yml

require 'yaml'

Jekyll::Hooks.register :site, :after_init do |site|
  if File.file?('_secret.yml') then
    site.config.update(YAML.safe_load(File.read('_secret.yml')))
  else
    STDOUT.print("Warning, no _secret.yml found! secrets not loaded.")
  end
end

Now, we need to do tweets. Tweets is tricky because well we’re gonna be using the filesystem to store state between builds. In fact, some of y’all saw me fuck this up and spam about 30 tweets in half a second before I got ratelimited.

me: [[cursing loudly in the apartment ]]

y'all: pic.twitter.com/QBmo7WymFt
— Reid; Yak Hunter (@arrdem) September 17, 2019

Shout out to those of y’all who found some comedy in my testing on main.

So what we’re gonna do is maintain a _tweets.yml file, which maps the URL of a post to the URL of a tweet. When we see a “new” post - one which isn’t in the mapping - we’ll tweet it out and create the requisite map entry.

_plugins/announce.rb

require 'twitter'
require 'yaml'

client = nil
post_to_tweets = {}

# Load the tweet DB and create the client
Jekyll::Hooks.register :site, :pre_render do |site|
  if File.file?('_tweets.yml') then
    post_to_tweets = YAML.safe_load(File.read('_tweets.yml'))
  else
    STDOUT.print("Warning: no tweets database was found!\n")
  end

  client = Twitter::REST::Client.new(site.config['twitter'])

  # So there's an escape hatch for development
  if not site.config['twitter'].fetch("enabled", false) then
    STDOUT.print("Warning: Twitter publishing has been disabled\n")
  end
end

# For each post, if there's a tweet in the DB or in the YAML prefix
# use that with the YAML prefix winning. Otherwise create one and
# update the tweet database either way.
Jekyll::Hooks.register :posts, :pre_render do |post|
  site = post.site
  if post.data["layout"] == "post" then
    full_post_url = site.config["url"] + post.url
    tweet = post.data.fetch("twitter", post_to_tweets.fetch(post.url, nil))

    if tweet == nil and site.config['twitter'].fetch("enabled", false) then
      # Post a new tweet and compute its URL
      STDOUT.print("Found an unpublished tweet - publishing...\n")

      # convert all my tags to hashtags
      tags = post.data["tags"].map { |str| "#" + str.downcase }.join(" ")
      # make the tweet text
      tweet_text = "New blog post! - " + post.data["title"] + " " + full_post_url + " " + tags
      # lob it out and grab the URL
      tweet = client.update(tweet_text).url.to_s
      STDOUT.print("Published as " + tweet + "\n")
    end

    # Write the tweet back so that it can be used in rendering
    if tweet != "skipped" then
      post.data["twitter"] = tweet
    end

    post_to_tweets[post.url] = tweet
  end
end

# Dump the tweet DB back
Jekyll::Hooks.register :site, :post_render do |site|
  File.open('_tweets.yml', 'w') { |file| file.write(post_to_tweets.to_yaml) }
end

This totally works once it gets to a steady state. The problem is initialization. Some of my more recent blog posts had the twitter: entry in their heading fontmatter, and I didn’t want to re-post those tweets or pretend like they didn’t exist. Telling the difference between a really old blog post and a new blog post would be impossible here without bringing the date into consideration, and chuck that.

Instead the bootstrapping process (which I messed up) was to check in TWO versions of this plugin. The first version (should have) had the .fetch("enabled", false) snipped replaced with .fetch("enabled", "skipped"). This will make Jekyll lay down a database of all the existing posts flagged so that they’ll be ignored in future. ‘course I didn’t do that and totally spammed my Twitter account, but I trust y’all to learn from my mistakes.

Then, swap that .fetch default value back to nil so that future new posts (like this one!) will be recognized as missing and automatically posted.

And that’s “all” it takes! To prove the point, this post - when it airs - will have been published using this precise machinery. Check the git log if you don’t believe me!

Homelab: Inventory

2019-07-14T08:00:00+00:00

Previously, I talked at some length about the slightly heinous yet industrial grade monitoring and PDU automation solution I deployed to keep my three so called modes - ethos, logos and pathos - from locking up for good by simply hard resetting them when remote monitoring detected an incident. That post (somewhat deliberately I admit) had some pretty gaping holes around configuration management for the restart script. The restart script is handwritten and hand-configured. It has no awareness of the Ansible inventory I introduced in my fist Ansible post - which captures much of the same information. Why the duplication?

The answer is simply that I think the question of how you manage inventory and configuration as it relates to inventory is a deeply interesting question.

Let’s do a quick refresher on Ansible’s notion of inventory first. In Ansible, there are hosts, and groups. Groups contain either other groups (as children) or hosts (also as children), and may have vars (key/value pairs). Hosts exist, and also have vars. When Ansible executes against a host, the host is “materialized” by merging all the vars set on the host, or on any group of which that host is a member and using that set of bindings.

By way of a quick demo -

---
# This file is ./demo-inventory
group_a_b:
  vars:
    a: b
   hosts:
    foo.demo:

group_c_d:
  vars:
    c: d
  hosts:
    foo.demo:

Here, we’re creating two groups, each of which apply a key/value pair to what happens to be a single host.

And if we run it, we’ll see that the vars are in fact merged -

$ ansible-inventory -i demo-inventory --list | jq .
{
  "_meta": {
    "hostvars": {
      "foo.demo": {
        "a": "b",
        "c": "d"
      }
    }
  },
  "all": {
    "children": [
      "group_a_b",
      "group_c_d",
      "ungrouped"
    ]
  },
  "group_a_b": {
    "hosts": [
      "foo.demo"
    ]
  },
  "group_c_d": {
    "hosts": [
      "foo.demo"
    ]
  }
}

Other data sources

Recall, that Ansible features both host_vars and group_vars as shall we say tack on sources of data. Respectively, these directories may contain YAML files named for hosts or files named for a groups, providing vars as an alternative to writing those vars out in the hosts (inventory) file.

vars_files

Another trick you can play is telling Ansible to bolt on yet more vars using vars_files.

---
# snipped from play.yml
- hosts:
    - ...
  vars_files:
    - "vars/defaults.yml"
    - - "vars/_.yml"
      - "vars/.yml"
  roles:
    - ...

vars_files as with so many Ansible features is under-documented. The parameter itself is a sequence of additional files to be loaded and from which to source additional host level vars.

The trick here is that nested lists of files are also supported - in which case the first file which exists will be loaded. Here, I’m using this trick to load either a distribution release specific vars file, or a distribution specific vars file. The distribution specific file is a fallback with respect to the more specific file, but the defaults are always applied.

This is a good trick, which provides one way to go bolt on your own sources of data.

Custom inventory

The real trick is that you can write your own inventory scripts. Hate YAML? Got your own data source? Want some other model? You can just build it yourself and bolt that onto Ansible! No need to go mucking around with Ansible’s opinions about stuff and things.

I can’t commend the Python API for defining inventory scripts whatsoever. I think it’s a tremendously undocumented, complicated, and generally more bother than it’s worth - but there exist some examples of using it.

On the other hand, the JSON API for interfacing with arbitrary external inventory sources is tremendously clear cut. Simply, it’s the same JSON format I’ve been using to show you what’s “really” going on - with the addition of vars mappings being supported on groups so you don’t have to materialize all the host vars into _meta yourself. Technically you don’t even have to do the _meta dance, Ansible will run your script a bunch of times with --host if you don’t.

All a script has to do is accept either --list and dump everything (same as I’ve been using ansible-inventory --list) or --host=<hostname> to just get (all of!) one host’s vars when all the groups are applied.

This gives us an out, if you want to define some other model (or use other data sources) and map it into the Ansible host/group model. Ansible itself includes a huge collection of “contrib” inventory scripts, for sourcing hosts and vars from any and every compute utility tool you can name - like DigitalOcean’s for instance. There’s even another whole collection of more official inventory scripts, like k8s.

The only other tool for injecting vars into the an Ansible play is to define custom facts.

Custom facts

When Ansible runs the first thing it does is run the setup module. The setup module executes any executables in /etc/ansible/facts.d/ expecting that they produce JSON output, and loads any non-executable files in that directory again as JSON. Each of the JSON blobs generated by or read from /etc/ansible/facts.d is keyed by the name of the file or script it came from, and that entire map from filenames to data is made available in Ansible as the ansible_local var.

To take an example, if ethos had the JSON blobs /etc/ansible/facts.d/{foo,bar}_facts, the ansible_local var would look something like (in YAML)

---
foo_facts:
  foo: bar

bar_facts:
  bar: baz

Facts are a way to pull data down from the hosts themselves - device IDs, uptime or other unique machine specific state (possibly even “facts”) unsuitable for vars maintained in inventory.

Modeling with inventory

Okay so we’ve got some tools. We have hosts, groups and vars which can easily be generated by some user defined software, and technically also facts which - aren’t second class but require much more forethought. My gut is that group and host vars as files is probably better treated as an accident of the implementation rather than an essential tool. Likewise the relative difficulty of deploying and gathering facts suggests that facts aren’t a fantastic modeling tool unless you have really specific needs.

This leaves the inventory structure, with its groups and inline vars potentially derived from other sources. That we can do a lot with.

So let’s think about my homelab for a bit. The thin black lines are network connectivity, the thick red lines are power, and the one green line is the battery’s USB read port.

One entire concern in inventory which is worth modeling potentially with groups is failure domains.

Failure domains

Simplifying somewhat, my lab has a single public internet uplink. That one uplink has a USG router and a 16 port switch inbetween it and pretty much everything else. Then all my various computers are hung either directly off the switch, off wireless APs hung off the switch or one daisy chained unmanaged switch out in the living room for the PS4.

So really from the perspective of my networking gear, I have an entire stack of single devices. A failure of either my uplink, or my router, or my switch would destroy a meaningful amount of my internal connectivity. So all of that has to be working.

A crosscutting concern is that I have a single power source - although home solar would be neat - and a single backup battery. That one backup battery has some devices hanging off of it, but the majority of the compute resources hang off the networked PDU I wrote about last time. So again, I have another daisy chain of devices on the power front. Sure the backup battery gives me a fairly surprising amount of reliability in the face of thunderstorms or me deciding to unplug the whole rack and move it, but if that battery goes everything goes.

In the compute-as-a-utility world, this would be described as being a single failure domain. The notion of a failure domain is simply that it’s a group of hardware which due to shared network infrastructure and power infrastructure will fail all at once.

There’s a pretty decent Azure blog post (2015) which presents a diagram of the datacenter’s network architecture.

In this diagram, TOR is an abbreviation for “Top of Rack (switch)”, for rack presumably being a standard frame usually having 48 “units” or 1-inch mounting slots. A common design is to put a small (1 “unit”) switch at the top of the rack, wire everything in the rack to that switch, and then connect the switch to whatever the broader datacenter network topology may be with only a single uplink from the switch.

You’ll note that in this diagram, TOR switches seem to be connected to multiple “spine” routers. Interestingly, Microsoft’s diagram actually shows a full mesh where every TOR is connected to every spine - where a spine is presumably some sort of intermediary router. This is an unusual design, but having multiple uplinks from a rack to shared routers with some sort of routing mesh above that adds resiliency against single shared router failures. Multiple paths to any given rack mean multiple failures are required to take the rack down.

Most compute-as-a-utility vendors go even farther than this, and offer many failure domains (sometimes called availability zones or simply zones). Some of the bigger vendors also offer groups of availability zones called regions across which shared load balancing and and instance scheduling are offered. Google for instance provides an exhaustive list of all the zones and regions in which Google is happy to rent you resources.

To finish providing a model - having hosts be members of racks, and racks members of rack groups, which in turn live under “shared networking” groups and within power groups and finally sites which are comprised of shared power and networking groups live within regions. The really nice thing is that if you want to extend this model, you just add another tree of groups.

Placing a host into a service cluster can be a declarative thing. For instance I’ve toyed with wiring up service groups ala service_<servicename> to generate DNS A-record round robins automatically so that all I have to do is run a playbook like

---
# Install the service
- hosts:
    - service_myservice
  roles:
    - role: myservice
# Update the DNS record(s) -
#   Inspect all groups for "service_" prefixed
#   groups when generating a zone, and assume
#   that every service_ group gets an A record
#   with the IPs of its members.
- hosts:
    - service_dns
  roles:
    - role: dns-zones

In a shared infrastructure environment, hardware is provisioned in batches or requests to customers, who presumably use it to run services. You could try to model a hardware allocation workflow by creating a group for the results of every request. This lets you keep provisioning metadata around at the unit of provisioning - some sort of batch request.

Heck - Ansible supports using directories as sources of inventory. So you could even put the provisioning history / state data in a different file or directory and lock that file down to some combination of automation and the blessed provisioning administrators.

I’m not sure how far this group model goes, but it’s interesting to consider how far you can stretch the consistency wins of preferring groups and managing set theoretic memberships to managing key-value storage.

A case study

As an exercise, let’s refactor my inventory to reflect this structure. I’m gonna deliberately avoid putting vars on hosts wherever I can, and instead attempt to put vars on groups so that hosts almost strictly inherit vars. I think this helps make state manageable in the long term, because it groups related sets of vars into one place and makes it harder to make local changes.

Note I’m gonna use the doc: key, which is ignored by Ansible’s inventory language to write notes into this inventory.

WARNING It should also be noted that assigning attributes to hosts via groups is an extension to the Ansible inventory system that’s specific to the YAML inventory notation. Sadly - I think it’s a great feature. See the source.

---
# My general pattern for naming groups is that
# groups are {groupname}_{value} and define
# var: {groupname}: {value} along with any
# other key/values. This makes it possible to
# go back from a random host to the group(s)
# of which it is a member.

region_na:
  doc: |
    This exists entirely to specify what
    assets I have in North America. Somewhat
    silly, but I'm trying to lay out a pattern.
  vars:
    region: na
  children:
    geo_apartment:

geo_apartment:
  doc: |
    This group exists to map the physical
    facility of my apartment to a collection
    of failure domains.
  vars:
    geo: apartment
  children:
    # I don't have any redundancy, one AZ
    az_apartment0:

uplink_comcast0:
  doc: |
    This group describes a single network
    uplink to a provider. Ideally a high
    reliability site would have redundant
    uplinks.
  vars:
    uplink: comcast0
  children:
    az_apartment0:

power_xcel0:
  doc: |
    This group describes a single power
    substation, and the hardware attached to
    it. A highly reliable site would ideally
    have multiple power supplies - backup
    batteries excluded - and in some extreme
    cases such as telecom systems may
    be connected to multiple regional power
    generators not just redundant substations.
  vars:
    power: xcel0
    substation: gunbarrel
  children:
    az_apartment0:

az_apartment0:
  doc: |
    This group packages "racks" or just chunks
    of my hardware deployment into a failure
    domain which is associated with a network
    uplink and with a power supply. You could
    extend this same model to talk about
    cooling for instance.
  vars:
    az: apartment0
  children:
    apartment_rack0:
    apartment_rack1:
    apartment_rack2:

pdu_ups850_0:
  doc: |
    This group contains devices which pull
    power directly from the UPS. In which
    sense it is a "power dist. unit"
  vars:
    pdu: ups850_0
  hosts:
    sucker.apartment.arrdem.com:

pdu_sucker:
  doc: |
    This group defines a collection of hosts
    which are all wired to the same PDU, the
    connection details for the PDU (or some of
    them) and the mapping of PDU port(s) to
    devices because well that's defined at the
    PDU level by physical connections.
  vars:
    pdu: sucker
    pdu_uri: sucker.apartment.arrdem.com:23
  hosts:
    logos.apartment.arrdem.com:
      pdu_socket: 2
    ethos.apartment.arrdem.com:
      pdu_socket: 3
    pathos.apartment.arrdem.com:
      pdu_socket: 4
    hieroglyph.apartment.arrdem.com:
      pdu_socket: ...

pdu_us16150w_0:
  doc: |
    This group contains devices which pull
    power over PoE from my PoE switch.
  vars:
    pdu: us16150w_0
  children:
    # Racks 2 and 3 are PoE'd RPis
    rack_apartment_2:
    rack_apartment_3:

hw_ryzen0:
  doc: |
    The hw groups just define groupings by
    hardware type. It so happens that I'm
    using racks as chunks of one hardware
    platform - typically three of a kind.

    It so happens that rack '1' is all Ryzens

    In other hardware deployments, this could
    be far less trivial with mixed host
    platforms per rack being common even.
  vars:
    hw: ryzen0
  children:
    rack_apartment_1:

hw_rpi3_bp:
  vars:
    hw: rpi3_bp
  children:
    rack_apartment_2:

hw_rpi3_bp:
  vars:
    hw: rpi3_bp
  children:
    rack_apartment_3:

# Racks will be named rack_{geo}_{rack}
rack_apartment_0:
  doc: |
    A bunch of random devices.
    Hand-IP'd and mostly not Ansible managed.
  vars:
    rack: 0
    cidr: 10.0.0.0/26
  hosts:
    sucker.apartment.arrdem.com:
      ansible_host: 10.0.0.16

rack_apartment_1:
  doc: |
    The hosts I initially built out.

    I'm gonna assign IP blocks per rack, so
    IP assignments are mapped at the rack
    level. If a host moves between racks it
    should be re-IP'd.
  vars:
    rack: 1
    cidr: 10.0.0.64/29
  hosts:
    # The modes
    logos.apartment.arrdem.com:
      ansible_host: 10.0.0.64
    ethos.apartment.arrdem.com:
      ansible_host: 10.0.0.65
    pathos.apartment.arrdem.com:
      ansible_host: 10.0.0.66

rack_apartment_2:
  doc: |
    The raspberry Pi B+ "rack"
  vars:
    rack: 2
    cidr: 10.0.0.72/29
  children:
    rikis-hopuuj.apartment.arrdem.com:
      ansible_host: 10.0.0.72
    fidut-vimib.apartment.arrdem.com:
      ansible_host: 10.0.0.73
    #kipov-rupuh.apartment.arrdem.com:
    #  ansible_host: 10.0.0.74

rack_apartment_3:
  doc: |
    The Raspberry Pi B "rack"
    Thanks @AndySayler!
    (anyone have a 3rd B?)
  vars:
    rack: 3
    cidr: 10.0.0.80/29
  children:
    ...

service_apartment_zookeeper:
  doc: |
    A little config for and membership of my
    ZK cluster. More on this later.
  hosts:
    logos.apartment.arrdem.com:
      zookeeper_id: 1
    ethos.apartment.arrdem.com:
      zookeeper_id: 2
    pathos.apartment.arrdem.com:
      zookeeper_id: 3
    rikis-hopuuj.apartment.arrdem.com:
      zookeeper_id: 4
    fidut-vimib.apartment.arrdem.com:
      zookeeper_id: 5

It’s a lot, and the power groups especially are pretty messy because power for me doesn’t break down cleanly to rack groups as it would in a “real” datacenter - my power is hand wired and a bit of a rat’s nest.

The really slick thing we can do here is leverage Ansible’s Patterns to make either regex based or set theoretic selections of hosts.

The pattern geo_apartment or hw_rpi3_bp would simply select all hosts in my apartment, or all the Pi3B+ units respectively. Where this gets interesting is that queries can be multipart unions, intersections and subtractions. For instance if I had multiple sites, geo_apartment:&hw_rpi3_bp would select all the hosts which are in my apartment and on the Pi3B+ chassis. Likewise if I had Pi3B+ units deployed in a couple racks across AZs, az_apartment0:&hw_rpi3_bp becomes a relevant query.

Patterns with negation like geo_apartment:&!service_apartment_zookeeper eg. all hosts in the group not providing zookeeper become relevant for some purposes. Taking down all of ZK by accident would be bad.

Limitations of inventory

Unfortunately the ansible-inventory tool doesn’t provide a way to test patterns. Patterns seem to be implemented in the playbook machinery - so you have to work them by hand or build a thing which can compute them. This is such an obvious oversight that I may yet run off and build such a thing, but that it isn’t in the box is pretty silly.

Out of the box as it were, there aren’t fantastic tools in the Ansible ecosystem for plugging other programs into Ansible’s inventory. The only obvious pattern is to use Ansible to push out updated inventory information all the time which doesn’t scale nicely even to the handful of hosts I have, and blows up your playbook runtimes. It’d be far nicer if there were a standard API for manipulating inventory, and a standard way to publish inventory data so clients on the network can fetch it.

There isn’t an obvious pattern for how to make radical changes like re-hostnaming or re-IPing devices both of which fly in the face of Ansible’s ideas of host identity. This is hard, because Ansible’s inventory construct isn’t aware of its own history. If a device changes IP, Ansible doesn’t know what the old IP was or that the new `` IP is aspirational. That state management has to be built outside Ansible somehow.

While putting hosts in groups ala service_apartment_zookeeper is a great way to accrete state - add a new service and all you have to deal with is adding that one new service - but it doesn’t give you a tool for “garbage collecting” state because again Ansible can’t tell what’s there. It just barely figures out what new work you want done and leaves everything else alone.

If you want to use Ansible in this way - to implement what’s really an infrastructure-as-code workflow, like other infrastructure-as-code solutions you need a way to retain the “current” state so you can clean up. Other infra as code systems like Terraform maintain a parallel “current state” file which can be used to explain the consequences of changes to inventory and resources, and to then take a patch-based approach to applying changes.

Most crucially, even with full infra-as-code, the history of that infra isn’t transparent to other sources. Infrastructure changes aren’t something you can pub-sub style listen to. Ideally, we should be able to drive live-configured systems (like monitoring!) off of controlled changes to static inventory - thus getting the best of both worlds.

I don’t have a solution to these problems - nor do there seem to be good off-the-shelf tools. Google has their internal machine database aka mdb of which little is said publicly because it’s an unsexy problem but this is a pretty essential set of problems that need to be solved by anyone who wants to run infrastructure. Even at the scale of my homelab.

Homelab: A PDU arrives

2019-07-07T04:37:00+00:00

In my first homelab post, I mentioned that I chose the AMD Ryzen 5 1600 processor to run my compute nodes. Unfortunately, the Ryzen series is vulnerable to random soft locks which doesn’t seem to have a workaround other than “don’t let them idle”, and I neglected to do my do-diligence when I purchase this hardware platform because I trusted @cemerick who owns a whole stack of nearly identical boxes.

I started noticing because, having cut over to my internal DNS after a couple days of leaving the lab alone suddenly my internet connection seemed really slow or didn’t work at all because all my DNS resolvers were non-responsive.

Everyone who’s ever worked in operations has horror stories of hosts with the longest uptime. Machines which ran for years without maintenance. Machines which were forgotten altogether. Ironically, I have the opposite problem. Because my Ryzen boxes lock up at random on a timescale of about every two days, I can’t even get to the point of running “reliable” computations or servers by leaning on the underlying host to be reliable.

There’s a really interesting meditation to be had here on how absurdly reliable the hardware we take for granted is. Even commodity hardware can reasonably be expected to stay on for years, adding and executing instructions correctly, without meaningful interruptions. Only on the scale of large datacenter deployments do cosmic ray strikes, hardware failures and other “rare” events become common. An average programmer takes for granted that the network will mostly work, and that 2 + 2 will always be 4 with no need to cross-check results. Because that’s how good hardware is. Gone are the days of the Intel floating point bug(s) and dealing with write errors on shitty disks.

To put this another way - computers are reliable enough that for many applications you don’t have to go wide. Scaling vertically (just buying a bigger computer) really works for a long time and likely buys you all the reliability non-real-time, non-streaming workloads actually require. Furthermore Google’s offering of live migration of client VMs means in the cloud you can achieve truly insane application uptimes. Or go buy the biggest POWER 9 box you can get your hands on.

In an environment where hardware failures are common (like mine!) you have the opposite problem. In the omnipresence of hardware failures, everything has to be designed to die randomly at any point and recovery has to become automatic. In short, all the problems which usually show up only “at scale” come calling, which I think is really interesting because it means you can’t make the huge mistake of being able to lean on an incredibly reliable piece of hardware.

A real problem facing software systems is that vertical scaling works. To a point. And when that line in the sand is crossed, the software architectures and patterns for achieving reliability are completely different. Either you wind up paying IBM or Oracle an absolutely unbelievable amount of money for a magical sufficiently reliable machine, or you have to likely redesign your entire system to operate in a different less reliable environment. Neither is a good or easy outcome. The only serious choice seems to be designing for distributed reliability by default, because the cut-over can be so painful.

Okay. So back to the lab. I’ve got machines which get wedged, and don’t respond to commands. What the hell am I gonna do. I sure don’t want to come home and push power buttons every day. What happens when I go on vacation? The lab’s just gonna die?!

Well the Industry Grade™ solution to this problem is to use the BMC controller - a microcontroller built onto server class motherboards which provides “baseboard management” such as power control over a separate software stack and sometimes network link - to remotely power off and power back on wedged boxes. It’s not pretty, but it sure does work.

As Rich Hickey and Joe Armstrong have both eluded, the most consistent problem in computing is not being able to reason about the state of the machine. Resetting that state to nothing and allowing it to recover into known states is the ultimate big hammer of problem solving.

Okay but motherboards with BMC controllers and IPMI are uh not features on the commodity motherboards I went with. So what’s plan B? Enter the networked Power Distribution Unit (PDU).

Your common-or-garden surge protector is an example of a PDU. It just provides distribution of power to a bunch of sockets.

A networked PDU (sometimes called a managed PDU) goes a step farther and provides per-socket software switching. Expensive models even provide per-socket power usage metering, allowing datacenter operators to implement per-host power chargeback. For my usage, I settled on an APC 7900 unit which I scored used on ebay at a meaningful discount decommissioned from someone’s datacenter.

Using a my PDU (named sucker because this is a sucky solution to a sucky problem), I can configure my three hosts’ BIOS to automatically power on after power is restored, wire my hosts power through the PDU and given appropriate automation power cycle the hosts when they get wedged. Implementing this is remarkably easy. The PDU can be configured to expose a management console over the telnet protocol, so all one has to do is know what ports on the PDU each host are plugged into.

Adding the PDU to my internal DNS and telnetting to it, we’re welcomed by a screen which enables exactly my use case -

All one has to do is punch in reboot 2\r in order to reboot logos. Implementing this reboot in Python is pretty forwards. Really you just need to use the telnetlib and on recent versions of Python you’re golden. Maybe something like this -

from telnetlib import Telnet

CONFIG = {
  # APC PDU credentials
  "pdu_username": "apc",
  "pdu_password": "[REDACTED] -c",

  # Hosts recover in about 40s,
  # But only stop responding to pings for about 6-8s.
  "debounce": 40,

  # Once a host is up, 5s of no ping is indicative.
  "threshold": 5,

  # (hostname: PDU port) pairs
  "hosts": {
    "logos": "2",
    "ethos": "3",
    "pathos": "4",
    # "ketos": "5",  # the final mode
  }
}


def do_reboot(port: str):
  """telnet to sucker, reset the port and log out."""

  def apc_login(conn):
    conn.read_until(b"User Name")
    conn.write(f"{CONFIG['pdu_username']}\r".encode("utf-8"))
    conn.read_until(b"Password")
    conn.write(f"{CONFIG['pdu_password']}\r".encode("utf-8"))

  def apc_command(conn, cmd):
    conn.read_until(b"APC>")
    conn.write(f"{cmd}\r".encode("utf-8"))

  # To ensure only one process logs into the PDU at once
  with Telnet('sucker', 23) as conn:
    apc_login(conn)
    apc_command(conn, f"reboot {port}")
    apc_command(conn, "quit")

Here, I’m using the telnetlib’s nicest feature - expect-like patterns. telnetlib supports waiting, reading from a connection, until it gets a sequence like a shell prompt back. This helps you write clients which interact with command protocols - like my APC shell - which dump a bunch of data then prompt and may not gracefully handle inputs before the prompt is sent.

We could clean this code up somewhat by introducing a “real” expect function -

def expect(conn, text):
  offset, match, data = conn.expect([bytes(text)], timeout=1)
  if offset is None:
    raise Exception("Unable to match pattern {} in conn {}".format(text, conn))
  else:
    return

I don’t need perfect - for now I just need done so I can stop manually remediating these softlocking boxes.

So what criteria do I want to use for my host(s) becoming nonresponsive? Well one criteria is not returning ping responses. It’s not an ideal criteria because ping isn’t actually a service I care about and moreover some hardware supports offloading ping responses to the network card so a host could return a ping response while being stuck.

All my boxes are DNS resolvers, which means they should accept a connection on port 53. All my boxes run SSH, so they should also accept a connection on port 22, and print an SSH banner.

So we can bolt some fragments together -

from subprocess import check_call, CalledProcessError, DEVNULL
from telnet import Telnet


def ping(hostname: str,
         count: int = 2,
         timeout: int = 1):
  """Send count packets to a hostname, with a timeout of timeout"""

  try:
    return check_call(["ping", "-q",
                       "-c", str(count),
                       "-W", str(timeout),
                       hostname],
                      stderr=DEVNULL,
                      stdout=DEVNULL) == 0
  except CalledProcessError:
    return False


def check_port(hostname: str,
               timeout: int = 1,
               port: int = 22,
               banner: bytes = b""):
  """Knock on the given port, expecting a banner (which may be b'')."""

  try:
    conn = Telnet(hostname, port)
    offset, match, data = conn.expect([banner], timeout=timeout)
    conn.close()
    return match is not None
  except ConnectionRefusedError:
    return False


def knock_ssh(hostname: str):
   return check_port(hostname, port=22, banner=b'SSH')


def knock_dns(hostname: str):
   return check_port(hostname, port=53, banner=b'')

Okay so now we’ve got the machinery for doing a reboot and for checking a host’s “health” in hand. We just need to wire it up into some threads.

import logging as log
from datetime import datetime, timedelta
from threading import Thread
from time import sleep


def zdec(i):
  """Decrement, witha  floor at zero."""
  return max(i - 1, 0)


def monitor(hostname: str, port: str):
  log.info("Monitoring {hostname}".format(hostname=hostname))

  threshold = CONFIG["threshold"]
  debounce = timedelta(seconds=CONFIG["debounce"])

  # Outer loop - never exits just restores state
  start = datetime.today()
  counter = 0
  while True:
    now = datetime.today()
    delta = now - start

    # Debounce - provide a pause inbetween interventions to allow the host to stabilize
    if delta < debounce:
      pass

    elif counter >= threshold:
      # Bounce the box, wait for it to become healthy again
      uptime = delta.total_seconds() - counter
      log.critical("{hostname} detected unhealthy for {counter}s after {uptime}s up, forcing reboot!".format(**locals()))
      do_reboot(port)
      start = datetime.today()
      counter = 0

    elif not ping(hostname) or not knock_ssh(hostname) or not knock_dns(hostname):
      # If the hostname is unhealthy, we increment its "bad" score
      log.warning("{hostname} detected unhealthy ({counter} of {threshold})".format(**locals()))
      counter += 1
    else:
      # Otherwise we zdec the score.
      counter = zdec(counter)

      # delta > debounce implied by if ordering
      if delta.total_seconds() % (60 * 5) // 1 == 0:
        log.info("{} healthy for {}s".format(hostname, delta.total_seconds()))

      sleep(5)


if __name__ == "__main__":
  for hostname, port in CONFIG["hosts"].items():
    t = Thread(target=monitor, args=(hostname, port))
    t.start()

We’ll make a thread for each host in the config, and that thread will sit in an infinite loop running health checks every 5s. If the host is unhealthy for 25s, we’ll use the PDU to force it to reboot. My hosts take about 40s to come back up when hard rebooted this way, so if we assume that detection is accurate I’m taking an outage of about 65s maybe every day or so. 65s over 24h is 99.92% uptime! Three nines! Webscale!

By simply running an instance of this script on each node, I can make just my three node cluster watchdog itself. That’d enable my cluster to detect and recover from a single or double fault. Which would be a huge improvement in my cluster’s reliability! My odds of having all three boxes lock up at once given recoveries of single and double host failures are really slim - about 0.00000004% if I’m doing my math right.

Of more concern is that there’s no coordination in this script! If two hosts are good and one host is bad the bad host will likely get power cycled twice. Worse, the race condition between two un-coordinated hosts trying to reset the same box at once could easily generate exceptions telnetting since my PDU only accepts one connection at a time.

Either I could use a distributed consensus system like Apache Zookeeper, or I could figure out how to only run one instance “vertically scaled” to enough nines like I slagged on before.

It just so happens I’ve had a Raspberry Pi B+ sitting around waiting for a rainy day. In fact I’ve got a whole stack of them at this point.

#homelab now with a bunch of raspberry pis from @AndySayler pic.twitter.com/LL2ulHT7Gt
— arrdem (@arrdem) July 4, 2019

Enter rikis-hopuuj.

The name is deliberately meaningless. In fact, it’s a proquint - a word in a constructed language designed to make 16 bit chunks enunciable. Two 16 bit segments joined on - to make the 32 bit value 12187675381 to be precise. I’m not sure proquints themselves are The Right Thing - but being able to generate random identifiers and make them somewhat more tractable by humans (unlike UUIDs which are relatively intractable and unenunciable) is an interesting concept.

Astute python programmers following along so far may have noted that I’m making extensive use of Python’s String.format method, and doing so taking locals() (the map of all local variables to their bindings) as keyword arguments. This is weird python at best. It’d be better to use Python’s f-strings which I must admit are one of the best new features besides type syntax in the 3.X series. Sadly f-strings are only supported in Python 3.6 and later, and my poor little raspberry pi runs Rasbian (Debian) which packages Python 3.4 so I have to make due.

However with a Power Over Ethernet (POE) dongle to power rikis-hopuuj off of my fancy switch and a simple systemd unit -

$ cat /lib/systemd/system/pdu-monitor.service
[Unit]
Description=Monitor the network, restarting boxes

[Service]
ExecStart=/usr/bin/python3 /root/monitor.py
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

I’m all set!

PDU demo pic.twitter.com/KvFL2d44rP
— arrdem (@arrdem) July 7, 2019

Next time I’ll look a bit at solving the “vertically scaled” monolithic monitoring problem, and starting to play with Zookeeper.

Homelab: Internal DNS

2019-07-06T02:43:00+00:00

Previously, I looked at using Ansible and Ansible’s inventory capabilities to begin managing services and configuration on my homelab.

A core defect in the setup I presented was that I hand-coded the mapping of hostnames to IP addresses in my Ansible inventory because well I didn’t have DNS set up yet.

But hang on a second. What is DNS and why do I care?

When you type in http://foo.com/bar to your browser, that’s a URL (Uniform Resource Locator) which is comprised of a couple segments. It has a scheme - in this case http which describes the protocol by which we’ll go and fetch the resource. It also has an authority part - in this case foo.com - a hostname to go and fetch the resource from. The authority part can have other details like a username and port as well. For instance arrdem@foo.com:443 would provide a username, hostname and port. A simple IPv4 or IPv6 address is also legal as a hostname. A URL may also have a path - in this case /bar - which says what to request from foo.com when you get there.

Making a request from an IP address and port is pretty easy - if you know how to speak the protocol. You just make a TCP connection to that (host, port) pair and away you go. But how do you find IP addresses? I don’t want to commit ethos (10.0.0.64), logos (10.0.0.65) and pathos (10.0.0.66) to memory or build out anything which really depends on those address assignments if I can avoid it.

Enter DNS - the traditional solution to this problem. DNS ([the] Domain Name System) was created to provide a protocol for mapping names memorable to humans (like ethos, logos and pathos!) to IP addresses which machines actually use. DNS is a host discovery system - its core purpose is to map a domain name to one or more IP addresses presumed to identify machines somewhere. It does not implement service discovery. Services (programs listening to ports on a machine) are identified by convention. For instance “the” program which speaks HTTP if any is listening on port 80, “the” program if any which speaks SSH is listening on port 22 and soforth. These conventions worked fine before the advent of modern shared infrastructure or “cloud” hosting and now pose some challenges I’ll talk about later.

So how does DNS work? DNS consists of a hierarchy of servers - known as resolvers - which speak the DNS query language. Each DNS client connects to a few (typically 3 or fewer) resolvers provided as IP addresses. For instance 1.1.1.1 is a DNS resolver made public by CloudFlare, and 8.8.8.8 is a DNS resolver made public by Google. When you make a request of the resolver, you do by requesting an address (called a domain) of the resolver. If the resolver has data it will serve a response, otherwise it may have to (potentially recursively!) inquire of other resolvers for the data you wanted.

What kind of record(s) live in DNS? The most basic record is a A record - just an IP address. We can search DNS for records using the dig tool, as such -

$ dig www.arrdem.com

; <<>> DiG 9.14.3 <<>> www.arrdem.com A
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 17422
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;www.arrdem.com.      IN  A

;; ANSWER SECTION:
www.arrdem.com.    300  IN  A  67.166.32.93

;; Query time: 66 msec
;; SERVER: 75.75.75.75#53(75.75.75.75)
;; WHEN: Fri Jul 05 17:32:05 PDT 2019
;; MSG SIZE  rcvd: 59

In this response we can see the ANSWER section, which says cryptically

www.arrdem.com. 300  IN A 67.166.32.93

The first element here - www.arrdem.com. is the full canonical name of the requested record. The second element - 300 - is the TTL of this record in seconds. This tells resolvers which have to recursively query to get this data how long they may cache it for. The third element - IN A denotes the record type. Finally we actually have the value - 67.166.32.93 being the current IP address for my homelab.

An interesting property of DNS is that most records need not be singular. That is, you could dig and get a couple IP addresses back.

Twitter for instance presents two public IPs.

$ dig twitter.com A

; <<>> DiG 9.14.3 <<>> twitter.com A
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 11939
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;twitter.com.      IN  A

;; ANSWER SECTION:
twitter.com.    563  IN  A  104.244.42.65
twitter.com.    563  IN  A  104.244.42.1

;; Query time: 32 msec
;; SERVER: 75.75.75.75#53(75.75.75.75)
;; WHEN: Fri Jul 05 17:34:08 PDT 2019
;; MSG SIZE  rcvd: 72

That is there is not one but two public addresses either of which could be used to access the service known by the domain name twitter.com if the other fails or is overloaded. So if you go and connect to http://twitter.com, you’ll be connecting to one of those two IP addresses. This can be used to build client-side load balancing to distribute requests randomly over many hosts as clients are expected to choose which host to connect to in “round robin” order. For instance a fleet of tens or more puppet servers all of which provide the same data could live behind a single A record “round robin”.

There’s a lot of really interesting stuff you can do with DNS, but for now lets get it up and running in the lab.

The obvious first step would be to reconfigure my router to push the IP addresses of my three nodes as DNS resolvers. Doing so before the resolver(s) are set up however would nuke my ability to talk to the outside world! (looking at you stackoverflow) so lets hold off on that.

Instead we’ll take advantage of the dig tool’s ability to target a specific resolver eg. dig <address> @<resolver> to test the resolvers I’m building out before we cut over to them.

Okay. Let’s do this.

BIND setup

There’s a number of DNS servers - but I’m gonna go with good old bind. Bind (aka named) uses a three part configuration. /etc/named.conf tells named what to do - for which the general pattern is include configurations for domains (called zones) out of /etc/named/data/.conf. While bind can do a lot of stuff, all I’m gonna use it for initially is to serve handwritten domain files (AKA zonefiles) out of /etc/named/master/.

Writing this Ansible role is pretty easy -

roles/dns-resolver/tasks/main.yml

---
- name: Install bind
  package:
    name: bind
    state: present
  notify:
    - named enable

- name: Create directories
  when: installed.changed
  file:
    path: "{{ item }}"
    state: directory
    owner: root
    group: root
  with_items:
    - /etc/named/data
    - /etc/named/master

- name: Deploy named.conf
  when: installed.changed
  template:
    src: named.conf.j2
    dest: /etc/named.conf

Of slightly more interest is the actual named config template I’m deploying -

roles/dns-resolver/templates/named.conf.j2

acl "subnet" {
  10.0.0.0/24;
  localhost;
  localnets;
};

options {
  directory "/var/named";
  pid-file "/run/named/named.pid";

  listen-on { any; };

  allow-recursion { subnet; localhost; };
  allow-query { subnet; localhost; };
  allow-query-cache { subnet; localhost; };

  forwarders {
{% for node in upstream_dns_resolvers %}
    {{ node }};
{% endfor %}
  };
};

zone "localhost" IN {
  type master;
  file "localhost.zone";
};

This configuration defines an Access Control List (ACL) for my local subnet. It then allows only hosts in the subnet - or the local host - to make queries of this server. We also set up forwarders - hosts which each bind instance will query if the bind instance doesn’t have master data. Elsewhere in Ansible variables, I’m defining

# Everywhere we use the same upstream DNS resolvers.
# Local DNS resolvers are configured per-geo as 
dns_resolvers_upstream:
  - 1.1.1.1
  - 8.8.8.8
  - 8.8.4.4

Let’s create a new Ansible inventory group for the sake of hygiene which will contain our resolvers.

hosts

---
apartment_modes:
  hosts:
    ethos.apartment.arrdem.com:
      vars:
        ansible_host: 10.0.0.64

    logos.apartment.arrdem.com:
      vars:
        ansible_host: 10.0.0.65

    pathos.apartment.arrdem.com:
      vars:
        ansible_host: 10.0.0.66

apartment_resolvers:
  children:
    apartment_modes:

With just this configuration, we can run it against my modes using a really simple playbook

play.yml

---
- hosts:
    - apartment_resolvers
  roles:
    - role: dns-resolver

And run that -

$ ansible-playbook -i hosts play.yml

PLAY [apartment_resolvers] **************************************************************************************

TASK [Gathering Facts] ******************************************************************************************
ok: [ethos.apartment.arrdem.com]
ok: [pathos.apartment.arrdem.com]
ok: [logos.apartment.arrdem.com]

TASK [dns-resolver : Install bind] ******************************************************************************
ok: [ethos.apartment.arrdem.com]
ok: [pathos.apartment.arrdem.com]
ok: [logos.apartment.arrdem.com]

TASK [dns-resolver : Create directories] ************************************************************************
ok: [ethos.apartment.arrdem.com] => (item=/etc/named/data)
ok: [pathos.apartment.arrdem.com] => (item=/etc/named/data)
ok: [logos.apartment.arrdem.com] => (item=/etc/named/data)
ok: [pathos.apartment.arrdem.com] => (item=/etc/named/master)
ok: [logos.apartment.arrdem.com] => (item=/etc/named/master)
ok: [ethos.apartment.arrdem.com] => (item=/etc/named/master)

TASK [dns-resolver : Deploy named.service] **********************************************************************
ok: [ethos.apartment.arrdem.com]
ok: [logos.apartment.arrdem.com]
ok: [pathos.apartment.arrdem.com]

TASK [dns-resolver : Deploy named.conf] *************************************************************************
ok: [logos.apartment.arrdem.com]
ok: [ethos.apartment.arrdem.com]
ok: [pathos.apartment.arrdem.com]

PLAY RECAP ******************************************************************************************************
ethos.apartment.arrdem.com : ok=5    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0
logos.apartment.arrdem.com : ok=5    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0
pathos.apartment.arrdem.com : ok=5    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0

Cool! So now we should be able to run some test DNS queries against these servers. Most important is my ability to do recursive queries, so lets check twitter.com first.

$ dig twitter.com @10.0.0.64

; <<>> DiG 9.14.3 <<>> twitter.com @10.0.0.64
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 27850
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: 0f82e839818e3b9cecebcec75d200ee12a4d47ef5312925b (good)
;; QUESTION SECTION:
;twitter.com.      IN  A

;; ANSWER SECTION:
twitter.com.    1022  IN  A  104.244.42.193
twitter.com.    1022  IN  A  104.244.42.129

;; Query time: 5 msec
;; SERVER: 10.0.0.64#53(10.0.0.64)
;; WHEN: Fri Jul 05 20:00:45 PDT 2019
;; MSG SIZE  rcvd: 100

Heck yeah. Recursive lookups are working.

A first zone

Now let’s do what we’re here to do - creating the apartment.arrdem.com zone. To keep things simple, I’m gonna handwrite my first zonefile.

roles/dns-zone/templates/apartment.arrdem.com.j2

$ORIGIN apartment.arrdem.com.
$TTL 7200
apartment.arrdem.com. IN SOA ns.apartment.arrdem.com. mail.apartment.arrdem.com. (
    2019070442
    43200
    180
    1209600
    10800
)
;;;  NS  section
@ NS ns.apartment.arrdem.com.
ns IN A 10.0.0.65
ns IN A 10.0.0.66
ns IN A 10.0.0.64

;;; HOSTS
ethos  IN A 10.0.0.65
logos  IN A 10.0.0.66
pathos IN A 10.0.0.64

The ns record is a convention for all the nameservers (resolves) in the domain. And I’ve got an A record for each of my currently three machines.

We’ll also need a small template to configure named for each zone -

roles/dns-zone/templates/zone-data.j2

zone "{{ item }}" {
  type master;
  file "/etc/named/master/{{ item }}";
  allow-transfer {none;};
  allow-update {none;};
};

This config just tells named to prohibit dynamic updates or transfers of the domain. We’ve already set global ACLs for querying. As a template, it presumes we’re rendering it from inside a loop over zone names.

All it takes to get this deployed is a pretty simple role -

roles/dns-zone/tasks/main.yml

---
- name: Deploy zonefiles
  with_items: "{{ zones }}"
  template:
    src: "{{ item }}.zone"
    dest: "/etc/named/master/{{ item }}"
  notify:
    - named reload

- name: Deploy zone data
  with_items: "{{ zones }}"
  template:
    src: zone-data.j2
    dest: "/etc/named/data/{{ item }}.conf"

- name: Add zone config
  with_items: "{{ zones }}"
  lineinfile:
    path: /etc/named.conf
    state: present
    line: "include \"/etc/named/data/{{ item }}.conf\";"

That is, we’ll apply this role with a list of zones as the variable zones, for each one rendering a template to produce the zonefile, rendering our config template for each zone and using the lineinfile module to monkeypatch our main /etc/named.conf to make named include the new zone’s config.

Patching our playbook a tiny bit -

play.yml

---
- hosts:
    - apartment_resolvers
  vars_files:
    - "vars/.yml"
  roles:
    - role: dns-resolver
    - role: dns-zone
      zones:
        - apartment.arrdem.com

And running it -

$ ansible-playbook -i hosts play.yml

PLAY [apartment_resolvers] **************************************************************************************

TASK [Gathering Facts] ******************************************************************************************
ok: [logos.apartment.arrdem.com]
ok: [pathos.apartment.arrdem.com]
ok: [ethos.apartment.arrdem.com]

TASK [dns-resolver : Install bind] ******************************************************************************
ok: [pathos.apartment.arrdem.com]
ok: [logos.apartment.arrdem.com]
ok: [ethos.apartment.arrdem.com]

TASK [dns-resolver : Create directories] ************************************************************************
ok: [pathos.apartment.arrdem.com] => (item=/etc/named/data)
ok: [ethos.apartment.arrdem.com] => (item=/etc/named/data)
ok: [logos.apartment.arrdem.com] => (item=/etc/named/data)
ok: [pathos.apartment.arrdem.com] => (item=/etc/named/master)
ok: [ethos.apartment.arrdem.com] => (item=/etc/named/master)
ok: [logos.apartment.arrdem.com] => (item=/etc/named/master)

TASK [dns-resolver : Deploy named.service] **********************************************************************
ok: [ethos.apartment.arrdem.com]
ok: [logos.apartment.arrdem.com]
ok: [pathos.apartment.arrdem.com]

TASK [dns-resolver : Deploy named.conf] *************************************************************************
ok: [logos.apartment.arrdem.com]
ok: [pathos.apartment.arrdem.com]
ok: [ethos.apartment.arrdem.com]

TASK [dns-zone : Deploy zonefiles] ******************************************************************************
changed: [pathos.apartment.arrdem.com] => (item=apartment.arrdem.com)
changed: [logos.apartment.arrdem.com] => (item=apartment.arrdem.com)
changed: [ethos.apartment.arrdem.com] => (item=apartment.arrdem.com)

TASK [dns-zone : Deploy zone data] ******************************************************************************
ok: [ethos.apartment.arrdem.com] => (item=apartment.arrdem.com)
ok: [logos.apartment.arrdem.com] => (item=apartment.arrdem.com)
ok: [pathos.apartment.arrdem.com] => (item=apartment.arrdem.com)

TASK [dns-zone : Add zone config] *******************************************************************************
changed: [ethos.apartment.arrdem.com] => (item=apartment.arrdem.com)
changed: [pathos.apartment.arrdem.com] => (item=apartment.arrdem.com)
changed: [logos.apartment.arrdem.com] => (item=apartment.arrdem.com)

RUNNING HANDLER [dns-zone : named reload] ***********************************************************************
changed: [pathos.apartment.arrdem.com]
changed: [ethos.apartment.arrdem.com]
changed: [logos.apartment.arrdem.com]

PLAY RECAP ******************************************************************************************************
ethos.apartment.arrdem.com : ok=9    changed=3    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0
logos.apartment.arrdem.com : ok=9    changed=3    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0
pathos.apartment.arrdem.com : ok=9    changed=3    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0

we should be able to dig ethos, logos and pathos out of DNS!

$ for h in ethos logos pathos; do dig +short ${h}.apartment.arrdem.com @10.0.0.64; done
0.0.65
0.0.66
0.0.64

Heck yeah.

Now if I go into my router, tell it to use my three nodes as DNS resolvers and reconnect my device so that it get a fresh resolver config, I’ll see my resolvers configured in /etc/resolv.conf

# Generated by resolvconf
domain apartment.arrdem.com
search apartment.arrdem.com arrdem.com
nameserver 10.0.0.64
nameserver 10.0.0.65
nameserver 10.0.0.66

Now, I can ssh using DNS names not IP addresses!

$ ssh arrdem@pathos echo '$(hostname -f)] Hello, world!'
pathos.apartment.arrdem.com] Hello, world!

Metaprogramming zones

While the above zonefile for apartment.arrdem.com strictly works - it’s also one more thing to update by hand whenever I bring up a new node or service. I’m gonna be spending a lot of quality time working on the service discovery problem - but let’s start with this. Ansible still has (as ansible_host) the IP address for every device I configure. So at the very least, one could write this zone -

roles/dns-zone/templates/apartment.arrdem.com.j2

$ORIGIN apartment.arrdem.com.
$TTL 7200
apartment.arrdem.com. IN SOA ns.apartment.arrdem.com. mail.apartment.arrdem.com. (
    {{ansible_date_time.year}}{{ansible_date_time.month}}{{ansible_date_time.day}}42
    43200
    180
    1209600
    10800
)
;;;  NS  section
@ NS ns.apartment.arrdem.com.
{% for node in groups[geo + '_resolvers'] %}
ns IN A {{ hostvars[node]['ansible_host'] }}
{% endfor %}

;;; HOSTS
{% for node in groups['geo_apartment'] %}
{{ node | shortname | format("{0: <16}") }} IN A {{ hostvars[node]['ansible_host'] }}
{% endfor %}

This template will generate a SOA version by concatenating the date to day precision, along with a counter I bump by hand. Leveraging the fact that there’s an apartment_resolvers group in Ansible’s inventory, we can introspect it if there’s a geo variable set. We can also play the same game to get all the hosts in the geo_apartment group!

So if I tweak my inventory a tiny bit -

hosts

---
apartment_modes:
  hosts:
    ethos.apartment.arrdem.com:
      vars:
        ansible_host: 10.0.0.64

    logos.apartment.arrdem.com:
      vars:
        ansible_host: 10.0.0.65

    pathos.apartment.arrdem.com:
      vars:
        ansible_host: 10.0.0.66

apartment_resolvers:
  children:
    apartment_modes:

geo_apartment:
  vars:
    geo: apartment
  children:
    apartment_modes:

Now if I want to add a half-dozen raspberry pis all of a sudden, all I have to do is add them to my Ansible inventory and they’ll automatically be added to DNS! To really see that this works, check out ansible-inventory -i hosts --list with this hosts file.