Shelving; Building a Datalog for fun! and profit?

This post is my speaker notes from my May 3 SF Clojure talk (video) on building Shelving, a toy Datalog implementation


Databases?

  • 1660s; the first data stores. These early systems were very tightly coupled to the physical representation of data on tape. Difficult to use, develop, query and evolve.
    • CODASYL, a set of COBOL patterns for building data stores; essentially using doubly linked lists
    • IBM's IMS which had a notion of hierarchical (nested) records and transactions
  • 1969; E. F. Codd presents the "relational data model"

Relational data

Consider a bunch of data

[{:type ::developer
  :name {:given "Reid"
         :family "McKenzie"
         :middle "douglas"
         :signature "Reid Douglas McKenzie"}
  :nicknames ["arrdem" "wayde"]}
 {:type ::developer
  :name {:given "Edsger"
         :family "Djikstra"}
  :nicknames ["ewd"]}]

In a traditional (pre-relational) data model, you could imagine laying out a C-style struct in memory, where the name structure is mashed into the developer structure at known byte offsets from the start of the record. Or perhaps the developer structure references a name by its tape offset and has a length tagged array of nicknames trailing behind it.

The core insight of the relational data model is that we can define "joins" between data structures. But we need to take a couple steps here first.

Remember that maps are sequences of keys and values. So to take one of the examples above,

{:type ::developer
 :name {:given "Edsger"
        :family "Djikstra"}
 :nicknames ["ewd"]}

;; <=> under maps are k/v sequences

[[:type ::developer]
 [:name [[:given "Edsger"]
         [:family "Djikstra"]]]
 [:nicknames ["ewd"]]]

;; <=> under kv -> relational tuple decomp.

[[_0 :type ::developer]
 [_0 :name _1]
 [_0 :nickname "ewd"]
 [_1 :given "Edsger"]
 [_1 :family "Djikstra"]]

We can also project maps to tagged tuples and back if we have some agreement on the order of the fields.

{:type ::demo1
 :foo 1
 :bar 2}

;; <=>

[::demo1 1 2] ;; under {0 :type 1 :foo 2 :bar}

Finally, having projected maps (records) to tuples, we can display many tuples as a table where columns are tuple entries and rows are whole tuples. I mention this only for completeness, as rows and columns are common terms of use and I want to be complete here.

foo bar
1 2
3 4

Okay so we've got some data isomorphisms. What of it?

Well the relational algebra is defined in terms of ordered, untagged tuples.

Traditionally data stores didn't include their field identifiers in the storage implementation as an obvious space optimization.

That's it. That's the relational data model - projecting flat structures to relatable tuple units.

Operating with Tuples

The relational algebra defines a couple operations on tuples, or to be more precise sets of tuples. There are your obvious set theoretic operators - union, intersection and difference, and there's three more.

cartesian product

let R, S be tuple sets ∀r∈R,∀s∈S, r+s ∈ RxS

Ex. {(1,) (2,)} x {(3,) (4,)} => {(1, 3,) (1, 4,) (2, 3,) (2, 4,)}

projection (select keys)

Projection is bettern known as select-keys. It's an operator for selecting some subset of all the tuples in a tuple space. For instance if we have R defined as

A B C
a b c
d a f
c b d

π₍a,b₎(R) would be the space of tuples from R excluding the C column -

A B
a b
d a
c b

selection

Where projection selects elements from tuples, selection selects tuples from sets of tuples. I dislike the naming here, but I'm going with the original.

To recycle the example R from above,

A B C
a b c
d a f
c b d

σ₍B=b₎(R) - select where B=b over R would be

A B C
a b c

Joins

Finally given the above operators, we can define the most famous one(s), join and semijoin.

join (R⋈S)

The (natural) join of two tuple sets is the subset of the set RxS where any fields COMMON to both r∈R and s∈S are "equal".

Consider some tables, R

A B C
a b c
d e f

and S,

A D E
a 1 3
d 2 3

We then have R⋈S to be

A B C D E
a b c 1 2
d e f 2 3

semijoin

This is a slightly restricted form of join - you can think of it as the join on some particular column. For instance, if R and S had several overlapping columns, the (natural) join operation joins by all of them. For instance we may want to have several relations between two tables - and consequently leave open the possibility of several different joins.

In general when talking about joins for the rest of this presentation I'll be talking about natural joins over tables designed for only overlapping field so the natural join and the semijoin collapse.

Enter Datalog

Codd's relational calculus as we've gone through is a formulation of how to view data and data storage in terms of timeless, placeless algebraic operations. Like the Lambda Calculus or "Maxwell's Laws of Software" as Kay has described the original Lisp formulation, it provides a convenient generic substrate for building up operational abstractions. It's the basis for entire families of query systems which have come since.

Which is precisely what makes Datalog interesting! Datalog is almost a direct implementation of the relational calculus, along with some insights from logic programming. Unfortunately, this also means that it's difficult to give a precise definiton of what datalog is. Like lisp, it's simple enough that there are decades worth of implementations, re-implementations, experimental features and papers.

Traditionally, Datalog and Prolog share a fair bit of notation so we'll start there.

In traditional Datalog as in Prolog, "facts" are declared with a notation like this. This particular code is in Souffle a Datalog dialect, which happened to have an Emacs mode. This is the example I'll be trying to focus on going forwards.

State("Alaska")
State("Arizona")
State("Arkansas")

City("Juneau", "Alaska")
City("Phoenix", "Arizona")
City("Little Rock", "Arkansas")

Population("Juneau", 2018, 32756)
Population("Pheonix", 2018, 1.615e6)
Population("Little Rock", 2018, 198541)

Capital("Juneau")
Capital("Phoenix")
Capital("Little Rock")

Each one of these lines defines a tuple in the datalog "database". The notation is recognizable from Prolog, and is mostly agreed upon.

Datalog also has rules, also recognizable from logic programming. Rules describe sets of tuples in terms of either other rules or sets of tuples. For instance

CapitalOf(?city, ?state) :- State(?state), City(?city, ?state), Capital(?city).

This is a rule which defines the CapitalOf relation in terms of the State, City and Capital tuple sets. The CapitalOf rule can itself be directly evaluated to produce a set of "solutions" as we'd expect.

?city and ?state are logic variables, the ?- prefix convention being taken from Datomic.

That's really all there is to "common" datalog. Rules with set intersection/join semantics.

Extensions

Because Datalog is so minimal (which makes it attractive to implement) it's not particularly useful. Like Scheme, it can be a bit of a hair shirt. Most Datalog implementations have several extensions to the fundimental tuple and rule system.

Recursive rules!

Support for recursive rules is one very interesting extension. Given recursive rules, we could use a recursive Datalog to model network connectivity graphs (1)

Reachable(?s, ?d) :- Link(?s, ?d).
Reachable(?s, ?d) :- Link(?s, ?z), Reachable(?z, ?d).

This rule defines reachability in terms of either there existing a link between two points in a graph, or there existing a link between the source point and some intermediate Z which is recursively reachable to the destination point.

The trouble is that implementing recursive rules efficiently is difficult although possible. Lots of fun research material here!

Negation!

You'll notice that basic Datalog doesn't support negation of any kind, unless "positively" stated in the form of some kind of "not" rule.

TwoHopLink(?s, ?d) :- Link(?s, ?z), Link(?z, ?d), ! Link(?s, ?d).

It's quite common for databases to make the closed world assumption - that is all possible relevant data exists within the database. This sort of makes sense if you think of your tuple database as a subset of the tuples in the world. All it takes is one counter-example to invalidate your query response if suddenly a negated tuple becomes visible.

Incremental queries / differentiability!

Datalog is set-oriented! It doesn't have a concept of deletion or any aggregation operators such as ordering which require realizing an entire result set. This means that it's possible to "differentiate" a Datalog query and evaluate it over a stream of incomming tuples because no possible new tuple (without negation at least) will invalidate the previous result(s).

This creates the possibility of using Datalog to do things like describe application views over incremental update streams.

Eventual consistency / distributed data storage!

Sets form a monoid under merge - no information can ever be lost. This creates the possibility of building distributed data storage and query answering systems which are naturally consistent and don't have the locking / transaction ordering problems of traditional place oriented data stores.

The Yak

Okay. So I went and build a Datalog.

Why? Because I wanted to store documentation, and other data.

95 Theses

Who's ready for my resident malcontent bit?

Grimoire

Grimoire has a custom backing data store - lib-grimoire - which provides a pretty good model for talking about Clojure and ClojureScript's code structure and documentation.

https://github.com/clojure-grimoire/lib-grimoire#things

lib-grimoire was originally designed to abstract over concrete storage implementations, making it possible to build tools which generated or consumed Grimoire data stores. And that purpose it has served admirably for me. Unfortunately looking at my experiences onboarding contributors it's clearly been a stumbling block and the current Grimoire codebase doesn't respect the storage layer abstraction; there are lots of places where Grimoire makes assumptions about how the backing store is structured because I've only ever had one.

Grenada

https://github.com/clj-grenada/grenada-spec

In 2015 I helped mentor Richard Moehn on his Grenada project. The idea with the project was to take a broad view of the Clojure ecosystem and try to develop a "documentation as data" convention which could be used to pack documentation, examples and other content separately from source code - and particularly to enable 3rdparty documenters like myself to create packages for artifacts we don't control (core, contrib libraries). The data format Richard came up with never caught on I think because the scope of the project was just the data format not developing a suite of tools to consume it.

What was interesting about Grenada is that it tried to talk about schemas, and provide a general framework for talking about the annotations provided in a single piece of metadata rather than relying on a hard-coded schema the way Grimoire did.

cljdoc

https://github.com/martinklepsch/cljdoc

In talking to Martin about cljdoc and some other next generation tools, the concept of docs as data has re-surfaced again. Core's documentation remains utterly atrocious, and a consistent gripe of the community yearly survey over survey.

Documentation for core is higher hit rate than documentation for any other single library, so documenting core and some parts of contrib is a good way to get traction and add value for a new tool or suite thereof.

Prior art

You can bolt persistence ala carte onto most of the above with transit or just use edn, but then your serialization isn't incremental at all.

Building things is fun!

Design goals

  • Must lend itself to some sort of "merge" of many stores
    • Point reads
    • Keyspace scans
  • Must have a self-descriptive schema which is sane under merges / overlays
  • Must be built atop a meaningful storage abstraction
  • Design for embedding inside applications first, no server

Building a Datalog

Storage models!

Okay lets settle on an example that we can get right and refine some.

Take a step back - Datalog is really all about sets, and relating a set of sets of tuples to itself. What's the simplest possible implementation of a set that can work? An append only write log!

[[:state "Alaska"]
 [:state "Arizona"]
 [:state "Arkansas"]
 [:city "Juneau" "Alaska"]
 [:city "Pheonix" "Arizona"]
 ...]

Scans are easy - you just iterate the entire thing.

Writes are easy - you just append to one end of the entire thing.

Upserts don't exist, because we have set semantics so either you insert a straight duplicate which doesn't violate set semantics or you add a new element.

Reads are a bit of a mess, because you have to do a whole scan, but that's tolerable. Correct is more important than performant for a first pass!

Schemas!

So this sort of "sequence of tuples" thing is how core.logic.pldb works. It maintains a map of sequences of tuples, keyed by the tuple "name" so that scans can at least be restricted to single tuple "spaces".

Anyone here think that truely unstructured data is a good thing?

Yeah I didn't think so.

Years ago I did a project - spitfire - based on pldb. It was a sketch at a game engine which would load data files for a the Warmachine table top game pieces and provide with a rules quick reference and ultimately I hoped a full simulation to play against.

As with most tabletop war games, play proceeds by executing a clock, and repeatedly consulting tables of properties describing each model. Which we recognize as database query.

Spitfire used pldb to try and solve the data query problem, and I found that it was quite awkward to write to in large part because it was really easy to mess up the tuples you put into pldb. There was no schema system to save you if you messed up your column count somewhere. I built one, but its ergonomics weren't great.

Since then, we got clojure.spec(.alpha) which enables us to talk about the shape and requirements on data structures. Spec is designed for talking about data in a forwards compatible way, unlike traditional type systems which intentionally introduce brittleness to enable evolution.

While this may or may not be an appropriate trade-off for application development, it's a pretty great trade-off for persisted data and schemas on persisted, iterated data!

https://github.com/arrdem/shelving#schemas

(s/def :demo/name string?)
(s/def :demo.state/type #{:demo/state})
(s/def :demo/state
  (s/keys :req-un [:demo/name
                   :demo.state/type]))

(defn ->state [name]
  {:type :demo/state, :name name})

(s/def :demo/state string?)
(s/def :demo.city/type #{:demo/city})
(s/def :demo/city
  (s/keys :req-un [:demo.city/type
                   :demo/name
                   :demo/state]))

(defn ->city [state name]
  {:type :demo/city, :name name, :state state})

(s/def :demo/name string?)
(s/def :demo.capital/type #{:demo/capital})
(s/def :demo/capital
  (s/keys :req-un [:demo.capital/type
                   :demo/name]))

(defn ->capital [name]
  {:type :demo/capital, :name name})

(def *schema
  (-> sh/empty-schema
      (sh/value-spec :demo/state)
      (sh/value-spec :demo/city)
      (sh/value-spec :demo/capital)
      (sh/automatic-rels true))) ;; lazy demo

Writing!

#'shelving.core/put!

  • Recursively walk spec structure
    • depth first
    • spec s/conform equivalent
  • Generate content hashes for every tuple
  • Recursively insert every tuple (skipping dupes)
  • Insert the topmost parent record with either with a content hash ID or a generated ID depending on record/value semantics.
  • Create schema entries in the db if automatic schemas are on and the target schema/spec doesn't exist in the db.

Okay so lets throw some data in -

(def *conn
  (sh/open
   (->MapShelf *schema "/tmp/demo.edn"
               :load false
               :flush-after-write false)))
;; => #'*conn

(let [% *conn]
  (doseq [c [(->city "Alaska" "Juneau")
             (->city "Arizona" "Pheonix")
             (->city "Arkansas" "Little Rock")]]
    (sh/put-spec % :demo/city c))

  (doseq [c [(->capital "Juneau")
             (->capital "Pheonix")
             (->capital "Little Rock")]]
    (sh/put-spec % :demo/capital c))

  (doseq [s [(->state "Alaska")
             (->state "Arizona")
             (->state "Arkansas")]]
    (sh/put-spec % :demo/state s))

  nil)
;; => nil

Schema migrations!

Can be supported automatically, if we're just adding more stuff!

  • Let the user compute the proposed new schema
  • Check compatibility
  • Insert into the backing store if there are no problems

Query parsing!

Shelving does the same thing as most of the other Clojure datalogs and rips off Datomic's datalog DSL.

(sh/q *conn
  '[:find ?state
    :in ?city
    :where [?_0 [:demo/city :demo/name] ?city]
           [?_0 [:demo/city :demo.city/state] ?state]
           [?_1 [:demo/capital :demo/name] ?city]])

This is defined to have the same "meaning" (query evaluation) as

(sh/q *conn
      '{:find  [?state]
        :in    [?city]
        :where [[?_0 [:demo/city :demo/name] ?city]
                [?_0 [:demo/city :demo.city/state] ?state]
                [?_1 [:demo/capital :demo/name] ?city]]})

How can we achieve this? Let alone test it reasonably?

Spec to the rescue once again! src/test/clj/shelving/parsertest.clj conform/unform "normal form" round-trip testing!

Spec's normal form can also be used as the "parser" for the query compiler!

Query planning!

Traditional SQL query planning is based around optimizing disk I/O, typically by trying to do windowed scans or range order scans which respect the I/O characteristics of spinning disks.

This is below the abstractive level of Shelving!

Keys are (abstractly) unsorted, and all we have to program against is a write log anyway! For a purely naive implementation we really can't do anything interesting, we're stuck in an O(lvars) scan bottom.

Lets say we added indices - maps from ids of values of a spec to IDs of values of other specs they relate to. Suddenly query planning becomes interesting. We still have to do scans of relations, but we can restrict ourselves to talking about subscans based on relates-to information.

  • Take all lvars
  • Infer spec information from annotations & rels
  • Topsort lvars
  • Emit state-map -> [state-map] transducers & filters

TODO: - Planning using spec cardinality information - Simultaneous scans (rank-sort vs topsort) - Blocked scans for cache locality

Goodies

API management!

Documentation generation!

Covered previously on the blog - I wrote a custom markdown generator and updater to help me keep my docstrings as the documentation source of truth, and update the markdown files in the repo by inserting appropriate content from docstrings when it changes.

More fun still to be had

What makes Datalog really interesting is that among the many extensions which have been proposed is support for recursive rules.

Negation!

  • Really easy to bolt onto the parser, or enable as a query language flag
  • Doesn't invalidate any of the current stream/filter semantics
  • Closed world assumption, which most databases happily make

Recursive rules!

More backends!

Transactions!

  • Local write logs as views of the post-transaction state
  • transact! writes an entire local write log all-or-nothing
  • server? optimistic locking? consensus / consistency issues

Ergonomics!

The query DSL wound up super verbose unless you realy leverage the inferencer :c

Actually replacing Grimoire…

  • Should have just used a SQL ;) but this has been educational

Docs Sketches

At present I'm working on developing Shelving, a clojure.spec(.alpha) driven storage layer with a Datalog subset query language. Ultimately, I want it to be an appropriate storage and query system for other project of mine like Stacks and another major version of Grimoire.

In building out this tool, I've been trying to be mindful of what I've built and how I'm communicating it. What documentation have I written? Is there a clear narrative for this component? Am I capturing the reason I'm making these decisions and the examples that motivated me to come here?

Writing good docs is really hard, doubly so when you're trying to write documentation largely in docstrings. Docs you write in the docstrings drift rapidly from docs written elsewhere. The only surefire way I've come up with to keep the two in sync is to consider your docstrings to be authoritative and export them to other media. So I wanted a tool for that.

Another major problem is documentation organization. In Clojure, namespaces have long known limitations as the unit of structure for your API. Clojure's lack of support for circular references between namespaces and the need to use explicit forward declarations lead to large namespaces whose order reflects the dependency graph of the artifact not perhaps the intentional API or relative importance of its members. In my first iterations of shelving, I was trying to just use a large .core namespace rather than maintain separate namespaces. I wanted a tool that would let me take a large API namespace and break its documentation up into small subsections which had conceptual relationships.

The result of these two needs is the following script, which I used for a while to generate the docs/ tree of the shelving repo. To work around the issues of having a large, non-logical .core namespace, I annotated vars with ^:categories #{} metadata. This allowed me to build up and maintain a logical partitioning of my large namespace into logical components.

Documentation files are generated by opening an existing file, truncating any previously generated API docs in that category out, and appending newly generated docs to it. This allows me to have Eg. a Markdown title, some commentary, examples and what have you, then automatically splat the somewhat formatted and source linked docs for the relevant category at the end after that header.

I won't say the results are good. But they beat the heck out of not having any docs at all. Mainly they succeeded in being easier to browse, cross-check and consequently made copy editing docstrings much easier.

With the introduction of distinct "user" and "implementer" APIs to Shelving which happen to share some vars, this categorization is no longer so clear cut. Some docs now belong in one place, some in multiple places. This is an inadequate tool for capturing those concerns. Also it still sorts symbols by line numbers when generating docs 😞

So I'll be building another doodle, but thought that this one was worth capturing before I do.

Some sample output -

(ns compile-docs
  "A quick hack for building the doc tree based on `^:category` data."
  (:require [shelving.core :as sh]
            [clojure.java.io :as io]
            [clojure.string :as str]))

(def category-map
  {::sh/basic  (io/file "docs/basic.md")
   ::sh/schema (io/file "docs/schema.md")
   ::sh/rel    (io/file "docs/rel.md")
   ::sh/util   (io/file "docs/helpers.md")
   ::sh/query  (io/file "docs/queries.md")
   ::sh/spec   (io/file "docs/spec.md")
   ::sh/walk   (io/file "docs/walk.md")})

(defn ensure-trailing-newline [s]
  (if-not (.endsWith s "\n")
    (str s "\n") s))

(defn relativize-path [p]
  (str/replace p (.getCanonicalPath (io/file ".")) ""))

(defn compile-docs [category-map nss]
  (let [vars (for [ns              nss
                   :let            [_ (require ns :reload)
                                    ns (if-not (instance? clojure.lang.Namespace ns)
                                         (the-ns ns) ns)]
                   [sym maybe-var] (ns-publics ns)
                   :when           (instance? clojure.lang.Var maybe-var)]
               maybe-var)

        groupings (group-by #(-> % meta :categories first) vars)]
    (println groupings)
    (doseq [[category vars] groupings
            :let            [category-file (get category-map category)]
            :when           category-file
            v               (sort-by #(-> % meta :line) vars) ;; FIXME: better scoring heuristic?
            :let            [{:keys [categories arglists doc stability line file]
                              :as   var-meta} (meta v)]]
      (println v)
      (with-open [w (io/writer category-file :append true)]
        (binding [*out* w]
          (printf "## [%s/%s](%s#L%s)\n"
                  (ns-name (.ns v)) (.sym v)
                  (relativize-path file) line)
          (doseq [params arglists]
            (printf " - `%s`\n" (cons (.sym v) params)))
          (printf "\n")
          (when (= stability :stability/unstable)
            (printf "**UNSTABLE**: This API will probably change in the future\n\n"))
          (printf (ensure-trailing-newline
                   (-> doc
                       (str/replace #"(?<!\n)\n[\s&&[^\n\r]]+" " ")
                       (str/replace #"\n\n[\s&&[^\n\r]]+" "\n\n"))))
          (printf "\n"))))))

(defn recompile-docs [category-map nss]
  (doseq [[_cat f] category-map]
    (let [buff      (slurp f)
          truncated (str/replace buff #"(?s)\n+##.++" "\n\n")]
      (spit f truncated)))

  (compile-docs category-map nss))

(defn recompile-docs!
  "Entry point suitable for a lein alias. Usable for automating doc rebuilding."
  [& args]
  (recompile-docs category-map
                  '[shelving.core
                    shelving.query
                    shelving.spec
                    shelving.spec.walk]))

;; In lein:
;; :profiles {:dev {:aliases {"build-docs" ["run" "-m" "compile-docs/recompile-docs!"]}}}
;;
;; Works best with lein-auto to keep your doctree fresh

^d

Arch Package Transaction Log Fun

Those of you following me on twitter have probably seen me kibitzing about my development environment having been utterly destroyed by an Arch Linux package upgrade in the last ~24h.

It's been really fun.

Arch Linux unfortunately doesn't provide a tool for resetting your computer's package state to whatever it was previously following a bad update. However, the Arch package manager pacman does helpfully write a log file which records everything that it's done as a good old fashioned append only commit log.

This means that getting my computer back to a sane state was obnoxious, but a pretty trivial hack. Write a program that takes a date stamp, scan the transaction log for everything that happened at and after that date stamp and roll it back.

While my well polished emacs install was unusable, due to the aforementioned awesomeness, I still remember enough vim to be dangerous.

And so, the hack:

#!/usr/bin/env python3

import os
import re
import sys


def main(opts, args):
    """Usage: python rollback.py date

    Parse /var/log/pacman.log, enumerating package transactions since the
    specified date and building a plan for restoring the state of your system to
    what it was at the specified date.

    Assumes:
    - /var/log/pacman.log has not been truncated

    - /var/cache/pacman/pkg has not been flushed and still contains all required
      packages

    - The above paths are Arch default and have not been customized

    - That it is not necessary to remove any "installed" packages

    Note: no attempt is made to inspect the dependency graph of packages to be
    downgraded to detect when a package is already transitively listed for
    downgrading. This can create some annoying errors where eg. systemd will be
    downgraded, meaning libsystemd will also be downgraded, but pacman considers
    explicitly listing the downgrade of libsystemd when it will already be
    transitively downgraded an error.

    """

    date, = args

    print("Attempting to roll back package state to that of {0}...\n"
          .format(date))

    # These patterns can't be collapsed because we want to select different
    # version identifying strings depending on which case we're in. Not ideal,
    # but it works.

    # Ex. [2017-04-01 09:51] [ALPM] upgraded filesystem (2016.12-2 -> 2017.03-2)
    upgraded_pattern = re.compile(
        ".*? upgraded (?P<name>\w+) \((?P<from>[^ ]+) -> (?P<to>[^\)]+)\)")

    # Ex: [2018-02-23 21:18] [ALPM] downgraded emacs (25.3-3 -> 25.3-2)
    downgraded_pattern = re.compile(
        ".*? downgraded (?P<name>\w+) \((?P<to>[^ ]+) -> (?P<from>[^\)]+)\)")

    # Ex: [2017-03-31 07:05] [ALPM] removed gdm (3.22.3-1)
    removed_pattern = re.compile(
        ".*? removed (?P<name>\w+) \((?P<from>[^ ]+)\)")

    checkpoint = {}
    flag = False

    with open("/var/log/pacman.log") as logfile:
        for line in logfile:
            if date in line:
                flag = True
            elif not flag:
                continue

            match = re.match(upgraded_pattern, line)\
                or re.match(downgraded_pattern, line)\
                or re.match(removed_pattern, line)

            if match:
                package = match.group("name")
                from_rev = match.group("from")
                if package not in checkpoint:
                    checkpoint[package] = from_rev
                continue

    print("Checkpoint state:")
    for k in checkpoint.keys():
        print("{0} -> {1}".format(k, checkpoint[k]))

    pkgcache = "/var/cache/pacman/pkg"
    pkgs = os.listdir(pkgcache)
    pkgnames = ["{0}-{1}".format(k, v) for k, v in checkpoint.items()]

    selected_pkgs = [os.path.join(pkgcache, p)
                     for n in pkgnames
                     for p in pkgs
                     if n in p]

    print("\n\nSuggested incantation:\nsudo pacman -U {}"
          .format("\\\n  ".join(selected_pkgs)))


if __name__ == "__main__":
    main(None, sys.argv[1:])

Over and out from my recovered emacs setup 😊

^d

Architecture and Microarchitecture

From a recent paper review I wrote -

So @ztellman is writing a book The Elements of Clojure which would better be titled The Elements of Software. It’s almost entirely about exploring ideas about abstraction and how abstraction and engineering interact. Maybe worth perusing since this paper seems to hit a bunch of related ideas. To crib some of those:

Science is about dealing with timeless things. The lambda calculus has no engineering constraints. It’s a great substrate for proof, and takes some head standing to become an adequate substrate for doing “real” work because doing “real” work is characterized by trying to fit within a finite machine and finite time.

Engineering is about shoveling the constants around to optimize for how many transistors we can fab at what density and how many we can drive / cool at what power without melting the whole thing. That is, it’s entirely an exercise in factors which are far from timeless and thus not suitable for long lasting abstractions. Architectures like the Itanium, MIPS and TTAs you allude to are the final form of this kind of thinking. It may be momentarily convenient say for performance to expose, as in the MIPS, a pipeline without interlocks or the full physical register set, but as the paper conveys well this constitutes a fragile API which will prove limiting in the future when the engineering constraints which motivated those decisions change.

The bit you don't seem to hit on yet is what is architecture, and how do we do it? I suggest, still cribbing from Zach, that architecture both in software and in hardware is all about trying to find the perfect layer of indirection between the engineering constraints of the preset (time, memory, chip area, heat, etc.) and the universals/existentials that represent a stable target. That is, you’re trying to find what set of cost trade-offs you can presume are constant-ish for as long as you hope your design will be relevant. Will we still be monocore and care about single cache density? Will the NUMA gap continue to increase? Will we continue to be able to rely on speculation for performance wins, or will we switch to more communication based models of computation where privilage level context switches must become more expensive and communication must be efficient?

To cherry-pick your CAR / CDR example of an architecture feature which has lasted far too long; While the names are absurd and now long irrelevant the notion of the first/next pair has proven to be general, timeless representation for arbitrary graph structures. Which has been repeated time and time again silly names included even in the face of memory architecture [1] changes which render the original conception less and less appropriate.

Straight stack, thunk or SECD VMs follow hot on the heels of really trying not to assume anything. Maybe that's fine. Maybe as you hint at the end, an abstract register machine with some support for generic operations in the style of the Cray-1 which may be specialized or implemented with accelerators later is appropriate.

The central idea that this paper seems to be toeing is that as implementers with our eyes on what we’ve built our natural tendency when “specifying” is to write down the specific thing we’ve envisioned. From a cognitive laziness perspective it’s the natural thing to do. Taking a step back and realizing what’s essential versus what’s incidental is really goddamn difficult because their (our) job is to have our heads so far down in the microarchitecture that we’re counting clock cycles and cache lines in our heads before we even have a simulator. This sort of clock cycle counting - engineering and the failure of architecture - is what leads to the microarchitecture-as-architecture trap sketched above.

Dunno how much of this is useful without turning it into a philosophy paper, but it does seem like there’s plenty of room to hit on these issues a little more directly and be concrete about the possible advantages of abstract structures like the admittedly absurd x86 floating point stack unit.

牛: How static is enough?

Ox has been and will continue to be a long-haul project in part because it's a dumping ground of ideas for me, and in part because I don't desperately need it. There's plenty of bad in the software industry, but I'm able to swing the hammers that already exist. Ox isn't so much a hammer as it is an exercise in the ongoing search for what a better hammer would be.

Part of this search comes from my judgments of the tools I use on the daily and their failures. The most significant part however is how the decisions I've made in what prototypes I have built have shaken out.

It would have been easy enough to crank out an RⁿRS interpreter and call it Ox. It just wouldn't have added anything to the software ecosystem or solved any problems I actually had. In fact most of the problems I've tried in one way or another to approach in Ox are directed by my programming experience.

So lets start with the motivation and what I was trying to get at in the last cut I took at Ox.

Clojure is a nice language which does a good job at trying to provide sane defaults. Its data structures are immutable by default, it features destructuring - macro sugar for binding values in structures to locals, it interoperates nicely with the JVM and it has a namespace system which makes code organization fairly predictable.

Because Clojure uses a singe pass compiler and its compilation unit is a single form rather than a file/module or group thereof, it can be difficult to Clojure whose module structure reflects its logical structure. Cyclic dependencies between modules/namespaces are difficult to achieve and have poor ergonomics. In fact, cyclic dependencies within a single namespace are - if not difficult to achieve - then at least requires an effort which tends to prohibit their use.

This has a couple of consequences, well exemplified by the language implementation itself. The namespace clojure.core is a huge namespace containing many groups of related functions which would do just as well if not better in their own namespaces. Big namespaces which don't have a unifying concept are difficult to approach as a new user and difficult to document. Furthermore, they create artificial contention over the "good" names, and tend to lead to the use of prefixes.

Perhaps most interesting is the consequences the lack of circular dependency support has for the language's own implementation. The first several thousand lines of clojure.core serve to bootstrap the language using mainly JVM inter-operation to Java interfaces and the Java implementation of Clojure's various built-ins. Some of this bootstrap code is unavoidable, but parts of it could be avoided if the implementation was able to make forward use of symbols which weren't defined yet.

With explicit forward declarations in some cases this could be made possible today within a single namespace such as clojure.core. The most interesting use cases of circular dependencies however require circular dependencies across module boundaries. For instance what if we could have a clojure.core.seqs namespace containing all the operations related to the sequence abstraction, and so-forth. A large number of small namespaces would be more modular, easier to break out of the standard library into library packages where possible.

And after all, Ox is a pie in the sky project so what would it take?

Well - circular dependencies on functions are pretty easy. They don't have a data dependency on each-other at compile time; okay they do but letrec is easy to write and rewriting modules as letrec to support mutual function recursion is well understood and a traditional Scheme trick.

For instance if I had the nonsensical module

(define (f a b)
  (c a))

(define (g a b)
  (c b))

(define (c x)
  (+ x 1))

This could be implemented instead by re-writing it to the letrec form

(letrec ((f (lambda (a b) (c a)))
         (g (lambda (a b) (c b)))
         (c (lambda (x) (+ x 1))))
  ...)

Implementing circular dependencies Scheme style via letrec works because the top level special forms (define etc.) are well understood and can be re-written into letrec bindings as above.

The idea I had at the time was that REALLY the core problem here is that the evaluator can't figure out the dependency graph between functions and macros because we have an indeterminable number of un-expanded macros. A solution to this problem is to simply make the dependency graph statically visible.

In scheme, the above letrec conversion is trivial because the top level form define is well understood by the implementation and so can be rewritten into a form where circular references will function as desired. The take on this in #29 is to do so by restricting top level forms to a small set of forms which explicitly state what definitions they provide. For instance def, defn, deftype and soforth all explicitly name the definitions they create.

(ns example)

;; This provides the definition example/foo. `def` is well know, so
;; this form can be understood as a definition at read time without
;; any effort to evaluate the body or docstring just like `define`.
(def foo
  "Some docs"
  1)

;; Same as above, `defn` is well known and can be statically parsed
;; at least to the degree we need to understand that this defines
;; the symbol example/qux for dependency analysis
(defn qux
  "Some docs"
  ([x] "arity 1")
  ([x y] "arity 2")
  ([x y z & more] "arity 3+"))

;; `defmacro` is just `defn` with the macro flag and some tweaks to
;; accept implicit arguments like `&form` and `&env`.
(defmacro deftype
  ;; ...
  )

;; deftype, definterface and friends all nicely fit this pattern to
;; the extent one may choose to implement them as special forms.

This all works pretty well. The problem is what happens when a macro occurs at the top level? Doing macros this way doesn't work, especially when there are sequential dependencies between the macros and the functions in the module. While the macros can be re-written as letrec bindings, the semantics of macro expansion don't fit that way because the macros have to be expanded before the functions or (or the other macros!) using them become executable.

Consider something like this...

(deftype Foo "A foo record." [a b c])
;; We now have the functions `->Foo`, `Foo?` and some friends

(defn foo-or-bar [a]
  (or (Foo? a)   ;; Where does `Foo?` come from? Nothing at the top
                 ;; level trivially provides `Foo?`, but we have to
                 ;; find it in order to compile `foo-or-bar`.
      (Bar? a)))

I thought about this for a long time because I think there's a clear case for supporting mutually recursive definitions with the most generality possible. And the first solution I came up with is the one implemented in ox-lang/ox#29.

By defining a new top-level special form providing, users can just annotate their top level macros with the symbols they're expected to produce! Problem solved! This lets Ox determine what top level forms need to be evaluated to see of they provide macros and perform lazy loading by doing functionally free dependency analysis before any evaluation.

So for instance

(providing [->Foo, Foo?]
  (deftype Foo "A foo record." [a b c]))

(defn foo-or-bar [a]
  (or (Foo? a) ;; The reader understands `providing` and can determine
               ;; that to get the needed `Foo?` definition the
               ;; previous form needs to be evaluated.
      (Bar? a)))

The problem with this approach - and the reason I call it a mistake - is that it greatly complicates load time and neuters definition writing macros. As noted earlier, there's plenty of good reason to want to write macros that write definitions. However a good bit of the syntactic power of such macros is lost if you have to provide a list by hand of definitions the macros will create for you.

Frankly this made implementing code loading so complicated that what with trying to bootstrap Ox in Java I couldn't do it. This decision proved absolutely fatal to that particular run at the Ox project.

There are, now that I've seen the failures of this approach and had some hammock time, other ways to achieve this same end. For instance if you don't allow macros to write macros, you can play games where macro definitions are letrec rewritten as above, and the definitions of "normal" functions or expressions are delayed until forced by a macro expansion. This lets you play games whereby functions and macros may mutually recur so long as you can ground out a macroexpansion to some concrete form or function of forms that can be evaluated without unbounded mutual recursion between functions and macros. Which, honestly, is the best I think you can do.

The advantage of Clojure's single pass approach here is a huge complexity win which avoids all these considerations, and the fact that the Clojure read/evaluate/print loop has exactly the same structure as code loading.

Anyway. Languages are hard. That's what makes them fun.

More when I've got better ideas.

^d