Shaking up GHC

As part of my 6-month research secondment to Microsoft Research in Cambridge I am taking up the challenge of migrating the current GHC build system based on standard make into a new and (hopefully) better one based on Shake. If you are curious about the project you can find more details on the wiki page.

During this week I’ve been trying to wrap my head around the current build system and so far I’ve felt rather intimidated by the scale of this undertaking. Yet I am optimistic because I’m helped by several much smarter people 🙂 If you are a Haskell fan and would like to help make GHC better, please join this team effort!

To give you an idea about typical dependency chains, below is a representative one:

  • File compiler/prelude/primops.txt.pp describes GHC’s primitive operations and types. It uses C preprocessor directives like #include and #if and therefore needs to be preprocessed. The result goes into compiler/stage1/build/primops.txt. If one of the included files changes (e.g., ghc_boot_platform.h) the result must be rebuilt.
  • File primops.txt is processed by genprimopcode --out-of-line to generate primop-out-of-line.hs-incl.
  • genprimopcode itself needs to be built. It lives in utils/genprimopcode and is a Haskell program, so it is first built with the bootstrap (stage 0) compiler. Read more about compiler stages here.
  • Finally, a bunch of .hs-incl files are included (by a C preprocessor) into compiler/prelude/PrimOps.lhs, which then participates in the rest of the build in a more or less standard way.

A detour into graph families and parallelism

Since my recent research has been focused on graphs, I find build dependency ‘graphs’ quite interesting. I put graphs in quotes, because they are actually graph families, where each constituent graph corresponds to a partial order on the set of build actions: this is what gets executed when you run a build tool once (partial orders give parallelism and termination). There is some additional conditional structure, which enables/disables parts of the family depending on certain file freshness criteria, contents of generated files, and on the final build target; conditions select a particular graph in the family, so each build run is unambiguous. I think that one can use the algebra of parameterised graphs to capture and reason about build systems, for example, transform them in a provably correct way. The theory of parameterised graphs has been developed specifically to reason about graph families; you can have a look at my recent paper about this here.

And a final comment about parallelism and randomness. One often wants to control the degree of parallelism during a build run, e.g., by passing an argument like -j8 to make or Shake (I believe such parallelism control is called a policy in the concurrency theory). When parallelism is limited, there may be several possible executions of a partial order, and here is where randomness comes handy, as nicely explained in the Shake paper (section 4.3.2). Briefly, different build actions may require different resources (CPU, disk, memory, etc.), and scheduling them deterministically may lead to resource contention (imagine always running all compilers before all linkers). A randomised scheduling of build actions avoids this worst case scenario and is reported to give up to 20% speed-up in real build systems. I am curious to see what speed-up can be achieved by randomness in the GHC case. It may be worth doing some research into a clever scheduling strategy that minimises resource contention intelligently (rather than by randomness) by putting build actions in parallel bundles according to their resource requirements. One will then need to annotate build actions with their resource requirements, of course, but this seems not very difficult to do as there are just a few different types of build actions.

10 thoughts on “Shaking up GHC

  1. Although new to Haskell, I have been using Maven for many years and think that its concepts would apply well to Haskell build tools.

    The whole point about program building is control and sharing. Control is about making design decisions explicit and visible, and their changes detectable.

    When building a system in the large, you need to be able to:
    a) choose the libraries and their versions (your dependency tree), and automatically obtain the non-explicitly-chosen dependencies/versions. Your choice has highest priority over the “correct” one.
    b) quickly spot changes in your dependency tree (e.g. version-control diff your pom.xml), which are introduced by programmers who try to solve compilation problems by modifying the system’s architecture.
    c) share the dependencies and automate the download/build process: everybody in the team should get exactly the same dependencies and build/generate exactly the same components/documentation/…

    The great part of Maven is that you get a reproduceable build environment and programmers are productive in 30′ for mid-sized projects (1. install Maven, 2. pull latest sources including project structure descriptors (pom.xml), 3. build (first library downloads take most of the 30′)).

    1. @Yves I haven’t used Maven, but as far as I know it requires JDK, which is a huge dependency.

      Shake on the other hand is written in Haskell, and therefore depends on GHC as noted in the comment below. Since GHC does depend on GHC already, Shake doesn’t bring any new dependencies.

    2. I don’t think Maven would be appropriate since it’s really about expressing metadata, and most of the rules to execute that metadata are baked in. There are a handful of examples of people defining new Maven rules, but given the GHC build system is mostly rules, I don’t think Maven fits. I might be wrong though, I’d love to see a Maven entry for the Build System Shootout (https://github.com/ndmitchell/build-shootout), as many of the examples there are exactly the kind of things GHC needs to do in its build system.

  2. Building ghc with shake means to introduce a circular dependency. Shake needs itself a haskell compiler to be built.

    Circular Dependencies in compilers makes it hard to build things on new architectures where now previous version of the compiler exists.

    It’s also an annoyance for distributions like Debian or Fedora that prefer (or require) to build all their packages from source. They would be forced to (AFAIK) to use a pre-build ghc compiler to bootstrap the build: Compile shake with the prebuild ghc, Compile the new ghc with shake, rebuild shake with the new ghc.

    One way to avoid such circular dependencies is to keep a make based build option that is capable to build the minimum things that are necessary to build the rest of the build tools.

    1. @Thomas But circular dependencies are already there: GHC itself is written in Haskell, so it requires a bootstrapping compiler in order to be built. Addition of yet another circular dependency does not seem to change much.

      Any change comes with a price, of course, and in this case potential benefits seem to outweigh it. Let’s see how these benefits materialise (we never know until we try).

      1. I think the key is that adding Shake adds a few links, but doesn’t add any new cycles. GHC is required to build GHC. Afterwards, GHC will be required to build GHC. Hopefully that will be one extra line of shell script, and could even be wrapped into a Makefile to keep distros happy.

  3. Hi Andrey. I was playing with some similar stuff to figure out a good substrate for build systems and task runners and wish I had read your research before starting. I ended up with a “graph with additional conditional structure” sort of thing, although of course there is none of the rigorous mathematical analysis. The result of my experiments are here, in case you’re interested: https://github.com/masaeedu/transact

    1. Hi Asad! “Graph with additional conditional structure” sounds familiar — this was the topic of my PhD thesis long time ago in a different world of hardware design 🙂 Alas, I couldn’t follow the code you linked — could you perhaps add some comments, or simply explain what it does and how? I guess SituationGraph is the “graph with additional conditional structure” but I don’t understand the types.

      By the way, I’ve done some further research on build systems after this blog post. Here is the latest paper which provides a good overview and comparison of many build systems and their algorithms: https://dl.acm.org/citation.cfm?id=3236774.

      1. Yes, I’ve been delving more into it, including various talks by you and the coauthors. Very enlightening stuff!

        I apologize for the state of the code, still very new to Haskell as I wrote this; I’ll clean it up a little bit this weekend. Some handwavy explanation follows.

        The graph consists of some labeled “situations” and “transitions” between these situations.

        A situation `e (Maybe x)` represents some possible state of the world, where we can effectfully compute whether we are in the situation (and if so, obtain some information `x` associated with the situation).

        A transition into a situation consists of the labels of some prerequisite situations `c k`, and an effectful function `c x -> e x` which consumes information available in prerequisite situations and effectfully brings the world into conformity with the desired situation. The `c` here is some abstract container (e.g. a map/list/record).

        Given a situation graph that we perhaps declaratively read from config and some desired situation, we can first check whether we’re already in that situation. If we’re not, we can look at all the transitions that bring us into the current situation, and examine *their* prerequisite situations (possibly using additional metadata to weight the transitions, and preferring transitions for which most prerequisites are already satisfied). If there are no transitions for which the prerequisites are satisfied, we recursively look at transitions that might bring us to the prerequisites (and so on, …).

        A small demonstration of a situation graph for creating some folder structure (and cleaning it) is in the file.

        At any rate, now that I’ve read about the task abstraction and your synthesis of build tools into this conceptual framework, I think I have a much nicer basis to start working from. The old model was a useful idea but it much too complicated and specific.

        1. Many thanks for the explanation! It’s very helpful, I think I can more or less understand your code now 🙂

          I’m also glad you found the task/build abstractions from our paper useful! If you find a way to improve anything or introduce something new, feel free to open an issue/PR in the repository.

Leave a Reply to Thomas Koch Cancel reply

Your email address will not be published. Required fields are marked *