As part of my 6-month research secondment to Microsoft Research in Cambridge I am taking up the challenge of migrating the current GHC build system based on standard
make into a new and (hopefully) better one based on
Shake. If you are curious about the project you can find more details on the wiki page.
During this week I’ve been trying to wrap my head around the current build system and so far I’ve felt rather intimidated by the scale of this undertaking. Yet I am optimistic because I’m helped by several much smarter people 🙂 If you are a Haskell fan and would like to help make GHC better, please join this team effort!
To give you an idea about typical dependency chains, below is a representative one:
compiler/prelude/primops.txt.ppdescribes GHC’s primitive operations and types. It uses C preprocessor directives like
#ifand therefore needs to be preprocessed. The result goes into
compiler/stage1/build/primops.txt. If one of the included files changes (e.g.,
ghc_boot_platform.h) the result must be rebuilt.
primops.txtis processed by
genprimopcode --out-of-lineto generate
genprimopcodeitself needs to be built. It lives in
utils/genprimopcodeand is a Haskell program, so it is first built with the bootstrap (stage 0) compiler. Read more about compiler stages here.
- Finally, a bunch of
.hs-inclfiles are included (by a C preprocessor) into
compiler/prelude/PrimOps.lhs, which then participates in the rest of the build in a more or less standard way.
A detour into graph families and parallelism
Since my recent research has been focused on graphs, I find build dependency ‘graphs’ quite interesting. I put graphs in quotes, because they are actually graph families, where each constituent graph corresponds to a partial order on the set of build actions: this is what gets executed when you run a build tool once (partial orders give parallelism and termination). There is some additional conditional structure, which enables/disables parts of the family depending on certain file freshness criteria, contents of generated files, and on the final build target; conditions select a particular graph in the family, so each build run is unambiguous. I think that one can use the algebra of parameterised graphs to capture and reason about build systems, for example, transform them in a provably correct way. The theory of parameterised graphs has been developed specifically to reason about graph families; you can have a look at my recent paper about this here.
And a final comment about parallelism and randomness. One often wants to control the degree of parallelism during a build run, e.g., by passing an argument like
Shake (I believe such parallelism control is called a policy in the concurrency theory). When parallelism is limited, there may be several possible executions of a partial order, and here is where randomness comes handy, as nicely explained in the Shake paper (section 4.3.2). Briefly, different build actions may require different resources (CPU, disk, memory, etc.), and scheduling them deterministically may lead to resource contention (imagine always running all compilers before all linkers). A randomised scheduling of build actions avoids this worst case scenario and is reported to give up to 20% speed-up in real build systems. I am curious to see what speed-up can be achieved by randomness in the GHC case. It may be worth doing some research into a clever scheduling strategy that minimises resource contention intelligently (rather than by randomness) by putting build actions in parallel bundles according to their resource requirements. One will then need to annotate build actions with their resource requirements, of course, but this seems not very difficult to do as there are just a few different types of build actions.