Implementing the Immix Garbage Collection Algorithm on GHC

Abstract

The objective of this project is to implement the Immix garbage collection algorithm onto GHC, the Glasgow Haskell Compiler. I believe that Immix is the current state-of-the-art garbage collector described in the literature, and that GHC would greatly benefit from this more modern algorithm. The immediate benefit would be a boost on the performance of native code produced by GHC, as Immix has been shown to deliver gains of 5-to-27% when compared to more traditional garbage collection algorithms.

Background

Haskell is one of the most popular functional programming languages. It is widely taught in the academia as an example of the functional programming paradigm, and it has seen use in the industry. Haskell is developed as an open source project, enjoying the support of a very active community of developers.

GHC is the de facto standard Haskell compiler^. This is an open source native code compiler, whose project started at the late eighties. GHC has been mostly developed at the University of Glasgow; however, nowadays it contains patches sent by many different developers, spread all over the world. Currently this compiler is distributed together with the Haskell Platform.

Haskell provides developers with managed memory. This means that the running environment shelters the programmers from the difficult and error prone task of manually allocating and freeing memory. In order to deliver this abstraction to developers, Haskell uses a garbage collector (GC)[ Burrel07 ]. Garbage collection is an old field of study in compiler construction, and a good GC implementation is key to performance. In particular, the garbage collection has a great impact on the performance of the executables produced by GHC[ Burrel07 ].

Recently, a new garbage collection algorithm has been published in the Conference on Programming Languages, Design and Implementation (PLDI'08). This new algorithm, so called Immix, was shown to improve on previous state-of-the-art algorithms[ Blackburn08 ]. I believe that Immix is the most modern garbage collection algorithm available in the literature, and that GHC would benefit from one such implementation.

Why a new garbage collector for GHC?

GHC provides the running application with two garbage collection alternatives: copy collection and mark (sweep) compact collection. The choice of the GH is based on the application's relative memory usage: if it is using less than a certain amount of memory, 30%, copying is used^, otherwise GHC uses the mark-sweep approach.

Blackburn and McKinley have shown that the Immix GH outperforms both approaches. In particular, it surpasses the mark-sweep implementation in Java benchmarks, delivering improvements ranging on 5-21%. However, the benefits of Immix have yet to be tested on the running environment of a functional programming language. Contrary to Java, Haskell relies heavily on immutable data. I believe that, as this data tends to have shorter life-times than the mutable counter-part, the garbage collection suffers more pressure from the running program. I speculate that, in this new environment, the Immix garbage collector will be particularly useful, as it nicely handles the high turn over of data.

A brief description of Immix

The Immix garbage collector differs from other algorithms, such as mark-sweep and mark-compact because, contrary to these, Immix divides the memory in regions of the same size, and performs collection on the region level, instead of the object level. Each region, called block, is further divided into same-size regions, called lines. Objects are allocated per line. It is possible that the object will not fill the space in the line completely, yet experiments [ Blackburn08 ] show that this extra space is a small price to pay in exchange of an easier memory layout that must be managed.

The Immix algorithm is divided into three parts: allocation, tracing and reclamation.

  • The allocation module is responsible for granting to the application the space that it needs in memory. This module divides the space of available data into region blocks, and divides each block into lines of finer granularity. The memory allocator assigns small objects to lines, and uses many lines to store bigger objects. There are many policies to find space to objects: best fit, first fit, etc.
  • The tracing module implements an algorithm that marks every data that is reachable from the application roots in the running program. If an object is reachable, then it is called live, otherwise it is called dead.
  • The reclamation module is in charge of returning to the memory allocator those blocks that do not contain any live object, so that they can be reused to store other data. The algorithm goes over the list of blocks that are currently marked as used, performing two actions: first it frees lines that do not contain live objects. Second, it frees blocks that do not contain used lines. This module also does defragmentation, placing together data that is scattered around many partially used memory blocks.

During the implementation of Immix, there will be some aspects of the current implementation of GHC's GC that should be analyzed with special attention. GHC heap objects currently do not contain any space for extra bits in an object header, and the Immix algorithm uses such bits to mark objects as big. GHC already marks objects as big, but in a different place.

Timeline

The GHC allocator already implements an algorithm similar to Immix. For instance, like Immix, it has the concept of blocks. The main difference is that, inside each block, GHC stores whole objects, while Immix divides blocks in lines. Hence, much of this work will be related to adjusting parameters and measuring how much each change affects performance. Therefore, I will proceed as suggested in [ Schilling10 ] and work in an incremental way: doing small changes, testing/debugging each change, measuring its impact on performance. This process will happen for each change that I implement on the GHC garbage collector.

On the first week, from May 24 to May 28, I plan to study the GHC Garbage Collector in detail, using the wiki page and the source code. In this week I'll define precisely what needs to be changed in the next steps.

In June, I will implement the list of free lines, the line sweeping algorithm and the handling of large objects. These are the changes that the GHC garbage collector requires to support allocation in a line-level, an intermediate between handling objects directly, and handling blocks. By the end of June I plan to have these changes working, and I'll start with testing and measuring.

Until July 12, I plan to work on benchmarking these changes and comparing them with the numbers produced by the old Garbage Collector. In the Google mid-term evaluation I should already have the results of this performance evaluation to submit. In the second half of July I plan to test different policies for defragmentation, such as defragmenting the 10% older blocks and separating a headroom of free blocks for defragmentation, as proposed in the original Immix description [ Blackburn08 ].

In the first week of August I will test, debug and measure the performance of the new deframentation algorithm. In the second week of August I plan to write a final document about what I did in the project, and polish the patch to submit to GHC. This document will contain the performance results, which should motivate the patch to be accepted in the repository.

Biography

I am currently a last year undergrad student at Federal University of Minas Gerais, and working on my graduation project under orientation of professor Dr. Carlos Camarão de Figueiredo, a member of the Haskell 2011 Committee, and co-orientation of professor Dr. Fernando Pereira. I started to use Haskell in a research project about Urban Modelling in 2004, and never stopped since then. I contributed with patches to some widely used Haskell projects, such as XMonad and Darcs. I'm a Debian Developer and member of the Debian Haskell Group, maintaining a lot of Haskell libraries, including gtk2hs.

I believe I will be able to successfully work in this project for a number of reasons. First, I've been studying the GHC code recently, and helping Kari Pahula with his maintenance of the ghc6 package in Debian. I've written some of the patches that are applied in it. Second, I have been programming in Haskell for 5 years, and in C for 8 years, and I think I'll not have problems to read GHC code and GHC RTS code. Third, I have contacted Simon Marlow, which is one of the main developers of GHC, about this project, and he seems to be willing to cooperate with the project, since he proposed it in Haskell.org's Summer of Code project list[ Marlow09 ]. Finally, I'm really excited about working on GHC, which is a project to which I have the desire to contribute since I started using Haskell, because of my big interest in compilers. This is something I've been preparing myself for, and it's great to see an opportunity to make it happen.

Contact info

Emailmarcot@debian.org
IRC marcot@irc.oftc.net (#debian-haskell, #debconf, #debian-devel, #debian-br)
marcot@irc.freenode.net (#ghc, #linux-bh)
Jabbermarcot@jabber-br.org
Phone+55 31 33181580
Mobile+55 31 98116720
AddressRua Boaventura, 771/303-E Indaiá 31270-020 Belo Horizonte/MG Brasil

Bibliography

  1. [Blackburn08] Blackburn, S. M, Kathryn S. McKinley. Immix: A Mark-Region Garbage Collector withS pace Efficiency, Fast Collection, and Mutator Performance. ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2008), (Tucson, AZ, USA, June 7-13, 2008)
  1. [Burrel07] Burrel, M. Real-time Haskell? ^
  1. [Schilling09] Schilling, T. Immix garbage collector as GSoC ^
  1. [Marlow09] Marlow, S. Implement the Immix garbage collector in GHC ^
 
marco_soc.txt · Last modified: 2010/04/07 18:02 by marco