The Architecture of Open Source Applications

Introduction

Amy Brown and Greg Wilson

Carpentry is an exacting craft, and people can spend their entire lives learning how to do it well. But carpentry is not architecture: if we step back from pitch boards and miter joints, buildings as a whole must be designed, and doing that is as much an art as it is a craft or science.

Programming is also an exacting craft, and people can spend their entire lives learning how to do it well. But programming is not software architecture. Many programmers spend years thinking about (or wrestling with) larger design issues: Should this application be extensible? If so, should that be done by providing a scripting interface, through some sort of plugin mechanism, or in some other way entirely? What should be done by the client, what should be left to the server, and is "client-server" even a useful way to think about this application? These are not programming questions, any more than where to put the stairs is a question of carpentry.

Building architecture and software architecture have a lot in common, but there is one crucial difference. While architects study thousands of buildings in their training and during their careers, most software developers only ever get to know a handful of large programs well. And more often than not, those are programs they wrote themselves. They never get to see the great programs of history, or read critiques of those programs' designs written by experienced practitioners. As a result, they repeat one another's mistakes rather than building on one another's successes.

This book is our attempt to change that. Each chapter describes the architecture of an open source application: how it is structured, how its parts interact, why it's built that way, and what lessons have been learned that can be applied to other big design problems. The descriptions are written by the people who know the software best, people with years or decades of experience designing and re-designing complex applications. The applications themselves range in scale from simple drawing programs and web-based spreadsheets to compiler toolkits and multi-million line visualization packages. Some are only a few years old, while others are approaching their thirtieth anniversary. What they have in common is that their creators have thought long and hard about their design, and are willing to share those thoughts with you. We hope you enjoy what they have written.

Contributors

Eric P. Allman (Sendmail): Eric Allman is the original author of sendmail, syslog, and trek, and the co-founder of Sendmail, Inc. He has been writing open source software since before it had a name, much less became a "movement". He is a member of the ACM Queue Editorial Review Board and the Cal Performances Board of Trustees. His personal web site is http://www.neophilic.com/~eric.

Keith Bostic (Berkeley DB): Keith was a member of the University of California Berkeley Computer Systems Research Group, where he was the architect of the 2.10BSD release and a principal developer of 4.4BSD and related releases. He received the USENIX Lifetime Achievement Award ("The Flame"), which recognizes singular contributions to the Unix community, as well as a Distinguished Achievement Award from the University of California, Berkeley, for making the 4BSD release open source. Keith was the architect and one of the original developers of Berkeley DB, the open source embedded database system.

Amy Brown (editorial): Amy has a bachelor's degree in Mathematics from the University of Waterloo, and worked in the software industry for ten years. She now writes and edits books, sometimes about software. She lives in Toronto and has two children and a very old cat.

C. Titus Brown (Continuous Integration): Titus has worked in evolutionary modeling, physical meteorology, developmental biology, genomics, and bioinformatics. He is now an Assistant Professor at Michigan State University, where he has expanded his interests into several new areas, including reproducibility and maintainability of scientific software. He is also a member of the Python Software Foundation, and blogs at http://ivory.idyll.org.

Roy Bryant (Snowflock): In 20 years as a software architect and CTO, Roy designed systems including Electronics Workbench (now National Instruments' Multisim) and the Linkwalker Data Pipeline, which won Microsoft's worldwide Winning Customer Award for High-Performance Computing in 2006. After selling his latest startup, he returned to the University of Toronto to do graduate studies in Computer Science with a research focus on virtualization and cloud computing. Most recently, he published his Kaleidoscope extensions to Snowflock at ACM's Eurosys Conference in 2011. His personal web site is http://www.roybryant.net/.

Russell Bryant (Asterisk): Russell is the Engineering Manager for the Open Source Software team at Digium, Inc. He has been a core member of the Asterisk development team since the Fall of 2004. He has since contributed to almost all areas of Asterisk development, from project management to core architectural design and development. He blogs at http://www.russellbryant.net.

Rosangela Canino-Koning (Continuous Integration): After 13 years of slogging in the software industry trenches, Rosangela returned to university to pursue a Ph.D. in Computer Science and Evolutionary Biology at Michigan State University. In her copious spare time, she likes to read, hike, travel, and hack on open source bioinformatics software. She blogs at http://www.voidptr.net.

Francesco Cesarini (Riak): Francesco Cesarini has used Erlang on a daily basis since 1995, having worked in various turnkey projects at Ericsson, including the OTP R1 release. He is the founder of Erlang Solutions and co-author of O'Reilly's Erlang Programming. He currently works as Technical Director at Erlang Solutions, but still finds the time to teach graduates and undergraduates alike at Oxford University in the UK and the IT University of Gotheburg in Sweden.

Robert Chansler (HDFS): Robert is a Senior Manager for Software Development at Yahoo! After graduate studies in distributed systems at Carnegie-Mellon University, he worked on compilers (Tartan Labs), printing and imaging systems (Adobe Systems), electronic commerce (Adobe Systems, Impresse), and storage area network management (SanNavigator, McDATA). Returning to distributed systems and HDFS, Rob found many familiar problems, but all of the numbers had two or three more zeros.

James Crook (Audacity): James is a contract software developer based in Dublin, Ireland. Currently he is working on tools for electronics design, though in a previous life he developed bioinformatics software. He has many audacious plans for Audacity, and he hopes some, at least, will see the light of day.

Chris Davis (Graphite): Chris is a software consultant and Google engineer who has been designing and building scalable monitoring and automation tools for over 12 years. Chris originally wrote Graphite in 2006 and has lead the open source project ever since. When he's not writing code he enjoys cooking, making music, and doing research. His research interests include knowledge modeling, group theory, information theory, chaos theory, and complex systems.

Juliana Freire (VisTrails): Juliana is an Associate Professor of Computer Science at the University of Utah. Before that, she was member of technical staff at the Database Systems Research Department at Bell Laboratories (Lucent Technologies) and an Assistant Professor at OGI/OHSU. Her research interests include provenance, scientific data management, information integration, and Web mining. She is a recipient of an NSF CAREER and an IBM Faculty award. Her research has been funded by the National Science Foundation, Department of Energy, National Institutes of Health, IBM, Microsoft and Yahoo!

Berk Geveci (VTK): Berk is the Director of Scientific Computing at Kitware. He is responsible for leading the development effort of ParaView, an award-winning visualization application based on VTK. His research interests include large scale parallel computing, computational dynamics, finite elements and visualization algorithms.

Andy Gross (Riak): Andy Gross is Principal Architect at Basho Technologies, managing the design and development of Basho's Open Source and Enterprise data storage systems. Andy started at Basho in December of 2007 with 10 years of software and distributed systems engineering experience. Prior to Basho, Andy held senior distributed systems engineering positions at Mochi Media, Apple, Inc., and Akamai Technologies.

Bill Hoffman (CMake): Bill is CTO and co-Founder of Kitware, Inc. He is a key developer of the CMake project, and has been working with large C++ systems for over 20 years.

Cay Horstmann (Violet): Cay is a professor of computer science at San Jose State University, but every so often he takes a leave of absence to work in industry or teach in a foreign country. He is the author of many books on programming languages and software design, and the original author of the Violet and GridWorld open-source programs.

Emil Ivov (Jitsi): Emil is the founder and project lead of the Jitsi project (previously SIP Communicator). He is also involved with other initiatives like the ice4j.org and JAIN SIP projects. Emil obtained his Ph.D. from the University of Strasbourg in early 2008, and has been focusing primarily on Jitsi related activities ever since.

David Koop (VisTrails): David is a Ph.D. candidate in computer science at the University of Utah (finishing in the summer of 2011). His research interests include visualization, provenance, and scientific data management. He is a lead developer of the VisTrails system, and a senior software architect at VisTrails, Inc.

Hairong Kuang (HDFS) is a long time contributor and committer to the Hadoop project, which she has worked on passionately, currently at Facebook and previously at Yahoo! Prior to working in industry, she was an Assistant Professor at California State Polytechnic University, Pomona. She received a Ph.D. in Computer Science from the University of California at Irvine. Her interests include cloud computing, mobile agents, parallel computing, and distributed systems.

H. Andrés Lagar-Cavilla (Snowflock): Andrés is a software systems researcher who does experimental work on virtualization, operating systems, security, cluster computing, and mobile computing. He has a B.A.Sc. from Argentina, and an M.Sc. and Ph.D. in Computer Science from University of Toronto, and can be found online at http://lagarcavilla.org.

Chris Lattner (LLVM): Chris is a software developer with a diverse range of interests and experiences, particularly in the area of compiler tool chains, operating systems, graphics and image rendering. He is the designer and lead architect of the Open Source LLVM Project. See http://nondot.org/~sabre/ for more about Chris and his projects.

Alan Laudicina (Thousand Parsec): Alan is an M.Sc. student in computer science at Wayne State University, where he studies distributed computing. In his spare time he codes, learns programming languages, and plays poker. You can find more about him at http://alanp.ca/.

Danielle Madeley (Telepathy): Danielle is an Australian software engineer working on Telepathy and other magic for Collabora Ltd. She has bachelor's degrees in electronic engineering and computer science. She also has an extensive collection of plush penguins. She blogs at http://blogs.gnome.org/danni/.

Adam Marcus (NoSQL): Adam is a Ph.D. student focused on the intersection of database systems and social computing at MIT's Computer Science and Artificial Intelligence Lab. His recent work ties traditional database systems to social streams such as Twitter and human computation platforms such as Mechanical Turk. He likes to build usable open source systems from his research prototypes, and prefers tracking open source storage systems to long walks on the beach. He blogs at http://blog.marcua.net.

Kenneth Martin (CMake): Ken is currently Chairman and CFO of Kitware, Inc., a research and development company based in the US. He co-founded Kitware in 1998 and since then has helped grow the company to its current position as a leading R&D provider with clients across many government and commercial sectors.

Aaron Mavrinac (Thousand Parsec): Aaron is a Ph.D. candidate in electrical and computer engineering at the University of Windsor, researching camera networks, computer vision, and robotics. When there is free time, he fills some of it working on Thousand Parsec and other free software, coding in Python and C, and doing too many other things to get good at any of them. His web site is http://www.mavrinac.com.

Kim Moir (Eclipse): Kim works at the IBM Rational Software lab in Ottawa as the Release Engineering lead for the Eclipse and Runtime Equinox projects and is a member of the Eclipse Architecture Council. Her interests lie in build optimization, Equinox and building component based software. Outside of work she can be found hitting the pavement with her running mates, preparing for the next road race. She blogs at http://relengofthenerds.blogspot.com/.

Dirkjan Ochtman (Mercurial): Dirkjan graduated as a Master in CS in 2010, and has been working at a financial startup for 3 years. When not procrastinating in his free time, he hacks on Mercurial, Python, Gentoo Linux and a Python CouchDB library. He lives in the beautiful city of Amsterdam. His personal web site is http://dirkjan.ochtman.nl/.

Sanjay Radia (HDFS): Sanjay is the architect of the Hadoop project at Yahoo!, and a Hadoop committer and Project Management Committee member at the Apache Software Foundation. Previously he held senior engineering positions at Cassatt, Sun Microsystems and INRIA where he developed software for distributed systems and grid/utility computing infrastructures. Sanjay has a Ph.D. in Computer Science from University of Waterloo, Canada.

Chet Ramey (Bash): Chet has been involved with bash for more than twenty years, the past seventeen as primary developer. He is a longtime employee of Case Western Reserve University in Cleveland, Ohio, from which he received his B.Sc. and M.Sc. degrees. He lives near Cleveland with his family and pets, and can be found online at http://tiswww.cwru.edu/~chet.

Emanuele Santos (VisTrails): Emanuele is a research scientist at the University of Utah. Her research interests include scientific data management, visualization, and provenance. She received her Ph.D. in Computing from the University of Utah in 2010. She is also a lead developer of the VisTrails system.

Carlos Scheidegger (VisTrails): Carlos has a Ph.D. in Computing from the University of Utah, and is now a researcher at AT&T Labs–Research. Carlos has won best paper awards at IEEE Visualization in 2007, and Shape Modeling International in 2008. His research interests include data visualization and analysis, geometry processing and computer graphics.

Will Schroeder (VTK): Will is President and co-Founder of Kitware, Inc. He is a computational scientist by training and has been one of the key developers of VTK. He enjoys writing beautiful code, especially when it involves computational geometry or graphics.

Margo Seltzer (Berkeley DB): Margo is the Herchel Smith Professor of Computer Science at Harvard's School of Engineering and Applied Sciences and an Architect at Oracle Corporation. She was one of the principal designers of Berkeley DB and a co-founder of Sleepycat Software. Her research interests are in filesystems, database systems, transactional systems, and medical data mining. Her professional life is online at http://www.eecs.harvard.edu/~margo, and she blogs at http://mis-misinformation.blogspot.com/.

Justin Sheehy (Riak): Justin is the CTO of Basho Technologies, the company behind the creation of Webmachine and Riak. Most recently before Basho, he was a principal scientist at the MITRE Corporation and a senior architect for systems infrastructure at Akamai. At both of those companies he focused on multiple aspects of robust distributed systems, including scheduling algorithms, language-based formal models, and resilience.

Richard Shimooka (Battle for Wesnoth): Richard is a Research Associate at Queen's University's Defence Management Studies Program in Kingston, Ontario. He is also a Deputy Administrator and Secretary for the Battle for Wesnoth. Richard has written several works examining the organizational cultures of social groups, ranging from governments to open source projects.

Konstantin V. Shvachko (HDFS), a veteran HDFS developer, is a principal Hadoop architect at eBay. Konstantin specializes in efficient data structures and algorithms for large-scale distributed storage systems. He discovered a new type of balanced trees, S-trees, for optimal indexing of unstructured data, and was a primary developer of an S-tree-based Linux filesystem, treeFS, a prototype of reiserFS. Konstantin holds a Ph.D. in computer science from Moscow State University, Russia. He is also a member of the Project Management Committee for Apache Hadoop.

Claudio Silva (VisTrails): Claudio is a full professor of computer science at the University of Utah. His research interests are in visualization, geometric computing, computer graphics, and scientific data management. He received his Ph.D. in computer science from the State University of New York at Stony Brook in 1996. Later in 2011, he will be joining the Polytechnic Institute of New York University as a full professor of computer science and engineering.

Suresh Srinivas (HDFS): Suresh works on HDFS as a software architect at Yahoo! He is a Hadoop committer and PMC member at Apache Software Foundation. Prior to Yahoo!, he worked at Sylantro Systems, developing scalable infrastructure for hosted communication services. Suresh has a bachelor's degree in Electronics and Communication from National Institute of Technology Karnataka, India.

Simon Stewart (Selenium): Simon lives in London and works as a Software Engineer in Test at Google. He is a core contributor to the Selenium project, was the creator of WebDriver and is enthusiastic about open source. Simon enjoys beer and writing better software, sometimes at the same time. His personal home page is http://www.pubbitch.org/.

Audrey Tang (SocialCalc): Audrey is a self-educated programmer and translator based in Taiwan. She curently works at Socialtext, where her job title is "Untitled Page", as well as at Apple as contractor for localization and release engineering. She previously designed and led the Pugs project, the first working Perl 6 implementation; she has also served in language design committees for Haskell, Perl 5, and Perl 6, and has made numerous contributions to CPAN and Hackage. She blogs at http://pugs.blogs.com/audreyt/.

Huy T. Vo (VisTrails): Huy is receiving his Ph.D. from the University of Utah in May 2011. His research interests include visualization, dataflow architecture and scientific data management. He is a senior developer at VisTrails, Inc. He also holds a Research Assistant Professor appointment with the Polytechnic Institute of New York University.

David White (Battle for Wesnoth): David is the founder and lead developer of Battle for Wesnoth. David has been involved with several Open Source video game projects, including Frogatto which he also co-founded. David is a performance engineer at Sabre Holdings, a leader in travel technology.

Greg Wilson (editorial): Greg has worked over the past 25 years in high-performance scientific computing, data visualization, and computer security, and is the author or editor of several computing books (including the 2008 Jolt Award winner Beautiful Code) and two books for children. Greg received a Ph.D. in Computer Science from the University of Edinburgh in 1993. He blogs at http://third-bit.com and http://software-carpentry.org.

Tarek Ziadé (Python Packaging): Tarek lives in Burgundy, France. He's a Senior Software Engineer at Mozilla, building servers in Python. In his spare time, he leads the packaging effort in Python.

Acknowledgments

We would like to thank our reviewers:

Eric Aderhold Muhammad Ali Lillian Angel
Robert Beghian Taavi Burns Luis Pedro Coelho
David Cooper Mauricio de Simone Jonathan Deber
Patrick Dubroy Igor Foox Alecia Fowler
Marcus Hanwell Johan Harjono Vivek Lakshmanan
Greg Lapouchnian Laurie MacDougall Sookraj Josh McCarthy
Jason Montojo Colin Morris Christian Muise
Victor Ng Nikita Pchelin Andrew Petersen
Andrey Petrov Tom Plaskon Pascal Rapicault
Todd Ritchie Samar Sabie Misa Sakamoto
David Scannell Clara Severino Tim Smith
Kyle Spaans Sana Tapal Tony Targonski
Miles Thibault David Wright Tina Yee

We would also like to thank Jackie Carter, who helped with the early stages of editing.

The cover image is a photograph by Peter Dutton of the 48 Free Street Mural by Chris Denison in Portland, Maine. The photograph is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 2.0 Generic license.

Dedication

Dedicated to Brian Kernighan,
who has taught us all so much;
and to prisoners of conscience everywhere.