Thursday, May 24, 2007

Nerd Food: On Maintenance

The many years I've spent working for the bespoke industry and using free software finally made me understand the obvious: the single most important aspect in the whole of software development is maintenance. Yes, you heard it right. Its not the language, not the platform, not the methodologies, not the technologies involved, not even the pretty Gantt charts. All these tools are important, of course, but if one looks at the entire lifespan of a program, maintenance overshadows every other aspect by a wide margin. You may think I'm not saying anything new here, and with good reason. Classic texts like Bertrand Meyer's Object Oriented Software Construction already pointed out that the highest cost in a software project is maintenance; Meyer was not the first, by far, to pick up on this. The problem was not with their diagnosis but rather with the cure they proposed. Allow me to expand on this.

The first thing one must realise is that code is in itself the only complete system specification there is. I'm not going to spend much time explaining this view of the world since I cannot possibly improve upon Jack Reeves' "What is Software Design?". Any experienced developer knows that the only way to really understand how a system works is by looking at the source. Let's face it, in the real world manuals don't exist. Comments are sketchy and, more often than not, totally wrong. You may get developers to write good documentation on the early stages but in all these years I'm yet to see a large five-year old project properly documented. The only thing you can always rely on, the only thing that truly documents the behaviour of a program is its source code. I know, I know, you'll bring up Knuth and literate programming. Unfortunately, I have no option but to check-mate you with real world experience. Sad truth is, most people don't even know about Knuth. While Doxygen et al are nice and make documenting much easier, very few people bother making sure the text matches the source when they are on a tight deadline, and the life of a bespoke developer is nothing but one tight deadline after another, ad infinitum. You can imagine your project manager's face when you explain that the deadline won't be met because you still need to finishing off commenting.

Speaking in very empirical terms, most projects seem to have an average lifespan of around seven to ten years, with the caveat that the final stage can drag on for a very long time. The first two or three years are all about adding large amounts of new features, cramming in as much as possible in the shortest possible time. During this period, lurking in the shadows, there is a steady increase of complexity. If these things were easily quantifiable, I'd expect the data would display a high correlation between the number of added features and the increase in complexity (i.e. each feature dramatically raises the complexity bar). Thus, adding each feature (and fixing each bug) starts taking longer and longer over time. At some point the project will reach the "complexity barrier"; this is the point at which adding new features (or fixing existing bugs) is so expensive that it's cheaper to create a new product from scratch, one which addresses all the "scalability" issues that the current system cannot. At this point the code-base is kept in "life-support" mode, with a bare minimum number of developers working on it to keep existing customers happy, but unable to do any fundamental changes to the project's core. If any new major features are required, they are implemented by extremely complex workarounds over existing architectural deficiencies. Eventually, the next generation system leaves the sheltered green house and is ready for deployment. Customers are moved over with varying degrees of grumbling, but with little choice on the matter. This pretty much describes every other project I have worked on over the last decade, some of them in different stages, of course, but all of them describing the exact same arc. First, let's make this clear: this methodology works. Companies are making ridiculous amounts of money by religiously following it, and at the end of the day, from a financial perspective, all that matters is the bottom line. However, this can't be The Right Way from an engineering perspective. I'm afraid you'll need your engineering hat on for the remaining of the article.

Lets step back for a second and reflect. Why do we throw away code-bases in the bespoke market so readily, when both commercial and open source shops do it a lot less frequently? It's all to do with the development process. Truth is, bespoke projects die _by design_; their environment is so entropic and hostile that they cannot but die. Software development changed fundamentally when the day to day running of a project was taken from the hands of programmers and handed over to professionals. In time, project management became a science in itself, complete with its own language of Gantt charts, milestones and deliverables. The entire development ecosystem in which we now live is geared towards delivering more and more features in ever smaller timescales by people who have less and less technical ability - i.e. people that think at ever higher levels of abstraction. The first victim in this quest for "time to market" is the code-base. When a developer is asked to implement a new feature the key question asked by a good project manager is: can you "reuse" some of the existing infrastructure to do this? The project manager may not even know what reuse means technically, but he knows that "with reuse" the estimates are much lower than "without reuse". So "reuse" is good, writing from scratch or re-engineering is bad, really bad. The developer will most likely explain that the existing infrastructure was not designed with the new feature in mind, and so, given the current timescales, there is no option but to bend the code-base beyond shape to shoehorn the functionality in (also known as a kludge). In the ears of a good project manager this is equal to "yes, we can reuse the existing infrastructure, we'll sort the mess later". Alas, later never comes. Eventually, after years of kludges to deliver features, the code-base becomes so unmaintainable - so complex - that it is cheaper to write a new system from scratch than to maintain the existing system. The complexity barrier has been reached, the dreaded point of no return.

What the project manager fails to grasp - or does not want to grasp - is that the code-base is in itself a repository of knowledge of sorts; the summary of the experience of a large group of developers over a long period of time attempting to tame a given problem domain. To make an extreme analogy, this is akin to someone taking every single copy of every volume of The Art of Computer Programming, writing a few sketchy notes about in fifty or so pages and then burning the books, happily thinking that all important detail has been captured. You'd think that most software houses would understand the importance of the code-base as an asset; after all, ask to take a copy of the code home and you'll have the police breathing down your neck in seconds. However, this sort of behaviour is a bit like the attitude of the peasant who keeps his money under the mattress, not really knowing what it's worth but thinking that it must be really important. Companies don't really understand the value of the code-base. If they did, they would take _really_ good care of it. Instead, they treat it like any other perishable resource, a computer or a car, a trite commodity spewed out of a production line of developer drones. The decommissioning of a software system should always be seen as an immense tragedy, a great loss of knowledge. Management is just not able to comprehend the amount of detail that is contained in a code-base, detail that simply cannot be transposed to a new system and will have to be rediscovered. Problem is, an existing code-base hasn't got an easily computable dollar value - man years are a very bad way of estimating effort nor is it possible to estimate the cost of a yet-to-develop system - so we're all in the dark. (Not being an expert, I'm not going to try to propose ways of valuing an existing code-base, but whatever methodology one comes up with it is bound to produce some astonishingly high figures.)

Unconvinced, you may ask, what is so wrong with starting new projects? After all many lessons can be learned, new technologies can be used and the end result will be a faster, more featureful, more maintainable system. Before everyone starts chanting "oh, you luddite in disguise", it's important to bear in mind the following:

  • The failure rate for new projects is extremely high;
  • Incremental changes have a lower risk, whereas big changes are always highly risky;
  • Its much easier to estimate costs and timescales in an existing project which has been running for years rather than on a new one, for which baselines are yet to be created;
  • The second system effect forces architects and developers to create new projects that aim to boil the ocean and use every other new technology, adding even more variables to an already complex problem;
  • New systems introduce a host of new bugs; you're basically trading an existing set of bugs that are either known, or not known but also not known to seriously impact production, with an unknown (but almost always large) quantity;
  • You'll need to find new people or retrain existing people for the new skills required - particularly on the developer side, but quite often in the user side too;
  • Your system and component requirements will almost always miss vital features or important little bits of detail and ignore many of the lessons already learned simply because the latest crop of developers writing the specs is not aware of them;
  • Your project planning will almost always underestimate the complexity of implementing some or all of the existing features;
  • Your project managers may be excellent at managing an existing system but totally inexperienced at managing at this huge scale of uncertainty;
  • Your developers may be excellent maintainers of an aging code-base but terrible green house developers, getting continuously lost in blue-skies approaches;
  • The architecture of the existing system may not transpose very well to the new technologies your developers insist in using, limiting reuse even at this fundamental level.

As you can see, replacing an established system is close to spending a million dollars in a casino over a few months. If you do win, you'll make a fortune - but the odds are heavily stacked against you. For all the reasons above - and probably many more which I failed to uncover - it is vital to try to keep an existing code-base running healthily, avoiding the complexity barrier at all costs. In order to do so one must maintain a system properly. This entails:

  • Removing functionality which is no longer necessary, thereby reducing complexity;
  • Looking for opportunities to refactor existing code into separate modules, and replace existing modules with open source libraries if suitable ones exist;
  • Tracking and fixing _all_ reported bugs;
  • Ensuring the code compiles with no warnings at the maximum warning level;
  • Refactoring code when implementing new functionality that does not fit the existing infrastructure;
  • Continuously measuring system performance, ensuring it does not degrade over time;
  • Ensuring consistency with existing standards and conventions, avoiding in-house protocols;
  • Improving readability of existing code;
  • Regression testing the code-base after changes;
  • Striving for platform independence;
  • Making continuous releases after changes to ensure there isn't a feature pile-up; in other words, release early, release often.

Yep, you've noticed it. These are all obvious tasks, pretty much the standard you'd expect from an average free software maintainer. Unfortunately, for reasons outlined above, these tasks are rarely present in bespoke software houses' project plans. You may find that some commercial off-the-shelf shops actually take maintenance seriously, but most bespoke houses just can't afford to spend the required time on maintenance. IMHO, herein lies the key, the biggest needed change: project managers have to start allocating slots for maintenance. They have to treat maintenance work like they treat enhancements, allocating adequate resources for it, asking developers to make and keep updated a list of top issues in the code-base and make sure these are addressed.

The intrepid reader may reasonably ask: but what if the system is designed in such a way that a large new feature just cannot be implemented within its framework? To that I must counter that _no_ feature is too complex as to be unimplementable in _any_ existing system which has been well maintained. This is a fallacy in which I believed for many years but which I think has been comprehensively disproved by many projects such as the linux kernel, GTK and Qt. Take the kernel. If a system that was designed to run only on x86, with no virtual memory, minimal driver support, minimal filesystem support and all sorts of other constraints can be made to do what linux does today then any project can do the same. I mean, the v2.6.x kernel has excellent portability, large SMP scalability, close to real time scheduling, and many, many more features that are all but impossible when looking at them from a v0.0.1 perspective. Linus feels quite strongly about the fact that the kernel was not originally designed to do any of these things, but _evolved_ solutions to these problems over time and in many cases these solutions work extremely well. The question is not whether it is possible or not, but rather how much effort is required to get there. And any discussion about resource allocation must always take into account the huge benefits of keeping the same code-base.

The other important aspect of maintenance is code-base reduction, mentioned on the first two points above. Code-base reduction may appear counter-intuitive at first sight; after all, new features must require adding code. However, the best way to look at this is from a resource allocation perspective. There is a finite number of developers, call it d, working on a code-base of a given size, say s. Let's call c the ratio between s and d. I always dreamed to come up with a law, and here finally is my chance: Craveiro's law states that the higher c is, the harder it is to maintain a code-base. Of course, this is a highly empirical law, but useful nonetheless. Now, there are two very straightforward ways of reducing c: either increase the number of developers until you meet Brook's law, or decrease the size of the code-base until you start impacting required features. The latter is more interesting, very much reminiscent of St. Exupry: a designer knows he has achieved perfection not when there is nothing left to add, but when there is nothing left to take away.

Since you can't literally start removing required functionality, the next best thing is to find other people who are willing to share the maintenance burden with you, reducing the individual maintenance cost (if not the overall cost). This is routinely done in open source projects, and it is incredibly successful. Basically, you want your developers to aggressively look at parts of your code-base which offer no discernible competitive advantage; once located, these are stripped out of the system and added to your company's portfolio of open source components. These have an important strategic value and should be managed very carefully (a community needs to be developed around them, the maintainer must listen to the community, etc.). The end result should be a significant reduction in your core code-base size.

I'll leave you with a couple of interesting corollaries from Craveiro's law:

  • Like many other "new" technologies, OOP by itself does not help or hinder the maintenance problem. Regardless of how elegantly your system is designed and implemented, if you are not maintaining it properly it will die. Conversely, a system designed in perl that is actively and adequately maintained may prove to be extremely resilient to time. _However_, choosing a popular language will have an important indirect impact on maintenance because it will define the size of the developer pool you can tap.
  • Java and C# are incredibly useful programming languages, but not for the reasons you might expect: i.e. not because of garbage collection, nice syntactical sugar, improved security or the VM. The one key element that distinguishes them from most other languages is their extensive and standardised class library, readily supplemented with huge amounts of open and close source components. These reduce the footprint of your code-base dramatically. Why are these languages better than say Delphi or RogueWave's extensions to C++? Because they insure vendor independence by standardising most of their interfaces.


VITORIA!!! VITOOORIA!!!

Incredible. After an exhausting season, Vitoria finally managed to stamp its ticket and remain in the superliga (the Portuguese equivalent of the English premier league). Its hard to believe we made it since it was one of those rather improbable events: two teams had to loose or draw and we had to win away. One of those teams, Beira Mar, had a cushy job since they were playing at home against a reasonable but beatable Pacos de Ferreira. In the end, after much, much suffering, the boys managed to bring it home. There were hundreds of faithful Vitorians who made the trip all the way up north to cheer the team on, and I don't think we would have done it without them. The victory was so huge that it even caused some fans to slightly lose track of reality over at the forums, so let me just clarify this: no, we're not the champions, we just managed to stay in for yet another season of pain and excitement.




Vitoria does make things hard for itself. It was yet another of those soap opera seasons, with managers getting fired, money gone missing, wages unpaid, players threatening to leave, players leaving, new Presidents, new financial partners that never quite materialise, the whole nine yards. A pretty standard season as far as Vitoria is concerned but hell on earth for any other club I can think of. No wonder Carlos Cardoso does not want to take up the job next season. I secretly believe that he spends the first few months of the season recovering from the previous season and getting mentally ready to take over as caretaker manager... Let's see what the new boy is all about...