Monday, February 08, 2016

Nerd Food: Interesting...

Nerd Food: Interesting…

Time to flush all those tabs again. Some interesting stuff I bumped into recently-ish.

Finance, Economics, Politics

Startups et al.

General Coding

Databases

  • Databases - the Long View: good presentation on databases and Postgres in particular, giving you a perspective of how things changed over time.

C++

  • Crow: New find. Simple library to write web services in C++. If you need to quickly expose some code as a web service, this may be easier than using Casablanca.

Layperson Science

Other

Created: 2016-02-08 Mon 22:29

Emacs 24.5.1 (Org mode 8.2.10)

Validate

Nerd Food: Tooling in Computational Neuroscience - Part III: Data

Nerd Food: Tooling in Computational Neuroscience - Part III: Data

In God we trust; all others must bring data. -- W. Edwards Deming

Welcome to yet another instalment in our series of posts about tooling in Computational Neuroscience. Previously, we have discussed simulators - a popular one, in particular - and microscopes. We shall now talk about data in Computational Neuroscience, a seemingly broad and somewhat mundane topic but one which is central to any attempt in understanding the status quo of the discipline. The target audience remains as it was - the lay person - but I'm afraid things are getting increasingly technical.

More Data! We Need More Data!

Computational Neuroscience by itself is not particularly interesting if there are no inputs to the models we carefully craft nor detailed outputs to allow us to know what the models are doing. Similarly, one needs to be able to use experimental data to inform our modeling choices and in order to baseline expectations; if this data is not available, one cannot tell how close or how far models are from the real thing. As everywhere else, data is of crucial importance here; we need lots of it and of many different kinds.

Once you need data, you soon need to worry about data representation: how should information be encoded? Clearly, in order for the data to be useful in a general sense, it must be accompanied by a formal or informal specification or else users will not know how to interpret it. Furthermore, given the highly technical nature of the data in question, the specification must be very precise or the data becomes useless or even dangerous; "Was that in microns or nanometres?" is not the sort of question you want to be asking. In a world where producers and consumers of data can be anywhere geographically, the specification assumes an ever larger degree of importance.

In summary, it is just not practical to allow everyone to come up with their own data formats:

  • writing a clear and concise specification for data interchange is hard work, and requires a lot of experience in both the domain and the specification process in general. The first attempts would probably prove to be incomplete, inconsistent or impractical.
  • writing code to read and write files according to a specification and in multiple programming languages is also demanding engineering work.
  • writing code to convert from one data specification to another is even more complicated because it requires intimate knowledge of both.
  • some data is generated directly by hardware, making it impractical to adapt to different requirements.

Another aspect worth highlighting is the "big data" nature of a lot of the data sets used in this field. Anything to do with the brain gets pretty complex pretty quickly, and this manifests itself in the data dimension by having ever larger data sets with greater levels of detail. On the plus side, thanks to Moore's Law sigmoid, detailed information at all levels is allowing us to answer questions that were unanswerable not so long ago. The flip side is that all those details come at a cost: the data sets are becoming huge. For example, the resolution of the data coming out of microscopy is now so high that a single data set can take as much as 500 TB. And of course, not only are individual data sets getting larger and larger, but we are able to generate more of them at an ever increasing pace because the processes are more streamlined. It is a fire-hose of data.

All of these difficulties are not unique to Computational Neuroscience or even to Neuroscience as a whole, but the complexity of the domain has the effect of greatly exacerbating an already thorny problem.

Neuroinformatics to the Rescue

If you think we're exaggerating then think again. The management of data in Neuroscience is so complex it is a field on its own right, with the cool-sounding name of Neuroinformatics. Wikipedia tells us that:

Neuroinformatics is a research field concerned with the organization of neuroscience data by the application of computational models and analytical tools. These areas of research are important for the integration and analysis of increasingly large-volume, high-dimensional, and fine-grain experimental data. Neuroinformaticians provide computational tools, mathematical models, and create interoperable databases for clinicians and research scientists.

In layman's terms, Neuroinformatics concerns itself with Neuroscience data and the places where said data is to be stored. It is also implied that one has to deal with a variety of types of data, e.g.: data from experiments (of which there can be many kinds), model inputs, model outputs, the models themselves when viewed as data, etc. The classification of this data is in itself a Neuroinformatics task. Finally, Neuroinformatics also is responsible for the tooling necessary to acquire the data, manipulate it, analyse it, visualise it and so on. Given such a broad definition, one is forced to conclude that there is a big overlap between Computational Neuroscience - the modeling activity - and Neuroinformatics - the management of the data required by it. This lack of clarity is common in science, particularly as new fields develop; take for example Mathematics and Computer Science at its inception.

In truth, such definitions and demarcations are only as useful as the tangible benefits they provide. It is perhaps more fruitful to think of Neuroinformatics as a hat you don on as and when your Computational Science work requires; the definition is there then to allow one to be aware of the separation between the analytic work in modeling and the data storage / retrieval work. For the purposes of this article, we'll continue to refer to the "Neuroinformatics Scientist" and the Computational Neuroscientist personas, but bear in mind they may resolve to the same person in practice.1

Before we move on, I'd like to point out another interesting challenge Neuroinformatics has to address, and one that is common to all Medical Sciences: the need to handle human-derived data very carefully. After all, making data sets available widely must not have implications for the original patients, so its often a requirement that the data is de-identified; in the cases where the data is patient sensitive, additional requirements may be made to users of the data to avoid leaking this information, such as requiring a registration, etc. This illustrates the peculiar nature of Neuroinformatics, with the constant tension between making data as widely available as possible but at the same time having to ensure there are no side-effects of doing so. Presumably, Primum non nocere - first, do no harm.

Databases, Repositories and Archives

Thanks to the efforts of Neuroinformatics, there is now a wealth of Neuroscience data available to all on the Internet. The roots of this growth were sowed in the nineties when labs started sharing research results online. Sharing always existed in one way or another, of course, but the rise of the Internet simply changed the magnitude of the process. It soon became apparent that there was a need to organise central repositories of data, and to ensure the consistency of the shared data. Papers with a distinct Neuroinformatics tone were written, such as An on-line archive of reconstructed hippocampal neurons (1999). Repositories grew, multiplied, morphed and in many cases died, as these things do, and the evolutionary process left us with the survivors. I'd like to highlight some of the ones I have bumped into so far are (with descriptions in their own words):

  • ModelDB: "ModelDB provides an accessible location for storing and efficiently retrieving computational neuroscience models. ModelDB is tightly coupled with NeuronDB. Models can be coded in any language for any environment. Model code can be viewed before downloading and browsers can be set to auto-launch the models."
  • NeuronDB: "NeuronDB provides a dynamically searchable database of three types of neuronal properties: voltage gated conductances, neurotransmitter receptors, and neurotransmitter substances. It contains tools that provide for integration of these properties in a given type of neuron and compartment, and for comparison of properties across different types of neurons and compartments."
  • NeuroMorpho: "NeuroMorpho.Org is a centrally curated inventory of digitally reconstructed neurons associated with peer-reviewed publications. It contains contributions from over 100 laboratories worldwide and is continuously updated as new morphological reconstructions are collected, published, and shared. To date, NeuroMorpho.Org is the largest collection of publicly accessible 3D neuronal reconstructions and associated metadata."
  • Functional Connectomes Project: "Following the precedent of full unrestricted data sharing, which has become the norm in molecular genetics, the FCP entailed the aggregation and public release (via www.nitrc.org) of over 1200 resting state fMRI (R-fMRI) datasets collected from 33 sites around the world."
  • OpenfMRI: "[…] project dedicated to the free and open sharing of functional magnetic resonance imaging (fMRI) datasets, including raw data."
  • Open Source Brain: "resource for sharing and collaboratively developing computational models of neural systems."

As you can see from this small list - rather incomplete, I'm sure - there is a wealth of information out there, covering all sorts of aspects of the brain. We never had so much data as we do today. And, in many ways, this is fast becoming a problem. As an example, data from each of Neuroscience's plethora of divisions and sub-fields is not designed to talk to each other: Electron Microscopy (EM) data is disconnected from data obtained by Magnetic Resonance Imaging (MRI), which is also totally separate from connectome information2 and so forth. In many cases, these sub-fields have evolved in fairly separate paths, and developed their own technical vocabulary in isolation and over long periods of time - an approach perfectly suitable for a "disconnected" world but less than ideal for a world where multiple sources of data are required to make sense of complex phenomena. If one can't even agree on what to call things, how can one be able to explain them?

Thus, the early Neuroinformatics approach is best described as "evolutionary". It is not as if someone sat down and generated a well defined set of file formats for data interchange, covering all different aspects of the areas under study. Instead, what has been emerging is a multitude of file formats in each sub-field, all calling out for attention, and all of them designed for the immediate goal at hand rather than the greater good of Neuroscience.

Taming the Sea of Data

From a Software Engineering perspective, an evolutionary approach makes perfect sense; after all, the Real Programmers had said: "first make it work, then make it right, and, finally, make it fast." In many ways, we are reaching the "make it right" phase, with an increasing interest in efforts towards the creation of broad standards. There have been several papers and initiatives on the subject, such as the Neuroscience Information Framework, or NIF, described in a paper: The Neuroscience Information Framework: A Data and Knowledge Environment for Neuroscience. The paper outlined a lot of the problems that are hampering research, such as:

  • the need for specialised search engines that are domain aware, and advanced query tools too;
  • the need to aid integration and to provide connectivity across related data and findings;
  • a requirement to provide new and enhanced forms of analysing existing data, as data reuse is extremely important - new insights can be obtained on already existing data, often long after the data was generated, and by using it in ways that were not at all envisioned by the original authors;
  • the need to make contribution to online repositories easier; lowering the "contribution barrier" is important to increase data availability but must be done in ways that do not compromise the quality of the data;
  • a requirement to make all code open source such that any lab can make use of it, and the community as a whole can share the maintenance load;
  • a need for an online repository for all tooling, to avoid reinventing the wheel;
  • the need to create a multi-domain standard vocabulary.

There are many worthwhile points in this paper, and it is highly recommended to anyone interested in the subject matter. For instance, the section discussing the design of the NIF also covers the requirements for any specification that wishes to solve the problems outlined above. They are worth highlighting as - in my humble and lay opinion - they are very well thought out.

  • The design of such a framework must combine technical specifications choices and broad community support; "open data, access and exchange, via open source and platform, aid Framework-enabled open discover for Neuroscience."
  • A common framework would reduce costs and enhance benefits of data sharing and knowledge sharing; it would "reduce the cost/benefit ration for data acquisition and utilization."
  • The framework must be designed by the broader community and with the needs of this broader community in mind, and it must build upon prior development in Neuroinformatics.
  • A focus on interoperability is crucial, and it is not a static target but one that must be looked after over time. In addition, there is also a need to keep in mind that different resources have very different interoperability potential. In order to maximise interoperability, we should aim to standardise as much as possible all aspects of the process such as user interfaces, terminologies, formats, etc.

To the untrained eye, the NIF initiative appears to be a great effort to solve fundamental problems in the field. It also seems to have spawned and/or helped popularise many useful and lasting resources such as NeuroMorpho. However, the impression one gets from the outside is that the NIF didn't quite fulfil all of its potential. Having said that, I am keenly looking for up-to-date documents that describe the current status across all of its many aspects - alas, I have not yet succeeded in finding any such document. If indeed it is the case that the initiative petered out, it did highlight a few potential problems for anyone working in this space:

  • large undertakings are hard to pull off; small, organic, incremental changes are easier to do, but of course, that is why we have the problems we currently have.
  • large initiatives require large amounts of funding; work is technical and very expensive.
  • it is not easy to understand NIFs deliverables from looking at their documentationa and website. One can clearly see it was an ambitious project, and one which took on the brunt of the problem areas highlighted above, but perhaps it needed a slightly more self-contained view of their achievements rather than a whole-or-nothing approach. This allows preserving some components even whilst others are failing to gain traction.

XML strikes back

Another interesting attempt to tackle these problems is what I call the "XML suite". These are basically a set of different XML-based standards that are able to interoperate and augment each other, a bit like a stack of building blocks. You can find more details in this paper: XML for Model Specification in Neuroscience. Some of the components of the XML Suite are (with descriptions on their own words, copied from the above paper and a link for more details):

  • LEMS: "the Low Entropy Model Specification […] is being developed to provide a compact, minimally redundant, human-readable, human-writable, declarative way of expressing models of biological systems. It differs from other systems such as CellML or SBML in its requirement to be human writable and the inclusion of basic physical concepts such as dimensionality and physical nesting as part of the language."
  • NeuroML: "supports the use of declarative model specifications for neuroscience modeling efforts at different scales, from intracellular mechanisms to networks of reconstructed neurons."
  • MorphML: "provides a common format for exchange of neuronal morphology data. It can also be used to specify cell structure for modeling efforts as part of NeuroML."
  • BrainML: "application for representing time series data, spike trains, experimental protocols, and other data relevant to neurophysiology experiments."
  • SBML: "(Systems Biology Markup Language) is an application for specifying models of biochemical reaction networks such as metabolic networks, cell-signaling pathways and gene regulatory networks."
  • CellML: "is designed for the specification of biological models of cellular and sub-cellular processes such as calcium dynamics, metabolic pathways, signal transduction, and electrophysiology."
  • MathML: "provides the means for describing the structure and content of mathematical notation in order to serve, receive, and process mathematics on the web. Other XML applications often use MathML language elements for representing mathematical equations."

A positive aspect of the XML Suite is its "discrete" nature. Each of these file formats are free to evolve in isolation, and the nature of their cooperation is very loose in most cases. For example MathML is not at all related to Neuroscience and has the support of the Maths community (to some extent). In addition, the "stacking" approach is also a very interesting one, allowing a good domain focus. For example, NeuroML is built on top of LEMS, so in theory each of these should cover different domains and there should be minimal redundancy.

The key challenge for the XML Suite is for each of their components to find a sustainable user base and sustainable funding to go along with it. This is a broader problem of Neuroinformatics: researchers do not want to spend time on work that is not contributing directly to their research and so the developer pool to do fundamental work on the file formats is limited. Once the developer pool becomes too limited, the file format ends up with a small user base because it is not fit for purpose, and thus starts a downward spiral. This appears to have been the fate of projects such as BrainML.

Conclusion

This post provided an overview of the data landscape in Computational Neuroscience and introduced the sub-field of Neuroinformatics. We also looked at some of the available data stores and reviewed a few of the more popular initiatives to solve the fundamental data problems in the field.

Stay tuned for the next instalment!

Footnotes:

1

For a bit more details on the two fields see What are Computational Neuroscience and Neuroinformatics?

2

"A connectome is a comprehensive map of neural connections in the brain, and may be thought of as its "wiring diagram". From this page.

Created: 2016-02-08 Mon 21:41

Emacs 24.5.1 (Org mode 8.2.10)

Validate

Monday, January 18, 2016

Nerd Food: Interesting...

Nerd Food: Interesting…

Time to flush all those tabs again. Some interesting stuff I bumped into recently-ish.

Finance, Economics, Politics

  • Why Big Oil Should Kill Itself: This is a really, really interesting article. The gist of it is that the entire logic around oil exploration is now a fallacy and it makes more economic sense to simply give up looking for oil because all oil that is left is just to expensive to commercialise. It also has a very interesting take on the valuation of oil companies (and sources of take overs) but I won't spoil it for you. If you are into oil (or against it), its a must read.
  • Oil Goes Nonlinear: Short but thought provoking. I don't tend to agree with Krugman on a lot of things, but quite like this analysis.
  • Africa’s Boom Is Over: And the bad news continue. Totally spot on analysis of what will befall us.
  • American Spring: interesting take on the state of affairs of American politics. Not sure I agree with everything, but definitely food for thought. "Statistically speaking, what are the odds that the two most qualified candidates to be president out of 300 million people are siblings? Or married?" Indeed.
  • A Year of Sovereign Defaults?: Very good and very scary. This has to be on the cards, the only question is the timing.
  • Really rich people are suddenly paying quite a bit more in taxes: some good news on the equality front I guess. But not quite sure it makes much of a difference in the big scheme of US things.
  • Argentina's 'little trees' getting chopped down by new president: Seems like Argentina is going to go through yet another turbulent period, with some good and bad news coming out. Interesting take on the impact to the less well off of the new policies. The chap is certainly a doer, it seems: A fast start.

Startups et al.

General Coding

  • Feeding Graph databases - a third use-case for modern log management platforms: Very interesting ideas on how to use logging data in a graph database. Sounds extremely counter-intuitive, and then you start reading at which point its like "Damn, why didn't I think of that before!". Source: Hacker News
  • Moores law hits the roof: Seems like the exponential function is revealing itself as a sigmoid, as everyone knew it would. Some of the cracks that are already present in Moore's law. Interesting to note that a transistor is now only a few silicon atoms wide - meaning we can't really make it much smaller. Source: Hacker News
  • No, I Don't Want To Configure Your App: Call to arms to get us all thinking on just how many configuration knobs you need to use something. Source: Hacker News
  • Your IDE Is Killing You: Somewhat preaching to the choir, since I am an Emacs user of old, but still a very cogent argument on why relying too much on IDEs is not a good thing. Source: Bruno Antunes (twitter)
  • Starters and Maintainers: The different personas around an open source project. Interesting, its good to be aware of which hat you are wearing when.
  • I Moved to Linux and It’s Even Better Than I Expected: A feel good story about the Linux desktop. Given how slowly things are progressing on that front, we all need one of these some times to cheer us up. Main value of the article though.

Databases

  • Encrypted databases with ZeroDB: I'm not exactly impressed with the technology itself, but more with the ideas one can extract from it. Briefly: what if the database only stores encrypted data, which only each client can decrypt? This is certainly a very useful thing for certain types of information and a PostgreSQL extension would be most useful. Source: Hacker News
  • Introduction to PostgreSQL physical storage: Great article on Postgres low-level details. One to read if you want to get serious about the Elephant but are not yet in the know.
  • Schema based versioning and deployment for PostgreSQL: Tips on how to manage versions for your stored procs, and also contains links for table management. For those of us not totally taken by NoSQL.

C++

Layman Science

  • Why String Theory Is Not A Scientific Theory: Doesn't say a lot of new things, but its good to remind ourselves on what exactly do we mean when we say "Science". This would save us from a lot of grief, such as considering Economics as a Science.
  • The cold fusion horizon: … talking about Science, I was surprised to find out that people are still talking seriously about cold fusion. Interesting article, because it takes the flip side of the Science coin: nothing should not be science unless it is not using the scientific method. Whilst up til now cold fusion has been more of a hoax, we should not discredit people who work on it provided they are following scientific principles. Who knows, they may be right in the end. Science is all about long-shots.

Other

Created: 2016-01-18 Mon 12:49

Emacs 24.5.1 (Org mode 8.2.10)

Validate

Sunday, January 17, 2016

Nerd Food: On Product Backlog

Nerd Food: On Product Backlog

Would be be good to have a better bug-tracking setup? Yes. But I think it takes man-power, and it would take something *fundamentally* better than bugzilla. -- Linus

Many developers in large companies tend to be exposed to a strange variation of agile which I like to call "Enterprise Grade Agile", but I've also heard it called "Fragile" and, most aptly, "Cargo-Cult Agile". However you decide to name the phenomena, the gist of it is that these setups contain nearly all of the ceremony of agile - including stand-ups, sprint planning, retrospectives and so on - but none of its spirit. Tweets such as this are great at capturing the essence of the problem:

Once you start having that nagging feeling of doing things "because you are told to", and once your stand-ups become more of a status report to the "project manager" and/or "delivery manager" - the existence of which, in itself, is rather worrying - your Cargo Cult Agile alarm bells should start ringing. As I see it, agile is a toolbox with a number of tools, and they only start to add value once you've adapted them to your personal circumstances. The fitness function that determines if a tool should be used is how much value it adds to all (or at least most) of its users. If it does not, the tool must be further adapted or removed altogether. And, crucially, you learn about agile tools by using them and by reflecting on the lessons learned. There is no other way.

This post is one such exercise and the tool I'd like to reflect on is the Product Backlog. Now, before you read through the whole rant, its probably worth saying that this post takes a slightly narrow and somewhat "advanced" view of agile, with a target audience of those already using it. If you require a more introductory approach, you are probably better off looking at other online resources such as How to learn Scrum in 10 minutes and clean your house in the process. Having said that, I'll try to define terms best I can to make sure we are all on the same page.

Working Definition

Once your company has grokked the basics of agile and starts to move away from those lengthy specification documents - those that no one reads properly until implementation and those that never specified anything the customer wanted, but everything we thought the customer wanted and then some - you will start to use the product backlog in anger. And that's when you will realise that it is not quite as simple as memorising text books.

So what do the "text books" say? Let's take a fairly typical definition - this one from Scrum:

The agile product backlog in Scrum is a prioritized features list, containing short descriptions of all functionality desired in the product. When applying Scrum, it's not necessary to start a project with a lengthy, upfront effort to document all requirements. Typically, a Scrum team and its product owner begin by writing down everything they can think of for agile backlog prioritization. This agile product backlog is almost always more than enough for a first sprint. The Scrum product backlog is then allowed to grow and change as more is learned about the product and its customers.1

This is a good working definition, which will suffice for the purposes of this post. It is deceptively simple. However, as always, one must remember Yogi Berra: "In theory, there is no difference between theory and practice. But in practice, there is."

Potmenkin Product Backlogs

Many teams finish reading one such definition, find it amazingly inspiring, install the "agile plug-in" on their bug-tracking software of choice and then furiously start typing in those tickets. But if you look closely, you'd be hard-pressed to find any difference between the bug tickets of old versus the "stories" in the new and improved "product backlog" that apparently you are now using.

This is a classic management disconnect, whereby a renaming exercise is applied and suddenly, Potemkin village-style, we are now in with the kool kids and our company suddenly becomes a modern and desirable place to work. But much like Potemkin villages were not designed for real people to live in, so "Potmenkin Product Backlogs" are not designed to help you manage the lifecycle of a real product; they are there to give you the appearance of doing said management, for the purposes of reporting to the higher eschelons and so that you can tell stakeholders that "their story has been added to the product backlog for prioritisation".

Alas, very soon you will find that the bulk of the "user stories" are nothing but glorified one-liners that no one seems to recall what exactly they're supposed to mean, and those few elaborately detailed tickets end up rotting because they keep being deprioritised and now describe a world long gone. Soon enough you will find that your sprint planning meetings will cover less and less of the product backlog - after all, who is able to prioritise this mess? Some stories don't even make any sense! The final act is when all stories worked on are stories raised directly on the sprint backlog, and the product backlog is nothing but the dumping ground for the stories that didn't make it on a given sprint. At this stage, the product backlog is in such a terrible mess that no one looks at it, other than for the occasional historic search for valuable details on how a bug was fixed. Eventually the product backlog is zeroed - maybe a dozen or so of the most recent stories make it through the cull - and the entire process begins anew. Alas, enlightenment is never achieved, so you are condemned to repeat this cycle for all eternity.

As expected, the Potmenkin Product Backlog adds very little value - in fact it can be argued that it detracts value - but it must be kept because "agile requires a product backlog".

Bug-Trackers: Lessons From History

In order to understand the difficulties with a product backlog, we turn next to their logical predecessors: bug-tracking systems such as Bugzilla or Jira. This post starts with a quote from the kernel's Benevolent Dictator that illustrates the problem with these. Linus has long taken the approach that there is no need for a bug-tracker in kernel development, although he does not object if someone wants to use one for a subsystem. You may think this is a very primitive approach but in some ways it is also a very modern approach, very much in line with agile; if you have a bug-tracking system which is taking time away from developers without providing any value, you should remove the bug-tracking system. In kernel development, there simply is no space for ceremony - or, for that matter, for anything which slows things down2.

All of which begs the question: what makes bug-tracking systems so useless? From experience, there are a few factors:

  • they are a "fire and forget" capture system. Most users only care about entering new data, rather than worrying about the lifecycle of a ticket. Very few places have some kind of "ticket quality control" which ensures that the content of the ticket is vaguely sensible, and those who do suffer from another problem:
  • they require dedicated teams. By this I don't just mean running the bug-tracking software - which you will most likely have to do in a proprietary shop; I also mean the entire notion of Q&A and Testing as separate from development, with reams of people dedicated to setting "environments" up (and keeping them up!), organising database restores and other such activities that are incompatible with current best practices of software development.
  • they are temples of ceremony: a glance at the myriad of fields you need to fill in - and the rules and permutations required to get them exactly right - should be sufficient to put off even the most ardent believer in process. Most developers end up memorising some safe incantation that allows them to get on with life, without understanding the majority of the data they are entering.
  • as the underlying product ages, you will be faced with the sad graph of software death. The main problem is that resources get taken away from systems as they get older, a phenomena that manifests itself as a growth in the delta between the number of open tickets against the number of closed tickets. This is actually a really useful metric but one that is often ignored.3.

And what of the newest iterations on this venerable concept such as GitHub Issues? Well, clearly they solve a number of the problems above - such as lowering the complexity and cost barriers - and certainly they do serve a very useful purpose: they allow the efficient management of user interactions. Every time I create an issue - such as this one - it never ceases to amaze me how easily the information flows within GitHub projects; one can initiate comms with the author(s) or other users with zero setup - something that previously required mailinglist membership, opening an account on a bug-tracker and so forth. We now take all of this for granted, of course, but it is important to bear in mind that many open source projects would probably not even have any form of user interaction support, were it not for GitHub. After all, most of them are a one-person shop with very little disposable time, and it makes no sense to spend part of that time maintaining infrastructure for the odd person or two who may drop by to chat.

However, for all of its glory, it is also important to bear in mind that GitHub Issues is not a product backlog solution. What I mean by this is that the product backlog must be owned by the team that owns the product and, as we shall see, it must be carefully groomed if it is to be continually useful. This is at loggerheads with allowing free flow of information from users. Your Issues will eventually be filled up with user requests and questions which you may not want to address, or general discussions which may or may not have a story behind it. They are simply different tools for different jobs, albeit with an overlap in functionality.

So, history tells us what does not work. But is the product backlog even worth all this hassle?

Voyaging Through Strange Seas of Thought

One of the great things about agile is how much it reflects on itself; a strange loop of sorts. Presentations such as Kevlin Henney's The Architecture of Uncertainty are part of this continual process of discovery and understanding, and provide great insights about the fundamental nature of the development process. The product backlog plays - or should play - a crucial role exactly because of this uncertain nature of software development. We can explain this by way of a device.

Imagine that you start off by admitting that you know very little about what it is that you are intending to do and that the problem domain you are about to explore is vast and complex. In this scenario, the product backlog is the sum total of the knowledge gained whilst exploring this space that has yet not been transformed into source code. Think of it like the explorer's maps in the fifteen-hundreds. In those days, "users" knew that much of it was incorrect and a great part was sketchy and ill-defined, but it was all you had. Given that the odds of success were stacked against you, you'd hold that map pretty tightly while the storms were raging about you. Those that made it back would provide corrections and amendments and, over time, the maps eventually converged with the real geography.

The product backlog does something similar, but of course, the space you are exploring does not have a fixed geometry or topography and your knowledge of the problem domain can actively change the domain itself too - an unavoidable consequence of dealing with pure thought stuff. But the general principle applies. Thus, in the same way a code base is precious because it embodies the sum total knowledge of a domain - heck, in many ways it is the sum total knowledge of a domain! - so the product backlog is precious because it captures all the known knowledge of these yet-to-be-explored areas. In this light, you can understand statements such as this:

So, if the backlog is this important, how should one manage it?

Works For Me, Guv!

Up to this point - whilst we were delving into the problem space - we have been dealing with a fairly general argument, likely applicable to many. Now, as we enter the solution space, I'm afraid I will have to move from the general to the particular and talk only about the specific circumstances of my one-man-project Dogen. You can find Dogen's product backlog here.

This may sound like a bit of a cop out, you may say, and not without reason: how on earth are you supposed to extrapolate conclusions from a one-person open source project to a team of N working on a commercial product? However, it is also important to take into account what I said at the start: agile is what you make of it. I personally think of it as a) the smallest amount of processes required to make your development process work smoothly and b) and the continual improvement of those processes. Thus, there are no one-size-fits-all solutions; all one can do is to look at others for ideas. So, lets look at my findings4.

The first and most important thing I did to help me manage my product backlog was to use a simple text file in Org Mode notation. Clearly, this is not a setup that is workable for a development team much larger than a set of one, or one that doesn't use Emacs (or Vim). But for my particular circumstances it has worked wonders:

  • the product backlog is close to the code, so wherever you go, you take it with you. This means you can always search the product backlog and - most importantly - add to it wherever you are and whenever an idea happens to come by. I use this flexibility frequently.
  • the Org Mode interface makes it really easy to move stories up and down (order is taken to mean priority here) and to create "buckets" of stories according to whatever categorisation you decide to use, up to any level of nesting. At some point you end up converging to a reasonable level of nesting, of course. It is surprising how one can manage very large amounts of stories thanks to this flexible tree structure.
  • it's trivial to move stories in and out of a sprint, keeping track of all changes to a story - they are just text that can be copy and pasted and committed.
  • Org Mode provides a very capable tagging system. I first started by overusing these, but when tagging got too fine grained it became unmaintainable. Now we use too few - just epic and story - so this will have to change again in the near future. For example, it should be trivial to add tags for different components in the system or to mark stories as bugs or features, etc. Searching then allows you to see a subset of the stories that match those labels.

A second decision which has proven to be a very good one has been to groom the product backlog very often. And by this I don't just mean a cursory look, but a deep inspection of all stories, fixing them where required. Again, the choice of format has proved very helpful:

  • it is easy to mark all stories as "non-reviewed" or some other suitable tag in Org Mode, and then unmark them as one finishes the groom - thereby ensuring all stories get some attention. As the product backlog becomes larger, a full groom could take multiple sprints, but this is not an issue once you understand its value and the cost of having it rot.
  • because the product backlog is with the code, any downtime can be used for grooming; those idle weekends or that long wait at the airport are perfect candidates to get a few stories looked at. Time spent waiting for the build is also a good candidate.
  • you get an HTML representation of the Org Mode file for free in GitHub, meaning you can read your backlog from your phone. And with the new editing functionality, you can also edit stories too.

Thirdly, I decided to take a "multi-pass" approach at managing the story lifecycle. These are some of the key aspects of this lifecycle management:

  • stories can only be captured if they are aligned with the vision. This filter saves me from adding all sorts of ideas which are just too "out of the left field" to be of practical use, but keeps those that may sound crazy are but aligned with the vision.
  • stories can only be captured if there is no "prior art". I always perform a number of searches in the backlog to look for anything which covers similar ground. If found, I append to that.
  • new stories tend to start with very little content - just the minimum required to allow resetting state back to the idea I was trying to capture. Due to this, very little gets lost. At this point, we have a "proto-story".
  • as time progresses, I end up having more ideas on this space, and I update the story with those ideas - mainly bullet points with one liners and links.
  • at some point the story begins to mature; there is enough on it that we can convert the "proto-story" to a full blown story. After a number of grooms, the story becomes fully formed and is then a candidate to be moved to a sprint backlog for implementation. It may stay in this state ad-infinitum, with periodic updates just to make sure it does not rot.
  • A candidate story can still get refined: trimmed in scope, re-targeted, or even cancelled because it no longer fits with the current architecture or even the vision. Cancelled stories are important because we may came back to them - its just very unlikely that we do.
  • every sprint has a "sprint mission"5. When we start to move stories into the sprint backlog, we look for those which resonate with the sprint mission. Not all of them are fully formed, and the work on the sprint can entail the analysis required to create a full blown story. But many will be implementable directly off of the product backlog.
  • some times I end up finding related threads in multiple stories and decide to merge them. Merging of related stories is done by simply copying and pasting them into a single story; over time, with the multiple passes done in the grooms, we end up again with a single consistent story.

What all of this means is that a story can evolve over time in the product backlog, only to become the exact thing you need at a given sprint; at that point you benefit from the knowledge and insight gained over that long period of time. Some stories in Dogen's backlog have been there for years, and when I finally get to them, I find them extremely useful. Remember: they are a map to the unknown space you are exploring.

With all of this machinery in place, we've ended up with a very useful product backlog for Dogen - one that certainly adds a lot of value. Don't take me wrong, the cost of maintenance is high and I'd rather be coding instead of maintaining the product backlog, especially given the limited resources. But I keep it because I can see on a daily basis how much it improves the overall quality of the development process. It is a price I find worth paying, given what I get in return.

Final Thoughts

This post was an attempt to summarise some of the thoughts I've been having on the space of product backlogs. One of its main objectives was to try to convey the importance of this tool, and to provide ideas on how you can improve the management of your own product backlog by discussing the approach I have taken with Dogen.

If you have any suggestions or want to share your own tips on how to manage your product backlog please reach me on the comments section - there is always space for improvement.

Footnotes:

1

Source: Scrum Product Backlog, Mountain Goat Software.

2

A topic which I covered some time ago here: On Evolutionary Methodology. It is also interesting to see how the kernel processes are organised for speed: How 4.4's patches got to the mainline.

3

Another topic which I also covered here some time ago: On Maintenance.

4

I am self-plagiarising a little bit here and rehashing some of the arguments I've used before in Lessons in Incremental Coding, mainly from section DVCS to the Core.

5

See the current sprint backlog for an example.

Created: 2016-01-17 Sun 23:55

Emacs 24.5.1 (Org mode 8.2.10)

Validate

Tuesday, December 22, 2015

Nerd Food: Dogen: The Package Management Saga

Nerd Food: Dogen: The Package Management Saga

We've just gone past Dogen's Sprint 75, so I guess it's time for one of those "reminiscing posts" - something along the lines of what we did for Sprint 50. This one is a bit more practical though; if you are only interested in the practical side, keep scrolling until you see "Conan".

So, package management. Like any other part-time C++ developer whose professional mainstay is C# and Java, I have keenly felt the need for a package manager when in C++-land. The problem is less visible when you are working with mature libraries and dealing with just Linux, due to the huge size of the package repositories and the great tooling built around them. However, things get messier when you start to go cross-platform, and messier still when you are coding on the bleeding edge of C++: either the package you need is not available in the distro's repos or even PPA's; or, when it is, its rarely at the version you require.

Alas, for all our sins, that's exactly where we were when Dogen got started.

A Spoonful of Dogen History

Dogen sprung to life just a tad after C++-0x became C++-11, so we experienced first hand the highs of a quasi-new-language followed by the lows of feeling the brunt of the bleeding edge pain. For starters, nothing we ever wanted was available out of the box, on any of the platforms we were interested in. Even Debian testing was a bit behind - probably stalled due to a compiler transition or other, but I can't quite recall the details. In those days, Real Programmers were Real Programmers and mice were mice: we had to build and install the C++ compilers ourselves and, even then, C++-11 support was new, a bit flaky and limited. We then had to use those compilers to compile all of the dependencies in C++-11 mode.

The PFH Days

After doing this manually once or twice, it soon stopped being fun. And so we solved this problem by creating the PFH - the Private Filesystem Hierarchy - a gloriously over-ambitious name to describe a set of wrapper scripts that helped with the process of downloading tarballs, unpacking, building and finally installing them into well-defined locations. It worked well enough in the confines of its remit, but we were often outside those, having to apply out-of-tree patches, adding new dependencies and so on. We also didn't use Travis in those days - not even sure it existed, but if it did, the rigmarole of the bleeding edge experience would certainly put a stop to any ideas of using it. So we used a local install of CDash with a number of build agents on OSX, Windows (MinGW) and Linux (32-bit and 64-bit). Things worked beautifully when nothing changed and the setup was stable; but, every time a new version of a library - or god forbid, of a compiler - was released, one had that sense of dread: do I really need to upgrade?

Since one of the main objectives of Dogen was to learn about C++-11, one has to say that the pain was worth it. But all of the moving parts described above were not ideal and they were certainly not the thing you want to be wasting your precious time on when it is very scarce. They were certainly not scalable.

The Good Days and the Bad Days

Things improved slightly for a year or two when distros started to ship C++-11 compliant compilers and recent boost versions. It was all so good we were able to move over to Travis and ditch almost all of our private infrastructure. For a while things looked really good. However, due to Travis' Ubuntu LTS policy, we were stuck with a rapidly ageing Boost version. At first PPAs were a good solution for this, but soon these became stale too. We also needed to get latest CMake as there are a lot of developments on that front, but we certainly could not afford (time-wise) to revert back to the bad old days of the PFH. At the same time, it made no sense to freeze dependencies in time, providing a worse development experience. So the only route left was to break Travis and hope that some solution would appear. Some alternatives were tried such as Drone.io but nothing was successful.

There was nothing else for it; what was needed was a package manager to manage the development dependencies.

Nuget Hopes Dashed

Having used Nuget in anger for both C# and C++ projects, and given Microsoft's recent change of heart with regards to open source, I was secretly hoping that Nuget would get some traction in the wider C++ world. To recap, Nuget worked well enough in Mono; in addition, C++ support for Windows was added early on. It was somewhat limited and a bit quirky at the start, but it kept on getting better, to the point of usability. Trouble was, their focus was just Visual Studio.

Alas, nothing much ever came from my Nuget hopes. However, there have been a couple of recent announcements from Microsoft that make me think that they will eventually look into this space:

Surely the logical consequence is to be able to manage packages in a consistent way across platforms? We can but hope.

Biicode Comes to the Rescue?

Nuget did not pan out but what did happen was even more unlikely: some crazy-cool Spaniards decided to create a stand alone package manager. Being from the same peninsula, I felt compelled to use their wares, and was joyful as they went from strength to strength - including the success of their open source campaign. And I loved the fact that it integrated really well with CMake, and that CLion provided Biicode integration very early on.

However, my biggest problem with Biicode was that it was just too complicated. I don't mean to say the creators of the product didn't have very good reasons for their technical choices - lord knows creating a product is hard enough, so I have nothing but praise to anyone who tries. However, for me personally, I never had the time to understand why Biicode needed its own version of CMake, nor did I want to modify my CMake files too much in order to fit properly with Biicode and so on. Basically, I needed a solution that worked well and required minimal changes at my end. Having been brought up with Maven and Nuget, I just could not understand why there wasn't a simple "packages.xml" file that specified the dependencies and then some non-intrusive CMake support to expose those into the CMake files. As you can see from some of my posts, it just seemed it required "getting" Biicode in order to make use of it, which for me was not an option.

Another thing that annoyed me was the difficulty on knowing what the "real" version of a library was. I wrote, at the time:

One slightly confusing thing about the process of adding dependencies is that there may be more than one page for a given dependency and it is not clear which one is the "best" one. For RapidJson there are three options, presumably from three different Biicode users:

  • fenix: authored on 2015-Apr-28, v1.0.1.
  • hithwen: authored 2014-Jul-30
  • denis: authored 2014-Oct-09

The "fenix" option appeared to be the most up-to-date so I went with that one. However, this illustrates a deeper issue: how do you know you can trust a package? In the ideal setup, the project owners would add Biicode support and that would then be the one true version. However, like any other project, Biicode faces the initial adoption conundrum: people are not going to be willing to spend time adding support for Biicode if there aren't a lot of users of Biicode out there already, but without a large library of dependencies there is nothing to draw users in. In this light, one can understand that it makes sense for Biicode to allow anyone to add new packages as a way to bootstrap their user base; but sooner or later they will face the same issues as all distributions face.

A few features would be helpful in the mean time:

  • popularity/number of downloads
  • user ratings

These metrics would help in deciding which package to depend on.

For all these reasons, I never found the time to get Biicode setup and these stories lingered in Dogen's backlog. And the build continued to be red.

Sadly Biicode the company didn't make it either. I feel very sad for the guys behind it, because their heart was on the right place.

Which brings us right up to date.

Enter Conan

When I was a kid, we were all big fans of Conan. No, not the barbarian, the Japanese Manga Future Boy Conan. For me the name Conan will always bring back great memories of this show, which we watched in the original Japanese with Portuguese subtitles. So I was secretly pleased when I found conan.io, a new package management system for C++. The guy behind it seems to be one of the original Biicode developers, so a lot of lessons from Biicode were learned.

To cut a short story short, the great news is I managed to add Conan support to Dogen in roughly 3 hours and with very minimal knowledge about Conan. This to me was a litmus test of sorts, because I have very little interest in package management - creating my own product has proven to be challenging enough, so the last thing I need is to divert my energy further. The other interesting thing is that roughly half of that time was taken by trying to get Travis to behave, so its not quite fair to impute it to Conan.

Setting Up Dogen for Conan

So, what changes did I do to get it all working? It was a very simple 3-step process. First I installed Conan using a Debian package from their site.

I then created a conanfile.txt on my top-level directory:

[requires]
Boost/1.60.0@lasote/stable

[generators]
cmake

Finally I modified my top-level CMakeLists.txt:

# conan support
if(EXISTS "${CMAKE_BINARY_DIR}/conanbuildinfo.cmake")
    message(STATUS "Setting up Conan support.")
    include("${CMAKE_BINARY_DIR}/conanbuildinfo.cmake")
    CONAN_BASIC_SETUP()
else()
    message(STATUS "Conan build file not found, skipping include")
endif()

This means that it is entirely possible to build Dogen without Conan, but if it is present, it will be used. With these two changes, all that was left to do was to build:

$ cd dogen/build/output
$ mkdir gcc-5-conan
$ conan install ../../..
$ make -j5 run_all_specs

Et voila, I had a brand spanking new build of Dogen using Conan. Well, actually, not quite. I've omitted a couple of problems that are a bit of a distraction on the Conan success story. Let's look at them now.

Problems and Their Solutions

The first problem was that Boost 1.59 does not appear to have an overridden FindBoost, which means that I was not able to link. I moved to Boost 1.60 - which I wanted to do any way - and it worked out of the box.

The second problem was that Conan seems to get confused with Ninja, my build system of choice. For whatever reason, when I use the Ninja generator, it fails like so:

$ cmake ../../../ -G Ninja
$ ninja -j5
$ ninja: error: '~/.conan/data/Boost/1.60.0/lasote/stable/package/ebdc9c0c0164b54c29125127c75297f6607946c5/lib/libboost_system.so', needed by 'stage/bin/dogen_utility_spec', missing and no known rule to make it

This is very strange because boost system is clearly available in the Conan download folder. Using make solved this problem. I am going to open a ticket on the Conan GitHub project to investigate this.

The third problem is more boost related than anything else. Boost Graph has not been as well maintained as it should, really. Thus users now find themselves carrying patches, and all because no one seems to be able to apply them upstream. Dogen is in this situation as we've hit the issue described here: Compile error with boost.graph 1.56.0 and g++ 4.6.4. Sadly this is still present on Boost 1.60; the patch exists in Trac but remains unapplied (#10382). This is a tad worrying as we make a lot of use of Boost Graph and intend to increase the usage in the future.

At any rate, as you can see, none of the problems were showstoppers, nor can they all be attributed to Conan.

Getting Travis to Behave

Once I got Dogen building locally, I then went on a mission to convince Travis to use it. It was painful, but mainly because of the lag between commits and hitting an error. The core of the changes to my YML file were as follows:

install:
<snip>
  # conan
  - wget https://s3-eu-west-1.amazonaws.com/conanio-production/downloads/conan-ubuntu-64_0_5_0.deb -O conan.deb
  - sudo dpkg -i conan.deb
  - rm conan.deb
<snip>
script:
  - export GIT_REPO="`pwd`"
  - cd ${GIT_REPO}/build
  - mkdir output
  - cd output
  - conan install ${GIT_REPO}
  - hash=`ls ~/.conan/data/Boost/1.60.0/lasote/stable/package/`
  - cd ~/.conan/data/Boost/1.60.0/lasote/stable/package/${hash}/include/
  - sudo patch -p0 < ${GIT_REPO}/patches/boost_1_59_graph.patch
  - cmake ${GIT_REPO} -DWITH_MINIMAL_PACKAGING=on
  - make -j2 run_all_specs
<snip>

I probably should have a bash script by know, given the size of the YML, but hey - if it works. The changes above deal with installation of the package, applying the boost patch and using Make instead of Ninja. Quite trivial in the end, even though it required a lot of iterations to get there.

Conclusions

Having a red build is a very distressful event for a developer, so you can imagine how painful it has been to have red builds for several months. So it is with unmitigated pleasure that I got to see build #628 in a shiny emerald green. As far as that goes, it has been an unmitigated success.

In a broader sense though, what can we say about Conan? There are many positives to take home, even at this early stage of Dogen usage:

  • it is a lot less intrusive than Biicode and easier to setup. Biicode was very well documented, but it was easy to stray from the beaten track and that then required reading a lot of different wiki pages. It seems easier to stay on the beaten track with Conan.
  • as with Biicode, it seems to provide solutions to Debug/Release and multi-platforms and compilers. We shall be testing it on Windows soon and reporting back.
  • hopefully, since it started Open Source from the beginning, it will form a community of developers around the source with the know-how required to maintain it. It would also be great to see if a business forms around it, since someone will have to pay the cloud bill.

In terms of negatives:

  • I still believe the most scalable approach would have been to extend Nuget for the C++ Linux use case, since Microsoft is willing to take patches and since they foot the bill for the public repo. However, I can understand why one would prefer to have total control over the solution rather than depend on the whims of some middle-manager in order to commit.
  • it seems publishing packages requires getting down into Python. Haven't tried it yet, but I'm hoping it will be made as easy as importing packages with a simple text file. The more complexity around these flows the tool adds, the less likely they are to be used.
  • there still are no "official builds" from projects. As explained above, this is a chicken and egg problem, because people are only willing to dedicate time to it once there are enough users complaining. Having said that, since Conan is easy to setup, one hopes to see some adoption in the near future.
  • even when using a GitHub profile, one still has to define a Conan specific password. This was not required with Biicode. Minor pain, but still, if they want to increase traction, this is probably an unnecessary stumbling block. It was sufficient to make me think twice about setting up a login, for one.

In truth, these are all very minor negative points, but still worth making them. All and all, I am quite pleased with Conan thus far.

Created: 2015-12-22 Tue 14:00

Emacs 24.5.1 (Org mode 8.2.10)

Validate