Sunday, January 17, 2016

Nerd Food: On Product Backlog

Nerd Food: On Product Backlog

Would be be good to have a better bug-tracking setup? Yes. But I think it takes man-power, and it would take something *fundamentally* better than bugzilla. -- Linus

Many developers in large companies tend to be exposed to a strange variation of agile which I like to call "Enterprise Grade Agile", but I've also heard it called "Fragile" and, most aptly, "Cargo-Cult Agile". However you decide to name the phenomena, the gist of it is that these setups contain nearly all of the ceremony of agile - including stand-ups, sprint planning, retrospectives and so on - but none of its spirit. Tweets such as this are great at capturing the essence of the problem:

Once you start having that nagging feeling of doing things "because you are told to", and once your stand-ups become more of a status report to the "project manager" and/or "delivery manager" - the existence of which, in itself, is rather worrying - your Cargo Cult Agile alarm bells should start ringing. As I see it, agile is a toolbox with a number of tools, and they only start to add value once you've adapted them to your personal circumstances. The fitness function that determines if a tool should be used is how much value it adds to all (or at least most) of its users. If it does not, the tool must be further adapted or removed altogether. And, crucially, you learn about agile tools by using them and by reflecting on the lessons learned. There is no other way.

This post is one such exercise and the tool I'd like to reflect on is the Product Backlog. Now, before you read through the whole rant, its probably worth saying that this post takes a slightly narrow and somewhat "advanced" view of agile, with a target audience of those already using it. If you require a more introductory approach, you are probably better off looking at other online resources such as How to learn Scrum in 10 minutes and clean your house in the process. Having said that, I'll try to define terms best I can to make sure we are all on the same page.

Working Definition

Once your company has grokked the basics of agile and starts to move away from those lengthy specification documents - those that no one reads properly until implementation and those that never specified anything the customer wanted, but everything we thought the customer wanted and then some - you will start to use the product backlog in anger. And that's when you will realise that it is not quite as simple as memorising text books.

So what do the "text books" say? Let's take a fairly typical definition - this one from Scrum:

The agile product backlog in Scrum is a prioritized features list, containing short descriptions of all functionality desired in the product. When applying Scrum, it's not necessary to start a project with a lengthy, upfront effort to document all requirements. Typically, a Scrum team and its product owner begin by writing down everything they can think of for agile backlog prioritization. This agile product backlog is almost always more than enough for a first sprint. The Scrum product backlog is then allowed to grow and change as more is learned about the product and its customers.1

This is a good working definition, which will suffice for the purposes of this post. It is deceptively simple. However, as always, one must remember Yogi Berra: "In theory, there is no difference between theory and practice. But in practice, there is."

Potmenkin Product Backlogs

Many teams finish reading one such definition, find it amazingly inspiring, install the "agile plug-in" on their bug-tracking software of choice and then furiously start typing in those tickets. But if you look closely, you'd be hard-pressed to find any difference between the bug tickets of old versus the "stories" in the new and improved "product backlog" that apparently you are now using.

This is a classic management disconnect, whereby a renaming exercise is applied and suddenly, Potemkin village-style, we are now in with the kool kids and our company suddenly becomes a modern and desirable place to work. But much like Potemkin villages were not designed for real people to live in, so "Potmenkin Product Backlogs" are not designed to help you manage the lifecycle of a real product; they are there to give you the appearance of doing said management, for the purposes of reporting to the higher eschelons and so that you can tell stakeholders that "their story has been added to the product backlog for prioritisation".

Alas, very soon you will find that the bulk of the "user stories" are nothing but glorified one-liners that no one seems to recall what exactly they're supposed to mean, and those few elaboratedly detailed tickets end up rotting because they keep being deprioritised and now describe a world long gone. Soon enough you will find that your sprint planning meetings will cover less and less of the product backlog - after all, who is able to prioritise this mess? Some stories don't even make any sense! The final act is when all stories worked on are stories raised directly on the sprint backlog, and the product backlog is nothing but the dumping ground for the stories that didn't make it on a given sprint. At this stage, the product backlog is in such a terrible mess that no one looks at it, other than for the occasional historic search for valuable details on how a bug was fixed. Eventually the product backlog is zeroed - maybe a dozen or so of the most recent stories make it through the cull - and the entire process begins anew. Alas, enlightenment is never achieved, so you are condemned to repeat this cycle for all eternity.

As expected, the Potmenkin Product Backlog adds very little value - in fact it can be argued that it detracts value - but it must be kept because "agile requires a product backlog".

Bug-Trackers: Lessons From History

In order to understand the difficulties with a product backlog, we turn next to their logical predecessors: bug-tracking systems such as Bugzilla or Jira. This post starts with a quote from the kernel's Benevolent Dictator that illustrates the problem with these. Linus has long taken the approach that there is no need for a bug-tracker in kernel development, although he does not object if someone wants to use one for a subsystem. You may think this is a very primitive approach but in some ways it is also a very modern approach, very much in line with agile; if you have a bug-tracking system which is taking time away from developers without providing any value, you should remove the bug-tracking system. In kernel development, there simply is no space for ceremony - or, for that matter, for anything which slows things down2.

All of which begs the question: what makes bug-tracking systems so useless? From experience, there are a few factors:

  • they are a "fire and forget" capture system. Most users only care about entering new data, rather than worrying about the lifecycle of a ticket. Very few places have some kind of "ticket quality control" which ensures that the content of the ticket is vaguely sensible, and those who do suffer from another problem:
  • they require dedicated teams. By this I don't just mean running the bug-tracking software - which you will most likely have to do in a proprietary shop; I also mean the entire notion of Q&A and Testing as separate from development, with reams of people dedicated to setting "environments" up (and keeping them up!), organising database restores and other such activities that are incompatible with current best practices of software development.
  • they are temples of ceremony: a glance at the myriad of fields you need to fill in - and the rules and permutations required to get them exactly right - should be sufficient to put off even the most ardent believer in process. Most developers end up memorising some safe incantation that allows them to get on with life, without understanding the majority of the data they are entering.
  • as the underlying product ages, you will be faced with the sad graph of software death. The main problem is that resources get taken away from systems as they get older, a phenomena that manifests itself as a growth in the delta between the number of open tickets against the number of closed tickets. This is actually a really useful metric but one that is often ignored.3.

And what of the newest iterations on this venerable concept such as GitHub Issues? Well, clearly they solve a number of the problems above - such as lowering the complexity and cost barriers - and certainly they do serve a very useful purpose: they allow the efficient management of user interactions. Every time I create an issue - such as this one - it never ceases to amaze me how easily the information flows within GitHub projects; one can initiate comms with the author(s) or other users with zero setup - something that previously required mailinglist membership, opening an account on a bug-tracker and so forth. We now take all of this for granted, of course, but it is important to bear in mind that many open source projects would probably not even have any form of user interaction support, were it not for GitHub. After all, most of them are a one-person shop with very little disposable time, and it makes no sense to spend part of that time maintaining infrastructure for the odd person or two who may drop by to chat.

However, for all of its glory, it is also important to bear in mind that GitHub Issues is not a product backlog solution. What I mean by this is that the product backlog must be owned by the team that owns the product and, as we shall see, it must be carefully groomed if it is to be continually useful. This is at loggerheads with allowing free flow of information from users. Your Issues will eventually be filled up with user requests and questions which you may not want to address, or general discussions which may or may not have a story behind it. They are simply different tools for different jobs, albeit with an overlap in functionality.

So, history tells us what does not work. But is the product backlog even worth all this hassle?

Voyaging Through Strange Seas of Thought

One of the great things about agile is how much it reflects on itself; a strange loop of sorts. Presentations such as Kevlin Henney's The Architecture of Uncertainty are part of this continual process of discovery and understanding, and provide great insights about the fundamental nature of the development process. The product backlog plays - or should play - a crucial role exactly because of this uncertain nature of software development. We can explain this by way of a device.

Imagine that you start off by admitting that you know very little about what it is that you are intending to do and that the problem domain you are about to explore is vast and complex. In this scenario, the product backlog is the sum total of the knowledge gained whilst exploring this space that has yet not been transformed into source code. Think of it like the explorer's maps in the fifteen-hundreds. In those days, "users" knew that much of it was incorrect and a great part was sketchy and ill-defined, but it was all you had. Given that the odds of success were stacked against you, you'd hold that map pretty tightly while the storms were raging about you. Those that made it back would provide corrections and amendments and, over time, the maps eventually converged with the real geography.

The product backlog does something similar, but of course, the space you are exploring does not have a fixed geometry or topography and your knowledge of the problem domain can actively change the domain itself too - an unavoidable consequence of dealing with pure thought stuff. But the general principle applies. Thus, in the same way a code base is precious because it embodies the sum total knowledge of a domain - heck, in many ways it is the sum total knowledge of a domain! - so the product backlog is precious because it captures all the known knowledge of these yet-to-be-explored areas. In this light, you can understand statements such as this:

So, if the backlog is this important, how should one manage it?

Works For Me, Guv!

Up to this point - whilst we were delving into the problem space - we have been dealing with a fairly general argument, likely applicable to many. Now, as we enter the solution space, I'm afraid I will have to move from the general to the particular and talk only about the specific circumstances of my one-man-project Dogen. You can find Dogen's product backlog here.

This may sound like a bit of a cop out, you may say, and not without reason: how on earth are you supposed to extrapolate conclusions from a one-person open source project to a team of N working on a commercial product? However, it is also important to take into account what I said at the start: agile is what you make of it. I personally think of it as a) the smallest amount of processes required to make your development process work smoothly and b) and the continual improvement of those processes. Thus, there are no one-size-fits-all solutions; all one can do is to look at others for ideas. So, lets look at my findings4.

The first and most important thing I did to help me manage my product backlog was to use a simple text file in Org Mode notation. Clearly, this is not a setup that is workable for a development team much larger than a set of one, or one that doesn't use Emacs (or Vim). But for my particular circumstances it has worked wonders:

  • the product backlog is close to the code, so wherever you go, you take it with you. This means you can always search the product backlog and - most importantly - add to it wherever you are and whenever an idea happens to come by. I use this flexibility frequently.
  • the Org Mode interface makes it really easy to move stories up and down (order is taken to mean priority here) and to create "buckets" of stories according to whatever categorisation you decide to use, up to any level of nesting. At some point you end up converging to a reasonable level of nesting, of course. It is surprising how one can manage very large amounts of stories thanks to this flexible tree structure.
  • it's trivial to move stories in and out of a sprint, keeping track of all changes to a story - they are just text that can be copy and pasted and committed.
  • Org Mode provides a very capable tagging system. I first started by overusing these, but when tagging got too fine grained it became unmaintainable. Now we use too few - just epic and story - so this will have to change again in the near future. For example, it should be trivial to add tags for different components in the system or to mark stories as bugs or features, etc. Searching then allows you to see a subset of the stories that match those labels.

A second decision which has proven to be a very good one has been to groom the product backlog very often. And by this I don't just mean a cursory look, but a deep inspection of all stories, fixing them where required. Again, the choice of format has proved very helpful:

  • it is easy to mark all stories as "non-reviewed" or some other suitable tag in Org Mode, and then unmark them as one finishes the groom - thereby ensuring all stories get some attention. As the product backlog becomes larger, a full groom could take multiple sprints, but this is not an issue once you understand its value and the cost of having it rot.
  • because the product backlog is with the code, any downtime can be used for grooming; those idle weekends or that long wait at the airport are perfect candidates to get a few stories looked at. Time spent waiting for the build is also a good candidate.
  • you get an HTML representation of the Org Mode file for free in GitHub, meaning you can read your backlog from your phone. And with the new editing functionality, you can also edit stories too.

Thirdly, I decided to take a "multi-pass" approach at managing the story lifecycle. These are some of the key aspects of this lifecycle management:

  • stories can only be captured if they are aligned with the vision. This filter saves me from adding all sorts of ideas which are just too "out of the left field" to be of practical use, but keeps those that may sound crazy are but aligned with the vision.
  • stories can only be captured if there is no "prior art". I always perform a number of searches in the backlog to look for anything which covers similar ground. If found, I append to that.
  • new stories tend to start with very little content - just the minimum required to allow resetting state back to the idea I was trying to capture. Due to this, very little gets lost. At this point, we have a "proto-story".
  • as time progresses, I end up having more ideas on this space, and I update the story with those ideas - mainly bullet points with one liners and links.
  • at some point the story begins to mature; there is enough on it that we can convert the "proto-story" to a full blown story. After a number of grooms, the story becomes fully formed and is then a candidate to be moved to a sprint backlog for implementation. It may stay in this state ad-infinitum, with periodic updates just to make sure it does not rot.
  • A candidate story can still get refined: trimmed in scope, re-targeted, or even cancelled because it no longer fits with the current architecture or even the vision. Cancelled stories are important because we may come back to them - its just very unlikely that we do.
  • every sprint has a "sprint mission"5. When we start to move stories into the sprint backlog, we look for those which resonate with the sprint mission. Not all of them are fully formed, and the work on the sprint can entail the analysis required to create a full blown story. But many will be implementable directly off of the product backlog.
  • some times I end up finding related threads in multiple stories and decide to merge them. Merging of related stories is done by simply copying and pasting them into a single story; over time, with the multiple passes done in the grooms, we end up again with a single consistent story.

What all of this means is that a story can evolve over time in the product backlog, only to become the exact thing you need at a given sprint; at that point you benefit from the knowledge and insight gained over that long period of time. Some stories in Dogen's backlog have been there for years, and when I finally get to them, I find them extremely useful. Remember: they are a map to the unknown space you are exploring.

With all of this machinery in place, we've ended up with a very useful product backlog for Dogen - one that certainly adds a lot of value. Don't take me wrong, the cost of maintenance is high and I'd rather be coding instead of maintaining the product backlog, especially given the limited resources. But I keep it because I can see on a daily basis how much it improves the overall quality of the development process. It is a price I find worth paying, given what I get in return.

Final Thoughts

This post was an attempt to summarise some of the thoughts I've been having on the space of product backlogs. One of its main objectives was to try to convey the importance of this tool, and to provide ideas on how you can improve the management of your own product backlog by discussing the approach I have taken with Dogen.

If you have any suggestions or want to share your own tips on how to manage your product backlog please reach me on the comments section - there is always space for improvement.

Footnotes:

1

Source: Scrum Product Backlog, Mountain Goat Software.

2

A topic which I covered some time ago here: On Evolutionary Methodology. It is also interesting to see how the kernel processes are organised for speed: How 4.4's patches got to the mainline.

3

Another topic which I also covered here some time ago: On Maintenance.

4

I am self-plagiarising a little bit here and rehashing some of the arguments I've used before in Lessons in Incremental Coding, mainly from section DVCS to the Core.

5

See the current sprint backlog for an example.

Created: 2016-01-17 Sun 23:55

Emacs 24.5.1 (Org mode 8.2.10)

Validate

Tuesday, December 22, 2015

Nerd Food: Dogen: The Package Management Saga

Nerd Food: Dogen: The Package Management Saga

We've just gone past Dogen's Sprint 75, so I guess it's time for one of those "reminiscing posts" - something along the lines of what we did for Sprint 50. This one is a bit more practical though; if you are only interested in the practical side, keep scrolling until you see "Conan".

So, package management. Like any other part-time C++ developer whose professional mainstay is C# and Java, I have keenly felt the need for a package manager when in C++-land. The problem is less visible when you are working with mature libraries and dealing with just Linux, due to the huge size of the package repositories and the great tooling built around them. However, things get messier when you start to go cross-platform, and messier still when you are coding on the bleeding edge of C++: either the package you need is not available in the distro's repos or even PPA's; or, when it is, its rarely at the version you require.

Alas, for all our sins, that's exactly where we were when Dogen got started.

A Spoonful of Dogen History

Dogen sprung to life just a tad after C++-0x became C++-11, so we experienced first hand the highs of a quasi-new-language followed by the lows of feeling the brunt of the bleeding edge pain. For starters, nothing we ever wanted was available out of the box, on any of the platforms we were interested in. Even Debian testing was a bit behind - probably stalled due to a compiler transition or other, but I can't quite recall the details. In those days, Real Programmers were Real Programmers and mice were mice: we had to build and install the C++ compilers ourselves and, even then, C++-11 support was new, a bit flaky and limited. We then had to use those compilers to compile all of the dependencies in C++-11 mode.

The PFH Days

After doing this manually once or twice, it soon stopped being fun. And so we solved this problem by creating the PFH - the Private Filesystem Hierarchy - a gloriously over-ambitious name to describe a set of wrapper scripts that helped with the process of downloading tarballs, unpacking, building and finally installing them into well-defined locations. It worked well enough in the confines of its remit, but we were often outside those, having to apply out-of-tree patches, adding new dependencies and so on. We also didn't use Travis in those days - not even sure it existed, but if it did, the rigmarole of the bleeding edge experience would certainly put a stop to any ideas of using it. So we used a local install of CDash with a number of build agents on OSX, Windows (MinGW) and Linux (32-bit and 64-bit). Things worked beautifully when nothing changed and the setup was stable; but, every time a new version of a library - or god forbid, of a compiler - was released, one had that sense of dread: do I really need to upgrade?

Since one of the main objectives of Dogen was to learn about C++-11, one has to say that the pain was worth it. But all of the moving parts described above were not ideal and they were certainly not the thing you want to be wasting your precious time on when it is very scarce. They were certainly not scalable.

The Good Days and the Bad Days

Things improved slightly for a year or two when distros started to ship C++-11 compliant compilers and recent boost versions. It was all so good we were able to move over to Travis and ditch almost all of our private infrastructure. For a while things looked really good. However, due to Travis' Ubuntu LTS policy, we were stuck with a rapidly ageing Boost version. At first PPAs were a good solution for this, but soon these became stale too. We also needed to get latest CMake as there are a lot of developments on that front, but we certainly could not afford (time-wise) to revert back to the bad old days of the PFH. At the same time, it made no sense to freeze dependencies in time, providing a worse development experience. So the only route left was to break Travis and hope that some solution would appear. Some alternatives were tried such as Drone.io but nothing was successful.

There was nothing else for it; what was needed was a package manager to manage the development dependencies.

Nuget Hopes Dashed

Having used Nuget in anger for both C# and C++ projects, and given Microsoft's recent change of heart with regards to open source, I was secretly hoping that Nuget would get some traction in the wider C++ world. To recap, Nuget worked well enough in Mono; in addition, C++ support for Windows was added early on. It was somewhat limited and a bit quirky at the start, but it kept on getting better, to the point of usability. Trouble was, their focus was just Visual Studio.

Alas, nothing much ever came from my Nuget hopes. However, there have been a couple of recent announcements from Microsoft that make me think that they will eventually look into this space:

Surely the logical consequence is to be able to manage packages in a consistent way across platforms? We can but hope.

Biicode Comes to the Rescue?

Nuget did not pan out but what did happen was even more unlikely: some crazy-cool Spaniards decided to create a stand alone package manager. Being from the same peninsula, I felt compelled to use their wares, and was joyful as they went from strength to strength - including the success of their open source campaign. And I loved the fact that it integrated really well with CMake, and that CLion provided Biicode integration very early on.

However, my biggest problem with Biicode was that it was just too complicated. I don't mean to say the creators of the product didn't have very good reasons for their technical choices - lord knows creating a product is hard enough, so I have nothing but praise to anyone who tries. However, for me personally, I never had the time to understand why Biicode needed its own version of CMake, nor did I want to modify my CMake files too much in order to fit properly with Biicode and so on. Basically, I needed a solution that worked well and required minimal changes at my end. Having been brought up with Maven and Nuget, I just could not understand why there wasn't a simple "packages.xml" file that specified the dependencies and then some non-intrusive CMake support to expose those into the CMake files. As you can see from some of my posts, it just seemed it required "getting" Biicode in order to make use of it, which for me was not an option.

Another thing that annoyed me was the difficulty on knowing what the "real" version of a library was. I wrote, at the time:

One slightly confusing thing about the process of adding dependencies is that there may be more than one page for a given dependency and it is not clear which one is the "best" one. For RapidJson there are three options, presumably from three different Biicode users:

  • fenix: authored on 2015-Apr-28, v1.0.1.
  • hithwen: authored 2014-Jul-30
  • denis: authored 2014-Oct-09

The "fenix" option appeared to be the most up-to-date so I went with that one. However, this illustrates a deeper issue: how do you know you can trust a package? In the ideal setup, the project owners would add Biicode support and that would then be the one true version. However, like any other project, Biicode faces the initial adoption conundrum: people are not going to be willing to spend time adding support for Biicode if there aren't a lot of users of Biicode out there already, but without a large library of dependencies there is nothing to draw users in. In this light, one can understand that it makes sense for Biicode to allow anyone to add new packages as a way to bootstrap their user base; but sooner or later they will face the same issues as all distributions face.

A few features would be helpful in the mean time:

  • popularity/number of downloads
  • user ratings

These metrics would help in deciding which package to depend on.

For all these reasons, I never found the time to get Biicode setup and these stories lingered in Dogen's backlog. And the build continued to be red.

Sadly Biicode the company didn't make it either. I feel very sad for the guys behind it, because their heart was on the right place.

Which brings us right up to date.

Enter Conan

When I was a kid, we were all big fans of Conan. No, not the barbarian, the Japanese Manga Future Boy Conan. For me the name Conan will always bring back great memories of this show, which we watched in the original Japanese with Portuguese subtitles. So I was secretly pleased when I found conan.io, a new package management system for C++. The guy behind it seems to be one of the original Biicode developers, so a lot of lessons from Biicode were learned.

To cut a short story short, the great news is I managed to add Conan support to Dogen in roughly 3 hours and with very minimal knowledge about Conan. This to me was a litmus test of sorts, because I have very little interest in package management - creating my own product has proven to be challenging enough, so the last thing I need is to divert my energy further. The other interesting thing is that roughly half of that time was taken by trying to get Travis to behave, so its not quite fair to impute it to Conan.

Setting Up Dogen for Conan

So, what changes did I do to get it all working? It was a very simple 3-step process. First I installed Conan using a Debian package from their site.

I then created a conanfile.txt on my top-level directory:

[requires]
Boost/1.60.0@lasote/stable

[generators]
cmake

Finally I modified my top-level CMakeLists.txt:

# conan support
if(EXISTS "${CMAKE_BINARY_DIR}/conanbuildinfo.cmake")
    message(STATUS "Setting up Conan support.")
    include("${CMAKE_BINARY_DIR}/conanbuildinfo.cmake")
    CONAN_BASIC_SETUP()
else()
    message(STATUS "Conan build file not found, skipping include")
endif()

This means that it is entirely possible to build Dogen without Conan, but if it is present, it will be used. With these two changes, all that was left to do was to build:

$ cd dogen/build/output
$ mkdir gcc-5-conan
$ conan install ../../..
$ make -j5 run_all_specs

Et voila, I had a brand spanking new build of Dogen using Conan. Well, actually, not quite. I've omitted a couple of problems that are a bit of a distraction on the Conan success story. Let's look at them now.

Problems and Their Solutions

The first problem was that Boost 1.59 does not appear to have an overridden FindBoost, which means that I was not able to link. I moved to Boost 1.60 - which I wanted to do any way - and it worked out of the box.

The second problem was that Conan seems to get confused with Ninja, my build system of choice. For whatever reason, when I use the Ninja generator, it fails like so:

$ cmake ../../../ -G Ninja
$ ninja -j5
$ ninja: error: '~/.conan/data/Boost/1.60.0/lasote/stable/package/ebdc9c0c0164b54c29125127c75297f6607946c5/lib/libboost_system.so', needed by 'stage/bin/dogen_utility_spec', missing and no known rule to make it

This is very strange because boost system is clearly available in the Conan download folder. Using make solved this problem. I am going to open a ticket on the Conan GitHub project to investigate this.

The third problem is more boost related than anything else. Boost Graph has not been as well maintained as it should, really. Thus users now find themselves carrying patches, and all because no one seems to be able to apply them upstream. Dogen is in this situation as we've hit the issue described here: Compile error with boost.graph 1.56.0 and g++ 4.6.4. Sadly this is still present on Boost 1.60; the patch exists in Trac but remains unapplied (#10382). This is a tad worrying as we make a lot of use of Boost Graph and intend to increase the usage in the future.

At any rate, as you can see, none of the problems were showstoppers, nor can they all be attributed to Conan.

Getting Travis to Behave

Once I got Dogen building locally, I then went on a mission to convince Travis to use it. It was painful, but mainly because of the lag between commits and hitting an error. The core of the changes to my YML file were as follows:

install:
<snip>
  # conan
  - wget https://s3-eu-west-1.amazonaws.com/conanio-production/downloads/conan-ubuntu-64_0_5_0.deb -O conan.deb
  - sudo dpkg -i conan.deb
  - rm conan.deb
<snip>
script:
  - export GIT_REPO="`pwd`"
  - cd ${GIT_REPO}/build
  - mkdir output
  - cd output
  - conan install ${GIT_REPO}
  - hash=`ls ~/.conan/data/Boost/1.60.0/lasote/stable/package/`
  - cd ~/.conan/data/Boost/1.60.0/lasote/stable/package/${hash}/include/
  - sudo patch -p0 < ${GIT_REPO}/patches/boost_1_59_graph.patch
  - cmake ${GIT_REPO} -DWITH_MINIMAL_PACKAGING=on
  - make -j2 run_all_specs
<snip>

I probably should have a bash script by know, given the size of the YML, but hey - if it works. The changes above deal with installation of the package, applying the boost patch and using Make instead of Ninja. Quite trivial in the end, even though it required a lot of iterations to get there.

Conclusions

Having a red build is a very distressful event for a developer, so you can imagine how painful it has been to have red builds for several months. So it is with unmitigated pleasure that I got to see build #628 in a shiny emerald green. As far as that goes, it has been an unmitigated success.

In a broader sense though, what can we say about Conan? There are many positives to take home, even at this early stage of Dogen usage:

  • it is a lot less intrusive than Biicode and easier to setup. Biicode was very well documented, but it was easy to stray from the beaten track and that then required reading a lot of different wiki pages. It seems easier to stay on the beaten track with Conan.
  • as with Biicode, it seems to provide solutions to Debug/Release and multi-platforms and compilers. We shall be testing it on Windows soon and reporting back.
  • hopefully, since it started Open Source from the beginning, it will form a community of developers around the source with the know-how required to maintain it. It would also be great to see if a business forms around it, since someone will have to pay the cloud bill.

In terms of negatives:

  • I still believe the most scalable approach would have been to extend Nuget for the C++ Linux use case, since Microsoft is willing to take patches and since they foot the bill for the public repo. However, I can understand why one would prefer to have total control over the solution rather than depend on the whims of some middle-manager in order to commit.
  • it seems publishing packages requires getting down into Python. Haven't tried it yet, but I'm hoping it will be made as easy as importing packages with a simple text file. The more complexity around these flows the tool adds, the less likely they are to be used.
  • there still are no "official builds" from projects. As explained above, this is a chicken and egg problem, because people are only willing to dedicate time to it once there are enough users complaining. Having said that, since Conan is easy to setup, one hopes to see some adoption in the near future.
  • even when using a GitHub profile, one still has to define a Conan specific password. This was not required with Biicode. Minor pain, but still, if they want to increase traction, this is probably an unnecessary stumbling block. It was sufficient to make me think twice about setting up a login, for one.

In truth, these are all very minor negative points, but still worth making them. All and all, I am quite pleased with Conan thus far.

Created: 2015-12-22 Tue 14:00

Emacs 24.5.1 (Org mode 8.2.10)

Validate

Monday, December 21, 2015

Nerd Food: Interesting...

Nerd Food: Interesting…

Time to flush all those tabs again. Some interesting stuff I bumped into recently-ish.

Finance, Economics, Politics

Startups et al.

General Coding

Databases

  • What's new in PostgreSQL 9.5: The RC's are starting and 9.5 looks to continue the trend of amazing Postgres releases. My only missing wish is for native (and full) support for bitemporality really, though to be fair Temporal Tables is probably enough for my needs.

C++

  • Optimizing software in C++: One to bookmark now but to digest later. A whole load of stuff on optimisation.
  • Support for Android CMake projects in Visual Studio: So, as if the latest patches to Clang hadn't been enough, MS now decides to add support for CMake in Visual Studio. A bit embryonic, and a bit too android focused, but surely it should be extensible for more regular C++ use. Whats going on at MS? This is all far too cool to be true.
  • Quickly Loading Things From Disk: interesting analysis about the state of affairs of serialisation in C++. I'll probably require a few passes to fully digest it.
  • Beyond ad-hoc automation: leveraging structured platforms: I've been consuming this presentation slowly but steadily. It deals with a lot of the questions we all have about the new world of containers and microservices, and it seems vital to learn from experience before one finds oneself in a much bigger mess than the monolith could ever get you into. Bridget Kromhout talks intelligently about the subject.

Layman Science

Other

Created: 2015-12-21 Mon 23:31

Emacs 24.5.1 (Org mode 8.2.10)

Validate

Friday, December 11, 2015

Nerd Food: Pull Request Driven Development

Nerd Food: Pull Request Driven Development

Being in this game for the best part of twenty years, I must confess that its not often I find something that revolutionises my coding ways. I do tend to try a lot of things, but most of them end up revealing themselves as fads or are incompatible with my flow. For instance, I never managed to get BDD to work for me, try as I might. I will keep trying because it sounds really useful, but it hasn't clicked just yet.

Having said all of that, these moments of enlightenment do occasionally happen, and when they do, nothing beats that life-changing feeling. "Pull Request Driven Development" (or PRDD) is my latest find. I'll start by confessing that "PRDD" as a name was totally made up for this post and hopefully you can see its rather tongue in cheek. However, the benefits of this approach are very real. In fact, I've been using PRDD for a while now but I just never really noticed its presence creeping in. Today, as I introduced a new developer to the process, I finally had the eureka moment and saw just how brilliant it has been thus far. It also made me realise that some people are not aware of this great tool in the developer's arsenal.

But first things first. In order to explain what I mean by PRDD, I need to provide a bit of context. Everyone is migrating to git these days, even those of us locked behind corporate walls; in our particular case, the migration path implied exposure to Git Stash. For those not in the know, picture it as an expensive and somewhat less featureful version of GitHub, but with most of the core functionality there. Of course, I'm sure GitHub is not that cheap for enterprises either, but hey at least its the tool everyone uses. Anyway - grumbling or not - we moved to Stash and all development started to revolve around Pull Requests (PRs), raised for each new feature.

Not long after PRs were introduced, a particularly interesting habit started to appear: developers begun opening the PRs earlier and earlier during the feature cycle rather than waiting to the very end. Taking this approach to the limit, the idea is that when you start to work on a new feature, you raise the ticket and the PR before you write any code at all. In practice - due to Stash's anachronisms - you need to push at least one commit, but the general notion is valid. This was never mandated anywhere, and there was no particular coordination. I guess one possible explanation for this behaviour is that one wants to get rid of the paperwork as quickly as possible to get to the coding. At any rate, the causes may be obscure but the emerging behaviour was not.

When you combine early PRs with the commit early and commit often approach - which you should be using anyway - the PR starts to become a living document; people see your development work as it progresses and they start commenting on it and possibly even sending you patches as you go along. In a way, this is an enabler for a very efficient kind of peer programming - particularly if you have a tightly knit team - because it gives you maximum parallelism but in a very subtle, non-noticeable way. The main author of the PR is coding as she would normally be, but whenever there is a lull in development - those moments where you'd be browsing the web for five minutes or so - you can quickly check for any comments on your PR and react to those. Similarly, other developers can carry on doing their own work and browse the PRs on their downtime; this allows them to provide feedback whenever it is convenient to them, and to choose the format of the feedback - lengthy or quick, as time permits.

Quick feedback is many a times invaluable in large code bases because everyone tends to know their own little corner of the code and only very few old hands know how it all hangs together. Thus, seemingly trivial one liners such as "have you considered using API xyz instead of rolling your own" or "don't forget to do abc when you do that" could save you many hours of pain and enable knowledge to be transferred organically - something that no number of wiki pages could hope to achieve in a million years because its very difficult to find these pearls in a sea of uncurated content. And because you committed early and often, each commit is very small and very easy to parse in a small interval of time, so people are much more willing to review - as opposed to that several Kb (or even Mb!) patch that you will have to allocate a day or two for. Further: if you take your commit message seriously - as, again, you should - you will find that the number of reviewers will grow rapidly simply because developers are nosy and opinionated.

Note that this review process involves no vague meetings and no lengthy and unfocused email chains; it is very high-quality because it is (or can be) very focused to specific lines of code; it causes no unwanted disruptions because you review where and when you choose to review; reviewers can provide examples and even fix things themselves if they so choose; it is totally inclusive because anyone who wants to participate can, but no one is forced to; and it equalises local and remote developers because they all have access to the same data (modulus some IRL conversations that always take place) - an important feature in this world of near-shoring, off-shoring and home-working. Most importantly, instead of finding out some fundamental errors of approach at the end of an intense period of coding, you now have timely feedback. This saves an enormous amount of time - an advantage that anyone who has been through lengthy code reviews and then spent a week or two reacting to that feedback can appreciate.

I am now a believer in PRDD. So much so that whenever I go back to work on legacy projects in svn, I find myself cringing all the way to the end of the feature. It just feels so nineties.

Update: As I finished penning this post and started reflecting about it it suddenly dawned on me that a lot of things we now take for granted are only possible because of git. And I don't mean DVCS', I specifically mean git. For example PRDD is made possible to a large extent because committing in git is a reversible process and history can be fluid if required. This means that people are not afraid of committing, which in turn enables a lot of the goodness I described above. Many DVCS' didn't like this way of viewing history - and to be fair, I know of very few people that liked the idea until they started using it. Once you figure out what it is good for (and not so good for), it suddenly becomes an amazing tool. Git is full of little decisions like this that at first sight look either straight insane or just not particularly useful but then turn out to change entire development flows.

Created: 2015-12-11 Fri 13:12

Emacs 24.5.1 (Org mode 8.2.10)

Validate

Wednesday, December 09, 2015

Nerd Food: Interesting...

Nerd Food: Interesting…

Time to flush all those tabs again. Some interesting stuff I bumped into recently-ish.

Finance, Economics, Politics

Startups et al.

General Coding

Databases

C++

  • New ELF Linker from the LLVM Project: LLVM keeps on delivering! Now a new ELF linker. To be totally honest, I haven't even started using Gold in anger - I get the feeling the LLVM linker is going to be transitioned in much quicker than Gold.
  • Clang with Microsoft CodeGen in VS 2015 Update 1: OMG, OMG how cool is this - MSFT decided to create a backend for Clang that is totally compatible with MSVC AND open source it! This is just insane. This means for example that you now can develop C++ on Windows without ever having to use MSVC and Visual Studio. It also means you can cross-compile from Linux into Windows with 100% certainty things will work. It means that projects like Wine and ReactOS can start thinking about a migration path into Clang (not quite as simple as it may sound but surely makes sense). CLion with Clang on Windows will rock. The possibilities are just endless. I never quite understood what C2 was all about until I read this announcement - suddenly it all makes sense. This is fantastic news.

Layman Science

Other

  • NoiseRV Live: Still discovering this Portuguese musician, but love his work. Great concert. Could do a little bit less talking between songs, but still - artists prerogative and all that.
  • Warm Focus: Winging It: Interesting set of "intelligent dance music" as we used to call it back in the day.
  • Mosaic - The “First” Web Browser: Super-cool podcasts about internet history. It would be great to have something like this for UNIX!
  • Jackson C. Frank (1965): Tragic musician from the 60s. Great tunes.
  • Reason in common sense: Always wanted to read Santayana properly. Started, but I guess it will be a very long exercise. Interesting, if somewhat strange book.
  • Ceu - jazz baltica Live (2010): New find, Brazilian musician Ceu.

Created: 2015-12-09 Wed 12:49

Emacs 24.5.1 (Org mode 8.2.10)

Validate

Monday, November 30, 2015

Nerd Food: Tooling in Computational Neuroscience - Part II: Microscopy

Nerd Food: Tooling in Computational Neuroscience - Part II: Microscopy

Research is what I'm doing when I don't know what I'm doing.
Wernher von Braun

Welcome to the second instalment of our second series on Computational Neuroscience for lay people. You can find the first post of the previous series here, and the first post of the current series here. As you'd expect, this second series is slightly more advanced, and, as such, it is peppered with unavoidable technical jargon. Having said that, we shall continue to pursue our ambitious target of making things as easy to parse as possible (but no easier). If you read the first series, the second should hopefully make some sense.1

Our last post discussed Computational Neuroscience as a discipline, and the kind of things one may want to do in this field. We also spoke about models and their composition, and the desirable properties of a platform that runs simulations of said models. However, it occurred to me that we should probably build some kind of "end-to-end" understanding; that is, by starting with the simulations and models we are missing a vital link with the physical (i.e. non-computational) world. To put matters right, this part attempts to provide a high-level introduction on how data is acquired from the real world and can then be used - amongst other things - to inform the modeling process.

Macro and Micro Microworlds

For the purposes of this post, the data gathering process starts with the microscope. Of course, keep in mind that we are focusing only on the morphology at present - the shape and the structures that make up the neuron - so we are ignoring other important activities in the lab. For instance, one can conduct experiments to measure voltage in a neuron, and these measurements provide data for the functional aspects of the model. Alas, we will skip these for now, with the promise of returning to them at a later date2.

So, microscopes then. Microscopy is the technical name for the observation work done with the microscope. Because neurons are so small - some 4 to 100 microns in size - only certain types of microscopes are suitable to perform neuronal microscopy. To make matters worse, the sub-structures inside the neuron are an important area of study and they can be ridiculously small: a dentritic spine - the minute protrusions that come out of the dendrites - can be as tiny as 500 nanometres; the lipid bylayer itself is only 2 or 3 nanometres thick, so you can imagine how incredibly small ion channels and pumps are. Yet these are some of the things we want to observe and measure. Lets call this the "micro" work. On the other hand, we also want to understand connectivity and other larger structures, as well as perform observations of the evolution of the cell and so on. Lets call this the "macro" work. These are not technical terms, by the by, just so we can orient ourselves. So, how does one go about observing these differently sized microworlds?

F1.large.jpg

Figure 1: Example of measurements one may want to perform on a dendrite. Source: Reversal of long-term dendritic spine alterations in Alzheimer disease models

Optical Microscopy

The "macro" work is usually done using the Optical "family" of microscopes, which is what most of us think of when hearing the word microscope. As it was with Van Leeuwenhoek's tool in the sixteen hundreds, so it is that today's optical microscopes still rely on light and lenses to perform observations. Needless to say, things did evolve a fair bit since then, but standard optical microscopy has not completely removed the shackles of its limitations. These are of three kinds, as Wikipedia helpfully tells us: a) the objects we want to observe must be dark or strongly refracting - a problem, since the internal structures of the cell are transparent; b) visible light's diffraction limit means that we cannot go much lower than 200 nanometres - pretty impressive, but unfortunately not quite low enough for detailed sub-structure analysis; and c) out of focus light hampers image clarity.

Workarounds to these limitations have been found in the guise of techniques, with the aim of augmenting the abilities of standard optical microscopy. There are many of these techniques. There is the Confocal Microscopy3 - improving resolution and contrast; the Fluorescence microscope, which uses a sub-diffraction technique to reconstruct some of the detail that is missing due to diffraction; or the incredible-looking movies produced by Multiphoton Microscopy. And of course, it is possible to combine multiple techniques in a single microscope, as is the case with the Multiphoton Fluorescence Microscopes (MTMs) and many others.

In fact, given all of these developments, it seems there is no sign of optical microscopy dying out. Presumably some of this is due to the relative lower cost of this approach as well as to the ease of use. In addition, optical microscopy is complementary to the other more expensive types of microscopes; it is the perfect tool for "macro" work that can then help to point out where to do "micro" work. For example, you can use an optical microscope to assess the larger structures and see how they evolve over time, and eventually decide on specific areas that require more detailed analysis. And when you do, you need a completely different kind of microscope.

Electron Microscopy

When you need really high-resolution, there is only one tool to turn to: the Electron Microscope (EM). This crazy critter can provide insane levels of magnification by using a beam of electrons instead of visible light. Just how insane, you ask? Well, if you think that an optical microscope lives in the range of 1500x to 2000x - that is, can magnify a sample up to two thousand times - an EM can magnify as much as 10 million times, and provide a sub-nanometre resolution4. It is mind boggling. If fact, we've already seen images of atoms using EM in part II, but perhaps it wasn't easy to appreciate just how amazing a feat that is.

Of course, EM is itself a family - and a large one at that, with many and diverse members. As with optical microscopy, each member of the family specialises on a given technique or combination of techniques. For example, the Scanning Electron Microscope (SEM) performs a scan of the object under study, and has a resolution of 1 nanometre or higher; the Scanning Confocal Electron Microscope (SCEM) uses the same confocal technique mentioned above to provide higher depth resolution; and Transmission Electron Microscopy (TEM) has the ability to penetrate inside the specimen during the imagining process, given samples with thickness of 100 nanometres or less.

A couple of noteworthy points are required at this juncture. First, whilst some of these EM techniques may sound new and exciting, most have been around for a very long time; it just seems they keep getting better and better as they mature. For example, TEM was used in the fifties to show that neurons communicate over synaptic junctions but its still wildly popular today. Secondly, its important to understand that the entire imaging process is not at all trivial - certainly not for TEM, nor EM in general and probably not for Optical Microscopy either. It just is a very labour intensive and very specialised process - most likely done by an expert human neuroanatomist - and the difficulties range from the chemical preparation of the samples all the way up to creating the images. The end product may give the impression it was easy to produce, but easy it was not.

At any rate, whatever the technical details, the fact is that the imagery that results from all these advances is truly evocative - haunting, even. Take this image produced by SEM:

Personally, I think it is incredibly beautiful; simultaneously awe-inspiring and depressing because it really conveys the messiness and complexity of wetware. By way of contrast, look at the neatness of man-made micro-structures:

bluegeneq%20x%20420.jpg

Figure 3: The BlueGene/Q chip. Source: IBM plants transactional memory in CPU

Stacks and Stacks of 'Em

Technically, pictures like the ones above are called micrographs. As you can see in the neuron micrograph, these images provide a great visual description of the topology of the object we are trying to study. You also may notice a slight coloration of the cell in that picture. This is most likely due to the fact that the people doing the analysis stain the neuron to make it easier to image. Now, in practice - at least as far as I have seen, which is not very far at all, to be fair - 2D grayscale images are preferred by researchers to the nice, Public Relations friendly pictures like the one above; those appear to be more useful for magazine covers. The working micrographs are not quite as exciting to the untrained eye but very useful to the professionals. Here's an example:

fetch.php?w=900&tok=d88a10&media=wiki:biomed-neurons.jpg

Figure 4: The left-hand side shows the original micrograph. On the right-hand side it shows the result of processing it with machine learning. Source: Deep Neural Networks Segment Neuronal Membranes in Electron Microscopy Images

Let's focus on the left-hand side of this image for the moment. It was taken using ssTEM - serial-section TEM, an evolutionary step in TEM. The ss part of ssTEM is helpful in creating stacks of images, which is why you see the little drawings on the left of the picture; they are there to give you the idea that the top-most image is one of 30 in a stack5. The process of producing the images above was as follows: they started off with a neuronal tissue sample, which is prepared for observation. The sample had 1.5 micrometres and was then sectioned into 30 slices of 50 nanometres. Each of these slices was imaged, at a resolution of 4x4 nanometres per pixel.

As you can imagine, this work is extremely sensitive to measurement error. The trick is to ensure there is some kind of visual continuity between images so that you can recreate a 3D model from the 2D slices. This means for instance that if you are trying to figure out connectivity, you need some way to relate a dendrite to it's soma and say to the axon of the neuron it connects to - and that's one of the reasons why the slices have to be so thin. It would be no good if the pictures miss this information out as you will not be able to recreate the connectivity faithfully. This is actually really difficult to achieve in practice due to the minute sizes involved; a slight tremor that displaces the sample by some nanometres would cause shifts in alignment; even with the high-precision the tools have, you can imagine that there is always some kind of movement in the sample's position as part of the slicing process.

Images in a stack are normally stored using traditional formats such as TIFF6. You can see an example of the raw images in a stack here. Its worth noticing that, even though the images are 2D grey-scale, since the pixel size is only a few nanometres wide (4x4 in this case), the full size of an image is very large. Indeed, the latest generation of microscopes produce stacks on the 500 Terabyte range, making the processing of the images a "big-data" challenge.

What To Do Once You Got the Images

But back to the task at hand. Once you have the stack, the next logical step is to try to figure out what's what: which objects are in the picture. This is called segmentation and labelling, presumably because you are breaking the one big monolithic picture into discrete objects and give them names. Historically, segmentation has been done manually, but its a painful, slow and error-prone process. Due to this, there is a lot of interest in automation, and it has recently become feasible to do so - what with the abundance of cheap computing resources as well as the advent of "useful" machine learning (rather than the theoretical variety). Cracking this puzzle is gaining traction amongst the programming herds, as you can see by the popularity of challenges such as this one: Segmentation of neuronal structures in EM stacks challenge - ISBI 2012. It is from this challenge we sourced the stack and micrograph above; the right-hand side is the finished product after machine learning processing.

There are also open source packages to help with segmentation. A couple of notable contenders are Fiji and Ilastik. Below is a screenshot of Ilastik.

Figure-2-a.png

Figure 5: Source: Ilastik gallery.

An activity that naturally follows on from segmentation and labelling is reconstruction. The objective of reconstruction is to try to "reconstruct" morphology given the images in the stack. It could involve inferring the missing bits of information by mathematical means or any other kind of analysis which transforms the set of discrete objects spotted by segmentation into something looking more like a bunch of connected neurons.

Once we have a reconstructed model, we can start performing morphometric analysis. As wikipedia tells us, Morphometry is "the quantitative analysis of form"; as you can imagine, there are a lot of useful things one may want to measure in the brain structures and sub-structures such as lengths, volumes, surface area and so on. Some of these measurements can of course be done in 2D, but life is made easier if the model is available in 3D. One such tool is NeuroMorph. It is an open source extension written in Python for the popular open source 3D computer graphics software Blender.

Conclusion

This post was a bit of a world-wind tour of some of the sources of real world data for Computational Neuroscience. As I soon found out, each of these sections could have easily been ten times bigger and still not provide you with a proper overview of the landscape; having said that, I hope that the post at least gives some impression of the terrain and its main features.

From a software engineering perspective, its worth pointing out the lack of standardisation in information exchange. In an ideal world, one would want a pipeline with components to perform each of the steps of the complete process, from data acquisition off of a microscope (either opitical or EM), to segmentation, labelling, reconstruction and finally morphometric analysis. This would then be used as an input to the models. Alas, no such overarching standard appears to exist.

One final point in terms of Free and Open Source Software (FOSS). On one hand, it is encouraging to see the large number of FOSS tools and programs being used. Unfortunately - at least for the lovers of Free Software - there are also some proprietary tools that are widely used such as NeuroLucida. Since the software is so specialised, the fear is that in the future, the better funded commercial enterprises will take over more and more of the space.

That's all for now. Don't forget to tune in for the next instalment!

Footnotes:

1

As it happens, what we are doing here is to apply a well-established learning methodology called the Feynman Technique. I was blissfully unaware of its existence all this time, even though Feynman is one of my heroes and even though I had read a fair bit about the man. On this topic (and the reason why I came to know about the Feynman Technique), its worth reading Richard Feynman: The Difference Between Knowing the Name of Something and Knowing Something, where Feynman discusses his disappointment with science education in Brazil. Unfortunately the Portuguese and the Brazilian teaching systems have a lot in common - or at least they did when I was younger.

2

Nor is the microscope the only way to figure out what is happening inside the brain. For example, there are neuroimagining techniques which can provide data about both structure and function.

3

Patented by Marvin Minsky, no less - yes, he of Computer Science and AI fame!

4

And, to be fair, sub-nanometre just doesn't quite capture just how low these things can go. For an example, read Electron microscopy at a sub-50 pm resolution.

5

For a more technical but yet short and understandable take, read Uniform Serial Sectioning for Transmission Electron Microscopy.

6

On the topic of formats: its probably time we mention the Open Microscopy Environment (OME). The microscopy world is dominated by hardware and as such its the perfect environment for corporations, their proprietary formats and expensive software packages. The OME guys are trying to buck the trend by creating a suite of open source tools and protocols, and by looking at some of their stuff, they seem to be doing alright.

Created: 2015-11-30 Mon 23:12

Emacs 24.5.1 (Org mode 8.2.10)

Validate