Friday, June 17, 2016

Nerd Food: Interesting...

Nerd Food: Interesting…

Time to flush all those tabs again. Some interesting stuff I bumped into recently-ish.

Finance, Economics, Politics

  • Understanding Growth, part 1: looks very promising although I've only started parsing it. Also pointed me to - Tomas Sedlacek and the Economics of Good and Evil. Bought the book, but still reading it. Seems very thoughtful.
  • Here’s How Electric Cars Will Cause the Next Oil Crisis: Extremely interesting take on the relationship between electric cars and the oil price. Its along the lines of articles posted in the past, to be fair, but still. Basically, it won't take a huge number of sales of electric cars to start knocking down the oil price. And with Model 3 coming out, this all seems quite ominous to the oil producing countries. Here we go again, Angola.
  • Red Hat becomes first $2b open-source company: I may not use their wares any more but RedHat will always be one of my favourite companies. Really happy to see they are growing nicely and hopefully continuing all of their incredible investment on Linux.
  • The Amazon Tax: Really, really good article about Amazon and their strategy. If you read only one, read this. Amazon is amazing - and its dominance is very worrying because they are so good at executing! See also Bezos letter.
  • It’s a Tesla: Great article about Tesla. Some of the usual Fanboyism we all know and love, of course, but still a lot of very good points. The core of the article is a interesting comparison between Tesla and Apple. By the by, not at all convinced about that dashboard and the launch ceremony itself was a bit sparse too! But, Model 3 looks great. I'm officially a Stratechery fanboy now.
  • Google’s Alphabet Transition Has Been Tougher Than A-B-C: Great article on the pains of moving to a single monolithic structure to something more distributed. In truth, what would one expect with such a seismic change? And, also, how come it took Google so long to make this shift? After all, programmers are supposedly taught how important separation of concerns is. The other very interesting point is the CED difficulties. These guys were able founders (at least able enough to get bought out by Google) but seem to fail badly at the CEO'ing malarky.

Startups et al.

General Coding

  • Water treatment plant hacked, chemical mix changed for tap supplies: this is a tad worrying. Can you imagine the amount of systems out there with vulnerabilities, etc - many of which are connected to the internet.
  • On the Impending Crypto Monoculture: Talking about security, very worrying news from the crypto front. It seems our foundations are much less solid than expected - and after all the OpenSSL bugs, this is a surprising statement indeed. Very interesting email on the subject. The LWN article is a must read too.
  • Neural Networks Demystified - Part 1: Data and Architecture: just started browsing this in my spare time, but it looks very promising. For the layperson.
  • Microsoft deletes 'teen girl' AI after it became a Hitler-loving sex robot within 24 hours: friggin' hilarious in a funny-not-funny sort of way. This tweet said it best: "Tay" went from "humans are super cool" to full nazi in <24 hrs and I'm not at all concerned about the future of AI. – Gerry
  • Abandoning Gitflow and GitHub in favour of Gerrit: I've always wanted to know more about Gerrit but never seem to find the time. The article explains it to my required extent, contrasting it with the model I'm more familiar with - GitHub, forks and pull requests. I must say, still not convinced about Gerrit, but having said that, it seems there is definitely scope for some kind of hybrid between the two. A lot of the issues they mention in the article are definitely pain points for GitHub users.
  • Introducing DGit: OK this one is a puzzling post, from our friends at GitHub engineering. I'm not sure I get it at all, but seems amazing. Basically, they talk about all the hard work they've made to make git distributed. Fine, I'm jesting - but not totally. The part that leaves no doubts is that GitHub as a whole is a lot more reliable after this work and can handle a lot more traffic - without increasing its hardware requirements. Amazing stuff.

Databases

C++

  • Compiler Bugs Found When Porting Chromium to VC++ 2015: great tales form the frontline. Also good to hear that MS is really responsive to bug reports. Can't wait to be able to build my C++ 14 code on Windows…
  • EasyLambda: C++ 14 library for data processing. Based on MPI though. Still, seems like an interesting find.

Layperson Science

Other

Created: 2016-06-17 Fri 10:56

Emacs 24.5.1 (Org mode 8.2.10)

Validate

Thursday, June 16, 2016

Nerd Food: The Strange Case of the Undefined References

Nerd Food: The Strange Case of the Undefined References

As a kid, I loved reading Sherlock Holmes and Poirot novels. Each book got me completely spellbound, totally immersed and pretty much unable to do anything else until I finally found out whodunnit. Somehow, the culprits were never the characters I suspected of. Debugging and troubleshooting difficult software engineering problems is a lot like the plot of a crime novel: in both cases you are trying to form a mental picture of something that happened, with very incomplete information - the clues; in both cases, experience and attention to detail is crucial, with many a wrong path taken before the final eureka moment; and, in both cases too, there is this overwhelming sense of urgency in figuring out whodunnit. Of course, unlike a crime novel, we'd all prefer not having to deal with these kinds of "interesting" issues, but you don't choose the problems - they choose you.

I recently had to deal with one such problem, which annoyed me to no end until I finally fixed it. It was so annoying I decided it was worth blogging about - if nothing else, it may save other people from the same level of pain and misery.

A bit of context for those that are new here. Dogen is a pet project that I've been maintaining for a few years now. Like many other C++ projects, it relies on the foundational Boost libraries. To be fair, we rely on other stuff as well - libraries such as LibXML2 and so on - but Boost is our core C++ dependency and the only one where latest is greatest, so it tends to cause us the most problems. I've covered my past woes in terms of dependency management and how happy I was to find Conan. And so it was that life was bliss for a number of builds, until one day…

It All Started With a Warning

It was a rainy day and I must have been bored because I noticed a rather innocuous-looking warning on my Travis build, related to Conan:

CMake Warning (dev) in build/output/conanbuildinfo.cmake:
  Syntax Warning in cmake code at
    /home/travis/build/DomainDrivenConsulting/dogen/build/output/conanbuildinfo.cmake:142:88
  Argument not separated from preceding token by whitespace.
Call Stack (most recent call first):
  CMakeLists.txt:30 (include)
This warning is for project developers.  Use -Wno-dev to suppress it.

Little did I know that this simple discovery would lead to a sequence of troublesome events and to many a broken build. I decided to report the problem to the Conan developers who, with their usual promptness, rolled up their sleeves, quickly bounced ideas back and forth and then did a sterling job in spinning fixes until we got to the bottom of the issue. Some of the fixes were to Conan itself, whereas some others were related to rebuilding Boost. In the heat of the investigation, I bumped into some very troubling - and apparently unrelated - linking errors:

/home/travis/.conan/data/Boost/1.60.0/lasote/stable/package/ebdc9c0c0164b54c29125127c75297f6607946c5/lib/libboost_log.so: undefined reference to `std::invalid_argument::invalid_argument(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)@GLIBCXX_3.4.21'
/home/travis/.conan/data/Boost/1.60.0/lasote/stable/package/ebdc9c0c0164b54c29125127c75297f6607946c5/lib/libboost_log.so: undefined reference to `std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::find(char const*, unsigned long, unsigned long) const@GLIBCXX_3.4.21'

The build was littered with errors such as these. But the most puzzling thing was that I had changed nothing of consequence on my side and the Conan guys changed very little at their end too! What on earth was going on?

After quite a lot of thinking, Conan's memsharded came up a startling conclusion: we've been hit by one of those rare-but-dreadful ABI-transitions! His comment is worth reading in full, but the crux of his findings is as follows (copied verbatim):

  • Boost packages, generated with travis use docker to manage different versions of gcc, as gcc 5.2 or gcc 5.3
  • Those docker images are using modern linux distros, e.g. > Ubuntu 15.10
  • By default, new modern linux distros have switched to the gcc > 5.1 new C++11 ABI, that is libstdc++ is built with gcc > 5.1, usually named libcxx11, as well as the rest of the system. The libcxx11 ABI is incompatible with the old gcc < 5.1 libcxx98 ABI.
  • Building in such environment links with the new libcxx11 by default.
  • Now, we move to our user, package consumer environment, which could be an Ubuntu 14.04, or a travis VM (12.04). Those distros use a libcxx98 libstdc++, as a lot of programs of those distros depends on the old libcxx98 ABI. It is not simple to replace it for the new one, requiring to rebuild or reinstall large part of the system and applications. Maybe it could be installed for dev only, and specified in the build, but I have not been able yet.

Reading the above may have given you that sad, sinking feeling: "what on earth is he on about, I just want to compile my code!", "Why oh why is C++ so damn complicated!" and so forth. So, for the benefit of those not in the know, let me try to provide the required background to fully grok memsharded's comment.

What's this ABI Malarkey Again?

This topic may sound oddly familiar to the faithful reader of Nerd Food and with good reason: we did cover ABIs in the distant past, at a slightly lower level. The post in question was On MinGW, Cygwin and Wine and it does provide some useful context to this discussion, but, if you want a TL;DR, it basically dealt with kernel space and user space and with things such as the C library. This time round we will turn our attention to the C++ Standard Library.

In addition to specifying the C++ language, the C++ Standard also defines the API of the C++ Standard Library - the classes and their methods, the functions and so on. The C++ Standard Library is responsible for providing a set of services for applications compiled with a C++ compiler. So far, so similar to the C Standard Library. Where things begin to differ is in the crucial matter of the ABI. But first, lets get a working definition for ABI, just so we are all on the same page. For this, we can do worse than using Linux System Programming:

Whereas an API defines a source interface, an ABI defines the low-level binary interface between two or more pieces of software on a particular architecture. It defines how an application interacts with itself, how an application interacts with the kernel, and how an application interacts with libraries. An ABI ensures binary compatibility, guaranteeing that a piece of object code will function on any system with the same ABI, without requiring recompilation.

ABIs are concerned with issues such as calling conventions, byte ordering, register use, system call invocation, linking, library behavior, and the binary object format. The calling convention, for example, defines how functions are invoked, how arguments are passed to functions, which registers are preserved and which are mangled, and how the caller retrieves the return value.

The second paragraph is especially crucial. You see, although both the C and the C++ Standards are somewhat silent on the matter of specifying an ABI, C tends to have a de facto standard for a given OS on a given architecture. This may not sound like much and you may be saying: "what, wait: the same OS on a different architecture has a different ABI?" Yep, that is indeed the case. If you think about it, it makes perfect sense; after all, C was carefully designed to be equivalent to "portable assembler"; in order to achieve maximum performance, one must not create artificial layers of indirection on top of the hardware but instead expose it as is. So, by the same token, two different C compilers working on the same architecture and OS will tend to agree on the ABI. The reason why is because the OS will also follow the hardware where it must, for performance reasons; and where the OS can make choices, it more or less makes the choice for everybody else. For example, until recently, if you were on Windows, it did you no good to compile code into an ELF binary because the law of the land was PE. Things have now changed dramatically, but the general point remains: the OS and the hardware rule.

C++ inherits much of C's approach to efficiency, so at first blush you may be fooled into thinking it too would have a de facto ABI standard ("for a given OS, " etc. etc.). However, there are a few crucial differences that have grave consequences. Let me point out a few:

  • C++'s support for genericity - such as function overloading, templates, etc - is implemented by using name mangling; however, each compiler tends to have their own mangling scheme.
  • implementation details such as the memory layout of objects in the C++ Standard Library - in particular, as we shall see, std::string - are important.

In the past, compiler vendors tended exacerbate differences such as these; as it was with the UNIX wars, so too during the "C++ wars" did it make sense to be as incompatible as possible in the never ending hunt for monetisation. Thus, ABI specifications were kept internal and were closely guarded secrets. But since then the world has changed. To a large extent, C++ lost the huge amounts of funding it once had during the nineties and part of the naughties, and many vendors either went under or greatly reduced their efforts in this space. Two compilers emerged as victors: MSVC on the Windows platform and - once the dust of the EGCS fork finally settled - GCC everywhere else. The excellent quality of GCC across a vast array of platforms and its strict standards adherence - coupled with a quick response to the standardisation efforts - resulted in total domination outside of Windows. So much so that only recently did it meet a true challenger in Clang. The brave new world in which we now find ourselves in is one where C++ ABI standardisation is a real possibility - see Defining a Portable C++ ABI.

But pray forgive the old hand, I digress again. The main point is that, for a given OS on a given architecture, you normally had to compile all your code with a single compiler; if you did that, you were good to go. Granted, GCC never made any official promises to keep its releases ABI-compatible, but in practice we came to rely on the fact that new and old releases interoperated just fine since the days of 3.x. And so did Clang, respecting GCC's ABI so carefully it made us think of them as one happy family. Then, C++-11 arrived.

Mixing and Matching

As described in GCC5 and the C++11 ABI, this pleasant state of affairs was too idyllic to last forever:

[…] [S]ome new complexity requirements in the C++11 standard require ABI changes to several standard library classes to satisfy, most notably to std::basic_string and std::list. And since std::basic_string is used widely, much of the standard library is affected.

On hindsight, the improvements in the std::string implementation are great; as a grasshopper, I recall spending hours on end debugging my code in the long forgotten days of EGGS 2.91, only to find out there was a weird bug in the COW implementation for my architecture. That was the first time - and as it happens, the last time too - I found a library bug, and it made a strong impression on me, at that young age. These people were not infallible.

These days I sit much higher up in the C++ stack. Like many, I didn't read that carefully the GCC 5 release notes when it came out, relying as usual on my distro to do the right thing. And, as usual, the distros largely did, even though, unbeknown to many, a stir was happening in their world 1. But hey, who reads distro blogs, right? Hidden comfortably under my Debian Testing lean-to, I was blissfully unaware of this transition since my code continued to compile just fine. Also, where things start to get hairy is when you need to mix and match compiler versions and build settings - and who on their right mind does that, right?

As it happens, this is a situation in which modern C++ users of Travis may easily find themselves in, stuck as they are on either on Ubuntu 12.04 (2012) or Ubuntu 14.04 (2014). Nick Sarten's blog post rams the point home in inimitable fashion:

Hold on, did I say GCC 4.6? Clang 3.4? WHAT YEAR IS IT?

Yes, what year is it indeed. So it is that most of us rely on PPA's to bring the C++ environment on Travis up to date, such as the Ubuntu Toolchain:

sudo add-apt-repository -y ppa:ubuntu-toolchain-r/test

This always seemed like an innocent thing to do but after my linking errors and memsharded discoveries, one suddenly started to question everything: what settings did the PPA use to build? What settings were used to build the Boost Conan packages? With what compiler? In what distro? The nightmare was endless. It was clear this was going to lead to tears before bedtime.

The Long Road to a Solution

Whilst memsharded honed into the problem pretty quickly - less than a couple of weeks - a complete solution to my woes was a lot more elusive. In truth, this is the kind of situation where you need long spells of concentrated effort, so working in your copious spare time does not help at all. I first tried the easiest approach: to pray that it would all go away by itself, given enough time. And, lo and behold, things did work again, for a little while! And then started to fail again; the Boost package in Conan got rebuilt and the build broke. And that way it stayed.

Once waiting was no longer an option, I had to take it seriously and started investigating in earnest. Trouble is, when you lose trust in the compilation settings you then need to methodically validate absolutely everything, until you bottom out the problem. And that takes time. Many things were tried, including:

  • rebuilding Boost locally, attempting to reproduce the issue - to no avail.
  • rebuilding the Conan Boost packages with the old ABI; a fail (#12).
  • reading up a variety of articles on the subject, most of them linked in this post.
  • building the Boost packages locally and exporting them into Travis using DropBox's public folders. Another fail, but DropBox was a win.
  • obtaining the exact same Ubuntu 14.04 image as Travis is using, use the compiler from the PPA and export Boost to Travis using DropBox and replicating the problem locally in a VM. This worked.

Predictably, the final step is the one I should have tried first, but one is always lazy. Still, all of this got me wondering why had things been so complicated. Normally one would be able to ldd or nm -C the binary and figure out the dependencies, but in this case I seemed to always be pointing to libstdc++.so.6 regardless. Most puzzling. And then I found the Debian wiki page on GCC5, which states:

The good news is, that GCC 5 now provides a stable libcxx11 ABI, and stable support for C++11 (GCC version before 5 called this supported experimental). This required some changes in the libstdc++ ABI, and now libstdc++6 provides a dual ABI, the classic libcxx98 ABI, and the new libcxx11 (GCC 5 (<< 5.1.1-20) only provides the classic libcxx98 ABI). The bad news is that the (experimental) C++11 support in the classic libcxx98 ABI and the new stable libcxx11 ABIs are not compatible, and upstream doesn't provide an upgrade path except for rebuilding. Note that even in the past there were incompatibilities between g++ versions, but not as fundamental ones as found in the g++-5 update to stable C++11 support.

Using different libstdc++ ABIs in the same object or in the same library is allowed, as long as you don't try to pass std::list to something expecting std::__cxx11::list or vice versa. We should rebuild everything with g++-5 (once it is the default). Using g++-4.9 as a fallback won't be possible in many cases.

libstdc++ (>= 5.1.1-20) doesn't change the soname, provides a dual ABI. Existing C++98 binary packages will continue to work. Building these packages using g++-5 is expected to work after build failures are fixed.

The crux is, of course, all the stuff about a dual ABI. I had never bumped into the dual ABI beast before, and now that I did I'm not sure I am entirely pleased. It's probably great when it just works, but it's tricky to troubleshoot when it doesn't: are you linking against a libstdc++ with dual ABI disabled/unsupported? Or is it some other error you've introduced? Personally, having a completely different SO name like memsharded had suggested seems like a less surprising approach - e.g. call it libcxx11 instead of libstdc++. But, as always, one has to play with the cards that were dealt so there is no point in complaining.

Conclusion

The Ubuntu 14.04 build of Boost did get us a green build again, but for all the joyous celebrations, there is still a grey cloud hovering above since the mop-up exercise is not completed. I now need to figure out how to build Boost with Conan on 14.04 and upload this version into the package manager's repo. However, for now carpe diem. After so much unproductive time, there is a real need for a few weeks (months!) of proper coding - the reason why I have a spare time project in the first place. But some lessons were learned.

Firstly, one cannot but feel truly annoyed at ${COSMIC_DEITY} for having to deal with issues such as this. After all, one of the reasons I prefer C++ to the languages I use at work (C# and Java) is that it is usually very transparent; normally I can very quickly reproduce, diagnose and fix a problem in my code. Of course, lord knows this statement is not true of all C++ code, but at least it tends to be valid for most Modern C++ - and over the last five years that's all the C++ I dealt with in anger. It was indeed rather irritating to find out that the pain has not yet been removed from the language, and on occasion, even experienced developers get bitten. Hard.

A second point worth of note is that in C++ - more so than in any other language - one cannot just blindly trust the package manager. There are just so many configuration knobs and buttons for that to be possible, and one can easily get bitten by assumptions. The sad truth is that even when using Conan, one should probably upload one's own packages built with a well understood configuration. True, this may cost time - but on the other hand, it will avoid wild goose chases such as this one.

Finally, its also important to note that this whole episode illustrates the sterling job that package maintainers do in distributions. Paradoxically, their work is often so good that we tend to be blissfully unaware of its importance. Articles such as Maintainers Matter take a heightened sense of urgency after an experience like this.

The road was narrow, long and troublesome. But, as with all Poirot novels, there is always that satisfying feeling of finally finding out whodunnit in the end.

Post Script

There is one final twist to this story, which adds insult to injury and further illustrates ${COSMIC_DEITY}'s sense of humour. When I finally attempted to restore our clang builds, I found out that LLVM has disabled their APT repo for an unspecified length of time:

> TL;DR: APT repo switched off due to excessive load / traffic

There are no alternatives at present to build with a recent clang. Sometimes one has the feeling that the universe does not want to play ball. Stiff upper lip and all that; mustn't grumble.

Footnotes:

1

For example, see The Case of GCC-5.1 and the Two C++ ABIs to understand Arch's pains.

Created: 2016-06-16 Thu 14:12

Emacs 24.5.1 (Org mode 8.2.10)

Validate

Monday, February 08, 2016

Nerd Food: Interesting...

Nerd Food: Interesting…

Time to flush all those tabs again. Some interesting stuff I bumped into recently-ish.

Finance, Economics, Politics

Startups et al.

General Coding

Databases

  • Databases - the Long View: good presentation on databases and Postgres in particular, giving you a perspective of how things changed over time.

C++

  • Crow: New find. Simple library to write web services in C++. If you need to quickly expose some code as a web service, this may be easier than using Casablanca.

Layperson Science

Other

Created: 2016-02-08 Mon 22:29

Emacs 24.5.1 (Org mode 8.2.10)

Validate

Nerd Food: Tooling in Computational Neuroscience - Part III: Data

Nerd Food: Tooling in Computational Neuroscience - Part III: Data

In God we trust; all others must bring data. -- W. Edwards Deming

Welcome to yet another instalment in our series of posts about tooling in Computational Neuroscience. Previously, we have discussed simulators - a popular one, in particular - and microscopes. We shall now talk about data in Computational Neuroscience, a seemingly broad and somewhat mundane topic but one which is central to any attempt in understanding the status quo of the discipline. The target audience remains as it was - the lay person - but I'm afraid things are getting increasingly technical.

More Data! We Need More Data!

Computational Neuroscience by itself is not particularly interesting if there are no inputs to the models we carefully craft nor detailed outputs to allow us to know what the models are doing. Similarly, one needs to be able to use experimental data to inform our modeling choices and in order to baseline expectations; if this data is not available, one cannot tell how close or how far models are from the real thing. As everywhere else, data is of crucial importance here; we need lots of it and of many different kinds.

Once you need data, you soon need to worry about data representation: how should information be encoded? Clearly, in order for the data to be useful in a general sense, it must be accompanied by a formal or informal specification or else users will not know how to interpret it. Furthermore, given the highly technical nature of the data in question, the specification must be very precise or the data becomes useless or even dangerous; "Was that in microns or nanometres?" is not the sort of question you want to be asking. In a world where producers and consumers of data can be anywhere geographically, the specification assumes an ever larger degree of importance.

In summary, it is just not practical to allow everyone to come up with their own data formats:

  • writing a clear and concise specification for data interchange is hard work, and requires a lot of experience in both the domain and the specification process in general. The first attempts would probably prove to be incomplete, inconsistent or impractical.
  • writing code to read and write files according to a specification and in multiple programming languages is also demanding engineering work.
  • writing code to convert from one data specification to another is even more complicated because it requires intimate knowledge of both.
  • some data is generated directly by hardware, making it impractical to adapt to different requirements.

Another aspect worth highlighting is the "big data" nature of a lot of the data sets used in this field. Anything to do with the brain gets pretty complex pretty quickly, and this manifests itself in the data dimension by having ever larger data sets with greater levels of detail. On the plus side, thanks to Moore's Law sigmoid, detailed information at all levels is allowing us to answer questions that were unanswerable not so long ago. The flip side is that all those details come at a cost: the data sets are becoming huge. For example, the resolution of the data coming out of microscopy is now so high that a single data set can take as much as 500 TB. And of course, not only are individual data sets getting larger and larger, but we are able to generate more of them at an ever increasing pace because the processes are more streamlined. It is a fire-hose of data.

All of these difficulties are not unique to Computational Neuroscience or even to Neuroscience as a whole, but the complexity of the domain has the effect of greatly exacerbating an already thorny problem.

Neuroinformatics to the Rescue

If you think we're exaggerating then think again. The management of data in Neuroscience is so complex it is a field on its own right, with the cool-sounding name of Neuroinformatics. Wikipedia tells us that:

Neuroinformatics is a research field concerned with the organization of neuroscience data by the application of computational models and analytical tools. These areas of research are important for the integration and analysis of increasingly large-volume, high-dimensional, and fine-grain experimental data. Neuroinformaticians provide computational tools, mathematical models, and create interoperable databases for clinicians and research scientists.

In layman's terms, Neuroinformatics concerns itself with Neuroscience data and the places where said data is to be stored. It is also implied that one has to deal with a variety of types of data, e.g.: data from experiments (of which there can be many kinds), model inputs, model outputs, the models themselves when viewed as data, etc. The classification of this data is in itself a Neuroinformatics task. Finally, Neuroinformatics also is responsible for the tooling necessary to acquire the data, manipulate it, analyse it, visualise it and so on. Given such a broad definition, one is forced to conclude that there is a big overlap between Computational Neuroscience - the modeling activity - and Neuroinformatics - the management of the data required by it. This lack of clarity is common in science, particularly as new fields develop; take for example Mathematics and Computer Science at its inception.

In truth, such definitions and demarcations are only as useful as the tangible benefits they provide. It is perhaps more fruitful to think of Neuroinformatics as a hat you don on as and when your Computational Science work requires; the definition is there then to allow one to be aware of the separation between the analytic work in modeling and the data storage / retrieval work. For the purposes of this article, we'll continue to refer to the "Neuroinformatics Scientist" and the Computational Neuroscientist personas, but bear in mind they may resolve to the same person in practice.1

Before we move on, I'd like to point out another interesting challenge Neuroinformatics has to address, and one that is common to all Medical Sciences: the need to handle human-derived data very carefully. After all, making data sets available widely must not have implications for the original patients, so its often a requirement that the data is de-identified; in the cases where the data is patient sensitive, additional requirements may be made to users of the data to avoid leaking this information, such as requiring a registration, etc. This illustrates the peculiar nature of Neuroinformatics, with the constant tension between making data as widely available as possible but at the same time having to ensure there are no side-effects of doing so. Presumably, Primum non nocere - first, do no harm.

Databases, Repositories and Archives

Thanks to the efforts of Neuroinformatics, there is now a wealth of Neuroscience data available to all on the Internet. The roots of this growth were sowed in the nineties when labs started sharing research results online. Sharing always existed in one way or another, of course, but the rise of the Internet simply changed the magnitude of the process. It soon became apparent that there was a need to organise central repositories of data, and to ensure the consistency of the shared data. Papers with a distinct Neuroinformatics tone were written, such as An on-line archive of reconstructed hippocampal neurons (1999). Repositories grew, multiplied, morphed and in many cases died, as these things do, and the evolutionary process left us with the survivors. I'd like to highlight some of the ones I have bumped into so far are (with descriptions in their own words):

  • ModelDB: "ModelDB provides an accessible location for storing and efficiently retrieving computational neuroscience models. ModelDB is tightly coupled with NeuronDB. Models can be coded in any language for any environment. Model code can be viewed before downloading and browsers can be set to auto-launch the models."
  • NeuronDB: "NeuronDB provides a dynamically searchable database of three types of neuronal properties: voltage gated conductances, neurotransmitter receptors, and neurotransmitter substances. It contains tools that provide for integration of these properties in a given type of neuron and compartment, and for comparison of properties across different types of neurons and compartments."
  • NeuroMorpho: "NeuroMorpho.Org is a centrally curated inventory of digitally reconstructed neurons associated with peer-reviewed publications. It contains contributions from over 100 laboratories worldwide and is continuously updated as new morphological reconstructions are collected, published, and shared. To date, NeuroMorpho.Org is the largest collection of publicly accessible 3D neuronal reconstructions and associated metadata."
  • Functional Connectomes Project: "Following the precedent of full unrestricted data sharing, which has become the norm in molecular genetics, the FCP entailed the aggregation and public release (via www.nitrc.org) of over 1200 resting state fMRI (R-fMRI) datasets collected from 33 sites around the world."
  • OpenfMRI: "[…] project dedicated to the free and open sharing of functional magnetic resonance imaging (fMRI) datasets, including raw data."
  • Open Source Brain: "resource for sharing and collaboratively developing computational models of neural systems."

As you can see from this small list - rather incomplete, I'm sure - there is a wealth of information out there, covering all sorts of aspects of the brain. We never had so much data as we do today. And, in many ways, this is fast becoming a problem. As an example, data from each of Neuroscience's plethora of divisions and sub-fields is not designed to talk to each other: Electron Microscopy (EM) data is disconnected from data obtained by Magnetic Resonance Imaging (MRI), which is also totally separate from connectome information2 and so forth. In many cases, these sub-fields have evolved in fairly separate paths, and developed their own technical vocabulary in isolation and over long periods of time - an approach perfectly suitable for a "disconnected" world but less than ideal for a world where multiple sources of data are required to make sense of complex phenomena. If one can't even agree on what to call things, how can one be able to explain them?

Thus, the early Neuroinformatics approach is best described as "evolutionary". It is not as if someone sat down and generated a well defined set of file formats for data interchange, covering all different aspects of the areas under study. Instead, what has been emerging is a multitude of file formats in each sub-field, all calling out for attention, and all of them designed for the immediate goal at hand rather than the greater good of Neuroscience.

Taming the Sea of Data

From a Software Engineering perspective, an evolutionary approach makes perfect sense; after all, the Real Programmers had said: "first make it work, then make it right, and, finally, make it fast." In many ways, we are reaching the "make it right" phase, with an increasing interest in efforts towards the creation of broad standards. There have been several papers and initiatives on the subject, such as the Neuroscience Information Framework, or NIF, described in a paper: The Neuroscience Information Framework: A Data and Knowledge Environment for Neuroscience. The paper outlined a lot of the problems that are hampering research, such as:

  • the need for specialised search engines that are domain aware, and advanced query tools too;
  • the need to aid integration and to provide connectivity across related data and findings;
  • a requirement to provide new and enhanced forms of analysing existing data, as data reuse is extremely important - new insights can be obtained on already existing data, often long after the data was generated, and by using it in ways that were not at all envisioned by the original authors;
  • the need to make contribution to online repositories easier; lowering the "contribution barrier" is important to increase data availability but must be done in ways that do not compromise the quality of the data;
  • a requirement to make all code open source such that any lab can make use of it, and the community as a whole can share the maintenance load;
  • a need for an online repository for all tooling, to avoid reinventing the wheel;
  • the need to create a multi-domain standard vocabulary.

There are many worthwhile points in this paper, and it is highly recommended to anyone interested in the subject matter. For instance, the section discussing the design of the NIF also covers the requirements for any specification that wishes to solve the problems outlined above. They are worth highlighting as - in my humble and lay opinion - they are very well thought out.

  • The design of such a framework must combine technical specifications choices and broad community support; "open data, access and exchange, via open source and platform, aid Framework-enabled open discover for Neuroscience."
  • A common framework would reduce costs and enhance benefits of data sharing and knowledge sharing; it would "reduce the cost/benefit ration for data acquisition and utilization."
  • The framework must be designed by the broader community and with the needs of this broader community in mind, and it must build upon prior development in Neuroinformatics.
  • A focus on interoperability is crucial, and it is not a static target but one that must be looked after over time. In addition, there is also a need to keep in mind that different resources have very different interoperability potential. In order to maximise interoperability, we should aim to standardise as much as possible all aspects of the process such as user interfaces, terminologies, formats, etc.

To the untrained eye, the NIF initiative appears to be a great effort to solve fundamental problems in the field. It also seems to have spawned and/or helped popularise many useful and lasting resources such as NeuroMorpho. However, the impression one gets from the outside is that the NIF didn't quite fulfil all of its potential. Having said that, I am keenly looking for up-to-date documents that describe the current status across all of its many aspects - alas, I have not yet succeeded in finding any such document. If indeed it is the case that the initiative petered out, it did highlight a few potential problems for anyone working in this space:

  • large undertakings are hard to pull off; small, organic, incremental changes are easier to do, but of course, that is why we have the problems we currently have.
  • large initiatives require large amounts of funding; work is technical and very expensive.
  • it is not easy to understand NIFs deliverables from looking at their documentationa and website. One can clearly see it was an ambitious project, and one which took on the brunt of the problem areas highlighted above, but perhaps it needed a slightly more self-contained view of their achievements rather than a whole-or-nothing approach. This allows preserving some components even whilst others are failing to gain traction.

XML strikes back

Another interesting attempt to tackle these problems is what I call the "XML suite". These are basically a set of different XML-based standards that are able to interoperate and augment each other, a bit like a stack of building blocks. You can find more details in this paper: XML for Model Specification in Neuroscience. Some of the components of the XML Suite are (with descriptions on their own words, copied from the above paper and a link for more details):

  • LEMS: "the Low Entropy Model Specification […] is being developed to provide a compact, minimally redundant, human-readable, human-writable, declarative way of expressing models of biological systems. It differs from other systems such as CellML or SBML in its requirement to be human writable and the inclusion of basic physical concepts such as dimensionality and physical nesting as part of the language."
  • NeuroML: "supports the use of declarative model specifications for neuroscience modeling efforts at different scales, from intracellular mechanisms to networks of reconstructed neurons."
  • MorphML: "provides a common format for exchange of neuronal morphology data. It can also be used to specify cell structure for modeling efforts as part of NeuroML."
  • BrainML: "application for representing time series data, spike trains, experimental protocols, and other data relevant to neurophysiology experiments."
  • SBML: "(Systems Biology Markup Language) is an application for specifying models of biochemical reaction networks such as metabolic networks, cell-signaling pathways and gene regulatory networks."
  • CellML: "is designed for the specification of biological models of cellular and sub-cellular processes such as calcium dynamics, metabolic pathways, signal transduction, and electrophysiology."
  • MathML: "provides the means for describing the structure and content of mathematical notation in order to serve, receive, and process mathematics on the web. Other XML applications often use MathML language elements for representing mathematical equations."

A positive aspect of the XML Suite is its "discrete" nature. Each of these file formats are free to evolve in isolation, and the nature of their cooperation is very loose in most cases. For example MathML is not at all related to Neuroscience and has the support of the Maths community (to some extent). In addition, the "stacking" approach is also a very interesting one, allowing a good domain focus. For example, NeuroML is built on top of LEMS, so in theory each of these should cover different domains and there should be minimal redundancy.

The key challenge for the XML Suite is for each of their components to find a sustainable user base and sustainable funding to go along with it. This is a broader problem of Neuroinformatics: researchers do not want to spend time on work that is not contributing directly to their research and so the developer pool to do fundamental work on the file formats is limited. Once the developer pool becomes too limited, the file format ends up with a small user base because it is not fit for purpose, and thus starts a downward spiral. This appears to have been the fate of projects such as BrainML.

Conclusion

This post provided an overview of the data landscape in Computational Neuroscience and introduced the sub-field of Neuroinformatics. We also looked at some of the available data stores and reviewed a few of the more popular initiatives to solve the fundamental data problems in the field.

Stay tuned for the next instalment!

Footnotes:

1

For a bit more details on the two fields see What are Computational Neuroscience and Neuroinformatics?

2

"A connectome is a comprehensive map of neural connections in the brain, and may be thought of as its "wiring diagram". From this page.

Created: 2016-02-08 Mon 21:41

Emacs 24.5.1 (Org mode 8.2.10)

Validate

Monday, January 18, 2016

Nerd Food: Interesting...

Nerd Food: Interesting…

Time to flush all those tabs again. Some interesting stuff I bumped into recently-ish.

Finance, Economics, Politics

  • Why Big Oil Should Kill Itself: This is a really, really interesting article. The gist of it is that the entire logic around oil exploration is now a fallacy and it makes more economic sense to simply give up looking for oil because all oil that is left is just to expensive to commercialise. It also has a very interesting take on the valuation of oil companies (and sources of take overs) but I won't spoil it for you. If you are into oil (or against it), its a must read.
  • Oil Goes Nonlinear: Short but thought provoking. I don't tend to agree with Krugman on a lot of things, but quite like this analysis.
  • Africa’s Boom Is Over: And the bad news continue. Totally spot on analysis of what will befall us.
  • American Spring: interesting take on the state of affairs of American politics. Not sure I agree with everything, but definitely food for thought. "Statistically speaking, what are the odds that the two most qualified candidates to be president out of 300 million people are siblings? Or married?" Indeed.
  • A Year of Sovereign Defaults?: Very good and very scary. This has to be on the cards, the only question is the timing.
  • Really rich people are suddenly paying quite a bit more in taxes: some good news on the equality front I guess. But not quite sure it makes much of a difference in the big scheme of US things.
  • Argentina's 'little trees' getting chopped down by new president: Seems like Argentina is going to go through yet another turbulent period, with some good and bad news coming out. Interesting take on the impact to the less well off of the new policies. The chap is certainly a doer, it seems: A fast start.

Startups et al.

General Coding

  • Feeding Graph databases - a third use-case for modern log management platforms: Very interesting ideas on how to use logging data in a graph database. Sounds extremely counter-intuitive, and then you start reading at which point its like "Damn, why didn't I think of that before!". Source: Hacker News
  • Moores law hits the roof: Seems like the exponential function is revealing itself as a sigmoid, as everyone knew it would. Some of the cracks that are already present in Moore's law. Interesting to note that a transistor is now only a few silicon atoms wide - meaning we can't really make it much smaller. Source: Hacker News
  • No, I Don't Want To Configure Your App: Call to arms to get us all thinking on just how many configuration knobs you need to use something. Source: Hacker News
  • Your IDE Is Killing You: Somewhat preaching to the choir, since I am an Emacs user of old, but still a very cogent argument on why relying too much on IDEs is not a good thing. Source: Bruno Antunes (twitter)
  • Starters and Maintainers: The different personas around an open source project. Interesting, its good to be aware of which hat you are wearing when.
  • I Moved to Linux and It’s Even Better Than I Expected: A feel good story about the Linux desktop. Given how slowly things are progressing on that front, we all need one of these some times to cheer us up. Main value of the article though.

Databases

  • Encrypted databases with ZeroDB: I'm not exactly impressed with the technology itself, but more with the ideas one can extract from it. Briefly: what if the database only stores encrypted data, which only each client can decrypt? This is certainly a very useful thing for certain types of information and a PostgreSQL extension would be most useful. Source: Hacker News
  • Introduction to PostgreSQL physical storage: Great article on Postgres low-level details. One to read if you want to get serious about the Elephant but are not yet in the know.
  • Schema based versioning and deployment for PostgreSQL: Tips on how to manage versions for your stored procs, and also contains links for table management. For those of us not totally taken by NoSQL.

C++

Layman Science

  • Why String Theory Is Not A Scientific Theory: Doesn't say a lot of new things, but its good to remind ourselves on what exactly do we mean when we say "Science". This would save us from a lot of grief, such as considering Economics as a Science.
  • The cold fusion horizon: … talking about Science, I was surprised to find out that people are still talking seriously about cold fusion. Interesting article, because it takes the flip side of the Science coin: nothing should not be science unless it is not using the scientific method. Whilst up til now cold fusion has been more of a hoax, we should not discredit people who work on it provided they are following scientific principles. Who knows, they may be right in the end. Science is all about long-shots.

Other

Created: 2016-01-18 Mon 12:49

Emacs 24.5.1 (Org mode 8.2.10)

Validate