In God we trust; all others must bring data. -- W. Edwards Deming
Welcome to yet another instalment in our series of posts about tooling in Computational Neuroscience. Previously, we have discussed simulators - a popular one, in particular - and microscopes. We shall now talk about data in Computational Neuroscience, a seemingly broad and somewhat mundane topic but one which is central to any attempt in understanding the status quo of the discipline. The target audience remains as it was - the lay person - but I'm afraid things are getting increasingly technical.
More Data! We Need More Data!
Computational Neuroscience by itself is not particularly interesting if there are no inputs to the models we carefully craft nor detailed outputs to allow us to know what the models are doing. Similarly, one needs to be able to use experimental data to inform our modeling choices and in order to baseline expectations; if this data is not available, one cannot tell how close or how far models are from the real thing. As everywhere else, data is of crucial importance here; we need lots of it and of many different kinds.
Once you need data, you soon need to worry about data representation: how should information be encoded? Clearly, in order for the data to be useful in a general sense, it must be accompanied by a formal or informal specification or else users will not know how to interpret it. Furthermore, given the highly technical nature of the data in question, the specification must be very precise or the data becomes useless or even dangerous; "Was that in microns or nanometres?" is not the sort of question you want to be asking. In a world where producers and consumers of data can be anywhere geographically, the specification assumes an ever larger degree of importance.
In summary, it is just not practical to allow everyone to come up with their own data formats:
- writing a clear and concise specification for data interchange is hard work, and requires a lot of experience in both the domain and the specification process in general. The first attempts would probably prove to be incomplete, inconsistent or impractical.
- writing code to read and write files according to a specification and in multiple programming languages is also demanding engineering work.
- writing code to convert from one data specification to another is even more complicated because it requires intimate knowledge of both.
- some data is generated directly by hardware, making it impractical to adapt to different requirements.
Another aspect worth highlighting is the "big data" nature of a lot of the data sets used in this field. Anything to do with the brain gets pretty complex pretty quickly, and this manifests itself in the data dimension by having ever larger data sets with greater levels of detail. On the plus side, thanks to Moore's Law sigmoid, detailed information at all levels is allowing us to answer questions that were unanswerable not so long ago. The flip side is that all those details come at a cost: the data sets are becoming huge. For example, the resolution of the data coming out of microscopy is now so high that a single data set can take as much as 500 TB. And of course, not only are individual data sets getting larger and larger, but we are able to generate more of them at an ever increasing pace because the processes are more streamlined. It is a fire-hose of data.
All of these difficulties are not unique to Computational Neuroscience or even to Neuroscience as a whole, but the complexity of the domain has the effect of greatly exacerbating an already thorny problem.
Neuroinformatics to the Rescue
If you think we're exaggerating then think again. The management of data in Neuroscience is so complex it is a field on its own right, with the cool-sounding name of Neuroinformatics. Wikipedia tells us that:
Neuroinformatics is a research field concerned with the organization of neuroscience data by the application of computational models and analytical tools. These areas of research are important for the integration and analysis of increasingly large-volume, high-dimensional, and fine-grain experimental data. Neuroinformaticians provide computational tools, mathematical models, and create interoperable databases for clinicians and research scientists.
In layman's terms, Neuroinformatics concerns itself with Neuroscience data and the places where said data is to be stored. It is also implied that one has to deal with a variety of types of data, e.g.: data from experiments (of which there can be many kinds), model inputs, model outputs, the models themselves when viewed as data, etc. The classification of this data is in itself a Neuroinformatics task. Finally, Neuroinformatics also is responsible for the tooling necessary to acquire the data, manipulate it, analyse it, visualise it and so on. Given such a broad definition, one is forced to conclude that there is a big overlap between Computational Neuroscience - the modeling activity - and Neuroinformatics - the management of the data required by it. This lack of clarity is common in science, particularly as new fields develop; take for example Mathematics and Computer Science at its inception.
In truth, such definitions and demarcations are only as useful as the tangible benefits they provide. It is perhaps more fruitful to think of Neuroinformatics as a hat you don on as and when your Computational Science work requires; the definition is there then to allow one to be aware of the separation between the analytic work in modeling and the data storage / retrieval work. For the purposes of this article, we'll continue to refer to the "Neuroinformatics Scientist" and the Computational Neuroscientist personas, but bear in mind they may resolve to the same person in practice.1
Before we move on, I'd like to point out another interesting challenge Neuroinformatics has to address, and one that is common to all Medical Sciences: the need to handle human-derived data very carefully. After all, making data sets available widely must not have implications for the original patients, so its often a requirement that the data is de-identified; in the cases where the data is patient sensitive, additional requirements may be made to users of the data to avoid leaking this information, such as requiring a registration, etc. This illustrates the peculiar nature of Neuroinformatics, with the constant tension between making data as widely available as possible but at the same time having to ensure there are no side-effects of doing so. Presumably, Primum non nocere - first, do no harm.
Databases, Repositories and Archives
Thanks to the efforts of Neuroinformatics, there is now a wealth of Neuroscience data available to all on the Internet. The roots of this growth were sowed in the nineties when labs started sharing research results online. Sharing always existed in one way or another, of course, but the rise of the Internet simply changed the magnitude of the process. It soon became apparent that there was a need to organise central repositories of data, and to ensure the consistency of the shared data. Papers with a distinct Neuroinformatics tone were written, such as An on-line archive of reconstructed hippocampal neurons (1999). Repositories grew, multiplied, morphed and in many cases died, as these things do, and the evolutionary process left us with the survivors. I'd like to highlight some of the ones I have bumped into so far are (with descriptions in their own words):
- ModelDB: "ModelDB provides an accessible location for storing and efficiently retrieving computational neuroscience models. ModelDB is tightly coupled with NeuronDB. Models can be coded in any language for any environment. Model code can be viewed before downloading and browsers can be set to auto-launch the models."
- NeuronDB: "NeuronDB provides a dynamically searchable database of three types of neuronal properties: voltage gated conductances, neurotransmitter receptors, and neurotransmitter substances. It contains tools that provide for integration of these properties in a given type of neuron and compartment, and for comparison of properties across different types of neurons and compartments."
- NeuroMorpho: "NeuroMorpho.Org is a centrally curated inventory of digitally reconstructed neurons associated with peer-reviewed publications. It contains contributions from over 100 laboratories worldwide and is continuously updated as new morphological reconstructions are collected, published, and shared. To date, NeuroMorpho.Org is the largest collection of publicly accessible 3D neuronal reconstructions and associated metadata."
- Functional Connectomes Project: "Following the precedent of full unrestricted data sharing, which has become the norm in molecular genetics, the FCP entailed the aggregation and public release (via www.nitrc.org) of over 1200 resting state fMRI (R-fMRI) datasets collected from 33 sites around the world."
- OpenfMRI: "[…] project dedicated to the free and open sharing of functional magnetic resonance imaging (fMRI) datasets, including raw data."
- Open Source Brain: "resource for sharing and collaboratively developing computational models of neural systems."
As you can see from this small list - rather incomplete, I'm sure - there is a wealth of information out there, covering all sorts of aspects of the brain. We never had so much data as we do today. And, in many ways, this is fast becoming a problem. As an example, data from each of Neuroscience's plethora of divisions and sub-fields is not designed to talk to each other: Electron Microscopy (EM) data is disconnected from data obtained by Magnetic Resonance Imaging (MRI), which is also totally separate from connectome information2 and so forth. In many cases, these sub-fields have evolved in fairly separate paths, and developed their own technical vocabulary in isolation and over long periods of time - an approach perfectly suitable for a "disconnected" world but less than ideal for a world where multiple sources of data are required to make sense of complex phenomena. If one can't even agree on what to call things, how can one be able to explain them?
Thus, the early Neuroinformatics approach is best described as "evolutionary". It is not as if someone sat down and generated a well defined set of file formats for data interchange, covering all different aspects of the areas under study. Instead, what has been emerging is a multitude of file formats in each sub-field, all calling out for attention, and all of them designed for the immediate goal at hand rather than the greater good of Neuroscience.
Taming the Sea of Data
From a Software Engineering perspective, an evolutionary approach makes perfect sense; after all, the Real Programmers had said: "first make it work, then make it right, and, finally, make it fast." In many ways, we are reaching the "make it right" phase, with an increasing interest in efforts towards the creation of broad standards. There have been several papers and initiatives on the subject, such as the Neuroscience Information Framework, or NIF, described in a paper: The Neuroscience Information Framework: A Data and Knowledge Environment for Neuroscience. The paper outlined a lot of the problems that are hampering research, such as:
- the need for specialised search engines that are domain aware, and advanced query tools too;
- the need to aid integration and to provide connectivity across related data and findings;
- a requirement to provide new and enhanced forms of analysing existing data, as data reuse is extremely important - new insights can be obtained on already existing data, often long after the data was generated, and by using it in ways that were not at all envisioned by the original authors;
- the need to make contribution to online repositories easier; lowering the "contribution barrier" is important to increase data availability but must be done in ways that do not compromise the quality of the data;
- a requirement to make all code open source such that any lab can make use of it, and the community as a whole can share the maintenance load;
- a need for an online repository for all tooling, to avoid reinventing the wheel;
- the need to create a multi-domain standard vocabulary.
There are many worthwhile points in this paper, and it is highly recommended to anyone interested in the subject matter. For instance, the section discussing the design of the NIF also covers the requirements for any specification that wishes to solve the problems outlined above. They are worth highlighting as - in my humble and lay opinion - they are very well thought out.
- The design of such a framework must combine technical specifications choices and broad community support; "open data, access and exchange, via open source and platform, aid Framework-enabled open discover for Neuroscience."
- A common framework would reduce costs and enhance benefits of data sharing and knowledge sharing; it would "reduce the cost/benefit ration for data acquisition and utilization."
- The framework must be designed by the broader community and with the needs of this broader community in mind, and it must build upon prior development in Neuroinformatics.
- A focus on interoperability is crucial, and it is not a static target but one that must be looked after over time. In addition, there is also a need to keep in mind that different resources have very different interoperability potential. In order to maximise interoperability, we should aim to standardise as much as possible all aspects of the process such as user interfaces, terminologies, formats, etc.
To the untrained eye, the NIF initiative appears to be a great effort to solve fundamental problems in the field. It also seems to have spawned and/or helped popularise many useful and lasting resources such as NeuroMorpho. However, the impression one gets from the outside is that the NIF didn't quite fulfil all of its potential. Having said that, I am keenly looking for up-to-date documents that describe the current status across all of its many aspects - alas, I have not yet succeeded in finding any such document. If indeed it is the case that the initiative petered out, it did highlight a few potential problems for anyone working in this space:
- large undertakings are hard to pull off; small, organic, incremental changes are easier to do, but of course, that is why we have the problems we currently have.
- large initiatives require large amounts of funding; work is technical and very expensive.
- it is not easy to understand NIFs deliverables from looking at their documentationa and website. One can clearly see it was an ambitious project, and one which took on the brunt of the problem areas highlighted above, but perhaps it needed a slightly more self-contained view of their achievements rather than a whole-or-nothing approach. This allows preserving some components even whilst others are failing to gain traction.
XML strikes back
Another interesting attempt to tackle these problems is what I call the "XML suite". These are basically a set of different XML-based standards that are able to interoperate and augment each other, a bit like a stack of building blocks. You can find more details in this paper: XML for Model Specification in Neuroscience. Some of the components of the XML Suite are (with descriptions on their own words, copied from the above paper and a link for more details):
- LEMS: "the Low Entropy Model Specification […] is being developed to provide a compact, minimally redundant, human-readable, human-writable, declarative way of expressing models of biological systems. It differs from other systems such as CellML or SBML in its requirement to be human writable and the inclusion of basic physical concepts such as dimensionality and physical nesting as part of the language."
- NeuroML: "supports the use of declarative model specifications for neuroscience modeling efforts at different scales, from intracellular mechanisms to networks of reconstructed neurons."
- MorphML: "provides a common format for exchange of neuronal morphology data. It can also be used to specify cell structure for modeling efforts as part of NeuroML."
- BrainML: "application for representing time series data, spike trains, experimental protocols, and other data relevant to neurophysiology experiments."
- SBML: "(Systems Biology Markup Language) is an application for specifying models of biochemical reaction networks such as metabolic networks, cell-signaling pathways and gene regulatory networks."
- CellML: "is designed for the specification of biological models of cellular and sub-cellular processes such as calcium dynamics, metabolic pathways, signal transduction, and electrophysiology."
- MathML: "provides the means for describing the structure and content of mathematical notation in order to serve, receive, and process mathematics on the web. Other XML applications often use MathML language elements for representing mathematical equations."
A positive aspect of the XML Suite is its "discrete" nature. Each of these file formats are free to evolve in isolation, and the nature of their cooperation is very loose in most cases. For example MathML is not at all related to Neuroscience and has the support of the Maths community (to some extent). In addition, the "stacking" approach is also a very interesting one, allowing a good domain focus. For example, NeuroML is built on top of LEMS, so in theory each of these should cover different domains and there should be minimal redundancy.
The key challenge for the XML Suite is for each of their components to find a sustainable user base and sustainable funding to go along with it. This is a broader problem of Neuroinformatics: researchers do not want to spend time on work that is not contributing directly to their research and so the developer pool to do fundamental work on the file formats is limited. Once the developer pool becomes too limited, the file format ends up with a small user base because it is not fit for purpose, and thus starts a downward spiral. This appears to have been the fate of projects such as BrainML.
This post provided an overview of the data landscape in Computational Neuroscience and introduced the sub-field of Neuroinformatics. We also looked at some of the available data stores and reviewed a few of the more popular initiatives to solve the fundamental data problems in the field.
Stay tuned for the next instalment!
For a bit more details on the two fields see What are Computational Neuroscience and Neuroinformatics?