Marco Craveiro

Nerd Food: Notes on Computational Finance, Part I: Introduction

2020-04-06T11:59:00.001-07:00

Nerd Food: Computational Finance, Part I: Introduction

Welcome to the first part of what we hope will be a long series of articles exploring QuantLib, Open Source Risk Engine (ORE), libbitcoin and other interesting finance-related FOSS (Free and Open Source) projects. However, I'm afraid this will be a bit of a dull first post, as we need to clarify our objectives before we can jump into the fray.

About

Even though I've been a software engineer in the financial sector for over fifteen years, I've always felt I lacked a deep understanding of the foundational concepts that make up the domain. As a self-confessed reductionist, I find this state of affairs extremely uncomfortable, akin to hearing a continuous loop of David Hilbert's words: wir müssen wissen, wir werden wissen¹. The situation had to be remedied, somehow, and the material you are now reading is the proposed fix for my ailments. As to the methodology: given I've had some success in applying the Feynman Technique² to other complex domains, it seemed only natural to try to use it for this endeavour as well. Past experience also demonstrated writing is an adequate replacement for in vivo communication, which is just as well in this brave new world of social distancing.

So that's that for the why and the how. But, just what exactly are we researching?

Scope

These posts shall largely amble where our fancy takes us, within the porous boundaries of finance. Alas, we can hardly keep calling our target domain "trading, accounting, crypto and a bit of quantitative finance, when viewed through the lens of FOSS" - descriptive though that might sound. We are Software Engineers after all, and if there is one thing we do is to name things, especially when we lack competence to do so³. In this vein, I decided to call this motley domain of ours "Computational Finance". Should the name have merit, I'm afraid I have little claim to it, as it was shamelessly stolen from this passage in Wolfram's writings:

Doctors, lawyers, teachers, farmers, whatever. The future of all these professions will be full of computational thinking. Whether it’s sensor-based medicine, computational contracts, education analytics or computational agriculture - success is going to rely on being able to do computational thinking well.

I’ve noticed an interesting trend. Pick any field X, from archeology to zoology. There either is now a “computational X” or there soon will be. And it’s widely viewed as the future of the field.

These seemed like inspiring words for anyone embarking on a long and uncertain journey, so we made them our own and, in turn, it gave us a name to rally around. But what of its boundaries? One of the biggest challenges facing the reductionist is that, in the limit, everything is interconnected with everything else, for there is no natural halting function. Thus, if you are not careful, all paths will eventually lead you into the realm of quarks and particle physics, regardless of your starting point. Now, that would not be an entirely useful outcome. I have never found a completely satisfactory answer to this question in any on my previous meanderings, but in general I tend to follow an empiric approach and let taste be my guide⁴. Granted, its probably not the scientific solution you were expecting, but it seems that there are "intuitive" boundaries in subjects, and when we hit one of those we shall stop⁵. As an example, for our purposes we need not look in detail at legal frameworks when trying to understand financial concepts, though the two disciplines are deeply intertwined.

Structure

An issue which is closely interrelated with the previous one is on how to strike a balance between computational exploration versus domain definitions. Too much exploration and you proceed full steam ahead without knowing the meaning of things; too many boring definitions and they become just words without bringing any light to the subject under scrutiny. The sweet spot lies somewhere in the middle.

Our approach can be described as follows. We shall try to progress very slowly and methodically through the concepts in the domain, building them up as we climb the abstraction ladder but without making them too dense and technical. We'll make extensive use of Wikipedia definitions, where possible, but keeping these focused only on the point at hand rather than exploring the myriad of possibilities around a theme.

Finally, we shall try to marry domain concepts with our chosen implementations - the computational experiments part - in order to illustrate their purpose and get a better understanding at what it is that they are trying to do. So, each post will be focused on one fairly narrow subject area, start with a bunch of definitions which are hopefully self-explanatory and then proceed to explore the available implementations on that topic, or code that we write ourselves.

Audience

The target audience for this material is the fabled homo developus, that non-existent "standard developer" - in this particular case, one moderately competent on C++ but unfamiliar with computational finance. On the "finance" part, if you are already familiar with the domain, you will no doubt find the content very slow going. I'm afraid this is by design: the objective is to try to build the concepts on a solid foundation for those not in the know, so slowness is unavoidable⁶.

With regards to the computational part: the astute reader will likely point out that there are a great deal of tutorials on QuantLib, ORE and many other libraries of a similar ilk, and many books have been written on quantitative finance. One could be forgiven for wondering if there is a need to pile on more literature onto a seemingly already crowded subject.

In our defence, we are yet to find work that directly targets "plain" software developers and provides them with a sufficiently broad view of the domain. In addition, most of the existing material is aimed at either those with strong mathematical abilities but no domain knowledge, or its converse, leaving many gaps in the understanding. What we are instead aiming for is to target those with strong programming abilities but no particular knowledge of either computational finance or mathematics. And this leads us nicely to our next topic.

Mathematics

Our assumption is that you, dear reader, are not able to attain deep levels of understanding by staring at page after page of complex mathematical formulae. I, for one, certainly cannot. Unfortunately, non-trivial mathematics is difficult to avoid when covering a subject matter of this nature so, as a counterweight, we shall strive to use it sparingly and only from a software engineering application perspective. Note that this approach is clearly not suitable for the mathematically savvy amongst us, as they will find it unnecessarily laboured; then again, our focus lies elsewhere.

Our core belief is that an average reader (like me) should be able to attain a software engineer's intuition of how things work just by fooling around with software models of formulae. The reason why I am very confident on this regard is because that's how developers learn: by coding and seeing what happens. In fact, it is this very tight feedback loop between having an idea and experimenting with it that got many of us hooked into programming in the first place, so its a very powerful tool in the motivational arsenal. And, as it turns out, these ideas are related to Wolfram's concept of Experimental Mathematics. Ultimately, our aspiration is to copy the approach taken by Klein in Coding the Matrix, though perhaps that sets the bar a tad too high. Well, at least you get the spirit of the approach.

Cryptos

Another rather peculiar idea we pursued is the use of cryptocurrencies throughout, to the exclusion of everything else. Whilst very popular in the media, where they are known as cryptos, in truth cryptocurrencies still have a limited presence in the "real" world of finance, and nowhere more so than in derivatives - i.e., the bulk of our analysis. So at first blush, this is a most puzzling choice. We have decided to do so for three reasons.

Firstly, just because I wanted to learn more about cryptos. Secondly, because there is a need to bridge the knowledge gap between these two distinct worlds of finance; to blend the old with the new if you will. Personally, I think it will be interesting to see what the proliferation of derivatives will do to cryptos - but for that we need to disseminate financial knowledge. Finally, and most important of all, because in order to properly illustrate all of the concepts we shall cover, and to drive the open source libraries to their potential, one needs vast amounts of data of the right kind. Lets elaborate further on this point.

One of the biggest problems with any material in quantitative finance is in obtaining data sets which are sufficiently rich to cover all of the concepts being explained. This, in my opinion, is one of the key shortcomings with most tutorials: they either assume users can source the data themselves, or provide a small data set to prove a very specific point but which is insufficient for wider exploration⁷. This document takes a slightly different approach. We will base ourselves on a simulated world - a parallel reality if you'd like, thinly anchored to our reality by freely available data taken from the crypto markets. We shall then generate all of the remaining data, to the level of precision, richness and consistency required both to drive the code samples, but also to allow for "immersive" exploration. In fact, the very processes for data generation will be used as a pathway for domain exploration.

Of course, generated data is not perfect - i.e., realistic it is not, by definition - but our point is to understand the concepts, not to create new quant models that trade in the real world, so it is adequate for our needs. In addition, the data sets and code samples, as well as the means used to derive them shall be part of a git repository under an open source licence, so they can be extended and improved over time.

If you are not familiar with cryptos, don't worry. For starters, we can assume the intricate mechanistic details to a large extent - the blockchain and so forth - and introduce key concepts as required. We need not concern ourselves with this because there is plenty of freely available material covering it in excruciating detail, and designed specifically for software engineers. Instead, we shall treat cryptos as if they were regular currencies, except where they are just too different - in which case we'll point out the differences. Its a bit of a strange approach, but hopefully it will produce the desired results.

Non Goals

If you are trying to learn techniques on how to trade, this is not the material for you. Even when we discuss trading strategies and other similar topics, our focus is always on trying to understand how the machinery works rather than on how to make money with it. Similarly, if you are a quant or are trying to become one, you are probably better off reading the traditional books such as Hull or Wilmott rather than these posts, as our treatment of mathematics will be far too basic for your requirements. However, if you are an expert in this subject area, or if you find any mistakes please do point them out.

Legalese

As with anything to do with finance, we need to set out the standard disclaimers. To make sure these are seen, we shall add them to each post.

Legal Disclaimer

All of the content, including source code, is either written by the author of the posts, or obtained from freely available sites in the internet, with suitable software licences. All content sources shall be clearly identified at the point of use. No proprietary information of any kind - including, but not limited to, source code, text, market data or mathematical models - shall be used within this material.

All of the views expressed here represent exclusively myself and are not those of any corporation I may be engaged in commercial activities with.

The information available in these blog posts is for your general information and use and is not intended to address your particular requirements. In particular, the information does not constitute any form of financial advice or recommendation and is not intended to be relied upon by users in making (or refraining from making) any investment decisions.⁸

All software written by the author for these posts is licensed under the Gnu GPL v3. As per the licence, it is "distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose. See the GNU General Public License for more details."

With all of the preliminaries out of the way, we can move on to the meat of the subject. On Part II we shall discuss our first real topic, and it could not be much more fundamental: Money.

Footnotes:

"We must know, we will know". As per Wikipedia:

The epitaph on his tombstone in Göttingen consists of the famous lines he spoke at the conclusion of his retirement address to the Society of German Scientists and Physicians on 8 September 1930. The words were given in response to the Latin maxim: "Ignoramus et ignorabimus" or "We do not know, we shall not know".

The Feynman Technique is a well-established learning methodology. For more details, see Richard Feynman: The Difference Between Knowing the Name of Something and Knowing Something.

There are no circumstances under which I have seen software developers lacking confidence. I feel that the motto of our profession should be the Latin translation of Make up with confidence that which you lack for in competence.

⁴

An idea that was most likely inspired by Linus' views on good taste. For details see Applying the Linus Torvalds “Good Taste” Coding Requirement.

⁵

Of course, your intuition is not my intuition. I'm afraid you will have to take my taste as a given, even where you disagree. Feel free to make your views heard though.

⁶

As they say in my home country of Angola, malembe malembe. The expression can be loosely translated to English as "slowly but surely", or "slowly does it".

⁷

As an example, the latter approach is taken by a library I respect very much, the Open Source Risk Engine (ORE).

⁸

This paragraph was obtained from the Truly Independent Ltd and modified to suit our needs.

Nerd Food: The Refactoring Quagmire

2018-01-03T07:45:00.000-08:00

Nerd Food: The Refactoring Quagmire

The latest Dogen sprint turned out to be a really long and tortuous one, which is all the more perplexing given the long list of hard sprints that preceded it. Clearly, the slope of the curve is steepening unrelentingly. Experience teaches that whenever you find yourself wandering over such terrains, it is time to stop and gather your thoughts; more likely than not, you are going the wrong way - fast.

Thus, for Dogen, this a post of reflection. To the casual reader - if nothing else - it will hopefully serve as a cautionary tale.

Not Even Wrong

If you are one of the lucky few internauts who avidly follows our release notes, you may recall that the previous sprint had produced a moment of enlightenment where we finally understood yarn as the core of Dogen. At the time, it felt like one of those rare eureka moments, and "the one last great change to the architecture"; afterwards, all would be light. "Famous last words", you may have said then and, of course, if you did, you were right. But given the historical context, the optimism wasn't entirely unjustified. To understand why, we need to quickly recap how the architecture has evolved over time.

Dogen started out divided into three very distinct parts: the frontends (Dia, JSON), the middle-end (yarn) and the backends (C++, C#). The "pipeline" metaphor guided our design because we saw Dogen very much like a compiler, with its frontend, middle-end and backend stages. This was very handy as it meant we could test all stages of the pipeline in isolation. Composition was done by orchestrating frontend, middle-end and backends, at a higher level. This architecture had very good properties when it came to testability and debuggability: we'd start by running the entire pipeline and locating the problem; then, one could easily isolate the issue to a specific component either by looking at the log file, or by dumping the inputs and outputs of the different stages and sifting through them. As a result, bug reproduction was very straightforward since we just needed to record the inputs and create the test at the right level. Whilst the names of the models and their responsibilities changed over time, the overall pipeline architecture remained so since the very early days of Dogen.

In parallel to this, a second trend had emerged over the last ten sprints or so: we moved more and more functionality from the frontends and backends to the middle-end. The key objective here was DRY: we soon found a lot of commonalities between frontends, driving us to create a simple frontend intermediate format so that the work was carried out only once. Not long after, we discovered that backends suffered from precisely the same malaise, so the same cure begun to be applied there too. So far so good, as we were following Roberts and Johnson's sage advice:

People develop abstractions by generalizing from concrete examples. Every attempt to determine the correct abstractions on paper without actually developing a running system is doomed to failure. No one is that smart. A framework is a reusable design, so you develop it by looking at the things it is supposed to be a design of. The more examples you look at, the more general your framework will be.

The literature was with us and the wind was on our sails: the concrete code in the frontends and backends was slowly cleaned up, made general and moved across to the middle-end. As this process took hold, the middle-end grew and grew in size and responsibilities, just as everybody else shed them. Before long, we ended up with one big model, a couple medium-sized models and lots of very small models: "modelets", we named them. These were models with very little responsibility other than gluing together one or two things. The overhead of maintaining a physical component (e.g. static or dynamic library) for the sake of one or two classes seemed a tad too high.

As we begun to extrapolate the trend somewhat, a vision suddenly appeared: why not centralise everything in the middle-end? That is:

place all meta-models and transforms in one single central location, together with their orchestration; call it the "core model";
orchestration becomes either helper code or a transform in its own right;
within this "core model", provide interfaces that backends and frontends implement, injecting them dynamically;
make these new interfaces appear as transform chains themselves (mostly).

In this elegant and clean brave new world, we would no longer have "ends" as such but something more akin to "plugins", dynamically glued into the "middle-end" via the magic of dependency injection; the "middle-end" itself would no longer be a "middle" but the center of everything. Backends and frontends had to merely implement the interfaces supplied by the core and the system would just magically sort itself out. The idea seemed amazing and we quickly moved to implementation.

Alas, in our haste to jump into the fray, we had forgotten to heed Mencken:

[T]here is always a well-known solution to every human problem — neat, plausible, and wrong.

The Strange Loop

One of the biggest downsides of working alone and in your spare time is the lack of feedback from other developers. And it's not even just that other developers will teach you lots of new things. No, most often than not, they'll simply drag you away from the echo chambers and tunnels of self-reinforcement you carefully craft and curate for yourself. You are your own intellectual jailer.

In the cold light of day, any developer will tell you that creating cycles is not a good idea, and should not be done without a great deal of thought. Yet, we managed to create "circular" dependencies between all components of the system by centralising all responsibilities into yarn. Now, you may say that these are not "canonically circular" - and this is probably why the problem was not picked up in the first place - because yarn provides interfaces for other models to implement. Well, Lakos is very helpful here in explaining what is going on: our logical design had no cycles - because yarn does not explicitly call any frontends or backends - but the physical design did have them. And these came at a cost.

For starters, it screwed up reasonability. Even though frontends and backends still had their own models, the net result was that we jumbled up all of the elements of the pipeline into a single model, making it really hard to tell what's what. Explaining the system to a new developer now required saying things such as "ah, don't worry about that part for now, it belongs to the middle-end, but here we are dealing only with the backends" - a clear code smell. Once a property of the architecture, reasonability now had to be conveyed in lossy natural language. Testability and debuggability got screwed up too because now everything went through one single central model; if you needed to test a frontend fix you still required building the backends and middle-end and initialise them too. Our pursuit of clarity muddied up the waters.

To make matters worse, an even more pertinent question arose: just when exactly should you stop refactoring? In my two decades of professional development, I had never encountered this problem. In the real world, you are fortunate if you get a tiny amount of time allocated to refactoring - most of the time you need to somehow sneak it in into some overall estimate and hope no one notices. Like sharks, Project Managers (PM) are bred to smell refactoring efforts from a mile a way and know how to trim estimates down to the bone. Even when you are in a greenfield project or just lucky enough to have an enlightened PM who will bat for you, you still need to contend with the realities of corporate development: you need to ship, now. No one gets away with endless refactoring. No one, that is, other than the Free and Open Source Software Developer.

Like many a spare time project, Dogen is my test bed of ideas around coding and coding processes; a general sandbox to have fun outside of work. As such - and very much by design - the traditional feedback loops that exist in the real world need not apply. I wanted to see what would happen if you coded without any constraints and, in the end, what I found out was that if you do not self-impose some kind of halting machinery, you will refactor on forever. In practice, physics still apply, so your project will eventually die out because its energy will dissipate across the many refactoring fronts and entropy will, as always, triumph. But if you really want to keep it at bay, at least for a little while, you need to preserve energy by having one single, consistent vision - "wrong" as it may be according to some metric or other. For, as Voltaire said and we often forget, "le mieux est l'ennemi du bien".

The trouble is that refactoring is made up of a set of engineering trade-offs, and when you optimise for one thing you'll inevitably make something else worse. So, first and foremost, you need to make sure you understand what your trade-offs are, and prioritise accordingly. Secondly, looking for a global minima in such a gigantic multidimensional space is impossible, so you need to make do with local minima. But how do you known you reach a "good enough" point in that space? You need some kind of conceptual cost function.

Descending the Gradient

So it was that we started by defining the key dimensions across which we were trying to optimise. This can be phrased slightly differently: given what we now know about the domain and its implementation, what are the most important characteristics of an idealised physical and logical design?

After some thinking, the final answer was deceptively simple:

the entities of the logical design (models, namespaces, classes, methods and the like) should reflect what one reads in the literature of Model Driven Engineering (MDE). That is, a person competent on the field should find a code base that talks his or her language.
logical and physical design should promote reasonability and isolation, and thus orchestration should be performed via composition rather than by circular physical dependencies.

For now, these are the two fundamental pillars guiding the Dogen architecture; any engineering trade-offs to be made must ensure these dimensions take precedence. In other words, we can only optimise away any "modelets" if they do not impact negatively either of these two dimensions. If they do, then we must discard this refactoring option. More generally, it is now possible to "cost" all refactoring activity - a conceptual refactoring gradient descent if you'd like; it either brings us closer to the local minima or further away. It gave us a sieve with which to filter the product and sprint backlogs.

To cut a rather long story short, we ended up with a "final" - ha, ha - set of changes to the architecture to get us closer to the local minima:

move away from sewing terms: from the beginning we had used terms such as knitter, yarn and so forth. These were… colourful, but did not add any value and detracted us from the first dimension. This was a painful decision but clearly required if one is to comply to point one above: we need to replace all sewing terms with domain specific vocabulary.
reorganise the models into a pipeline: however, instead of simply going back to the "modelets" of the past, we need to have a deep think as to what responsibilities belong at what stage of the pipeline. Perhaps the "modelets" were warning us of design failures.

Conclusion

Its never a great feeling when you end a long and arduous sprint only to figure out you were going in the wrong direction in design space. In fact, it is rather frustrating. We have many stories in the product backlog which are really exciting and which will add real value to the end users - well, at this point, just us really but hey - yet we seemed to be lost in some kind of refactoring ground hog day, with no end in sight. However, the main point of Dogen is to teach, and learn we undoubtedly did.

As with anything in the physical world, nothing in software engineering exists in splendid perfection like some kind of platonic solid. Perfection belongs to the realm of maths. In engineering, something can only be described as "fit for purpose", and to do so requires to first determine best we can what that purpose might be. So, before you wonder into a refactoring quagmire of your own making, be sure to have a very firm idea of what your trade-offs are.

Created: 2018-01-03 Wed 15:55

Emacs 25.2.2 (Org mode 8.2.10)

Validate

Nerd Food: Northwind, or Using Dogen with ODB - Part IV

2017-03-25T12:34:00.000-07:00

Nerd Food: Northwind, or Using Dogen with ODB - Part IV

So, dear reader, we meet again for the fourth and final instalment of our series of posts on using Dogen with ODB! And if you missed an episode - unlikely as it may be - well, fear not for you can always catch up! Here are the links: Part I, Part II and Part III. But, if you are too lazy and need a summary: all we've done thus far is to install and setup an Oracle Express database, populate it with a schema (and data) and finally code-generate an ORM model with Dogen and ODB.

I guess it would not be entirely unfair to describe our adventure thus far as a prelude; if nothing else, it was a character building experience. But now we can finally enjoy the code.

Building Zango

Assuming you have checked out zango as described in Part III and you are sitting on its containing directory, you can "configure" the project fairly simply:

$ . /u01/app/oracle/product/11.2.0/xe/bin/oracle_env.sh
$ cd zango
$ git pull origin master
$ cd build
$ mkdir output
$ cd output
$ CMAKE_INCLUDE_PATH=/full/path/to/local/include CMAKE_LIBRARY_PATH=/full/path/to/local/lib cmake ../.. -G Ninja
-- The C compiler identification is GNU 6.3.0
<lots of CMake output>
-- Generating done
-- Build files have been written to: /path/to/zango/build/output

As always, do not forget to replace /full/path/to/local with your path to the directory containing the ODB libraries. If all has gone according to plan, CMake should have found ODB, Boost, Dogen and all other dependencies we have carefully and painstakingly setup in the previous three parts.

Once the configuration is done, you can fire up Ninja to build:

$ ninja -j5
[1/100] Building CXX object projects/northwind/src/CMakeFiles/northwind.dir/io/category_id_io.cpp.o
<lots of Ninja output>
[98/100] Linking CXX static library projects/northwind/src/libzango.northwind.a
[99/100] Building CXX object CMakeFiles/application.dir/projects/application/main.cpp.o
[100/100] Linking CXX executable application

That was easy! But what exactly have we just built?

The "Application"

We've created a really simple application to test drive the northwind model. Of course, this is really not how your production code should look like, but it'll do just fine for our purposes. We shall start by reading a password from the command line and then we use it to instantiate our Oracle database:

    const std::string password(argv[1]);
    using odb::oracle::database;
    std::unique_ptr<database>
        db(new database("northwind", password, "XE", "localhost", 1521));

We then use this database to read all available customers:

std::list<zango::northwind::customers>
load_customers(odb::oracle::database& db) {
    odb::oracle::transaction t(db.begin());

    std::list<zango::northwind::customers> r;
    auto rs(db.query<zango::northwind::customers>());
    for (auto i(rs.begin ()); i != rs.end (); ++i)
        r.push_back(*i);
    return r;
}

Please note that this is a straightforward use of the ODB API, but barely scratches the surface of what ODB can do. ODB supports all sorts of weird and wonderful things, including fairly complex queries and other great features. If you'd like more details on how to use ODB, you should read its manual: C++ Object Persistence with ODB. It's extremely comprehensive and very well written.

Once we have the customers in memory, we can start to do things with them. We can for example serialise them to a Boost serialisation binary archive and read them back out:

    boost::filesystem::path file("a_file.bin");
    {
        boost::filesystem::ofstream os(file);
        boost::archive::binary_oarchive oa(os);
        oa << customers;
    }

    std::cout << "Wrote customers to file: "
              << file.generic_string() << std::endl;

    std::list<zango::northwind::customers> customers_from_file;
    {
        boost::filesystem::ifstream is(file);
        boost::archive::binary_iarchive ia(is);
        ia >> customers_from_file;
    }

This is where hopefully you should start to see the advantages of Dogen: without writing any code, we have full serialisation support to all classes in the model - in addition to ODB support, of course.

Another very useful feature is to dump objects into a stream:

    for (const auto& c : customers_from_file)
        std::cout << "Customer: " << c << std::endl;

The objects are written in JSON, making it easy to post-process the output with JSON tools such as JQ, resulting in a nicely formatted string:

{
  "__type__": "zango::northwind::customers",
  "customer_id": {
    "__type__": "zango::northwind::customer_id",
    "value": 90
  },
  "customer_code": "WILMK",
  "company_name": "Wilman Kala",
  "contact_name": "Matti Karttunen",
  "contact_title": "Owner/Marketing Assistant",
  "address": "Keskuskatu 45",
  "city": "Helsinki",
  "region": "",
  "postal_code": "21240",
  "country": "Finland",
  "phone": "90-224 8858",
  "fax": "90-224 8858"
}

Dogen supports dumping arbitrarily-nested graphs, so it's great for logging program state as you go along. We make extensive use of this in Dogen, since - of course - we use Dogen to develop Dogen. Whilst this has proven invaluable, we have also hit some limits. For example, sometimes you may bump into really large and complex objects and JQ just won't cut it. But the great thing is that you can always dump the JSON into PostgreSQL - very easily indeed, given the ODB support - and then run queries on the object using the power of JSONB. With a tiny bit more bother you can also dump the objects into MongoDB.

However, with all of this said, it is also important to notice that we do not support proper JSON serialisation in Dogen at the moment. This will be added Real-Soon-Now, as we have a real need for it in production, but its not there yet. At present all you have is this debug-dumping of objects into streams which happens to be JSON. It is not real JSON serialisation. Real JSON support is very high on our priority list though, so expect it to land in the next few sprints.

Another useful Dogen feature is test data generation. This can be handy for performance testing, for example. Let's say we want to generate ~10K customers and see how Oracle fares:

std::vector<zango::northwind::customers> generate_customers() {
    std::vector<zango::northwind::customers> r;
    const auto total(10 * 1000);
    r.reserve(total);

    zango::northwind::customers_generator g;
    for (int i = 0; i < total; ++i) {
        const auto c(g());
        if (i > 100)
            r.push_back(g());
    }

    return r;
}

Note that we skipped the first hundred customers just to avoid clashes with the customer_id primary key. Now, thanks to the magic of ODB we can easily push this data into the database:

void save_customers(odb::oracle::database& db,
    const std::vector<zango::northwind::customers>& customers) {

    odb::transaction t(db.begin());
    for (const auto c : customers)
        db.persist(c);
    t.commit();
}

Et voilá, we have lots of customers in the database now:

SQL> select count(1) from customers;

  COUNT(1)
----------
      9990

To be totally honest, this exercise revealed a shortcoming in Dogen: since it does not know of the size of fields on the database, the generated test data may in some cases be too big to fit the database fields:

Saving customers...
terminate called after throwing an instance of 'odb::oracle::database_exception'
  what():  12899: ORA-12899: value too large for column "NORTHWIND"."CUSTOMERS"."CUSTOMER_CODE" (actual: 6, maximum: 5)

I solved this problem with a quick hack for this article (by removing the prefix used in the test data) but a proper fix is now sitting in Dogen's product backlog for implementation in the near future.

Finally, just for giggles, I decided to push the data we read from Oracle into Redis, an in-memory cache that seems to be all the rage amongst the Cool-Kid community. To keep things simple, I used the C API provided by hiredis. Of course, if this was the real world, I would have used one of the many c++ clients for Redis such as redis-cplusplus-client or cpp redis. As it was, I could not find any Debian packages for them, so I'll just have to pretend I know C. Since I'm not much of a C programmer, I decided to do a very bad copy and paste job from this Stack Overflow article. The result was this beauty (forgive me in advance, C programmers):

    redisContext *c;
    redisReply *reply;
    const char *hostname = "localhost";
    int port = 6379;
    struct timeval timeout = { 1, 500000 }; // 1.5 seconds
    c = redisConnectWithTimeout(hostname, port, timeout);
    if (c == NULL || c->err) {
        if (c) {
            std::cerr << "Connection error: " << c->errstr << std::endl;
            redisFree(c);
        } else {
            std::cerr << "Connection error: can't allocate redis context"
                      << std::endl;
        }
        return 1;
    }

    std::ostringstream os;
    boost::archive::binary_oarchive oa(os);
    oa << customers;
    const auto value(os.str());
    const std::string key("customers");
    reply = (redisReply*)redisCommand(c, "SET %b %b", key.c_str(),
        (size_t) key.size(), value.c_str(), (size_t) value.size());
    if (!reply)
        return REDIS_ERR;
    freeReplyObject(reply);

    reply = (redisReply*)redisCommand(c, "GET %b", key.c_str(),
        (size_t) key.size());
    if (!reply)
        return REDIS_ERR;

    if ( reply->type != REDIS_REPLY_STRING ) {
        std::cerr << "ERROR: " << reply->str << std::endl;
        return 1;
    }

    const std::string redis_value(reply->str, reply->len);
    std::istringstream is(redis_value);
    std::list<zango::northwind::customers> customers_from_redis;
    boost::archive::binary_iarchive ia(is);
    ia >> customers_from_redis;
    std::cout << "Read from redis: " << customers_from_redis.size()
              << std::endl;
    std::cout << "Front customer (redis): "
              << customers_from_redis.front() << std::endl;
    freeReplyObject(reply);

And it actually works. Here's the output, with manual formatting of JSON:

Read from redis: 91
Front customer (redis):  {
  "__type__": "zango::northwind::customers",
  "customer_id": {
    "__type__": "zango::northwind::customer_id",
    "value": 1
  },
  "customer_code": "ALFKI",
  "company_name": "Alfreds Futterkiste",
  "contact_name": "Maria Anders",
  "contact_title": "Sales Representative",
  "address": "Obere Str. 57",
  "city": "Berlin",
  "region": "",
  "postal_code": "12209",
  "country": "Germany",
  "phone": "030-0074321",
  "fax": "030-0076545"
}

As you can hopefully see, in very few lines of code we managed to connect to a RDBMS, read some data, push it into a stream, read it and write into Boost Serialization archives and push it into and out of Redis. All this in fairly efficient C++ code (and some very dodgy C code, but we'll keep that one quiet).

A final note on the CMake targets. Zango comes with a couple of targets for Dogen and ODB:

knit_northwind generates the Dogen code from the model.
odb_northwind runs ODB against the Dogen model, generating the ODB sources.

The ODB target is added automatically by Dogen. The Dogen target was added manually by yours truly, and it is considered good practice to have one such target when you use Dogen so that other Dogen users know how to generate your models. You can, of course, name it what you like, but in the interest of making everyone's life easier its best if you follow the convention.

Oracle and Bulk Fetching

Whilst I was playing around with ODB and Oracle, I noticed a slight problem: there is no bulk fetch support in the ODB Oracle wrappers at present; it works for other scenarios, but not for selects. I reported this to the main ODB mailing list here. By the by, the ODB community is very friendly and their mailing list is a very responsive place to chat about ODB issues.

Anyway, so you can have an idea of this problem, here's a fetch of our generated customers without prefetch support:

<snip>
Generating customers...
Generated customers. Size: 9899
Saving customers...
Saved customers.
Read generated customers. Size: 9990 time (ms): 263.449
<snip>

Remember the 263.449 for a moment. Now say you delete all rows we generated:

delete from  customers where customer_id > 100;

Then, say you apply to libodb-oracle the hastily-hacked patch I mentioned in that mailing list thread. Of course, I am hand-waving here greatly, as you need to rebuild the library, install the binaries, rebuild zango, etc, but you get the gist. At any rate, here's the patch, hard-coding an unscientifically-obtained-prefetch of 5K rows:

--- original_statement.txt 2017-02-09 15:45:56.585765500 +0000
+++ statement.cxx        2017-02-13 10:18:28.447916100 +0000
@@ -1574,18 +1574,29 @@
       OCIError* err (conn_.error_handle ());
+      const int prefetchSize(5000);
+      sword r = OCIAttrSet (stmt_,
+          OCI_HTYPE_STMT,
+          (void*)&prefetchSize,
+          sizeof(int),
+          OCI_ATTR_PREFETCH_ROWS,
+          err);
+
+      if (r == OCI_ERROR || r == OCI_INVALID_HANDLE)
+          translate_error (err, r);
+
       // @@ Retrieve a single row into the already bound output buffers as an
       // optimization? This will avoid multiple server round-trips in the case
       // of a single object load.
       //
-      sword r (OCIStmtExecute (conn_.handle (),
+      r = OCIStmtExecute (conn_.handle (),
                                stmt_,
                                err,
                                0,
                                0,
                                0,
                                0,
-                               OCI_DEFAULT));
+                               OCI_DEFAULT);
       if (r == OCI_ERROR || r == OCI_INVALID_HANDLE)
         translate_error (conn_, r);

And now re-run the command:

Generated customers. Size: 9899
Saving customers...
Saved customers.
Read generated customers. Size: 9990 time (ms): 40.85

Magic! We're down to 40.85. Now that I have a proper setup, I am going to start working on upstreaming this patch, so that ODB can expose the fetch configuration for fetching in a similar manner it already does for other purposes. If you are interested in the gory technical details, have a look at Boris' reply.

Conclusion

Hopefully this concluding part gave you an idea of why you might want to use Dogen with ODB for your modeling needs. Sadly, its not easy to frame the discussion adequately, so that you have all the required context in order to place these two tools in the continuum of tooling; but I'm hoping this series of articles was useful to at least help you setup Oracle Express in Debian and get an idea of what you can do with these two tools.

Created: 2017-03-25 Sat 20:26

Emacs 25.1.1 (Org mode 8.2.10)

Validate

Nerd Food: Northwind, or Using Dogen with ODB - Part III

2017-03-20T04:50:00.000-07:00

Nerd Food: Northwind, or Using Dogen with ODB - Part III

Optimism is an occupational hazard of programming; feedback is the treatment. -- Kent Beck

Welcome to the third part of a series of N blog posts on using Dogen with ODB against an Oracle database. If you want more than the TL;DR, please read Part I and Part II. Otherwise, the story so far can be quickly summarised as follows: we got our Oracle Express database installed and set up by adding the required users; we then built the ODB libraries and installed the ODB compiler.

After this rather grand build up, we shall finally get to look at Dogen - just about. It now seems clear these series will have to be extended by at least one or two additional instalments in order to provide a vaguely sensible treatment of the material I had initially planned to cover. I wasn't expecting N to become so large, but - like every good software project - I'm now realising you can only estimate the size of the series properly once you've actually finished it. And to rub salt into the wounds, before we can proceed we must start by addressing some of the instructions in the previous posts which were not quite right.

Est Humanum Errare?

The first and foremost point in the errata agenda is concerned with the additional Oracle packages we downloaded in Part I. When I had originally checked my Oracle XE install, I did not find an include directory, which led me to conclude that a separate download was required for driver libraries and header files. I did find this state of affairs somewhat unusual - but then again, it is Oracle we're talking about here, so "unusual" is the default behaviour. As it turns out, I was wrong; the header files are indeed part of the Oracle XE install, just placed under a rather… shall we say, creative, location: /u01/app/oracle/product/11.2.0/xe/rdbms/public. The libraries are there too, under the slightly more conventionally named lib directory.

This is quite an important find because the downloaded OCI driver has moved on to v12 whereas XE is still on v11. There is backwards compatibility, of course - and everything should work fine connecting a v12 client against an v11 database - but it does introduce an extra layer of complexity: you now need to make sure you do not simultaneously have both v11 and v12 shared objects in the path when linking and running or else you will start to get some strange warnings. As usual, we try our best to confuse only one issue at a time, so we need to make sure we are making use of v11 and purge all references to v12; this entails recompiling ODB's oracle support.

If you followed the instructions on Part II and you have already installed the ODB Oracle library, you'll need to remove it first:

rm /full/path/to/local/lib/libodb-oracle* /full/path/to/local/include/odb/oracle

Remember to replace /full/path/to/local with the path to your local directory. Then, you can build by following the instructions as per previous post, but with one crucial difference at configure time: point to the Oracle XE directories instead of the external OCI driver directories:

. /u01/app/oracle/product/11.2.0/xe/bin/oracle_env.sh
LD_LIBRARY_PATH=/u01/app/oracle/product/11.2.0/xe/lib CPPFLAGS="-I/full/path/to/local/include -I/u01/app/oracle/product/11.2.0/xe/rdbms/public" LDFLAGS="-L/full/path/to/local/lib -L/u01/app/oracle/product/11.2.0/xe/lib" ./configure --prefix=/full/path/to/local

Again, replacing the paths accordingly. If all goes well, the end result should be an ODB Oracle library that uses the OCI driver from Oracle XE. You then just need to make sure you have executed oracle_env.sh before running your binary, but don't worry too much because I'll remind you later on. Whilst we're on the subject of Oracle packages, it's worth mentioning that I did a minor update to Part I: you didn't need to download SQLPlus separately either, as it is also included in XE package. So, in conclusion, after a lot of faffing, it turns out you can get away with just downloading XE and nothing else.

The other minor alteration to what was laid out on the original posts is that I removed the need for the basic database schema. In truth, the entities placed in that schema were not adding a lot of value; their use cases are already covered by the northwind schema, so I removed the need for two schemas and collapsed them into one.

A final note - not quite an errata per se but still, something worthwhile mentioning. We didn't do a "proper" Oracle setup, so when you reboot your box you will find that the service is no longer running. You can easily restart it from the shell, logged in as root:

# cd /etc/init.d/
# ./oracle-xe start
Starting oracle-xe (via systemctl): oracle-xe.service.

Notice that Debian is actually clever enough to integrate the Oracle scripts with systemd, so you can use the usual tools to find out more about this service:

# systemctl status oracle-xe
● oracle-xe.service - SYSV: This is a program that is responsible for taking care of
   Loaded: loaded (/etc/init.d/oracle-xe; generated; vendor preset: enabled)
   Active: active (exited) since Sun 2017-03-12 15:10:47 GMT; 6s ago
     Docs: man:systemd-sysv-generator(8)
  Process: 16761 ExecStart=/etc/init.d/oracle-xe start (code=exited, status=0/SUCCESS)

Mar 12 15:10:37 lorenz systemd[1]: Starting SYSV: This is a program that is responsible for taking c…e of...
Mar 12 15:10:37 lorenz oracle-xe[16761]: Starting Oracle Net Listener.
Mar 12 15:10:37 lorenz su[16772]: Successful su for oracle by root
Mar 12 15:10:37 lorenz su[16772]: + ??? root:oracle
Mar 12 15:10:37 lorenz su[16772]: pam_unix(su:session): session opened for user oracle by (uid=0)
Mar 12 15:10:39 lorenz oracle-xe[16761]: Starting Oracle Database 11g Express Edition instance.
Mar 12 15:10:39 lorenz su[16800]: Successful su for oracle by root
Mar 12 15:10:39 lorenz su[16800]: + ??? root:oracle
Mar 12 15:10:39 lorenz su[16800]: pam_unix(su:session): session opened for user oracle by (uid=0)
Mar 12 15:10:47 lorenz systemd[1]: Started SYSV: This is a program that is responsible for taking care of.
Hint: Some lines were ellipsized, use -l to show in full.

With all of this said, lets resume from where we left off.

Installing the Remaining Packages

We still have a number of packages to install, but fortunately the installation steps are easy enough so we'll cover them quickly in this section. Let's start with Dogen.

Dogen

Installing Dogen is fairly straightforward: we can just grab the latest release from BinTray:

dogen 0.99.0 amd64-applications.deb

As it happens, we must install v99 or above because we did a number of fixes to Dogen as a result of this series of articles; previous releases had shortcomings with their ODB support.

As expected, the setup is pretty standard-fare debian:

$ wget https://dl.bintray.com/domaindrivenconsulting/Dogen/0.99.0/dogen_0.99.0_amd64-applications.deb -O dogen_0.99.0_amd64-applications.deb
$ sudo dpkg -i dogen_0.99.0_amd64-applications.deb
[sudo] password for USER:
Selecting previously unselected package dogen-applications.
(Reading database ... 551550 files and directories currently installed.)
Preparing to unpack dogen_0.99.0_amd64-applications.deb ...
Unpacking dogen-applications (0.99.0) ...
Setting up dogen-applications (0.99.0) ...

If all has gone according to plan, you should see something along the lines of:

$ dogen.knitter --version
Dogen Knitter v0.99.0
Copyright (C) 2015-2017 Domain Driven Consulting Plc.
Copyright (C) 2012-2015 Marco Craveiro.
License: GPLv3 - GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>.

Dia

Dogen has multiple frontends - at the time of writing, JSON and Dia. We'll stick with Dia because of its visual nature, but keep in mind that what you can do with Dia you can also do with JSON.

A quick word on Dia for those not in the know, copied verbatim from its home page:

Dia is a GTK+ based diagram creation program for GNU/Linux, MacOS X, Unix, and Windows, and is released under the GPL license.

Dia is roughly inspired by the commercial Windows program 'Visio,' though more geared towards informal diagrams for casual use. It can be used to draw many different kinds of diagrams. It currently has special objects to help draw entity relationship diagrams, UML diagrams, flowcharts, network diagrams, and many other diagrams.

Dia does not change very often, which means any old version will do. You should be able to install dia straight off of package manager:

apt-get install dia

Other Dependencies

I had previously assumed Boost to be installed on Part II but - if nothing else, purely for the sake of completeness - here are the instructions to set it up, as well as CMake and Ninja. We will need these in order to build our application, but we won't dwell on them too much on them or else this series of posts would go on forever. Pretty much any recent version of Boost and CMake will do, so again we'll just stick to vanilla package manager:

# apt-get install cmake ninja-build libboost-all-dev

Mind you, you don't actually need the entirety of Boost for this exercise, but it's just easier this way.

Emacs and SQL Plus

Finally, a couple of lose notes which I might as well add here. If you wish to use SQLPlus from within Emacs - and you should, since the SQLi mode is just simply glorious - you can configure it to use our Oracle Express database quite easily:

(add-to-list 'exec-path "/u01/app/oracle/product/11.2.0/xe/bin")
(setenv "PATH" (concat (getenv "PATH") ":/u01/app/oracle/product/11.2.0/xe/bin"))
(setenv "ORACLE_HOME" "/u01/app/oracle/product/11.2.0/xe")

After this you will be able to start SQL Plus from Emacs with the usual sql-oracle command. I recommend you to do at least a minimal setup of SQL Plus too, to make it usable:

SQL> set linesize 8192
SQL> set pagesize 50000

Introducing Zango

After this excruciatingly long setup process, we can at long last start to create our very "simple" project. Simple in quotes because it ended up being a tad more complex than what was originally envisioned, so it was easier to create a GitHub repository for it. It would have been preferable to describe it from first principles, but then the commentary would literally go on for ever. A compromise had to be made.

In order to follow the remainder of this post please clone zango from GitHub:

git clone git@github.com:DomainDrivenConsulting/zango.git

Zango is a very small Dogen project that builds with CMake. Here are some notes on the folder structure to help you navigate:

build/cmake: additional CMake modules that are not part of the standard CMake distribution. We need this for ODB, Oracle and Dogen.
data: some application data that we will use to populate our database.
projects: where all the code lives.
projects/input_models: location of the Dogen models - in this case, we just have one. You could, of course, place it anywhere you'd like, but traditionally this is where they live.
projects/northwind: code output of the Dogen model. This is the key project of zango.
projects/application: our little command line driver for the application.

Now, before we get to look at the code I'd like to first talk about Northwind and on the relationship between Dogen and ODB.

Northwind Schema

Microsoft makes the venerable Northwind database available in CodePlex, at this location. I found a useful description of the Northwind database here, which I quote:

Northwind Traders Access database is a sample database that shipped with Microsoft Office suite. The Northwind database contains the sales data for a fictitious company called Northwind Traders, which imports and exports specialty foods from around the world. You can use and experiment with Access with Northwind database while you're learning and develop ideas for Access.

If you really want a thorough introduction to Northwind, you could do worse than reading this paper: Adapting the Access Northwind Database to Support a Database Course. Having said that, for the purposes of this series we don't really need to dig that deep. In fact, I'll just present CodePlex's diagram with the tables and their relationships to give you an idea of the schema - without any further commentary - and that's more or less all that needs to be said about it:

Northwind Schema (C) Microsoft.

Now, in theory, we could use this image to manually extract all the required information to create a Dia diagram that follows Dogen's conventions, code-generate that and Bob's your Uncle. However, in practice we have a problem: the CodePlex project only contains the SQL statements for Microsoft SQL Server. Part of the point of this exercise is to show that we can load real data from Oracle, rather than just generate random data, so it would be nice to load up the "real" Northwind data from their own tables. This would be more of an "end-to-end" test, as opposed to using ODB to generate the tables, and Dogen to generate random data which we can push to the database.

However, its not entirely trivial to convert T-SQL into Oracle SQL, and since this is supposed to be a "quick" project on the side - focusing on ODB and Dogen - I was keen on not spending time on unrelated activities such as SQL conversions. Fortunately, I found exactly what I was looking for: a series of posts from GeeksEngine entitled "Convert MS Access Northwind database to Oracle". For reference, these are as follows:

If you don't care too much about the details, you can just look at the Oracle SQL statements, available here and copied across into the Zango project. I guess it's still worthwhile mentioning that GeeksEngine has reduced considerably the number of entities in the schema - for which they provide a rationale. Before we start an in-depth discussions into the merits of normalisation and de-normalisation and other DBA level topics, I have to stop you in your tracks. Please do not get too hung-up on the "quality" of the database schema of Northwind - either the Microsoft or the GeeksEngine one. The purpose of this exercise is merely to demonstrate how Dogen and ODB work together to provide an ORM solution. From this perspective, any vaguely realistic database schema is adequate - provided it allows us to test-drive all the features we're interested in. Whether you agree or not with the decisions the original creators of this schema made is a completely different matter, which is well beyond the scope of this series of posts.

Right, so now we need to setup our Northwind schema and populate it with data. For this you can open a SQL Plus session with user Northwind as explained previously and then run in the SQL script:

@/path/to/zango/data/Oracle-Northwind.sql

Replacing /path/to with the full path to your Zango checkout. This executes the GeeksEngine script against your local Oracle XE database. If all has gone well, you should now have a whole load of tables and data. You can sanity-check the setup by running the following SQL:

SQL> select table_name from all_tables where owner = 'NORTHWIND';

TABLE_NAME
------------------------------
ORDER_DETAILS
CATEGORIES
CUSTOMERS
EMPLOYEES
SUPPLIERS
SHIPPERS
PRODUCTS
ORDERS

8 rows selected.

SQL> select employee_id, firstname, lastname from employees where rownum <3;

EMPLOYEE_ID FIRSTNAME  LASTNAME
----------- ---------- --------------------
      1 Nancy      Davolio
      2 Andrew     Fuller

Now then, let's model these entities in Dogen.

The Dogen Model for Northwind

Before we proceed, I'm afraid I must make yet another disclaimer: a proper explanation on how to use Dia (and UML in general) is outside the scope of these articles, so you'll see me hand-waving quite a lot. Hopefully the diagrams are sufficiently self-explanatory for you to get the idea.

The process of modeling is simply to take the entities of the GeeksEngine SQL schema and to model them in Dia, following Dogen's conventions: each SQL type is converted to what we deemed to be the closest C++ type. You can open the diagram from the folder projects/input_models/northwind.dia, but if you haven't got it handy, here's a screenshot of most of the UML model:

Dogen Northwind model.

The first point of note in that diagram is - if you pardon the pun - the UML note.

Figure 1: UML Note from northwind model.

This configuration is quite important so we'll discuss it a bit more detail. All lines starting with #DOGEN are an extension mechanism used to supply meta-data into Dogen. First, lets have a very quick look at the model's more "general settings":

yarn.dia.comment: this is a special command that tells Dogen to use this UML note as the source code comments for the namespace of the model (i.e. northwind). Thus the text "The Northwind model is a…" will become part of a doxygen comment for the namespace.
yarn.dia.external_modules: this places all types into the top-level namespace northwind.
yarn.input_language: the notation for types used in this model is C++. We won't delve on this too much, but just keep in mind that Dogen supports both C++ and C#.
quilt.cpp.enabled: as we are using C++, we must enable it.
quilt.cpp.hash.enabled: we not require this feature for the purposes of this exercise.
quilt.csharp.enabled: As this is a C++-only model, we will disable C#.
annotations.profile: Do not worry too much about this knob, it just sets a lot of default options for this project such as copyright notices and so forth.

As promised, you won't fail to notice we hand-waved quite a lot on the description of these settings. It is very difficult to explain them properly them without giving the reader an immense amount of context about Dogen. This, of course, needs to be done - particularly since we haven't really spent the required time updating the manual. However, in the interest of keeping this series of posts somewhat focused on ODB and ORM, we'll just leave it at that, with a promise to create Dogen-specific posts on them.

Talking about ORM, the next batch of settings is exactly related to that.

yarn.orm.database_system: here, we're stating that we're interested in both oracle and postgresql databases.
yarn.orm.letter_case: this sets the "case" to use for all identifiers; either upper_case or lower_case. So if you choose upper_case, all your table names will be in upper case and vice-versa. This applies to all columns and object names on the entirety of this model (e.g. customers becomes CUSTOMERS and so forth).
yarn.orm.schema_name: finally we set the schema name to northwind. Remember that we are in upper case, so the name becomes NORTHWIND.

In addition to the meta-data, the second point worth noticing is that there is a large overlap between C++ classes and the entities in the original diagram. For example, we have customers, suppliers, employees and so forth - the Object-Relational Mapping is very "linear". This is a characteristic of the Dogen approach to ORM, but you do not necessarily need to use ODB in this manner; we discuss this in the next section.

If one is to look at a properties of a few attributes in more detail, one can see additional Dogen meta-data. Take customer_id in the customers class:

Figure 2: Properties of customer_id in the customer class.

The setting yarn.orm.is_primary_key tells Dogen that this attribute is the primary key of the table. Note that we did not use an int as the type of customer_id but instead made use of a Dogen feature called "primitives". Primitives are simple wrappers around builtins and "core" types such as std::string, intended to have little or no overhead after the compiler is done with them. They are useful when you want to use domain concepts to clarify intent. For example, primitives help making it obvious when you try to use a customer_id when a supplier_id was called for. It's also worth noticing that customer_id makes use of yarn.orm.is_nullable - settable to true or false. It results in Dogen telling ODB if a column can be NULL or not.

As we stated, each of the attributes of these classes has the closest C++ type we could find that maps to the SQL type used in the database schema. Of course, different developers can make different choices for these types. For example, were we to store the picture data rather than a path to the picture as GeeksEngine decided to do, we would use a std::vector<char> instead of a std::string. In that case, we'd have to perform some additional mapping too:

#DOGEN yarn.orm.type_override=postgresql,BYTEA
#DOGEN yarn.orm.type_override=oracle,BLOB

This tells Dogen about the mapping of the attribute's type to the SQL type. Dogen then conveys this information to ODB.

Dogen's ORM support is still quite young - literally a couple of sprints old - so there will be cases where you may need to perform some customisation which is not yet available in its meta-model. In these cases, you can bypass Dogen and make use of ODB pragmas directly. As an example, GeeksEngine Oracle schema named a few columns in Employees without underscores such as FIRSTNAME and LASTNAME. We want the C++ classes to have the correct names (e.g. first_name, last_name, etc) so we simply tell ODB that these columns have different names in the database. Take last name for example:

Figure 3: Properties of last name in the employee class.

A final note on composite keys. Predictably, Dogen follows the ODB approach - in that primary keys that have more than one column must be expressed as a class on its own right. In northwind, we use the postfix _key for these class names in order to make them easier to identify - e.g. order_details_key. You won't fail to notice that this class has the flag yarn.orm.is_value set. It tells Dogen - and, by extension, ODB - that it is not really a full-blown type, which would map it to a table, but instead should be treated like other value types such as std::string.

Interlude: Dogen with ODB vs Plain ODB

"The technical minutiae is all well and good", the inquisitive reader will say, "but why Dogen and ODB? Why add yet another layer of indirection when one can just use ODB?" Indeed, it may be puzzling for there to be a need for a code-generator which generates code for another code-generator. "Turtles all the way down" and "We over-engineered it yet again", the crowd chants from the terraces.

Let me attempt to address some of these concerns.

First, it is important to understand the argument we're trying to make here: Dogen models benefit greatly from ODB, but its not necessarily the case that all ODB users benefit from Dogen. Let's start with a classic ODB use case, which is to take an existing code base and add ORM support to it. In this scenario it makes no sense to introduce Dogen; after all, ODB requires only small changes to the original source code and has the ability to parse very complex C++. And, of course, using ODB in this manner also allows one to deal with impedance mismatches between the relational model and the object model of your domain.

Dogen, on the other hand, exists mainly to support Model Driven Software Development (MDSD), so the modeling process is the driver. This means that one is expected to start with a Dogen model, and to use the traditional MDSD techniques for the management of the life-cycle of your model - and eventually for the generation of entire software product lines. Of course, you need not buy into the whole MDSD sales pitch in order to make use of Dogen, but you should at least understand it in this context. At a bare minimum, it requires you to think in terms of Domain Models - as Domain Driven Development defines them - and then in terms of "classes of features" required by the elements of your domain. These we call "facets" in Dogen parlance. There are many such facets like io, which is the ability to dump an object's state into a C++ stream - at present using JSON notation - or serialization which is the ability to serialise an object using Boost serialisation. It is in this context that ODB enters the Dogen world. We could, of course, generate ORM mappings (and SQL) directly from Dogen. But given what we've seen from ODB, it seems this would be a very large project - or, conversely, we'd have very poor support, not dealing with a great number of corner cases. By generating the very minimal and very non-intrusive code that ODB needs, we benefit from the years of experience accumulated in this tool whilst at the same time making life easier for Dogen users.

Of course, as with all engineering trade-offs, this one is not without its disadvantages. When things do go wrong you now have more moving parts to go through when root-causing: was it an error in the diagram, or was it Dogen, or was it the mapping between Dogen and ODB or was it ODB? Fortunately, I found that this situation is minimised by the way in which you end up using Dogen. For instance, all generated code can be version-controlled, so you can look at the ODB input files generated by Dogen and observe how they change with changes in the Dogen model. The Dogen ODB files should also look very much like regular hand-crafted ODB files - making use of pragmas and so forth - and you are also required to run ODB manually against them. Thus, in practice, I have found troubleshooting straightforward enough that the layers of indirection end up not constituting a real problem.

Finally, its worth pointing out that the Domain Models Dogen generates have a fairly straightforward shape to them, making the ODB mapping a lot more trivial that "general" C++ code would have. It is because of this that we have orm parameters in Dogen, which can expand to multiple ODB pragmas - the user should not need to worry about that expansion.

Conclusion

This part is already becoming quite large, so I'm afraid we need to stop it here and continue on Part IV. However, we have managed to address a few of the mistakes of the Oracle setup of previous parts, introduced the remaining applications that we need to install and then discussed Northwind - both in terms of its original intent and also in terms of the Dogen objectives. Finally we provided an explanation of how Dogen and ODB fit together in a tooling ecosystem.

Created: 2017-03-19 Sun 23:07

Emacs 25.1.1 (Org mode 8.2.10)

Validate

Nerd Food: Northwind, or Using Dogen with ODB - Part II

2017-02-24T04:18:00.000-08:00

Nerd Food: Northwind, or Using Dogen with ODB - Part II

On Part I of this series, we got our Oracle Express database up and running against Debian Testing. It involved quite a bit of fiddling but we seemed to get there in the end. In Part II we shall now finish the configuration of the Oracle database and set up the application dependencies. On Part III we will finally get to the Dogen model, and start to make use of ODB.

What's in a Schema?

The first thing we need to do to our database is add the "application users". This is a common approach to most server side apps, where we tend to have "service users" that login to the database and act upon user requests on their behalf. We can then use audit tables to stamp the user actions so we can monitor them. We can also have application level permissions that stop users from doing silly things. This is of course a step up from the applications in the nineties, where one would have one database account for each user - allowing all sorts of weird and wonderful things such as users connecting directly to databases via ODBC and Excel or Access. I guess nowadays developers don't even know someone thought this to be a good idea at one point.

When I say "database user", most developers exposed to RDBMS' immediately associate this to a user account. This is of course how most databases work, but obviously not so with Oracle. In Oracle, "users" and "schemas" are conflated, so much so it's hard to tell if there is any difference between them. For the purist RDBMS user, a schema is a schema - a collection of tables and other database objects, effectively a namespace - and a user is a user - a person (real or otherwise) that owns database objects. In Oracle these two more or less map to the same concept. So when you create a user, you have created a schema and you can start adding tables to it; and when you refer to database objects, you prefix them by the user name just as you would if they belonged to a schema. And, of course, you can have users that have no database objects for themselves, but which were granted permission to access database objects from other users.

So our first task is to create two schemas; these are required by the Dogen model which we will use as our "application". They are:

basic
northwind

As I mentioned before, I had created some fairly basic tests for ODB support in Dogen. Those entities were placed in the aptly named schema basic. I then decided to extend the schema with something a bit more meaty, which is where northwind comes in.

For the oldest readers, especially those with a Microsoft background, Northwind is bound to conjure memories. Many of us learned Microsoft Access at some point in the nineties, and in those days the samples were pure gold. I was lucky enough to learn about relational databases in my high-school days, using Clipper and dBASE IV, so the transition to Microsoft Access was more of an exercise in mapping than learning proper. And that's where Northwind came in. It was a "large" database, with forms and queries and tables and all sorts of weird and wonderful things; every time you needed something done to your database you'd check first to see how Northwind had done it.

Now that we are much older, of course, we can see the flaws of Northwind and even call for its abolition. But you must remember that in the nineties there was no Internet for most of us - even dial-up was pretty rare where I was - and up-to-date IT books were almost as scarce, so samples were like gold dust. So for all of these historic reasons and as an homage to my olden days, I decided to implement the Northwind schema in Dogen and ODB; it may not cover all corner cases, but it is certainly a step up on my previous basic tests.

Enough about history and motivations. Returning to our SQLPlus from Part I, where we were logged in as SYSTEM, we start first by creating a table space and then the users which will make use of that table space:

SQL> create tablespace tbs_01 datafile 'tbs_f01.dbf' size 200M online;

Tablespace created.

SQL> create user basic identified by "PASSWORD" default tablespace tbs_01 quota 100M on tbs_01;
User created.

SQL> create user northwind identified by "PASSWORD" default tablespace tbs_01 quota 100M on tbs_01;

User created.

Remember to replace PASSWORD with your own passwords. This is of course a very simple setup; in the real world you would have to take great care setting the users and table spaces up, including thinking about temporary table spaces and so forth. But for our simplistic purposes this suffices. Now we need to grant these users a couple of useful privileges - again, for a real setup, you'd need quite a bit more:

SQL> GRANT create session TO basic;
GRANT create session TO basic;

Grant succeeded.

SQL> GRANT create table TO basic;
GRANT create table TO basic;

Grant succeeded.

SQL> GRANT create session TO northwind;
GRANT create session TO northwind;

Grant succeeded.

SQL> GRANT create table TO northwind;
GRANT create table TO northwind;

Grant succeeded.

If all went well, we should now be able to exit the SYSTEM session, start a new one with one of these users, and play with a test table:

$ sqlplus northwind@XE

SQL*Plus: Release 11.2.0.2.0 Production on Fri Feb 24 10:20:10 2017

Copyright (c) 1982, 2011, Oracle.  All rights reserved.

Enter password:

Connected to:
Oracle Database 11g Express Edition Release 11.2.0.2.0 - 64bit Production

SQL> create table test ( name varchar(10) );

Table created.

SQL> insert into test(name) values ('kianda');
insert into test(name) values ('kianda');

1 row created.

SQL> select * from test;

NAME
----------
kianda

SQL> grant select on test to basic;

Grant succeeded.

SQL> Disconnected from Oracle Database 11g Express Edition Release 11.2.0.2.0 - 64bit Production
$ sqlplus basic@XE

SQL*Plus: Release 11.2.0.2.0 Production on Fri Feb 24 10:23:04 2017

Copyright (c) 1982, 2011, Oracle.  All rights reserved.

Enter password:

Connected to:
Oracle Database 11g Express Edition Release 11.2.0.2.0 - 64bit Production

SQL> select * from northwind.test;

NAME
----------
kianda

This all looks quite promising. To recap, we logged in with user northwind, created a table, inserted some random data and selected it back; all looked ok. Then for good measure, we granted the rights to see this test table to user basic; logged in as that user and selected the test table, with the expected results.

At this point we consider our Oracle setup completed and we're ready to enter the application world.

Enter ODB

Setting up ODB is fairly easy, especially if you are on Debian: you can simply obtain it from apt-get or synaptic. The only slight snag is, I could not find the oracle dependencies (i.e. libodb-oracle). Likely this is because they depend on OCI, which is non-free, so Debian either does not bother to package it at all or you need some kind of special (non-free) repo for it. As it was, instead of losing myself on wild goose chases, I thought easier to build from source. And since I had to build one from source, might as well build all (or almost all) to demonstrate the whole process from scratch as it is pretty straightforward, really.

Before we proceed, one warning: best if you either use your package manager or build from source. You should probably only mix-and-match if you really know what you are doing; if you do and things get tangled up, it may take you a long while to figure out the source of your woes.

So, the manual approach. I first started by revisiting my previous notes on building ODB; as it happens, I had covered installing ODB from source previously here for version 2.2. However, those instructions have largely bit-rotted at the Dogen end and things have changed slightly since that post, so a revisit is worthwhile.

As usual, we start by grabbing all of the packages from the main ODB website:

odb 2.4.0-1 amd64.deb: the ODB compiler itself.
libodb-2.4.0: the main ODB library, required by all backends.
libodb-pgsql-2.4.0: the PostgreSQL backend. We don't need it today, of course, but since PostgreSQL is my DB of choice I always install it.
libodb-oracle-2.4.0: the Oracle backend. We will need this one.
libodb-boost-2.4.0: the ODB boost profile. This allows using boost types in your Dogen model and having ODB do the right thing in terms of ORM mapping. Our Northwind model does not use boost at present, but I intend to change it as soon as possible as this is a very important feature for customers.

Of course, if you are too lazy to click on buttons, just use wget:

$ mkdir odb
$ cd odb
$ wget http://www.codesynthesis.com/download/odb/2.4/odb_2.4.0-1_amd64.deb -O odb_2.4.0-1_amd64.deb
$ wget http://www.codesynthesis.com/download/odb/2.4/libodb-2.4.0.tar.gz -O libodb-2.4.0.tar.gz
$ wget http://www.codesynthesis.com/download/odb/2.4/libodb-pgsql-2.4.0.tar.gz -O libodb-pgsql-2.4.0.tar.gz
$ wget http://www.codesynthesis.com/download/odb/2.4/libodb-oracle-2.4.0.tar.gz -O libodb-oracle-2.4.0.tar.gz
$ wget http://www.codesynthesis.com/download/odb/2.4/libodb-boost-2.4.0.tar.gz -O libodb-boost-2.4.0.tar.gz

We start with the DEB, as simple as always:

# dpkg -i odb_2.4.0-1_amd64.deb
Selecting previously unselected package odb.
(Reading database ... 549841 files and directories currently installed.)
Preparing to unpack odb_2.4.0-1_amd64.deb ...
Unpacking odb (2.4.0-1) ...
Setting up odb (2.4.0-1) ...
Processing triggers for man-db (2.7.6.1-2) ...

I tend to store locally built software under my home directory, so that's where we'll place the libraries:

$ mkdir ~/local
$ tar -xaf libodb-2.4.0.tar.gz
$ cd libodb-2.4.0/
$ ./configure --prefix=/full/path/to/local
<snip>
make[1]: Leaving directory '/path/to/build/directory/odb/2.4/libodb-2.4.0'
$ make install
<snip>
make[1]: Leaving directory '/path/to/build/directory/odb/2.4/libodb-2.4.0'

Remember to replace /full/path/to/local with your installation directory. The process is similar for the other three packages, with one crucial difference: you need to ensure the environment variables are set to place all required dependencies in the include and link path. This is achieved via the venerable environment variables CPPFLAGS and LDFLAGS (and LD_LIBRARY_PATH as we shall see). You may bump into --with-libodb. However, be careful; the documentation states:

If these libraries are not installed and you would like to use their build directories instead, you can use the --with-libodb, and --with-boost configure options to specify their locations, for example:

./configure --with-boost=/tmp/boost

So if you did make install, you need the environment variables instead.

Without further ado, here are the shell commands. First boost; do note I am relying on the presence of Debian's system boost; if you have a local build of boost, which is not in the flags below, you will also need to add a path to it.

$ cd ..
$ tar -xaf libodb-boost-2.4.0.tar.gz
$ cd libodb-boost-2.4.0/
$ CPPFLAGS=-I/full/path/to/local/include LDFLAGS=-L/full/path/to/local/lib ./configure --prefix=/full/path/to/local
<snip>
config.status: executing libtool-rpath-patch commands
$ make -j5
<snip>
make[1]: Leaving directory '/path/to/build/directory/odb/2.4/libodb-boost-2.4.0'
$ make install
make[1]: Leaving directory '/path/to/build/directory/odb/2.4/libodb-boost-2.4.0'

For PostgreSQL again I am relying on the header files installed in Debian. The commands are:

$ cd ..
$ tar -xaf libodb-pgsql-2.4.0.tar.gz
$ cd libodb-pgsql-2.4.0/
$ CPPFLAGS=-I/full/path/to/local/include LDFLAGS=-L/full/path/to/local/lib ./configure --prefix=/full/path/to/local
<snip>
config.status: executing libtool-rpath-patch commands
$ make -j5
<snip>
make[1]: Leaving directory '/path/to/build/directory/odb/2.4/libodb-pgsql-2.4.0'
$ make install
<snip>
make[1]: Leaving directory '/path/to/build/directory/odb/2.4/libodb-pgsql-2.4.0'

Finally, Oracle. For this we need to supply the locations of the downloaded drivers or else ODB will not find the Oracle header and libraries. If you recall from the previous post, they are located in /usr/include/oracle/12.1/client64 and /usr/lib/oracle/12.1/client64/lib, so we must augment the flags with those two paths. In addition, I found configure was failing with errors finding shared objects, so I added LD_LIBRARY_PATH for good measure. The end result was as follows:

$ cd ..
$ tar -xaf libodb-oracle-2.4.0.tar.gz
$ cd libodb-oracle-2.4.0
$ LD_LIBRARY_PATH=/usr/lib/oracle/12.1/client64/lib CPPFLAGS="-I/full/path/to/local/include -I/usr/include/oracle/12.1/client64" LDFLAGS="-L/full/path/to/local/lib -L/usr/lib/oracle/12.1/client64/lib" ./configure --prefix=/full/path/to/local
<snip>
config.status: executing libtool-rpath-patch commands
$ make -j5
<snip>
make[1]: Leaving directory '/path/to/build/directory/odb/2.4/libodb-oracle-2.4.0'
$ make install
<snip>
make[1]: Leaving directory '/path/to/build/directory/odb/2.4/libodb-oracle-2.4.0'

And there you are; all libraries built and installed into our local directory, ready to be used.

Conclusion

In this part we've configured the Oracle Express database with the application users, and we sanity checked the configuration. Once that was out of the way, we built and installed all of the ODB libraries required by application code.

On Part III we will finally start making use of this setup and attempt to connect to the Oracle database. Stay tuned!

Created: 2017-02-24 Fri 12:32

Emacs 25.1.1 (Org mode 8.2.10)

Validate

Nerd Food: Northwind, or Using Dogen with ODB - Part I

2017-02-23T15:37:00.000-08:00

Nerd Food: Northwind, or Using Dogen with ODB - Part I

Thanks to my first Dogen paying customer, I finally got a chance to work with ODB - Code Synthesis' amazingly interesting C++ Object-Relational Mapping tool, built on the back of the GCC plugin system. I've personally always been in awe of what Boris Kolpackov has achieved, and, of course, it being a South African company made me all the more keen to use their wares. More importantly: the product just seems to rock in terms of polish, features and documentation.

Astute readers of this blog will point out that Dogen has been supporting ODB for quite some time. That is indeed true, but since I haven't used this feature in anger, I wasn't sure how good the support really was; our fairly trivial database model (Dia) explored only a small fraction of what is possible. Now that I finally had a chance to use it in production, I needed to expand the tests and try to replicate the customer's scenario as close as possible. As always in these situations, there was a snag: instead of using PostgreSQL - the RDBMS I had originally used in my Dogen tests - they were using Oracle. So my first task was to setup Oracle locally on my beloved Debian Linux.

Never one to miss an opportunity, I decided this adventure was worthy of a quick blog post; it soon turned out to be a series of posts, if I was to do any justice to this strange and wild experiment, through all of its twists and turns. But hopefully it is worth the effort, as it also demonstrates what you can do with Dogen and ODB. And so, here we are.

The first part of the series deals with just trying to convince Oracle to run on Debian Testing - something that obviously Oracle does not support out of the box.

Before we proceed, a word to the wise: this is a highly experimental Oracle setup, which I embarked upon just because I could; follow it at your own peril, and do not use it anywhere near production. More generally, if you want to stick to the beaten track, use Oracle on RHEL, CentOS or - god forbid - even Windows. All of that said, if like me, you are a Debian person, well, there's not much for it other than to fire off a VM and start looking for those damned faint tracks in the jungle.

Alien Worlds

The very first stumbling block was Oracle itself. After all, for all the many years of using this RBDMS at work - more than I care to admit in polite company - I suddenly realised I actually never used it at home. Or course, Oracle has supported Linux for a little while now; and the good news is they have a "free" version available: Oracle Database Express Edition (XE). A quick glance at the Oracle website revealed RPM's for 64-bits (Intel only, of course). So before anything else, I decided to brush up my knowledge of Alien.

Alien is a debian package that converts RPMs into DEBs. I've used it in the past for another (lovely) Oracle technology: Java. It had worked wonderfully well then so I thought I'd give it a try. The Ubuntu Alien HowTo is pretty straightforward, and so is Debian's. Basically, obtain Alien:

sudo apt-get install alien

And then apply it to the RPM in question. So the next quest was obtaining those darn RPMs.

Of course, once you move away from the easy world of Free and Open Source Software, things start to get a bit more complicated. Those lovely links you can easy Google for don't actually work until you sign up for an Oracle developer account, asking all sorts of personal information. Sadly, even listening to Tesla earnings conferences requires registering these days. Undaunted, I filled all required fields, obtained my developer account and returned to download my loot. For Oracle Express it's rather straightforward: there is a grand-total of one package for Linux 64-bit, so you can't really go wrong. Here's the link, just in case:

Oracle Express: download the 64-bit Linux RPM oracle-xe-11.2.0-1.0.x86_64.rpm.zip.

It is interesting that they decided to zip the RPM but you can easily unzip it with the unzip tool. The contents are the RPM Alien expects, plus a few oracle specific files which I decided to ignore for now:

$ unzip oracle-xe-11.2.0-1.0.x86_64.rpm.zip
Archive:  oracle-xe-11.2.0-1.0.x86_64.rpm.zip
   creating: Disk1/
   creating: Disk1/upgrade/
  inflating: Disk1/upgrade/gen_inst.sql
   creating: Disk1/response/
  inflating: Disk1/response/xe.rsp
  inflating: Disk1/oracle-xe-11.2.0-1.0.x86_64.rpm

From a quick glance at the instructions, it appeared the Oracle Express package contained just the database server - that meant it did not include a command line client, or the APIs to build applications that talk to the database. To be fair, this is not an entirely uncommon approach; Debian also packages the PostgreSQL server separately from the development libraries. But behind apt-get and synaptic, installation of packages is all so trivial. Not so when you have to go through lots of detailed explanations of different packages and variations. But; onwards! In the Instant client page, I settled on the downloading the following:

Basic: client shared libraries. Package: oracle-instantclient12.1-basic-12.1.0.2.0-1.x86_64.rpm
SQL Plus: command-line client. Package: oracle-instantclient12.1-sqlplus-12.1.0.2.0-1.x86_64.rpm
SDK: header files to compile code. Package: oracle-instantclient12.1-devel-12.1.0.2.0-1.x86_64.rpm

Update: As it turns out, I was wrong on my original expectations, and you don't really need the SQL Plus package - its already included with Oracle Express. But I only figured it out much later, so I'll leave the steps as I originally followed them.

With all of these packages in hand, I swiftly got busy with Alien, only to also rather swiftly hit an issue:

$ cd Disk1
$ alien --scripts oracle-xe-11.2.0-1.0.x86_64.rpm
Must run as root to convert to deb format (or you may use fakeroot).

Yes, sadly you cannot run alien directly as an unprivileged user. I did not wish to start reading up on FakeRoot - seems straightforward enough, to be fair, but hey - so I took the easy way out and ran all the Alien commands as root. Note also the --scripts to ensure the scripts will also get converted across. This will bring us some other… interesting issues, shall we say, but seems worthwhile doing.

Quite a few seconds later (hey, it was a 300 MB RPM!), a nice looking DEB was generated:

# alien --scripts oracle-xe-11.2.0-1.0.x86_64.rpm
oracle-xe_11.2.0-2_amd64.deb generated

A rather promising start. For good measure, I repeated the process with all RPMs, all with similar results:

# alien oracle-instantclient12.1-basic_12.1.0.2.0-2_amd64.deb
oracle-instantclient12.1-basic_12.1.0.2.0-2_amd64.deb generated

# alien oracle-instantclient12.1-sqlplus-12.1.0.2.0-1.x86_64.rpm
oracle-instantclient12.1-sqlplus_12.1.0.2.0-2_amd64.deb generated

# alien  oracle-instantclient12.1-devel-12.1.0.2.0-1.x86_64.rpm
oracle-instantclient12.1-devel_12.1.0.2.0-2_amd64.deb generated

Voila, all DEBs generated. Of course, as the English love to say, the proof is in the pudding - whatever that means, exactly. So before one can celebrate, you should try to install the generated packages. That can be easily done with our old trusty dpkg:

# dpkg -i oracle-xe_11.2.0-2_amd64.deb
Selecting previously unselected package oracle-xe.
(Reading database ... 564824 files and directories currently installed.)
Preparing to unpack oracle-xe_11.2.0-2_amd64.deb ...
Unpacking oracle-xe (11.2.0-2) ...
Setting up oracle-xe (11.2.0-2) ...
Executing post-install steps...
/var/lib/dpkg/info/oracle-xe.postinst: line 114: /sbin/chkconfig: No such file or directory
You must run '/etc/init.d/oracle-xe configure' as the root user to configure the database.

Processing triggers for libc-bin (2.24-8) ...
Processing triggers for systemd (232-8) ...
Processing triggers for desktop-file-utils (0.23-1) ...
Processing triggers for gnome-menus (3.13.3-8) ...
Processing triggers for mime-support (3.60) ...
Processing triggers for mime-support (3.60) ...

As it turns out, it seems the error for chkconfig is related to setting up the service to autostart. Since this was not a key requirement for my purposes, I decided to ignore it. The remaining RPMs - or should I say DEBs - installed beautifully:

# dpkg -i oracle-instantclient12.1-basic_12.1.0.2.0-2_amd64.deb
Selecting previously unselected package oracle-instantclient12.1-basic.
(Reading database ... 564801 files and directories currently installed.)
Preparing to unpack oracle-instantclient12.1-basic_12.1.0.2.0-2_amd64.deb ...
Unpacking oracle-instantclient12.1-basic (12.1.0.2.0-2) ...
Setting up oracle-instantclient12.1-basic (12.1.0.2.0-2) ...
Processing triggers for libc-bin (2.24-8) ...

# dpkg -i oracle-instantclient12.1-sqlplus_12.1.0.2.0-2_amd64.deb
Selecting previously unselected package oracle-instantclient12.1-sqlplus.
(Reading database ... 567895 files and directories currently installed.)
Preparing to unpack oracle-instantclient12.1-sqlplus_12.1.0.2.0-2_amd64.deb ...
Unpacking oracle-instantclient12.1-sqlplus (12.1.0.2.0-2) ...
Setting up oracle-instantclient12.1-sqlplus (12.1.0.2.0-2) ...

# dpkg -i oracle-instantclient12.1-devel_12.1.0.2.0-2_amd64.deb
Selecting previously unselected package oracle-instantclient12.1-devel.
(Reading database ... 567903 files and directories currently installed.)
Preparing to unpack oracle-instantclient12.1-devel_12.1.0.2.0-2_amd64.deb ...
Unpacking oracle-instantclient12.1-devel (12.1.0.2.0-2) ...
Setting up oracle-instantclient12.1-devel (12.1.0.2.0-2) ...

Talking to the Oracle

So, at this point in time we have a bunch of stuff installed in all sorts of random (read: Oracle-like) locations. The database itself is under /u01/app/oracle/product/11.2.0/, and all the other packages seemed to have gone into /usr/lib/oracle/12.1/client64/ and /usr/include/oracle/12.1/client64/. The first task is now to start the database server. For this we can rely on the scripts we installed earlier on. However, before we proceed, one little spoiler: we need to ensure the scripts can find awk at /bin/awk (these days it lives in /usr/bin/awk). For this we can do a swift (and brutal) hack:

# ln -s /usr/bin/awk /bin/awk

Now we can configure it. I accepted all of the defaults, and setup a suitably sensible password:

# cd /etc/init.d/
# /etc/init.d/oracle-xe configure

Oracle Database 11g Express Edition Configuration
-------------------------------------------------
This will configure on-boot properties of Oracle Database 11g Express
Edition.  The following questions will determine whether the database should
be starting upon system boot, the ports it will use, and the passwords that
will be used for database accounts.  Press <Enter> to accept the defaults.
Ctrl-C will abort.

Specify the HTTP port that will be used for Oracle Application Express [8080]:

Specify a port that will be used for the database listener [1521]:

Specify a password to be used for database accounts.  Note that the same
password will be used for SYS and SYSTEM.  Oracle recommends the use of
different passwords for each database account.  This can be done after
initial configuration:

Confirm the password:


Do you want Oracle Database 11g Express Edition to be started on boot (y/n) [y]:y
y

Starting Oracle Net Listener...Done
Configuring database...
Starting Oracle Database 11g Express Edition instance...Done
Installation completed successfully.

Notice how your port 8080 has been hogged. If you are using it for other work, you may need to move the Oracle Application Express server to some other port. At any rate, after this I could indeed see a whole load of Oracle processes running:

$ ps -ef | grep oracle
oracle   20228     1  0 22:35 ?        00:00:00 /u01/app/oracle/product/11.2.0/xe/bin/tnslsnr LISTENER -inhe
oracle   21251     1  0 22:36 ?        00:00:00 xe_pmon_XE
oracle   21253     1  0 22:36 ?        00:00:00 xe_psp0_XE
oracle   21257     1  0 22:36 ?        00:00:00 xe_vktm_XE
oracle   21261     1  0 22:36 ?        00:00:00 xe_gen0_XE
oracle   21263     1  0 22:36 ?        00:00:00 xe_diag_XE
oracle   21265     1  0 22:36 ?        00:00:00 xe_dbrm_XE
oracle   21267     1  0 22:36 ?        00:00:00 xe_dia0_XE
oracle   21269     1  0 22:36 ?        00:00:00 xe_mman_XE
oracle   21271     1  0 22:36 ?        00:00:00 xe_dbw0_XE
oracle   21273     1  0 22:36 ?        00:00:00 xe_lgwr_XE
...

To the untrained eye, this seems like a healthy start; but for more details, there are also a bunch of useful logs under the Oracle directories:

# ls -l /u01/app/oracle/product/11.2.0/xe/config/log
ls -l /u01/app/oracle/product/11.2.0/xe/config/log
total 20
-rw-r--r-- 1 oracle dba 1369 Feb 23 22:36 CloneRmanRestore.log
-rw-r--r-- 1 oracle dba 7377 Feb 23 22:36 cloneDBCreation.log
-rw-r--r-- 1 oracle dba 1278 Feb 23 22:36 postDBCreation.log
-rw-r--r-- 1 oracle dba  227 Feb 23 22:36 postScripts.log

Now, at this point in time, if all had gone according to plan we should be able to connect to our new instance. A typical trick in Oracle is to use tnsping to validate the setup. For this we need to know what to ping, and that is where TNS Names comes in handy:

$ cat /u01/app/oracle/product/11.2.0/xe/network/admin/tnsnames.ora
# tnsnames.ora Network Configuration File:

XE =
  (DESCRIPTION =
    (ADDRESS = (PROTOCOL = TCP)(HOST = lorenz)(PORT = 1521))
    (CONNECT_DATA =
      (SERVER = DEDICATED)
      (SERVICE_NAME = XE)
    )
  )
...

The magic word is XE (the net service name, i.e. what we will be connecting against). Now we can simply do:

$ . /u01/app/oracle/product/11.2.0/xe/bin/oracle_env.sh
$ tnsping XE

TNS Ping Utility for Linux: Version 11.2.0.2.0 - Production on 23-FEB-2017 22:52:04

Copyright (c) 1997, 2011, Oracle.  All rights reserved.

Used parameter files:


Used TNSNAMES adapter to resolve the alias
Attempting to contact (DESCRIPTION = (ADDRESS = (PROTOCOL = TCP)(HOST = lorenz)(PORT = 1521)) (CONNECT_DATA = (SERVER = DEDICATED) (SERVICE_NAME = XE)))
OK (0 msec)

Success! Worth noticing that the first step was to call oracle_env.sh to bring in all the required environment variables of our Oracle setup.

The final test at this stage is to ensure we can connect with SQL Plus. For this we will just rely on the SYSTEM user.

$ sqlplus SYSTEM@XE

SQL*Plus: Release 11.2.0.2.0 Production on Thu Feb 23 22:56:31 2017

Copyright (c) 1982, 2011, Oracle.  All rights reserved.

Enter password:

Connected to:
Oracle Database 11g Express Edition Release 11.2.0.2.0 - 64bit Production

SQL> select table_name from all_tables where rownum < 4;
select table_name from all_tables where rownum < 4;

TABLE_NAME
------------------------------
ICOL$
CON$
UNDO$

And there you go. We have an absolutely minimal, bare-bones setup of Oracle Express running on Debian Linux. Worth bearing in mind that if you want to make use of SQL Plus from within emacs you must make sure you start emacs on a shell that has all the variables defined in oracle_env.sh.

Conclusions

In this first part we simply setup Oracle Express, and the client libraries. We also managed to prove that the setup is vaguely working by connecting to it first at a low-level via TNS ping and then at a proper client level using SQL Plus. The next part will wrap things up with the Oracle setup and then move on to ODB.

Created: 2017-03-25 Sat 19:54

Emacs 25.1.1 (Org mode 8.2.10)

Validate

Nerd Food: Interesting...

2016-06-17T02:58:00.000-07:00

Nerd Food: Interesting…

Time to flush all those tabs again. Some interesting stuff I bumped into recently-ish.

Finance, Economics, Politics

Understanding Growth, part 1: looks very promising although I've only started parsing it. Also pointed me to - Tomas Sedlacek and the Economics of Good and Evil. Bought the book, but still reading it. Seems very thoughtful.
Here’s How Electric Cars Will Cause the Next Oil Crisis: Extremely interesting take on the relationship between electric cars and the oil price. Its along the lines of articles posted in the past, to be fair, but still. Basically, it won't take a huge number of sales of electric cars to start knocking down the oil price. And with Model 3 coming out, this all seems quite ominous to the oil producing countries. Here we go again, Angola.
Red Hat becomes first $2b open-source company: I may not use their wares any more but RedHat will always be one of my favourite companies. Really happy to see they are growing nicely and hopefully continuing all of their incredible investment on Linux.
The Amazon Tax: Really, really good article about Amazon and their strategy. If you read only one, read this. Amazon is amazing - and its dominance is very worrying because they are so good at executing! See also Bezos letter.
It’s a Tesla: Great article about Tesla. Some of the usual Fanboyism we all know and love, of course, but still a lot of very good points. The core of the article is a interesting comparison between Tesla and Apple. By the by, not at all convinced about that dashboard and the launch ceremony itself was a bit sparse too! But, Model 3 looks great. I'm officially a Stratechery fanboy now.
Google’s Alphabet Transition Has Been Tougher Than A-B-C: Great article on the pains of moving to a single monolithic structure to something more distributed. In truth, what would one expect with such a seismic change? And, also, how come it took Google so long to make this shift? After all, programmers are supposedly taught how important separation of concerns is. The other very interesting point is the CED difficulties. These guys were able founders (at least able enough to get bought out by Google) but seem to fail badly at the CEO'ing malarky.

Startups et al.

Venture capital and the internet’s impact: From the same guys as the Amazon post, this is also a very interesting take on VCs and the internet. Highly recommended.
Believe me, you do not want to quit your banking job for a tech unicorn: Stories from the trenches on how Unicorns are not always rosy. Of course, given it comes from "eFinacialCareers", one must assume they are talking their book. Cautionary tale, nonetheless.
Sir Clive Sinclair Revives the ZX Spectrum: so the Spectrum is back! I know I shouldn't - there isn't a single logical reason to back it up - but I just feel like I need to get me one of these…

General Coding

Water treatment plant hacked, chemical mix changed for tap supplies: this is a tad worrying. Can you imagine the amount of systems out there with vulnerabilities, etc - many of which are connected to the internet.
On the Impending Crypto Monoculture: Talking about security, very worrying news from the crypto front. It seems our foundations are much less solid than expected - and after all the OpenSSL bugs, this is a surprising statement indeed. Very interesting email on the subject. The LWN article is a must read too.
Neural Networks Demystified - Part 1: Data and Architecture: just started browsing this in my spare time, but it looks very promising. For the layperson.
Microsoft deletes 'teen girl' AI after it became a Hitler-loving sex robot within 24 hours: friggin' hilarious in a funny-not-funny sort of way. This tweet said it best: "Tay" went from "humans are super cool" to full nazi in <24 hrs and I'm not at all concerned about the future of AI. – Gerry
Abandoning Gitflow and GitHub in favour of Gerrit: I've always wanted to know more about Gerrit but never seem to find the time. The article explains it to my required extent, contrasting it with the model I'm more familiar with - GitHub, forks and pull requests. I must say, still not convinced about Gerrit, but having said that, it seems there is definitely scope for some kind of hybrid between the two. A lot of the issues they mention in the article are definitely pain points for GitHub users.
Introducing DGit: OK this one is a puzzling post, from our friends at GitHub engineering. I'm not sure I get it at all, but seems amazing. Basically, they talk about all the hard work they've made to make git distributed. Fine, I'm jesting - but not totally. The part that leaves no doubts is that GitHub as a whole is a lot more reliable after this work and can handle a lot more traffic - without increasing its hardware requirements. Amazing stuff.

Databases

Citus Unforks From PostgreSQL, Goes Open Source: Great news everyone! Sharding in Postgres just became easier with the open sourcing of Citus! Also worth watching / reading: Interactive Analytics on GitHub Data using PostgreSQL with Citus. This explains in a very understandable way how you will use Citus to shard.
Parallel Aggregate – Getting the most out of your CPUs: The elephant just keeps getting better and better. Improved scaling on multi-CPU for a few scenarios is coming on 9.6.

C++

Compiler Bugs Found When Porting Chromium to VC++ 2015: great tales form the frontline. Also good to hear that MS is really responsive to bug reports. Can't wait to be able to build my C++ 14 code on Windows…
EasyLambda: C++ 14 library for data processing. Based on MPI though. Still, seems like an interesting find.

Layperson Science

The Open Publishing Revolution, Now Behind A Billion-Dollar Paywall: this is very sad news. How science has regressed yet again, now that Mendeley has been bought out. This saga gets worse and worse. On the slightly more positive side: From Crowdfunding To Open Access, Startups Are Experimenting With Academic Research. But will they succeed?
AI & The Future Of Civilization: Very interesting chat with Wolfram. Absurdly long but worth a read.
What is the best way to explain the concept of manifold to a novice?: Bumped into this in Quora. If only we had more of these. We need an entire book of "mathematics for lay people".
Why we’re living in an era of neuroscience hype: One that everyone interested on the field should read. Interesting take on the wave of progress on the neuroscience front.

Other

How a TV Sitcom Triggered the Downfall of Western Civilization: OK, I got to say that with a click bait title as bad as this, I almost immediately ignored this article. Somehow I went back to it. Its very long and a bit crazy but its actually very interesting. Friends (the sitcom) as the signal of the end.

Created: 2016-06-17 Fri 10:56

Emacs 24.5.1 (Org mode 8.2.10)

Validate

Nerd Food: The Strange Case of the Undefined References

2016-06-16T06:13:00.000-07:00

Nerd Food: The Strange Case of the Undefined References

As a kid, I loved reading Sherlock Holmes and Poirot novels. Each book got me completely spellbound, totally immersed and pretty much unable to do anything else until I finally found out whodunnit. Somehow, the culprits were never the characters I suspected of. Debugging and troubleshooting difficult software engineering problems is a lot like the plot of a crime novel: in both cases you are trying to form a mental picture of something that happened, with very incomplete information - the clues; in both cases, experience and attention to detail is crucial, with many a wrong path taken before the final eureka moment; and, in both cases too, there is this overwhelming sense of urgency in figuring out whodunnit. Of course, unlike a crime novel, we'd all prefer not having to deal with these kinds of "interesting" issues, but you don't choose the problems - they choose you.

I recently had to deal with one such problem, which annoyed me to no end until I finally fixed it. It was so annoying I decided it was worth blogging about - if nothing else, it may save other people from the same level of pain and misery.

A bit of context for those that are new here. Dogen is a pet project that I've been maintaining for a few years now. Like many other C++ projects, it relies on the foundational Boost libraries. To be fair, we rely on other stuff as well - libraries such as LibXML2 and so on - but Boost is our core C++ dependency and the only one where latest is greatest, so it tends to cause us the most problems. I've covered my past woes in terms of dependency management and how happy I was to find Conan. And so it was that life was bliss for a number of builds, until one day…

It All Started With a Warning

It was a rainy day and I must have been bored because I noticed a rather innocuous-looking warning on my Travis build, related to Conan:

CMake Warning (dev) in build/output/conanbuildinfo.cmake:
  Syntax Warning in cmake code at
    /home/travis/build/DomainDrivenConsulting/dogen/build/output/conanbuildinfo.cmake:142:88
  Argument not separated from preceding token by whitespace.
Call Stack (most recent call first):
  CMakeLists.txt:30 (include)
This warning is for project developers.  Use -Wno-dev to suppress it.

Little did I know that this simple discovery would lead to a sequence of troublesome events and to many a broken build. I decided to report the problem to the Conan developers who, with their usual promptness, rolled up their sleeves, quickly bounced ideas back and forth and then did a sterling job in spinning fixes until we got to the bottom of the issue. Some of the fixes were to Conan itself, whereas some others were related to rebuilding Boost. In the heat of the investigation, I bumped into some very troubling - and apparently unrelated - linking errors:

/home/travis/.conan/data/Boost/1.60.0/lasote/stable/package/ebdc9c0c0164b54c29125127c75297f6607946c5/lib/libboost_log.so: undefined reference to `std::invalid_argument::invalid_argument(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)@GLIBCXX_3.4.21'
/home/travis/.conan/data/Boost/1.60.0/lasote/stable/package/ebdc9c0c0164b54c29125127c75297f6607946c5/lib/libboost_log.so: undefined reference to `std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::find(char const*, unsigned long, unsigned long) const@GLIBCXX_3.4.21'

The build was littered with errors such as these. But the most puzzling thing was that I had changed nothing of consequence on my side and the Conan guys changed very little at their end too! What on earth was going on?

After quite a lot of thinking, Conan's memsharded came up a startling conclusion: we've been hit by one of those rare-but-dreadful ABI-transitions! His comment is worth reading in full, but the crux of his findings is as follows (copied verbatim):

Boost packages, generated with travis use docker to manage different versions of gcc, as gcc 5.2 or gcc 5.3

Those docker images are using modern linux distros, e.g. > Ubuntu 15.10

By default, new modern linux distros have switched to the gcc > 5.1 new C++11 ABI, that is libstdc++ is built with gcc > 5.1, usually named libcxx11, as well as the rest of the system. The libcxx11 ABI is incompatible with the old gcc < 5.1 libcxx98 ABI.

Building in such environment links with the new libcxx11 by default.

Now, we move to our user, package consumer environment, which could be an Ubuntu 14.04, or a travis VM (12.04). Those distros use a libcxx98 libstdc++, as a lot of programs of those distros depends on the old libcxx98 ABI. It is not simple to replace it for the new one, requiring to rebuild or reinstall large part of the system and applications. Maybe it could be installed for dev only, and specified in the build, but I have not been able yet.

Reading the above may have given you that sad, sinking feeling: "what on earth is he on about, I just want to compile my code!", "Why oh why is C++ so damn complicated!" and so forth. So, for the benefit of those not in the know, let me try to provide the required background to fully grok memsharded's comment.

What's this ABI Malarkey Again?

This topic may sound oddly familiar to the faithful reader of Nerd Food and with good reason: we did cover ABIs in the distant past, at a slightly lower level. The post in question was On MinGW, Cygwin and Wine and it does provide some useful context to this discussion, but, if you want a TL;DR, it basically dealt with kernel space and user space and with things such as the C library. This time round we will turn our attention to the C++ Standard Library.

In addition to specifying the C++ language, the C++ Standard also defines the API of the C++ Standard Library - the classes and their methods, the functions and so on. The C++ Standard Library is responsible for providing a set of services for applications compiled with a C++ compiler. So far, so similar to the C Standard Library. Where things begin to differ is in the crucial matter of the ABI. But first, lets get a working definition for ABI, just so we are all on the same page. For this, we can do worse than using Linux System Programming:

Whereas an API defines a source interface, an ABI defines the low-level binary interface between two or more pieces of software on a particular architecture. It defines how an application interacts with itself, how an application interacts with the kernel, and how an application interacts with libraries. An ABI ensures binary compatibility, guaranteeing that a piece of object code will function on any system with the same ABI, without requiring recompilation.

ABIs are concerned with issues such as calling conventions, byte ordering, register use, system call invocation, linking, library behavior, and the binary object format. The calling convention, for example, defines how functions are invoked, how arguments are passed to functions, which registers are preserved and which are mangled, and how the caller retrieves the return value.

The second paragraph is especially crucial. You see, although both the C and the C++ Standards are somewhat silent on the matter of specifying an ABI, C tends to have a de facto standard for a given OS on a given architecture. This may not sound like much and you may be saying: "what, wait: the same OS on a different architecture has a different ABI?" Yep, that is indeed the case. If you think about it, it makes perfect sense; after all, C was carefully designed to be equivalent to "portable assembler"; in order to achieve maximum performance, one must not create artificial layers of indirection on top of the hardware but instead expose it as is. So, by the same token, two different C compilers working on the same architecture and OS will tend to agree on the ABI. The reason why is because the OS will also follow the hardware where it must, for performance reasons; and where the OS can make choices, it more or less makes the choice for everybody else. For example, until recently, if you were on Windows, it did you no good to compile code into an ELF binary because the law of the land was PE. Things have now changed dramatically, but the general point remains: the OS and the hardware rule.

C++ inherits much of C's approach to efficiency, so at first blush you may be fooled into thinking it too would have a de facto ABI standard ("for a given OS, " etc. etc.). However, there are a few crucial differences that have grave consequences. Let me point out a few:

C++'s support for genericity - such as function overloading, templates, etc - is implemented by using name mangling; however, each compiler tends to have their own mangling scheme.
implementation details such as the memory layout of objects in the C++ Standard Library - in particular, as we shall see, std::string - are important.

In the past, compiler vendors tended exacerbate differences such as these; as it was with the UNIX wars, so too during the "C++ wars" did it make sense to be as incompatible as possible in the never ending hunt for monetisation. Thus, ABI specifications were kept internal and were closely guarded secrets. But since then the world has changed. To a large extent, C++ lost the huge amounts of funding it once had during the nineties and part of the naughties, and many vendors either went under or greatly reduced their efforts in this space. Two compilers emerged as victors: MSVC on the Windows platform and - once the dust of the EGCS fork finally settled - GCC everywhere else. The excellent quality of GCC across a vast array of platforms and its strict standards adherence - coupled with a quick response to the standardisation efforts - resulted in total domination outside of Windows. So much so that only recently did it meet a true challenger in Clang. The brave new world in which we now find ourselves in is one where C++ ABI standardisation is a real possibility - see Defining a Portable C++ ABI.

But pray forgive the old hand, I digress again. The main point is that, for a given OS on a given architecture, you normally had to compile all your code with a single compiler; if you did that, you were good to go. Granted, GCC never made any official promises to keep its releases ABI-compatible, but in practice we came to rely on the fact that new and old releases interoperated just fine since the days of 3.x. And so did Clang, respecting GCC's ABI so carefully it made us think of them as one happy family. Then, C++-11 arrived.

Mixing and Matching

As described in GCC5 and the C++11 ABI, this pleasant state of affairs was too idyllic to last forever:

[…] [S]ome new complexity requirements in the C++11 standard require ABI changes to several standard library classes to satisfy, most notably to std::basic_string and std::list. And since std::basic_string is used widely, much of the standard library is affected.

On hindsight, the improvements in the std::string implementation are great; as a grasshopper, I recall spending hours on end debugging my code in the long forgotten days of EGGS 2.91, only to find out there was a weird bug in the COW implementation for my architecture. That was the first time - and as it happens, the last time too - I found a library bug, and it made a strong impression on me, at that young age. These people were not infallible.

These days I sit much higher up in the C++ stack. Like many, I didn't read that carefully the GCC 5 release notes when it came out, relying as usual on my distro to do the right thing. And, as usual, the distros largely did, even though, unbeknown to many, a stir was happening in their world ¹. But hey, who reads distro blogs, right? Hidden comfortably under my Debian Testing lean-to, I was blissfully unaware of this transition since my code continued to compile just fine. Also, where things start to get hairy is when you need to mix and match compiler versions and build settings - and who on their right mind does that, right?

As it happens, this is a situation in which modern C++ users of Travis may easily find themselves in, stuck as they are on either on Ubuntu 12.04 (2012) or Ubuntu 14.04 (2014). Nick Sarten's blog post rams the point home in inimitable fashion:

Hold on, did I say GCC 4.6? Clang 3.4? WHAT YEAR IS IT?

Yes, what year is it indeed. So it is that most of us rely on PPA's to bring the C++ environment on Travis up to date, such as the Ubuntu Toolchain:

sudo add-apt-repository -y ppa:ubuntu-toolchain-r/test

This always seemed like an innocent thing to do but after my linking errors and memsharded discoveries, one suddenly started to question everything: what settings did the PPA use to build? What settings were used to build the Boost Conan packages? With what compiler? In what distro? The nightmare was endless. It was clear this was going to lead to tears before bedtime.

The Long Road to a Solution

Whilst memsharded honed into the problem pretty quickly - less than a couple of weeks - a complete solution to my woes was a lot more elusive. In truth, this is the kind of situation where you need long spells of concentrated effort, so working in your copious spare time does not help at all. I first tried the easiest approach: to pray that it would all go away by itself, given enough time. And, lo and behold, things did work again, for a little while! And then started to fail again; the Boost package in Conan got rebuilt and the build broke. And that way it stayed.

Once waiting was no longer an option, I had to take it seriously and started investigating in earnest. Trouble is, when you lose trust in the compilation settings you then need to methodically validate absolutely everything, until you bottom out the problem. And that takes time. Many things were tried, including:

rebuilding Boost locally, attempting to reproduce the issue - to no avail.
rebuilding the Conan Boost packages with the old ABI; a fail (#12).
reading up a variety of articles on the subject, most of them linked in this post.
building the Boost packages locally and exporting them into Travis using DropBox's public folders. Another fail, but DropBox was a win.
obtaining the exact same Ubuntu 14.04 image as Travis is using, use the compiler from the PPA and export Boost to Travis using DropBox and replicating the problem locally in a VM. This worked.

Predictably, the final step is the one I should have tried first, but one is always lazy. Still, all of this got me wondering why had things been so complicated. Normally one would be able to ldd or nm -C the binary and figure out the dependencies, but in this case I seemed to always be pointing to libstdc++.so.6 regardless. Most puzzling. And then I found the Debian wiki page on GCC5, which states:

The good news is, that GCC 5 now provides a stable libcxx11 ABI, and stable support for C++11 (GCC version before 5 called this supported experimental). This required some changes in the libstdc++ ABI, and now libstdc++6 provides a dual ABI, the classic libcxx98 ABI, and the new libcxx11 (GCC 5 (<< 5.1.1-20) only provides the classic libcxx98 ABI). The bad news is that the (experimental) C++11 support in the classic libcxx98 ABI and the new stable libcxx11 ABIs are not compatible, and upstream doesn't provide an upgrade path except for rebuilding. Note that even in the past there were incompatibilities between g++ versions, but not as fundamental ones as found in the g++-5 update to stable C++11 support.

Using different libstdc++ ABIs in the same object or in the same library is allowed, as long as you don't try to pass std::list to something expecting std::__cxx11::list or vice versa. We should rebuild everything with g++-5 (once it is the default). Using g++-4.9 as a fallback won't be possible in many cases.

libstdc++ (>= 5.1.1-20) doesn't change the soname, provides a dual ABI. Existing C++98 binary packages will continue to work. Building these packages using g++-5 is expected to work after build failures are fixed.

The crux is, of course, all the stuff about a dual ABI. I had never bumped into the dual ABI beast before, and now that I did I'm not sure I am entirely pleased. It's probably great when it just works, but it's tricky to troubleshoot when it doesn't: are you linking against a libstdc++ with dual ABI disabled/unsupported? Or is it some other error you've introduced? Personally, having a completely different SO name like memsharded had suggested seems like a less surprising approach - e.g. call it libcxx11 instead of libstdc++. But, as always, one has to play with the cards that were dealt so there is no point in complaining.

Conclusion

The Ubuntu 14.04 build of Boost did get us a green build again, but for all the joyous celebrations, there is still a grey cloud hovering above since the mop-up exercise is not completed. I now need to figure out how to build Boost with Conan on 14.04 and upload this version into the package manager's repo. However, for now carpe diem. After so much unproductive time, there is a real need for a few weeks (months!) of proper coding - the reason why I have a spare time project in the first place. But some lessons were learned.

Firstly, one cannot but feel truly annoyed at ${COSMIC_DEITY} for having to deal with issues such as this. After all, one of the reasons I prefer C++ to the languages I use at work (C# and Java) is that it is usually very transparent; normally I can very quickly reproduce, diagnose and fix a problem in my code. Of course, lord knows this statement is not true of all C++ code, but at least it tends to be valid for most Modern C++ - and over the last five years that's all the C++ I dealt with in anger. It was indeed rather irritating to find out that the pain has not yet been removed from the language, and on occasion, even experienced developers get bitten. Hard.

A second point worth of note is that in C++ - more so than in any other language - one cannot just blindly trust the package manager. There are just so many configuration knobs and buttons for that to be possible, and one can easily get bitten by assumptions. The sad truth is that even when using Conan, one should probably upload one's own packages built with a well understood configuration. True, this may cost time - but on the other hand, it will avoid wild goose chases such as this one.

Finally, its also important to note that this whole episode illustrates the sterling job that package maintainers do in distributions. Paradoxically, their work is often so good that we tend to be blissfully unaware of its importance. Articles such as Maintainers Matter take a heightened sense of urgency after an experience like this.

The road was narrow, long and troublesome. But, as with all Poirot novels, there is always that satisfying feeling of finally finding out whodunnit in the end.

Post Script

There is one final twist to this story, which adds insult to injury and further illustrates ${COSMIC_DEITY}'s sense of humour. When I finally attempted to restore our clang builds, I found out that LLVM has disabled their APT repo for an unspecified length of time:

> TL;DR: APT repo switched off due to excessive load / traffic

There are no alternatives at present to build with a recent clang. Sometimes one has the feeling that the universe does not want to play ball. Stiff upper lip and all that; mustn't grumble.

Footnotes:

For example, see The Case of GCC-5.1 and the Two C++ ABIs to understand Arch's pains.

Created: 2016-06-16 Thu 14:12

Emacs 24.5.1 (Org mode 8.2.10)

Validate

Nerd Food: Interesting...

2016-02-08T14:30:00.000-08:00

Nerd Food: Northwind, or Using Dogen with ODB - Part II

What's in a Schema?

When I say "database user", most developers exposed to RDBMs immediately associate this to a user account. This is of course how most databases work, but obviously not so with Oracle. In Oracle, "users" and "schemas" are conflated, so much so it's hard to tell if there is any difference between them. For the purist RDBM user, a schema is a schema - a collection of tables and other database objects, effectively a namespace - and a user is a user - a person (real or otherwise) that owns database objects. In Oracle these two more or less map to the same concept. So when you create a user, you have created a schema and you can start adding tables to it; and when you refer to database objects, you prefix them by the user name just as you would if they belonged to a schema. And, of course, you can have users that have no database objects for themselves, but which were granted permission to access database objects from other users.

So our first task is to create two schemas; these are required by the Dogen model which we will use as our "application". They are:

basic
northwind

SQL> create tablespace tbs_01 datafile 'tbs_f01.dbf' size 200M online;

Tablespace created.

SQL> create user basic identified by "PASSWORD" default tablespace tbs_01 quota 100M on tbs_01;
User created.

SQL> create user northwind identified by "PASSWORD" default tablespace tbs_01 quota 100M on tbs_01;

User created.

SQL> GRANT create session TO basic;
GRANT create session TO basic;

Grant succeeded.

SQL> GRANT create table TO basic;
GRANT create table TO basic;

Grant succeeded.

SQL> GRANT create session TO northwind;
GRANT create session TO northwind;

Grant succeeded.

SQL> GRANT create table TO northwind;
GRANT create table TO northwind;

Grant succeeded.

If all went well, we should now be able to exit the SYSTEM session, start a new one with one of these users, and play with a test table:

$ sqlplus northwind@XE

SQL*Plus: Release 11.2.0.2.0 Production on Fri Feb 24 10:20:10 2017

Copyright (c) 1982, 2011, Oracle.  All rights reserved.

Enter password:

Connected to:
Oracle Database 11g Express Edition Release 11.2.0.2.0 - 64bit Production

SQL> create table test ( name varchar(10) );

Table created.

SQL> insert into test(name) values ('kianda');
insert into test(name) values ('kianda');

1 row created.

SQL> select * from test;

NAME
----------
kianda

SQL> grant select on test to basic;

Grant succeeded.

SQL> Disconnected from Oracle Database 11g Express Edition Release 11.2.0.2.0 - 64bit Production
$ sqlplus basic@XE

SQL*Plus: Release 11.2.0.2.0 Production on Fri Feb 24 10:23:04 2017

Copyright (c) 1982, 2011, Oracle.  All rights reserved.

Enter password:

Connected to:
Oracle Database 11g Express Edition Release 11.2.0.2.0 - 64bit Production

SQL> select * from northwind.test;

NAME
----------
kianda

At this point we consider our Oracle setup completed and we're ready to enter the application world.

Enter ODB

Before we proceed, one warning: when it comes to the libraries, best if you either use your package manager or build from source. You should probably only mix-and-match if you really know what you are doing; if you do and things get tangled up, it may take you a long while to figure out the source of your woes. Note also that this warning applies to the support libraries but not to the ODB compiler itself.

As usual, we start by grabbing all of the packages from the main ODB website:

odb 2.4.0-1 amd64.deb: the ODB compiler itself.
libodb-2.4.0: the main ODB library, required by all backends.
libodb-pgsql-2.4.0: the PostgreSQL backend. We don't need it today, of course, but since PostgreSQL is my DB of choice I always install it.
libodb-oracle-2.4.0: the Oracle backend. We will need this one.
libodb-boost-2.4.0: the ODB boost profile. This allows using boost types in your Dogen model and having ODB do the right thing in terms of ORM mapping. Our Northwind model does not use boost at present, but I intend to change it as soon as possible as this is a very important feature for customers.

Of course, if you are too lazy to click on links, just use wget:

$ mkdir odb
$ cd odb
$ wget http://www.codesynthesis.com/download/odb/2.4/odb_2.4.0-1_amd64.deb -O odb_2.4.0-1_amd64.deb
$ wget http://www.codesynthesis.com/download/odb/2.4/libodb-2.4.0.tar.gz -O libodb-2.4.0.tar.gz
$ wget http://www.codesynthesis.com/download/odb/2.4/libodb-pgsql-2.4.0.tar.gz -O libodb-pgsql-2.4.0.tar.gz
$ wget http://www.codesynthesis.com/download/odb/2.4/libodb-oracle-2.4.0.tar.gz -O libodb-oracle-2.4.0.tar.gz
$ wget http://www.codesynthesis.com/download/odb/2.4/libodb-boost-2.4.0.tar.gz -O libodb-boost-2.4.0.tar.gz

We start with the DEB, as simple as always:

# dpkg -i odb_2.4.0-1_amd64.deb
Selecting previously unselected package odb.
(Reading database ... 549841 files and directories currently installed.)
Preparing to unpack odb_2.4.0-1_amd64.deb ...
Unpacking odb (2.4.0-1) ...
Setting up odb (2.4.0-1) ...
Processing triggers for man-db (2.7.6.1-2) ...

I tend to store locally built software under my home directory, so that's where we'll place the libraries:

$ mkdir ~/local
$ tar -xaf libodb-2.4.0.tar.gz
$ cd libodb-2.4.0/
$ ./configure --prefix=/full/path/to/local
<snip>
make[1]: Leaving directory '/path/to/build/directory/odb/2.4/libodb-2.4.0'
$ make install
<snip>
make[1]: Leaving directory '/path/to/build/directory/odb/2.4/libodb-2.4.0'

If these libraries are not installed and you would like to use their build directories instead, you can use the --with-libodb, and --with-boost configure options to specify their locations, for example:

./configure --with-boost=/tmp/boost

So if you did make install, you need the environment variables instead.

$ cd ..
$ tar -xaf libodb-boost-2.4.0.tar.gz
$ cd libodb-boost-2.4.0/
$ CPPFLAGS=-I/full/path/to/local/include LDFLAGS=-L/full/path/to/local/lib ./configure --prefix=/full/path/to/local
<snip>
config.status: executing libtool-rpath-patch commands
$ make -j5
<snip>
make[1]: Leaving directory '/path/to/build/directory/odb/2.4/libodb-boost-2.4.0'
$ make install
make[1]: Leaving directory '/path/to/build/directory/odb/2.4/libodb-boost-2.4.0'

For PostgreSQL again I am relying on the header files installed in Debian. The commands are:

$ cd ..
$ tar -xaf libodb-pgsql-2.4.0.tar.gz
$ cd libodb-pgsql-2.4.0/
$ CPPFLAGS=-I/full/path/to/local/include LDFLAGS=-L/full/path/to/local/lib ./configure --prefix=/full/path/to/local
<snip>
config.status: executing libtool-rpath-patch commands
$ make -j5
<snip>
make[1]: Leaving directory '/path/to/build/directory/odb/2.4/libodb-pgsql-2.4.0'
$ make install
<snip>
make[1]: Leaving directory '/path/to/build/directory/odb/2.4/libodb-pgsql-2.4.0'

$ cd ..
$ tar -xaf libodb-oracle-2.4.0.tar.gz
$ cd libodb-oracle-2.4.0
$ LD_LIBRARY_PATH=/usr/lib/oracle/12.1/client64/lib CPPFLAGS="-I/full/path/to/local/include -I/usr/include/oracle/12.1/client64" LDFLAGS="-L/full/path/to/local/lib -L/usr/lib/oracle/12.1/client64/lib" ./configure --prefix=/full/path/to/local
<snip>
config.status: executing libtool-rpath-patch commands
$ make -j5
<snip>
make[1]: Leaving directory '/path/to/build/directory/odb/2.4/libodb-oracle-2.4.0'
$ make install
<snip>
make[1]: Leaving directory '/path/to/build/directory/odb/2.4/libodb-oracle-2.4.0'

And there you are; all libraries built and installed into our local directory, ready to be used.

Conclusion

On Part III we will finally start making use of this setup and attempt to connect to the Oracle database. Stay tuned!

Created: 2017-02-24 Fri 12:37

Emacs 25.1.1 (Org mode 8.2.10)

Validate

Nerd Food: Tooling in Computational Neuroscience - Part III: Data

2016-02-08T13:28:00.000-08:00

Nerd Food: Tooling in Computational Neuroscience - Part III: Data

In God we trust; all others must bring data. -- W. Edwards Deming

Welcome to yet another instalment in our series of posts about tooling in Computational Neuroscience. Previously, we have discussed simulators - a popular one, in particular - and microscopes. We shall now talk about data in Computational Neuroscience, a seemingly broad and somewhat mundane topic but one which is central to any attempt in understanding the status quo of the discipline. The target audience remains as it was - the lay person - but I'm afraid things are getting increasingly technical.

More Data! We Need More Data!

Computational Neuroscience by itself is not particularly interesting if there are no inputs to the models we carefully craft nor detailed outputs to allow us to know what the models are doing. Similarly, one needs to be able to use experimental data to inform our modeling choices and in order to baseline expectations; if this data is not available, one cannot tell how close or how far models are from the real thing. As everywhere else, data is of crucial importance here; we need lots of it and of many different kinds.

Once you need data, you soon need to worry about data representation: how should information be encoded? Clearly, in order for the data to be useful in a general sense, it must be accompanied by a formal or informal specification or else users will not know how to interpret it. Furthermore, given the highly technical nature of the data in question, the specification must be very precise or the data becomes useless or even dangerous; "Was that in microns or nanometres?" is not the sort of question you want to be asking. In a world where producers and consumers of data can be anywhere geographically, the specification assumes an ever larger degree of importance.

In summary, it is just not practical to allow everyone to come up with their own data formats:

writing a clear and concise specification for data interchange is hard work, and requires a lot of experience in both the domain and the specification process in general. The first attempts would probably prove to be incomplete, inconsistent or impractical.
writing code to read and write files according to a specification and in multiple programming languages is also demanding engineering work.
writing code to convert from one data specification to another is even more complicated because it requires intimate knowledge of both.
some data is generated directly by hardware, making it impractical to adapt to different requirements.

Another aspect worth highlighting is the "big data" nature of a lot of the data sets used in this field. Anything to do with the brain gets pretty complex pretty quickly, and this manifests itself in the data dimension by having ever larger data sets with greater levels of detail. On the plus side, thanks to Moore's Law sigmoid, detailed information at all levels is allowing us to answer questions that were unanswerable not so long ago. The flip side is that all those details come at a cost: the data sets are becoming huge. For example, the resolution of the data coming out of microscopy is now so high that a single data set can take as much as 500 TB. And of course, not only are individual data sets getting larger and larger, but we are able to generate more of them at an ever increasing pace because the processes are more streamlined. It is a fire-hose of data.

All of these difficulties are not unique to Computational Neuroscience or even to Neuroscience as a whole, but the complexity of the domain has the effect of greatly exacerbating an already thorny problem.

Neuroinformatics to the Rescue

If you think we're exaggerating then think again. The management of data in Neuroscience is so complex it is a field on its own right, with the cool-sounding name of Neuroinformatics. Wikipedia tells us that:

Neuroinformatics is a research field concerned with the organization of neuroscience data by the application of computational models and analytical tools. These areas of research are important for the integration and analysis of increasingly large-volume, high-dimensional, and fine-grain experimental data. Neuroinformaticians provide computational tools, mathematical models, and create interoperable databases for clinicians and research scientists.

In layman's terms, Neuroinformatics concerns itself with Neuroscience data and the places where said data is to be stored. It is also implied that one has to deal with a variety of types of data, e.g.: data from experiments (of which there can be many kinds), model inputs, model outputs, the models themselves when viewed as data, etc. The classification of this data is in itself a Neuroinformatics task. Finally, Neuroinformatics also is responsible for the tooling necessary to acquire the data, manipulate it, analyse it, visualise it and so on. Given such a broad definition, one is forced to conclude that there is a big overlap between Computational Neuroscience - the modeling activity - and Neuroinformatics - the management of the data required by it. This lack of clarity is common in science, particularly as new fields develop; take for example Mathematics and Computer Science at its inception.

In truth, such definitions and demarcations are only as useful as the tangible benefits they provide. It is perhaps more fruitful to think of Neuroinformatics as a hat you don on as and when your Computational Science work requires; the definition is there then to allow one to be aware of the separation between the analytic work in modeling and the data storage / retrieval work. For the purposes of this article, we'll continue to refer to the "Neuroinformatics Scientist" and the Computational Neuroscientist personas, but bear in mind they may resolve to the same person in practice.¹

Before we move on, I'd like to point out another interesting challenge Neuroinformatics has to address, and one that is common to all Medical Sciences: the need to handle human-derived data very carefully. After all, making data sets available widely must not have implications for the original patients, so its often a requirement that the data is de-identified; in the cases where the data is patient sensitive, additional requirements may be made to users of the data to avoid leaking this information, such as requiring a registration, etc. This illustrates the peculiar nature of Neuroinformatics, with the constant tension between making data as widely available as possible but at the same time having to ensure there are no side-effects of doing so. Presumably, Primum non nocere - first, do no harm.

Databases, Repositories and Archives

Thanks to the efforts of Neuroinformatics, there is now a wealth of Neuroscience data available to all on the Internet. The roots of this growth were sowed in the nineties when labs started sharing research results online. Sharing always existed in one way or another, of course, but the rise of the Internet simply changed the magnitude of the process. It soon became apparent that there was a need to organise central repositories of data, and to ensure the consistency of the shared data. Papers with a distinct Neuroinformatics tone were written, such as An on-line archive of reconstructed hippocampal neurons (1999). Repositories grew, multiplied, morphed and in many cases died, as these things do, and the evolutionary process left us with the survivors. I'd like to highlight some of the ones I have bumped into so far are (with descriptions in their own words):

ModelDB: "ModelDB provides an accessible location for storing and efficiently retrieving computational neuroscience models. ModelDB is tightly coupled with NeuronDB. Models can be coded in any language for any environment. Model code can be viewed before downloading and browsers can be set to auto-launch the models."
NeuronDB: "NeuronDB provides a dynamically searchable database of three types of neuronal properties: voltage gated conductances, neurotransmitter receptors, and neurotransmitter substances. It contains tools that provide for integration of these properties in a given type of neuron and compartment, and for comparison of properties across different types of neurons and compartments."
NeuroMorpho: "NeuroMorpho.Org is a centrally curated inventory of digitally reconstructed neurons associated with peer-reviewed publications. It contains contributions from over 100 laboratories worldwide and is continuously updated as new morphological reconstructions are collected, published, and shared. To date, NeuroMorpho.Org is the largest collection of publicly accessible 3D neuronal reconstructions and associated metadata."
Functional Connectomes Project: "Following the precedent of full unrestricted data sharing, which has become the norm in molecular genetics, the FCP entailed the aggregation and public release (via www.nitrc.org) of over 1200 resting state fMRI (R-fMRI) datasets collected from 33 sites around the world."
OpenfMRI: "[…] project dedicated to the free and open sharing of functional magnetic resonance imaging (fMRI) datasets, including raw data."
Open Source Brain: "resource for sharing and collaboratively developing computational models of neural systems."

As you can see from this small list - rather incomplete, I'm sure - there is a wealth of information out there, covering all sorts of aspects of the brain. We never had so much data as we do today. And, in many ways, this is fast becoming a problem. As an example, data from each of Neuroscience's plethora of divisions and sub-fields is not designed to talk to each other: Electron Microscopy (EM) data is disconnected from data obtained by Magnetic Resonance Imaging (MRI), which is also totally separate from connectome information² and so forth. In many cases, these sub-fields have evolved in fairly separate paths, and developed their own technical vocabulary in isolation and over long periods of time - an approach perfectly suitable for a "disconnected" world but less than ideal for a world where multiple sources of data are required to make sense of complex phenomena. If one can't even agree on what to call things, how can one be able to explain them?

Thus, the early Neuroinformatics approach is best described as "evolutionary". It is not as if someone sat down and generated a well defined set of file formats for data interchange, covering all different aspects of the areas under study. Instead, what has been emerging is a multitude of file formats in each sub-field, all calling out for attention, and all of them designed for the immediate goal at hand rather than the greater good of Neuroscience.

Taming the Sea of Data

From a Software Engineering perspective, an evolutionary approach makes perfect sense; after all, the Real Programmers had said: "first make it work, then make it right, and, finally, make it fast." In many ways, we are reaching the "make it right" phase, with an increasing interest in efforts towards the creation of broad standards. There have been several papers and initiatives on the subject, such as the Neuroscience Information Framework, or NIF, described in a paper: The Neuroscience Information Framework: A Data and Knowledge Environment for Neuroscience. The paper outlined a lot of the problems that are hampering research, such as:

the need for specialised search engines that are domain aware, and advanced query tools too;
the need to aid integration and to provide connectivity across related data and findings;
a requirement to provide new and enhanced forms of analysing existing data, as data reuse is extremely important - new insights can be obtained on already existing data, often long after the data was generated, and by using it in ways that were not at all envisioned by the original authors;
the need to make contribution to online repositories easier; lowering the "contribution barrier" is important to increase data availability but must be done in ways that do not compromise the quality of the data;
a requirement to make all code open source such that any lab can make use of it, and the community as a whole can share the maintenance load;
a need for an online repository for all tooling, to avoid reinventing the wheel;
the need to create a multi-domain standard vocabulary.

There are many worthwhile points in this paper, and it is highly recommended to anyone interested in the subject matter. For instance, the section discussing the design of the NIF also covers the requirements for any specification that wishes to solve the problems outlined above. They are worth highlighting as - in my humble and lay opinion - they are very well thought out.

The design of such a framework must combine technical specifications choices and broad community support; "open data, access and exchange, via open source and platform, aid Framework-enabled open discover for Neuroscience."
A common framework would reduce costs and enhance benefits of data sharing and knowledge sharing; it would "reduce the cost/benefit ration for data acquisition and utilization."
The framework must be designed by the broader community and with the needs of this broader community in mind, and it must build upon prior development in Neuroinformatics.
A focus on interoperability is crucial, and it is not a static target but one that must be looked after over time. In addition, there is also a need to keep in mind that different resources have very different interoperability potential. In order to maximise interoperability, we should aim to standardise as much as possible all aspects of the process such as user interfaces, terminologies, formats, etc.

To the untrained eye, the NIF initiative appears to be a great effort to solve fundamental problems in the field. It also seems to have spawned and/or helped popularise many useful and lasting resources such as NeuroMorpho. However, the impression one gets from the outside is that the NIF didn't quite fulfil all of its potential. Having said that, I am keenly looking for up-to-date documents that describe the current status across all of its many aspects - alas, I have not yet succeeded in finding any such document. If indeed it is the case that the initiative petered out, it did highlight a few potential problems for anyone working in this space:

large undertakings are hard to pull off; small, organic, incremental changes are easier to do, but of course, that is why we have the problems we currently have.
large initiatives require large amounts of funding; work is technical and very expensive.
it is not easy to understand NIFs deliverables from looking at their documentationa and website. One can clearly see it was an ambitious project, and one which took on the brunt of the problem areas highlighted above, but perhaps it needed a slightly more self-contained view of their achievements rather than a whole-or-nothing approach. This allows preserving some components even whilst others are failing to gain traction.

XML strikes back

Another interesting attempt to tackle these problems is what I call the "XML suite". These are basically a set of different XML-based standards that are able to interoperate and augment each other, a bit like a stack of building blocks. You can find more details in this paper: XML for Model Specification in Neuroscience. Some of the components of the XML Suite are (with descriptions on their own words, copied from the above paper and a link for more details):

LEMS: "the Low Entropy Model Specification […] is being developed to provide a compact, minimally redundant, human-readable, human-writable, declarative way of expressing models of biological systems. It differs from other systems such as CellML or SBML in its requirement to be human writable and the inclusion of basic physical concepts such as dimensionality and physical nesting as part of the language."
NeuroML: "supports the use of declarative model specifications for neuroscience modeling efforts at different scales, from intracellular mechanisms to networks of reconstructed neurons."
MorphML: "provides a common format for exchange of neuronal morphology data. It can also be used to specify cell structure for modeling efforts as part of NeuroML."
BrainML: "application for representing time series data, spike trains, experimental protocols, and other data relevant to neurophysiology experiments."
SBML: "(Systems Biology Markup Language) is an application for specifying models of biochemical reaction networks such as metabolic networks, cell-signaling pathways and gene regulatory networks."
CellML: "is designed for the specification of biological models of cellular and sub-cellular processes such as calcium dynamics, metabolic pathways, signal transduction, and electrophysiology."
MathML: "provides the means for describing the structure and content of mathematical notation in order to serve, receive, and process mathematics on the web. Other XML applications often use MathML language elements for representing mathematical equations."

A positive aspect of the XML Suite is its "discrete" nature. Each of these file formats are free to evolve in isolation, and the nature of their cooperation is very loose in most cases. For example MathML is not at all related to Neuroscience and has the support of the Maths community (to some extent). In addition, the "stacking" approach is also a very interesting one, allowing a good domain focus. For example, NeuroML is built on top of LEMS, so in theory each of these should cover different domains and there should be minimal redundancy.

The key challenge for the XML Suite is for each of their components to find a sustainable user base and sustainable funding to go along with it. This is a broader problem of Neuroinformatics: researchers do not want to spend time on work that is not contributing directly to their research and so the developer pool to do fundamental work on the file formats is limited. Once the developer pool becomes too limited, the file format ends up with a small user base because it is not fit for purpose, and thus starts a downward spiral. This appears to have been the fate of projects such as BrainML.

Conclusion

This post provided an overview of the data landscape in Computational Neuroscience and introduced the sub-field of Neuroinformatics. We also looked at some of the available data stores and reviewed a few of the more popular initiatives to solve the fundamental data problems in the field.

Stay tuned for the next instalment!

Footnotes:

For a bit more details on the two fields see What are Computational Neuroscience and Neuroinformatics?

"A connectome is a comprehensive map of neural connections in the brain, and may be thought of as its "wiring diagram". From this page.

Created: 2016-02-08 Mon 21:41

Emacs 24.5.1 (Org mode 8.2.10)

Validate

Nerd Food: Interesting...

2016-01-18T04:52:00.000-08:00

Nerd Food: Interesting…

Time to flush all those tabs again. Some interesting stuff I bumped into recently-ish.

Finance, Economics, Politics

Why Big Oil Should Kill Itself: This is a really, really interesting article. The gist of it is that the entire logic around oil exploration is now a fallacy and it makes more economic sense to simply give up looking for oil because all oil that is left is just to expensive to commercialise. It also has a very interesting take on the valuation of oil companies (and sources of take overs) but I won't spoil it for you. If you are into oil (or against it), its a must read.
Oil Goes Nonlinear: Short but thought provoking. I don't tend to agree with Krugman on a lot of things, but quite like this analysis.
Africa’s Boom Is Over: And the bad news continue. Totally spot on analysis of what will befall us.
American Spring: interesting take on the state of affairs of American politics. Not sure I agree with everything, but definitely food for thought. "Statistically speaking, what are the odds that the two most qualified candidates to be president out of 300 million people are siblings? Or married?" Indeed.
A Year of Sovereign Defaults?: Very good and very scary. This has to be on the cards, the only question is the timing.
Really rich people are suddenly paying quite a bit more in taxes: some good news on the equality front I guess. But not quite sure it makes much of a difference in the big scheme of US things.
Argentina's 'little trees' getting chopped down by new president: Seems like Argentina is going to go through yet another turbulent period, with some good and bad news coming out. Interesting take on the impact to the less well off of the new policies. The chap is certainly a doer, it seems: A fast start.

Startups et al.

The WTF Economy: Tim O'Reilly (of publishing house fame) is setting up a conference in the future of work. Sounds extremely interesting. Hopefully, they will have a section dedicated to the developing world. Source: Tim O'Reilly (Twitter)
Bitcoin is Being Hot-Wired for Settlement: Garzik is at it again. Interesting news of cryptos in the settlement front. Source: Jeff Garzik (Twitter)
Elon Musk’s Billion-Dollar AI Plan Is About Far More Than Saving the World: So it seems Musk and Altman want to ensure AI plays nice. Not quite sure he's right on this one. Steven Levy's version is available here, more of an interview with Musk.
License to (Not) Drive: Levy gets to try the Google self-driving cars. Very interesting.
Hire Literally Anyone: Extremely interesting. I always thought the existing hiring practices are not very well thought out, but this article makes me realise that the flaws are deeper than I expected. While we are on this topic, this ain't too bad either: How to Hire.
The resolution of the Bitcoin experiment: Great - nay - insanely great analysis on the state of affairs in the BTC world. Spoken with authority.
On the dangers of a blockchain monoculture: Pours more petrol in the raging BTC fire. Very interesting. Never saw BTC as a monoculture, but actually it so is.
The Final Days of the Bitcoin Foundation?: And yet some more on the BTC impending doom. I just gotta stop reading about it now, the whole saga is far too depressing. Lets hope the technology survives where humans failed.
IBM Talks Open Ledger Project, Bright Future for Blockchain: Still trying to catch my breath on all the BTC articles coming out, and lo-and-behold, I missed the whole Open Ledger thing.
Apploitation in a city of instaserfs: Scary. Very scary. Reminiscent of the older Mirani article The secret to the Uber economy is wealth inequality. San Francisco is becoming more like Mumbai and that is not good news.

General Coding

Feeding Graph databases - a third use-case for modern log management platforms: Very interesting ideas on how to use logging data in a graph database. Sounds extremely counter-intuitive, and then you start reading at which point its like "Damn, why didn't I think of that before!". Source: Hacker News
Moores law hits the roof: Seems like the exponential function is revealing itself as a sigmoid, as everyone knew it would. Some of the cracks that are already present in Moore's law. Interesting to note that a transistor is now only a few silicon atoms wide - meaning we can't really make it much smaller. Source: Hacker News
No, I Don't Want To Configure Your App: Call to arms to get us all thinking on just how many configuration knobs you need to use something. Source: Hacker News
Your IDE Is Killing You: Somewhat preaching to the choir, since I am an Emacs user of old, but still a very cogent argument on why relying too much on IDEs is not a good thing. Source: Bruno Antunes (twitter)
Starters and Maintainers: The different personas around an open source project. Interesting, its good to be aware of which hat you are wearing when.
I Moved to Linux and It’s Even Better Than I Expected: A feel good story about the Linux desktop. Given how slowly things are progressing on that front, we all need one of these some times to cheer us up. Main value of the article though.

Databases

Encrypted databases with ZeroDB: I'm not exactly impressed with the technology itself, but more with the ideas one can extract from it. Briefly: what if the database only stores encrypted data, which only each client can decrypt? This is certainly a very useful thing for certain types of information and a PostgreSQL extension would be most useful. Source: Hacker News
Introduction to PostgreSQL physical storage: Great article on Postgres low-level details. One to read if you want to get serious about the Elephant but are not yet in the know.
Schema based versioning and deployment for PostgreSQL: Tips on how to manage versions for your stored procs, and also contains links for table management. For those of us not totally taken by NoSQL.

C++

Lessons learnt from 10+ years with actors in C++: The voice of experience talks about what they learned from using Actors over more than a decade. Worth reading if you are into that pattern.
Automating a C++ program from a Node.js web app: If you are considering exposing your C++ code into JS, this is a series of posts to read.
Starting a tech startup with C++: lots of libraries I never heard of and an insight on the performance differences between python and c++.
Writing modern C++ servers using Wangle: The follow up to the previous post, explaining how to write servers with Facebook technologies.
I want my pony! Or why you cannot have C++ exceptions with a stack trace: very interesting. Since I started using Boost.Exception I never missed the stack traces either. Source: Hacker News

Layman Science

Why String Theory Is Not A Scientific Theory: Doesn't say a lot of new things, but its good to remind ourselves on what exactly do we mean when we say "Science". This would save us from a lot of grief, such as considering Economics as a Science.
The cold fusion horizon: … talking about Science, I was surprised to find out that people are still talking seriously about cold fusion. Interesting article, because it takes the flip side of the Science coin: nothing should not be science unless it is not using the scientific method. Whilst up til now cold fusion has been more of a hoax, we should not discredit people who work on it provided they are following scientific principles. Who knows, they may be right in the end. Science is all about long-shots.

Other

Exit Sandman: Neil Gaiman goes in-depth with Overture, one of 2015's best comics: For the Sandman fans, the new (and last) Sandman book is all the rage. A great interview by the man himself.
Dear Zachary: bumped into this via Wait But Why, and, as usual, great tip. Fantastic documentary.
Solaris: Always wanted to watch this Tarkovsky movie and now it seems it is available online! This is part of an initiative described by Open Culture here. Source: Bruno Antunes (twitter)
Wittgenstein: A Wonderful Life: Found a Wittgenstein documentary, but sadly haven't had time to watch it just yet. In my watch list though.
Je ne suis pas Charlie: Haven't yet watched it but seems thought-provoking. Watch listed.
Tulipa Ruiz - Efêmera - Album Completo: New musical find in the Brazilian space (Portuguese).

Created: 2016-01-18 Mon 12:49

Emacs 24.5.1 (Org mode 8.2.10)

Validate

Nerd Food: On Product Backlog

2016-01-17T15:38:00.000-08:00

Nerd Food: On Product Backlog

Would be be good to have a better bug-tracking setup? Yes. But I think it takes man-power, and it would take something *fundamentally* better than bugzilla. -- Linus

Many developers in large companies tend to be exposed to a strange variation of agile which I like to call "Enterprise Grade Agile", but I've also heard it called "Fragile" and, most aptly, "Cargo-Cult Agile". However you decide to name the phenomena, the gist of it is that these setups contain nearly all of the ceremony of agile - including stand-ups, sprint planning, retrospectives and so on - but none of its spirit. Tweets such as this are great at capturing the essence of the problem:

Top tip: if you need to bring a notepad to the daily stand up to tell us what you did yesterday that's too many details
— Fran Buontempo (@fbuontempo) January 12, 2016

Once you start having that nagging feeling of doing things "because you are told to", and once your stand-ups become more of a status report to the "project manager" and/or "delivery manager" - the existence of which, in itself, is rather worrying - your Cargo Cult Agile alarm bells should start ringing. As I see it, agile is a toolbox with a number of tools, and they only start to add value once you've adapted them to your personal circumstances. The fitness function that determines if a tool should be used is how much value it adds to all (or at least most) of its users. If it does not, the tool must be further adapted or removed altogether. And, crucially, you learn about agile tools by using them and by reflecting on the lessons learned. There is no other way.

This post is one such exercise and the tool I'd like to reflect on is the Product Backlog. Now, before you read through the whole rant, its probably worth saying that this post takes a slightly narrow and somewhat "advanced" view of agile, with a target audience of those already using it. If you require a more introductory approach, you are probably better off looking at other online resources such as How to learn Scrum in 10 minutes and clean your house in the process. Having said that, I'll try to define terms best I can to make sure we are all on the same page.

Working Definition

Once your company has grokked the basics of agile and starts to move away from those lengthy specification documents - those that no one reads properly until implementation and those that never specified anything the customer wanted, but everything we thought the customer wanted and then some - you will start to use the product backlog in anger. And that's when you will realise that it is not quite as simple as memorising text books.

So what do the "text books" say? Let's take a fairly typical definition - this one from Scrum:

The agile product backlog in Scrum is a prioritized features list, containing short descriptions of all functionality desired in the product. When applying Scrum, it's not necessary to start a project with a lengthy, upfront effort to document all requirements. Typically, a Scrum team and its product owner begin by writing down everything they can think of for agile backlog prioritization. This agile product backlog is almost always more than enough for a first sprint. The Scrum product backlog is then allowed to grow and change as more is learned about the product and its customers.¹

This is a good working definition, which will suffice for the purposes of this post. It is deceptively simple. However, as always, one must remember Yogi Berra: "In theory, there is no difference between theory and practice. But in practice, there is."

Potmenkin Product Backlogs

Many teams finish reading one such definition, find it amazingly inspiring, install the "agile plug-in" on their bug-tracking software of choice and then furiously start typing in those tickets. But if you look closely, you'd be hard-pressed to find any difference between the bug tickets of old versus the "stories" in the new and improved "product backlog" that apparently you are now using.

This is a classic management disconnect, whereby a renaming exercise is applied and suddenly, Potemkin village-style, we are now in with the kool kids and our company suddenly becomes a modern and desirable place to work. But much like Potemkin villages were not designed for real people to live in, so "Potmenkin Product Backlogs" are not designed to help you manage the lifecycle of a real product; they are there to give you the appearance of doing said management, for the purposes of reporting to the higher eschelons and so that you can tell stakeholders that "their story has been added to the product backlog for prioritisation".

Alas, very soon you will find that the bulk of the "user stories" are nothing but glorified one-liners that no one seems to recall what exactly they're supposed to mean, and those few elaboratedly detailed tickets end up rotting because they keep being deprioritised and now describe a world long gone. Soon enough you will find that your sprint planning meetings will cover less and less of the product backlog - after all, who is able to prioritise this mess? Some stories don't even make any sense! The final act is when all stories worked on are stories raised directly on the sprint backlog, and the product backlog is nothing but the dumping ground for the stories that didn't make it on a given sprint. At this stage, the product backlog is in such a terrible mess that no one looks at it, other than for the occasional historic search for valuable details on how a bug was fixed. Eventually the product backlog is zeroed - maybe a dozen or so of the most recent stories make it through the cull - and the entire process begins anew. Alas, enlightenment is never achieved, so you are condemned to repeat this cycle for all eternity.

As expected, the Potmenkin Product Backlog adds very little value - in fact it can be argued that it detracts value - but it must be kept because "agile requires a product backlog".

Bug-Trackers: Lessons From History

In order to understand the difficulties with a product backlog, we turn next to their logical predecessors: bug-tracking systems such as Bugzilla or Jira. This post starts with a quote from the kernel's Benevolent Dictator that illustrates the problem with these. Linus has long taken the approach that there is no need for a bug-tracker in kernel development, although he does not object if someone wants to use one for a subsystem. You may think this is a very primitive approach but in some ways it is also a very modern approach, very much in line with agile; if you have a bug-tracking system which is taking time away from developers without providing any value, you should remove the bug-tracking system. In kernel development, there simply is no space for ceremony - or, for that matter, for anything which slows things down².

All of which begs the question: what makes bug-tracking systems so useless? From experience, there are a few factors:

they are a "fire and forget" capture system. Most users only care about entering new data, rather than worrying about the lifecycle of a ticket. Very few places have some kind of "ticket quality control" which ensures that the content of the ticket is vaguely sensible, and those who do suffer from another problem:
they require dedicated teams. By this I don't just mean running the bug-tracking software - which you will most likely have to do in a proprietary shop; I also mean the entire notion of Q&A and Testing as separate from development, with reams of people dedicated to setting "environments" up (and keeping them up!), organising database restores and other such activities that are incompatible with current best practices of software development.
they are temples of ceremony: a glance at the myriad of fields you need to fill in - and the rules and permutations required to get them exactly right - should be sufficient to put off even the most ardent believer in process. Most developers end up memorising some safe incantation that allows them to get on with life, without understanding the majority of the data they are entering.
as the underlying product ages, you will be faced with the sad graph of software death. The main problem is that resources get taken away from systems as they get older, a phenomena that manifests itself as a growth in the delta between the number of open tickets against the number of closed tickets. This is actually a really useful metric but one that is often ignored.³.

And what of the newest iterations on this venerable concept such as GitHub Issues? Well, clearly they solve a number of the problems above - such as lowering the complexity and cost barriers - and certainly they do serve a very useful purpose: they allow the efficient management of user interactions. Every time I create an issue - such as this one - it never ceases to amaze me how easily the information flows within GitHub projects; one can initiate comms with the author(s) or other users with zero setup - something that previously required mailinglist membership, opening an account on a bug-tracker and so forth. We now take all of this for granted, of course, but it is important to bear in mind that many open source projects would probably not even have any form of user interaction support, were it not for GitHub. After all, most of them are a one-person shop with very little disposable time, and it makes no sense to spend part of that time maintaining infrastructure for the odd person or two who may drop by to chat.

However, for all of its glory, it is also important to bear in mind that GitHub Issues is not a product backlog solution. What I mean by this is that the product backlog must be owned by the team that owns the product and, as we shall see, it must be carefully groomed if it is to be continually useful. This is at loggerheads with allowing free flow of information from users. Your Issues will eventually be filled up with user requests and questions which you may not want to address, or general discussions which may or may not have a story behind it. They are simply different tools for different jobs, albeit with an overlap in functionality.

So, history tells us what does not work. But is the product backlog even worth all this hassle?

Voyaging Through Strange Seas of Thought

One of the great things about agile is how much it reflects on itself; a strange loop of sorts. Presentations such as Kevlin Henney's The Architecture of Uncertainty are part of this continual process of discovery and understanding, and provide great insights about the fundamental nature of the development process. The product backlog plays - or should play - a crucial role exactly because of this uncertain nature of software development. We can explain this by way of a device.

Imagine that you start off by admitting that you know very little about what it is that you are intending to do and that the problem domain you are about to explore is vast and complex. In this scenario, the product backlog is the sum total of the knowledge gained whilst exploring this space that has yet not been transformed into source code. Think of it like the explorer's maps in the fifteen-hundreds. In those days, "users" knew that much of it was incorrect and a great part was sketchy and ill-defined, but it was all you had. Given that the odds of success were stacked against you, you'd hold that map pretty tightly while the storms were raging about you. Those that made it back would provide corrections and amendments and, over time, the maps eventually converged with the real geography.

The product backlog does something similar, but of course, the space you are exploring does not have a fixed geometry or topography and your knowledge of the problem domain can actively change the domain itself too - an unavoidable consequence of dealing with pure thought stuff. But the general principle applies. Thus, in the same way a code base is precious because it embodies the sum total knowledge of a domain - heck, in many ways it is the sum total knowledge of a domain! - so the product backlog is precious because it captures all the known knowledge of these yet-to-be-explored areas. In this light, you can understand statements such as this:

When your product backlog is empty, your product is dead - @KevlinHenney #agileotb
— Marc Johnson (@marcjohnson) September 4, 2014

So, if the backlog is this important, how should one manage it?

Works For Me, Guv!

Up to this point - whilst we were delving into the problem space - we have been dealing with a fairly general argument, likely applicable to many. Now, as we enter the solution space, I'm afraid I will have to move from the general to the particular and talk only about the specific circumstances of my one-man-project Dogen. You can find Dogen's product backlog here.

This may sound like a bit of a cop out, you may say, and not without reason: how on earth are you supposed to extrapolate conclusions from a one-person open source project to a team of N working on a commercial product? However, it is also important to take into account what I said at the start: agile is what you make of it. I personally think of it as a) the smallest amount of processes required to make your development process work smoothly and b) and the continual improvement of those processes. Thus, there are no one-size-fits-all solutions; all one can do is to look at others for ideas. So, lets look at my findings⁴.

The first and most important thing I did to help me manage my product backlog was to use a simple text file in Org Mode notation. Clearly, this is not a setup that is workable for a development team much larger than a set of one, or one that doesn't use Emacs (or Vim). But for my particular circumstances it has worked wonders:

the product backlog is close to the code, so wherever you go, you take it with you. This means you can always search the product backlog and - most importantly - add to it wherever you are and whenever an idea happens to come by. I use this flexibility frequently.
the Org Mode interface makes it really easy to move stories up and down (order is taken to mean priority here) and to create "buckets" of stories according to whatever categorisation you decide to use, up to any level of nesting. At some point you end up converging to a reasonable level of nesting, of course. It is surprising how one can manage very large amounts of stories thanks to this flexible tree structure.
it's trivial to move stories in and out of a sprint, keeping track of all changes to a story - they are just text that can be copy and pasted and committed.
Org Mode provides a very capable tagging system. I first started by overusing these, but when tagging got too fine grained it became unmaintainable. Now we use too few - just epic and story - so this will have to change again in the near future. For example, it should be trivial to add tags for different components in the system or to mark stories as bugs or features, etc. Searching then allows you to see a subset of the stories that match those labels.

A second decision which has proven to be a very good one has been to groom the product backlog very often. And by this I don't just mean a cursory look, but a deep inspection of all stories, fixing them where required. Again, the choice of format has proved very helpful:

it is easy to mark all stories as "non-reviewed" or some other suitable tag in Org Mode, and then unmark them as one finishes the groom - thereby ensuring all stories get some attention. As the product backlog becomes larger, a full groom could take multiple sprints, but this is not an issue once you understand its value and the cost of having it rot.
because the product backlog is with the code, any downtime can be used for grooming; those idle weekends or that long wait at the airport are perfect candidates to get a few stories looked at. Time spent waiting for the build is also a good candidate.
you get an HTML representation of the Org Mode file for free in GitHub, meaning you can read your backlog from your phone. And with the new editing functionality, you can also edit stories too.

Thirdly, I decided to take a "multi-pass" approach at managing the story lifecycle. These are some of the key aspects of this lifecycle management:

stories can only be captured if they are aligned with the vision. This filter saves me from adding all sorts of ideas which are just too "out of the left field" to be of practical use, but keeps those that may sound crazy are but aligned with the vision.
stories can only be captured if there is no "prior art". I always perform a number of searches in the backlog to look for anything which covers similar ground. If found, I append to that.
new stories tend to start with very little content - just the minimum required to allow resetting state back to the idea I was trying to capture. Due to this, very little gets lost. At this point, we have a "proto-story".
as time progresses, I end up having more ideas on this space, and I update the story with those ideas - mainly bullet points with one liners and links.
at some point the story begins to mature; there is enough on it that we can convert the "proto-story" to a full blown story. After a number of grooms, the story becomes fully formed and is then a candidate to be moved to a sprint backlog for implementation. It may stay in this state ad-infinitum, with periodic updates just to make sure it does not rot.
A candidate story can still get refined: trimmed in scope, re-targeted, or even cancelled because it no longer fits with the current architecture or even the vision. Cancelled stories are important because we may come back to them - its just very unlikely that we do.
every sprint has a "sprint mission"⁵. When we start to move stories into the sprint backlog, we look for those which resonate with the sprint mission. Not all of them are fully formed, and the work on the sprint can entail the analysis required to create a full blown story. But many will be implementable directly off of the product backlog.
some times I end up finding related threads in multiple stories and decide to merge them. Merging of related stories is done by simply copying and pasting them into a single story; over time, with the multiple passes done in the grooms, we end up again with a single consistent story.

What all of this means is that a story can evolve over time in the product backlog, only to become the exact thing you need at a given sprint; at that point you benefit from the knowledge and insight gained over that long period of time. Some stories in Dogen's backlog have been there for years, and when I finally get to them, I find them extremely useful. Remember: they are a map to the unknown space you are exploring.

With all of this machinery in place, we've ended up with a very useful product backlog for Dogen - one that certainly adds a lot of value. Don't take me wrong, the cost of maintenance is high and I'd rather be coding instead of maintaining the product backlog, especially given the limited resources. But I keep it because I can see on a daily basis how much it improves the overall quality of the development process. It is a price I find worth paying, given what I get in return.

Final Thoughts

This post was an attempt to summarise some of the thoughts I've been having on the space of product backlogs. One of its main objectives was to try to convey the importance of this tool, and to provide ideas on how you can improve the management of your own product backlog by discussing the approach I have taken with Dogen.

If you have any suggestions or want to share your own tips on how to manage your product backlog please reach me on the comments section - there is always space for improvement.

Footnotes:

Source: Scrum Product Backlog, Mountain Goat Software.

A topic which I covered some time ago here: On Evolutionary Methodology. It is also interesting to see how the kernel processes are organised for speed: How 4.4's patches got to the mainline.

Another topic which I also covered here some time ago: On Maintenance.

⁴

I am self-plagiarising a little bit here and rehashing some of the arguments I've used before in Lessons in Incremental Coding, mainly from section DVCS to the Core.

⁵

See the current sprint backlog for an example.

Created: 2016-01-17 Sun 23:55

Emacs 24.5.1 (Org mode 8.2.10)

Validate

Nerd Food: Dogen: The Package Management Saga

2015-12-22T06:01:00.000-08:00

Nerd Food: Dogen: The Package Management Saga

We've just gone past Dogen's Sprint 75, so I guess it's time for one of those "reminiscing posts" - something along the lines of what we did for Sprint 50. This one is a bit more practical though; if you are only interested in the practical side, keep scrolling until you see "Conan".

So, package management. Like any other part-time C++ developer whose professional mainstay is C# and Java, I have keenly felt the need for a package manager when in C++-land. The problem is less visible when you are working with mature libraries and dealing with just Linux, due to the huge size of the package repositories and the great tooling built around them. However, things get messier when you start to go cross-platform, and messier still when you are coding on the bleeding edge of C++: either the package you need is not available in the distro's repos or even PPA's; or, when it is, its rarely at the version you require.

Alas, for all our sins, that's exactly where we were when Dogen got started.

A Spoonful of Dogen History

Dogen sprung to life just a tad after C++-0x became C++-11, so we experienced first hand the highs of a quasi-new-language followed by the lows of feeling the brunt of the bleeding edge pain. For starters, nothing we ever wanted was available out of the box, on any of the platforms we were interested in. Even Debian testing was a bit behind - probably stalled due to a compiler transition or other, but I can't quite recall the details. In those days, Real Programmers were Real Programmers and mice were mice: we had to build and install the C++ compilers ourselves and, even then, C++-11 support was new, a bit flaky and limited. We then had to use those compilers to compile all of the dependencies in C++-11 mode.

The PFH Days

After doing this manually once or twice, it soon stopped being fun. And so we solved this problem by creating the PFH - the Private Filesystem Hierarchy - a gloriously over-ambitious name to describe a set of wrapper scripts that helped with the process of downloading tarballs, unpacking, building and finally installing them into well-defined locations. It worked well enough in the confines of its remit, but we were often outside those, having to apply out-of-tree patches, adding new dependencies and so on. We also didn't use Travis in those days - not even sure it existed, but if it did, the rigmarole of the bleeding edge experience would certainly put a stop to any ideas of using it. So we used a local install of CDash with a number of build agents on OSX, Windows (MinGW) and Linux (32-bit and 64-bit). Things worked beautifully when nothing changed and the setup was stable; but, every time a new version of a library - or god forbid, of a compiler - was released, one had that sense of dread: do I really need to upgrade?

Since one of the main objectives of Dogen was to learn about C++-11, one has to say that the pain was worth it. But all of the moving parts described above were not ideal and they were certainly not the thing you want to be wasting your precious time on when it is very scarce. They were certainly not scalable.

The Good Days and the Bad Days

Things improved slightly for a year or two when distros started to ship C++-11 compliant compilers and recent boost versions. It was all so good we were able to move over to Travis and ditch almost all of our private infrastructure. For a while things looked really good. However, due to Travis' Ubuntu LTS policy, we were stuck with a rapidly ageing Boost version. At first PPAs were a good solution for this, but soon these became stale too. We also needed to get latest CMake as there are a lot of developments on that front, but we certainly could not afford (time-wise) to revert back to the bad old days of the PFH. At the same time, it made no sense to freeze dependencies in time, providing a worse development experience. So the only route left was to break Travis and hope that some solution would appear. Some alternatives were tried such as Drone.io but nothing was successful.

There was nothing else for it; what was needed was a package manager to manage the development dependencies.

Nuget Hopes Dashed

Having used Nuget in anger for both C# and C++ projects, and given Microsoft's recent change of heart with regards to open source, I was secretly hoping that Nuget would get some traction in the wider C++ world. To recap, Nuget worked well enough in Mono; in addition, C++ support for Windows was added early on. It was somewhat limited and a bit quirky at the start, but it kept on getting better, to the point of usability. Trouble was, their focus was just Visual Studio.

Alas, nothing much ever came from my Nuget hopes. However, there have been a couple of recent announcements from Microsoft that make me think that they will eventually look into this space:

Surely the logical consequence is to be able to manage packages in a consistent way across platforms? We can but hope.

Biicode Comes to the Rescue?

Nuget did not pan out but what did happen was even more unlikely: some crazy-cool Spaniards decided to create a stand alone package manager. Being from the same peninsula, I felt compelled to use their wares, and was joyful as they went from strength to strength - including the success of their open source campaign. And I loved the fact that it integrated really well with CMake, and that CLion provided Biicode integration very early on.

However, my biggest problem with Biicode was that it was just too complicated. I don't mean to say the creators of the product didn't have very good reasons for their technical choices - lord knows creating a product is hard enough, so I have nothing but praise to anyone who tries. However, for me personally, I never had the time to understand why Biicode needed its own version of CMake, nor did I want to modify my CMake files too much in order to fit properly with Biicode and so on. Basically, I needed a solution that worked well and required minimal changes at my end. Having been brought up with Maven and Nuget, I just could not understand why there wasn't a simple "packages.xml" file that specified the dependencies and then some non-intrusive CMake support to expose those into the CMake files. As you can see from some of my posts, it just seemed it required "getting" Biicode in order to make use of it, which for me was not an option.

Another thing that annoyed me was the difficulty on knowing what the "real" version of a library was. I wrote, at the time:

One slightly confusing thing about the process of adding dependencies is that there may be more than one page for a given dependency and it is not clear which one is the "best" one. For RapidJson there are three options, presumably from three different Biicode users:

fenix: authored on 2015-Apr-28, v1.0.1.

hithwen: authored 2014-Jul-30

denis: authored 2014-Oct-09

The "fenix" option appeared to be the most up-to-date so I went with that one. However, this illustrates a deeper issue: how do you know you can trust a package? In the ideal setup, the project owners would add Biicode support and that would then be the one true version. However, like any other project, Biicode faces the initial adoption conundrum: people are not going to be willing to spend time adding support for Biicode if there aren't a lot of users of Biicode out there already, but without a large library of dependencies there is nothing to draw users in. In this light, one can understand that it makes sense for Biicode to allow anyone to add new packages as a way to bootstrap their user base; but sooner or later they will face the same issues as all distributions face.

A few features would be helpful in the mean time:

popularity/number of downloads

user ratings

These metrics would help in deciding which package to depend on.

For all these reasons, I never found the time to get Biicode setup and these stories lingered in Dogen's backlog. And the build continued to be red.

Sadly Biicode the company didn't make it either. I feel very sad for the guys behind it, because their heart was on the right place.

Which brings us right up to date.

Enter Conan

When I was a kid, we were all big fans of Conan. No, not the barbarian, the Japanese Manga Future Boy Conan. For me the name Conan will always bring back great memories of this show, which we watched in the original Japanese with Portuguese subtitles. So I was secretly pleased when I found conan.io, a new package management system for C++. The guy behind it seems to be one of the original Biicode developers, so a lot of lessons from Biicode were learned.

To cut a short story short, the great news is I managed to add Conan support to Dogen in roughly 3 hours and with very minimal knowledge about Conan. This to me was a litmus test of sorts, because I have very little interest in package management - creating my own product has proven to be challenging enough, so the last thing I need is to divert my energy further. The other interesting thing is that roughly half of that time was taken by trying to get Travis to behave, so its not quite fair to impute it to Conan.

Setting Up Dogen for Conan

So, what changes did I do to get it all working? It was a very simple 3-step process. First I installed Conan using a Debian package from their site.

I then created a conanfile.txt on my top-level directory:

[requires]
Boost/1.60.0@lasote/stable

[generators]
cmake

Finally I modified my top-level CMakeLists.txt:

# conan support
if(EXISTS "${CMAKE_BINARY_DIR}/conanbuildinfo.cmake")
    message(STATUS "Setting up Conan support.")
    include("${CMAKE_BINARY_DIR}/conanbuildinfo.cmake")
    CONAN_BASIC_SETUP()
else()
    message(STATUS "Conan build file not found, skipping include")
endif()

This means that it is entirely possible to build Dogen without Conan, but if it is present, it will be used. With these two changes, all that was left to do was to build:

$ cd dogen/build/output
$ mkdir gcc-5-conan
$ conan install ../../..
$ make -j5 run_all_specs

Et voila, I had a brand spanking new build of Dogen using Conan. Well, actually, not quite. I've omitted a couple of problems that are a bit of a distraction on the Conan success story. Let's look at them now.

Problems and Their Solutions

The first problem was that Boost 1.59 does not appear to have an overridden FindBoost, which means that I was not able to link. I moved to Boost 1.60 - which I wanted to do any way - and it worked out of the box.

The second problem was that Conan seems to get confused with Ninja, my build system of choice. For whatever reason, when I use the Ninja generator, it fails like so:

$ cmake ../../../ -G Ninja
$ ninja -j5
$ ninja: error: '~/.conan/data/Boost/1.60.0/lasote/stable/package/ebdc9c0c0164b54c29125127c75297f6607946c5/lib/libboost_system.so', needed by 'stage/bin/dogen_utility_spec', missing and no known rule to make it

This is very strange because boost system is clearly available in the Conan download folder. Using make solved this problem. I am going to open a ticket on the Conan GitHub project to investigate this.

The third problem is more boost related than anything else. Boost Graph has not been as well maintained as it should, really. Thus users now find themselves carrying patches, and all because no one seems to be able to apply them upstream. Dogen is in this situation as we've hit the issue described here: Compile error with boost.graph 1.56.0 and g++ 4.6.4. Sadly this is still present on Boost 1.60; the patch exists in Trac but remains unapplied (#10382). This is a tad worrying as we make a lot of use of Boost Graph and intend to increase the usage in the future.

At any rate, as you can see, none of the problems were showstoppers, nor can they all be attributed to Conan.

Getting Travis to Behave

Once I got Dogen building locally, I then went on a mission to convince Travis to use it. It was painful, but mainly because of the lag between commits and hitting an error. The core of the changes to my YML file were as follows:

install:
<snip>
  # conan
  - wget https://s3-eu-west-1.amazonaws.com/conanio-production/downloads/conan-ubuntu-64_0_5_0.deb -O conan.deb
  - sudo dpkg -i conan.deb
  - rm conan.deb
<snip>
script:
  - export GIT_REPO="`pwd`"
  - cd ${GIT_REPO}/build
  - mkdir output
  - cd output
  - conan install ${GIT_REPO}
  - hash=`ls ~/.conan/data/Boost/1.60.0/lasote/stable/package/`
  - cd ~/.conan/data/Boost/1.60.0/lasote/stable/package/${hash}/include/
  - sudo patch -p0 < ${GIT_REPO}/patches/boost_1_59_graph.patch
  - cmake ${GIT_REPO} -DWITH_MINIMAL_PACKAGING=on
  - make -j2 run_all_specs
<snip>

I probably should have a bash script by know, given the size of the YML, but hey - if it works. The changes above deal with installation of the package, applying the boost patch and using Make instead of Ninja. Quite trivial in the end, even though it required a lot of iterations to get there.

Conclusions

Having a red build is a very distressful event for a developer, so you can imagine how painful it has been to have red builds for several months. So it is with unmitigated pleasure that I got to see build #628 in a shiny emerald green. As far as that goes, it has been an unmitigated success.

In a broader sense though, what can we say about Conan? There are many positives to take home, even at this early stage of Dogen usage:

it is a lot less intrusive than Biicode and easier to setup. Biicode was very well documented, but it was easy to stray from the beaten track and that then required reading a lot of different wiki pages. It seems easier to stay on the beaten track with Conan.
as with Biicode, it seems to provide solutions to Debug/Release and multi-platforms and compilers. We shall be testing it on Windows soon and reporting back.
hopefully, since it started Open Source from the beginning, it will form a community of developers around the source with the know-how required to maintain it. It would also be great to see if a business forms around it, since someone will have to pay the cloud bill.

In terms of negatives:

I still believe the most scalable approach would have been to extend Nuget for the C++ Linux use case, since Microsoft is willing to take patches and since they foot the bill for the public repo. However, I can understand why one would prefer to have total control over the solution rather than depend on the whims of some middle-manager in order to commit.
it seems publishing packages requires getting down into Python. Haven't tried it yet, but I'm hoping it will be made as easy as importing packages with a simple text file. The more complexity around these flows the tool adds, the less likely they are to be used.
there still are no "official builds" from projects. As explained above, this is a chicken and egg problem, because people are only willing to dedicate time to it once there are enough users complaining. Having said that, since Conan is easy to setup, one hopes to see some adoption in the near future.
even when using a GitHub profile, one still has to define a Conan specific password. This was not required with Biicode. Minor pain, but still, if they want to increase traction, this is probably an unnecessary stumbling block. It was sufficient to make me think twice about setting up a login, for one.

In truth, these are all very minor negative points, but still worth making them. All and all, I am quite pleased with Conan thus far.

Created: 2015-12-22 Tue 14:00

Emacs 24.5.1 (Org mode 8.2.10)

Validate

Nerd Food: Interesting...

2015-12-21T15:32:00.000-08:00

Nerd Food: Interesting…

Time to flush all those tabs again. Some interesting stuff I bumped into recently-ish.

Finance, Economics, Politics

A (not so) brief history of the fall and fall of the Nigerian naira: Very good read for Angolans; if nothing else, it makes us understand that our precious Kwanza behaves in many ways like any other petro-currency. Source: King Alfred (twitter).
#ThisIsACoup - Episode 1- "Angela, suck our balls": A rather political take on the recent-ish financial mess in Greece. On a similar vein, BBC's A Greek Drama is worth a listen.

Startups et al.

Brazilian Judge Shuts Down WhatsApp And Brazil’s Congress Wants To Shut Down The Social Web Next: One of the most enlightened internet countries decides to shut it all down. Sad day for the Internet and for all Portuguese speakers. Source: Hacker News (twitter)
Bitcoin’s Creator Satoshi Nakamoto Is Probably This Unknown Australian Genius: So they found Satoshi (again). Hesitated in adding this link, to be totally honest - there have been far too many fakes to recount and the whole process is such a media circus that its best avoiding it altogether. But after reading it - questionable media behaviour notwhitstanding - it does appear to provide some insights into these bitcoin early days. Useful to anyone who likes BTC. There is also the Gizmodo report, with additional evidence. This is all getting a bit too much for my liking though.
Jolla is back in business!: Good to hear Jolla is still going. Now that my Firefox OS phone is no longer supported, I am keen on getting a Jolla. Source: Hacker News (twitter)
Tech and Banking Giants Join Forces with the Linux Foundation to Create New Open Source Blockchain 'Hyperledger': In truth, hard not to be sceptical - even though it's coming from the Linux Foundation. I guess - in this world of scalability wars - this must come as good news. However, I still think there is a lot of misunderstanding around Bitcoin and the Blockchain, and there are far too many "AOLs" out there trying to create their gated communities, failing to understand history (again). Not quite sure on which side of the fence to place this initiative but, alas, I'm more inclined towards the AOL side.

General Coding

Yahoo’s Engineers Move to Coding Without a Net: How removing a testing team can help reduce the bug count and ramp up productivity. Source: Hacker News (twitter)
Move Fast and Fix Things: An incredible tale of real engineering from the GitHub guys with lots of take-ins - Scientist is a pretty neat idea, for one. Worth a read and a re-read. Logically related to the previous article. Source: Hacker News (twitter)
The Jacob’s Ladder of coding: Reminiscences on our beloved profession of coding. Long and deep, so still parsing.

Databases

What's new in PostgreSQL 9.5: The RC's are starting and 9.5 looks to continue the trend of amazing Postgres releases. My only missing wish is for native (and full) support for bitemporality really, though to be fair Temporal Tables is probably enough for my needs.

C++

Optimizing software in C++: One to bookmark now but to digest later. A whole load of stuff on optimisation.
Support for Android CMake projects in Visual Studio: So, as if the latest patches to Clang hadn't been enough, MS now decides to add support for CMake in Visual Studio. A bit embryonic, and a bit too android focused, but surely it should be extensible for more regular C++ use. Whats going on at MS? This is all far too cool to be true.
Quickly Loading Things From Disk: interesting analysis about the state of affairs of serialisation in C++. I'll probably require a few passes to fully digest it.
Beyond ad-hoc automation: leveraging structured platforms: I've been consuming this presentation slowly but steadily. It deals with a lot of the questions we all have about the new world of containers and microservices, and it seems vital to learn from experience before one finds oneself in a much bigger mess than the monolith could ever get you into. Bridget Kromhout talks intelligently about the subject.

Layman Science

The Church of D-Wave: So is D-Wave a quantum computer or not? It appears the verdict is "not", even with the 2X and the Google paper.
Intelligence and the Brain: Oldish but still very good and relevant. Another high-level introduction to HTM.
NASA probe shows how solar burps may have stripped Mars of water: How the sun could be responsible for stripping water away from the red planet.
Artificial Intelligence Through Hierarchical Temporal Memory: Continuing my adventures in the HTM space, Dr. Paul Cottrell is my latest find. I'm still not totally sure I understand all concepts in this video but what I do understand - assuming they have succeeded in doing what he describes - seem mondo-cool. Basically, it's all about the application of HTM to Finance and trading. He also introduces the idea of adding sub-cortical machinery to HTM (which is just cortical); a most puzzling concept. Once I finish parsing this video, I intend to move to Neuroscience Foundation For Artificial Intelligence.

Other

Benjamin Clementine - Le Ring - Live: Haven't totally made up my mind about Benjamin Clementine, but certainly a very interesting performance.

Created: 2015-12-21 Mon 23:31

Emacs 24.5.1 (Org mode 8.2.10)

Validate

Nerd Food: Pull Request Driven Development

2015-12-11T05:13:00.000-08:00

Nerd Food: Pull Request Driven Development

Being in this game for the best part of twenty years, I must confess that its not often I find something that revolutionises my coding ways. I do tend to try a lot of things, but most of them end up revealing themselves as fads or are incompatible with my flow. For instance, I never managed to get BDD to work for me, try as I might. I will keep trying because it sounds really useful, but it hasn't clicked just yet.

Having said all of that, these moments of enlightenment do occasionally happen, and when they do, nothing beats that life-changing feeling. "Pull Request Driven Development" (or PRDD) is my latest find. I'll start by confessing that "PRDD" as a name was totally made up for this post and hopefully you can see its rather tongue in cheek. However, the benefits of this approach are very real. In fact, I've been using PRDD for a while now but I just never really noticed its presence creeping in. Today, as I introduced a new developer to the process, I finally had the eureka moment and saw just how brilliant it has been thus far. It also made me realise that some people are not aware of this great tool in the developer's arsenal.

But first things first. In order to explain what I mean by PRDD, I need to provide a bit of context. Everyone is migrating to git these days, even those of us locked behind corporate walls; in our particular case, the migration path implied exposure to Git Stash. For those not in the know, picture it as an expensive and somewhat less featureful version of GitHub, but with most of the core functionality there. Of course, I'm sure GitHub is not that cheap for enterprises either, but hey at least its the tool everyone uses. Anyway - grumbling or not - we moved to Stash and all development started to revolve around Pull Requests (PRs), raised for each new feature.

Not long after PRs were introduced, a particularly interesting habit started to appear: developers begun opening the PRs earlier and earlier during the feature cycle rather than waiting to the very end. Taking this approach to the limit, the idea is that when you start to work on a new feature, you raise the ticket and the PR before you write any code at all. In practice - due to Stash's anachronisms - you need to push at least one commit, but the general notion is valid. This was never mandated anywhere, and there was no particular coordination. I guess one possible explanation for this behaviour is that one wants to get rid of the paperwork as quickly as possible to get to the coding. At any rate, the causes may be obscure but the emerging behaviour was not.

When you combine early PRs with the commit early and commit often approach - which you should be using anyway - the PR starts to become a living document; people see your development work as it progresses and they start commenting on it and possibly even sending you patches as you go along. In a way, this is an enabler for a very efficient kind of peer programming - particularly if you have a tightly knit team - because it gives you maximum parallelism but in a very subtle, non-noticeable way. The main author of the PR is coding as she would normally be, but whenever there is a lull in development - those moments where you'd be browsing the web for five minutes or so - you can quickly check for any comments on your PR and react to those. Similarly, other developers can carry on doing their own work and browse the PRs on their downtime; this allows them to provide feedback whenever it is convenient to them, and to choose the format of the feedback - lengthy or quick, as time permits.

Quick feedback is many a times invaluable in large code bases because everyone tends to know their own little corner of the code and only very few old hands know how it all hangs together. Thus, seemingly trivial one liners such as "have you considered using API xyz instead of rolling your own" or "don't forget to do abc when you do that" could save you many hours of pain and enable knowledge to be transferred organically - something that no number of wiki pages could hope to achieve in a million years because its very difficult to find these pearls in a sea of uncurated content. And because you committed early and often, each commit is very small and very easy to parse in a small interval of time, so people are much more willing to review - as opposed to that several Kb (or even Mb!) patch that you will have to allocate a day or two for. Further: if you take your commit message seriously - as, again, you should - you will find that the number of reviewers will grow rapidly simply because developers are nosy and opinionated.

Note that this review process involves no vague meetings and no lengthy and unfocused email chains; it is very high-quality because it is (or can be) very focused to specific lines of code; it causes no unwanted disruptions because you review where and when you choose to review; reviewers can provide examples and even fix things themselves if they so choose; it is totally inclusive because anyone who wants to participate can, but no one is forced to; and it equalises local and remote developers because they all have access to the same data (modulus some IRL conversations that always take place) - an important feature in this world of near-shoring, off-shoring and home-working. Most importantly, instead of finding out some fundamental errors of approach at the end of an intense period of coding, you now have timely feedback. This saves an enormous amount of time - an advantage that anyone who has been through lengthy code reviews and then spent a week or two reacting to that feedback can appreciate.

I am now a believer in PRDD. So much so that whenever I go back to work on legacy projects in svn, I find myself cringing all the way to the end of the feature. It just feels so nineties.

Update: As I finished penning this post and started reflecting about it it suddenly dawned on me that a lot of things we now take for granted are only possible because of git. And I don't mean DVCS', I specifically mean git. For example PRDD is made possible to a large extent because committing in git is a reversible process and history can be fluid if required. This means that people are not afraid of committing, which in turn enables a lot of the goodness I described above. Many DVCS' didn't like this way of viewing history - and to be fair, I know of very few people that liked the idea until they started using it. Once you figure out what it is good for (and not so good for), it suddenly becomes an amazing tool. Git is full of little decisions like this that at first sight look either straight insane or just not particularly useful but then turn out to change entire development flows.

Created: 2015-12-11 Fri 13:12

Emacs 24.5.1 (Org mode 8.2.10)

Validate

Nerd Food: Interesting...

2015-12-09T04:50:00.000-08:00

Nerd Food: Interesting…

Time to flush all those tabs again. Some interesting stuff I bumped into recently-ish.

Finance, Economics, Politics

The Color of Debt: How Collection Suits Squeeze Black Neighborhoods: Another great example of how markets are not so efficient for certain things and not exactly fair.
Pulled back in: The Economist's take on how the emerging markets credit bubble will play out. Not sure if I agree with their analysis, but its certainly very worrying to see so much EM debt piling up in such a volatile world.
ISIL: Who’s Calling the Shots?: Interesting analysis, much better than the usual superficial take one is used to from mass-media.
The Other France: As with the previous article, I cannot help but be surprised - even more so with this one. Truly, an amazing, in-depth job. Surprisingly good coming from the American media. If you have watched La Haine, read this. If you have read this, watch La Haine.
Elon Musk talks Climate Change and Carbon Tax at the Sorbone (12.2.15): Musk raises some interesting points, as usual, such as why we need to tax carbon or else.
'My father had one job in his life, I've had six in mine, my kids will have six at the same time': The Guardian's take in this new world of job displacement and "job context switching". Very interesting.
Ship it! QuantLib, IPython Notebook, and Docker: QuantLib conference is over, and sadly there are very few videos. This one bucks the trend. The ever informative Luigi talks about how QuantLib is moving with the times.
Morgan Stanley axes 400 bankers as bond-trading income dives: The contraction of the traditional banking industry continues, even as cryptos are growing insanely.

Startups et al.

Tesla is copying Apple's business model: very interesting comparison between Tesla and Apple's businesses. I don't fully agree with the article, but to be fair it does raise a number of interesting points. I definitely think that when Tesla can deliver mass-market quantities they will dominate sales in a similar fashion to the iPhone.
ARM: Britain's most successful tech company you've never heard of: Short history of ARM. It would be great to have a book about these guys!
DoorDash Wants to Own the Last Mile: interesting story of a startup that focuses on "last mile" delivery.
BitPesa: cool African start-up in the BitCoin / MPesa space.
LulaLend: Another cool African start-up that is doing well in the payments space.
Elon Musk and Y Combinator President on Thinking for the Future: Altman and Musk discuss the future. Shame the presenter is not a bit geekier or it could have been one of the best.
Elon Musk with his Brother Kimbal Musk on a panel: Since we're doing the Musk fanboy thing, here's a great panel with Elon and his brother. A more personal view of his achievements.
Jeff Bezos vs. Elon Musk: A Thrilling, New Space Race: More Musk fanboying; lets go all the way and read up on the latest about the space race. Very interesting.
Tesla Shareholders Meeting June 2015: Final Musk fanboying. I think Tesla is one of the few companies where non-shareholders tune in just to listen and get inspiration. Elon, nerdy and awkward but great and inspiring as always. Choice quote: "I'd expect SpaceX to go public once we get regular flights to mars." - very few people could get away with a statement like that.

General Coding

Gene Amdahl, Pioneer of Mainframe Computing, Dies at 92: I've heard the name a lot but never really read about the man.
Why you should understand (a little) about TCP: The new generation discovers the joys of understanding low-level protocols. And Nagle (yes, he of Nagle Algorithm fame) replies on that thread.
systemd.conf: Videos from the conference. Have watched a couple, seemed like a lively conference. Hard to imagine an init system with its own conference though!

Databases

When are we going to contribute BDR to PostgreSQL?: For those (like me) who keep moaning about the lack of BDR in Postgres, a great explanation of how the patchset is being merged. Great work by the 2nd Quadrant guys.

C++

New ELF Linker from the LLVM Project: LLVM keeps on delivering! Now a new ELF linker. To be totally honest, I haven't even started using Gold in anger - I get the feeling the LLVM linker is going to be transitioned in much quicker than Gold.
Clang with Microsoft CodeGen in VS 2015 Update 1: OMG, OMG how cool is this - MSFT decided to create a backend for Clang that is totally compatible with MSVC AND open source it! This is just insane. This means for example that you now can develop C++ on Windows without ever having to use MSVC and Visual Studio. It also means you can cross-compile from Linux into Windows with 100% certainty things will work. It means that projects like Wine and ReactOS can start thinking about a migration path into Clang (not quite as simple as it may sound but surely makes sense). CLion with Clang on Windows will rock. The possibilities are just endless. I never quite understood what C2 was all about until I read this announcement - suddenly it all makes sense. This is fantastic news.

Layman Science

Jeff Hawkins on Firing Up the Silicon Brain: OK, let me totally honest: I love Jeff Hawkins. I read On Intelligence far too many times to count and would be lying if I didn't admit that it had a little bit to do with my forays into Computational Neuroscience. So as you can imagine, I'm rather excited about HTM and Numenta's latest developments. This article is a good catch-up, if slightly high-level. If you want something slightly more technical but still very approachable, Principles of Hierarchical Temporal Memory (HTM): Foundations of Machine Intelligence is a must watch.

Other

NoiseRV Live: Still discovering this Portuguese musician, but love his work. Great concert. Could do a little bit less talking between songs, but still - artists prerogative and all that.
Warm Focus: Winging It: Interesting set of "intelligent dance music" as we used to call it back in the day.
Mosaic - The “First” Web Browser: Super-cool podcasts about internet history. It would be great to have something like this for UNIX!
Jackson C. Frank (1965): Tragic musician from the 60s. Great tunes.
Reason in common sense: Always wanted to read Santayana properly. Started, but I guess it will be a very long exercise. Interesting, if somewhat strange book.
Ceu - jazz baltica Live (2010): New find, Brazilian musician Ceu.

Created: 2015-12-09 Wed 12:49

Emacs 24.5.1 (Org mode 8.2.10)

Validate

Nerd Food: Tooling in Computational Neuroscience - Part II: Microscopy

2015-11-30T15:12:00.000-08:00

Nerd Food: Tooling in Computational Neuroscience - Part II: Microscopy

Research is what I'm doing when I don't know what I'm doing.
Wernher von Braun

Welcome to the second instalment of our second series on Computational Neuroscience for lay people. You can find the first post of the previous series here, and the first post of the current series here. As you'd expect, this second series is slightly more advanced, and, as such, it is peppered with unavoidable technical jargon. Having said that, we shall continue to pursue our ambitious target of making things as easy to parse as possible (but no easier). If you read the first series, the second should hopefully make some sense.¹

Our last post discussed Computational Neuroscience as a discipline, and the kind of things one may want to do in this field. We also spoke about models and their composition, and the desirable properties of a platform that runs simulations of said models. However, it occurred to me that we should probably build some kind of "end-to-end" understanding; that is, by starting with the simulations and models we are missing a vital link with the physical (i.e. non-computational) world. To put matters right, this part attempts to provide a high-level introduction on how data is acquired from the real world and can then be used - amongst other things - to inform the modeling process.

Macro and Micro Microworlds

For the purposes of this post, the data gathering process starts with the microscope. Of course, keep in mind that we are focusing only on the morphology at present - the shape and the structures that make up the neuron - so we are ignoring other important activities in the lab. For instance, one can conduct experiments to measure voltage in a neuron, and these measurements provide data for the functional aspects of the model. Alas, we will skip these for now, with the promise of returning to them at a later date².

So, microscopes then. Microscopy is the technical name for the observation work done with the microscope. Because neurons are so small - some 4 to 100 microns in size - only certain types of microscopes are suitable to perform neuronal microscopy. To make matters worse, the sub-structures inside the neuron are an important area of study and they can be ridiculously small: a dentritic spine - the minute protrusions that come out of the dendrites - can be as tiny as 500 nanometres; the lipid bylayer itself is only 2 or 3 nanometres thick, so you can imagine how incredibly small ion channels and pumps are. Yet these are some of the things we want to observe and measure. Lets call this the "micro" work. On the other hand, we also want to understand connectivity and other larger structures, as well as perform observations of the evolution of the cell and so on. Lets call this the "macro" work. These are not technical terms, by the by, just so we can orient ourselves. So, how does one go about observing these differently sized microworlds?

Figure 1: Example of measurements one may want to perform on a dendrite. Source: Reversal of long-term dendritic spine alterations in Alzheimer disease models

Optical Microscopy

The "macro" work is usually done using the Optical "family" of microscopes, which is what most of us think of when hearing the word microscope. As it was with Van Leeuwenhoek's tool in the sixteen hundreds, so it is that today's optical microscopes still rely on light and lenses to perform observations. Needless to say, things did evolve a fair bit since then, but standard optical microscopy has not completely removed the shackles of its limitations. These are of three kinds, as Wikipedia helpfully tells us: a) the objects we want to observe must be dark or strongly refracting - a problem, since the internal structures of the cell are transparent; b) visible light's diffraction limit means that we cannot go much lower than 200 nanometres - pretty impressive, but unfortunately not quite low enough for detailed sub-structure analysis; and c) out of focus light hampers image clarity.

Workarounds to these limitations have been found in the guise of techniques, with the aim of augmenting the abilities of standard optical microscopy. There are many of these techniques. There is the Confocal Microscopy³ - improving resolution and contrast; the Fluorescence microscope, which uses a sub-diffraction technique to reconstruct some of the detail that is missing due to diffraction; or the incredible-looking movies produced by Multiphoton Microscopy. And of course, it is possible to combine multiple techniques in a single microscope, as is the case with the Multiphoton Fluorescence Microscopes (MTMs) and many others.

In fact, given all of these developments, it seems there is no sign of optical microscopy dying out. Presumably some of this is due to the relative lower cost of this approach as well as to the ease of use. In addition, optical microscopy is complementary to the other more expensive types of microscopes; it is the perfect tool for "macro" work that can then help to point out where to do "micro" work. For example, you can use an optical microscope to assess the larger structures and see how they evolve over time, and eventually decide on specific areas that require more detailed analysis. And when you do, you need a completely different kind of microscope.

Electron Microscopy

When you need really high-resolution, there is only one tool to turn to: the Electron Microscope (EM). This crazy critter can provide insane levels of magnification by using a beam of electrons instead of visible light. Just how insane, you ask? Well, if you think that an optical microscope lives in the range of 1500x to 2000x - that is, can magnify a sample up to two thousand times - an EM can magnify as much as 10 million times, and provide a sub-nanometre resolution⁴. It is mind boggling. If fact, we've already seen images of atoms using EM in part II, but perhaps it wasn't easy to appreciate just how amazing a feat that is.

Of course, EM is itself a family - and a large one at that, with many and diverse members. As with optical microscopy, each member of the family specialises on a given technique or combination of techniques. For example, the Scanning Electron Microscope (SEM) performs a scan of the object under study, and has a resolution of 1 nanometre or higher; the Scanning Confocal Electron Microscope (SCEM) uses the same confocal technique mentioned above to provide higher depth resolution; and Transmission Electron Microscopy (TEM) has the ability to penetrate inside the specimen during the imagining process, given samples with thickness of 100 nanometres or less.

A couple of noteworthy points are required at this juncture. First, whilst some of these EM techniques may sound new and exciting, most have been around for a very long time; it just seems they keep getting better and better as they mature. For example, TEM was used in the fifties to show that neurons communicate over synaptic junctions but its still wildly popular today. Secondly, its important to understand that the entire imaging process is not at all trivial - certainly not for TEM, nor EM in general and probably not for Optical Microscopy either. It just is a very labour intensive and very specialised process - most likely done by an expert human neuroanatomist - and the difficulties range from the chemical preparation of the samples all the way up to creating the images. The end product may give the impression it was easy to produce, but easy it was not.

At any rate, whatever the technical details, the fact is that the imagery that results from all these advances is truly evocative - haunting, even. Take this image produced by SEM:

Figure 2: Human neuron. Source: New Reprogramming Method Makes Better Stem Cells

Personally, I think it is incredibly beautiful; simultaneously awe-inspiring and depressing because it really conveys the messiness and complexity of wetware. By way of contrast, look at the neatness of man-made micro-structures:

Figure 3: The BlueGene/Q chip. Source: IBM plants transactional memory in CPU

Stacks and Stacks of 'Em

Technically, pictures like the ones above are called micrographs. As you can see in the neuron micrograph, these images provide a great visual description of the topology of the object we are trying to study. You also may notice a slight coloration of the cell in that picture. This is most likely due to the fact that the people doing the analysis stain the neuron to make it easier to image. Now, in practice - at least as far as I have seen, which is not very far at all, to be fair - 2D grayscale images are preferred by researchers to the nice, Public Relations friendly pictures like the one above; those appear to be more useful for magazine covers. The working micrographs are not quite as exciting to the untrained eye but very useful to the professionals. Here's an example:

Figure 4: The left-hand side shows the original micrograph. On the right-hand side it shows the result of processing it with machine learning. Source: Deep Neural Networks Segment Neuronal Membranes in Electron Microscopy Images

Let's focus on the left-hand side of this image for the moment. It was taken using ssTEM - serial-section TEM, an evolutionary step in TEM. The ss part of ssTEM is helpful in creating stacks of images, which is why you see the little drawings on the left of the picture; they are there to give you the idea that the top-most image is one of 30 in a stack⁵. The process of producing the images above was as follows: they started off with a neuronal tissue sample, which is prepared for observation. The sample had 1.5 micrometres and was then sectioned into 30 slices of 50 nanometres. Each of these slices was imaged, at a resolution of 4x4 nanometres per pixel.

As you can imagine, this work is extremely sensitive to measurement error. The trick is to ensure there is some kind of visual continuity between images so that you can recreate a 3D model from the 2D slices. This means for instance that if you are trying to figure out connectivity, you need some way to relate a dendrite to it's soma and say to the axon of the neuron it connects to - and that's one of the reasons why the slices have to be so thin. It would be no good if the pictures miss this information out as you will not be able to recreate the connectivity faithfully. This is actually really difficult to achieve in practice due to the minute sizes involved; a slight tremor that displaces the sample by some nanometres would cause shifts in alignment; even with the high-precision the tools have, you can imagine that there is always some kind of movement in the sample's position as part of the slicing process.

Images in a stack are normally stored using traditional formats such as TIFF⁶. You can see an example of the raw images in a stack here. Its worth noticing that, even though the images are 2D grey-scale, since the pixel size is only a few nanometres wide (4x4 in this case), the full size of an image is very large. Indeed, the latest generation of microscopes produce stacks on the 500 Terabyte range, making the processing of the images a "big-data" challenge.

What To Do Once You Got the Images

But back to the task at hand. Once you have the stack, the next logical step is to try to figure out what's what: which objects are in the picture. This is called segmentation and labelling, presumably because you are breaking the one big monolithic picture into discrete objects and give them names. Historically, segmentation has been done manually, but its a painful, slow and error-prone process. Due to this, there is a lot of interest in automation, and it has recently become feasible to do so - what with the abundance of cheap computing resources as well as the advent of "useful" machine learning (rather than the theoretical variety). Cracking this puzzle is gaining traction amongst the programming herds, as you can see by the popularity of challenges such as this one: Segmentation of neuronal structures in EM stacks challenge - ISBI 2012. It is from this challenge we sourced the stack and micrograph above; the right-hand side is the finished product after machine learning processing.

There are also open source packages to help with segmentation. A couple of notable contenders are Fiji and Ilastik. Below is a screenshot of Ilastik.

Figure 5: Source: Ilastik gallery.

An activity that naturally follows on from segmentation and labelling is reconstruction. The objective of reconstruction is to try to "reconstruct" morphology given the images in the stack. It could involve inferring the missing bits of information by mathematical means or any other kind of analysis which transforms the set of discrete objects spotted by segmentation into something looking more like a bunch of connected neurons.

Once we have a reconstructed model, we can start performing morphometric analysis. As wikipedia tells us, Morphometry is "the quantitative analysis of form"; as you can imagine, there are a lot of useful things one may want to measure in the brain structures and sub-structures such as lengths, volumes, surface area and so on. Some of these measurements can of course be done in 2D, but life is made easier if the model is available in 3D. One such tool is NeuroMorph. It is an open source extension written in Python for the popular open source 3D computer graphics software Blender.

Figure 6: Source: Segmented anisotropic ssTEM dataset of neural tissue

Conclusion

This post was a bit of a world-wind tour of some of the sources of real world data for Computational Neuroscience. As I soon found out, each of these sections could have easily been ten times bigger and still not provide you with a proper overview of the landscape; having said that, I hope that the post at least gives some impression of the terrain and its main features.

From a software engineering perspective, its worth pointing out the lack of standardisation in information exchange. In an ideal world, one would want a pipeline with components to perform each of the steps of the complete process, from data acquisition off of a microscope (either opitical or EM), to segmentation, labelling, reconstruction and finally morphometric analysis. This would then be used as an input to the models. Alas, no such overarching standard appears to exist.

One final point in terms of Free and Open Source Software (FOSS). On one hand, it is encouraging to see the large number of FOSS tools and programs being used. Unfortunately - at least for the lovers of Free Software - there are also some proprietary tools that are widely used such as NeuroLucida. Since the software is so specialised, the fear is that in the future, the better funded commercial enterprises will take over more and more of the space.

That's all for now. Don't forget to tune in for the next instalment!

Footnotes:

As it happens, what we are doing here is to apply a well-established learning methodology called the Feynman Technique. I was blissfully unaware of its existence all this time, even though Feynman is one of my heroes and even though I had read a fair bit about the man. On this topic (and the reason why I came to know about the Feynman Technique), its worth reading Richard Feynman: The Difference Between Knowing the Name of Something and Knowing Something, where Feynman discusses his disappointment with science education in Brazil. Unfortunately the Portuguese and the Brazilian teaching systems have a lot in common - or at least they did when I was younger.

Nor is the microscope the only way to figure out what is happening inside the brain. For example, there are neuroimagining techniques which can provide data about both structure and function.

Patented by Marvin Minsky, no less - yes, he of Computer Science and AI fame!

⁴

And, to be fair, sub-nanometre just doesn't quite capture just how low these things can go. For an example, read Electron microscopy at a sub-50 pm resolution.

⁵

For a more technical but yet short and understandable take, read Uniform Serial Sectioning for Transmission Electron Microscopy.

⁶

On the topic of formats: its probably time we mention the Open Microscopy Environment (OME). The microscopy world is dominated by hardware and as such its the perfect environment for corporations, their proprietary formats and expensive software packages. The OME guys are trying to buck the trend by creating a suite of open source tools and protocols, and by looking at some of their stuff, they seem to be doing alright.

Created: 2015-11-30 Mon 23:12

Emacs 24.5.1 (Org mode 8.2.10)

Validate

Nerd Food: Tooling in Computational Neuroscience - Part I: NEURON

2015-11-11T10:01:00.000-08:00

Nerd Food: Tooling in Computational Neuroscience - Part I

In the previous series of posts we did a build up of theory - right up to the point where we were just about able to make sense of Integrate and Fire - one of the simpler families of neuron models. The series used a reductionist approach - or bottom up, if you prefer¹. We are now starting a new series with the opposite take, this time coming at it from the top. The objective is to provide a (very) high-level overview - in laymen's terms, still - of a few of the "platforms" used in computational neuroscience. As this is a rather large topic, we'll try to tackle a couple of platforms each post, discussing a little bit of their history, purpose and limitations - whilst trying to maintain a focus on file formats or DSLs. "File formats" may not sound particularly exciting at first glance, but it is important to keep in mind that these are instances of meta-models of the problem domain in question, and as such, their expressiveness is very important. Understand those and you've understood a great deal about the domain and about the engineering choices of those involved.

But first, let's introduce Computational Neuroscience.

Computers and the Brain

Part V of our previous series discussed some of the reasons why one would want to model neurons (section Brief Context on Modeling). What we did not mention is that there is a whole scientific discipline dedicated to this endeavour, called Computational Neuroscience. Wikipedia has a pretty good working definition, which we will take wholesale. It states:

Computational neuroscience […] is the study of brain function in terms of the information processing properties of the structures that make up the nervous system. It is an interdisciplinary science that links the diverse fields of neuroscience, cognitive science, and psychology with electrical engineering, computer science, mathematics, and physics.

Computational neuroscience is distinct from psychological connectionism and from learning theories of disciplines such as machine learning, neural networks, and computational learning theory in that it emphasizes descriptions of functional and biologically realistic neurons (and neural systems) and their physiology and dynamics. These models capture the essential features of the biological system at multiple spatial-temporal scales, from membrane currents, proteins, and chemical coupling to network oscillations, columnar and topographic architecture, and learning and memory.

These computational models are used to frame hypotheses that can be directly tested by biological or psychological experiments.

Lots of big words, of course, but hopefully they make some sense after the previous posts. If not, don't despair; what they all hint at is an "interdisciplinary" effort to create biologically plausible models, and to use these to provide insights on how the brain is performing certain functions. Think of the Computational Neuroscientist as the right-hand person of the Neuroscientist - the "computer guy" to the "business guy", if you like. The Neuroscientist (particularly the experimental Neuroscientist) gets his or her hands messy with wetware and experiments, which end up providing data and a better biological understanding; the Computational Neuroscientist takes these and uses them to make improved computer models, which are used to test hypothesis or to make new ones, which can then validated by experiments and so on, in a virtuous feedback loop.² Where the "interdisciplinary" part comes in is that many of the people doing the role of "computer guys" are actually not computer scientists but instead come from a variety of backgrounds such as biology, physics, chemistry and so on. This variety adds a lot of value to the discipline because the brain is such a complex organ; understanding it requires all kinds of skills - and then some.

It's Models All the Way Down

At the core, then, the work of the Computational Neuroscientist is to create models. Of course, as we already seen, one does not just walk straight into Mordor and starts creating the "most biologically plausible" model of the brain possible; all models must have a scope as narrow as possible, if they are to become a) understandable and b) computationally feasible. Thus engineering trade-offs are crucial to the discipline.

Also, it is important to understand that creating a model does not always imply writing things from scratch. Instead, most practitioners rely on a wealth of software available, all with different advantages and disadvantages.

At this juncture you are probably wondering just what exactly are these "models" we speak so much of. Are they just equations like IaF? Well, yes and no. As it happens, all models have roughly the following structure:

a morphology definition: we've already spoken a bit about morphology; think of it as the definition of the entities that exist in your model, their characteristics and relationships. This is actually closer to what we, computer scientists think the word modeling means. For example, the morphology defines how many neurons you have, how many axons and dendrites, connectivity, spatial positioning and so on.
a functional, mathematical or physical definition: I've heard it named in many ways, but fundamentally, what it boils down to is the definition of the equations that your model requires. For example, are you modeling electrical properties or reaction/diffusion?

For the simpler models, the morphology gets somewhat obscured - after all, in LIF, there is very little information about a neuron because all we are interested in are the spikes. For other models, a lot of morphological details are required.

The Tooling Landscape

Idealised…

It is important to keep in mind that these models are to be used in a simulation; that is, we are going to run the program for a period of time (hours or days) and observe different aspects of its behaviour. Thus the functional definition of the model provides the equations that describe the dynamics of the system being simulated and the morphology will provide some of the inputs for those equations.

From here one can start sketch the requirements for a system for the Computational Neuroscientist:

a platform of some kind to provide simulation control: starting, stopping, re-running, storing the results and so on. As the simulations can take a long time to run, the data sets can be quite large - on the hundreds of gigs range - so efficiently handling of the output data is a must.
some kind of DSL that provides a user friendly way to define their models, ideally with a graphical user interface that helps author the DSL. The DSL must cover the two aspects we mention above.
efficient libraries of numerical routines to help solve the equations. The libraries must be exposed in someway to the DSL so that users can make use of these when defining the functional aspects of the model.

Architecturally, the ability to use a cluster or GPUs would of course be very useful, but we shall ignore those aspects for now. Given this idealised platform, we can now make a bit more sense of what actually exists in the wild.

… vs Actual

The multidisciplinary nature of Computational Neuroscience poses some challenges when it comes to software development: as mentioned, many of the practitioners in the field do not have a Software Engineering background; of those that do have, most tend not to have strong biology and neuroscience backgrounds. As a result, the landscape is fragmented and the quality is uneven. On one side, most of the software is open source, making reuse a lot less of a problem. On the other hand, things such as continuous integration, version control, portability, user interface guide lines, organised releases, packaging and so on are still lagging behind most "regular" Free and Open Source projects³.

In some ways, to enter Computational Neuroscience is a bit like travelling in time to a era before git, before GitHub, before Travis and all other things we take for granted. Not everywhere, of course, but still in quite a few places, particularly with the older and more popular projects. One cannot help but get the feeling that the field could do with some of the general energy we have in the FOSS community, but the technical barriers to contributing tend to be large since the domain is so complex.

So after all of this boring introductory material, we can finally look at our first system.

NEURON

Having to choose, one feels compelled to start with NEURON - the most venerable of the lot, with roots in the 80s⁴. NEURON is a simulation environment with great depth of functionality and a comprehensive user manual published as a (non-free) book. For the less wealthy, an overview paper is available, as are many other online resources. The software itself is fully open source, with a public mercurial repo.

As with many of the older tools in this field, NEURON development has not quite kept up the pace with the latest and greatest. For instance, it still has a Motif'esque look to its UI but, alas, do not be fooled - its not Motif but InterViews - a technology I never heard of, but seems to have been popular in the 80's and early 90's. One fears that NEURON may just be the last widely used program relying on InterViews - and the fact that they carry their own fork of it does not make me hopeful.

Figure 1: Source: NEURON Cell Builder

However, once one goes past these layers of legacy, the domain functionality of the tool is very impressive. This goes some way to explain why so many people rely on it daily and why so many papers have been written using it - over 600 papers at the last count.

Whilst NEURON is vast, we are particularly interested in only two aspects of it: hoc and mod (in its many incarnations). These are the files that can be used to define models.

Hoc

Hoc has a fascinating history and a pedigree to match. It is actually the creation of Kernighan and Pike, two UNIX luminaries, and has as contenders tools like bc and dc and so on. NEURON took hoc and extended it both in terms of syntax as well as the number of available functions; NEURON Hoc is now an interpreted object oriented language, albeit with some limitations such as lack of inheritance. Programs written in hoc execute in an interpreter called oc. There are a few variations of this interpreter, with different kinds of libraries made available to the user (UI, neuron modeling specific functionality, etc) but the gist of it is the same, and the strong point is the interactive development with rapid feedback. On the GUI versions of the interpreter, the script can specify it's UI elements including input widgets for parameters and widgets to display the output. Hoc is then used as a mix between model/view logic and morphological definition language.

To get a feel for the language, here's a very simple sample from the manual:

create soma    // model topology
access soma    // default section = soma

soma {
   diam = 10   // soma dimensions in um
   L = 10/PI   //   surface area = 100 um^2
}

NMODL

The second language supported by NEURON is NMODL - The NEURON extended MODL (Model Description Language). NMODL is used to specify a physical model in terms of equations such as simultaneous nonlinear algebraic equations, differential equations and so on. In practice, there are actually different versions of NMODL for different NEURON versions, but to keep things simple I'll just abstract these complexities and refer to them as one entity⁵.

As intimated above, NMODL is a descendant of MODL. As with Hoc, the history of MODL is quite interesting; it was a language was defined by the National Biomedical Simulation Resource to specify models for use with SCoP - the Simulation Control Program⁶. From what I can gather of SCoP, its main purpose was to make life easier when creating simulations, providing an environment where users could focus on what they were trying to simulate rather than nitty-gritty implementation specific details.

NMODL took MODL syntax and extended it with the primitives required by its domain; for instance, it added the NEURON block to the language, which allows multiple instances of "entities". As with MODL, NMODL is translated into efficient C code and linked against supporting libraries that provide the numerics; the NMODL translator to C also had to take into account the requirement of linking against NEURON libraries rather than SCoP.

The below is a snippet of NMODL code, copied from the NEURON book (chapter 9, listing 9.1):

NEURON {
  SUFFIX leak
  NONSPECIFIC_CURRENT i
  RANGE i, e, g
}

PARAMETER {
  g = 0.001  (siemens/cm2)  < 0, 1e9 >
  e = -65    (millivolt)
}

ASSIGNED {
  i  (milliamp/cm2)
  v  (millivolt)
}

NMODL and hoc are used together to form a model; hoc to provide the UI, parameters and morphology and NMODL to provide the physical modeling. The website ModelDB provides a database of models in a variety of platforms with the main objective of making research reproducible. Here you can see an example of a production NEURON model in its full glory, with a mix of hoc and NMODL files - as well as a few others such as session files, which we can ignore for our purposes.

Thoughts

NEURON is more or less a standard in Computational Neuroscience - together with a few other tools such as GENESIS, which we shall cover later. Embedded deeply in it source code is the domain logic learned painstakingly over several decades. Whilst software engineering-wise it is creaking at the seams, finding a next generation heir will be a non-trivial task given the features of the system, the amount of models that exist out there, and the knowledge and large community that uses it.

Due to this, a solution that a lot of next-generation tools have developed is to use NEURON as a backend, providing a shiny modern frontend and then generating the appropriate hoc and NMODL required by NEURON. This is then executed in a NEURON environment and the results are sent back to the user for visualisation and processing using modern tools. Le Roi Est Mort, Vive Le Roi!

Conclusions

In this first part we've outlined what Computational Neuroscience is all about, what we mean by a model in this context and what services one can expect from a platform in this domain. We also covered the first of such platforms. Tune in for the next instalment where we'll cover more platforms.

Footnotes:

I still owe you the final post of that series, coming out soon, hopefully.

Of course, once you scratch the surface, things get a bit murkier. Erik De Schutter states:

[…] The term is often used to denote theoretical approaches in neuroscience, focusing on how the brain computes information. Examples are the search for “the neural code”, using experimental, analytical, and (to a limited degree) modeling methods, or theoretical analysis of constraints on brain architecture and function. This theoretical approach is closely linked to systems neuroscience, which studies neural circuit function, most commonly in awake, behaving intact animals, and has no relation at all to systems biology. […] Alternatively, computational neuroscience is about the use of computational approaches to investigate the properties of nervous systems at different levels of detail. Strictly speaking, this implies simulation of numerical models on computers, but usually analytical models are also included […], and experimental verification of models is an important issue. Sometimes this modeling is quite data driven and may involve cycling back and forth between experimental and computational methods.

This is a problem that has not gone unnoticed; for instance, this paper provides an interesting and thorough review of the state onion in Computational Neuroscience: Current practice in software development for computational neuroscience and how to improve it. In particular, it explains the dilemmas faced by the maintainers of neuroscience packages.

⁴

The early story of NEURON is available here; see also the scholarpedia page.

⁵

See the NMODL page for details, in the history section.

⁶

As far as I can see, in the SCoP days MODL it was just called the SCoP Language, but as the related paper is under a paywall I can't prove it either way. Paper: SCoP: An interactive simulation control program for micro- and minicomputers, from Springer.

Created: 2015-11-11 Wed 17:59

Emacs 24.5.1 (Org mode 8.2.10)

Validate

Nerd Food: Interesting...

2015-11-09T15:58:00.001-08:00

Nerd Food: Interesting…

Time to flush all those tabs again. Some interesting stuff I bumped into recently-ish.

Finance

There’s a blockchain for that!: The rise and rise of the blockchain…
Crypto 2.0–And Other Misconceptions: … but maybe we are getting a bit ahead of ourselves, and bitcoin is really what matters!
How the Bitcoin protocol actually works: Gory details of bitcoin's internals. Must say that the Bitcoin book already provides a pretty good explanation, but interesting nonetheless.
BitBeat: Bitcoin Surges Past $400 on Back of the New ‘Shining Star’: Bitcoin is up again! Rollercoaster ride?
ModVal: How cool is that, a site for financial models.

Startups et. al.

The Shape of Things to Come: Jonathan Ive seems like a cool fellow, actually. Even though I'm not much of an apple fan.
Y Combinator Posthaven: thoughts from the combinator. 1000 companies funded!
The Future of Firms. Is There an App for That?: What does it mean to be a company in this world of change?

General Coding

What I’ve learned so far about software development: Tales from the trenches.
Git from the inside out: Gory details about git. And I mean really gory and really detailed. As with bitcoin, the git book was already pretty detailed but interesting nonetheless.
How to Write a Git Commit Message: jeez, each to their own! Here's a true geek of git commit messages. But very useful though.
Elements of Scale: Composing and Scaling Data Platforms: Interesting take on data, should help navigate the SQL/NoSQL debate.
Do you really know why you prefer REST over RPC?: Title says it all, interesting take on the RESTification of the world.
Making The Case For Building Scalable Stateful Services In The Modern Era: Instead of knee-jerk reactions about statefulness, think deeply before you decide. With presentation: "Building Scalable Stateful Services" by Caitie McCaffrey
"Apache Kafka and the Next 700 Stream Processing Systems" by Jay Kreps: Improved my understanding of Kafka somewhat. And the title made me curious, so I ended up reading The Next 700 Programming Languages.
The Room Where the Internet Was Born: A quest for understanding the cloud. Rather long but great for those that like computer history.
A Criticism of Scrum: quite well thought out actually. I love agile but I must say, I agree with many of the points made. I guess in the end, the key is not to fall in love with ceremony.

Databases

Why Zalando trusts in PostgreSQL: Great to see how the elephant is used in anger. Picked up a few tips.

C++

Futures for C++11 at Facebook: futures are all the rage…
C++ Futures at Instagram: … everywhere!
Livestream: WOT, C++ live coding? Man this is a weird notion!
CppCon 2015: Chandler Carruth "Tuning C++: Benchmarks, and CPUs, and Compilers! Oh My!: Chandler at his best - hilarious but extremely informative.
Comparison in C++: A weird but weirdly useful paper. Deep thinking about comparisons. For good measure, you should also watch the presentation: CppCon 2015: Lawrence Crowl "Comparison is not simple, but it can be simpler"
CppCon 2015: Eric Niebler "Ranges for the Standard Library": Ranges, ranges, ranges!! Can't wait! For good measure, you can also read C++ Ranges are Pure Monadic Goodness.
Crypto coding rules: haven't parsed the entire document, but already found a few very useful points.
RapidCheck: Didn't know of QuickCheck or RapidCheck, but this may come in handy at some point…

Layman Science

Linear Algebra: What matrices actually are: an attempt to make matrices accessible.
A Gentle Introduction To Learning Calculus: Helpful when trying to remember all the stuff one did all those years ago…
A Visual, Intuitive Guide to Imaginary Numbers: … continued.
New theories reveal the nature of numbers: A Ramanujan movie is due out soon, and these are the guys doing the maths behind the scenes. Can't wait for the movie!
Richard Feynman on education in Brazil: Hilarious, but somewhat reminiscent of my own education, in a different country but culturally very similar and certainly with a very similar approach to science.
Using neural nets to recognize handwritten digits: nice introduction to neural nets with a good example.
Your Brain Is On the Brink of Chaos: Very interesting. This is particularly interesting because I have been reading up on the delicate balance between inhibitory and excitatory neurons, but the literature always gives you this static impression of balance. If you assume chaos on the other hand…

Other

Extreme City: Luanda, my beloved, just keeps on getting crazier and crazier. Interesting - if somewhat expat-oriented - take on the city.
This Florida Teenager Knows What Ahmed Mohamed Is Going Through. It Happened to Her in 2013: sad, really. Whatever the real truth was about Ahmed.
Will You Ever Be Able to Upload Your Brain?: er., spoiler alert - not really. Interesting though.
Dealing with “power laws” with upper (lower) bound: most certainly not for laypeople. Taleb is back at it. Would be great to have this translated to laymen's maths.
Lieke Boon: Unconscious Bias: we're all guilty: How to be a bit more aware of your own the biases is my take on it.

Created: 2015-11-09 Mon 23:57

Emacs 24.5.1 (Org mode 8.2.10)

Validate

Nerd Food: Neurons for Computer Geeks - Part VI: LIF At Long Last!

2015-09-16T10:06:00.000-07:00

Nerd Food: Neurons for Computer Geeks - Part VI: Integrate and Fire!

Welcome to part VI of a multi-part series on modeling neurons. In part V we added a tad more theory to link electricity with neurons, and also tried to give an idea of just how complex neurons are. Looking back on that post, I cannot help but notice I skipped one bit that is rather important to understanding Integrate-and-Fire (IAF) models. So lets look at that first and then return to our trail.

Resting Potential and Action Potential

We have spoken before about the membrane potential and the resting membrane potential, but we did so with such a high degree of hand-waving it now warrants revisiting. When we are talking about the resting membrane potential we mean just that - the value for the membrane potential when nothing much is happening. That is the magical circa -65mv we discussed before - with all of the related explanations on conventions around negative voltages. However, time does not stand still and things happen. The cell receives input from other neurons, and this varies over time. Some kinds of inputs can cause events to trigger on the receiving neuron: active ion channels may get opened or shut, ions move around, concentrations change and so forth, and thus, the cell will change its membrane potential in response. When these changes result in a higher voltage - such as moving to -60mv - we say a depolarisation is taking place. Conversely, when the voltage becomes more negative, we say hyperpolarisation is occurring.

Now, it may just happen that there is a short-lived but "strong" burst of depolarisation, followed by equally rapid hyperpolarisation - and, as a result of which, the Axon's terminal decides to release neurotransmitters into the synapse (well, into the synaptic gap or synaptic cleft to be precise). This is called an action potential, and it is also known by many other names such as "nerve impulses" or "spikes". When you hear that "a neuron has fired" this means that an action potential has just been emitted. If you record the neuron's behaviour over time you will see a spike train - a plot of the voltage over time, clearly showing the spikes. Taking a fairly random example:

Figure 1: Source: Wikipedia, Neural oscillation

One way of picturing this is as a kind of "chain-reaction" whereby something triggers the voltage of the neuron to rise, which triggers a number of gates to open, which then trigger the voltage to rise and so on, until some kind of magic voltage threshold is reached where the inverse occurs: the gates that were causing the voltage to rise shut and some other gates that cause the voltage to decrease open, and so on, until we fall back down to the resting membrane potential. The process feeds back on itself, first as a positive feedback and then as a negative feedback. In the case of the picture above, something else triggers us again and again, until we finally come to rest.

This spiking or firing behaviour is what we are trying to model.

Historical Background

As it happens, we are not the first ones to try to do so. A couple of years after Einstein's annus mirabilis, a french chap called Louis Lapicque was also going through his own personal moment of inspiration, the output of which was the seminal Recherches quantitatives sur l'excitation électrique des nerfs traitée comme une polarisation. It is summarised here in fairly accessible English by Abbot.

Lapicque had the insight of imagining the neuron as an RC circuit, with the membrane potential's behaviour explained as the interplay between capacitor and resistor; the action potential is then the capacitor reaching a threshold followed by a discharge. Even with our faint understanding of the subject matter, one cannot but appreciate Lapique's brilliance to have the ability to reach these conclusions in 1907. Of course, he also had to rely on the work of many others to get there, let's not forget.

This model is still considered a useful model today, even though we know so much more about neurons now - a great example of what we mentioned before in terms of the choices of the level of detail when modeling. Each model is designed for a specific purpose and it should be as simple as possible for the stated end (but no simpler). As Abbot says:

While Lapicque, because of the limited knowledge of his time, had no choice but to model the action potential in a simple manner, the stereotypical character of action potentials allows us, even today, to use the same approximation to avoid computation of the voltage trajectory during an action potential. This allows us to focus both intellectual and computation resources on the issues likely to be most relevant in neural computation, without expending time and energy on modeling a phenomenon, the generation of action potentials, that is already well understood.

The IAF Family

Integrate-and-Fire is actually a family of models - related because all of them follow Lapicque's original insights. Over time, people have addressed shortcomings in the model by adding more parameters and modifying it slightly and from this other models were born.

In general, models in the IAF family are single neuron models with a number of important properties (as per Izhikevich):

The spikes are all or none; that is, we either spike or we don't. This is a byproduct of the way spikes are added to the model, as we shall see later. This also means all spikes are identical because the are all created the same way.
The threshold for the spike is well defined and there is no ambiguity as to whether the neuron will fire or not.
It is possible to add a refractory period, similarly to how we add the spike. The refractory period is a time during which the neuron is less excitable (e.g. ignores inputs) and occurs right after the spike.
Positive currents are used as excitatory inputs and negative currents as inhibitory inputs.

But how do the members of this family look like? We will take a few examples from Wikipedia to make a family portrait and then focus on LIF.

IAF: Integrate-and-Fire

This the Lapicque model. It is also called a "perfect" or "non-leaky" neuron. The formula is as follows:

\begin{align} I(t) = C_m \frac{dV_m(t)}{dt} \end{align}

The m's are there to signify membrane, nothing else. Note that its the job of the user to determine θ - that is the point at which the neuron spikes - and then to reset everything to zero and start again. If you are wondering why it's called "integrate", that's because the differential equation must be integrated before we can compare the current value to a threshold and then, if we're passed it, well - fire!. Hence Integrate-and-Fire.

Wikipedia states this in a classier way, of course:

[This formula] is just the time derivative of the law of capacitance, Q = CV. When an input current is applied, the membrane voltage increases with time until it reaches a constant threshold Vth, at which point a delta function spike occurs and the voltage is reset to its resting potential, after which the model continues to run. The firing frequency of the model thus increases linearly without bound as input current increases.

Integrate-and-Fire with Refractory Period

It is possible to extend IAF to take the refractory period into account. This is done by adding a period of time t ref during which the neuron does not fire.

LIF: Leaky Integrate-and-Fire

One of the problems of IAF is that it will "remember" stimulus, regardless of the time that elapses between stimuli. By way of example: if a neuron gets some input below the firing threshold at some time (say ta), then nothing for a long period of time and then subsequent stimulus at say tb, this will cause the neuron to fire (assuming the two inputs together are above the threshold). In the real world, neurons "forget" about below-threshold stimulus after certain amount of time has elapsed. This problem is solved in LIF by adding a leak term to IAF. The Wikipedia's formula is like so:

\begin{align} I_m(t) - \frac{V_m(t)}{R_m} = C_m \frac{dV_m(t)}{dt} \end{align}

We will discuss it in detail later on.

Interlude: Leaky Integrators and Low-Pass Filters

Update: this section got moved here from an earlier post.

Minor detour into the world of "Leaky Integrators". As it turns out, mathematicians even have a name to describe functions like the one above: they are called Leaky Integrators. A leaky integrator is something that takes an input and "integrates" - that is, sums it over a range - but by doing so, starts "leaking" values out. In order words, a regular sum of values over a range should just result in an ever growing output. With a leaky integrator, we add up to a point, but then we start leaking, resetting the value of the sum back to where we started off.

It turns out these kind of functions have great utility. For example, imagine that you have a range of inputs varying from some arbitrary low number to some other arbitrary high-number. When you supply these inputs to a leaky integrator, it can be used to "filter out" the high numbers; input numbers higher than a certain cut-off point just result in zeros in the output. This is known as a low-pass filter. One can conceive of a function that acted in the opposite way - a high-pass filter.

Exponential Integrate-and-Fire

In this model, spike generation is exponential:

\begin{align} \frac{dX}{dt} = \Delta_\tau exp(\frac{X - X_t}{\Delta_\tau}) \end{align}

Wikipedia explains it as follows:

where X is the membrane potential, X_T is the membrane potential threshold, and Δ_T is the sharpness of action potential initiation, usually around 1 mV for cortical pyramidal neurons. Once the membrane potential crosses X_T, it diverges to infinity in finite time.

Others

We could continue and look into other IAF models, but you get the point. Each model has limitations, and as people work through those limitations - e.g. try to make the spike trains generated by the model closer to those observed in reality - they make changes to the model and create new members of the IAF family.

Explaining the LIF Formula

Let's look at a slightly more familiar formulation of LIF:

\begin{align} \tau_m \frac{dv}{dt} = -v(t) + RI(t) \end{align}

By now this should make vague sense, but lets do it step by step breakdown just to make sure we are all on the same page. First, we know that the current of the RC circuit is defined like so:

\begin{align} I(t) = I_R + I_C \end{align}

From Ohm's Law we also know that:

\begin{align} I_R = \frac {v}{R} \end{align}

And from the rigmarole of the capacitor we also know that:

\begin{align} I_C = C \frac{dv}{dt} \end{align}

Thus its not much of a leap to say:

\begin{align} I(t) = \frac {v(t)}{R} + C \frac{dv}{dt} \end{align}

Now, if we now multiply both sides by R, we get:

\begin{align} RI(t) = v(t) + RC \frac{dv}{dt} \end{align}

Remember that RC is τ, the RC time constant; in this case, we are dealing with the membrane so hence the m. With that, the rest of the rearranging to the original formula should be fairly obvious.

Also, if you recall, we mentioned Leaky Integrators before. You should hopefully be able to see the resemblance between these and our first formula.

Note that we did not model spikes explicitly with this formula. However, when it comes to implementing it, all that is required is to look for a threshold value for the membrane potential - called the spiking threshold; when that value is reached, we need to reset the membrane potential back to a lower value - the reset potential.

And with that we have enough to start thinking about code…

Method in our Madness

.. Or so you may think. First, a quick detour on discretisation. As it happens, computers are rather fond of discrete things rather than the continuous entities that inhabit the world of calculus. Computers are very much of the same opinion as the priest who said:

And what are these same evanescent Increments? They are neither finite Quantities nor Quantities infinitely small, nor yet nothing. May we not call them the Ghosts of departed Quantities?

So we cannot directly represent differential equations in the computer - not even the simpler ordinary differential equations (ODEs), with their single independent variable. Instead, we need to approximate them with a method for numerical integration of the ODE. Remember: when we say integration we just mean "summing".

Once we enter the world of methods and numerical analysis we are much closer to our ancestral home of Software Engineering. The job of numerical analysis is to look for ways in which one can make discrete approximations of the problems in mathematical analysis - like, say, calculus. The little recipes they come up with are called numerical methods. A method is nothing more than an algorithm, a set of steps used iteratively. One such method is the Euler Method: "[a] numerical procedure for solving ordinary differential equations (ODEs) with a given initial value", as Wikipedia tells us, and as it happens that is exactly what we are trying to do.

So how does the Euler method work? Very simply. First you know that:

\begin{align} y(t_0) = y_0 \\ y'(t) = f(t, y(t)) \end{align}

That is, at the beginning of time we have a known value. Then, for all other t's, we use the current value in f in order to be able to compute the next value. Lets imagine that our steps - how much we are moving forwards by - are of a size h. You can then say:

\begin{align} t_{n+1} = t_n + h \\ y_{n+1} = y_n + h * f(x_n, t_n) \end{align}

And that's it. You just need to know where you are right now, by how much you need to scale the function - e.g. the step size - and then apply the function to the current values of x and t.

In code:

template<typename F>
void euler(F f, double y0, double start, double end, double h) {
    double y = y0;
    for (auto t(start); t < end; t += h) {
        y += h * f(t, y, h);
    }
}

We are passing h to the function F because it needs to know about the step size, but other than that it should be a pretty clean mapping from the maths above.

This method is also known as Forward Euler or Explicit Euler.

What next?

And yet again, we run out of time yet again before we can get into serious coding. In the next instalment we shall cover the implementation of the LIF model.

Created: 2015-09-16 Wed 18:05

Emacs 24.5.1 (Org mode 8.2.10)

Validate

Nerd Food: Neurons for Computer Geeks - Part V: Yet More Theory

2015-09-07T09:13:00.000-07:00

Nerd Food: Neurons for Computer Geeks - Part V: Yet More Theory

Welcome to part V of a multi-part series on modeling neurons. In part IV we introduced the RC Circuit by making use of the foundations we painstakingly laid in previous posts. In truth, we could now move on to code and start looking at the Leaky Integrate-and-Fire (LIF) model, since we've already covered most required concepts. However, we are going to do just a little bit more theory before we get to that.

The main reason for this detour is that I do not want to give you the impression neurons are easy; if there is one thing that they are not is easy. So we're going to resume our morphological and electrical exploits to try to provide a better account of the complexity inside the neuron, hopefully supplying enough context to appreciate the simplifications done in LIF.

The content of this post is highly inspired from Principles of Computational Modelling in Neuroscience, a book that is a must read introduction if you decide to become serious on this subject. If so, you may also want to check the Gerstner videos: Neural networks and biological modeling.

But you need not worry, casual reader. Our feet are firmly set in layman's land and we'll remain so until the end of the series.

Brief Context on Modeling

Before we get into the subject matter proper, I'd like us to ponder a few "meta-questions" in terms of modeling.

Why Model?

A layperson may think that we model neurons because we want to build a "computer brain": one that is similar to a real brain, with its amazing ability to learn, and one which at some point may even think and be conscious. Hopefully, after you finish this series of posts, you will appreciate the difficulty of the problem and see that it's not very likely we'll be able to make a "realistic" "computer brain" any time soon - for sensible values of "realistic", "computer brain" and "any time soon".

Whilst we have good models that explain part of the behaviour of the neuron and good models for neural networks too, it is not the case that we can put all of these together to form some kind of "unified neuron model", multiply it by 80 billion, add a few quadrillion synapses and away we go: artificial consciousness. Given what we know at the moment, this approach is far too computationally demanding to be feasible. Things would change if there was a massive leap in computational power, of course, but not if they stay at present projections - even with Moore's Law.

So if we are not just trying to build a computer brain, then why bother? Well, if you set your sights a little lower, computational models are actually amazingly useful:

one can use code to explore a small portion of the problem domain, making and validating predictions using computer models, and then test those predictions in the lab with real wetware. The iterative process is orders of magnitude faster.
computer models are now becoming quite sophisticated, so in some cases they are good representations of biological processes. This tends to be the case for small things such as individual cells or smaller. As computers get faster and faster according to Moore's Law, the power and scope of these models grows too.
distributing work with Free and Open Source Software licences means it is much easier for researchers to reproduce each others work, as well as for them to explore avenues not taken by those who did the work originally, speeding things up considerably. Standing on the shoulders of giants and all that.

What Tools Do We Model With?

The focus of these posts is on writing models from scratch, but that's not how most research is conducted. In the real world, people try their best to reuse existing infrastructure - of which there is plenty. For example there is NEURON, PyNN, Brian and much more. Tools and processes have evolved around these ecosystems, and there is a push to try to standardise around the more successful frameworks.

There is also a push to find some kind of standard "language" to describe models so that we can all share information freely without having to learn the particulars of each others representations. The world is not quite there yet, but initiatives such as NeuroML are making inroads in this direction.

However, the purpose of our this series is simplification, so we will swerve around all of this. Perhaps material for another series.

At What Level Should One Model?

A related question to the previous ones - and one that is not normally raised in traditional software engineering, but is very relevant in biology - is the level of detail at which one should model.

Software Engineers tend to believe there is a model for a problem, and once you understand enough about the problem domain you will come up with it and all will be light. Agile and sprints are just a way to converge to it, to the perfection that exists somewhere in the platonic cloud. Eric Evans with DDD started to challenge that assumption somewhat by making us reflect on just what it is that we mean by "model" and "modeling", but, in general, we have such an ingrained belief in this idea that is very hard to shake it off or to even realise the belief is there in the first place. Most of us still think of the code representation of the domain model as the model - rather than accept it is one of a multitude of possible representations, each suitable for a given purpose.

Alas, all of this becomes incredibly obvious when you are faced with a problem like modeling a neuron or a network of neurons. Here, there is just no such thing as the "right model"; only a set of models at a different perspectives, each with a different set of trade-offs, and any of them only make sense in the context of what one is trying to study. It may make sense to model neurons like networks, ignoring the finer details of each one and looking at their behaviour as a group, or it may make sense to model individual bits of the neuron as an entity. What makes it "right" or "wrong" is what it is that we are using the model for and how much computational power one has at one's disposal.

Having said all of that, lets resume our morphology adventures.

Electricity and Neurons

We started off with an overview of the neuron and then moved over to lots and lots of electricity; now it's time to see how those two fit together.

As we explained in part I, there is a electric potential difference between the inside of the cell and the outside, called the membrane potential. The convention to compute this potential is to subtract the potential inside the cell to the potential outside the cell; current is positive when there is a flow of positive charge from the inside to the outside and negative otherwise. Taken into account these definitions, one should be able to make sense of the resting membrane potential: it is around -65mv. But how does this potential change?

Ion Channels

Earlier, we spoke about ions - atoms that either lost or gained electrons and so are positively or negatively charged. We also said that, in general, the cell's membrane is impermeable, but there are tiny gaps in the membrane which allow things in and out of the cell. Now we can expand a bit further. Ion channels are one such gap, and they have that name because they let ions through. There are many kinds of ion channels. One way of naming them is to use the ion they are most permeable to - but of course, this being biology, the ion channels don't necessarily always have a major ion they are permeable to.

Another useful categorisation distinguishes between passive and active ion channels. Active channels are those that change their permeability depending on external factors such as the membrane potential, the concentration of certain ions, and so on. For certain values they are open - i.e. permeable - whereas for other values they are closed, not allowing any ions through. Passive channels are simpler, they just have a fixed permeability behaviour.

There are also ionic pumps. These are called pumps because they take one kind of ion out, exchanging it for another kind. For instance, the sodium-potassium pump pushes potassium into the cell and expels sodium out. A pump has a stoichiometry, which is a fancy word to describe the ratio of ions being pumped in and out.

Complexity Starts To Emerge

As you can imagine, the key to understating electric behaviour is understanding how these pesky ions move around. Very simplistically, ions tend to move for two reasons: because there is a potential difference between the inside and the outside of the cell, or because of the concentration gradient of said ion. The concentration gradient just means that, left to their own devices, concentration becomes uniform over time. For example, if you drop some ink in a glass of water, you will start by seeing the ink quite clearly; given enough time, the ink will diffuse in the water, making it all uniformly coloured. The same principle applies to ions - they want to be uniformly concentrated.

It should be fairly straightforward to work out that a phenomenal number of permutations is possible here. Not only do we have a great number of channels, all with different properties - some switching on and off as properties change around the cell - but we also have the natural flow of ions being affected by the membrane's potential and the concentration gradient, all of which are changing over time. To make matters worse, factors interact with each other such that even if you have simple models to explain each aspect individually, the overall behaviour is still incredibly complex.

Now imagine more than 50 thousand such ion channels - of over one hundred (known) types - in just a single neuron and you are starting to get an idea of the magnitude of the task.

Equivalent Circuit for a Patch of Membrane

But lets return to simplicity. The very clever people determined that it is possible to model the behaviour of ions and its electric effects by thinking of it as an electric circuit. Taking a patch of membrane as an example, it can be visualised as an electric circuit like so:

Figure 1: Source: Wikipedia, Membrane Potential

What this diagram tells us is that the membrane itself acts as a capacitor, with its capacitance determined by the properties of the lipid bilayer. We didn't really discuss the lipid bilayer before so perhaps a short introduction is in order. The membrane is made up of two sheets of lipids (think fatty acids), which when layered so, have interesting properties: the outside of the sheets are impermeable to most things such as water molecules and ions. The membrane itself is pretty thin, at around 5nm.

The membrane capacitance is considered constant. We then have a series of ion channels: sodium, potassium, chlorine, calcium. Each of these can be thought of as a pairing of a resistor with variable conductance coupled with a battery. Note that the resistor and the battery are in series, but the ion channels themselves form a parallel circuit. The voltages for each pathway are determined by the different concentrations of the ion inside and outside the cell.

If we further assume fixed ion concentrations and passive ion channels, we can perform an additional simplification on the circuit above and we finally end up with an RC Circuit:

Figure 2: Source: Wikipedia, Membrane Potential

The circuit now has one resistance, which we call the membrane resistance, and a membrane battery.

What next?

Hopefully you can start to see both the complexity around modeling neurons and the necessity to create simpler models to make them computationally feasible - just look at the amount of simplification that was required for us to get to an RC Circuit!

But at least we can now look forward to implementing LIF.

Created: 2015-09-07 Mon 17:12

Emacs 24.5.1 (Org mode 8.2.10)

Validate

Nerd Food: Neurons for Computer Geeks - Part IV: More Electricity

2015-09-05T10:57:00.000-07:00

Nerd Food: Neurons for Computer Geeks - Part IV: More Electricity

Part I of this series looked at a neuron from above; Part II attempted to give us the fundamental building blocks in electricity required to get us on the road to modeling neurons. We did a quick interlude with a bit of coding in part III but now, sadly, we must return to boring theory once more.

Now that we grok the basics of electricity, we need to turn our attention to the RC circuit. As we shall see, this circuit is of particular interest when modeling neurons. The RC circuit is so called because it is a circuit, and it is composed of a Resistor and a Capacitor. We've already got some vague understanding of circuits and resistors, so lets start by having a look at this new crazy critter, the capacitor.

Capacitors

Just like the battery is a source of current, one can think of the capacitor as a temporary store of current. If you plug a capacitor into a circuit with just a battery, it will start to "accumulate" charge over time, up to a "maximum limit". But how exactly does this process work?

In simple terms, the capacitor is made up of two metal plates, one of which will connect to the positive end of the battery and another which connects to the negative end. At the positive end, the metal plate will start to lose negative charges because these are attracted to the positive end of the battery. This will make this metal plate positively charged. Similarly, at the negative end, the plate will start to accumulate negative charges. This happens because the electrons are repelled by the negative end of the battery. Whilst this process is taking place, the capacitor is charging.

At some point, the process reaches a kind of equilibrium, whereby the electrons in the positively charged plate are attracted equally to the plate as they are to the positive end of the battery, and thus stop flowing. At this point we say the capacitor is charged. It is interesting to note that both plates of the capacitor end up with the same "total" charge but different signs (i.e. -q and +q).

Capacitance

We mentioned a "maximum limit". A few things control this limit: how big the plates are, how much space there is between them and the kind of material we place between them, if any. The bigger the plates and the closer they are - without touching - the more you can store in the capacitor. The material used for the plates is, of course, of great importance too - it must be some kind of metal good at conducting.

In a more technical language, this notion of a limit is captured by the concept of capacitance, and is given by the following formula:

\begin{align} C = \frac{q}{V} \end{align}

Lets break it down to its components to see what the formula is trying to tell us. The role of V is to inform us about the potential difference between the two plates. This much is easy to grasp; since one plate is positively charged and other negatively charged, it is therefore straightforward to imagine that a charge will have a different electric potential in each plate, and thus that there will be an electric potential difference between them. q tells us about the magnitude of the charges that we placed on the plates - i.e. ignoring the sign. It wouldn't be to great a leap to conceive that plates with a larger surface area would probably have more "space" for charges and so a larger q - and vice-versa.

Capacitance is then the ratio between these two things; a measure of how much electric charge one can store for a given potential difference. It may not be very obvious from this formula, but capacitance is constant. That is to say, a given capacitor has a capacitance, influenced by the properties described above. This formula does not describe the discharging or charging process - but of course, capacitance is used in the formulas that describe those.

Capacitance is measured in SI units of Farads, denoted by the letter F. A farad is 1 coulomb over 1 volt:

\begin{align} 1F = \frac{C}{V} \end{align}

Capacitors and Current

After charging a capacitor, one may be tempted to discharge it. For that one could construct a simple circuit with just the capacitor. Once the circuit is closed, the negative charges will start to flow to the positively charged plate, at full speed - minus the resistance of the material. Soon enough both plates would be made neutral. At first glance, this may appear to be very similar to our previous circuit with a battery. However, there is one crucial difference: the battery circuit had a constant voltage and a constant current (for a theoretical battery) whereas a circuit with a discharging capacitor has voltage and current that decay over time. By "decaying", all we really mean is that we start at some arbitrarily high value and we move towards zero over a period of time. This makes intuitive sense: you cannot discharge the capacitor forever; and, as you discharge it, the voltage starts to decrease - for there are less charges in the plates and so less potential difference - and similarly, so does the current - for there is less "pressure" to make the charges flow.

This intuition is formally captured by the following equation:

\begin{align} I(t) = C \frac{dV(t)}{dt} \end{align}

I'm rather afraid that, at this juncture, we have no choice but to introduce Calculus. A proper explanation of Calculus a tad outside the remit of these posts, so instead we will have to make do with some common-sense but extremely hand-waved interpretations of the ideas behind it. If you are interested in a light-hearted but still comprehensive treatment of the subject, perhaps A Gentle Introduction To Learning Calculus may be to your liking.

Let's start by taking a slightly different representation of the formula above and then compare these two formulas.

\begin{align} i = C \frac{dv}{dt} \end{align}

In the first case we are talking about the current I, which normally is some kind of average current over some unspecified period. Up to now, time didn't really matter - so we got away with just talking about I in these general terms. This was the case with the Ohm's Law in part II. However, as we've seen, it is not so with capacitors - so we need to make the current specific to a point in time. For that we supply an "argument" to I - I(t); here, a mathematician would say that that I is a function of time. In the second case, we make use of i, which is the instantaneous current through the capacitor. The idea is that, somehow, we are able to know - for any point in time - what the instantaneous current is.

How we achieve that is via the magic of Calculus. The expression dv/dt in the second formula provides us with the instantaneous rate of change of the voltage over time. The same notion can be applied to V, as per first formula.

These formulas may sound awfully complicated, but what they are trying to tell us is that the capacitor's current has the following properties:

it varies as a "function" of time; that is to say, different time points have different currents. Well, that's pretty consistent with our simplistic notion of a decaying current.
it is "scaled" by the capacitor's capacitance C; "bigger" capacitors can hold on to higher currents for longer when compared to "smaller" capacitors.
the change in electric potential difference varies as a function of time. This is subtle but also makes sense: we imagined some kind of decay for our voltage, but there was nothing to say the decay would remain constant until we reached zero. This formula tells us it does not; voltage may decrease faster or slower at different points in time.

Circuits: Parallel and Series

The RC circuit can appear in a parallel or series form, so its a good time to introduce these concepts. One way we can connect circuits is in series; that is, all components are connected along a single path, such that the current flows through all of them, one after the other. If any component fails, the flow will cease.

This is best understood by way of example. Lets imagine the canonical example of a battery - our old friend the 1.5V AA battery - and a three small light bulbs. A circuit that connects them in series would be made up of a cable segment plugged onto one of the battery's terminals - say +, then connected to the first light bulb. A second cable segment would then connect this light bulb to another light bulb, followed by another segment and another light bulb. Finally, a cable segment would connect the light build to the other battery terminal - say -. Graphically - and pardoning my inability to use Dia to create circuit diagrams - it would look more or less like this:

Figure 1: Series circuit. Source: Author

This circuit has a few interesting properties. First, if any of the light bulbs fail, all of them will stop working because the circuit is no longer closed. Second, if one were to add more and more light bulbs, the brightness of each light bulb will start to decrease. This is because each light bulb is in effect a resistor - the light shining being a byproduct of said resistance - and so they are each decreasing the current. So it is that in a series circuit the total resistance is given by the sum of all individual resistances, and the current is the same for all elements.

Parallel circuits are a bit different. The idea is that two or more components are connected to the circuit in parallel, i.e. there are two or more paths along which the current can flow at the same time. So we'd have to modify our example to have a path to each of the light bulbs which exists in parallel to the main path - quite literally a segment of cable that connects the other segments of cable, more or less like so:

Figure 2: Parallel circuit. Source: Author

Here you can see that if a bulb fails, there is still a closed loop in which current can flow, so the other bulbs should be unaffected. This also means that the voltage is the same for all components in the circuit. Current and resistance are now "relative" to each component, and it is possible to compute the overall current for the circuit via Kirchhoff's Current Law. Simplifying it, it means that the current for the circuit is the sum of all currents flowing through each component.

This will become significant later on when we finally return to the world of neurons.

The RC Circuit

With all of this we can now move to the RC circuit. In its simplest form, the circuit has a source of current with a resistor and a capacitor:

Figure 3: Source: Wikipedia, RC circuit

Let's try to understand how the capacitor's voltage will behave over time. This circuit is rather similar to the one we analysed when discussing capacitance, with the exception that we now have a resistor as well. But in order to understand this, we must return to Kirchhoff's current law, which we hand-waved a few paragraphs ago. Wikipedia tells us that:

The algebraic sum of currents in a network of conductors meeting at a point is zero.

One way to understand this statement is to think that the total quantity of current entering a junction point must be identical to the total quantity leaving that junction point. If we consider entering to be positive and leaving to be negative, that means that adding the two together must yield zero.

Because of Kirchhoff's law, we can state that, for the positive terminal of the capacitor:

\begin{align} i_c(t) + i_r(t) = 0 \end{align}

That is: at any particular point in time t, the current flowing through the capacitor added to the current flowing through the resistor must sum to zero. However, we can now make use of the previous formulas; after all, our section on capacitance taught us that:

\begin{align} i_c(t) = C \frac{dv(t)}{dt} \end{align}

And making use of Ohm's Law we can also say that:

\begin{align} i_r(t) = \frac{v(t)}{R} \end{align}

So we can expand the original formula to:

\begin{align} C \frac{dv(t)}{dt} + \frac{v(t)}{R} \end{align}

Or:

\begin{align} C \frac{dV}{dt} + \frac{V}{R} \end{align}

I'm not actually going to follow the remaining steps to compute V, but you can see them here and they are fairly straighforward, or at least as straightforward as calculus gets. The key point is, when you solve the differential equation for V, you get:

\begin{align} V(t) = V_0e^{-\frac{t}{RC}} \end{align}

With V0 being voltage when time is zero. This is called the circuit's natural response. This equation is very important. Note that we are now able to describe the behaviour of voltage over time with just a few inputs: the starting voltage, the time, the resistance and the capacitance.

A second thing falls off of this equation: the RC Time constant, or τ. It is given by:

\begin{align} \tau = RC \end{align}

The Time Constant is described in a very useful way in this page, so I'll just quote them and their chart here:

The time required to charge a capacitor to 63 percent (actually 63.2 percent) of full charge or to discharge it to 37 percent (actually 36.8 percent) of its initial voltage is known as the TIME CONSTANT (TC) of the circuit.

Figure 4: The RC Time constant. Source: Concepts of alternating current

What next?

Now we understand the basic behaviour of the RC Circuit, together with a vague understanding of the maths that describe it, we need to return to the neuron's morphology. Stay tuned.

Created: 2015-09-05 Sat 18:56

Emacs 24.5.1 (Org mode 8.2.10)

Validate

Nerd Food: Neurons for Computer Geeks - Part III: Coding Interlude

2015-09-04T07:56:00.000-07:00

Nerd Food: Neurons for Computer Geeks - Part III: Coding Interlude

If you are anything like me, the first two parts of this series have already bored you silly with theory (Part I, Part II) and you are now hankering for some code - any code - to take away the pain. So part III is here to do exactly that. However, let me prefix that grandiose statement by saying this is not the best code you will ever see. Rather, its just a quick hack to introduce a few of the technologies we will make use of for the remainder of these series, namely:

CMake and Ninja: this is how we will build our code.
Wt: provides a quick way to knock-up a web frontend for C++ code.
Boost: in particular Boost Units and later on Boost OdeInt. Provides us with the foundations for our numeric work.

What I mean by a "quick hack" is: there is no validation, no unit tests, no "sound architecture" and none of the things you'd expect from production code. But it should serve as an introduction to modeling in C++.

All the code is available in GitHub under neurite. Lets have a quick look at the project structure.

CMake

We just took a slimmed down version of the Dogen build system to build this code. We could have gotten away with a much simpler CMake setup, but I intend to use it for the remainder of this series so that's why its a bit more complex than what you'd expect. It is made up of the following files:

Top-level CMakeLists.txt: ensures all of the dependencies can be found and configured for building, sets up the version number and debug/release builds.
build/cmake: any Find* scripts that are not supplied with the CMake distribution. We Google for these and copied them here.
projects/CMakeLists.txt: sets up all of the compiler and linker flags we need to build the project. Uses pretty aggressive flags such as -Wall and -Werror.
projects/ohms_law/src/CMakeLists.txt: our actual project, the bit that matters for this article.

`ohms_law` Project

The project is made up of two classes, in files calculator.[hc]pp and view.[hc]pp. The names are fairly arbitrary but they try to separate View from Model: the user interface is in view and the "number crunching" is in calculator.

The View

Lets have a quick look at view. In the header file we simply define a Wt application with a few widgets:

class view : public Wt::WApplication {
public:
  view(const Wt::WEnvironment& env);

private:
  Wt::WLineEdit* current_;
  Wt::WLineEdit* resistance_;
  Wt::WText* result_;
};

It is implemented in an equally trivial manner. We just setup the widgets and hook them together. Finally, we create a trivial event handler that performs the "computations" when the button is clicked.

view::view(const Wt::WEnvironment& env) : Wt::WApplication(env) {
  setTitle("Ohm's Law Calculator");

  root()->addWidget(new Wt::WText("Current: "));
  current_ = new Wt::WLineEdit(root());
  current_->setValidator(new Wt::WDoubleValidator());
  current_->setFocus();

  root()->addWidget(new Wt::WText("Resistance: "));
  resistance_ = new Wt::WLineEdit(root());
  resistance_->setValidator(new Wt::WDoubleValidator());

  Wt::WPushButton* button = new Wt::WPushButton("Calculate!", root());
  button->setMargin(5, Wt::Left);
  root()->addWidget(new Wt::WBreak());
  result_ = new Wt::WText(root());

  button->clicked().connect([&](Wt::WMouseEvent&) {
      const auto current(boost::lexical_cast<double>(current_->text()));
      const auto resistance(boost::lexical_cast<double>(resistance_->text()));

      calculator c;
      const auto voltage(c.voltage(resistance, current));
      const auto s(boost::lexical_cast<std::string>(voltage));
      result_->setText("Voltage: " + s);
    });
}

The Model

The model is equally as simple as the view. It is made up of a single class, calculator, whose job is to compute the voltage using Ohm's Law. It does this by making use of Boost Units. This is obviously not necessary, but we wanted to take the opportunity to explore this library as part of this series of articles.

double calculator::
voltage(const double resistance, const double current) const {
  boost::units::quantity<boost::units::si::resistance>
    R(resistance * boost::units::si::ohms);
  boost::units::quantity<boost::units::si::current>
    I(current * boost::units::si::amperes);
  auto V(R * I);
  return V.value();
}

Compiling and Running

If you are on a debian-based distribution, you can do the following steps to get the code up and running. First install the dependencies:

$ sudo apt-get install libboost-all-dev witty-dev ninja-build cmake clang-3.5

Then obtain the source code from GitHub:

$ git clone https://github.com/mcraveiro/neurite.git

Now you can build it:

cd neurite
mkdir output
cd output
cmake ../ -G Ninja
ninja -j5

If all went according to plan, you should be able to run it:

$ stage/bin/neurite_ohms_law --docroot . --http-address 0.0.0.0 --http-port 8080

Now using a web browser such as chrome, connect to http://127.0.0.1:8080 and you should see a "shiny" Ohm's Law calculator! Sorry, just had to be done to take away the boredom a little bit. Lets proceed with the more serious matters at hand, with the promise that the real code will come later on.

Created: 2015-09-04 Fri 17:16

Emacs 24.5.1 (Org mode 8.2.10)

Validate

Nerd Food: Neurons for Computer Geeks - Part II: The Shocking Complexity of Electricity

2015-08-31T11:38:00.000-07:00

Nerd Food: Neurons for Computer Geeks - Part II: The Shocking Complexity of Electricity

In part I we started to describe the basic morphology of the neuron. In order to continue, we now need to take a detour around the world of electricity. If you are an electricity nerd, I apologise in advance; this is what happens when a computer scientist escapes into your realm, I'm afraid.

"Honor the charge they made!"

First and foremost, we need to understand the concept of charge. It is almost a tautology that atoms are made up of "sub-atomic" particles. These are the proton, the neutron and the electron. The neutron is not particularly interesting right now; however the electron and the proton are, and all because they have a magical property called charge. For our purposes, it suffices to know that "charge" means that certain sub-atomic particles attract or repeal each other, according to a well defined set of rules.

You can think of a charge as a property attached to the sub-atomic particle, very much like a person has a weight or height, but with a side-effect; it is as if this property makes people push or hug each other when they are in close proximity, and they do so with the same strength when at the same distance. This "strength" is the electric force. How they decide whether to hug or push the next guy is based on the "sign" of the charge - that is, positive or negative - with respect to their own charge "sign". Positives push positives away but hug negatives and vice-versa.

For whatever historical reasons, very clever people decided that an electron has one negative unit of charge and a proton has a positive unit of charge. The sign is, of course, rather arbitrary. We could have just as well said that protons are red and electrons are blue or some other suitably binary-like convention to represent these permutations. Just because protons and electrons have the same charge, it does not follow that they are similar in other respects. In fact, they are very different creatures. For example, the electron is very "small" when compared to the proton - almost 2000 times "smaller". The relevance of this "size" difference will become apparent later on. Physicists call this "size" mass, by the by.

As it happens, all of these sub-atomic crazy critters are rather minute entities. So small in fact that it would be really cumbersome if we had to talk about charges in terms of the charge of an electron; the numbers would just be too big and unwieldy. So, the very clever people came up with a sensible way to bundle up the charges of the sub-atomic particles in bigger numbers, much like we don't talk about millimetres when measuring the distance to the Moon. However, unlike the nice and logical metric system, with its neat use of the decimal system, physicists came up instead with the Coulomb, or C, one definition of which is:

1 Coloumb (1C) = 6.241 x 10¹⁸ protons
-1 Coloumb (-1C) = 6.241 x 10¹⁸ electrons

This may sound like a very odd choice - hey, why not just 1 x 10²⁰ or some other "round" number? - but just like a kilobyte is 1024 bytes rather 1000, this wasn't done by accident either. In fact, all related SI units were carefuly designed to work together and make calculations as easy as possible.

Anyway, whenever you see q or Q in formulas it normally refers to a charge in Coulombs.

Units, Dimensions, Measures, Oh My!

Since we are on the subject of SI, this is probably a good point to talk about units, dimensions, measurements, magnitudes, conversions and other such exciting topics. Unfortunately, these are important to understand how it all hangs together.

A number such as 1A makes use of the SI unit of measure "Ampere" and it exists in a dimension: the dimension of all units which can talk about electric charges. This is very much in the same way we can talk about time in seconds or minutes - we are describing points in the time dimension, but using different units of measure - or just units, because we're lazy. A measurement is the recording of a quantity with a unit in a dimension. Of course, it would be too simple to call it a "quantity", so instead physicists, mathematicians and the like call it magnitude. But for the lay person, its not too bad an approximation to replace "magnitude" with "quantity".

Finally, it is entirely possible to have compound dimensional units; that is, one can have a unit of measure that refers to more than one dimension, such as say "10 kilometres per second".

I won't discuss conversions just now, but you can easily imagine that formulas that contain multiple units may provide ways to convert from one unit to another. This will become relevant later on.

Go With the Flow

Now we have a way of talking about charge, and now we know these things can move - since they attract and repel each other - the next logical thing is to start to imagine current. The name sounds magical, but in reality it is akin to a current in a river: you are just trying to figure out how much water is coming past you every second (or in some other suitable unit in the time dimension). The exact same exercise could be repeated for the number of cars going past in a motorway or the number of runners across some imaginary point in a track. For our electric purposes, current tells you how many charges have zipped past over a period of time.

In terms of SI units, current is measured in Amperes, which have the symbol A; an Ampere tells us how many Coloumbs have flown past in a second. Whenever you see I in formulas it normally refers to current.

Now lets see how these two things - Coulombs and Amperes - could work together. Lets imagine an arbitrary "pipe" between two imaginary locations, one side of which with a pile of positive charges and, on the other side, a pile of negative charges - both measured in Coulombs, naturally. In this extraordinarily simplified and non-existing world, the negative charges would "flow" down the pipe, attracted by the positive charges. Because the positive charges are so huge they won't budge, but the negative charges - the lighter electrons - would zip across to meet them. The number of charges you see going past in a time tick is the current.

Resist!

Going back to our example of current in a river, one can imagine that some surfaces are better at allowing water to flow than others; for example, a river out in the open is a lot less "efficient" at flowing than say a plastic pipe designed for that purpose. One reason is that the river has to deal with twists and turns as it finds a path over the landscape whereas the pipe could be laid out as straight as possible; but it is also that the rocks and other elements of the landscape slow down water, whereas a nice flat pipe would have no such impediments. If one were to take these two extremes - a plastic pipe designed for maximum water flow versus a landscape - one could see that they affect flow differently; and one could be tempted to name the property of "slowing down the flow" resistance, because it describes how much "resistance" these things are offering to the water. If you put up a barrier to avoid flooding, you probably would want it to "resist" water quite a lot rather than allow it to flow; and you can easily imagine that sand and sandbags "resist" water in very different ways.

Resistance is a fundamental concept in the electrical world. The gist of it is similar to the contrived examples above, in that not all materials behave the same way with regards to allowing charges to flow. Some allow them to flow freely nearly at maximum speed whereas others do not allow them to flow at all.

Since we are dealing with physics, it is of course possible to measure resistance. We do so in SI units of Ohms, denoted by the Greek letter upper-case Ω.

As we shall see, not all materials are nicely behaved when it comes to resistance.

You've Got Potential Baby!

Lets return to our non-existing "pipe that allows charges to flow" scenario, and take it one step further. Imagine that for whatever reason our pipe becomes clogged up with a blockage somewhere in the middle. Nothing could actually flow due to this blockage so our current drops to zero.

According to the highly simplified rules that we have learned thus far, we do know that - were there to be no blockage - there would be movement (current). That is, the setup of the two bundles in space is such that, given the right conditions, we would start to see things flowing. But, alas, we do not have the right conditions because the pipe is blocked; hence no flow. You could say this setup has "the potential" to get some flow going, if only we could fix the blockage.

In the world of electricity, this idea is captured by a few related concepts. If we highly simplify them, they amount to this:

electric potential: the idea that depending where you place a charge in space, it may have different "potential" to generate energy. We'll define energy a bit better latter on, but for now a layman's idea of it suffices. By way of an example: if you place a positive charge next to a lump of positive charges and let it go, it will move a certain distance away from the lump. Before you let the charge go, you know the charge has potential to move away. You can also see that the charge will move by different amounts depending how close you place it to the lump; the closer you place it, the more it will move. When we are thinking of electric potential, we think of just one charge.
electric potential energy: clearly it would be possible to move two or three charges too, as we did for the one; and clearly they should produce more energy than a single charge. So one simple way of understanding electric potential energy is to think of it as the case of electric potential that deals with the total number of charges we're interested in, rather than just one.

Another way of imagining these two concepts is to think that electric potential is a good way to measure things when you don't particularly care about the number of charges involved; it is as if you scaled everything to just one unit of charge. Electric potential energy is more when you are thinking of a system with an actual number of charges. But both concepts deal with the notion that placing a charge at different points in space may have an impact in the energy you can get out of it.

Having said all of that we can now start to think about electric potential difference. It uses the same approach as electric potential, in that everything is scaled to just one unit of charge, but as the name implies, it provides a measurement of the difference between the electric potential of two points. Electric potential difference is more commonly known as voltage. Interestingly, it is also known as electric pressure, and this may be the most meaningful of its names; this is because when there is an electric potential difference, it applies "pressure" on charges which force them to move.

The SI unit Volt is used to measure electric potential, electric potential energy and electric potential difference amongst other things. This may sound a bit weird at first, but it is just because one is unfamiliar with these concepts. Take time, for example: we use minutes as a unit of measure of all sorts of things (duration of a football game, time it takes for the moon to go around the earth, etc.). We did not invent a new unit for each phenomenon because we recognised - at some point - that we were dealing with points in the same dimension.

Quick Conceptual Mop-Up

Before we move over to the formulas, it may be best to tie up a few loose ends. These are not strictly necessary, but just make the picture a bit more complete and moves us to a more realistic model - if still very simplistic.

First, we should start with atoms; we mentioned charges but skipped them. Atoms are (mostly) a stable arrangement of charges, placed in such a way that the atoms themselves are neutral - i.e. contain exactly the same amount of negative and positive charges. We mentioned before that protons and electrons don't really get along, and neutrons are kind of just there, hanging around. In truth, neutrons and protons also really get along, via the aptly named nuclear force; this is what binds them together in the nucleus of the atom. Electrons are attracted to protons and live their existences in a "cloud" around the nucleus. Note that the nucleus is more than 99% of the mass of the atom, which gives you an idea of just how small electrons are.

The materials we will deal with in our examples are made of atoms, as are, well, quite a few things in the universe. These materials are themselves stable arrangements of atoms, just like atoms are stable arrangements of protons, neutrons and electrons. As you can see in the picture, these look like lattices of some kind.

Figure 1: Microscopic View of Carbon Atoms. Source: Quantum Physics: The Brink of Knowing Something Wonderful

In practice, copper wires are made up of a great many things rather than just atoms of copper. One such "kind of thing" is the unbound electrons - or free-moving electrons; basically electrons are not trapped into an atom. As we mentioned before, electrons are the ones doing most of the moving. Left to their own devices, electrons in a conducting material will just move around, bumping into atoms in a fairly random way. However, lets say you take one end of a copper wire and plug it to the + side of a regular AA battery and then take other end and plug it to the - side of the battery. According to all we've just learned, its easy to imagine what will happen: the electrons stored in the - side will zip across the copper to meet their proton friends at the other end. This elemental construction, with its circular path, is called a circuit. What you've done is to upset the neutral balance of the copper wire and got all the electrons to move in a coordinated way (rather than random) from the - side to the + side.

It is at this juncture that we must introduce the concept of ions. An ion is basically an atom that is no longer neutral - either because it has more protons than electrons (called a cation) or more electrons than protons (called an anion). In either case, this comes about because the atom has gained or lost some electrons. Ions will become of great interest when we return to the neuron.

One final word on resistance and its sister concept of conductance:

Resistance is in effect a byproduct of the way the electrons are arranged in the electron cloud and is related to the ionisation mentioned above; certain arrangements just don't allow electrons to flow across.
Conductance is the inverse of resistance. When you talk about resistance you are focusing on the material's ability to impair movement of charges; when you talk about conductance you are focusing on the material's ability to let charge flow through.

The reason we choose copper or other metals for our examples is because they are good at conducting these pesky electrons.

Ohm's Law

We have now introduced all the main actors required for one of the main parts in the play: Ohm's Law. It can be stated very easily:

V = R x I

And here's a picture to aid intuition.

Figure 2: Source: Could someone intuitively explain to me Ohm's law?

The best way to understand this law is to create a simple circuit.

Figure 3: Simple electrical circuit. Source: Wikipedia, Electrical network

On the left we have a voltage source, which could be our 1.5V AA battery. On the right of the diagram we have a resistor - an electric component that is designed specifically to "control" the flow of the electric current. Without the resistor, we would be limited by how much current the battery can pump out and how much "natural" resistance the copper wire has, which is not a lot since it is very good at conducting. The resistor gives us a way to limit current flow from these theoretical maximum limitations.

Even if you are not particularly mathematically oriented, you can easily see that Ohm's Law gives us a nice way to find any of these three variables, given the other two. That is to say:

R = V / I
I = V / R

These tell us many interesting things such as: for the same resistance, current increases as the voltage increases. For good measure, we can also find out the conductance too:

G = I / V = 1 / R

It is important to notice that not everything obeys Ohm's law - i.e. behave in a straight line. The conductors that obey this law are called ohmic conductors. Those that do not are called non-ohmic conductors. There are also things that obey to Ohm's Law, for the most part. These are called quasi-ohmic.

What next?

We have already run out of time for this instalment but there are still some more fundamental electrical concepts we need to discuss. The next part will finish these and start to link them back to the neuron.

Created: 2015-08-31 Mon 19:27

Emacs 24.5.1 (Org mode 8.2.10)

Validate

Nerd Food: Neurons for Computer Geeks - Part I: A Neuron From Up On High

2015-08-31T09:26:00.000-07:00

Nerd Food: Neurons for Computer Geeks - Part I: A Neuron From Up On High

As any computer geek would tell you, computer science is great in and of itself and many of us could live long and contented lives inside that box. But things certainly tend to become interesting when there is a whole problem domain to model, and doubly so when that domain is outside of our comfort zone. As it happens, I have managed to step outside said zone - rather, quite far outside - so it seemed like a good idea to chronicle these adventures here.

The journey we are about to embark starts with a deceptively simple mission: to understand how one can use computers to model neurons. The intended audience of these posts is anyone who loves coding but has no idea about electricity, circuits, cells and so on - basically someone very much like me. We shall try to explain, at least to a degree, all of the required core concepts in order to start coding. As it turns out, there are quite a few.

But hey, as they say, "If you can't explain something to a six year-old, you really don't understand it yourself". So lets see if I got it or not.

I'm a Cell, Get Me Out Of Here!

A neuron is a cell, so it makes sense to start with cells. Cells are a basic building block in biology and can be considered as the smallest unit of a living organism - at least for our purposes, if nothing else. The key idea behind a cell is as obvious as you'd like: there is the inside, the outside, and the thing that separates both.

Of course, this being biology, we need to give it complicated names. Accordingly, the inside of the cell is the cytoplasm and the thing that separates the cell from the outside world is the membrane. You can think of it as a tiny roundy-box-like thing, with some gooey stuff inside. The material of the box is the membrane. The gooey stuff is the cytoplasm. When we start describing the different cellular structures - as we are doing here - we are talking about the cell's morphology.

Living beings are made up of many, many cells - according to some estimates, a human body would have several trillion - and cells themselves come in many, many kinds. Fortunately, we are interested in just one kind: the neuron.

The Neuron Cell

The neuron is a nerve cell. Of course, there are many, many kinds of neurons - nature just seems to love complexity - but they all share things in common, and those things define their neuron-ness.

Unlike the "typical" cell we described above (i.e. "roundy-box-like thing"), the neuron is more like a roundy-box-like thing with some branches coming out of it. The box-like thing is the cell body and is called soma. There are two types of branches: axons and dendrites. A dendrite tends to be short, and it branches like a tree with a very small trunk. The axon tends to be long and it also branches off like a tree, but with a very long trunk. As we said, there are many kinds of neurons, but a fair generalisation is that they tend to have few axons (one or maybe a couple) and many dendrites (in the thousands).

Figure 1: Source: What is a Neuron?

This very basic morphology is already sufficient to allows to start to think of a neuron as a "computing device" - a strange kind of device where the dendrites provide inputs and the axon outputs. The neuron receives all these inputs, performs some kind of computation over them, and produces an output.

The next logical question for a computer scientist is, then: "where do the outputs come from and where do they go?". Imagining an idealised neuron, the dendrites would be "connecting" to other dendrites or to axons. At this juncture (pun not intended), we need to expand on what exactly these "connections" are. In truth, its not that the axon binds directly to the dendrite; there is always a gap between them. But this gap is a special kind of gap, first because it is a very small gap and second because it is one over which things can travel, from the axon into the dendrite. This kind of connectivity between neurons is called a synapse.

From this it is an easy leap to imagine that these sets of neurons connected to other neurons begin to form "networks" of connectivity, and these networks will also have computational-device-like properties, just like a neuron. These are called neural networks. Our brain happens to be one of these "neural networks", and a pretty large one at that: it can have as many as 80-100 billion neurons, connected over some 1 quadrillion synapses. In these days of financial billions and trillions, it is easy to be fooled into thinking 100 billion is not a very large number, so to get a sense of perspective lets compare it to another large network. The biggest and fastest growing human-made network is the Internet, estimated to have some 5 billion connected devices but less than 600k connections in its core - and yet we are already creacking at the seams.

The Need To Go Lower

Alas, we must dig deeper before we start to understand how these things behave in groups. Our skimpy first pass at the neuron morphology left a lot of details out, which are required to understand how they behave. As we explained, neurons have axons and dendrites, and these are responsible for hooking them together. However, what is interesting is what they talk about once they are hooked.

A neuron is can be thought of as an electrical device, and much of its power (sorry!) stems from this. In general, as computer scientists, we don't like to get too close to the physical messiness of the world of hardware; we deem it sufficient to understand some high-level properties, but rarely do we want to concern ourselves with transistors or even - regrettably - registers or pipelines in the CPU. With neurons, we can't get away with it. We need to understand the hardware - or better, the wetware - and for that we have to go very low-level.

We started off by saying cells have a membrane that separates the outside world from the cytoplasm. That was a tad of an oversimplification; after all, if the membrane did not allow anything in, how would the cell continue to exist - or even come about in the first place? In practice these membranes are permeable - or to be precise, semi-permeable. This just means that it allows some stuff in and some stuff out, under controlled circumstances. This is how a cell gets energy in to do its thing and how it expels its unwanted content out. Once things started to move in and out selectively, something very interesting can start to happen: the build up of "electric potential". However, rather unfortunately, in order to understand what we mean by this, we need to cover the fundamentals of electricity.

Onward and downwards we march. Stay tuned for Part II.

Created: 2015-08-31 Mon 17:25

Emacs 24.5.1 (Org mode 8.2.10)

Validate

Marco Craveiro

Nerd Food: Notes on Computational Finance, Part I: Introduction

About

Scope

Structure

Audience

Mathematics

Cryptos

Non Goals

Legalese

Legal Disclaimer

Next

Footnotes:

Nerd Food: The Refactoring Quagmire

Not Even Wrong

The Strange Loop

Descending the Gradient

Conclusion

Nerd Food: Northwind, or Using Dogen with ODB - Part IV

Building Zango

The "Application"

Oracle and Bulk Fetching

Conclusion

Nerd Food: Northwind, or Using Dogen with ODB - Part III

Est Humanum Errare?

Installing the Remaining Packages

Dogen

Dia

Other Dependencies

Emacs and SQL Plus

Introducing Zango

Northwind Schema

The Dogen Model for Northwind

Interlude: Dogen with ODB vs Plain ODB

Conclusion

Nerd Food: Northwind, or Using Dogen with ODB - Part II

What's in a Schema?

Enter ODB

Conclusion

Nerd Food: Northwind, or Using Dogen with ODB - Part I

Alien Worlds

Talking to the Oracle

Conclusions

Nerd Food: Interesting...

Finance, Economics, Politics

Startups et al.

General Coding

Databases

C++

Layperson Science

Other

Nerd Food: The Strange Case of the Undefined References

It All Started With a Warning

What's this ABI Malarkey Again?

Mixing and Matching

The Long Road to a Solution

Conclusion

Post Script

Footnotes:

Nerd Food: Interesting...

What's in a Schema?

Enter ODB

Conclusion

Nerd Food: Tooling in Computational Neuroscience - Part III: Data

More Data! We Need More Data!

Neuroinformatics to the Rescue

Databases, Repositories and Archives

Taming the Sea of Data

XML strikes back

Conclusion

Footnotes:

Nerd Food: Interesting...

Finance, Economics, Politics

Startups et al.

General Coding

Databases

C++

Layman Science

Other

Nerd Food: On Product Backlog