Posts about software development for the Open Tree of Life

Latest

Architecture changes coming

A few of the software folks met at KU in Lawrence last week to talk about (among other things) the architecture around how we store & edit studies. A study is a publication that contains one or more phylogenies that may be import into the graph database (and therefore make it into the synthetic draft tree). We currently store studies in a relational database, export to a NexSON (a JSON-ified NeXML format) and import into the graph database. What we lack is a round-trip of information from graph database (warnings, errors) to NexSON and back to the RDBMS where curators could make necessary changes. We would also like the data model to be more flexible for new information we might want to store about trees, and we want our study data to be publicly available for anyone who wants to browse, download or re-use it.

We are proposing moving to a document-based store where a collection of NexSON files is our canonical source of study information for all opentree components. Since we are now talking about a collection of text files, putting them in a git repo on github makes a lot of sense. Git on github gives us versioning, diffs, public access. Dumping the JSON files into CouchDB is easy if you wanted to write views across the whole datastore. So, now we are looking into various github hooks for interacting with the repo and modifying our tools to read and write the NexSONs via github.

Should be reporting back on our experiments in the next week or so.

Installing taxonomies with the treemachine

Treemachine is what I have been calling the backend tools for manipulating and adding to the OpenTreeOfLife. It is hosted on github and currently has some tools for installing taxonomies and the start of some synthesis and adding of phylogenies. For details on the architecture check out the discussions on this google doc. The general structure that was planned is below. This is likely to change, but should give an idea of why I talk about taxonomies separate from the graph of the phylogenies. We are allowing multiple conflicting taxonomies to live in the same graph so that we can synthesize (and add taxa from other taxonomies).

So, to use treemachine to import a taxonomy here are the steps.

  1. First, make sure that git is installed on your machine.
  2. in the terminal run git clone git://github.com/OpenTreeOfLife/opentree-treemachine.git
  3. make sure that you have mvn installed. It is easy on linux and I suspect it is already on macs, but please comment and let me know if you have trouble.
  4. go inside the opentree-treemachine directory in the terminal and type mvn_cmdline.sh. This is installing the neo4j libraries if you don’t have them and compiling treemachine

Within the target directory is now a .jar file that is the treemachine. You can run it by doing java -jar treemachine-0.0.1-SNAPSHOT-jar-with-dependencies.jar. This gives some basic output on some of the commands.

To load a taxonomy the command is basically (if you are in the opentree-treemachine directory)

java -jar target/treemachine-0.0.1-SNAPSHOT-jar-with-dependencies.jar inittax filename nameforsource whereyouwantthedatabase

so for example, if we had a taxonomy in a file called ncbi.txt and this was the ncbi taxonomy and we wanted the graph folder to be called graph.db we would do

java -jar target/treemachine-0.0.1-SNAPSHOT-jar-with-dependencies.jar inittax ncbi.txt ncbi graph.db

Then if we wanted to look at it in our server (see this post), we just need to turn off the server (inside the neo4j directory, bin/neo4j stop), delete the graph.db in the data directory (rm -r data/graph.db) and replace it with the one you just made (cp -r graph.db neo4jfolder/data/), then restart your server (bin/neo4j start) and go to http://localhost:7474/webadmin/ . You have your taxonomy up now and you can go to the Data Browser tab and search by name by looking search node:index:taxNamedNodes:name:THENAMEYOUWANTTOSEARCH. That will give you a list. If you click the graph like thing on the right you will get a graph that you can click. You can also change the look of the nodes by clicking styles and new style and change {id} to {name} to show names instead of id numbers.

Hope this helps some and please post questions in the comments here!

This is very experimental code and is constantly updating, so be warned! When major updates occur with new functionality, they will be explained and posted here. You can get the most up to date code by going into the opentree-treemachine directory and running git pull. Then sh mvn_cmdline.sh.

Installing neo4j

Neo4j is a graph database that forms the back end of a lot of our work on the Open Tree of Life. A database with neo4j consists of a folder with many files. This folder can be moved around and can be interacted with in a couple of ways.

One way is as an embedded database. This is you (or programs and scripts) connect directly to the folder (you have to give the location of the folder). So there is no server running and nothing is constantly running and waiting for your queries. This is simple and good for testing or loading tons of data. I often create our databases like this and load most of the starting data before moving it to our server.

Another way to interact with a neo4j database is through the server. Behind the server is still a folder that is identical to the folder if you were connecting as an embedded database. The major difference is in your program or script, instead of giving the location of the folder, you give the location of the server. Also the server needs to be running. Of course the benefits are that you can connect remotely, you can take advantage of some of the tools that come with neo4j and you can write, provide procedures that will be used over and over again as web services through REST calls. More on that in another post.

So how do you install neo4j? Well, if you are using something like treemachine that only interacts with the embedded database, it gets installed when you first run treemachine. However, in general and for the server, you go to this site and download the community edition probably the latest milestone but you can just grab the latest edition if you like.

Once this downloads, you can unarchive it and that is basically it. Move it where you like. Everything is in there. You can start the server by going in that folder in a terminal and typing

bin/neo4j start

Then if you go to http://localhost:7474/webadmin/ you will see some info and that you have 0 nodes, etc. That is it. You have the server running and you can connect by REST calls to the empty database. You do bin/neo4j stop to stop the server. Inside the data folder, there is a folder called graph.db. This is the database. If you make a database with another process or program and you want to view it through the server tools, you can stop the server, delete the graph.db in the data folder and replace it with your other folder (it is easiest to rename it to graph.db as well). Then restart and you should be good to go. This is what I have been doing with our server.

I will post shortly on how to work with treemachine to get taxonomies uploaded.

Follow

Get every new post delivered to your Inbox.