First Duke University meeting report, December 2017

Since both phyloreferencing postdocs work at the Florida Museum of Natural History with project PI Nico Cellinese, our project proposal calls for us to spend a few weeks every year working closely with our other PI, Hilmar Lapp, at Duke University. We drove up to Durham the week before the Phenoscape Knowledgebase Datafest and Interoperability Hackathon last month, and spent five days working together to improve our developmental practices, review our software development plan, discuss how to make the value of phyloreferencing clear to potential users, and decide how we would match Phyloreferencing specifiers with nodes on a phylogeny. These discussions followed upon the goals we laid out in our second face-to-face meeting, and helped scope out what we will be working on through the first half of 2018.

Improving our developmental practices

Since early in our project, we have coordinated work through frequent e-mail updates and an hour-long videoconferencing meeting every two weeks. This served us well when we were developing tools to convert phylogenies into ontologies or working on the Phyloreferencing specification, but we need better project and software management tools as we start building larger and more complex software products, such as the Curation Workflow and the Curation Tool. To this end, we have already centralized all our code into the Phyloreferencing organization on Github and set up a single organizational repository to store project-wide issues and documents. Our website is itself stored in a Github repository as a Jekyll website. All our software development progresses through pull requests, allowing other project members to review both newly written source code as well as any update to our website, such as new blog posts — for example, here is the pull request for the blog post you are reading right now! Despite taking these steps, we determined that we were failing in two aspects: (1) we were missing a high-level view of our software development efforts, and (2) too many of our documents were created and discussed on websites like Google Docs or Dropbox Paper without being permanently archived in Github or shared publicly through our blog. This is a problem for project members trying to keep track of our current progress as well as for people from relevant scientific communities who are, or become, interested in understanding what we have built and what we’re working on right now.

To improve these aspects of project management, we decided to increase our use of Github’s built-in project and document management tools, including:

Setting up Kanban boards for the entire project (with goals due every two months) and for individual software projects, such as the Curation Workflow.
Move more of our project documents to Github’s Wikis, from a list of our overall project goals to weekly progress updates to records of casual discussions between team members.
Pushing smaller patches more often rather than waiting to complete large, feature-complete push requests, which should make code reviews easier.

A review of our software development plan

One of the most valuable parts of our Durham meeting was the opportunity to discuss our software development plan with Dan Leehr, scientific applications architect and senior developer at the Duke Center for Genomic and Computational Biology, and Jim Balhoff, senior research scientist at the Renaissance Computing Institute. Dan and Jim were extremely patient while I explained what we were building and how long I thought it would take to build, based on the software development plan I had just written. They recommended that I schedule more time for some of the more complex software components, which is now reflected in our six-month plan.

We spent some time discussing whether the Curation Tool would be more useful as a desktop or web application, given that our main challenge was in packaging an OWL reasoner with the application. Many OWL reasoners (e.g. HermiT, Pellet, JFact) are designed to be used through the OWL API on Java, while others are written in C++ (Fact++, RDFox); neither software platform is trivially easy to distribute. Jim pointed us to Arachne, an RDF rule engine he had written in Scala. Since Scala can be converted into Javascript using Scala.js, this may allow us to create a distributable desktop application using Electron. We also considered writing a Protege plugin, which would be able to leverage the reasoning and visualization features already available in that software. Desktop applications have an advantage in not having ongoing costs – once written, an application can be archived and reused as long as its runtime environment remains available, while web services may run out of funding and be shut down as a result. The best option for now is the option that gives us the most flexibility going forward: building a server application that can resolve phyloreferences, and then building a simple web-based frontend to it. Applications we develop in the future would have the option of being a standalone application with a built-in OWL reasoner or may use the server backend to resolve phyloreferences, whether locally on the desktop or remotely over the internet. To improve archiving, we could ensure that the server backend is easy to install locally on any computer.

Another challenge we discussed was the most efficient ways of carrying out the OWL reasoning we need to resolve phyloreferences. Right now, we can reason over 27 phyloreferences from two case studies in just under a thousand seconds (or around 37 seconds/phyloreference), but we expect larger phylogenies or more complex phyloreferences will take longer to resolve. Jim suggested RDF-based approaches that might be faster than OWL-based approaches, such as using SPARQL, Cypher (the Neo4J Query Language) or SWRL (Semantic Web Rule Language). These could be combined with a fast graph database such as Blazegraph, which could store large phylogenies such as the Open Tree of Life’s synthetic tree. We might also be able to model phyloreferences in OWL 2 RL, which would allow rule-based OWL reasoners, such as Jim’s Arachne, to process them more efficiently than full OWL 2 DL reasoners. OWL 2 EL reasoners, like ELK, cannot fully represent the phyloreferencing model we currently use. Restricting our model to the OWL 2 RL or OWL 2 EL profiles might require us to make some counter-intuitive modeling decisions, such as modeling phylogenetic nodes as classes rather than individuals, but could provide us with speed improvements and allow us to use faster reasoners. We will reconsider these options as we curate more phyloreferences in the future.

We also had a short discussion on whether it would be possible to infer relationships between phyloreferences themselves: for example, could we determine if a particular phyloreference will always resolve to a subclade of another phyloreference just by comparing their definitions? While we can do this in simple cases or when resolving phyloreferences on the same phylogeny, solving this generally was likely impossible given the wide variety of phylogenies we might see. Another short discussion concerned the limitations of OWL in doing the kind of reasoning we need: OWL cannot measure and compare path-lengths within a phylogeny, and so it cannot be used to natively represent the concept of a “most recent common ancestor”. A rule-based approach may be able to calculate these before the model was executed by a reasoner, simplifying some of the OWL class expressions we currently use to represent phyloreferences.

Making the value of phyloreferencing clear to potential users

2018 is the year we hope to show off the technologies and ontologies we have been working on through 2017. To do this, we need to do more than demonstrate how phyloreferences work or how reliable they are: we need to demonstrate that they can be used to generate biological insights not easily obtainable through other means. We plan to do this in cooperation with other scientists attempting to reconcile data using phylogenies, such as those involved in the NSF-funded Genealogy of Life projects. In our discussions on identifying and developing these use cases, we decided that it was important not to build the tool we think scientists will want, but to actually build the tools they need: we must be careful not to preempt how to tackle data integration challenges in the absence of concrete use cases or motivating user stories. A small number of good use cases that clearly demonstrate the benefits of phyloreferencing will be more helpful to us than a large number of use cases that only demonstrate marginal benefits over current approaches.

While we work to identify these use cases by reaching out to potential collaborators, we need a tool that allows the basic use of phyloreferences to be demonstrated. The first step towards this goal will be to build the Curation Tool, which will be our first user-accessible tool for writing and executing phyloreferences. This will allow project members to play with the current phyloreferencing model in practice and identify gaps that might not be obvious just from reading the specification. It will be the first tool we can show to potential collaborators, and any custom tools necessary for their use can be based on it. The creation of the Curation Tool is therefore our immediate software development goal.

We also reiterated the value of building a list of curated phyloreferences. These can provide information on how people currently refer to clades, whether formally as phylogenetic clade definitions or informally within publications, such as through the use of synapomorphies or free-text descriptions. It will also allow us to ask deeper questions, such as considering what might make a phyloreference “good” or “bad” in terms of precision, reusability and biological importance.

How to match Phyloreferencing specifiers with nodes on a phylogeny

My most important goal for this meeting was to finalize a model we could use to match Phyloreferencing specifiers. This is the first step in resolving phyloreferences, in which “specifiers” — the parts of a phyloreference that refer to biological entities, such as “Homo sapiens” or “CM 9380” — are matched with nodes on a phylogeny that refer to those entities. This matching process can be exceptionally complex, and so we quickly decided not to carry it out purely in OWL. If we carried out the matching in a programming language or SPARQL, we could more easily write, maintain and extend the logic necessary to carry out matching. We are currently updating the Phyloreferencing specification to reflect this decision, and will shortly publish a blog post detailing the specifier model we have settled on.

Post-meeting discussions on project communication

Having returned to Gainesville from the meeting, we thought further on how we could improve our development and communication process. We all agreed that our meeting had been a success, and plan to meet more often: if possible, twice a year at Duke University and three times at the University of Florida. We decided that the postdocs should keep everybody informed on our progress by maintaining a list of accomplishments from the previous week and a list of work we plan to accomplish in the following week. We decided to learn more about Github’s collaboration features and to commit to using them more often, even if there is an initial learning curve to overcome. This includes pushing early drafts of new blog posts directly to Github instead of distributing them as shared documents on other platforms for review. We’re working on those now, and hope to have a better project management in 2018 as we build the Curation Tool!