Currently employed by HowardHughesMedicalInstitute
working for BerkeleyDrosophilaGenomeProject
) on databases, ontologies and tools for biological research.
I am interested in GenomeAnnotation
and relating knowledge of model organism biology (in particular PhenotypeData
) to understanding human diseases. I believe that our understanding of these things can be advanced through the use of techniques from ComputerScience
, including advanced database and modeling techniques such as FormalOntology
. I also believe that RelationalDatabase
technology is not being used to its fullest potential in BioInformatics
. This wiki page is a way for me to informally riff of some of these things.
My previous work at Berkeley includes GadFly
database of the fly) and the GadFlyPipeline
was the precursor of the ChadoDatabase
(part of the GenericModelOrganismDatabase
project). At the same time I developed the GeneOntologyDatabase
and associated perl modules and API. Currently I am doing a lot of research on enhacing the GeneOntology
, and working with MarkYandell
on looking at intron evolution.
is essential for making sense of the flood of sequence data. This means both quality data and the correct representational formalisms for capturing that data. There is a lot more to a genome than protein coding genes, and there is a lot more to the standard protein coding gene than the central dogma (DNA makes RNA makes protein, end of story). Biology is full of surprises that confound our rigid data models.
I had the benefit of working with some extremely knowledgeable biologist curators during at Berkeley during the annotation of the fruitfly genome. Amongst other things, this taught me that it will be a long time before automated methods can approximate the correctness and detail of human annotation; and that current automated methods can be improved tremendously to aid manual annotation.
There are a plethora of DataModels
for representing GenomeAnnotations
, most lack formal rigour, and there is little in the way of declarative mappings between them. There are many black-box parsers that purport to convert between these, but the inner workings are only known to their programmers. Many formats such as GFF are lossy, which is not problematic for some of today's applications, but make them unsuitable as a general purpose solution. It's all a bit of a mess. Thankfully the SequenceOntology
will go some way to addressing issues of interoperability; although SO still allows multiple representantions and differing conceptualisations of things such as locations.
I have contributed to this plethora both with the ChadoDatabase
and the semantically equivalent ChaosXML
and ChadoXML DataModels
. I hope that the use of formal principles underlying these models will help manage the crisis of GenomeAnnotation
I curate the OBORelationshipOntology
[TODO: Obol, AristotelianDefinitions