Data Structures for Genome Representation

The availability of fast and affordable high-throughput DNA and RNA sequencing have transformed biology and medicine into research areas of data science. After initial targeted sequencing of individuals of a specific species of interest, such as human, the scope has dramatically widened over the past years to the analysis of population-scale cohorts. Such projects include the study of cancer patients in the International Cancer Genome Consortium, the health studies of the UK10K effort, or the research projects assessing tens of thousands of new species, such as the whole metagenome studies of the Human Microbiome Project or the work of the MetaHIT and MetaSUB consortia. The revolution in sequencing technology has given scientists the power to collect genomic variation at an unprecedented depth and scale.

Building a compendium of sequence information

Yet, despite the data availability, a central problem remains that can be stated as a simple question: Given a short DNA sequence arising from an experimental measurement, what information about this sequence can be derived or inferred from the knowledge currently available in the domain? Our research spans around answering this question in a timely manner while using a minimal amount of compute and storage resources. 

To this end, we are working on a compressed, distributed storage system for reference genomes and DNA sequencing data that dynamically scales over multiple compute entities, and that can be adapted to the individual needs of specific research projects. We employ succinct data structures, compression techniques and concept from distributed computing. In close collaboration with both experts from the biomedical domain as well as with other researchers from the ETH Computer Science Department.

As a first field of application, we have chosen the field of metagenome sequencing - the acquisition of whole genome DNA sequences from a mixed sample containing a community of bacteria, viruses and funghi (microbiota) sampled from a specific environment - has become increasingly relevant. The compositional and functional analysis of such samples not only aids in understanding the ecological roles that microbial communities play in our environment or in an industrial setting, but also helps to elucidate the interplay of human health and the microbiome of the individual.

Gene prediction on pan-genome representations

In the classical setting, gene prediction algorithms are applied to a single genome sequence, integrating RNA-Seq evidence with sequence signals and conservation information. In a multi-species setting, state of the art approaches additionally employ multiple-sequence alignments (MSA) and phylogenetic information to glean information from (closely) related species. We are interested to use a pan-genome graph as a backbone data structure and carry out prediction tasks directly on the graph.