The MetaGraph Project

Fast and affordable high-throughput DNA and RNA sequencing have become a commodity in the biomedical domain. After initially targeted sequencing, the scope has dramatically widened over the past decade towards analysing population-scale cohorts. Such projects include the study of cancer patients in The Cancer Genome Atlas project or the International Cancer Genome Consortium and the health studies of the UK10K effort. Even research projects assessing tens of thousands of new species are common, such as the whole metagenome studies of the Human Microbiome Project or the work of the MetaHIT and MetaSUB consortia. The revolution in sequencing technology has given scientists the power to collect and study genomic variation at an unprecedented depth and scale.

Yet, despite the massive data availability, a central problem remains that can be stated as a simple question: Given a short DNA sequence arising from experimental measurement, have we ever seen this sequence before? Followed closely by a second question: What information about this sequence can be derived or inferred from the knowledge currently available in the domain? The MetaGraph project not only aims to answer these questions but also to do so promptly while using minimal computing and storage resources. In addition, we aim to retain all relevant information from the compressed input data to allow the execution of sequence analysis tasks directly on the compressed index. One such analysis task is the differential assembly of sequences, where we search for sets of sequences that occur (or are enriched) in a given foreground set but do not occur (or are depleted) in a background set of sequences.

The MetaGraph framework [1] is a compressed, distributed storage and analysis system for reference genomes and DNA sequencing data that dynamically scales over multiple compute entities and that can be adapted to the individual needs of specific research projects. Our research focuses on the indexing of text data and the efficient compression and decompression of colour- / label information on a given sequence graph [2, 3]. We employ succinct data structures, compression techniques, and concepts from distributed computing.

In collaboration with other groups, MetaGraph has already been applied to various large-scale cohorts and provides the backend for various genome-sequencing services. Examples are the metagenomics data of the MetaSUB consortium [4] and the TARA Oceans [5] projects. In addition to these raw data sets, we provide publicly available pre-computed indexes for large parts of all public genome sequencing projects on NCBI’s Sequencing Read Archive (SRA).

Involved group members: Harun Mustafa, Mikhail Karasikov, Andre Kahles, Gunnar Rätsch, Marc Zimmermann (alumnus), Daniel Danciu (alumnus)

[Project page] [Download] [Documentation]

[1] Karasikov, Mikhail, Harun Mustafa, Daniel Danciu, Marc Zimmermann, Christopher Barber, Gunnar Rätsch, and André Kahles. "Metagraph: Indexing and analyzing nucleotide archives at petabase-scale." BioRxiv (2020)
[2] Karasikov, Mikhail, Harun Mustafa, Gunnar Rätsch, and André Kahles. "Lossless indexing with counting de Bruijn graphs." In International Conference on Research in Computational Molecular Biology, pp. 374-376. Springer, Cham, 2022.

[3] Danciu, Daniel, Mikhail Karasikov, Harun Mustafa, André Kahles, and Gunnar Rätsch. "Topology-based sparsification of graph annotations." Bioinformatics 37, no. Supplement_1 (2021): i169-i176.
[4] Danko, David, Daniela Bezdan, Evan E. Afshin, Sofia Ahsanuddin, Chandrima Bhattacharya, Daniel J. Butler, Kern Rei Chng, et al. "A global metagenomic map of urban microbiomes and antimicrobial resistance." Cell 184, no. 13 (2021): 3376-3393.
[5] Paoli, Lucas, Hans-Joachim Ruscheweyh, Clarissa C. Forneris, Florian Hubrich, Satria Kautsar, Agneya Bhushan, Alessandro Lotti et al. "Biosynthetic potential of the global ocean microbiome." Nature 607, no. 7917 (2022): 111-118.