We are proud that the SIB committee has selected and recognized the work of our lab members, Mikhail Karasikov and Harun Mustafa, on “Lossless indexing with counting de Bruijn graphs“!
This work is especially relevant in the age of rapid growth of public sequence repositories (such as NCBI SRA and ENA), where there are currently no efficient methods for interactive exploration and scalable search. Although De Bruijn graphs (DBGs) have seen widespread use in bioinformatics for genome assembly and sequence set indexing, they are inherently lossy. In particular, they only represent a set of k-mers (nodes) and their overlaps (edges), and thus, it is generally impossible to extract from the graph the original sequences used to construct it.
In this work, we propose Counting DBGs, a data structure generalising DBGs by supplementing each node with one or many attributes. Counting DBGs can succinctly represent quantitative information such as k-mer counts for gene expression and graph walks for lossless representation of the input sequences.
Alongside our open-source toolkit and experimental workflows, we have made our Counting DBG indexes queryable by the wider bioinformatics community via our web-based MetaGraph search service.
What the SIB awards committee said about the work: “By allowing DNA databases to be indexed without losing information and fully searchable, once-daunting data sets can now become powerful resources for biomedical research. A major step to making DNA sequencing data accessible to wider audiences.”
A huge thank you to SIB for recognizing and supporting our research through its platform and activities!