Harun Mustafa, MSc ETH UZH in Computational Biology and Bioinformatics

Luke: "You want the impossible... I don't... I don't believe it!" — Yoda: "That is why you fail."

PhD Student

+41 43 254 0225
Biomedical Informatics Group
Schmelzbergstrasse 26
SHM 26 B 5
8006 Zürich
SHM 26 B 5

My main research interests are in the development of data structures and algorithms to allow for efficient searching and annotation of high-throughput genome and metagenome sequencing data.

I completed my honours B.Sc. with high distinction at the University of Toronto, dual majoring in computational biology and mathematics. Under the supervision of Michael Brudno, I developed methods for assembling the sequences of novel Alu insertions detected in second-generation sequencing data. I completed my M.Sc. in computational biology at the ETH Zürich, where I developed a classification method for determining internal sites in proteins permissive to tag insertion under the joint supervision of Sven Panke and Jörg Stelling. I joined the Biomedical Informatics Group in 2017 as a Ph.D. student.

Abstract Jaccard Similarity index is an important measure of the overlap of two sets, widely used in machine learning, computational genomics, information retrieval, and many other areas. However, little efforts have been made to develop a scalable and high-performance scheme for computing the Jaccard Similarity for today's large data sets. To address this issue, we design and implement SimilarityAtScale, the first communicationefficient distributed algorithm for computing the Jaccard Similarity. The key idea is to express the problem algebraically, as a sequence of matrix operations, and implement these operations with communication-avoiding distributed routines to minimize the amount of transferred data and ensure both high scalability and low latency. We then apply our algorithm to the problem of obtaining distances between whole-genome sequencing samples, a key part of modern metagenomics analysis and an evergrowing need due to the increasing availability of high-throughput DNA sequencing data. The resulting scheme is the first to enable accurate Jaccard distance derivations for massive datasets, using large-scale distributed-memory systems. We package our routines in a tool, called GenomeAtScale, that combines the proposed algorithm with tools for processing input sequences. Our evaluation on real data illustrates that one can use GenomeAtScale to effectively employ tens of thousands of processors to reach new frontiers in large-scale genomic and metagenomic analysis. While GenomeAtScale can be used to foster DNA research, the more general underlying SimilarityAtScale algorithm may be used for high-performance distributed similarity computations in other data analytics application domains.

Authors Maciej Besta, Raghavendra Kanakagiri, Harun Mustafa, Mikhail Karasikov, Gunnar Rätsch, Torsten Hoefler, Edgar Solomonik

Submitted arXiv Preprints

Abstract Metagenomic studies have increasingly utilized sequencing technologies in order to analyze DNA fragments found in environmental samples. It can provide useful insights for studying the interactions between hosts and microbes, infectious disease proliferation, and novel species discovery. One important step in this analysis is the taxonomic classification of those DNA fragments. Of particular interest is the determination of the distribution of the taxa of microbes in metagenomic samples. Recent attempts using deep learning focus on architectures that classify single DNA reads independently from each other. In this work, we attempt to solve the task of directly predicting the distribution over the taxa of whole metagenomic read sets. We formulate this task as a Multiple Instance Learning (MIL) problem. We extend architectures used in single-read taxonomic classification with two different types of permutation-invariant MIL pooling layers: a) deepsets and b) attention-based pooling. We illustrate that our architecture can exploit the co-occurrence of species in metagenomic read sets and outperforms the single-read architectures in predicting the distribution over the taxa at higher taxonomic ranks.

Authors Andreas Georgiou, Vincent Fortuin, Harun Mustafa, Gunnar Rätsch

Submitted arXiv Preprints


Abstract High-throughput DNA sequencing data is accumulating in public repositories, and efficient approaches for storing and indexing such data are in high demand. In recent research, several graph data structures have been proposed to represent large sets of sequencing data and to allow for efficient querying of sequences. In particular, the concept of labeled de Bruijn graphs has been explored by several groups. While there has been good progress towards representing the sequence graph in small space, methods for storing a set of labels on top of such graphs are still not sufficiently explored. It is also currently not clear how characteristics of the input data, such as the sparsity and correlations of labels, can help to inform the choice of method to compress the graph labeling. In this work, we present a new compression approach, Multi-BRWT, which is adaptive to different kinds of input data. We show an up to 29% improvement in compression performance over the basic BRWT method, and up to a 68% improvement over the current state-of-the-art for de Bruijn graph label compression. To put our results into perspective, we present a systematic analysis of five different state-of-the-art annotation compression schemes, evaluate key metrics on both artificial and real-world data and discuss how different data characteristics influence the compression performance. We show that the improvements of our new method can be robustly reproduced for different representative real-world datasets.

Authors Mikhail Karasikov, Harun Mustafa, Amir Joudaki, Sara Javadzadeh-No, Gunnar Rätsch, Andre Kahles

Submitted RECOMB 2019

Link DOI

Abstract Motivation: Technological advancements in high-throughput DNA sequencing have led to an exponential growth of sequencing data being produced and stored as a byproduct of biomedical research. Despite its public availability, a majority of this data remains hard to query for the research community due to a lack of efficient data representation and indexing solutions. One of the available techniques to represent read data is a condensed form as an assembly graph. Such a representation contains all sequence information but does not store contextual information and metadata. Results: We present two new approaches for a compressed representation of a graph coloring: a lossless compression scheme based on a novel application of wavelet tries as well as a highly accurate lossy compression based on a set of Bloom filters. Both strategies retain a coloring even when adding to the underlying graph topology. We present construction and merge procedures for both methods and evaluate their performance on a wide range of different datasets. By dropping the requirement of a fully lossless compression and using the topological information of the underlying graph, we can reduce memory requirements by up to three orders of magnitude. Representing individual colors as independently stored modules, our approaches can be efficiently parallelized and provide strategies for dynamic use. These properties allow for an easy upscaling to the problem sizes common to the biomedical domain. Availability: We provide prototype implementations in C++, summaries of our experiments as well as links to all datasets publicly at https://github.com/ratschlab/graph_annotation.

Authors Harun Mustafa, Ingo Schilken, Mikhail Karasikov, Carsten Eickhoff, Gunnar Rätsch, Andre Kahles

Submitted Bioinformatics

Link DOI

Abstract Technological advancements in high throughput DNA sequencing have led to an exponential growth of sequencing data being produced and stored as a byproduct of biomedical research. Despite its public availability, a majority of this data remains inaccessible to the research com- munity through a lack efficient data representation and indexing solutions. One of the available techniques to represent read data on a more abstract level is its transformation into an assem- bly graph. Although the sequence information is now accessible, any contextual annotation and metadata is lost. We present a new approach for a compressed representation of a graph coloring based on a set of Bloom filters. By dropping the requirement of a fully lossless compression and using the topological information of the underlying graph to decide on false positives, we can reduce the memory requirements for a given set of colors per edge by three orders of magnitude. As insertion and query on a Bloom filter are constant time operations, the complexity to compress and decompress an edge color is linear in the number of color bits. Representing individual colors as independent filters, our approach is fully dynamic and can be easily parallelized. These properties allow for an easy upscaling to the problem sizes common in the biomedical domain. A prototype implementation of our method is available in Java.

Authors Ingo Schilken, Harun Mustafa, Gunnar Rätsch, Carsten Eickhoff, Andre Kahles

Submitted bioRxiv

Link DOI

Abstract Much of the DNA and RNA sequencing data available is in the form of high-throughput sequencing (HTS) reads and is currently unindexed by established sequence search databases. Recent succinct data structures for indexing both reference sequences and HTS data, along with associated metadata, have been based on either hashing or graph models, but many of these structures are static in nature, and thus, not well-suited as backends for dynamic databases. We propose a parallel construction method for and novel application of the wavelet trie as a dynamic data structure for compressing and indexing graph metadata. By developing an algorithm for merging wavelet tries, we are able to construct large tries in parallel by merging smaller tries constructed concurrently from batches of data. When compared against general compression algorithms and those developed specifically for graph colors (VARI and Rainbowfish), our method achieves compression ratios superior to gzip and VARI, converging to compression ratios of 6.5% to 2% on data sets constructed from over 600 virus genomes. While marginally worse than compression by bzip2 or Rainbowfish, this structure allows for both fast extension and query. We also found that additionally encoding graph topology metadata improved compression ratios, particularly on data sets consisting of several mutually-exclusive reference genomes. It was also observed that the compression ratio of wavelet tries grew sublinearly with the density of the annotation matrices. This work is a significant step towards implementing a dynamic data structure for indexing large annotated sequence data sets that supports fast query and update operations. At the time of writing, no established standard tool has filled this niche.

Authors Harun Mustafa, Andre Kahles, Mikhail Karasikov, Gunnar Raetsch

Submitted bioRxiv

Link DOI

Abstract Background Internal tagging of proteins by inserting small functional peptides into surface accessible permissive sites has proven to be an indispensable tool for basic and applied science. Permissive sites are typically identified by transposon mutagenesis on a case-by-case basis, limiting scalability and their exploitation as a system-wide protein engineering tool. Methods We developed an apporach for predicting permissive stretches (PSs) in proteins based on the identification of length-variable regions (regions containing indels) in homologous proteins. Results We verify that a protein's primary structure information alone is sufficient to identify PSs. Identified PSs are predicted to be predominantly surface accessible; hence, the position of inserted peptides is likely suitable for diverse applications. We demonstrate the viability of this approach by inserting a Tobacco etch virus protease recognition site (TEV-tag) into several PSs in a wide range of proteins, from small monomeric enzymes (adenylate kinase) to large multi-subunit molecular machines (ATP synthase) and verify their functionality after insertion. We apply this method to engineer conditional protein knockdowns directly in the Escherichia coli chromosome and generate a cell-free platform with enhanced nucleotide stability. Conclusions Functional internally tagged proteins can be rationally designed and directly chromosomally implemented. Critical for the successful design of protein knockdowns was the incorporation of surface accessibility and secondary structure predictions, as well as the design of an improved TEV-tag that enables efficient hydrolysis when inserted into the middle of a protein. This versatile and portable approach can likely be adapted for other applications, and broadly adopted. We provide guidelines for the design of internally tagged proteins in order to empower scientists with little or no protein engineering expertise to internally tag their target proteins.

Authors Sabine Oesterle, Tania Michelle Roberts, Lukas Andreas Widmer, Harun Mustafa, Sven Panke, Sonja Billerbeck

Submitted BMC Biology

Link DOI

Abstract Repetitive elements generally, and Alu inserts specifically are a large contributor to the recent evolution of the human genome. By assembling the sequences of novel Alu inserts using their respective subfamily consensus sequences as references, we found an exponential decay in the Alu subfamily call enrichment with increased number of sequence variants (Pearson correlation r=−0.68, p<0.0039). By mapping the sequences of these inserts to a human reference genome, we infer the reference Alu sources of a subset of the novel Alus, of which 85% were previously shown to be active. We also evaluate relationships between the loci of the novel inserts and their inferred sources.

Authors Harun Mustafa, Matei David, Michael Brudno

Submitted Mobile Genetic Elements

Link DOI

Abstract High-throughput sequencing technologies have allowed for the cataloguing of variation in personal human genomes. In this manuscript, we present alu-detect, a tool that combines read-pair and split-read information to detect novel Alus and their precise breakpoints directly from either whole-genome or whole-exome sequencing data while also identifying insertions directly in the vicinity of existing Alus. To set the parameters of our method, we use simulation of a faux reference, which allows us to compute the precision and recall of various parameter settings using real sequencing data. Applying our method to 100 bp paired Illumina data from seven individuals, including two trios, we detected on average 1519 novel Alus per sample. Based on the faux-reference simulation, we estimate that our method has 97% precision and 85% recall. We identify 808 novel Alus not previously described in other studies. We also demonstrate the use of alu-detect to study the local sequence and global location preferences for novel Alu insertions.

Authors Matei David, Harun Mustafa, Michael Brudno

Submitted Nucleic Acids Research

Link DOI