Harun Mustafa, MSc ETH UZH in Computational Biology and Bioinformatics

Luke: "You want the impossible... I don't... I don't believe it!" — Yoda: "That is why you fail."

PhD Student

E-Mail
harun.mustafa@get-your-addresses-elsewhere.inf.ethz.ch
Phone
+41 43 254 0225
Address
Biomedical Informatics Group
Schmelzbergstrasse 26
SHM 26 B 5
8006 Zürich
Room
SHM 26 B 5
twitter
@gxr

My main research interests are in the development of data structures and algorithms to allow for efficient searching and annotation of high-throughput genome and metagenome sequencing data.

I completed my honours B.Sc. with high distinction at the University of Toronto, dual majoring in computational biology and mathematics. Under the supervision of Michael Brudno, I developed methods for assembling the sequences of novel Alu insertions detected in second-generation sequencing data. I completed my M.Sc. in computational biology at the ETH Zürich, where I developed a classification method for determining internal sites in proteins permissive to tag insertion under the joint supervision of Sven Panke and Jörg Stelling. I joined the Biomedical Informatics Group in 2017 as a Ph.D. student.

Abstract Background Internal tagging of proteins by inserting small functional peptides into surface accessible permissive sites has proven to be an indispensable tool for basic and applied science. Permissive sites are typically identified by transposon mutagenesis on a case-by-case basis, limiting scalability and their exploitation as a system-wide protein engineering tool. Methods We developed an apporach for predicting permissive stretches (PSs) in proteins based on the identification of length-variable regions (regions containing indels) in homologous proteins. Results We verify that a protein's primary structure information alone is sufficient to identify PSs. Identified PSs are predicted to be predominantly surface accessible; hence, the position of inserted peptides is likely suitable for diverse applications. We demonstrate the viability of this approach by inserting a Tobacco etch virus protease recognition site (TEV-tag) into several PSs in a wide range of proteins, from small monomeric enzymes (adenylate kinase) to large multi-subunit molecular machines (ATP synthase) and verify their functionality after insertion. We apply this method to engineer conditional protein knockdowns directly in the Escherichia coli chromosome and generate a cell-free platform with enhanced nucleotide stability. Conclusions Functional internally tagged proteins can be rationally designed and directly chromosomally implemented. Critical for the successful design of protein knockdowns was the incorporation of surface accessibility and secondary structure predictions, as well as the design of an improved TEV-tag that enables efficient hydrolysis when inserted into the middle of a protein. This versatile and portable approach can likely be adapted for other applications, and broadly adopted. We provide guidelines for the design of internally tagged proteins in order to empower scientists with little or no protein engineering expertise to internally tag their target proteins.

Authors Sabine Oesterle, Tania Michelle Roberts, Lukas Andreas Widmer, Harun Mustafa, Sven Panke, Sonja Billerbeck

Submitted BMC Biology

Link DOI

Abstract Repetitive elements generally, and Alu inserts specifically are a large contributor to the recent evolution of the human genome. By assembling the sequences of novel Alu inserts using their respective subfamily consensus sequences as references, we found an exponential decay in the Alu subfamily call enrichment with increased number of sequence variants (Pearson correlation r=−0.68, p<0.0039). By mapping the sequences of these inserts to a human reference genome, we infer the reference Alu sources of a subset of the novel Alus, of which 85% were previously shown to be active. We also evaluate relationships between the loci of the novel inserts and their inferred sources.

Authors Harun Mustafa, Matei David, Michael Brudno

Submitted Mobile Genetic Elements

Link DOI

Abstract High-throughput sequencing technologies have allowed for the cataloguing of variation in personal human genomes. In this manuscript, we present alu-detect, a tool that combines read-pair and split-read information to detect novel Alus and their precise breakpoints directly from either whole-genome or whole-exome sequencing data while also identifying insertions directly in the vicinity of existing Alus. To set the parameters of our method, we use simulation of a faux reference, which allows us to compute the precision and recall of various parameter settings using real sequencing data. Applying our method to 100 bp paired Illumina data from seven individuals, including two trios, we detected on average 1519 novel Alus per sample. Based on the faux-reference simulation, we estimate that our method has 97% precision and 85% recall. We identify 808 novel Alus not previously described in other studies. We also demonstrate the use of alu-detect to study the local sequence and global location preferences for novel Alu insertions.

Authors Matei David, Harun Mustafa, Michael Brudno

Submitted Nucleic Acids Research

Link DOI