Harun Mustafa, Dr. sc. ETH Zürich

Luke: "You want the impossible... I don't... I don't believe it!" — Yoda: "That is why you fail."

Post Doc

E-Mail: harun.mustafa@get-your-addresses-elsewhere.inf.ethz.ch
Phone: +41 43 254 0225
Address: Biomedical Informatics Group
Schmelzbergstrasse 26
SHM 26 C 3
8006 Zürich
Room: SHM 26 C 3
twitter: @HarunMustafa416

My main research interests are in the development of data structures and algorithms to allow for efficient searching and annotation of high-throughput genome and metagenome sequencing data.

I completed my honours B.Sc. with high distinction at the University of Toronto, dual majoring in computational biology and mathematics. Under the supervision of Michael Brudno, I developed methods for assembling the sequences of novel Alu insertions detected in second-generation sequencing data. I completed my M.Sc. in computational biology at the ETH Zürich, where I developed a classification method for determining internal sites in proteins permissive to tag insertion under the joint supervision of Sven Panke and Jörg Stelling. I joined the Biomedical Informatics Group in 2017 as a Ph.D. student and completed my Ph.D. in 2022.

Nika Mansouri Ghiasi, Talu Güloglu, Harun Mustafa, Can Firtina, Konstantina Koliogeorgi, Konstantinos Kanellopoulos, Haiyu Mao, Rakesh Nadig, Mohammad Sadrosadati, Jisung Park, Onur Mutlu SAGe: A Lightweight Algorithm-Architecture Co-Design for Mitigating the Data Preparation Bottleneck in Large-Scale Genome Analysis arXiv

Abstract Given the exponentially growing volumes of genomic data, there are extensive efforts to accelerate genome analysis. We demonstrate a major bottleneck that greatly limits and diminishes the benefits of state-of-the-art genome analysis accelerators: the data preparation bottleneck, where genomic data is stored in compressed form and needs to be decompressed and formatted first before an accelerator can operate on it. To mitigate this bottleneck, we propose SAGe, an algorithm-architecture co-design for highly-compressed storage and high-performance access of large-scale genomic data. SAGe overcomes the challenges of mitigating the data preparation bottleneck while maintaining high compression ratios (comparable to genomic-specific compression algorithms) at low hardware cost. This is enabled by leveraging key features of genomic datasets to co-design (i) a new (de)compression algorithm, (ii) hardware, (iii) storage data layout, and (iv) interface commands to access storage. SAGe stores data in structures that can be rapidly interpreted and decompressed by efficient streaming accesses and lightweight hardware. To achieve high compression ratios using only these lightweight structures, SAGe exploits unique features of genomic data. We show that SAGe can be seamlessly integrated with a broad range of genome analysis hardware accelerators to mitigate their data preparation bottlenecks. Our results demonstrate that SAGe improves the average end-to-end performance and energy efficiency of two state-of-the-art genome analysis accelerators by 3.0x-32.1x and 18.8x-49.6x, respectively, compared to when the accelerators rely on state-of-the-art decompression tools.

Authors Nika Mansouri Ghiasi, Talu Güloglu, Harun Mustafa, Can Firtina, Konstantina Koliogeorgi, Konstantinos Kanellopoulos, Haiyu Mao, Rakesh Nadig, Mohammad Sadrosadati, Jisung Park, Onur Mutlu

Submitted arXiv

Harun Mustafa, Dr. sc. ETH Zürich

Post Doc

Latest Publications