Amir Joudaki,

“The mind is its own place, and in itself can make a heaven of hell, a hell of heaven..” ― John Milton, Paradise Lost

PhD Student

+41 44 632 23 74
ETH Zürich
Department of Computer Science
Biomedical Informatics Group Universitätsstrasse 6
CAB F52.1
8092 Zürich
CAB F52.1

I am currently a PhD candidate in Prof. Gunnar Ratsch lab, interested in applying statistical machine learning methods to understanding genomics data.

I am currently a Ph.D. candidate in Prof. Gunnar Ratsch lab, interested in applying statistical machine learning methods to understanding genomics data. 

I studied my BSc in computer science/engineering in Sharif University of Technology, Tehran, Iran, and my masters in cognitive neuroscience in SISSA, Trieste, Italy. 

Besides my studied I worked on ideas on how to make well-known algorithms like k-nearest neighbour search or non-linear dimensionality reduction methods applicable to large scale datasets. 

During my Ph.D. I would like to work at the border between theory and real-world problems, particularly the ones that involve a big amount of data.



Abstract High-throughput DNA sequencing data is accumulating in public repositories, and efficient approaches for storing and indexing such data are in high demand. In recent research, several graph data structures have been proposed to represent large sets of sequencing data and allow for efficient query of sequences. In particular, the concept of colored de Bruijn graphs has been explored by several groups. While there has been good progress towards representing the sequence graph in small space, methods for storing a set of labels on top of such graphs are still not sufficiently explored. It is also currently not clear how characteristics of the input data, such as the sparsity and correlations of labels, can help to inform the choice of method to compress the labels. In this work, we present a systematic analysis of five different state-of-the-art annotation compression schemes that evaluates key metrics on both artificial and real-world data and discusses how different data characteristics influence the compression performance. In addition, we present a new approach, Multi-BRWT, that shows an up to 50% improvement in compression performance over the current state-of-the-art and is adaptive to different kinds of input data. Using our comprehensive test datasets, we show that this improvement can be robustly reproduced for different representative real-world datasets.

Authors Mikhail Karasikov, Harun Mustafa, Amir Joudaki, Sara Javadzadeh, Gunnar Rätsch, Andre Kahles

Submitted bioRxiv

Link DOI