“Simplicity is the ultimate sophistication.” – Leonardo da Vinci
- amir.joudaki@ inf.ethz.ch
- +41 44 632 65 24
Department of Computer Science
Biomedical Informatics Group
- CAB F39
mathematical foundations of AI
Broad research vision: We stand at the dawn of a technological and scientific revolution, with AI as its center. Despite its potential, our understanding of advanced AI systems is still very limited, as revealed by the following fact: the human brain, the inspiration behind modern neural networks, operates on a mere 12 watts, in stark contrast to the megawatts consumed by advanced neural networks. Therefore, state-of-the-art neural networks burn millions of times more power and data than a biological brain while coming short of simple reasoning and common sense tasks. Perhaps more importantly, the collective carbon footprint of AI has been increasing exponentially, rendering it unsustainable in the long term. This observation points to an urgent need for a deeper, fundamental understanding of neural networks to build models that are not just efficient but also sustainably intelligent. As a doctoral researcher, my broad research vision is to move towards such a fundamental understanding of neural networks to pave the way for their wider adoption, including their biomedical applications.
Bio: Before joining ETH, I earned a BSc in Computer Engineering from Sharif University, an MPhil in Cognitive Neuroscience at SISSA, Italy, and an MSc in Computer Science from ETH Zurich. You can find my CV here
I'm very open to collaborative research efforts and mentoring future MSc students. Please reach out if you are interested in cutting-edge machine learning projects or want to engage in meaningful research discussions. Here is a list of available projects for MSc theses. Please send me an email with your CV and transcripts if you're interested :)
- GNN+genomics: Learning metagenome graphs by GNNs
- Neural Net theory+Transformers: Can a transformer learn sequence alignment?
- Neural Net theory: Replacing normalization with initialization
- Neural Net theory: Mystery of activation in neural nets
- Neural Net theory+Transformers: "Understanding pretraining speed in transformers."
- Neural Net theory: Zero-shot neural architecture search:
- Active Learning+medical: Doctor in the loop,
Abstract In this paper, we explore the structure of the penultimate Gram matrix in deep neural networks, which contains the pairwise inner products of outputs corresponding to a batch of inputs. In several architectures it has been observed that this Gram matrix becomes degenerate with depth at initialization, which dramatically slows training. Normalization layers, such as batch or layer normalization, play a pivotal role in preventing the rank collapse issue. Despite promising advances, the existing theoretical results (i) do not extend to layer normalization, which is widely used in transformers, (ii) can not characterize the bias of normalization quantitatively at finite depth. To bridge this gap, we provide a proof that layer normalization, in conjunction with activation layers, biases the Gram matrix of a multilayer perceptron towards isometry at an exponential rate with depth at initialization. We quantify this rate using the Hermite expansion of the activation function, highlighting the importance of higher order (≥2) Hermite coefficients in the bias towards isometry.
Authors Amir Joudaki, Hadi Daneshmand, Francis Bach
Submitted NeurIPS 2023 (poster)
Abstract Mean-field theory is widely used in theoretical studies of neural networks. In this paper, we analyze the role of depth in the concentration of mean-field predictions for Gram matrices of hidden representations in deep multilayer perceptron (MLP) with batch normalization (BN) at initialization. It is postulated that the mean-field predictions suffer from layer-wise errors that amplify with depth. We demonstrate that BN avoids this error amplification with depth. When the chain of hidden representations is rapidly mixing, we establish a concentration bound for a mean-field model of Gram matrices. To our knowledge, this is the first concentration bound that does not become vacuous with depth for standard MLPs with a finite width.
Authors Amir Joudaki, Hadi Daneshmand, Francis Bach
Submitted ICML 2023 (poster)
Abstract Sequence-to-graph alignment is crucial for applications such as variant genotyping, read error correction, and genome assembly. We propose a novel seeding approach that relies on long inexact matches rather than short exact matches, and demonstrate that it yields a better time-accuracy trade-off in settings with up to a 25% mutation rate. We use sketches of a subset of graph nodes, which are more robust to indels, and store them in a k-nearest neighbor index to avoid the curse of dimensionality. Our approach contrasts with existing methods and highlights the important role that sketching into vector space can play in bioinformatics applications. We show that our method scales to graphs with 1 billion nodes and has quasi-logarithmic query time for queries with an edit distance of 25%. For such queries, longer sketch-based seeds yield a 4× increase in recall compared to exact seeds. Our approach can be incorporated into other aligners, providing a novel direction for sequence-to-graph alignment.
Authors Amir Joudaki, Alexandru Meterez, Harun Mustafa, Ragnar Groot Koerkamp, André Kahles1 and Gunnar Rätsch
Submitted Genome Research, RECOMB 2023
Abstract This paper underlines an elegant property of batch-normalization (BN): Successive batch normalizations with random linear updates make samples increasingly orthogonal. We establish a non-asymptotic characterization of the interplay between depth, width, and the orthogonality of deep representations. More precisely, we prove, under a mild assumption, the deviation of the representations from orthogonality rapidly decays with depth up to a term inversely proportional to the network width. This result has two main theoretical and practical implications: 1) Theoretically, as the depth grows, the distribution of the outputs contracts to a Wasserstein-2 ball around an isotropic normal distribution. Furthermore, the radius of this Wasserstein ball shrinks with the width of the network. 2) Practically, the orthogonality of the representations directly influences the performance of stochastic gradient descent (SGD). When representations are initially aligned, we observe SGD wastes many iterations to disentangle representations before the classification. Nevertheless, we experimentally show that starting optimization from orthogonal representations is sufficient to accelerate SGD, with no need for BN.
Authors Hadi Daneshmand, Amir Joudaki, Francis Bach
Submitted NeurIPS 2021 (Spotlight)
Abstract The sharp increase in next-generation sequencing technologies’ capacity has created a demand for algorithms capable of quickly searching a large corpus of biological sequences. The complexity of biological variability and the magnitude of existing data sets have impeded finding algorithms with guaranteed accuracy that efficiently run in practice. Our main contribution is the Tensor Sketch method that efficiently and accurately estimates edit distances. In our experiments, Tensor Sketch had 0.88 Spearman’s rank correlation with the exact edit distance, almost doubling the 0.466 correlation of the closest competitor while running 8.8 times faster. Finally, all sketches can be updated dynamically if the input is a sequence stream, making it appealing for large-scale applications where data cannot fit into memory. Conceptually, our approach has three steps: 1) represent sequences as tensors over their sub-sequences, 2) apply tensor sketching that preserves tensor inner products, 3) implicitly compute the sketch. The sub-sequences, which are not necessarily contiguous pieces of the sequence, allow us to outperform fc-mer-based methods, such as min-hash sketching over a set of k-mers. Typically, the number of sub-sequences grows exponentially with the sub-sequence length, introducing both memory and time overheads. We directly address this problem in steps 2 and 3 of our method. While the sketching of rank-1 or super-symmetric tensors is known to admit efficient sketching, the sub-sequence tensor does not satisfy either of these properties. Hence, we propose a new sketching scheme that completely avoids the need for constructing the ambient space. Our tensor-sketching technique’s main advantages are three-fold: 1) Tensor Sketch has higher accuracy than any of the other assessed sketching methods used in practice. 2) All sketches can be computed in a streaming fashion, leading to significant time and memory savings when there is overlap between input sequences. 3) It is straightforward to extend tensor sketching to different settings leading to efficient methods for related sequence analysis tasks. We view tensor sketching as a framework to tackle a wide range of relevant bioinformatics problems, and we are confident that it can bring significant improvements for applications based on edit distance estimation.
Authors Amir Joudaki, Gunnar Rätsch, André Kahles
Submitted RECOMB 2021
Abstract High-throughput DNA sequencing data are accumulating in public repositories, and efficient approaches for storing and indexing such data are in high demand. In recent research, several graph data structures have been proposed to represent large sets of sequencing data and to allow for efficient querying of sequences. In particular, the concept of labeled de Bruijn graphs has been explored by several groups. Although there has been good progress toward representing the sequence graph in small space, methods for storing a set of labels on top of such graphs are still not sufficiently explored. It is also currently not clear how characteristics of the input data, such as the sparsity and correlations of labels, can help to inform the choice of method to compress the graph labeling. In this study, we present a new compression approach, Multi-binary relation wavelet tree (BRWT), which is adaptive to different kinds of input data. We show an up to 29% improvement in compression performance over the basic BRWT method, and up to a 68% improvement over the current state-of-the-art for de Bruijn graph label compression. To put our results into perspective, we present a systematic analysis of five different state-of-the-art annotation compression schemes, evaluate key metrics on both artificial and real-world data, and discuss how different data characteristics influence the compression performance. We show that the improvements of our new method can be robustly reproduced for different representative real-world data sets.
Authors Mikhail Karasikov, Harun Mustafa, Amir Joudaki, Sara Javadzadeh-No, Gunnar Rätsch, André Kahles
Submitted Journal of Computational Biology
Abstract High-throughput DNA sequencing data is accumulating in public repositories, and efficient approaches for storing and indexing such data are in high demand. In recent research, several graph data structures have been proposed to represent large sets of sequencing data and to allow for efficient querying of sequences. In particular, the concept of labeled de Bruijn graphs has been explored by several groups. While there has been good progress towards representing the sequence graph in small space, methods for storing a set of labels on top of such graphs are still not sufficiently explored. It is also currently not clear how characteristics of the input data, such as the sparsity and correlations of labels, can help to inform the choice of method to compress the graph labeling. In this work, we present a new compression approach, Multi-BRWT, which is adaptive to different kinds of input data. We show an up to 29% improvement in compression performance over the basic BRWT method, and up to a 68% improvement over the current state-of-the-art for de Bruijn graph label compression. To put our results into perspective, we present a systematic analysis of five different state-of-the-art annotation compression schemes, evaluate key metrics on both artificial and real-world data and discuss how different data characteristics influence the compression performance. We show that the improvements of our new method can be robustly reproduced for different representative real-world datasets.
Authors Mikhail Karasikov, Harun Mustafa, Amir Joudaki, Sara Javadzadeh-No, Gunnar Rätsch, Andre Kahles
Submitted RECOMB 2019