The volume of genomic sequence data available in public repositories is growing at an unprecedented rate and now exceeds 100 petabytes, a scale comparable to all text data in the browsable internet. These datasets contain DNA, RNA, and protein sequences from all forms of life and form the foundation for much of today’s biological and medical research. However, despite their immense value, searching across this distributed, heterogeneous, and rapidly expanding body of data remains a major computational challenge.
Addressing the challenge of large-scale sequence search
Our research addresses this challenge by introducing MetaGraph, a software framework that enables efficient, cost-effective, and full-text search over massive collections of biological sequences. MetaGraph builds on recent advances in succinct data structures and sequence graph representations to store and index genomic data in a way that is both highly compact and scalable. Using this approach, we achieve compression factors of 380 times on average, allowing the entirety of public DNA, RNA, and protein sequences to fit on just a few standard hard drives, while maintaining full searchability at base-level resolution.
Turning global sequence data into a searchable digital library
What makes MetaGraph distinctive is its ability to index and query the full diversity of publicly available biological sequences. This transforms previously fragmented databases into a unified, searchable “digital library of life,” enabling researchers to determine whether a sequence of interest, such as a variant associated with a disease, an antibiotic resistance gene, or a viral genome fragment, has been observed before and in which datasets it appears.
This capability enables a range of downstream applications, including:
- Genomic surveillance of pathogens and variants in real time.
 
- Comparative genomics across large-scale datasets without redundant storage.
 
- Rapid annotation of genetic elements through direct sequence lookup.
 
- Accelerated biomedical discovery through the integration of global data resources.
 
Impact and future directions
By making large-scale sequence data both searchable and accessible, MetaGraph lays the foundation for what we envision as a “Google for DNA,” a global search infrastructure for genomics. Beyond immediate applications in pathogen monitoring and disease genetics, MetaGraph provides a technical basis for building cross-domain data indexing systems that connect genomics with transcriptomics, proteomics, and metagenomics.
MetaGraph demonstrates that with the proper computational framework, the world’s biological sequence data can be made fully searchable, reproducible, and reusable. This development paves the way for faster and more collaborative discovery in the life sciences.
Read the full paper in Nature: https://www.nature.com/articles/s41586-025-09603-w
Further coverage:
- Nature News Feature
- ETH Zurich News: Eine Suchmaschine für DNA
- Netzwoche: Eine Suchmaschine für DNA
- Medinside: Wie Google - nur für Gene
- Inuit: ETH Zürich entwickelt Suchmaschine für DNA
- SIB: A fast, accurate ‘sequence search engine’
- The Times of India:“Google for DNA”: Scientists create the world’s first and fastest genetic search engine
- Scientific American: New DNA Search Engine Brings Order to Biology’s Big Data