MetaGraph: Making the World’s Genomic Data Searchable

Date 08 October 2025   Categories Featured, News

Our group has developed MetaGraph, a scalable framework that enables full-text search across the world’s genomic data, transforming petabytes of sequence information into a compact, accessible resource for biomedical research.

The volume of genomic sequence data available in public repositories is growing at an unprecedented rate and now exceeds 100 petabytes, a scale comparable to all text data in the browsable internet. These datasets contain DNA, RNA, and protein sequences from all forms of life and form the foundation for much of today’s biological and medical research. However, despite their immense value, searching across this distributed, heterogeneous, and rapidly expanding body of data remains a major computational challenge.

Addressing the challenge of large-scale sequence search

Our research addresses this challenge by introducing MetaGraph, a software framework that enables efficient, cost-effective, and full-text search over massive collections of biological sequences. MetaGraph builds on recent advances in succinct data structures and sequence graph representations to store and index genomic data in a way that is both highly compact and scalable. Using this approach, we achieve compression factors of 380 times on average, allowing the entirety of public DNA, RNA, and protein sequences to fit on just a few standard hard drives, while maintaining full searchability at base-level resolution.

Turning global sequence data into a searchable digital library

What makes MetaGraph distinctive is its ability to index and query the full diversity of publicly available biological sequences. This transforms previously fragmented databases into a unified, searchable “digital library of life,” enabling researchers to determine whether a sequence of interest, such as a variant associated with a disease, an antibiotic resistance gene, or a viral genome fragment, has been observed before and in which datasets it appears.

This capability enables a range of downstream applications, including:

  • Genomic surveillance of pathogens and variants in real time.
     
  • Comparative genomics across large-scale datasets without redundant storage.
     
  • Rapid annotation of genetic elements through direct sequence lookup.
     
  • Accelerated biomedical discovery through the integration of global data resources.
     

Impact and future directions

By making large-scale sequence data both searchable and accessible, MetaGraph lays the foundation for what we envision as a “Google for DNA,” a global search infrastructure for genomics. Beyond immediate applications in pathogen monitoring and disease genetics, MetaGraph provides a technical basis for building cross-domain data indexing systems that connect genomics with transcriptomics, proteomics, and metagenomics.

MetaGraph demonstrates that with the proper computational framework, the world’s biological sequence data can be made fully searchable, reproducible, and reusable. This development paves the way for faster and more collaborative discovery in the life sciences.

Read the full paper in Nature: https://www.nature.com/articles/s41586-025-09603-w

Further coverage: