How AI can help us understand how cells work—and help cure diseases

As the smallest living units, cells are key to understanding disease—and yet so much about them remains unknown. We do not know, for example, how billions of biomolecules—like DNA, proteins, and lipids—come together to act as one cell. Nor do we know how our many types of cells interact within our bodies. We have limited understanding of how cells, tissues, and organs become diseased and what it takes for them to be healthy.

AI can help us answer these questions and apply that knowledge to improve health and well-being worldwide—if researchers can access and harness these powerful new technologies.

Imagine if we had a way to represent every cell state and cell type using AI models. A “virtual cell” could simulate the appearance and known characteristics of any cell type in our body—from the rods and cones that detect light in our retinas to the cardiomyocytes that keep our hearts beating.

Scientists could use such a simulator to predict how cells might respond to specific conditions and stimuli: how an immune cell responds to an infection, what happens at the cellular level when a child is born with a rare disease, or even how a patient’s body will respond to a new medication. Scientific discovery, patient diagnosis, and treatment decisions would all become faster, safer, and more efficient.

At the Chan Zuckerberg Initiative, we’re helping to generate the scientific data and build out the computing infrastructure to make this a reality—and give scientists the tools they need to take advantage of new advances in AI to help end disease.

The data

Advances in AI coupled with large volumes of scientific data have already predicted the structure of nearly all known proteins. DeepMind trained AlphaFold on 50 years’ worth of carefully collected data, and in just five years, they solved the mystery of protein structure. ESM, another AI system which was developed at Meta, is a protein language model trained not on words but on over 60 million protein sequences. It is used for a wide range of applications, like predicting protein structures and the effects of mutations from single sequences.

A virtual cell modeling system will also require large amounts of data. Since 2016, CZI has supported researchers globally in efforts to generate and annotate data about cells and their components, built tools to integrate these large data sets, and made them widely available for researchers to learn from and build upon.

A global consortium of researchers has been building a reference map of every cell type in the body, and our San Francisco Biohub is creating whole-organism cell atlases. Together, these data sets are yielding the first draft of the open-source Human Cell Atlas, which will chart cell types in the body from development to adulthood. Our SF Biohub and the Chan Zuckerberg Imaging Institute are partnering on OpenCell, which maps the locations of different proteins in our cells.

Researchers are also using machine-learning models like Geneformer and scGPT to explore large amounts of data about genes and cells—including data generated from CELLxGENE, the open-source software platform that CZI’s science and technology teams created to speed up single-cell research. Similarly, with a new prototype data portal for cryo-electron tomography, our Imaging Institute and our science and technology teams are engaging machine-learning experts to develop automated annotations of microscopy data. This will speed up data processing time from months or even years to just weeks.

We are making the data as representative as possible to make sure scientific breakthroughs benefit everyone. This includes incorporating pediatric data into the Human Cell Atlas, filling gaps in our knowledge about the cellular mechanisms of diseases that arise in childhood. With our Ancestry Networks grants, we are also supporting researchers generating reference data about cells based on tissue samples from Black, Latino, Southeast Asian, and Indigenous people, among others from understudied racial, ethnic, and ancestral backgrounds.

Already, research teams have made discoveries using these well-curated data sets. One discovered that the broken gene linked to cystic fibrosis is expressed by a type of cell scientists had never come across before, while another identified the respiratory cells that are most vulnerable to SARS-CoV-2. Others are using the data to discover new options for splicing genes to potentially correct disease-causing mutations in specific cells.

These discoveries are the first step in developing treatments for diseases—and we believe that AI can significantly speed up researchers’ rate of discoveries going forward.

The compute

To create a virtual cell, we’re building a high-performance computing cluster with 1000+ H100 GPUs that will enable us to develop new AI models trained on various large data sets about cells and biomolecules—including those generated by our scientific institutes. Over time, we hope, this will enable scientists to simulate every cell type in both healthy and diseased states, and query those simulations to see how elusive biological phenomena likely play out—including how cells come into being, how they interact across the body, and how exactly disease-causing changes affect them.

Our computing cluster won’t be as large as those used in the private sector for commercial products, but once it’s up and running, it will be one of the world’s largest AI clusters for nonprofit scientific research. This will be an important resource for academic teams that are ready to use data sets in new ways but are held back by the prohibitive cost of accessing the latest AI technology. Like our other tools, these digital cell models, and their associated data and applications, will be openly accessible to researchers worldwide.

The people

Generating these data sets, building this computing cluster, and using AI for biology is the kind of multidisciplinary, collaborative effort that defines our work.

Our Biohub Network has brought together experts from different disciplines and institutions to tackle some of science’s biggest and riskiest challenges, which couldn’t be solved in traditional academic settings. Through projects like CELLxGENE, researchers around the world have helped build a single-cell data corpus—a testament to how effectively a shared resource for open science can grow with more collaborators contributing resources and brainpower.

When CZI first launched our science work in 2016, we committed to a big goal: to help the scientific community cure, prevent, or manage all disease by the end of this century. We believe this goal is possible and will be significantly advanced if leading scientists and technologists work together to make the most of the opportunities created by AI. We can start by unlocking the mysteries of our cells, and that can lead to work that helps end many diseases as we know them.

Priscilla Chan is cofounder and co-CEO of the Chan Zuckerberg Initiative. Priscilla’s work with patients and students in communities across the Bay Area as a pediatrician and teacher has informed her desire to make learning more personalized, find new paths to manage and cure disease, and expand opportunity for more people. Priscilla earned her BA in biology at Harvard University and her MD at UC San Francisco (UCSF).

Mark Zuckerberg is cofounder and co-CEO of the Chan Zuckerberg Initiative. As the founder, chairman, and chief executive officer of Meta, Mark brings a commitment to empowering people and building communities, and deep technical experience, to CZI’s work. Mark studied computer science at Harvard University before moving to Palo Alto, California in 2004.