- CellxGene
- Find Published Data
- Contribute and Publish Data
- Download Published Data
- Analyze Public Data
- Get Started
- Hosted Tutorials
- Gene Expression Documentation
- Annotate and Analyze Your Data
- Join the CellxGene User Community
- Cite cellxgene in your publications
- Frequently Asked Questions
- Learn About Single Cell Data Analysis
Automated Annotation of Single Cell Data (EXPERIMENTAL)
CZ CELLxGENE Annotate enables automatic cell type annotation of newly generated single cell datasets using pre-trained models. Annotation augments your dataset with the following new information:
- transferred cell type labels (a.k.a. the most probable cell type)
- uncertainty scores (measure of the confidence of the predicted label - takes on a value between 0 and 1 [0 = confident, 1 = uncertain])
- a projection of the query dataset into the reference embedding, and a newly computed UMAP based on that projection
After you complete the automatic annotation workflow, you can explore and refine your predictions using standard features in CELLxGENE such as:
- visualizing clusters by predicted cell type
- plotting marker genes
- performing differential expression
- refining cell type labels by creating custom annotations
Disclaimer: The Annotation feature in CELLxGENE is new and experimental. We intend for the results generated by the workflow to be a starting point for annotating your single cell dataset. If you run into issues completing this workflow, we would love to hear about your experience. Feel free to reach out to cellxgene@chanzuckerberg.com to get help running the workflow or to forward constructive feedback. In the meantime, in order to obtain the best results, we make the following recommendations:
- use this predictive annotation workflow with data generated by 10X platforms
- use the tissue model which is most closely related to the tissue profiled in your single cell experiment (currently available models are: lung, pan-tissue immune)
Installation of CELLxGENE[annotate]
You can install the latest experimental release of CELLxGENE Annotate to gain access to the automatic annotation subcommand. It is highly recommended to perform installation in a clean Python environment using Conda or venv with Python version 3.9+. Find examples of virtual environment setup using venv and conda below.
CELLxGENE[annotate] is currently only tested on OS X and Ubuntu operating systems.
Option 1: conda setup
To install conda
, download and install from one of the distributions:
conda create -n cellxgene 'python=3.9'
conda activate cellxgene
pip install 'cellxgene[annotate]'
To test the success of your installation, run:
cellxgene annotate --help
Option 2: venv setup
For an overview of venv
, refer to the link in the above section.
python3 -m venv cxg_annotate # create a virtual environment in a new the current directory
source cxg_annotate/bin/activate
pip install 'cellxgene[annotate]'
If conda
is not installed on your system, install these additional packages:
pip install scvi-tools==0.16.2 'scanpy[leiden]==1.9.3' xgboost==1.6.1 torch==1.11.0
If this fails, we recommend you use a Conda environment (see the conda setup section, above)
To test the success of your installation, run:
cellxgene annotate --help
CELLxGENE Annotation Workflow
See our quickstart guide to quickly see if you can run through the pipeline or check out the more detailed sections below for more detailed descriptions of the workflow. Both assume that you have properly set up a clean virtual environment to work from.
Quickstart
In this quickstart, we will use our lung model (trained on the Integrated Human Lung Cell atlas), to make annotations on a subset of another human lung dataset, LungMap. You can download the entire LungMap dataset here. Depending on your environment manager, execute one of the following quickstart commands:
conda
# create a directory to download data, run annotation pipeline and store annotated object
mkdir annotations
# navigate to newly created directory
cd annotations
# retrieve sample dataset (alternatively, you can use curl to retrieve the file)
# ex: curl https://cellxgene-annotation-public.s3.us-west-2.amazonaws.com/cell_type/tutorial/minilung.h5ad > minilung.h5ad
wget https://cellxgene-annotation-public.s3.us-west-2.amazonaws.com/cell_type/tutorial/minilung.h5ad
# generate annotations
cellxgene annotate minilung.h5ad -m https://cellxgene-annotation-public.s3.us-west-2.amazonaws.com/cell_type/models/hlca_20220920223732.zip -o minilungAnnotated.h5ad --mlflow-env-manager conda --no-use-gpu --gene-column-name feature_name
# launch cellxgene explorer
cellxgene launch minilungAnnotated.h5ad
venv
# create a directory to download data, run annotation pipeline and store annotated object
mkdir annotations
# navigate to newly created directory
cd annotations
# retrieve sample dataset (alternatively, you can use curl to retrieve the file)
# ex: curl https://cellxgene-annotation-public.s3.us-west-2.amazonaws.com/cell_type/tutorial/minilung.h5ad > minilung.h5ad
wget https://cellxgene-annotation-public.s3.us-west-2.amazonaws.com/cell_type/tutorial/minilung.h5ad
# generate annotations
cellxgene annotate minilung.h5ad -m https://cellxgene-annotation-public.s3.us-west-2.amazonaws.com/cell_type/models/hlca_20220920223732.zip -o minilungAnnotated.h5ad --mlflow-env-manager local --no-use-gpu --gene-column-name feature_name
# launch cellxgene explorer
cellxgene launch minilungAnnotated.h5ad
By default, you can find annotations in the obs
dataframe underneath the names: cxg_cell_type_predicted
for cell type labels and cxg_cell_type_predicted_uncertainty
for corresponding uncertainty scores.
You can find an already annotated version of the LungMap subset here.
Step 0: Annotation workflow inputs
CELLxGENE's automated annotation has two essential inputs:
- a query dataset to be annotated (structured as an
anndata
object (read more about theanndata
object here)) - a model that will be used to generate the annotations
Query data
You should create an anndata
object (written to an h5ad
file) that is compatible with the annotate
subcommand:
- contains raw counts in the
adata.X
,adata.raw.X
, or in a specified layer stored inadata.layers
- meets the additional requirements for launching and visualizing a dataset with CELLxGENE desktop
- note that the input object does not require a visualization embedding and that one will be generated as a part of running the annotation pipeline
- the embedding that results from running the annotation pipeline will be stored as
adata.obsm['cxg_cell_type_umap']
, however you can use the command line options--annotation-prefix
,--annotation-type
, and--run-name
to manipulate the name of the resulting embedding
Generating annotations from 10X cellranger outputs
10X cellranger is one of the standard pipelines for quantifying gene expression values (generating the count matrix). While the annotate
subcommand does not currently offer the ability to read cellranger outputs directly, it is relatively easy to create a compatible anndata
object using scanpy
. Here are some resources to get you started:
It is important that once you have constructed your object, that you perform initial QC filtering of low quality cells from the dataset. See this link for more details (specifically, the section entitled 'Basic Filtering').
Prediction model
You can download models to perform prediction for the following tissues:
Reference Mapping Models:
Logistic Regression Models:
Step 1: Running cellxgene annotate
After installing the latest version of CELLxGENE and the annotate subcommand, you can run a standard call to the command like so (see cellxgene annotate --help
for a full list of available options):
cellxgene annotate ./pathTo/data.h5ad --model-url modelURL --output-h5ad-file ./pathTo/annotatedData.h5ad
In the call of the annotate
subcommand above, we specify three arguments
- the local path to your query dataset (in
anndata
format) - the url of the tissue model which you like to use to perform reference mapping (for a list of available reference models, refer to the section above)
- the local path where
cellxgene annotate
will write a newanndata
object that includes the predicted annotations
Note: the first time use of a given model will:
- trigger a one-time download of the model, stored locally for repeated use (in .model_cache by default)
- take some extra time to setup an (internal) python env in which to run the model
Once your call to cellxgene annotate
completes, you will have the following new information in your anndata
object:
- transferred annotations/labels [
categorical
] (i.e. cell type) - these will be stored underadata.obs['cxg_cell_type_predicted']
by default - confidence scores [
numeric
] - stored underadata.obs['cxg_cell_type_predicted_uncertainty']
by default - a mapping into the reference embedding space - stored under
adata.obsm['cxg_cell_type'], by default
- a UMAP layout for the query data in this reference embedding space - stored under
adata.obsm['cxg_cell_type_umap']
by default - metadata about the model used - stored under
adata.uns['cxg_cell_type_predictions']['model_url']
by default
The cellxgene annotate
subcommand offers a number of options to customize how the workflow is run and to adapt to different scenarios. You can display the full list of available options like so:
cellxgene annotate -h
Step 2: Explore and Refine
After you have successfully completed the annotation workflow, you are now ready to explore your results in CELLxGENE Annotate!
You can launch your annotated anndata
object in the CELLxGENE Annotate like so:
cellxgene launch ./pathTo/annotatedData.h5ad
You will now have access to the standard set of CELLxGENE Annotate features to explore your single cell dataset. Here are some initial steps that you might take to start exploring your annotated dataset:
- color by predicted cell type
- color by prediction confidence score
- calculate differential expression between two annotated populations to see which genes distinguish them from each other
- create custom annotations to further refine or correct the predicted annotations
For a reference on how one might refine predicted annotations, you can refer to this Nature Tutorial.