Automated Annotation of Single Cell Data (EXPERIMENTAL)

CZ CELLxGENE Annotate enables automatic cell type annotation of newly generated single cell datasets using pre-trained models. Annotation augments your dataset with the following new information:

  • transferred cell type labels (a.k.a. the most probable cell type)
  • uncertainty scores (measure of the confidence of the predicted label - takes on a value between 0 and 1 [0 = confident, 1 = uncertain])
  • a projection of the query dataset into the reference embedding, and a newly computed UMAP based on that projection

After you complete the automatic annotation workflow, you can explore and refine your predictions using standard features in CELLxGENE such as:

  • visualizing clusters by predicted cell type
  • plotting marker genes
  • performing differential expression
  • refining cell type labels by creating custom annotations

Disclaimer: The Annotation feature in CELLxGENE is new and experimental. We intend for the results generated by the workflow to be a starting point for annotating your single cell dataset. If you run into issues completing this workflow, we would love to hear about your experience. Feel free to reach out to cellxgene@chanzuckerberg.com to get help running the workflow or to forward constructive feedback. In the meantime, in order to obtain the best results, we make the following recommendations:

  • use this predictive annotation workflow with data generated by 10X platforms
  • use the tissue model which is most closely related to the tissue profiled in your single cell experiment (currently available models are: lung, pan-tissue immune)

Installation of CELLxGENE[annotate]

You can install the latest experimental release of CELLxGENE Annotate to gain access to the automatic annotation subcommand. It is highly recommended to perform installation in a clean Python environment using Conda or venv with Python version 3.9+. Find examples of virtual environment setup using venv and conda below.

CELLxGENE[annotate] is currently only tested on OS X and Ubuntu operating systems.

Option 1: conda setup

To install conda, download and install from one of the distributions:

conda create -n cellxgene 'python=3.9'

conda activate cellxgene

pip install 'cellxgene[annotate]'

To test the success of your installation, run:

cellxgene annotate --help

Option 2: venv setup

For an overview of venv, refer to the link in the above section.

python3 -m venv cxg_annotate  # create a virtual environment in a new the current directory

source cxg_annotate/bin/activate

pip install 'cellxgene[annotate]'

If conda is not installed on your system, install these additional packages:

pip install scvi-tools==0.16.2 'scanpy[leiden]==1.9.3' xgboost==1.6.1 torch==1.11.0

If this fails, we recommend you use a Conda environment (see the conda setup section, above)

To test the success of your installation, run:

cellxgene annotate --help

CELLxGENE Annotation Workflow

See our quickstart guide to quickly see if you can run through the pipeline or check out the more detailed sections below for more detailed descriptions of the workflow. Both assume that you have properly set up a clean virtual environment to work from.

Quickstart

In this quickstart, we will use our lung model (trained on the Integrated Human Lung Cell atlas), to make annotations on a subset of another human lung dataset, LungMap. You can download the entire LungMap dataset here. Depending on your environment manager, execute one of the following quickstart commands:

conda

# create a directory to download data, run annotation pipeline and store annotated object
mkdir annotations

# navigate to newly created directory
cd annotations

# retrieve sample dataset (alternatively, you can use curl to retrieve the file)
# ex: curl https://cellxgene-annotation-public.s3.us-west-2.amazonaws.com/cell_type/tutorial/minilung.h5ad > minilung.h5ad
wget https://cellxgene-annotation-public.s3.us-west-2.amazonaws.com/cell_type/tutorial/minilung.h5ad


# generate annotations
cellxgene annotate minilung.h5ad  -m https://cellxgene-annotation-public.s3.us-west-2.amazonaws.com/cell_type/models/hlca_20220920223732.zip -o minilungAnnotated.h5ad --mlflow-env-manager conda --no-use-gpu --gene-column-name feature_name

# launch cellxgene explorer
cellxgene launch minilungAnnotated.h5ad

venv

# create a directory to download data, run annotation pipeline and store annotated object
mkdir annotations

# navigate to newly created directory
cd annotations

# retrieve sample dataset (alternatively, you can use curl to retrieve the file)
# ex: curl https://cellxgene-annotation-public.s3.us-west-2.amazonaws.com/cell_type/tutorial/minilung.h5ad > minilung.h5ad
wget https://cellxgene-annotation-public.s3.us-west-2.amazonaws.com/cell_type/tutorial/minilung.h5ad


# generate annotations
cellxgene annotate minilung.h5ad -m https://cellxgene-annotation-public.s3.us-west-2.amazonaws.com/cell_type/models/hlca_20220920223732.zip -o minilungAnnotated.h5ad --mlflow-env-manager local --no-use-gpu --gene-column-name feature_name

# launch cellxgene explorer
cellxgene launch minilungAnnotated.h5ad

By default, you can find annotations in the obs dataframe underneath the names: cxg_cell_type_predicted for cell type labels and cxg_cell_type_predicted_uncertainty for corresponding uncertainty scores.

You can find an already annotated version of the LungMap subset here.

Step 0: Annotation workflow inputs

CELLxGENE's automated annotation has two essential inputs:

  • a query dataset to be annotated (structured as an anndata object (read more about the anndata object here))
  • a model that will be used to generate the annotations

Query data

You should create an anndata object (written to an h5ad file) that is compatible with the annotate subcommand:

  • contains raw counts in the adata.X, adata.raw.X, or in a specified layer stored in adata.layers
  • meets the additional requirements for launching and visualizing a dataset with CELLxGENE desktop
    • note that the input object does not require a visualization embedding and that one will be generated as a part of running the annotation pipeline
    • the embedding that results from running the annotation pipeline will be stored as adata.obsm['cxg_cell_type_umap'], however you can use the command line options --annotation-prefix, --annotation-type, and --run-name to manipulate the name of the resulting embedding

Generating annotations from 10X cellranger outputs

10X cellranger is one of the standard pipelines for quantifying gene expression values (generating the count matrix). While the annotate subcommand does not currently offer the ability to read cellranger outputs directly, it is relatively easy to create a compatible anndata object using scanpy. Here are some resources to get you started:

It is important that once you have constructed your object, that you perform initial QC filtering of low quality cells from the dataset. See this link for more details (specifically, the section entitled 'Basic Filtering').

Prediction model

You can download models to perform prediction for the following tissues:

Reference Mapping Models:

Logistic Regression Models:

Step 1: Running cellxgene annotate

After installing the latest version of CELLxGENE and the annotate subcommand, you can run a standard call to the command like so (see cellxgene annotate --help for a full list of available options):

cellxgene annotate ./pathTo/data.h5ad --model-url modelURL --output-h5ad-file ./pathTo/annotatedData.h5ad

In the call of the annotate subcommand above, we specify three arguments

  • the local path to your query dataset (in anndata format)
  • the url of the tissue model which you like to use to perform reference mapping (for a list of available reference models, refer to the section above)
  • the local path where cellxgene annotate will write a new anndata object that includes the predicted annotations

Note: the first time use of a given model will:

  • trigger a one-time download of the model, stored locally for repeated use (in .model_cache by default)
  • take some extra time to setup an (internal) python env in which to run the model

Once your call to cellxgene annotate completes, you will have the following new information in your anndata object:

  • transferred annotations/labels [categorical] (i.e. cell type) - these will be stored under adata.obs['cxg_cell_type_predicted'] by default
  • confidence scores [numeric] - stored under adata.obs['cxg_cell_type_predicted_uncertainty'] by default
  • a mapping into the reference embedding space - stored under adata.obsm['cxg_cell_type'], by default
  • a UMAP layout for the query data in this reference embedding space - stored under adata.obsm['cxg_cell_type_umap'] by default
  • metadata about the model used - stored under adata.uns['cxg_cell_type_predictions']['model_url'] by default

The cellxgene annotate subcommand offers a number of options to customize how the workflow is run and to adapt to different scenarios. You can display the full list of available options like so:

cellxgene annotate -h

Step 2: Explore and Refine

After you have successfully completed the annotation workflow, you are now ready to explore your results in CELLxGENE Annotate!

You can launch your annotated anndata object in the CELLxGENE Annotate like so:

cellxgene launch ./pathTo/annotatedData.h5ad

You will now have access to the standard set of CELLxGENE Annotate features to explore your single cell dataset. Here are some initial steps that you might take to start exploring your annotated dataset:

  • color by predicted cell type
  • color by prediction confidence score
  • calculate differential expression between two annotated populations to see which genes distinguish them from each other
  • create custom annotations to further refine or correct the predicted annotations

For a reference on how one might refine predicted annotations, you can refer to this Nature Tutorial.

Useful Links