- CellxGene
- Find Published Data
- Contribute and Publish Data
- Download Published Data
- Analyze Public Data
- Get Started
- Hosted Tutorials
- Gene Expression Documentation
- Annotate and Analyze Your Data
- Join the CellxGene User Community
- Cite cellxgene in your publications
- Frequently Asked Questions
- Learn About Single Cell Data Analysis
Contributing Data
CELLxGENE supports a rapidly growing single cell data corpus because of generous contributions from researchers like you! If you have single cell data that meet our requirements, please reach out to our curation team at cellxgene@chanzuckerberg.com.
Publishing Process
The process for submission to the portal is:
- You confirm we can support your submission by reaching out to our curation team at cellxgene@chanzuckerberg.com with a description of the data that you'd like to contribute.
- We confirm that we will accept your data.
- You prepare your data according to our submission requirements and send us your files.
- We upload to a private Collection where you can review.
- You prepare revised data and send us your revised files, as needed.
- We publish the data when you tell us to.
Dataset Requirements
Data Eligibility
CELLxGENE is focused on supporting the global community attempting to create references of human cells and tissues. As a result, there are a few kinds of datasets that we are not likely to accept at this time:
- drug screens
- cell lines
- organisms other than mouse or human
- assays not on the CELLxGENE Census accepted assays list
- These additional assays are pending Census acceptance and will be accepted:
- Visium (non-HD)
- Slide-seq
- Expression measurements of multi-modal assays (e.g. 10x multiome, mCT-seq)
- These additional assays are pending Census acceptance and will be accepted:
Scale Constraints
CELLxGENE Discover sets the maximum dataset size for submissions to 50GB. Additionally, CELLxGENE Explorer sets the maximum number of cells to 4.6 million for exploration of datasets.
Formatting Requirements
We need the following Collection metadata (i.e. details associated with your publication or study)
- Collection information:
- Title
- Description
- Contact: a single name and email
- Publication/preprint DOI: optional can be added later
- URLs: optional can be added later. Any links to related data or resources, such as GEO or protocols.io
- Consortia: optional can be added later. Can be one or more of those listed here
The full schema is documented here but is summarized below. Each dataset needs the following information added to a single h5ad (AnnData 0.10) format file:
- Dataset-level metadata in uns:
- title
- title of the individual dataset
- batch_condition optional
- list of obs fields that define “batches” that a normalization or integration algorithm should be aware of
- default_embedding optional
- the obsm key associated with the embeddings you would like to be displayed in CELLxGENE by default
- title
- Data in .X and raw.X:
- raw counts are required
- normalized counts are strongly recommended
- raw counts should be in raw.X if normalized counts are in .X
- if there is no normalized matrix, raw counts should be in .X
- Cell metadata in obs (for ontology term IDs, the values MUST be the most specific term available from the specified ontology):
- organism_ontology_term_id
- NCBITaxon (
NCBITaxon:9606
for human,NCBITaxon:10090
for mouse)
- NCBITaxon (
- donor_id
- free-text identifier that distinguishes the unique individual that data were derived from. It is encouraged to be something not likely to be used in other studies (e.g. donor_1 is likely to not be unique in the data corpus)
- development_stage_ontology_term_id
- sex_ontology_term_id
PATO:0000384
for male,PATO:0000383
for female, orunknown
if unavailable
- self_reported_ethnicity_ontology_term_id
- HANCESTRO multiple comma-separated terms may be used if more than one ethnicity is reported. If human and information unavailable, use
unknown
. Usena
if non-human
- HANCESTRO multiple comma-separated terms may be used if more than one ethnicity is reported. If human and information unavailable, use
- disease_ontology_term_id
- MONDO or
PATO:0000461
for 'normal' - Any known disease that is thought to, or is being tested to, have an impact on the measurement being taken in this experiment. Not necessarily any known disease of the donor
- MONDO or
- tissue_type
tissue
,organoid
, orcell culture
- tissue_ontology_term_id
- cell_type_ontology_term_id
- assay_ontology_term_id
- suspension_type
cell
,nucleus
, orna
, as corresponding to assay. Use this table defined in the data schema for guidance. If the assay does not appear in this table, the most appropriate value MUST be selected and the curation team informed during submission so that the assay can be added to the table
- organism_ontology_term_id
- Embeddings in obsm:
- One or more two-dimensional embeddings, prefixed with 'X_'
- Features in var & raw.var (if present):
- index is Ensembl ID
- preference is that gene have not been filtered in order to maximize future data integration efforts
- Additional standards for single-capture area Visium datasets (largely aligns with scanpy’s model, this notebook may be helpful to curate from Space Ranger outputs):
- Empty spots must be included (should be 4992 observations)
- obsm['spatial']
- obs['array_row']
- obs['array_col']
- obs['in_tissue']
- uns['spatial'][library_id]['images']['fullres'] preferred fullres image
- uns['spatial'][library_id]['images']['hires'] hires image
- uns['spatial'][library_id]['scalefactors']['spot_diameter_fullres']
- uns['spatial'][library_id]['scalefactors']['tissue_hires_scalef']
- Additional standards for single-puck Slide-seq datasets:
- obsm['spatial']
Data Submission Policy
I give CZI permission to display, distribute, and create derivative works (e.g. visualizations) of this data for purposes of offering CELLxGENE Discover, and I have the authority to give this permission. It is my responsibility to ensure that this data is not identifiable. In particular, I commit that I will remove any direct personal identifiers in the metadata portions of the data, and that CZI may further contact me if it believes more work is needed to de-identify it. If I choose to publish this data publicly on CELLxGENE Discover, I understand that (1) anyone will be able to access it subject to a CC-BY 4.0 license, meaning they can download, share, and use the data without restriction beyond providing attribution to the original data contributor(s) and (2) the Collection details (including Collection name, description, my name, and the contact information for the datasets in this Collection) will be made public on CELLxGENE Discover as well. I understand that I have the ability to delete the data that I have published from CELLxGENE Discover if I later choose to. This however will not undo any prior downloads or shares of such data.