Contributing Data

CELLxGENE supports a rapidly growing single cell data corpus because of generous contributions from researchers like you! If you have single cell data that meet our requirements, please reach out to our curation team at cellxgene@chanzuckerberg.com.

Publishing Process

The process for submission to the portal is:

  • You confirm we can support your submission by reaching out to our curation team at cellxgene@chanzuckerberg.com with a description of the data that you'd like to contribute.
  • We confirm that we will accept your data.
  • You prepare your data according to our submission requirements and send us your files.
  • We upload to a private collection where you can review.
  • You prepare revised data and send us your revised files, as needed.
  • We publish the data when you tell us to.

Dataset Requirements

Data Eligibility

CELLxGENE is focused on supporting the global community attempting to create references of human cells and tissues. As a result, there are a few kinds of datasets that we are not likely to accept at this time:

Scale Constraints

CELLxGENE Discover sets the maximum dataset size for submissions to 50GB. Additionally, CELLxGENE Explorer sets the maximum number of cells to 4.6 million for exploration of datasets.

Formatting Requirements

We need the following collection metadata (i.e. details associated with your publication or study)

  • Collection information:
    • Title
    • Description
    • Contact: name and email
    • Publication/preprint DOI: can be added later
    • URLs: any additional URLs for related data or resources, such as GEO or protocols.io - can be added later
    • Consortia: optional, and can be added later. Can be one or more of those listed here.

Each dataset needs the following information added to a single h5ad (AnnData 0.10) format file:

  • Dataset-level metadata in uns:
    • title: title of the individual dataset
    • optional: batch_condition: list of obs fields that define “batches” that a normalization or integration algorithm should be aware of
  • Data in .X and raw.X:
    • raw counts are required
    • normalized counts are strongly recommended
    • raw counts should be in raw.X if normalized counts are in .X
    • if there is no normalized matrix, raw counts should be in .X
  • Cell metadata in obs (for ontology term IDs, the values MUST be the most specific term available from the specified ontology):
    • organism_ontology_term_id: NCBITaxon (NCBITaxon:9606 for human, NCBITaxon:10090 for mouse)
    • donor_id: free-text identifier that distinguishes the unique individual that data were derived from. It is encouraged to be something not likely to be used in other studies (e.g. donor_1 is likely to not be unique in the data corpus)
    • development_stage_ontology_term_id: HsapDv if human, MmusDv if mouse, unknown if information unavailable
    • sex_ontology_term_id: PATO:0000384 for male, PATO:0000383 for female, or unknown if unavailable
    • self_reported_ethnicity_ontology_term_id: HANCESTRO multiple comma-separated terms may be used if more than one ethnicity is reported. If human and information unavailable, use unknown. Use na if non-human.
    • disease_ontology_term_id: MONDO or PATO:0000461 for 'normal'
    • tissue_type: tissue, organoid, or cell culture
    • tissue_ontology_term_id: UBERON
    • cell_type_ontology_term_id: CL
    • assay_ontology_term_id: EFO
    • suspension_type: cell, nucleus, or na, as corresponding to assay. Use this table defined in the data schema for guidance. If the assay does not appear in this table, the most appropriate value MUST be selected and the curation team informed during submission so that the assay can be added to the table.
  • Embeddings in obsm:
    • One or more two-dimensional embeddings, prefixed with 'X_'
  • Features in var & raw.var (if present):
    • index is Ensembl ID
    • preference is that gene have not been filtered in order to maximize future data integration efforts
  • Additional standards for single-capture area Visium datasets (largely aligns with scanpy’s model, this notebook may be helpful to curate from Space Ranger outputs):
    • Empty spots must be included (should be 4992 observations)
    • obsm['spatial']
    • obs['array_row']
    • obs['array_col']
    • obs['in_tissue']
    • uns['spatial'][library_id]['images']['fullres'] fullres image (preferred)
    • uns['spatial'][library_id]['images']['hires'] hires image
    • uns['spatial'][library_id]['scalefactors']['spot_diameter_fullres']
    • uns['spatial'][library_id]['scalefactors']['tissue_hires_scalef']
  • Additional standards for single-puck Slide-seq datasets:
    • obsm['spatial']

Data Submission Policy

I give CZI permission to display, distribute, and create derivative works (e.g. visualizations) of this data for purposes of offering CELLxGENE Discover, and I have the authority to give this permission. It is my responsibility to ensure that this data is not identifiable. In particular, I commit that I will remove any direct personal identifiers in the metadata portions of the data, and that CZI may further contact me if it believes more work is needed to de-identify it. If I choose to publish this data publicly on CELLxGENE Discover, I understand that (1) anyone will be able to access it subject to a CC-BY 4.0 license, meaning they can download, share, and use the data without restriction beyond providing attribution to the original data contributor(s) and (2) the Collection details (including collection name, description, my name, and the contact information for the datasets in this Collection) will be made public on CELLxGENE Discover as well. I understand that I have the ability to delete the data that I have published from CELLxGENE Discover if I later choose to. This however will not undo any prior downloads or shares of such data.