Contributing Data

CELLxGENE supports a rapidly growing single-cell data corpus because of generous contributions from researchers like you!

Submission and Publication Process

Review the Data Eligibility criteria to ensure your data complies with these requirements
Contact us with a description of the data that you'd like to contribute to confirm that we will accept your data
Once confirmed, you send us files prepared according to the submission Requirements
We upload to a private Collection where you can review
The submission can be revised, as needed
The data are made openly available when you are ready

Dataset Requirements

Data Eligibility

CELLxGENE supports most single-cell RNA-seq and ATAC-seq data, but a few types of data are not accepted at this time:

drug screens
cell lines
species not on the supported list
assays not on the supported list
- these additional assays are accepted:
  - expression data from paired (i.e. multi-modal) assays (e.g. 10x multiome, mCT-seq)
  - unpaired scATAC-seq gene activity matrices with fragments file

CELLxGENE continues to expand support for additional species and assays so please contact us if you are interested in submitting data not currently covered by the supported lists.

Scale Constraints

CELLxGENE Discover sets the maximum per dataset file size for submissions to 50 GB. Additionally, datasets with more than 4.3 million cells can be submitted but will not visualized in CELLxGENE Explorer.

Formatting Requirements

Include the following Collection metadata in your emails to describe your publication or study, all of which can be edited as titles, abstracts, etc.:

Collection information:
- Title
- Description
- Contact: a single name and email
- Publication/preprint DOI: optional
- URLs optional
  - any links to the corresponding raw sequence data, protocols, and other related data or resources
- Consortia optional
  - one or more of those listed here

The full schema is documented here but is summarized below. Each dataset needs the following information added to a single h5ad (AnnData 0.10) format file:

Dataset-level metadata in uns:
- title
- organism_ontology_term_id
  - NCBITaxon See the schema for specific values
- batch_condition optional
  - list of obs fields that define “batches” that a normalization or integration algorithm should be aware of
- default_embedding optional
  - the obsm key associated with the embeddings you would like to be displayed in CELLxGENE Explorer by default
Data in .X and raw.X:
- raw counts are required
- normalized counts are strongly recommended
- raw counts should be in raw.X if normalized counts are in .X
- if there is no normalized matrix, raw counts should be in .X
- layer in .X is displayed in CELLxGENE Explorer
  - it is highly recommended that .X is not filtered to enable exploration of all genes
  - if only raw counts are provided, gene expression visualizations will not be normalized and might appear different than expected
Cell metadata in obs (ontology term IDs MUST be the most specific term available from the specified ontology):
- donor_id
  - free-text identifier that distinguishes the unique individual that data were derived from
  - na for cell line
- development_stage_ontology_term_id
  - human HsapDv
  - mouse MmusDv
  - roundworm WBls
  - zebrafish ZFS
  - fruit fly FBdv descendent of development stage or age
  - other organsism UBERON
  - unknown if information unavailable
  - na for cell line
- sex_ontology_term_id
  - PATO:0000384 for male, PATO:0000383 for female, PATO:0001340 for hermaphrodite, or unknown if unavailable
  - na for cell line
- self_reported_ethnicity_ontology_term_id
  - human HANCESTRO
  - multiple || -delimited terms may be used if more than one ethnicity is reported
  - unknown if information unavailable
  - other organisms na
  - na for cell line
- disease_ontology_term_id
  - should describe any known disease thought to, or being tested to, have an impact on the measurement being taken, not necessarily any known disease of the donor
  - MONDO or PATO:0000461 for normal
  - multiple || -delimited terms may be used if appropriate
- tissue_type
  - tissue, organoid, primary cell culture or cell line
- tissue_ontology_term_id
  - should describe the sample used as input to the experiment, not analysis-derived annotations
  - round worm WBbt or UBERON
  - zebrafish ZFA or UBERON
  - fruit fly FBbt or UBERON
  - other organisms UBERON
  - cell line CVCL
  - primary cell culture CL
- cell_type_ontology_term_id
  - should describe analysis-derived cell annotations
  - roundworm WBbt or CL
  - zebrafish ZFA or CL
  - fruit fly FBbt or CL
  - other organisms CL
  - cell line may be all na for studies that do not involve cell label annotation
- assay_ontology_term_id
  - EFO
- suspension_type
  - cell, nucleus, or na
Embeddings in obsm:
- one or more two-dimensional embeddings, prefixed with 'X_'
Features in var & raw.var (if present):
- index is Ensembl gene ID
- recommendation is that genes have not been filtered in order to maximize future data integration efforts
Additional standards for single-capture area Visium datasets (largely aligns with scanpy’s model, this notebook may be helpful to curate from Space Ranger outputs):
- empty spots must be included
  - 4992 total observations for 6.5 mm capture areas
  - 14336 total observations for 11 mm capture areas
- obsm['spatial']
- obs['array_row']
- obs['array_col']
- obs['in_tissue']
- uns['spatial'][library_id]['images']['fullres'] recommended
  - fullres image that is input to Space Ranger
- uns['spatial'][library_id]['images']['hires']
  - hires image that is output from Space Ranger
- uns['spatial'][library_id]['scalefactors']['spot_diameter_fullres']
- uns['spatial'][library_id]['scalefactors']['tissue_hires_scalef']
- multiple-capture area Visium datasets are permitted if each capture area is also submitted individually
  - the additional Visium standards do not apply to these
Additional standards for single-puck Slide-seq datasets:
- obsm['spatial']
- multiple-puck Slide-seq datasets are permitted if each puck is also submitted individually
  - obsm['spatial'] is not required for these
Additional ATAC-seq standards
- A fragments file is required for each unpaired scATAC-seq submission, recommended for multi-modal submission
- Follows the 5-column tabular format produced by Cell Ranger
- Barcode values must match the values in the obs index of the corresponding AnnData object
- This notebook may be helpful to curate from Cell Ranger outputs

Data Submission Policy

I give CZI permission to display, distribute, and create derivative works (e.g. visualizations) of this data for purposes of offering CELLxGENE Discover, and I have the authority to give this permission. It is my responsibility to ensure that this data is not identifiable. In particular, I commit that I will remove any direct personal identifiers in the metadata portions of the data, and that CZI may further contact me if it believes more work is needed to de-identify it. If I choose to publish this data publicly on CELLxGENE Discover, I understand that (1) anyone will be able to access it subject to a CC-BY 4.0 license, meaning they can download, share, and use the data without restriction beyond providing attribution to the original data contributor(s) and (2) the Collection details (including Collection name, description, my name, and the contact information for the datasets in this Collection) will be made public on CELLxGENE Discover as well. I understand that I have the ability to delete the data that I have published from CELLxGENE Discover if I later choose to. This however will not undo any prior downloads or shares of such data.