This data portal is designed for scientists to explore data from the most comprehensive studies of ALS.
It is organized as follows:
This data portal is written with scientists in mind. For a general introduction to our effort, visit the AnswerALS home page.
The development of this portal is generously supported by AnswerALS partners and donors.
Our mission is to build the most comprehensive clinical, genetic, molecular & biochemical assessment of ALS, while openly sharing the results with the global research community.
— more at answerals.org
The AnswerALS consortium is an ongoing effort to collect comprehensive datasets characterizing the biology of 1000 ALS patients. Find the status of our progress here.
Clinical data for patients enrolled by AnswerALS were collected via NeuroBANK which provides uniform and comprehensive "deep phenotyping" for each patient. These records are de-identified and pre-processed by the NeuroBANK team, and further trimmed for inclusion in this portal.
All patients are identified by Participant_ID prefixed by either "CASE" and "CTRL".
Induced Pluripotent Stem Cells (iPSCs) are generated for each patient recruited by each consortium by the iPSC team at Cedars-Sinai.
Our iPSC technology is broadly described here and further details can be found here. Each iPSC line is identified by a Cedars ID (prefixed CS) which is associated with a patient's Participant_ID.
The New York Genome Center's Center for Genomics of Neurodegenerative Disease (NYGC-CGND) performs whole-genome sequencing for all AnswerALS patients. These datasets are designated by a CGND ID (prefixed CGND) which is associated with a patient's Participant_ID.
Several 'omics assays are performed to characterize each patients' biology. Our multi-omic approach is described here.
Sequencing is performed at NYGC on Illumina HiSeqX10s mostly from whole blood samples. Reads are aligned to hg38 via BWA-MEM, and variants called via GATK 3.5 best practices. Details about the WGS workflow are available here.
ATAC-Seq is performed to assess the chromatin accessibility of the iPSC-derived motor neurons at the Fraenkel Lab at MIT.
Bulk RNA-Seq is performed to assess the transcriptional profiles of the iPSC-derived motor neurons at the Thompson Lab at UCI.
Sequencing is performed at UCI on Illumina NovaSeq 6000s. Reads are aligned to hg38 via HISAT2, and reads falling within genes are counted by featureCounts. The RNA-Seq bioinformatic workflow is available in our docker container.
Datasets are provided by this portal at multiple stages in processing, which are designated as "data levels". A definition of these "data levels" can be found here.
Level 1 data is raw, immutable data coming off an instrument (e.g. a sequencer).
Level 2 data is raw data mapped against the appropriate reference.
|Assay||Level 1||Level 2|
Level 3 data the most processed form of patient-specific data.
|Assay||Level 1||Level 2||Level 3|
|Genomics||.fastq||.cram||.raw.vcf (i.e. gVCF)|
|Transcriptomics||.fastq||.bam||.tsv (genes by samples)|
|Proteomics||.wiff||.mzML||.tsv (proteins by samples)|
Level 4 data is attained from the joining of a cohort of patients' level 3 data from a particular assay.
|Assay||Level 1||Level 2||Level 3||Level 4|
|Genomics||.fastq||.cram||.raw.vcf||.vcf (joint call)|
|Epigenomics||.fastq||.bam||.narrowPeaks||.bed (diff. regions)|
|Transcriptomics||.fastq||.bam||.tsv||.tsv (diff. genes)|
|Proteomics||.wiff||.mzML||.tsv||.tsv (diff. proteins)|
Level 5 data is attained from the integration of level 4 datasets across omics assays. Level 5 data represents the knowledge ultimately resulting from the experiment.