AnswerALS Data Portal


Release 2.1 (April 2020):

  1. We have added the following Transcriptomics data:
    • 86 Indexed BAM files
    • 86 Counts files
  2. Download the current release metadata package here.


Release 2.0 (March 2020):

  1. We have now released 391 WGS Samples, 85 Epigenomics samples (FASTQ, BAM, Peaks) & 86 Transcriptomics samples (FASTQ, BAM).
  2. Full release metadata package:
    • We’ve added complete Clinical metadata, Participant/Inventory metadata and Portal metadata.
    • Download the current release metadata package here.
    • Download the previous release metadata package here.
  3. All metadata tables are keyed by "Participant_ID" (Condition + GUID) to allow for easy joining of tables.
  4. Removed NeuroLINCS data and "Analyze" tab that previously used this data.
  5. Users are now alerted to the size of data within the download script.
  6. All sample data follows new naming and organization conventions as outlined in this example diagram.
  7. Sex Effects in RNA-Seq data:
    • The standard step after obtaining level 3 data (counts) is statistical inference of systematic changes between conditions (e.g ALS and CTR) by modeling gene expression data with a binary variable with two levels (design ~condition, condition =0 for CTR, 1 for ALS). In the presence of confounding factors a more complex design (e.g. design~batch + condition) may be needed to exact the disease relevant signal while controlling for the confounders.
    • In this data release, one of the dominating contributors to the gene expression changes observed in our initial principal component (PCA) analysis is the inherent sex effect. Genes that contribute to the sex effect include both sex chromosome linked genes and autosomal genes. This gender specific gene expression has also been reported in post mortem human brains, as well as IPSCs. In order to extract disease relevant signal, we recommend controlling sex effect for your downstream analysis. For example, excluding x, y linked genes and adding an additional binary variable to the design account for sex effect.

This data portal is designed for scientists to explore data from the most comprehensive studies of ALS.
It is organized as follows:

  • The Home page (this page) provides a description of the entities involved in the studies.
    Please take the time to read this page in its entirety to understand the data contained here.
  • The Search page provides a means to browse the available data, filter by various criteria, and download data or order iPSC lines.

This data portal is written with scientists in mind. For a general introduction to our effort, visit the AnswerALS home page.

The development of this portal is generously supported by AnswerALS partners and donors.


Consortia

Our mission is to build the most comprehensive clinical, genetic, molecular & biochemical assessment of ALS, while openly sharing the results with the global research community.
— more at answerals.org

The AnswerALS consortium is an ongoing effort to collect comprehensive datasets characterizing the biology of 1000 ALS patients. Find the status of our progress here.


Patients

AnswerALS patients were recruited at clinics across the United States by the AnswerALS team of clinicians & scientists.

Clinical data for patients enrolled by AnswerALS were collected via NeuroBANK which provides uniform and comprehensive "deep phenotyping" for each patient. These records are de-identified and pre-processed by the NeuroBANK team, and further trimmed for inclusion in this portal.

All patients are identified by Participant_ID prefixed by either "CASE" and "CTRL".


Samples

iPSCs

Induced Pluripotent Stem Cells (iPSCs) are generated for each patient recruited by each consortium by the iPSC team at Cedars-Sinai.
Our iPSC technology is broadly described here and further details can be found here. Each iPSC line is identified by a Cedars ID (prefixed CS) which is associated with a patient's Participant_ID.

Whole Genomes

The New York Genome Center's Center for Genomics of Neurodegenerative Disease (NYGC-CGND) performs whole-genome sequencing for all AnswerALS patients. These datasets are designated by a CGND ID (prefixed CGND) which is associated with a patient's Participant_ID.


Assays

Several 'omics assays are performed to characterize each patients' biology. Our multi-omic approach is described here.

Whole Genome Sequencing (WGS) is performed by the New York Genome Center's Center for Genomics of Neurodegenerative Disease (NYGC-CGND).

Sequencing is performed at NYGC on Illumina HiSeqX10s mostly from whole blood samples. Reads are aligned to hg38 via BWA-MEM, and variants called via GATK 3.5 best practices. Details about the WGS workflow are available here.

ATAC-Seq is performed to assess the chromatin accessibility of the iPSC-derived motor neurons at the Fraenkel Lab at MIT.

Sequencing is performed at MIT on illumina Nextseqs. Reads are aligned to hg38 via Bowtie2, and peaks are called with MACS2. Details about the ATAC-Seq workflow available here.

Bulk RNA-Seq is performed to assess the transcriptional profiles of the iPSC-derived motor neurons at the Thompson Lab at UCI.

Sequencing is performed at UCI on Illumina NovaSeq 6000s. Reads are aligned to hg38 via HISAT2, and reads falling within genes are counted by featureCounts. The RNA-Seq bioinformatic workflow is available in our docker container.

Data-Independent Acquisition (DIA) Mass Spectrometry (in particular, SWATH-MS) is performed to assess the proteomes of the iPSC-derived motor neurons at the Van Eyk Lab at Cedars-Sinai.

SWATH-MS is performed at Cedars-Sinai.


Data Levels

Datasets are provided by this portal at multiple stages in processing, which are designated as "data levels". A definition of these "data levels" can be found here.

Level 1 data is raw, immutable data coming off an instrument (e.g. a sequencer).

AssayLevel 1
Genomics.fastq
Epigenomics.fastq
Transcriptomics.fastq
Proteomics.wiff

Level 2 data is raw data mapped against the appropriate reference.

AssayLevel 1Level 2
Genomics.fastq.cram
Epigenomics.fastq.bam
Transcriptomics.fastq.bam
Proteomics.wiff.mzML

Level 3 data the most processed form of patient-specific data.

AssayLevel 1Level 2Level 3
Genomics.fastq.cram.raw.vcf (i.e. gVCF)
Epigenomics.fastq.bam.narrowPeaks
Transcriptomics.fastq.bam.tsv (genes by samples)
Proteomics.wiff.mzML.tsv (proteins by samples)

Level 4 data is attained from the joining of a cohort of patients' level 3 data from a particular assay.

AssayLevel 1Level 2Level 3Level 4
Genomics.fastq.cram.raw.vcf.vcf (joint call)
Epigenomics.fastq.bam.narrowPeaks.bed (diff. regions)
Transcriptomics.fastq.bam.tsv.tsv (diff. genes)
Proteomics.wiff.mzML.tsv.tsv (diff. proteins)

Level 5 data is attained from the integration of level 4 datasets across omics assays. Level 5 data represents the knowledge ultimately resulting from the experiment.