ALFA: Allele Frequency Aggregator
Table of Contents
ALFA at a glance:
- The goal is to make allele frequency data from over 1 million subjects available in dbGaP as open-access in accordance with the FAIR Data Principles (Findable, Accessible, Interoperable, and Reusable).
- The dbGaP studies include chip array, exome, and genomic sequencing data with subjects from 12 diverse populations including European, African, Asian, Latin American, and others.
- The data will be integrated with dbSNP regular build release with assigned RS accessions for variants and available for access by web, FTP, API, and TrackHub.
Background
The NCBI database of Genotypes and Phenotypes (
dbGaP) contains the findings of over 2K studies on the interaction of genotype and phenotype. The database has over
three million subjects and hundreds of millions of variants along with thousands of phenotypes and
molecular assay data. This unprecedented volume and diversity of data offers enormous potential for identifying genetic factors that influence health and disease. The National Institutes of Health (NIH) recently has
lifted the restriction on Genomic Summary Results (GSR) access for responsible data sharing and use.
In order to comply with the updated GSR policy and to encourage research aimed at identifying genetic variants that contribute to health and disease, NCBI created the
Allele
Frequency
Aggregator (ALFA) pipeline, which computes allele frequency for variants in dbGaP across approved unrestricted studies and makes the data available to the public via
dbSNP. The ALFA project's goal is to make frequency data from over
1M dbGaP subjects open-access to aid in the discovery and interpretation of common and rare variants with biological implications or causing diseases. Almost ~1M subjects with genotype data have been analyzed using
GRAF-pop as ALFA project candidates, pending study approval and processing.
Build Summary
| Release | Version | Date |
|---|
| 1 | | March 10, 2020 |
| 2 | | January 6, 2021 |
Data Generation
Data from
selected studies are harmonized and normalized. Using existing dbSNP and dbGaP curation and semi-automatic pipelines the data either from GWAS chip array genotyping or direct sequencing of exomes and whole genomes were QA/QC and transformed to standard VCF format as input into a pipeline that transform variants to
SPDI notation and normalized using VOCA to aggregate, remap and cluster to existing dbSNP rs or assign new ones (
Holmes et al.), and allele frequency computed.
Populations
Sample ancestries are validated using
GRAF-pop (Updated Sept 2021) and assigned to
12 major populations including European, Hispanic, African, Asian, and others (
Jin et al., 2019).
Data QC
We do our best to ensure that the data released is of the highest quality, complete, accurate, and useful. However, because we did not generate the original submitted data from dbGaP that were used as input for this project, and because the processing required to make the data useful is complex, we cannot be liable for omissions or inaccuracies. Please see the release summary with QC report (coming soon) for more details.
Data Excluded by QC:
- Variants with call rate < 95%
- Subjects with call rate < 95%
Data Excluded by QC and awaiting fixes from original dbGaP Submitters and may be included in future releases.
- Array datasets with conflicting subjects or markers between the marker manifest and reported genotype
- Datasets with incorrect or flipped allele orientation
- Datasets where the frequency of Ancestry Informative Markers (AIMs) tested is inconsistent with 1000 Genomes for whole study or for a particular population. The dataset is excluded if the percentage of AIMs outlier markers tested with allele frequency difference > abs(+/-0.15) exceed 0.3% for the whole study or 0.1% for a population (see details).
- Dataset where polymorphic SNPs are recorded as monomorphic
- Dataset suspected of having errors due to chip array design
- Dataset with various systemic errors and not does not appeared random