Arrays: General Information

Lab machine
Photo by Tameka U Shelford

Using Illumina Infinium chemistries and Affymetrix Axiom chemistry, investigators have a wealth of genotyping study design options available to them. We currently offer production scale linkage, GWAS and custom SNP genotyping services which utilize LIMS tracking, robotic automation and strict QC standards.

Depending on access pathway,
we include at no additional cost:

  • Sample pretesting with option to resolve problems, which allows us to:
    • Determine SNP genotypes for sample tracking
    • Identify file and/or aliquoting errors (primarily sex and/or Mendel discrepancies)
    • Identify unexpected 1st degree relatives among subjects and confirm expected relationships
    • Identify samples that perform poorly
    • Identify unexpected duplicate samples, confirm expected relationships
    • Not included for mouse or methylation studies
  • Plate Map Design
  • Repeats of failed assays
  • Rigorous laboratory quality assurance
  • Extensive quality evaluation of study, sample and variant level data
  • Inclusion of study duplicates and positive controls
  • Customizable data release and one year project archive

Clustering and Calling Genotype Data

For Illumina SNP genotyping services, genotype cluster definitions are determined using the Illumina Gentrain algorithm version 1.0 contained in GenomeStudio software. For Affymetrix genotyping services, Axiom Analysis Suite is used. We initially use the software to determine cluster boundaries using a project's samples. Sample call rate and quality metrics are evaluated and a small portion of samples will be marked for exclusion from project release due to poor data quality (call rate generally less than 97-98% for genomic DNAs). For Illumina projects, after exclusion of poor quality experiments, the clustering algorithm is run again for determination of final cluster positions. It is important to include only high quality raw data for accurate clustering.

Linkage Studies

For linkage studies, the linkage markers panel is chosen from the Illumina QC Array marker set, and that subset is manually reviewed. Manually reviewed clusters are adjusted as necessary, using HapMap replicate and relationship status as a guide. Intensity data is released for all SNPs on the array.

GWAS Studies

GWAS cluster definitions are determined with the same procedures with some modifications. A lower genotyping quality score is tolerated, manual review is only done for XY, Y and Mitochondrial SNPs and a SNP “technical filter” is applied to the GWAS data designed to remove genotypes only for markers that are complete assay failures. CIDR performs additional manual review of some SNPs based on flags obtained from zCall.* For dbGaP posting purposes, the desire is to post a very raw form of the data thus aggressive genotype “dropping” is not performed.

————————

* zCall: a rare variant caller for array-based genotyping: genetics and population analysis. Goldstein JI, Crenshaw A, Carey J, Grant GB, Maguire J, Fromer M, O'Dushlaine C, Moran JL, Chambert K, Stevens C; Swedish Schizophrenia Consortium; ARRA Autism Sequencing Consortium, Sklar P, Hultman CM, Purcell S, McCarroll SA, Sullivan PF, Daly MJ, Neale BM.

Released Genotyping Data

SNP genotyping data released back to our investigators includes:

  • Raw data files (.idat or .cel files)
  • Genotypes for forward, A/B, design and top alleles
  • Quality scores and intensity values
  • SNP and sample summary tables including quality flags and comments
  • SNP cluster definition files
  • Annovar annotated SNP manifest
  • PLINK files

GWAS Data Cleaning

Additional assistance with post-release data processing is performed for many GWAS-level studies, providing assistance to the PI for data cleaning and posting of datasets to dbGaP as well as imputation to HRC or most current applicable reference.

The major goal of GWAS data cleaning is to QC and impute genetic data and assist in dbGaP posting. It is conducted by CIDR’s genetic analysis group after molecular data production is completed by the lab. Regular conference calls are held to communicate the results with SI. The data cleaning process first focuses on resolving any remaining sample quality, if any, or identity (sex, unexpected relatedness etc) issues. Final sample decisions are usually made based on study design and the type of issues after discussing or communicating with SI. For instance, samples can be kept for QC and dbGaP posting but removed from association tests. Batch effects (samples processed together) are examined though it is usually not a concern due to the plate-map design before production. PCA is used to identify batches of ethnic outliers and to calculate eigenvectors in order to adjust for population stratification in the following association analyses. A number of SNP level quality filters are applied to flag problematic SNPs other than those technically failed. This includes missing data filters, duplicate and Mendelian errors, minor allele frequency and Hardy-Weinberg equilibrium etc. Lastly, a relatively simple association (“pre-compute”) analysis is performed to determine whether the data cleaning is thorough and whether there is significant genomic inflation that can lead to spurious results. The pre-compute also allows investigators who access the data to verify they were able to download, merge the genotype and phenotype datasets and apply the filters correctly by repeating the pre-compute results. Finally, a QC report along with a number of supporting files are prepared to be included on dbGaP which describes the dataset and results of the data cleaning process.  In addition, the cleaned genetic data will be uploaded to cloud for imputation on Michigan Imputation Server. The imputed results along with an imputation QC report will be reviewed by SI before being posted to dbGaP.