SHAPEIT is a fast and accurate haplotype inference software with several notable features:
- Linear complexity with the number of SNPs/individuals in the sample.
- Linear complexity with the number of conditioning haplotypes used in each update step.
- Whole chromosome of GWAS scale dataset can be phased in a single run.
- Mixed samples of Trios, Duos and Unrelateds are handled.
- Phasing is multi-threaded to decrease computational time on multi-core computers.
SHAPEIT was developed in C++ by Olivier Delaneau (firstname.lastname@example.org) under the supervision of Jean-Francois Zagury. Additional versions are being developed with the co-supervision of Jonathan Marchini.
The paper to cite if SHAPEIT is used:
O. Delaneau, J. Marchini, JF. Zagury. A linear complexity phasing method for thousands of genomes. Nature Methods 2011 (To appear).
To run shapeit, first load the environment using 'module load shapeit', and then enter the SHAPEIT commands as in the example below. A set of sample data can be copied from the system area for testing, e.g.
mkdir /data/$USER/shapeit_example cp -r /usr/local/apps/shapeit/2.r644/example /data/$USER/shapeit_example
The example below uses the sample files that are provided with the package.
helix% module load shapeit helix% shapeit -B chr20.unphased -M chr20.gmap.gz -O chr20.phased -T 4 Segmented HAPlotype Estimation & Imputation Tool * Authors : Olivier DELANEAU, Jean-FranAois ZAGURY & Jonathan MARCHINI * Contact : email@example.com * Webpage : http:://www.shapeit.fr * Version : v2.r644 * Date : 01/02/2013 15:13:12 * LOGfile : [shapeit_01022013_15h13m12s_f625dc43-acea-44ba-913d-b03cdac59a7e.log] MODE -phase : PHASING GENOTYPE DATA * Autosome (chr1 ... chr22) * Window-based model (SHAPEIT v2) * MCMC iteration Parameters : * Seed : 1359749592 * Parallelisation: 4 threads * MCMC: 35 iterations [7 B + 8 P + 20 M] * Model: 100 states per window [100 H + 0 PM + 0 R] / Windows of ~2.0 Mb / Ne = 15000 Reading SNPs in [chr20.unphased.bim] in Plink BIM format * 565 SNPs included Reading individuals in [chr20.unphased.fam] in Plink FAM format * 1049 individuals included * 603 unrelateds / 28 duos / 130 trios in 761 different families Reading genotypes in [chr20.unphased.bed] in Plink BED format * Plink binary file SNP-major mode Reading genetic map in [chr20.gmap.gz] * 774 genetic positions found * #set=194 / #interpolated=371 * Physical map [35.00 Mb -> 36.00 Mb] / Genetic map [54.41 cM -> 54.98 cM] Checking missingness and MAF... * 0 individuals with high rates of missing data (>5%) * 3 SNPs with high rates of missing data (>5%) * 50 monomorphic SNPs * 112 missing genotypes automatically imputed at monomorphic SNPs * 9 singletons SNPs Checking Mendel errors... * Low level of Mendel error in all trios and duos * 0 SNPs with high Mendel error rate (> 5%) Building graphs [761/761] * 761 graphs / 16269 segments / ~26 SNPs per segment / 536191 transitions * 0 haploids / 603 unrelateds / 28 duos / 130 trios * 1810 founder haplotypes Sampling haplotypes [761/761] Burn-in iteration [1/7] [761/761] Burn-in iteration [2/7] [761/761] [...] Normalising graphs [761/761] Solving haplotypes [761/761] Writing sample information in [chr20.phased.sample] in IMPUTE2 format Writing haplotypes in [chr20.phased.haps] in IMPUTE format Running time: 133 seconds helix%