Input file formats
SNP genotype files
Genotype files must be tabular with the samples as columns and the SNPs as rows, they can also be zipped or gzipped.
Affymetrix (example [Chip: Mapping50K_Hind240])
Instead of AA/AB/BB/NoCall, also the 'number format' (0,1,2,-1) can be used.
The following columns will be ignored and do not have to be removed from the file:
VCF file (Next Generation Sequencing genotypes)
The VCF file must have the following columns:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1 Sample2 Sample3 (...) chr1 14930 . A G . . . GT:DP 1/1:31 0/1:30 0/0:23
The content of the columns 'ID', 'QUAL', 'FILTER', 'INFO' is ignored. The format
attribute is used to determine which part of the samples' genotypes is the genotype and
which one is the coverage. Please note that the DP flag must be included
in the FORMAT string (not only in INFO!), unless you set the minimum coverage
value in the upload interface to 0. Without the DP flag in FORMAT it is impossible to exclude
genotypes with a low coverage because the DP information in INFO aggegrates the coverage over all samples!
Sites at which the genotype is uncertain (two alt alleles) are skipped.
Here is a sample file.
# all BAM files in the same directory samtools mpileup -D -gf /path/to/genome.fa *.bam | bcftools view -c -g - > filename.vcf # BAM files in different directories samtools mpileup -D -gf /path/to/genome.fa /path/to/bam1.bam /path/to/bam2.bam | bcftools view -c -g - > filename.vcf # reference genome: /path/to/genome.fa # output file: filename.vcfGATK offers a similar option.
Please read the manuals of SAMtools / bcftools to find the appropriate settings for your data.