Finding SNPs with lots of variation

Strategy for identify SNPs with lots of variation among ecotypes

The approach that seems to work well for more than two ecotypes is to compute the product of these columns in the SNP spreadsheet : MatchedVariant_Count x MatchedReference_Count (make a new column for this data)

Next, sort by the date in the new column by quantity.

The sorted column will be maximized for SNPs with allele frequencies near .5, and will be minimized where there is no support for the SNP being discriminative (e.g. if none match the reference, either because of lack of coverage, or all samples appear to share the same allele, or because the reference simply appears to be incorrect at the given position). For three genotypes, it's basically going to sort on sites where the allele frequency is demonstrably 2:1 in the given data.

« Previous Page Next Page »

Finding SNPs with lots of variation

About

Reuse

Page Text

Images

Files