Finding SNPs with lots of variationStrategy for identify SNPs with lots of variation among ecotypes
The approach that seems to work well for more than two ecotypes is to compute the product of these columns in the SNP spreadsheet : MatchedVariant_Count x MatchedReference_Count (make a new column for this data)
Next, sort by the date in the new column by quantity.
The sorted column will be maximized for SNPs with allele frequencies near .5, and will be minimized where there is no support for the SNP being discriminative (e.g. if none match the reference, either because of lack of coverage, or all samples appear to share the same allele, or because the reference simply appears to be incorrect at the given position). For three genotypes, it's basically going to sort on sites where the allele frequency is demonstrably 2:1 in the given data.