Understanding the variant tableMaking sense of the SNP data
What are all those columns?
Transcript: Overlapping 46 bp Solexa sequences were assembled into contigs (short for contiguous) and assigned a number. The 'ctg' tells you you're looking at a contig. The 'code' switches in some tables, so if you see a 'cf' id number, just think contig.
Variant: As each variant was sequentially identified, it was assigned a number. This information is not particularly relevant to your research questions.
Class: 'S' means that the type of variant reported is a SNP. It is also possible to have INDEL (insertion or deletion) data, but that is not reported here.
Position: This is an important column. The number is the position of the SNP, counting from the left (5' end) of the contig. So how do you get the sequence for the contig? Scroll down the "Variation among ecotypes' strategy page until you get to "Protein predictions." There is a link called "solexa_cdna" that will give you the sequences associated with each contig. There are also some tips in this section that will let you evaluate whether or not a particular SNP is changing a protein.
RefAllele: This is the base at the specified position that occurs most frequently in the data. It is one of the two SNP alleles. In most, but not all cases, the RefAllele ends up being from the MN population.
VarAllele:This is the other SNP allele at the specified position.
BothStrands: Is there a SNP on both strands of the DNA? This column is filled with 0s because we sequenced single stranded cDNA. If you were looking at double stranded DNA, this column would be more informative.
AvgQual vs MaxQual: How consistent are the base pair readsin the multiple sequences of 46 bp that were assembled into the contig? If you have a low quality score for a contig compared to the maximum possible, you can't be very confident in that data and should move on to other contigs.
NumReadsWithAllele vs UniqAlns: Number of sequences that had the SNP vs number of unique sequences that had the SNP in the overall set of sequences. Sometimes there are fewer unique alignments because there are multiple copies of the same sequence. The same is true in the columns for the numbered ecotypes. The denominator in the columns labeled "1, 2, or3' gives you the total number reads while the UniqAlns tells you how many unique sequences were found for that contig in that population of plants.
3 vs Freq:Each population (3=OK, 2=KS, 1=MN) is represented by column with its number above it. In the numbered column you find a fraction: number of contigs with RefAllele/total number of contigs with the SNP. The fraction is converted to a decimal called frequency in the next column.
FreqDiff: The very last column of the table gives you a numerical measure of how different the populations are in terms of the distribution of each SNP. The closer you are to 1, the greater the difference. The spreadsheet is rank ordered from top to bottom in terms of most to least diversity in terms of the SNP allele.