Whole Genome Data



This section is awaiting the arrival of genome sequence data from Cofactor Genomics



Transcriptome sequencing allows biologists to get a quick, but incomplete, inventory of the protein-coding genes, because it is limited to genes that happen to be expressed under a restricted set of condition(s). A "whole genome" sequence, on the other hand, is a complete inventory of the protein-coding genes, as well as the non-coding set of promoters, tRNAs, small RNAs, genomic repeats, etc. The challenge with whole genome sequences lay in 1) the assembly -- correctly piecing together the genome from millions/billions of small reads and 2) the annotation -- correctly predicting the location and structure of the genes, for example.


For Aiptasia the first questions to address might focus on practical questions about how well the sequencing effort has covered the genome
  • the depth at which the genome has been sequenced (the more reads that we have for each small region of the genome, the easier it is to reconstruct the entire genome sequence)
  • the extent of small DNA repeats that might complicate the assembly
  • the number of genes that might be represented by the genome sequence data


CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes


Data
Dataset of core genes