Teaching Genomics at Small Colleges > Inquiry-based Integrated Instructional Units > Modeling Molecular Evolution

Modeling Molecular Evolution

Jodi Schwarz, Vassar College
Marc L. Smith, Vassar College
Author Profile



Biology and Computer Science majors collaborated to design and implement algorithms to model a genomic-level process: the evolution of protein-coding genes. Students had to combine their respective expertise in biology and computer science in order to successfully model this process.

We used a pedagogical approach that starts with a biological question that requires a computational approach to solve. The instructors provided students with partial "starter code" that implemented part of the algorithm. Students refined and elaborated/simplified the code to more accurately model the process. They compared their output to empirical data (an alignment of HSP70 protein sequences) to determine which most closely matched their model.

Learning Goals

Learning Goals:
  • CS students: To apply their knowledge of data structures and algorithms to a biological/bioinformatics domain
  • Biology students: To apply their knowledge of the biology to design algorithms
  • For the collaboration: To become familiar with modeling a biological process: requires "abstracting away" features of the biology; a simple model must be constructed and tested first.

Context for Use

This is appropriate for use in an upper level bioinformatics course in which the students have already become familiar with:
  1. basic algorithm design
  2. use of a scripting language such as Perl to implement an algorithm
  3. Central Dogma, especially concepts of codons and protein translation
  4. DNA single point mutations

Description and Teaching Materials

  1. Perl for Exploring DNA by LeBlanc and Dyer (Oxford University Press)
  2. Understanding Bioinformatics by Zvelebil and Baum (Garland)
Week 1: Randomness, Sequence Alignment
Instructors provide background information (lectures and reading) on
  1. Theory and practice of sequence alignment: A lecture on Sequence Alignment (PowerPoint 1.4MB Jun5 09)
  2. Concept of randomness in the physical world as well as simulated by the computer: A lecture on Randomness (PowerPoint 139kB Jun5 09)
Students are given an assignment to acquire protein sequences from several genomic databases, then conduct sequence alignments using clustal.
  1. Practice performing alignments using blast and clustal (Acrobat (PDF) 69kB Jun5 09)
Week 2: Creating models of DNA mutation
Discussion of the biology:Students discuss what they know about how DNA and protein sequences change over time, and the class discusses the process of DNA mutation, which is presented in the textbook as a primarily random process. The biology students, however, should be able to contribute information that the pattern of nucleotide substitution is probably not random, because alternation of an amino acid might affect the function of a protein, so the nucleotides that are most likely to vary are those in the the position of codons. Class discusses how the process of mutation may be random, but the process of selection will prevent most amino-acid altering substitutions from persisting.

Discussion of the algorithm:How might we model how protein evolution might look under two different scenarios: a random vs targeted nucleotide substitution approach. The Bio/CS pairs are given "starter code" that they must further modify and develop to model the effect of a random vs. targeted substitution on the the human Heat Shock Protein 70 (HSP70).
Assignment: Compare effects of substitutions that occur under a random vs. targeted to 3rd position of codons:
  1. Model nucleotide substitution assignment
    1. Assignment: Model random vs. targeted substitutions (Microsoft Word 25kB Jun5 09)
    2. Substitution starter code ( 7kB Jun5 09)
    3. Substitution data file (Text File 2kB Jun5 09)
Week 3: Using empirical data to test the models

The goal: use extant protein-coding genes to provide evidence to support or refute the students' attempts to model the evolution of protein-coding genes.
    1. Construct a DNA sequence alignment of 5 animal homologs of the HSP70 gene
    2. Write an algorithm to calculate the diversity of nt in each position of each codon
      1. Starter Code for calculating codon diversity statistics ( 6kB Jun7 09)
      2. Multiple fasta file of HSP70 DNA sequences (Text File 12kB Jun7 09)
      3. Multiple fasta file of HSP70 amino acid sequences (Text File 4kB Jun7 09)
    3. Compare the diversity statistics with the two models to assess whether one model more closely mirrors the evolution of the HSP70 gene.

Teaching Notes and Tips

Collaborative teaching:

This teaching activity was developed in the context of an upper level team-taught course (computer science and biology).

Pedagogical template:

  1. Define a biological question or problem for the students to address. In this case, the problem to address was evolution of protein-coding DNA sequences.
  2. We provided the students with "starter code" that did part of the job, or it was a first attempt at doing the job.
  3. The students had to run and study the code and either improve the code or refine their original biological question or, usually, both.
  4. To be successful, they had to collaborate, because they needed expertise in both the Bio and CS. The whole was greater than the sum of its parts.

Important Class discussions:

Week 1: Random and alignments
  • random: class interactive discussion, playing around with randomness
    • selecting M&Ms
    • pseudo-random
    • can a computer really implement a truly random function?
  • alignments:
    • How is a single alignment chosen from among all the possible ways to line up two or more sequences?
      • Concept of a scoring scheme (PAM, Dayhoff, BLOSUM matrices)
      • What are gaps and what do they represent?
    • What does it mean biologically if a particular amino acid is present at the same position in all the sequences?
      • an alignment is an evolutionary hypothesis: a column with identical amino acids represents an amino acid that was present in the ancestral sequence and all descendent sequences
      • gaps represent indels since the common ancestor
      • amino acid substitutions reflect DNA substitutions in different lineages
    • if we compare alignments of highly conserved homologs vs poorly conserved homologs
      • easy to see homology in the highly conserved homologs
      • more variability in poorly conserved homologs makes homology harder to see
        • a few highly conserved "domains" within an overall poorly conserved sequence vs low level of conservation across the entire sequence

Week 2: Mutation experiments
  • What do we mean by "random"?
    • Do we select one of the 4 nucleotides randomly?
      • what about if we replace a nucleotide with itself?
    • selecting the location at which the mutation will occur randomly
    • should we model insertions and deletions?
      • It will destroy the reading frame, so too complicated for now
Week 3: Assessing the models by comparing substitutions in a homologs of HSP70 in metazoa.
  • discussion about how to deal with gaps
    • must occur in sets of three to maintain the reading frame
    • does clustal ensure that the reading frame is not thrown off? No! Why?


  • Pre/post survey to assess level of familiarity with genomics and bioinformatics Pre/post Attitude Survey (Microsoft Word 31kB Jun29 09)
  • Completion of a "pre-assignment" that tested ability to acquire sequences from the databases, and familiarity and interpretation of protein sequence alignment Alignment Assignment (Acrobat (PDF) 69kB Jun5 09)
  • Completion of an algorithm and perl script that modeled the evolution of DNA under random vs. targeted scenarios:
  • Completion of an algorithm and perl script that calculated the real nucleotide diversity at each position, using a dataset of 5 genes:
Example student results (diversity) (Text File 559bytes Jun29 09)
  • Instructor observation on the following aspects:
  • collaboration and communication
  • ability to identify and describe problems associated with the development and implementation of algorithms
  • ability to interpret results after implementation and to discover problems that require re-thinking the approach
  • increased sophistication in ability to understand molecular evolution (mutation occurs at the DNA level, but selection occurs at the protein level)

References and Resources


  1. Perl for Exploring DNA by LeBlanc and Dyer (Oxford University Press)
  2. Understanding Bioinformatics by Zvelebil and Baum (Garland)

Computing resources

Bioinformatics computing cluster (16 core Linux, SunGrid Engine) with a suite of open source bioinformatics tools (via iNquiry).

However, this module could be done entirely making use of open source web-based tools (clustalw) as well as any system that can run code (such as linux or dos).