Introduction to strings and DNA/protein sequence alignments

Benjamin Bratton
Princeton University, Molecular Biology
Author Profile


This is problem set that helps solidify concepts of computational processing (accessing data, parsing data, visualizing data, using preconstructed tools) and sequence matching, specifically in the context of DNA/Protein sequences (exact match, BLAST, multiple alignment, conserved motifs). While there is some domain specific knowledge in the activity, it is more focused on helping students learn how to use computation to solve problems and expose them to various computational tools.

Learning Goals

Activity specific goals:

After completing this activity, all students will be able to

  • load generic text files into MATLAB and search them for specific strings.
  • utilize pre-built MATLAB tools such as fastaread for faster import of specifically formatted text.
  • evaluate the efficiency and scaling of algorithms by benchmarking and tracking execution time.
  • retrieve data from online databases such as NCBI.
For more advanced students, the activity also provides a guide to
  • Compare MATLAB based algorithm implementation and algorithms in other languages.
  • Consider scaling of execution time and storage needs for big data.
  • Apply string searches to contemporary problems in molecular biology research.

Scientific computing and problem solving goals:

After completing this activity, students will be able to

  • access data stored in online databases or from files provided from them.
  • generate plots with multiple panels to best
  • synthesize pre-built tools with additional MATLAB code to solve specific problems.
  • reuse code snippets and single-purpose functions.
  • quickly develop code by taking a complex problem and breaking it down into smaller pieces
  • appreciate that skillsets necessary for success in modern scientific computing require both domain specific knowledge and algorithm development

Domain specific goals (Molecular Biology/Bioinformatics):

  • DNA and proteins primary structure can be represented by an ordered series of letters. The language of DNA only requires 4 and proteins roughly 20 letters.
  • Bacterial genomes have a length of a few million basepairs. To uniquely define a location in the genome, one must use a sequence of roughly 10-15 base pairs. For example, this is relevant for designing site specific oligonucleotides for genome amplification or genome editing.
  • Enzymes are proteins that can catalyze specific reactions and are often a few hundred amino acid residues long, roughly the length of a sentence. To achieve this chemistry, they often have specific residues in an active site that are required for this function. Other residues (letters) in the sentence are not as highly conserved between closely related organisms. These active site residues can sometimes be found by looking for conservation in a multiple sequence alignment across proteins from different organisms.
  • Immune cells make antibodies with unique protein sequences by mixing and matching sequences from V, D, and J sites. This makes sequence matching difficult because the template is split and rejoined in different ways, creating a vast repertoire of antibodies from a relatively small starting pool of DNA sequences.
  • Big data will continue to be an issue for molecular biology in the post-genomic era. In RNAseq experiments, the mRNA of a sample is matched back to genome of the organism to provide a quantitative measure of the number of transcripts a certain gene has. While the final data (counts per gene) may only be a few megabytes in size, the text search algorithms need to be able to handle inputs of hundreds of gigabytes of data for each sample. Additionally, these data need to allow mismatches as the reference genome or individual read may be incomplete or incorrect.

Context for Use

Domain specific background

Two of the major classes of biomolecules, nucleic acids and proteins, encode specific molecular information in their ordered sequence. For DNA, there are only four types of bases, A/T/C/G, and any specific piece of DNA can be written as a word or sentence of these letters. In turn, the cell reads these nucleic acids and translates them into sequences of amino acids, of which there are 20. This flow of information from nucleic acid to proteins is called the central dogma of molecular biology. Because of the 'sequence' nature of both DNA and proteins, it is a good way to introduce students to computational concepts of string processing and string matching.

For some molecular biology applications, finding a specific sequence is the most important step. That is, given a genome that one wants to make a precise modification in (Crispr, genome editing, etc.), it can be difficult to determine the shortest sequence of nucleic acid that defines that locus. Due to a variety of considerations, including the cost of DNA synthesis, short sequences are much more experimentally tractable than long ones. This leads one naturally to a problem of string matching and finding a string which matches one location and nowhere else.

In other situations, the string search is more complicated. Many organisms share protein sequences that are very similar, especially in the conserved residues that make up the active sites of enzymes. By performing a multiple sequence alignment, these conserved residues show up in the same position in the protein with the same sequence. This is an improper matching problem of trying to best line up multiple sequences that are similar but not the same.

A final example of a common, contemporary molecular biology problem involving string matching is that of RNAseq experiments. In an RNAseq experiment, the mRNA of a specific sample is isolated and sequenced by next generation sequencing methods which read off the sequence of a single piece of nucleic acid. Each of millions of such reads are mapped back to the genome of the organism to calculate how many RNA molecules are produced from each gene. This generates an expression profile for the sample and can be used to measure stress response, development or how effective a drug is at eliciting some desired response. RNAseq presents a problem both for matching short strings back to a reference genome as well as a big data problem for storing, accessing and transferring hundreds of gigabytes of data per sample.

Instructional context/format

  • This problem set is designed for students who have some basic understanding of the MATLAB environment and how to interact with scripts. The students should also have a basic understanding of molecular biology. In particular, students should know before starting the exercise that both DNA and Proteins can be written as strings of letters and those letters dictate the function of the molecules.
  • This problem set will likely take students roughly 3-5 hours to complete. My students have done different parts of this before, but not the complete version.
  • This activity is situated near the beginning of a course that requires students to interact with a variety of types of data. In that sense, it is an introductory exercise that helps teach them mainly computation skills using an interesting biological problem to get them motivated.
  • This exercise can be done easily as individuals, or groups of 2-3. When groups get to be larger than 3, we've found that the students don't engage as well in the physical typing of code and therefore don't gain the benefits of learning the computational skills.

Description and Teaching Materials

- This exercise is explicitly designed to teach functionality in MATLAB and additionally requires the bioinformatics toolbox and an internet connection. That being said, it could be reworked to use standalone tools for performing multiple sequence alignment and/or pre-download the data. The NCBI repositories and FASTA formatted data that are provided here are easily processed using tools built in other languages such as Python.

- A PDF of the exercises that can be used as a student handout is provided, along with the required data, and a MATLAB livescript file with worked code for the examples.

Introduction to strings and sequence alignments - Student Handout (Acrobat (PDF) 127kB Aug16 18)

Plain text version of the declaration of independence (Text File 8kB Aug16 18)

retreiveFastaSeqs_helicases.m (Matlab File 1kB Aug16 18)

Vibrio cholerae genome ( 3.9MB Aug16 18)

instructor_livescript_for_DNAsequences (MATLAB Live Script 787kB Jan11 19)

Teaching Notes and Tips

1) Although accession codes don't change frequently, it would be good to make sure that the retrieval script is working before using it.
2) If your students need extra assistance in making figures or have never made figures before, they may find it difficult to make multi-axis plots or change scales from log to linear and back.
3) If you want to focus more on benchmarking and algorithms, you could ask the students to try and design more than one approach and compare them.
4) If you want to focus more on molecular biology concepts, you can add additional discussion about amino acids, their chemistry and why "functional conservation" makes more sense in the context of proteins than DNA.


- Students will produce a few short scripts that can be used to process simple text documents.
- Students will answer a few short answer questions as well as produce some figures demonstrating their ability to process text documents.
- Students will have an understanding of sequence alignment and should be able to answer questions about the types of experiments and analysis that can be done with sequence alignment.

References and Resources

The following references may be helpful for understanding some of the more molecular biology and bioinformatics specific topics involved in this activity.

  • Tilak Raj, Nikhil Sharma and T. C. Bhalla, "Bacterial serine proteases: Computational and statistical approach to understand temperature adaptability," Journal of Proteomics & Bioinformatics, vol. 10, no. 12, pp. 329-334, 2017.
  • B. S. Wendel, C. He, M. Qu, D. Wu, S. M. Hernandez, K.-Y. Ma, E. W. Liu, J. Xiao, P. D. Crompton, S. K. Pierce, P. Ren, K. Chen, and N. Jiang, "Accurate immune repertoire sequencing reveals malaria infection driven antibody lineage diversification in young children," Nature Communications, vol. 8, p. 531, Sept. 2017.
  • Z. Sethna, Y. Elhanati, C. R. Dudgeon, C. G. Callan, A. J. Levine, T. Mora, and A. M. Walczak, "Insights into immune system development and function from mouse T-cell repertoires," PNAS, vol. 114, pp. 2253-2258, Feb. 2017.
  • H. Li, "Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM," ArXiv e-prints, Mar. 2013.
  • B. Langmead and S. L. Salzberg, "Fast gapped-read alignment with Bowtie 2," Nature Methods, vol. 9, pp. 357-359, Apr. 2012.