Resequencing of Commercial Microorganisms

Jessica Kaufman, Endicott College

Location:

Abstract

Many people are taking probiotics to improve their health. While competing products claim to contain the same species that have shown health effects, it is difficult to manufacture live products without any contamination. The sensitivity of next-gen sequencing can be used to describe the strain variation in commercial microorganisms and detect DNA from other microorganisms that may be present in some products. This investigation uses next-gen sequencing and bioinformatics analysis to verify the probiotic contents for a product with a the species composition included on the label. A probiotic pill or product with a species and strain of bacteria with a high quality reference genome sequence available on the National Center for Biotechnology Information (NCBI) Nucleotide database is chosen for the resequencing analysis. After DNA isolation and library preparation, student samples are pooled for next-gen sequencing on an Illumina MiSeq. Students learn the required bioinformatics analysis to analyze the FASTQ files from the Illumina MiSeq run with publicly available computing resources from the Galaxy Project(usegalaxy.org) and R Shiny websites built for the final formatting of sequence files. Students complete the investigation by with a public submission of the FASTA file to the NCBI Nucleotide Database.

Student Goals

  1. Troubleshoot next-gen sequencing protocol in response to quantification and quality control steps
  2. Critically describe limitations and assumptions of the experimental design
  3. Communicate about research progress orally and in writing

Research Goals

  1. Upload novel sequence data to NCBI
  2. Identify sources of contamination in the manufacture of probiotics

Context

All Biology and Biotechnology majors are required to take a two-course sequence junior year with the same instructor.  All Bioengineering majors can take the fall genetics course as an elective, but are required to take the spring semester Bioinformatics course. Students must be comfortable from previous lab courses with biotechnology and lab skills to perform next-gen library preparation. Students who have successfully performed PCR and other techniques with pipetting and heat treatment steps are well prepared for library preparation. Students are not expected to have any previous programming experience before this two class sequence.

In the Fall semester, students take "Genes and Genomes" lecture and lab. The lab meets for 2 hours once a week.

In the Spring semester, students take a required "Bioinformatics" lecture course which is taught as online 3 credit course. The first bioinformatics project is the completion of this CURE. For this first project, they will run applications through the browser-based Galaxy project and use a web-based application created by the instructor. The project can be broken down to the following analysis steps.

Student Enrollment:  20-30 students

Target Audience:  Major, Upper Division

CURE Duration:  Two semesters

Target Audience:Major
CURE Duration:Multiple terms

CURE Design

The theme of this CURE is genome sequencing and resequencing analysis. The project trains students in all aspects of the sequencing process and the associated bioinformatics tools needed for analysis of the raw FASTQ files to form a consensus sequence for an organism with an existing, high quality reference genome. For both the wet lab and in silico lab activities, students are given structured documents that list protocols as planned and are asked to edit them to reflect the steps as executed. Each student is assigned a single gene within the bacterial genome for their project. Depending on the quality of the DNA library in this particular, students must adjust their analysis to produce a single consensus sequence for their assigned gene. Students use iteration to find the best bioinformatics workflow and set of options for each of the application on usegalaxy.org.

The lab course breaks up the sequencing process into 6 separate lab activities:

  1. DNA Extraction using Invitrogen PureLink Microbiome Purification Kit
  2. DNA Quantification using Agilent Tape Station
  3. End Repair and Adapter Ligation using NEBNext Library Prep Kit for Illumina
  4. Indexing PCR using NEBNext Library Prep Kit for Illumina
  5. Library Quantification and Normalization using Agilent Tape Station
  6. Library Pooling, Denaturation, and Illumina MiSeq run

The computational course breaks up the analysis process into 6 separate lab activities:

  1. FASTQ files are aligned to reference genome with default trimming settings
  2. Trimming is improved through iteration to retain at least 50% of data while keeping average quality as high as possible
  3. More than one alignment application is used to align to reference genome
  4. SAMTools are performed after each alignment, best aligner and alignment settings are chosen
  5. Variant calling is iterated towards best output that reflects a single consensus sequence for student's assigned gene
  6. FASTA sequence file uploaded to NCBI

Core Competencies: Analyzing and interpreting data, Planning and carrying out investigations, Using mathematics and computational thinking
Nature of Research: Informatics/Computational Research, Wet Lab/Bench Research

Tasks that Align Student and Research Goals

Research Goals ?
Student Goals ?
Research Goal 1: Upload novel sequence data to NCBI
Research Goal 2: Identify sources of contamination in the manufacture of probiotics


Student Goal 1: Troubleshoot next-gen sequencing protocol in response to quantification and quality control steps

- Quantify DNA 
- Quantify and normalize library 
- Carry out library pooling, denaturating, and MiSeq runs


- Filter and trim FASTQ files
- Use BLAST search to identify unaligned reads



Student Goal 2: Critically describe limitations and assumptions of the experimental design

- Develop variant calling workflow on usegalaxy.org
- Discuss lab reports

- Perform Shannon diversity analysis on raw data
- Compare any contaminants between years (from lab and students)



Student Goal 3: Communicate about research progress orally and in writing

- Write formal lab reports
- Submit sequence data NCBI
- Present during 5-10 minute "lab meetings"

- Discuss lab reports 
- Carry out second project in bioinformatics class on taxonomic classification, which builds on skills developed in variant calling project


Instructional Materials

ELN Template (Acrobat (PDF) 264kB Jun5 19)
Unit 1 Project Assignment (Acrobat (PDF) 475kB Jun5 19)

Electronic Lab Notebook with Google Docs for Sequencing Labs

Unit 1 Project Outline with Analysis Steps

Web-based app for creating variant FASTA files from genome FASTA file and vcf file: https://ecbio340.shinyapps.io/R-VCF2FA-master

Web-based app for calculating Shannon Diversity metrics from Kraken2 tabular output: https://ecbio340.shinyapps.io/ShannonDiversityKraken2

Assessment

ELN Rubric (Microsoft Word 2007 (.docx) 13kB Jun5 19)
Lab Report Rubric (Microsoft Word 2007 (.docx) 22kB Jun5 19)

Instructional Staffing

There is one full-time faculty member who teaches the Fall Genes and Genomes lecture and two sections of lab and two spring sections of Bioinformatics lecture as approximately half of her teaching load. No other staff is involved in this CURE, but it should be possible to adapt to other staffing models.

Author Experience

Jessica Kaufman, Endicott College

This CURE is designed to engage students in next-gen resequencing of a commercial microorganism and a simple variant calling workflow. Through this process, students develop skills they can apply to other types of library prep (RNA-Seq, Chip-Seq) and learn new bioinformatics workflows beyond variant calling. The laboratory steps and bioinformatics analysis could be combined into a single course depending on program needs.


Advice for Implementation

Only high quality libraries should be pooled and a small flow cell should be used. Usually, if 30 students perform DNA extraction, there will be 1-3 high quality libraries. For the bioinformatics project to be successful, it is helpful to have at least 10X coverage of each probiotic genome.

I made extensive use of existing tutorials from usegalaxy.org and my own tutorials to show students how to run each application with an example data set prior to students rerunning the tool on their assigned region of the bacterial genome. I found that video instruction worked best for learning how to use these tools. Written, numbered steps did not work as well because a major roadblock to navigating the usegalaxy.org website is finding which portion of the page to interact with.

Galaxy Video Library

YouTube Playlist for Variant Calling Project

All lab observations were recorded in a shared Google Doc that serves as an electronic lab notebook. Students opened the Electronic Lab Notebook Template and use File > Make a copy to get their own version of the document. Each week, students were responsible for timely entries to this document and analysis of the data collected in lab. I used Revision History in Google Docs to monitor when students made changes to the notebook and 10% of the lab notebook grade was the timing of entries.

In Spring 2020, I surveyed students to gauge their impressions of this experience and summarized their responses for a presentation to the American Society of Engineering Educators. The survey showed that students recognized elements of a CURE in both the wet-lab and in silico lab tasks.  There was no significant difference between students who only participated in the bioinformatics part of the CURE and students who had taken the full year sequence. One question I added to the student survey was "Was this project [bioinformatics analysis] similar to research in laboratory classes? How was it similar? How could this project better connect to open-ended research questions?"  There were mixed responses to this question. Some students did not feel the computational work was similar. A typical comment was "This project was not similar to research in laboratory classes. It is unlike any class I have ever taken." Other students saw the connection between computational work in biology and lab work more clearly. A comment reflecting this view was: "I loved this project! Although it was definitely a steep learning curve at first, I was able to relate my previous lab report experiences to this kind of methodology. The scientific process was iterated in this project." After teaching bioinformatics for over a decade and seeing former students apply these skills after graduating, it is clear to me that computational biology research is an essential skill for our graduates. These data showed me that I need to do more to introduce the research that is possible with bioinformatics before starting the first project.

Iteration

Prior to library prep, students do a shorter lab using PCR for genotyping. In the library preparation process, students repeat many steps that are similar to PCR (End Prep, Adapter Ligation, and Indexing). At the end of the lab course, students prepare a formal lab report on the six-week activity. Similarly, students write up a lab report at the end of the bioinformatics analysis. The first lab report focuses on the quality of data collected. The second lab report focuses on the iterative process used to select applications and application settings to produce a consensus sequence for their assigned gene.

Using CURE Data

Students are encouraged to submit completed FASTA files for one gene in the resequenced genome to NCBI.

Example student submission from the 2019-2020 academic year: https://www.ncbi.nlm.nih.gov/nuccore/MT066060

Resources

** = Articles for Students

Library Prep and Sequencing 

1. Barney BT, Munkholm C, Walt DR, Palumbi SR. Highly localized divergence within supergenes in Atlantic cod (Gadus morhua) within the Gulf of Maine. BMC Genomics. 2017;18(1). **
2. Head, S. R., Komori, H. K., LaMere, S. A., Whisenant, T., Van Nieuwerburgh, F., Salomon, D. R., & Ordoukhanian, P. (2014). Library construction for next-generation sequencing: overviews and challenges. BioTechniques, 56(2), 61