Bioinformatics Resources and Tools
A database is a structured collection of records stored in a computer system. Genomic databases typically store DNA or protein sequences as well as annotated information about those sequences. Many databases also provide bioinformatics tools tools, such as BLAST, for finding specific sequences or annotations. There are hundreds of genomics databases: some are comprehensive, but are not carefully curated (GenBank), while others are carefully curated, but are narrow (FlyBase).
Bioinformatic tools are computer programs that analyze one or more sequences. There are a dizzying array of bioinformatic tools that can analyze sequences to find protein domains (Pfam), or that can search through databases of millions of sequences to find ones that are similar (BLAST) or that can find potential protein-coding regions (ORF-Finder). Many are freely available over the web. It can be overwhelming to find and use bioinformatic tools because you need to know 1) what type of analysis you wnat performed 2) what type of tool to use 3) where to find the tool.
For this class, we have made a collection of common databases and tools that should be useful for your research. But you should also feel free to use Google to try and find other databases and tools. If you want information on a particular species, you can search for databases that contain DNA sequences for that species (try Medicago truncatula genome). If you want to identify possible functional domains contained within a set of sequences, you can search for annotation tools.
Approaching the challenge of doing bioinformatic research is daunting; there are many different bioinformatic tools, it is often difficult to figure out exactly what each one does, none of the web interfaces are the same, and the number of options that you can set are overwhelming. You may ask why it needs to be so confusing?
Consider:
Most bioinformatic tools were created by researchers to do the job, not be easy for others to use:The majority of bioinformatic tools were not designed by companies trying to make the tools easy for consumers; rather they have been designed and written by researchers in labs all over the world for use in their own research. Their focus was on creating a good bioinformatic tool that would help them address their own research questions, not really to make them intuitive for others to use.
Few standards have been adopted:The world of bioinformatics and genomics is still rapidly evolving and the community of researchers has not settled on many standard ways of organizing data or bioinformatics interfaces. This is good because it allows for innovation. But it is difficult to start out as a genomicist, because you must acclimate yourself to a culture of having to just figure stuff out as you go along.
Most researchers adhere to an open source policy:Most researchers and bioinformatics organizations (NCBI, EMBL) believe in an open sourcepolicy, in which they freely make tools available to others, usually via their websites.
Tools/Databases evolve:Because it is such a rapidly developing field, the labs creating the tools and databases are constantly updating, creating new versions, and changing their tools and websites. You may visit a familiar bioinformatics website only to find that it has changed. This happens even with NCBI; recently they completely altered their blast interface.
There are many redundant tools:Because it is such a new field, many different people have created bioinformatic tools and databases that do very similar things. For example, Pfam and SMART are both databases of conserved protein domains, but they look very different. Check out this one website that lists dozens of tools, all for analyzing protein sequences EXPASY tools
Rules for visiting bioinformatics websites:
Consider:
Most bioinformatic tools were created by researchers to do the job, not be easy for others to use:The majority of bioinformatic tools were not designed by companies trying to make the tools easy for consumers; rather they have been designed and written by researchers in labs all over the world for use in their own research. Their focus was on creating a good bioinformatic tool that would help them address their own research questions, not really to make them intuitive for others to use.
Few standards have been adopted:The world of bioinformatics and genomics is still rapidly evolving and the community of researchers has not settled on many standard ways of organizing data or bioinformatics interfaces. This is good because it allows for innovation. But it is difficult to start out as a genomicist, because you must acclimate yourself to a culture of having to just figure stuff out as you go along.
Most researchers adhere to an open source policy:Most researchers and bioinformatics organizations (NCBI, EMBL) believe in an open sourcepolicy, in which they freely make tools available to others, usually via their websites.
Tools/Databases evolve:Because it is such a rapidly developing field, the labs creating the tools and databases are constantly updating, creating new versions, and changing their tools and websites. You may visit a familiar bioinformatics website only to find that it has changed. This happens even with NCBI; recently they completely altered their blast interface.
There are many redundant tools:Because it is such a new field, many different people have created bioinformatic tools and databases that do very similar things. For example, Pfam and SMART are both databases of conserved protein domains, but they look very different. Check out this one website that lists dozens of tools, all for analyzing protein sequences EXPASY tools
HOW YOU CAN DEAL WITH THESE ISSUES:
The proliferation of open source bioinformatics is wonderful because everyone has access to sequence data and tools to analyze it (for example, see just one list of tools at . But it is challenging because every site has a different interface. Some sites are not intuitive to use. The best mentality for you to develop is a "spend time exploring, reading, and trying" mentality.Rules for visiting bioinformatics websites:
- Try to generally determine the purpose of the database/tool (if the group has published a paper to publicize the tool, they will post it on their website; read their paper)
- Find the help page and read through it
- Find the "Sequence Submission" page, and read through it.
- PLAY AROUND and try things out. It may take several attempts to get it right.
Tools for getting started
Searching with BLAST
BLAST known sequences against the Chamaecrista transcriptome JMP Genomics Tools
JMP Genomics is a statistical software package that lets you look for gene expression patterns. The two links that follow provide the directions and data needed to visualize patterns in gene expression among the different Chamaecrista tissue types. Work through this exercise before you go on to plan your own strategy for working with the expression data. The software is available for all faculty and students in the Biology Department. It is currently on the machines in CMC 109 and the Biology Computer Lab. As a student in the class, you can download your own copy from the COLLAB server folder (Departments/Biology) onto a Windows machine.
JMP Genomics Gene Expression Exercise (Microsoft Word 2007 (.docx) 38kB Feb7 11)
Data File for JMP Genomics Gene Expression Exercise (Text File 5.5MB Jan9 11)
Paper on Analyzing Gene Expression (Acrobat (PDF) 4.7MB Jan11 10) the figures may be helpful as you reflect on the results of your JMP analysis.
Links that will help you relate genes to the biology of organisms:
Annotation tools:
Pfam is a database of evolutionarily conserved protein families, and annotations about the functions of those families.
KAAS (KEGG Automatic Annotation Server) provides functional annotation of genes by BLAST comparisons against the manually curated KEGG GENES database and mapping them onto known biochemical/metabolic pathways.
Alignment,Phylogeny, and Evolutionary Analysis Tools:
Clustal Clustal will align multiple DNA or protein sequences. It is available over many different sites, in this case we are providing a link to EMBL's Clustalw server.
CIPRES is a portal for the inference of large phylogenetic trees
: MEGAis a program for constructing alignments and constructing phylogenetic trees. Important:this software does NOT run from the web. You must download it onto a PC and run it locally. The Biology computer lab computers have this software installed on them.
Primer Design Tools
Primer 3 is an excellent tool for designing primers. You will find this helpful in your functional analysis.
Sequence Analysis Tools
There are a number of tools available to analyze sequence data, including 4Peaks, Student Interface to Biology Workbench, and LaserGene. Click here (Microsoft Word 2007 (.docx) 129kB Feb5 10) for detailed instructions on how these programs can be helpful to you.


