BCB 590
Lab 3
Database Searching
with BLAST
Instructions:
Objectives
1. Use BLAST programs to retrieve similar sequences from a database
2. Understand the significance of the BLAST results
3. Use PSI-BLAST to detect distantly related sequences
Introduction
This lab has been designed to give you practice using BLAST and interpreting the results of a BLAST search. You will be able to find similar sequences in a database and assess if the sequence "hits" found in a database are relevant. If you are a biologist, it is likely that you have already used many of the programs introduced in this lab without knowing exactly what they are doing. The lab is designed to give you some practice with using the programs and interpreting the results.
Exercises
Our first task is to read through the BLAST program selection guide. Even if you have used BLAST before, you will probably learn something new by reading the selection guide. It contains a lot of good information about all of the options that are available. After learning a bit about which BLAST program to use and database to search against, we need to know how to analyze your search results. Read the Statistics of Sequence Similarity Scores page to get an idea of what all the numbers and options mean.
1. What does an E-value of 2 mean?
Note: This portion of the exercise and questions related to it come from an exercise by Susan Cates - Cates, S. (2007, September 4). Introduction to Sequence Alignment. Retrieved from the Connexions Web site: http://cnx.org/content/m11026/2.12/
We will begin our BLAST searches with the following nucleotide sequence:
TCAAGCAGATCACTGTCCTTCTGCCATGGCCCTGT
GGATGCGCCTCCTGCCCCTGCTGGCGCTGCTGGCC
CTCTGGGGACCTGACCCAGCCGCAGCCTTTGTGAA
CCAACACCTGTGCGGCTCACACCTGGTGGAAGCTC
TCTACCTAGTGTGCGGGGAACGAGGCTTCTTCTAC
ACACCCAAGACCCGCCGGGAGGCAGAGGACCTGCA
GGTGGGGCAGGTGGAGCTGGGCGGGGGCCCTGGTG
CAGGCAGCCTGCAGCCCTTGGCCCTGGAGGGGTCC
CTGCAGAAGCGTGGCATTGTGGAACAATGCTGTAC
CAGCATCTGCTCCCTCTACCAGCTGGAGAACTACT
GCAACTAGACGCAGCCCGCATGCAGNCCCCCACCC
GCCGNCTTCTGCACCGAGAGAGATGGAATTAAACC
CTTGAACCCAGCANANAAAAAAAAGAAATAAAA
Go to the NCBI BLAST page and select nucleotide blast. Enter the sequence in the text box at the top of the page. We then must change the database to search against to the “Nucleotide collection (nr/nt)” database. Choose the option for “Highly similar sequences (megablast).” Our search is now set up, but just for fun, click on the “Algorithm parameters” link at the bottom of the page. A new set of options is now available to us where we can specify more search parameters. We will leave them as is for our first search. Remember, the best way to start is with the most simple search because we are likely to find what we are looking for anyway. Click on BLAST to begin your search.
2. How long is the top
alignment? What is the E-value? What is the percent identity?
3. What organism is the most common
source of the sequences in the first 5 hits?
4. What protein is most commonly
identified in the description column of the alignments?
Return to the NCBI BLAST page and select blastx to search for protein hits to the 6 frame translation of our nucleotide sequence. Paste in the same sequence above, choose the “Non-redundant protein sequences (nr)” database and click on BLAST to start the search.
5. What protein is listed in the
vast majority of the returned matches?
Look at the first alignment, which should have 100% identity between the aligned subsequences. Note that a particular alignment may have more than one sequence associated with it. Therefore, you must look at the actual alignments, not the list of scores, to answer the following questions.
6. How many different species are
listed as sources where the aligned subsequence has an identical amino acid to
total amino acid ratio of 86/86, and 100% identity?
In lab 2 we had a mystery protein sequence that we submitted to PROSITE and SuperFam to find conserved motifs/domains. Today, we will use BLAST to see if we can find out exactly what this protein is. Here is the mystery protein sequence:
>unknown protein
MAAAKAEMQLMSPLQISDPFGSFPHSPTMDNYPKLEEMMLLSNGAPQFLGAAGTPEGSGGNSSSSTSSGGGGGGGSNSGSSAFNPQGEPSEQPYEHLTTESFSDIALNNEKAMVETSYPSQTTRLPPITYTGRFSLEPAPNSGNTLWPEPLFSLVSGLVSMTNPPTSSSSAPSPAASSSSSASQSPPLSCAVPSNDSSPIYSAAPTFPTPNTDIFPEPQSQAFPGSAGTALQYPPPAYPATKGGFQVPMIPDYLFPQQQGDLSLGTPDQKPFQGLENRTQQPSLTPLSTIKAFATQSGSQDLKALNTTYQSQLIKPSRMRKYPNRPSKTPPHERPYACPVESCDRRFSRSDELTRHIRIHTGQKPFQCRICMRNFSRSDHLTTHIRTHTGEKPFACDICGRKFARSDERKRHTKIHLRQKDKKADKSVVASPAASSLSSYPSPVATSYPSPATTSFPSPVPTSYSSPGSSTYPSPAHSGFPSPSVATTFASVPPAFPTQVSSFPSAGVSSSFSTSTGLSDMTATFSPRTIEIC
Return to the NCBI BLAST page and select protein blast. Paste in our mystery sequence, choose the “Non-redundant protein sequences (nr)” database, and select the blastp program. Click on “Algorithm parameters” and change the matrix to BLOSUM80 because I expect us to find a very similar sequence in the database (remember, higher BLOSUM numbers mean we expect/want to find highly similar sequences). Click on BLAST to start the search.
7. What is our mystery protein?
8. Many of the hits have an E-value
of 0.0. How many different species do
these hits come from?
One of the problems with BLAST these days is that it is just too darn good. The databases contain so many sequences that often your BLAST results are just a huge collection of identical, or nearly identical sequences. This problem has been designed to challenge BLAST with a difficult problem. In this exercise we will try to find a bacterial match for the following nucleotide sequence:
>gi|76828014|gb|BC107078.1| Homo sapiens G protein-coupled receptor, family C, group 5, member D, mRNA (cDNA clone MGC:129714 IMAGE:40027066), complete cds ATGTACAAGGACTGCATCGAGTCCACTGGAGACTATTTTCTTCTCTGTGACGCCGAGGGGCCATGGGGCA
TCATTCTGGAGTCCCTGGCCATACTTGGCATCGTGGTCACAATTCTGCTACTCTTAGCATTTCTCTTCCT
CATGCGAAAGATCCAAGACTGCAGCCAGTGGAATGTCCTCCCCACCCAGCTCCTCTTCCTCCTGAGTGTC
CTGGGGCTCTTCGGACTCGCTTTTGCCTTCATCATCGAGCTCAATCAACAAACTGCCCCCGTACGCTACT
TTCTCTTTGGGGTTCTCTTTGCTCTCTGTTTCTCATGCCTCTTAGCTCATGCCTCCAATCTAGTGAAGCT
GGTTCGGGGTTGTGTCTCCTTCTCCTGGACGACAATTCTGTGCATTGCTATTGGTTGCAGTCTGTTGCAA
ATCATTATTGCCACTGAGTATGTGACTCTCATCATGACCAGAGGTATGATGTTTGTGAATATGACACCCT
GCCAGCTCAATGTGGACTTTGTTGTACTCCTGGTCTATGTCCTCTTCCTGATGGCCCTCACATTCTTCGT
CTCCAAAGCCACCTTCTGTGGCCCGTGTGAGAACTGGAAGCAGCATGGAAGGCTCATCTTTATCACTGTG
CTCTTCTCCATCATCATCTGGGTGGTGTGGATCTCCATGCTCCTGAGAGGCAACCCGCAGTTCCAGCGAC
AGCCCCAGTGGGACGACCCGGTCGTCTGCATTGCTCTGGTCACCAACGCATGGGTTTTCCTGCTGCTGTA
CATCGTCCCTGAGCTCTGCATTCTCTACAGATCGTGTAGACAGGAGTGCCCTTTACAAGGCAATGCCTGC
CCCGTCACAGCCTACCAACACAGCTTCCAAGTGGAGAACCAGGAGCTCTCCAGAGCCCGAGACAGTGATG
GAGCTGAGGAGGATGTAGCATTAACTTCATATGGTACTCCCATTCAGCCGCAGACTGTTGATCCCACACA
AGAGTGTTTCATCCCACAGGCTAAACTAAGCCCCCAGCAAGATGCAGGAGGAGTATAA
Go to the NCBI BLAST page and choose nucleotide blast. Paste the sequence into the search text box. In the Database section, click the radio button for Others and select the Nucleotide collection (nr/nt) database. In the Organism box, type in Bacteria. Under Program Selection, use megablast. Then click on BLAST to start your search.
4. How many hits did you get? How many of them are significant, with an E-value below 0.1?
Let’s try using discontinuous megablast this time and see if the results change. Enter all of the same parameters as the first search, but click on the radio button for discontinuous megablast, then click BLAST to run the search.
5. How many hits did you get? How many of them are significant, with an E-value below 0.1?
For our third search, let’s try blastn. Same parameters again, just click on the radio button for blastn and click BLAST to run the search.
6. How many hits did you get? How many of them are significant, with an E-value below 0.1?
Results table
|
BLAST flavor |
Number of hits |
Number of hits with E-value < 0.1 |
|
megablast |
|
|
|
discontinuous megablast |
|
|
|
blastn |
|
|
7. What explanation can you give for
the different results from using megablast, discontinuous megablast, and
blastn?
Here is the protein sequence that corresponds to the nucleotide sequence we have been using:
>gi|76828015|gb|AAI07079.1|
G protein-coupled receptor, family C, group 5, member D [Homo sapiens]
MYKDCIESTGDYFLLCDAEGPWGIILESLAILGIVVTILLLLAFLFLMRKIQDCSQWNVLPTQLLFLLSV
LGLFGLAFAFIIELNQQTAPVRYFLFGVLFALCFSCLLAHASNLVKLVRGCVSFSWTTILCIAIGCSLLQ
IIIATEYVTLIMTRGMMFVNMTPCQLNVDFVVLLVYVLFLMALTFFVSKATFCGPCENWKQHGRLIFITV
LFSIIIWVVWISMLLRGNPQFQRQPQWDDPVVCIALVTNAWVFLLLYIVPELCILYRSCRQECPLQGNAC
PVTAYQHSFQVENQELSRARDSDGAEEDVALTSYGTPIQPQTVDPTQECFIPQAKLSPQQDAGGV
Use this sequence to run a PSI-BLAST search. Run the first round of PSI-BLAST with these parameters: limit the search to bacteria, use a word size of 2, BLOSUM45 matrix, gap existence 10, extension 3 and click BLAST.
8. How many hits did you get? How many of them are significant, with an E-value below 0.1?
Select the checkboxes next to all of the sequences with an E-value of less than 2 and Click on Run PSI-BLAST Iteration 2.
9. How many hits did you get? How many of them are significant, with an E-value below 0.1?
10. What types of proteins do you get from the second PSI-BLAST iteration? Do you believe that our query sequence is related to the results we found? Why or why not?