BCB 590

Lab 3

Database Searching with BLAST

 

Instructions:

  1. Complete the lab exercise
  2. Answer the questions in red
  3. Email your answers to terrible@iastate.edu

 

Objectives

1. Use BLAST programs to retrieve similar sequences from a database

2. Understand the significance of the BLAST results

3. Use PSI-BLAST to detect distantly related sequences

 

Introduction

 

This lab has been designed to give you practice using BLAST and interpreting the results of a BLAST search.  You will be able to find similar sequences in a database and assess if the sequence "hits" found in a database are relevant.  If you are a biologist, it is likely that you have already used many of the programs introduced in this lab without knowing exactly what they are doing.  The lab is designed to give you some practice with using the programs and interpreting the results.

 

Exercises

 

Our first task is to read through the BLAST program selection guide.  Even if you have used BLAST before, you will probably learn something new by reading the selection guide.  It contains a lot of good information about all of the options that are available.  After learning a bit about which BLAST program to use and database to search against, we need to know how to analyze your search results.  Read the Statistics of Sequence Similarity Scores page to get an idea of what all the numbers and options mean.

 

1. What does an E-value of 2 mean?

 

Note:  This portion of the exercise and questions related to it come from an exercise by Susan Cates - Cates, S. (2007, September 4). Introduction to Sequence Alignment. Retrieved from the Connexions Web site: http://cnx.org/content/m11026/2.12/

 

We will begin our BLAST searches with the following nucleotide sequence:

 

TCAAGCAGATCACTGTCCTTCTGCCATGGCCCTGT

GGATGCGCCTCCTGCCCCTGCTGGCGCTGCTGGCC

CTCTGGGGACCTGACCCAGCCGCAGCCTTTGTGAA

CCAACACCTGTGCGGCTCACACCTGGTGGAAGCTC

TCTACCTAGTGTGCGGGGAACGAGGCTTCTTCTAC

ACACCCAAGACCCGCCGGGAGGCAGAGGACCTGCA

GGTGGGGCAGGTGGAGCTGGGCGGGGGCCCTGGTG

CAGGCAGCCTGCAGCCCTTGGCCCTGGAGGGGTCC

CTGCAGAAGCGTGGCATTGTGGAACAATGCTGTAC

CAGCATCTGCTCCCTCTACCAGCTGGAGAACTACT

GCAACTAGACGCAGCCCGCATGCAGNCCCCCACCC

GCCGNCTTCTGCACCGAGAGAGATGGAATTAAACC

CTTGAACCCAGCANANAAAAAAAAGAAATAAAA

 

 

 

Go to the NCBI BLAST page and select nucleotide blast.  Enter the sequence in the text box at the top of the page.  We then must change the database to search against to the “Nucleotide collection (nr/nt)” database.  Choose the option for “Highly similar sequences (megablast).”  Our search is now set up, but just for fun, click on the “Algorithm parameters” link at the bottom of the page.  A new set of options is now available to us where we can specify more search parameters.  We will leave them as is for our first search.  Remember, the best way to start is with the most simple search because we are likely to find what we are looking for anyway.  Click on BLAST to begin your search.

 

2. How long is the top alignment?  What is the E-value?  What is the percent identity?

3. What organism is the most common source of the sequences in the first 5 hits?

4. What protein is most commonly identified in the description column of the alignments?

 

Return to the NCBI BLAST page and select blastx to search for protein hits to the 6 frame translation of our nucleotide sequence.  Paste in the same sequence above, choose the “Non-redundant protein sequences (nr)” database and click on BLAST to start the search.

 

5. What protein is listed in the vast majority of the returned matches?

 

Look at the first alignment, which should have 100% identity between the aligned subsequences.  Note that a particular alignment may have more than one sequence associated with it.  Therefore, you must look at the actual alignments, not the list of scores, to answer the following questions.

 

6. How many different species are listed as sources where the aligned subsequence has an identical amino acid to total amino acid ratio of 86/86, and 100% identity?

 

In lab 2 we had a mystery protein sequence that we submitted to PROSITE and SuperFam to find conserved motifs/domains.  Today, we will use BLAST to see if we can find out exactly what this protein is.  Here is the mystery protein sequence:

 

>unknown protein
MAAAKAEMQLMSPLQISDPFGSFPHSPTMDNYPKLEEMMLLSNGAPQFLGAAGTPEGSGGNSSSSTSSGG
GGGGGSNSGSSAFNPQGEPSEQPYEHLTTESFSDIALNNEKAMVETSYPSQTTRLPPITYTGRFSLEPAP
NSGNTLWPEPLFSLVSGLVSMTNPPTSSSSAPSPAASSSSSASQSPPLSCAVPSNDSSPIYSAAPTFPTP
NTDIFPEPQSQAFPGSAGTALQYPPPAYPATKGGFQVPMIPDYLFPQQQGDLSLGTPDQKPFQGLENRTQ
QPSLTPLSTIKAFATQSGSQDLKALNTTYQSQLIKPSRMRKYPNRPSKTPPHERPYACPVESCDRRFSRS
DELTRHIRIHTGQKPFQCRICMRNFSRSDHLTTHIRTHTGEKPFACDICGRKFARSDERKRHTKIHLRQK
DKKADKSVVASPAASSLSSYPSPVATSYPSPATTSFPSPVPTSYSSPGSSTYPSPAHSGFPSPSVATTFA
SVPPAFPTQVSSFPSAGVSSSFSTSTGLSDMTATFSPRTIEIC

 

Return to the NCBI BLAST page and select protein blast.  Paste in our mystery sequence, choose the “Non-redundant protein sequences (nr)” database, and select the blastp program.  Click on “Algorithm parameters” and change the matrix to BLOSUM80 because I expect us to find a very similar sequence in the database (remember, higher BLOSUM numbers mean we expect/want to find highly similar sequences).  Click on BLAST to start the search.

 

7. What is our mystery protein?

8. Many of the hits have an E-value of 0.0.  How many different species do these hits come from?

 

One of the problems with BLAST these days is that it is just too darn good.  The databases contain so many sequences that often your BLAST results are just a huge collection of identical, or nearly identical sequences.  This problem has been designed to challenge BLAST with a difficult problem.  In this exercise we will try to find a bacterial match for the following nucleotide sequence:

 

>gi|76828014|gb|BC107078.1| Homo sapiens G protein-coupled receptor, family C, group 5, member D, mRNA (cDNA clone MGC:129714 IMAGE:40027066), complete cds ATGTACAAGGACTGCATCGAGTCCACTGGAGACTATTTTCTTCTCTGTGACGCCGAGGGGCCATGGGGCA

TCATTCTGGAGTCCCTGGCCATACTTGGCATCGTGGTCACAATTCTGCTACTCTTAGCATTTCTCTTCCT

CATGCGAAAGATCCAAGACTGCAGCCAGTGGAATGTCCTCCCCACCCAGCTCCTCTTCCTCCTGAGTGTC

CTGGGGCTCTTCGGACTCGCTTTTGCCTTCATCATCGAGCTCAATCAACAAACTGCCCCCGTACGCTACT

TTCTCTTTGGGGTTCTCTTTGCTCTCTGTTTCTCATGCCTCTTAGCTCATGCCTCCAATCTAGTGAAGCT

GGTTCGGGGTTGTGTCTCCTTCTCCTGGACGACAATTCTGTGCATTGCTATTGGTTGCAGTCTGTTGCAA

ATCATTATTGCCACTGAGTATGTGACTCTCATCATGACCAGAGGTATGATGTTTGTGAATATGACACCCT

GCCAGCTCAATGTGGACTTTGTTGTACTCCTGGTCTATGTCCTCTTCCTGATGGCCCTCACATTCTTCGT

CTCCAAAGCCACCTTCTGTGGCCCGTGTGAGAACTGGAAGCAGCATGGAAGGCTCATCTTTATCACTGTG

CTCTTCTCCATCATCATCTGGGTGGTGTGGATCTCCATGCTCCTGAGAGGCAACCCGCAGTTCCAGCGAC

AGCCCCAGTGGGACGACCCGGTCGTCTGCATTGCTCTGGTCACCAACGCATGGGTTTTCCTGCTGCTGTA

CATCGTCCCTGAGCTCTGCATTCTCTACAGATCGTGTAGACAGGAGTGCCCTTTACAAGGCAATGCCTGC

CCCGTCACAGCCTACCAACACAGCTTCCAAGTGGAGAACCAGGAGCTCTCCAGAGCCCGAGACAGTGATG

GAGCTGAGGAGGATGTAGCATTAACTTCATATGGTACTCCCATTCAGCCGCAGACTGTTGATCCCACACA

AGAGTGTTTCATCCCACAGGCTAAACTAAGCCCCCAGCAAGATGCAGGAGGAGTATAA

 

Go to the NCBI BLAST page and choose nucleotide blast.  Paste the sequence into the search text box.  In the Database section, click the radio button for Others and select the Nucleotide collection (nr/nt) database.  In the Organism box, type in Bacteria.  Under Program Selection, use megablast.  Then click on BLAST to start your search.

 

4. How many hits did you get?   How many of them are significant, with an E-value below 0.1?

 

Let’s try using discontinuous megablast this time and see if the results change.  Enter all of the same parameters as the first search, but click on the radio button for discontinuous megablast, then click BLAST to run the search.

 

5. How many hits did you get?   How many of them are significant, with an E-value below 0.1?

 

For our third search, let’s try blastn.  Same parameters again, just click on the radio button for blastn and click BLAST to run the search.

 

6. How many hits did you get?   How many of them are significant, with an E-value below 0.1?

 

Results table

BLAST flavor

Number of hits

Number of hits with E-value < 0.1

megablast

 

 

discontinuous megablast

 

 

blastn

 

 

 

 

7. What explanation can you give for the different results from using megablast, discontinuous megablast, and blastn?

 

Here is the protein sequence that corresponds to the nucleotide sequence we have been using:

 

>gi|76828015|gb|AAI07079.1| G protein-coupled receptor, family C, group 5, member D [Homo sapiens]

MYKDCIESTGDYFLLCDAEGPWGIILESLAILGIVVTILLLLAFLFLMRKIQDCSQWNVLPTQLLFLLSV

LGLFGLAFAFIIELNQQTAPVRYFLFGVLFALCFSCLLAHASNLVKLVRGCVSFSWTTILCIAIGCSLLQ

IIIATEYVTLIMTRGMMFVNMTPCQLNVDFVVLLVYVLFLMALTFFVSKATFCGPCENWKQHGRLIFITV

LFSIIIWVVWISMLLRGNPQFQRQPQWDDPVVCIALVTNAWVFLLLYIVPELCILYRSCRQECPLQGNAC

PVTAYQHSFQVENQELSRARDSDGAEEDVALTSYGTPIQPQTVDPTQECFIPQAKLSPQQDAGGV

 

Use this sequence to run a PSI-BLAST search.  Run the first round of PSI-BLAST with these parameters: limit the search to bacteria, use a word size of 2, BLOSUM45 matrix, gap existence 10, extension 3 and click BLAST. 

 

8. How many hits did you get?   How many of them are significant, with an E-value below 0.1?

 

Select the checkboxes next to all of the sequences with an E-value of less than 2 and Click on Run PSI-BLAST Iteration 2.

 

9. How many hits did you get?  How many of them are significant, with an E-value below 0.1? 

 

10. What types of proteins do you get from the second PSI-BLAST iteration?  Do you believe that our query sequence is related to the results we found?   Why or why not?