BCB 590
Lab 2
Making Sense of DNA and Protein Sequences
Instructions:
Introduction
In this lab you will be using a variety of sequence based approaches to make sense of DNA and protein sequences. We will start out by refreshing our memories about how to retrieve sequences from NCBI, and see some new search tricks along the way.
The second portion of the lab will start with a DNA sequence and scan for open reading frames and use a Hidden Markov Model to predict genes. After we know where the genes are, we would probably like to know something about how the gene is regulated. Gene regulation is also largely dependent on signals from the DNA sequence. Differential regulation of genes is essential to cell differentiation and response. This regulation is achieved partly via transcription factors that often recognize DNA motifs upstream of genes and initiate transcription of gene groups. Understanding genetic regulation of is essential to understanding genetic diseases such as cancer and is the target of the systems biology branch of bioinformatics. We will analyze upstream sequences of genes identified as potentially co-regulated by microarray experiment for promoter recognition.
The third portion of the lab deals with protein sequences. We will take a mystery protein sequence and use three different tools to find conserved motifs or domains. From this we can try to determine the function of the mystery protein.
Lab Exercise
Part I – Some new search tricks
This problem practices using the Entrez Search program at
the
Go to the Entrez Web site (http://www.ncbi.nlm.nih.gov/entrez/). Return to the Entrez main page and choose Protein from the search drop-down box in the upper left. Enter the term “heat shock factor” (with the quotes) to identify all entries with this phrase somewhere in their text.
1. How many matches are found?
Now limit the search by clicking the mouse on the Preview/Index tab, go to add terms, choose organism in the first box, type human in the second, then click AND to limit the search to just human proteins, and then click Preview. The history will now show the results of a search for database entries with the term “heat shock factor” AND originating from humans as the organism.
2. How many hits are there now?
We can limit the hits to matches
to RefSeq, which is in Genbank’s annotated sequence database, to give a best
representative sequence entry for each protein.
Click the mouse on Limits. In the
Limited To section of the page ignore the boxes on the left and choose RefSeq
in the right-most box. Then click the GO
button and then the history tab. Now we
have all human heat shock factors in RefSeq.
The gene of interest is HSF1. Click on the Preview/Index tab and select
Gene Name in the drop down menu. Type HSF1
in the textbox, then click the AND button and then the Preview Button. There should now be one entry left in
History. Clicking on the number 1
provides the sequence.
There are other ways of arriving at this final sequence. As another example, pull out all human protein sequences in RefSeq and all HSF1 sequences in all organisms and then select the human one using another Boolean search feature of Entrez. Reload the Entrez page and choose Protein in the upper left box. Enter human in the text box at the top, click Limits and then in the Limit To area, choose Organism in the upper left box and RefSeq in the right box. Click GO and then History. Now we have a complete list of all human proteins in RefSeq.
3. How many are there?
Now replace human with HSF1 in the upper text field, click Limits, and in the Limited To area, choose gene name in the upper left box and RefSeq in the right box. Click GO and then History.
4. How many proteins result from
this query?
Finally, note the numbers at the
beginning of the two lines that start with a pound sign (#) in history that
were found by the last two searches. Go
to the upper text box and type <#1 AND #2> (assuming the numbers are 1
and 2) and omit the angled brackets.
This now creates a new search in which only protein sequences are
matched that are from humans and which are the HSF1 gene, i.e., the new search
in an intersection of the previous two.
Again, 1 protein should be left.
Click on the 1 in the result column. Note the RefSeq accession number starting with “NP”. Click on NP, the link to display the detailed information for the query item. Use the display menu to display the sequence in FASTA format. “NP” identifies the sequence as a curated protein sequence.
While on the page with the target sequence, click on Links
on the right side of the page and choose the Nucleotide option. Now the mRNA and genome sequence
corresponding to the protein should become available.
Part II – DNA
Sequence Analysis
II A.) ORF Finder
We’ll start our gene prediction journey with the simple task of scanning a genome for open reading frames. An open reading frame or ORF is the DNA sequence located between the start codon and a stop codon. Open reading frames represent potential genes. Since each codon consists of three base pairs, each ORF must have a length divisible by three. By the same token, there exist three possible reading frames for a strand of DNA. A fourth reading frame would be inline with the first reading frame. Since DNA is double stranded we have a total of 6 reading frames (3 on each strand).


Let’s use ORF Finder at NCBI http://www.ncbi.nlm.nih.gov/gorf/gorf.html to locate all the possible open reading frames. We’re going to locate all the open reading frames in the Human Immunodeficiency Virus (HIV). Its genbank accession number is NC_001802. We can either enter the accession number or we could find the sequence ourselves and enter it directly. For now enter the accession and click the “OrfFind” button. The default settings show us all open reading frames greater than 100 bases in length.
5. What is the longest open reading frame detected in the HIV
virus.
6. Assuming this is a valid ORF with
no introns, how long would the protein produced by this ORF be?
Change the minimum displayed ORF length by changing the drop down box from 100 to 50 and then clicking the “Redraw” button.
7. Do you think most of these new
ORFs are translated into proteins? Why
or why not?
Let’s assume we think the longest ORF is a real gene. Click either on the highlighted ORF in the 2D chromosomal picture or on the colored box next to the ORF in the ORF list. We can either accept this ORF as an actual gene or select the “Alternative Initiation Codons” button. Although most open reading frames start with ATG, there are always exceptions in biology.
8. Select the “Alternative
Initiation Codons” button. The gene now
starts with what earlier codon?
II B.) GeneMark
ORF’s go a long way to helping us identify potential genes. However, not all ORF’s have the appropriate transcription factor binding sites to provide translation. Also, prediction gets much more challenging when we consider alternative start codons, introns, and alternative splicing. Thankfully there are plenty more tricks we can pull out of our bag.
Nearly all eukaryotic introns begin with GT and end with AG; this is known as the GT-AG rule. We can also take advantage of something known as codon bias. There are 20 Amino Acids and 64 possible codons. This means that many Amino Acids are represented by multiple codon triplets. However separate species are able to utilize certain codons better due to their specific cell machinery. If we know how often a given triplet is used in known genes for that organism, we can assign probabilities to each codon occurring. We can then look at these probabilities for the entire protein to assess whether it is likely to be a gene in this organism. We can also expand our search to scan for transcription factor binding sites upstream of the ORF. By combining all of these methods and several other techniques we can do a much better job of predicting genes than simply looking at all ORF’s.
There are many programs out there, taking advantage of
different combinations of this information.
One such program, Gene Seqer, was developed by
In this lab, we will be using the GeneMark program. Go to http://www.ncbi.nlm.nih.gov and click on the “Genomic Biology” link on the left side of the screen. Here we can access whole genome sequences. On the right side of the screen under Organism-Specific select the G for Genome Resources next to Mouse. There should be a figure with the mouse chromosomes in the upper right hand corner. Click on chromosome 1 to go to a map of that chromosome. Click on the “Download/View Sequence/Evidence” link in the upper right corner. The chromosome is very large and is broken into several files. Click on the “Display” link for the first contig. We just want ~50,000 bases of the chromosome for this lab. Change the range values to display the sequence from 50000 to 100000, then click the “Refresh” button. A new region with the desired region should appear below. Now copy this shortened FASTA sequence and go to http://exon.gatech.edu/GeneMark/eukhmm.cgi and paste in the copied sequence. Under running options change the species to M. musculus. Under output options check “Generate PDF graphics (screen)” Click the “Start GeneMark.hmm” button then click on “View PDF Graphical Output”
The output shows the predicted genes as thick black lines along the number line. The thick grey regions are “regions of interest” and the thin black lines are ORFs in each reading frame. The output shows the predictions on the sequence directly as well as on the complementary strand.
9. Where is the first predicted gene?
10. In what reading frame is the first
predicted gene?
II C.) Obtaining upstream sequences for a gene list
In a recent article Journal of Clinical Pathology, Kaushik et al. used microarray analysis to identify a list of genes differentially expressed in patients with chronic fatigue syndrome. In an attempt corroborate their claim and hopefully learn more about the immediate pathway interactions, we are going to look for conserved promoter sequences in the upstream regions of these genes.
We will use the UCSC genome browser (http://genome.ucsc.edu/) to assemble upstream sequences for out gene list. Select the “Table” menu from the menu bar. Be sure the clade is set of “Vertebrate” and the genome is set to “human” and that you are using the most recent assembly. Also check to be sure the table is set to “knownGene” and that “genome” is selected as the region. Use the “paste list” button to paste a copy the list of accession numbers we obtained by the journal article. Select “sequence” from the output format drop down menu. Check to be sure the output will be returned as “plain text” and click “get output.” On the following screen select “genomic” and click “submit.” Now deselect all options except “Promoter/Upstream”, “One FASTA record per gene.”, and “Exons in uppercase, everything else in lower case” (note: The returned upstream regions should not contain any Exons). Type 1000 in the “Promoter/Upstream” option and click “get sequence.”
Several of the entries are duplicated as alternatively spliced genes are listed separately. However they have the same upstream region. Use the following FASTA file for Part II. It should be identical to your output; however it has duplicate upstream regions removed.
II D.) Analyzing sequences for conserved motifs
In this section we will be using servers to identify potential conserved sequence motifs in our upstream sequences. We will start using MEME/MAST (http://meme.sdsc.edu/meme/). Select option 1 to “discover motifs (highly conserved regions) in groups of related DNA or protein sequences.” Enter your email address and paste in the upstream regions obtained in Part I. Submit using the default options. You may have to wait for more than 1 hour to obtain results. Proceed to the next section of the lab exercise and return when you have received your results via email. (Note: Use the job status link to see if the job is finished. If the job is finished, then follow the job output link.) Select the link to the file similar to “meme.html”
11. How many motifs did the server find?
Next we will use TESS to check for any known transcription factor binding sites in the upstream site of the fatigue associated gene below.
>hg18_knownGene_uc003sbs.1 range=chr6_qbl_hap2:3029931-3030930 5'pad=0 3'pad=0 revComp=FALSE strand=+ repeatMasking=none
tagtgtctatccctctccacaccgcagattcctaggccgcactccctttcccccgcttcccagttaccccgcctcccccttaccccgccttccccgcctccccatttccccgacaggccgcactcccttcccccgcctcccccattctggctgctccgaccaatcaatctgaagccatcttagctttccccaagtgctcctcctacccggatcagccaacgcccacatacctcaggcttaaaccaactagggaactttccagtactttcccaaacaaggacctactgagcctttcaggttcacaatcaatcagatccctactggctcacctagtctcccgacgccttcgcttcagtttggaaacgtccagattacgcagccccagcgagtaggtgggggctccctcaatatcaaactgcacaaccggggtccccccaccccccaccccgtccctccctgcaaatttgagacggctccaactcagtaatctttttccaaactggcccatgaggtcagagacagtatctccattgtaacgtggccgggcggtgtcaacacaaacgcccccaccctcccctggacgcgcgtaacccgctccccgcaccagccccctgcccacaactgcgcaggcccagcaagcccccacaattaaaagcccagcgccgacccttcctgtcaattaggcgctgaagcgcaggcggtcagcatcgccatggagaccaacacccttcccaccgccactcccccttcctctcagggtccctgtcccctccagtgaatcccagaagactctggagagttctgagcagggggcggcactctggcctctgattggtccaaggaaggctggggggcaggacgggaggcgaaacccctggaatattcccgacctggcagcctcatcgagctcggtgattggctcagaagggaaaaggcgggtctccgtgacgacttataaaagcccaggggcaagcggtccggataacggctagcctgagga
On the right half of the page halfway down paste the above sequence into the “Try a quick search with the default parameters” section. Enter a job title and your email address. (It may take several minutes to receive a result) On the results page use the Result navigation on the top.
14. How many transcription factor binding sites were
found?
Finally,
we would like to compare the results from MEME and TESS. In the MEME results, find the motif that is
included in the uc003sbs sequence submitted to TESS. Then identify the starting nucleotide of the
motif in the uc003sbs gene. To compare
this location in the uc003sbs gene with the TESS results, click on “Tabular
Results” at the top of the TESS results page.
Then click on “Beg” to sort the TESS results by starting location. Go to the region of the gene found by MEME to
compare the results.
15. Are the motifs discovered in our MEME search identified
as known binding motifs by TESS?
Part III – Protein
Sequence Analysis
We will use this mystery protein sequence for this entire section of the exercise:
>unknown protein
MAAAKAEMQLMSPLQISDPFGSFPHSPTMDNYPKLEEMMLLSNGAPQFLGAAGTPEGSGGNSSSSTSSGGGGGGGSNSGSSAFNPQGEPSEQPYEHLTTESFSDIALNNEKAMVETSYPSQTTRLPPITYTGRFSLEPAPNSGNTLWPEPLFSLVSGLVSMTNPPTSSSSAPSPAASSSSSASQSPPLSCAVPSNDSSPIYSAAPTFPTPNTDIFPEPQSQAFPGSAGTALQYPPPAYPATKGGFQVPMIPDYLFPQQQGDLSLGTPDQKPFQGLENRTQQPSLTPLSTIKAFATQSGSQDLKALNTTYQSQLIKPSRMRKYPNRPSKTPPHERPYACPVESCDRRFSRSDELTRHIRIHTGQKPFQCRICMRNFSRSDHLTTHIRTHTGEKPFACDICGRKFARSDERKRHTKIHLRQKDKKADKSVVASPAASSLSSYPSPVATSYPSPATTSFPSPVPTSYSSPGSSTYPSPAHSGFPSPSVATTFASVPPAFPTQVSSFPSAGVSSSFSTSTGLSDMTATFSPRTIEIC
III A.) PROSITE –
Motifs & Weight Matrices
One feature of protein sequences is that they often contain protein motifs, a recurring pattern of amino acids that denote a characteristic such as a structure or function. Identifying protein function by scanning for known protein motifs can prove more reliable than sequence alignments. Go to the PROSITE website at http://ca.expasy.org/prosite/. Read the PROSITE User Manual Introduction under Documents. Under the Tools for PROSITE section submit the following sequence:
16. How many hits were found for
this sequence and what was the motif found?
III B.) SuperFam –
Hidden Markov Models (HMM)
Go to the Super Family website at http://supfam.org/SUPERFAMILY/hmm.html . This website uses pre-trained HMM’s to determine what family a protein falls into. Again paste the FASTA sequence above into the sequence window. Change the notification to Browser and click submit. When “YOUR RUN IS COMPLETE” click “here” to see the output.
17. What domains does SUPERFAM
identify? Does this agree with the
PROSITE motif?
18. Based on all the motif finding
results what can you say about our mystery protein? (What do we know and what
do we not know?)