SSU home

..Biology Department home

Virology

Home | Index | Syllabus | Schedule | Study aids | Computing | Links | Interactive

 

Key sites

Interactive discussion

Pre-lab activity

Part 1 Exercise

Part 1 Summary questions

Part 2 Exercise

Computing 1: Intro to Virology & Modeling

Part 2 Summary questions

Computing 3: Bioinformatics Project

Part 3 Exercise

Part 3 Summary questions

a


Computer Exercise 2: Bioinformatics Searching

Finding sequences and working with them

Stev 2055 PC lab

Introduction:

What does evolution have to do with learning to search for sequences? What can playing with sequences teach us about evolution? One answer to the first question might be that since you are "more evolved" than any other primate and "technological evolution" has taken you to the point of being able to use a computer, searching for viral sequences is a good way to occupy your time while learning something about viruses. I suppose there are other answers as well.

The possible answers to the second question are quite varied. As time goes on, the list of answers should continue to grow. A few examples should illustrate the point. By comparing gene sequences or protein sequences, it is possible to compare the degree of relatedness of the source organisms. It is possible to build a phylogenetic tree based on the differences and similarities found. In some cases, it is possible to trace the divergence of a gene into a gene family, represented by many different protein products. In other cases, it is possible to trace the origin of an individual gene found in one organism as being from a very unrelated one. After doing these exercises, you will be able to add to this list. But be prepared. For every question you raise and attempt to answer, you will generate even more questions. It is possible to get sucked in and lose focus as you attempt to follow multiple leads at once.

Be prepared to spend some time on this exercise set. At first it may seem overwhelming, but it gets better with practice. Note questions in your log. Work on small chunks at a time. Take breaks. For those with little or no experience in bioinformatics, and for others wanting a review or to learn new things about NCBI resources, do the Pre-lab activity. The tutorials will give you an overview of some of the components involved in bioinformatics.

Reminder: Using a log can help you backtrack to things you'd like to check out later. Saving sequences, along with pertinent accession numbers and identifiers, will become extremely useful, especially when trying to use multiple analytical applications on the same data set.

Collect the information requested below [marked with asterisks (*)] and answer the summary questions at the end of each part. Read them through before you start browsing. You can answer these as you go, or answer them after browsing the following sites. Points = 15. Due in two weeks [3/5]. 

The last thing to do is to go to the Fora on WebCT. See the end of the exercise for topic focus.

 

Key sites used in this exercise:

1. Tutorial lessons at NCBI:

http://www.ncbi.nlm.nih.gov/Education

2. Searching for taxonomic organization, specific sequences, and performing BLAST analyses:

http://www.ncbi.nlm.nih.gov

Sub-sites of interest: [there are others, as well]

  • Taxonomy
  • Entrez- nucleotides, proteins, etc.
  • BLAST

3. Performing multiple sequence analyses [MSA]:

http://www2.ebi.ac.uk/clustalw/ or http://www.ebi.ac.uk/clustalw/
[top of page]

Part 0: Pre-Lab Activity- Tutorials

1. To access NCBI tutorials directly, go to http://www.ncbi.nlm.nih.gov/Education.

a. To become familiar with the size and layout of NCBI, click on the site map. For future use, this site map will allow you to navigate to different sections quickly, or to find something that you saw at one point but can't seem to find again.

b. From the main Education page, you can get an overview of bioinformatics and some key terms.

c. Next, try the tutorial Nucleotides, then try BLAST. Even if you have used BLAST before, this is very helpful. There are also new features which can help in doing BLAST searches more effectively.

d. To better understand BLAST statistics and the results of a search, browse the statistics tutorial, available from the left-hand menu bar. [In future, you can also access both this tutorial and the main BLAST tutorial from the the BLAST page.]

e. An interesting way to learn about different aspects of bioinformatics and some specifics on applications is to go to Coffee Break. These are a series of short essays on a variety of topics with specific tutorials imbedded. For example, the newest posting is 22 October 01, "Finding Fanconi- The hunt for the cause of autosomal dominant renal Fanconi syndrome." Click on Archive to access the list of topics. Try out a few now. Come back for more later.

2. To learn more about bioinformatics in general, and have fun doing it, try playing Origin: Unknown:

http://www.nbif.org/

There are different modules and different levels of difficulty. Let me know what you think about it.

[top of page]

Part 1: Finding a sequence (or two, or more...)

Exercise:

A: Getting started

1. For starters, this will be by way of a detour, to explore a cool site you should visit often as we make our way through the viral taxa. Go to:

http://www.ncbi.nlm.nih.gov

2. Click on Taxonomy on the menu bar.

a. Click on Genetic Codes on the left-hand bar. Here you can discover that AUG does not always mean methionine, among other things. [Don't linger long; you can return later for a more thorough browse.]

b. Return to Taxonomy; then click on the tree button when you get to the site; then select viruses. [This is another site you'll want to browse in more detail later.]

1) Page down through the taxon groups [currently groups are listed in the order dsDNA, dsRNA, retroids, ssDNA, ssRNA, unassigned] until you reach ssRNA positive strand.

2) Select Picornaviridae; then select Human Rhinovirus 16.

3) Retrieve protein sequences by selecting protein, then retrieving.

3. Now you are in another part of the NCBI site. You should have your list of 19 proteins. Click on Q82122; then explore the different reports available. In graphical, you will see the genomic RNA and the location of the sequences for different protein products. [Try the graphical view for some of the other sequences, and you will find that not all provide the same amount of information.]

*Use your log to save an example of each kind of report, along with a label and explanatory comments as to source, Web page type [NCBI search], and anything else you deem useful for future use.

4. Select one of the 19 proteins, and go to FASTA report.

*Copy the entire report, beginning at ">gi|xxx ...." and ending at the end of the sequence. Paste this in your log as a backup and for future use. [What, you didn't open a log yet?? You can do it now.] After pasting, click back to your Web page. [You may also want to try NCBI's clipboard feature. Just be aware that the clipboard is useful during an active session, but will not save your selections.]

Congratulations! You now have a sequence, which you can use in the next section. Now go to Part 2. [Come back on your own to do B below.]

 

B. For later, on your own.

1. Try entering descriptive terms, such as "poliovirus", "rhinovirus serotype 3", etc. directly in the search line of NCBI's home page, or in Entrez. Select the type of search you want: GenBank, for nucleotides; Proteins; Structure; and so on. You can refine your search if you get 100's or 1000's of hits by adding specific words to your search string. You can use the advance search field to further refine your search. Try using the same search term to search for both nucleotides and proteins. Compare your results in terms of items retrieved and how they relate to each other.

2. Try saving some files of interest. Then use them for other analyses, such as visualizing in Chime/RasMol, if a Structure search was done; or run BLAST or MSA.

*a) Use your log to save examples of the different types of searches, along with useful commentary.

*b) Be sure to save at least one nucleotide sequence of interest in FASTA format, for use in running a BLAST search.

3. Check out the "newsy" information on the main pages of Entrez, such as Nucleotides, Proteins, etc. There are some interesting developments.

4. Find a group of related proteins. This can be done in a variety of ways, in different locations. A brief introduction to two approaches is given here.

a. Scan the right side of a page of results from one of your searches above. Note that some entries have Related Sequences in blue type. Click on one to see what you get back. Another way is to check if Protein Neighbors shows in the Display menu for your selected protein(s). Selecting it will give you a list as well. Be sure to check the boxes before clicking on "Display".

b. You can find pre-grouped sets of sequences by going to Popset at Entrez. Try a couple of your successful search terms used above. If you strike out, try modifying your word string. Since this is still a fairly new site, the data set of groups is still somewhat limited. However, there is more than enough here to play with and to get a good idea how multiple sequence alignments [MSA] work. [I readily found three sets of influenza sequences.] Note that you will retrieve nucleotide sequence sets here. You can work with these or if you want to look at the proteins, you can select Protein on the right side. This will take you to a summary list of protein sequences as you had in A above.

[top of page]

Part 1 Summary Questions:

Try to limit your answers to one typed page [12 pt font] for this part. Examples may be given in 10 pt font. [You need not retype the questions as part of your responses.]

1. Give an example of a successful basic search you did. Give the following information:

a. Word or word string used
b. Give examples from your log of different types of reports you retrieved
c. Give one example of a sequence in FASTA format

2. How did you use this sequence after you saved it?

[top of page]

Part 2: Find homologous sequences by using BLAST

BLAST is a search tool to find homologous sequences to a target sequence. The query is: "What in all in the crowd of database entries matches this selected sequence?" This is fundamentally different from the approach taken in doing multiple sequence alignments [MSA], which compares selected sequences against each other. [See Part 3.]

Exercise:

A: Finding homologous protein sequences by using BLAST.

1. The shuttle to return to NCBI departs here:

http://www.ncbi.nlm.nih.gov/

If you are at the NCBI home page, click on the BLAST button. If you are at the Entrez home page, click on BLAST on the left-hand menu panel. On some formats of results pages, BLAST can be accessed by clicking on it at the top of the page. Read the overview; and check out the tutorials if you haven't seen them already. They introduce the different types of BLAST searches and the statistical analytical tools used.

2. Select blastp in the first menu window. For database window, leave it nr [non-redundant]. Go to your log and select the FASTA formatted sequence of a protein of interest and copy it. Go back to the BLAST page. Scroll down to the large text window, click the cursor inside, then click paste. [FASTA sequence should still be in the buffer. If not, go to your log and copy/paste it over.] Be sure you eliminate any spaces to the left of the ">gi|xxx" line and any lines to the left of the sequence lines. Do not touch the right-side of any lines.

3. Below the text window are other menu selections, such as one which says pairwise. For your first search, use the defaults. To retrieve your BLAST report, click "Format results". Browse the report. Note that the graph is active, meaning you can navigate the report by mousing-over the bars and clicking on them to get to the alignments. The window above the graph tells you the identity of each sequence retrieved. You can also scroll through the report. For details of interpreting the report, 1) click on the line above the graph, "Distribution of xx Blast Hits...", and/or 2) click on "FAQ".

  • For long polyprotein sequences, % homology scores will be high, even with a reasonable number of non-match residues and gaps, since calculations are based on the total number of residues. For single proteins and fragments, homology scores will often be lower, because each difference will make a greater numerical impact on the calculations. "Expect" or "E" values are useful. A value of 0.0 indicates extremely good homology or an exact match. A value of 10 or more generally indicates random chance and will go unreported. Values in-between will give a sense of the relative degree of significant match.
  • For a pairwise Blast-p, the middle line between query and subject gives a letter if an identical match, a + if similar in character, or a blank if no match. [In Blast-n, vertical lines and blanks are used between the query and subject lines.] In both cases, hyphens in the query or subject lines indicate gaps.

4. Go back to the BLAST query page. Change from pairwise in the window below the text box to flat query anchored. Run and browse the report. Compare the results to your first report.
 

5. For another search go back to the BLAST query page or open a new one. Select blastp again. Set database to pdb. In the window above the text box, change to accession number. [If you're using the old page, first dump the FASTA sequence in the text window.] In the text window, type in one of your saved PDB accession numbers or use QJY1. [This one of the chains of QJY, the rhinovirus coat protein which we used in the first exercise.] Run BLAST either in pairwise or flat query-anchored. You might just want to try query anchored to compare the difference in the report appearance.

 

B: For later- Finding homologous nucleotide sequences by using BLAST.

This does not need to be done now, except by those eager to try it.

1. [If you've already have a FASTA nucleotide sequence in your log, skip to #2.] Go to either NCBI or Entrez home page, select GenBank, and enter your choice of search terms. [If you're drawing a blank and just want something to try now, you can enter L24917 or D00625.1 in the search line. L23917 is the human rhinovirus 16 polyprotein gene for Q82122 protein; D00625.1 is from poliovirus 2.] Retrieve your target sequence in FASTA format.

2. Go to BLAST. Select blastn and leave database nr. Paste your sequence and run your choice of pairwise, query-anchored, and flat query-anchored searches. Compare the similarities and differences found between running blastp and blastn. Note the informational advantages of each.

 

C: Optional.

Explore the uses of Psi-BLAST and BLAST-x. Find out what they do and how they can be used. Depending on your project, these may be of some benefit.

[top of page]

Part 2 Summary Question:

Try to limit your answer to one typed page [12 pt font] for this part. [You need not retype the questions as part of your responses.] 

1. Summarize one of your protein BLAST search results. Give the following information:

a. What was your query sequence, which you entered in FASTA or as a PDB number?

b. How many closely related sequences did you retrieve? What was the range of "E" scores? What does this mean?

c. Did you retrieve any matches of sequences outside of the virus family of your query sequence? What is the significance of these matches? [Depending on your search, this may or may not apply.]

d. What information did you get from the query-anchored report, which you didn't see in the first report?

e. Which style of BLAST report did you prefer? Why?

2. Summarize one of your nucleotide BLAST search results. Follow parts a-e above.

[top of page]

Part 3: Running Multiple Sequence Alignments [MSA]

Exercise:

A: MSA of protein sequences.

1. By this time you should have one or more sets of related proteins stored in your log in FASTA format. If so, jump to 2 below. If not, go back to Part 1, section B4 and follow directions on retrieving a set of related proteins.

2. In your log, you need to remove the spaces on the line preceding ">gi|xxx..." for each entry. [Don't disturb the sequence lines.] This is necessary when running MSA, because any extra spaces will terminate the alignment for all entries beyond those spaces.

3. Go to:

http://www2.ebi.ac.uk/clustalw/ or http://www.ebi.ac.uk/clustalw/

Explore the site. You can read about the windows by clicking on them.

4. Paste your grouped FASTA sequences into the text box. For your first run, use the defaults. The alignments will take a few minutes. You may want to enter your e-mail to retrieve the report. If you run fails, the first place to check is the FASTA format and the left-hand spaces. If a run seems to take too long, try "off-hours", keeping in mind that this is a European site, or try making your alignment request smaller. You can do this by selecting only 4-6 sequences. Alternatively, you may want to focus on just one region or domain of your sequences. In that case, you can select portions of the FASTA reports.

5. Once you have the report, browse to see what you have. Click on Jalview for a graphical display. Wait for the calculations and color assignment to be complete before trying to navigate. For your convenience, consensus notations and colors used in Jalview are assigned as follows:

Consensus line notations:
* = identical or conserved residues in all sequences in the alignment
: = indicates conserved substitutions
. = indicates semi-conserved substitutions.

Characteristics:

Amino acids:

red: small & hydrophobic R groups

AVFPMILW

blue: acidic

DE

magenta: basic

RHK

green: hydroxyl + X

STYHCNGQ

gray: other

Compare the results given here to the BLAST results. If you used a group of sequences containing the query sequence you used for a BLAST search, you should see a similar alignment.

6. Try repeating the MSA after changing selected defaults. [It is good to play with a limited number of sequences when doing this, due to the computing load it creates. Four sequences of reasonable length should give you enough to see the effects.] How are the reports affected? Record your results in your log for future reference.

B: Optional.

You can also run MSAs on a selected set of nucleotide sequences, such as from Popset or from one of your own searches.

[top of page]

Part 3 Summary Questions:

Try to limit your answers to one typed page [12 pt font] for this part. [You need not retype the questions as part of your responses.] 

1. Summarize one of you MSA results on some selected proteins. Give the following information:

a. What group of proteins did you select?

b. What regions are highly conserved? What regions are not? [Consider the structure and function of the proteins.]

c. If you were to compare the nucleotide sequences of these same proteins, would the conserved regions of the nucleotides be as conserved as the conserved regions of the proteins? Why or why not?

2. If you get alignments by running BLAST, what is the advantage of using MSA?

3. How can MSA results be used in examining phylogeny?

 

Interactive discussion

Make one original posting to "Search Strategies", or respond to an existing posting. You can copy/paste sequences as part of your posting and you can provide active links as well. Check back and read new postings. They may contain information or answers which you'll find useful.

 

[top of page]

Home | Index | Syllabus | Schedule | Study aids | Computing | Links | Interactive

 Updated 1/5/02 by thatcher@sonoma.edu