SSU home

..Biology Department home

 

Immunology

Index | Syllabus | Schedule | Study aids | Computing | Links

Computing 3:

Key sites of interest

Computing 1: Intro to Immuno

Exercise

Computing 2: Protein Models

Summary questions & fora

Symbols for amino acids

 

 Computer Exercise 3:
Searching for Protein Sequences and Beyond

Discussion Section 1: 9/28
Discussion Section 2: 9/30

Objectives:

1. Gain an understanding of the basics of bioinformatics.

  • Explore some key sites containing databases and application software.
  • Become familiar with some of the uses of bioinformatics.
  • Appreciate the dynamic nature of this rapidly growing field.

2. Become familiar with some uses of molecular databases.

  • Be able to query databases and search for molecular sequences.
  • Be able to find homologous sequences to a known sequence using BLAST.
  • Be able to align multiple sequences using ClustalW.
  • Know how to evaluate the quality of information obtained and analyze the results.

3. By using these tools, gain a better understanding of some of the key proteins involved in immune responses. This may include some or all of the following:

  • Know the range of model species used and appreciate the depth [and sometimes lack thereof] of data.
  • Have a better understanding of sequence motifs, molecular behavior, and molecular structure.
  • Have an increased understanding of the phylogeny of some of the proteins examined.

Introduction: 

In this exercise, you will get a taste of searching for specific protein sequences and then using them in some applications intended to increase your appreciation of molecular immunology. You will be able to search for related sequences and to compare selected sequences to each other. After playing with them some, you may wish to try your hand at asking some basic questions and looking for ways to answer them. For example, your questions may focus on the structural aspects of receptor-ligand interactions of antigens and antibodies or T cell receptors. Or perhaps you may explore the phylogeny of the supergene immunoglobulin family. Or you may be wanting to look for conservation in transmembrane motifs. Curiosity and imagination are great tools to use along with what you'll learn here. [This is basically an advertisement for the next exercise.]

Before you can begin searching for answers to your questions, you need to learn how to search for the necessary data and information. Once that is done, then the analysis and interpretation can begin. Both parts are important and require learning specific strategies and applications. The cool thing is that now you can access a wide variety of tools on-line for free, including support in learning how to use the tools.

For those of you with experience in doing nucleotide sequence searches, and for those of you wishing to do so, be patient. You will have an opportunity to expand when designing your project. Based on feedback from last year, and on the demands of trying to give clear direction, limiting the scope seems to be appropriate. Since we are looking at molecular structure of immunoglobulins before examining the underlying gene sequences, it makes sense to start with proteins.

Time management: First of all, don't try to do this in one sitting. Breaks between small chunks will actually help the learning process and cut the frustration down to a reasonable level. The more you play with different sections, the more they begin to fit together. The more they fit together, the more understanding you will gain. Don't expect to "get it" after one round of search and analysis. [I've been poking around for seven years, and I'm still somewhere on the left-hand side of the learning curve.]

Reminder: Using a log can help you backtrack to things you'd like to check out later. Saving sequences, along with pertinent accession numbers and identifiers, will become extremely useful, especially when trying to use multiple analytical applications on the same data set.

There are summary questions at the end of this section. Read them through before you start browsing. You can answer these as you go, or answer them after browsing the following sites. Points = 10. Due in two weeks [10/8 & 10/10]. 

The last thing to do is to go to the Fora on WebCT. See the end of the exercise for topic focus.

 [top of page]

Key sites used in this exercise:

1. Tutorial lessons: For a pre-lab activity, browse the following two tutorial sites to see what they have to offer.

a. Try using the Darwin 2000 modules. If you haven't tried these tutorials yet, DO. You'll get a better idea of what you are doing and seeing and why. If you have done them in the past, check back through them for practice and specific pointers. 

http://www.rickhershberger.com/darwin2000/

Explore the following sub-sites:

  • Darwin 2000- for an overview of evolution & computing.
  • Finding sequences- for an introduction and guided tour of NCBI search strategies and understanding some details of the results.
  • BLAST- "one against all" homology/alignment searches: an introduction and basic tour.
  • Multiple Sequence Alignment- "many against each other" homology/alignment searches: an introduction and basic tour.

b. Explore the tutorials at NCBI. Besides the basic introductory lessons, browse the "Coffee Break" tutorials to see what is there. [If you would like a guided activity to follow, jump to Virology's Pre-lab activity.]

http://www.ncbi.nlm.nih.gov/Education

2. Searching for specific sequences and performing BLAST analyses. 

http://www.ncbi.nlm.nih.gov 

Sub-sites of interest: [there are others, as well]

  • BLAST
  • Entrez- nucleotides, proteins, etc. 
  • Structure [you've been here already]

3. Performing multiple sequence analyses [MSA].

http://www2.ebi.ac.uk/clustalw/

http://www.ebi.ac.uk/clustalw/ : Alternative site if the above site can't be accessed.

 [top of page]

Exercise:

Part 1: Basic Search Strategies

For warm-up, especially if you haven't done searches before, go to Darwin 2000 and use the module "Finding Sequences":

 http://www.rickhershberger.com/darwin2000/

This exercise will guide you through finding a gene sequence in GenBank, which is fine because the principles are the same as for finding a protein sequence and later you can look for nucleotide sequences in GenBank and Entrez. Work through the "Guided Activity" to get a feel for searching and to get descriptions of the various parts of the reports, saving sequences, and so forth. Bookmark this site, because you will probably want to refer to it when working on your target proteins.

A. Finding an "immuno" protein sequence or two. 

1. Go to NCBI:

http://www.ncbi.nlm.nih.gov/

Either select proteins in the pulldown menu window on the left or select Entrez and then Proteins on the top menu bar. Enter one of the following keywords or phrases from the following list:

  • immunoglobulin
  • Fab
  • T cell receptor
  • MHC

You can refine your search if you get too many hits. For example, you can narrow down to a given species or to a single chain of a tertiary protein. You can add to your word string or go to "Limits" under your search window for advanced search options. Use the "All Fields" pulldown menu to specify a field. Boolean operators AND, OR, NOT must be in upper case. Help on using limits is available.

2. To retrieve a single protein, click on its accession number highlighted in blue. To retrieve several proteins at once, select them by checking the boxes to the left. Then choose which type of display you want, such as GenPept, then click "Display". Review the type of information available. Try some of the other display formats & links. Use your log to save an example of each kind of report, along with a label and explanatory comments as to source, Web page type [NCBI search], and anything else you deem useful for future use.

3. Try displaying your selected proteins in FASTA format by selecting it in the display menu. For saving FASTA reports to your log, copy the entire report, beginning at ">gi|xxx ...." and ending at the end of the sequence. Paste this in your log as a backup and for future use. [What, you didn't open a log yet?? You can do it now.] After pasting, click back to your Web page. Congratulations! You now have a sequence or sequences, which you can use in the next section.

Note: You can also save search results and FASTA reports to the on-line clipboard. Beware, however; that there is a time limit for material remaining on the clipboard.

[top of page]

B. Finding a group of related proteins.

This can be done in a variety of ways, in different locations. A brief introduction to some of these is given here.

1. Scan the right side of a page of results from your search in A above. Note that some entries have Related Sequences in blue type. Click on one to see what you get back. Another way is to check if Protein Neighbors shows in the Display menu for your selected protein(s). Selecting it will give you a list as well. Be sure to check the boxes before clicking on "Display".

2. You can find grouped sets of sequences by going to Popset at Entrez. Try a couple of your successful search terms used above. If you strike out, try modifying your word string. This is still a fairly new site and the data set of groups is still somewhat limited. However, there is more than enough here to play with and to get a good idea how multiple sequence alignments [MSA] work. Note that you will retrieve nucleotide sequence sets here. You can work with these or if you want to look at the proteins, you can select "Protein" on the right side. This will take you to a summary list of protein sequences as you had in A above.

3. If you want a selected set of related structures, the easiest way is to start a search under "Structure" at Entrez. On the Structure Summary page of a given entry, look for a listing Structure Neighbors. Click on one, and you may be able to retrieve a list of proteins with related structures. Entries which are relatively new [under a year] often are still in a cue to be calculated into retrievable lists. Therefore, look for "old entries".

The above strategies have their uses, but as is they are not that interesting, with the exception of the Popset groups. These groups are displayed in aligned blocks, which give you a fair amount of information to start digesting. For further analysis following any of these search strategies, sequences of interest need to be retrieved and saved in FASTA format. If you are interested in structure, be sure to save MMDB and PDB accession numbers, as well, for future use.

 [top of page]

Part 2: BLAST Searches

Time to go back to Darwin 2000 for another tutorial:

http://www.rickhershberger.com/darwin2000/

This time select the module "BLAST" and work through the "Guided Activity". By starting with one of your selected sequences, BLAST allows you to perform "one against all" homology/alignment searches against the available databases. After reviewing this tutorial,either bookmark this site or keep a window open, so you can shuttle back for help.

A: Finding homologous protein sequences by using BLAST.

1. The shuttle to return to NCBI departs here:

http://www.ncbi.nlm.nih.gov/

If you are at the NCBI home page, click on the BLAST button. If you are at the Entrez home page, click on BLAST on the left-hand menu panel. On some formats of results pages, BLAST can be accessed by clicking on it at the top of the page. Read the overview; and check out the "course" and "tutorial" links on the left-hand menu if you are interested in the statistical analytical tools used and their levels of reliability; and more about running BLAST searches. For help while in BLAST, just click on any linked label to get an explanation.

2. Select blastp on the first menu page. On the next page, use defaults for your first try. For database window, leave it "nr" [non-redundant] as this will give you the broadest search of linked databases without duplication of individual sequences. Go to your log and select the FASTA formatted sequence of a protein of interest and copy it. Go back to the BLAST page. Click the cursor inside the top window, then paste. [Your FASTA sequence should still be in the buffer. If not, go to your log and copy/paste it over.] Be sure you eliminate any spaces to the left of the ">gi|xxx" line and any spaces to the left of the sequence lines. Do not touch the right-side of any lines.

3. Below the search window are other menu selections, such as one which says "pairwise" in the "Format" section next to "alignment view". For your first search, use the defaults. To retrieve your BLAST report, click "Format results". Browse the report. Note that the graph is active, meaning you can navigate the report by mousing-over the bars and clicking on them to get to the alignments. The window above the graph tells you the identity of each sequence retrieved. You can also scroll through the report. For details of interpreting the report, 1) click on the line above the graph, "Distribution of xx Blast Hits...", and/or 2) click on "FAQ". You can also toggle over to Darwin 2000, if you have kept a window open.

  • For long polyprotein sequences, % homology scores will be high, even with a reasonable number of non-match residues and gaps, since calculations are based on the total number of residues. For single proteins and fragments, homology scores will often be lower, because each difference will make a greater numerical impact on the calculations. Expect or E values are useful. A value of 0.0 indicates extremely good homology or an exact match. A value of 10 or more indicates random chance and will go unreported. Values in-between will give a sense of the relative degree of significant match.
  • For a pairwise Blast-p, the middle line between query and subject gives a letter if an identical match, a + if similar in character, or a blank if no match. [In Blast-n, vertical lines and blanks are used between the query and subject lines.] In both cases, hyphens in the query or subject lines indicate gaps.

4. Go back to the BLAST query page. Change from pairwise in the format section below to flat query anchored and browse the report. Compare the results to your first report.
 

5. For another search go back to the BLAST homepage, scroll to the bottom and select the JavaScript free BLAST link. [This will allow you more choice in database selection.] Select "blastp" again. On the query page, set database to "pdb". [This will allow you to limit your search for proteins for which structural analysis is complete and which you could view in Chime, RasMol, or CN3D.] In the search window enter a PDB accession number. Run BLAST either in "pairwise" or "flat query-anchored". You might just want to try "query anchored" to compare the difference in the report appearance.

[top of page]

B: Finding homologous nucleotide sequences by using BLAST.

This does not need to be done now, except by those eager to try it. This just seemed the obvious place to put the instructions.

1. Go to either NCBI or Entrez home page, select GenBank, and enter your choice of search terms. Retrieve your target sequence in FASTA format.

2. Go to BLAST. Select blastn and leave database "nr". Paste your sequence and run your choice of "pairwise", "query-anchored", and "flat query-anchored" searches. Compare the similarities and differences found between running "blastp" and "blastn". Note the informational advantages of each.

C: Optional.

Explore the uses of Psi-BLAST and BLAST-x. Find out what they do and how they can be used. Depending on your project, these may be of some benefit.

Part 3: Running multiple sequence alignments [MSA]

All aboard!

http://www.rickhershberger.com/darwin2000/

Back to Darwin 2000, this time to learn about ClustalW, the application which takes a group of selected sequences and makes a multiple sequence alignment. Click on the module "Multiple sequence alignments" and work through the guided activity to learn what this can do for you. Warning: Some patience is required. Pay attention to details, or you may be frustrated by the lack of results.

1. By this time you should have one or more sets of related proteins stored in your log in FASTA format. If so, jump to 2 below. If not, go back to Part 1, section B and follow directions on retrieving a set of related proteins.

2. In your log, you need to remove the spaces on the line preceding ">gi|xxx..." for each entry. [Don't disturb the sequence lines.] This is necessary when running MSA, because any extra spaces will terminate the alignment for all entries beyond those spaces.

3. Go to:

http://www2.ebi.ac.uk/clustalw/ or http://www.ebi.ac.uk/clustalw/

Explore the site. You can read about the windows by clicking on them.

4. Paste your grouped FASTA sequences into the text box. For your first run, use the defaults. The alignments will take a few minutes. You may want to enter your e-mail to retrieve the report. If you run fails, the first place to check if the FASTA format and the left-hand spaces. If a run seems to take too long, try "off-hours", keeping in mind that this is a European site, or try making your alignment request smaller. You can do this by selecting only 4-6 sequences. Alternatively, you may want to focus on just one region or domain of your sequences. In that case, you can select portions of the FASTA reports.

5. Once you have the report, browse to see what you have. Click on Jalview for a graphical display. Wait for the calculations and color assignment to be complete before trying to navigate. For your convenience, consensus notations and colors used in Jalview are assigned as follows:

Consensus line notations:
* = identical or conserved residues in all sequences in the alignment
: = indicates conserved substitutions
. = indicates semi-conserved substitutions.

Characteristics:

Amino acids:

red: small & hydrophobic R groups

AVFPMILW

blue: acidic

DE

magenta: basic

RHK

green: hydroxyl + X

STYHCNGQ

gray: other

Symbols for amino acids

Compare the results given here to the BLAST results. If you used a group of sequences containing the query sequence you used for a BLAST search, you see a similar alignment.

 

6. Try repeating the MSA after changing selected defaults. [It is good to play with a limited number of sequences when doing this, due to the computing load it creates. Four sequences of reasonable length should give you enough to see the effects.] How are the reports affected? Record your results for future reference.

Note: You can also run MSAs on nucleotide sequences.

 [top of page]

Summary Questions:

Try to limit your answers to two typed pages [12 pt font]. Examples may be given in 10 pt font. You need not retype the questions.

1. Give an example of a successful basic search you did. Give the following information:

a. Word or word string used
b. Give examples from your log of different types of reports you retrieved
c. Give one example of a sequence in FASTA format

2. Briefly summarize your search for a group of related proteins. Give the following information:

a. Word or word string used

b. Compare two methods used to find related sequences. How many sequences were in your group(s)?

c. How did you use these sequences after you saved them?

3. Summarize one of your BLAST search results. Give the following information:

a. What was your query sequence, which you entered in FASTA or as a PDB number?

b. How many closely related sequences did you retrieve? What was the range of "E" scores? What does this mean?

c. What other matches did the Blast search find? What is the significance of these matches? [Depending on your search, this may or may not apply.]

d. What information did you get from the query-anchored report, which you didn't see in the first report?

e. Which style of BLAST report did you prefer? Why?

4. Summarize one of you MSA results. Give the following information:

a. What group of proteins did you select?

b. What regions are highly conserved? What regions are not? [Consider the structure and function of these proteins.]

c. If you were to compare the nucleotide sequences of these same proteins, would the conserved regions of the nucleotides be as conserved as the conserved regions of the proteins? Why or why not?

5. What is a supergene family? The immunoglobulin supergene family is an example. Answer one of the following questions:

a. How would you demonstrate b-2-microglobulin is related to immunoglobulins even though they serve different functions?

b. How would you demonstrate that TCRs and immunoglobulins are homologues and not paralogues?

[You need not actually do the problem. The idea here is to explore how you would approach solving the problem using bioinformatics applications.]

 

Interactive discussion forum:

Make one original posting to "Search Strategies", or respond to an existing posting. Check back and read new postings. They may contain information or answers which you'll find useful.

[top of page]

Index | Syllabus | Schedule | Study aids | Computing | Links

 Updated 8/27/04 by thatcher@sonoma.edu