Computer Exercise 3:
Searching for Protein Sequences and Beyond
Discussion Section 1: 9/28
Discussion Section 2: 9/30
Objectives:
1. Gain an understanding of the basics of
bioinformatics.
- Explore some key sites containing databases and
application software.
- Become familiar with some of the uses of
bioinformatics.
- Appreciate the dynamic nature of this rapidly growing
field.
2. Become familiar with some uses of molecular
databases.
- Be able to query databases and search for molecular
sequences.
- Be able to find homologous sequences to a known
sequence using BLAST.
- Be able to align multiple sequences using
ClustalW.
- Know how to evaluate the quality of information
obtained and analyze the results.
3. By using these tools, gain a better understanding of
some of the key proteins involved in immune responses. This
may include some or all of the following:
- Know the range of model species used and appreciate
the depth [and sometimes lack thereof] of
data.
- Have a better understanding of sequence motifs,
molecular behavior, and molecular structure.
- Have an increased understanding of the phylogeny of
some of the proteins examined.
Introduction:
In this exercise, you will get a taste of searching for
specific protein sequences and then using them in some
applications intended to increase your appreciation of
molecular immunology. You will be able to search for related
sequences and to compare selected sequences to each other.
After playing with them some, you may wish to try your hand
at asking some basic questions and looking for ways to
answer them. For example, your questions may focus on the
structural aspects of receptor-ligand interactions of
antigens and antibodies or T cell receptors. Or perhaps you
may explore the phylogeny of the supergene immunoglobulin
family. Or you may be wanting to look for conservation in
transmembrane motifs. Curiosity and imagination are great
tools to use along with what you'll learn here. [This is
basically an advertisement for the next exercise.]
Before you can begin searching for answers to your
questions, you need to learn how to search for the necessary
data and information. Once that is done, then the analysis
and interpretation can begin. Both parts are important and
require learning specific strategies and applications. The
cool thing is that now you can access a wide variety of
tools on-line for free, including support in learning how to
use the tools.
For those of you with experience in doing nucleotide
sequence searches, and for those of you wishing to do so, be
patient. You will have an opportunity to expand when
designing your project. Based on feedback from last year,
and on the demands of trying to give clear direction,
limiting the scope seems to be appropriate. Since we are
looking at molecular structure of immunoglobulins before
examining the underlying gene sequences, it makes sense to
start with proteins.
Time management: First of all,
don't try to do this in one sitting. Breaks between small
chunks will actually help the learning process and cut the
frustration down to a reasonable level. The more you play
with different sections, the more they begin to fit
together. The more they fit together, the more understanding
you will gain. Don't expect to "get it" after one round of
search and analysis. [I've been poking around for seven
years, and I'm still somewhere on the left-hand side of the
learning curve.]
Reminder: Using a log
can help you backtrack to things you'd like to check out
later. Saving sequences, along with pertinent accession
numbers and identifiers, will become extremely useful,
especially when trying to use multiple analytical
applications on the same data set.
There are summary questions at the end of this
section. Read them through before you start browsing. You
can answer these as you go, or answer them after browsing
the following sites. Points = 10. Due in two weeks
[10/8 & 10/10].
The last thing to do is to go to the Fora on WebCT. See
the end of the exercise for topic focus.
[top of
page]
Key sites used in this
exercise:
1. Tutorial lessons: For a pre-lab
activity, browse the following two tutorial sites to see
what they have to offer.
a. Try using the Darwin 2000 modules. If
you haven't tried these tutorials yet, DO. You'll get a
better idea of what you are doing and seeing and why. If
you have done them in the past, check back through them
for practice and specific pointers.
http://www.rickhershberger.com/darwin2000/
Explore the following sub-sites:
- Darwin 2000- for an overview of evolution &
computing.
- Finding sequences- for an introduction and guided
tour of NCBI search strategies and understanding some
details of the results.
- BLAST- "one against all" homology/alignment
searches: an introduction and basic tour.
- Multiple Sequence Alignment- "many against each
other" homology/alignment searches: an introduction
and basic tour.
b. Explore the tutorials at NCBI.
Besides the basic introductory lessons, browse the
"Coffee Break" tutorials to see what is there. [If
you would like a guided activity to follow, jump to
Virology's Pre-lab
activity.]
http://www.ncbi.nlm.nih.gov/Education
2. Searching for specific sequences and performing
BLAST analyses.
http://www.ncbi.nlm.nih.gov
Sub-sites of interest: [there are others, as
well]
- BLAST
- Entrez- nucleotides, proteins, etc.
- Structure [you've been here already]
3. Performing multiple sequence analyses
[MSA].
http://www2.ebi.ac.uk/clustalw/
http://www.ebi.ac.uk/clustalw/
: Alternative site if the above site can't be
accessed.
[top of
page]
Exercise:
Part 1: Basic Search
Strategies
For warm-up, especially if you haven't done
searches before, go to Darwin 2000 and use the module
"Finding Sequences":
http://www.rickhershberger.com/darwin2000/
This exercise will guide you through finding a gene
sequence in GenBank, which is fine because the principles
are the same as for finding a protein sequence and later you
can look for nucleotide sequences in GenBank and Entrez.
Work through the "Guided Activity" to get a feel for
searching and to get descriptions of the various parts of
the reports, saving sequences, and so forth. Bookmark this
site, because you will probably want to refer to it when
working on your target proteins.
A. Finding an "immuno" protein sequence or
two.
1. Go to NCBI:
http://www.ncbi.nlm.nih.gov/
Either select proteins in the pulldown menu window
on the left or select Entrez and then Proteins
on the top menu bar. Enter one of the following keywords or
phrases from the following list:
- immunoglobulin
- Fab
- T cell receptor
- MHC
You can refine your search if you get too many hits. For
example, you can narrow down to a given species or to a
single chain of a tertiary protein. You can add to your word
string or go to "Limits" under your search window for
advanced search options. Use the "All Fields" pulldown menu
to specify a field. Boolean operators AND, OR, NOT
must be in upper case. Help on using limits is
available.
2. To retrieve a single protein, click on its
accession number highlighted in blue. To retrieve
several proteins at once, select them by checking the boxes
to the left. Then choose which type of display you want,
such as GenPept, then click "Display". Review the
type of information available. Try some of the other display
formats & links. Use your log to save an example of each
kind of report, along with a label and explanatory comments
as to source, Web page type [NCBI search], and
anything else you deem useful for future use.
3. Try displaying your selected proteins in
FASTA format by selecting it in the display menu. For
saving FASTA reports to your log, copy the entire report,
beginning at ">gi|xxx ...." and ending at the end of the
sequence. Paste this in your log as a backup and for future
use. [What, you didn't open a log
yet?? You can do it now.] After pasting, click back to
your Web page. Congratulations! You now have a sequence or
sequences, which you can use in the next section.
Note: You can also save search results and FASTA reports
to the on-line clipboard. Beware, however; that there is a
time limit for material remaining on the clipboard.
[top of
page]
B. Finding a group of related proteins.
This can be done in a variety of ways, in different
locations. A brief introduction to some of these is given
here.
1. Scan the right side of a page of results from
your search in A above. Note that some entries
have Related
Sequences in blue type. Click on one to see
what you get back. Another way is to check if Protein
Neighbors shows in the Display menu for your
selected protein(s). Selecting it will give you a list as
well. Be sure to check the boxes before clicking on
"Display".
2. You can find grouped sets of sequences by going
to Popset at Entrez. Try a couple of your successful
search terms used above. If you strike out, try modifying
your word string. This is still a fairly new site and the
data set of groups is still somewhat limited. However, there
is more than enough here to play with and to get a good idea
how multiple sequence alignments [MSA] work.
Note that you will retrieve nucleotide sequence sets here.
You can work with these or if you want to look at the
proteins, you can select "Protein" on the right side. This
will take you to a summary list of protein sequences as you
had in A above.
3. If you want a selected set of related
structures, the easiest way is to start a search under
"Structure" at Entrez. On the Structure Summary page of a
given entry, look for a listing Structure
Neighbors. Click on one, and you may be able to
retrieve a list of proteins with related structures. Entries
which are relatively new [under a year] often are
still in a cue to be calculated into retrievable lists.
Therefore, look for "old entries".
The above strategies have their uses, but as is
they are not that interesting, with the exception of the
Popset groups. These groups are displayed in aligned blocks,
which give you a fair amount of information to start
digesting. For further analysis following any of these
search strategies, sequences of interest need to be
retrieved and saved in FASTA format. If you are interested
in structure, be sure to save MMDB and PDB accession
numbers, as well, for future use.
[top of
page]
Part 2: BLAST Searches
Time to go back to Darwin 2000 for another tutorial:
http://www.rickhershberger.com/darwin2000/
This time select the module "BLAST" and work through the
"Guided Activity". By starting with one of your selected
sequences, BLAST allows you to perform "one against all"
homology/alignment searches against the available databases.
After reviewing this tutorial,either bookmark this site or
keep a window open, so you can shuttle back for help.
A: Finding homologous protein sequences by
using BLAST.
1. The shuttle to return to NCBI departs here:
http://www.ncbi.nlm.nih.gov/
If you are at the NCBI home page, click on the BLAST
button. If you are at the Entrez home page, click on BLAST
on the left-hand menu panel. On some formats of results
pages, BLAST can be accessed by clicking on it at the top of
the page. Read the overview; and check out the "course" and
"tutorial" links on the left-hand menu if you are interested
in the statistical analytical tools used and their levels of
reliability; and more about running BLAST searches. For help
while in BLAST, just click on any linked label to get an
explanation.
2. Select blastp on the first menu
page. On the next page, use defaults for your first try. For
database window, leave it "nr" [non-redundant] as
this will give you the broadest search of linked databases
without duplication of individual sequences. Go to your log
and select the FASTA formatted sequence of a protein
of interest and copy it. Go back to the BLAST page.
Click the cursor inside the top window, then paste.
[Your FASTA sequence should still be in the buffer. If
not, go to your log and copy/paste it over.] Be sure you
eliminate any spaces to the left of the ">gi|xxx"
line and any spaces to the left of the sequence lines. Do
not touch the right-side of any lines.
3. Below the search window are other menu
selections, such as one which says "pairwise" in the
"Format" section next to "alignment view". For your first
search, use the defaults. To retrieve your BLAST
report, click "Format results". Browse the report. Note that
the graph is active, meaning you can navigate the
report by mousing-over the bars and clicking on them to get
to the alignments. The window above the graph tells you the
identity of each sequence retrieved. You can also scroll
through the report. For details of interpreting the report,
1) click on the line above the graph, "Distribution of xx
Blast Hits...", and/or 2) click on "FAQ". You can also
toggle over to Darwin 2000, if you have kept a window
open.
- For long polyprotein sequences, % homology
scores will be high, even with a reasonable number of
non-match residues and gaps, since calculations are based
on the total number of residues. For single proteins and
fragments, homology scores will often be lower, because
each difference will make a greater numerical impact on
the calculations. Expect or E values are
useful. A value of 0.0 indicates extremely good homology
or an exact match. A value of 10 or more indicates random
chance and will go unreported. Values in-between will
give a sense of the relative degree of significant
match.
- For a pairwise Blast-p, the middle line
between query and subject gives a letter if an identical
match, a + if similar in character, or a blank if no
match. [In Blast-n, vertical lines and blanks
are used between the query and subject lines.] In
both cases, hyphens in the query or subject lines
indicate gaps.
4. Go back to the BLAST query page. Change from
pairwise in the format section below to
flat query anchored and browse the report.
Compare the results to your first report.
5. For another search go back to the BLAST
homepage, scroll to the bottom and select the JavaScript
free BLAST link. [This will allow you more choice
in database selection.] Select "blastp" again. On
the query page, set database to "pdb". [This will allow
you to limit your search for proteins for which structural
analysis is complete and which you could view in Chime,
RasMol, or CN3D.] In the search window enter a PDB
accession number. Run BLAST either in "pairwise" or "flat
query-anchored". You might just want to try "query anchored"
to compare the difference in the report appearance.
[top of
page]
B: Finding homologous nucleotide sequences by using
BLAST.
This does not need to be done now, except by those eager
to try it. This just seemed the obvious place to put the
instructions.
1. Go to either NCBI or Entrez home page, select
GenBank, and enter your choice of search terms.
Retrieve your target sequence in FASTA format.
2. Go to BLAST. Select blastn and leave
database "nr". Paste your sequence and run your choice of
"pairwise", "query-anchored", and "flat query-anchored"
searches. Compare the similarities and differences found
between running "blastp" and "blastn". Note the
informational advantages of each.
C: Optional.
Explore the uses of Psi-BLAST and BLAST-x.
Find out what they do and how they can be used. Depending on
your project, these may be of some benefit.
Part 3: Running multiple sequence
alignments [MSA]
All aboard!
http://www.rickhershberger.com/darwin2000/
Back to Darwin 2000, this time to learn about ClustalW,
the application which takes a group of selected sequences
and makes a multiple sequence alignment. Click on the module
"Multiple sequence alignments" and work through the guided
activity to learn what this can do for you. Warning: Some
patience is required. Pay attention to details, or you may
be frustrated by the lack of results.
1. By this time you should have one or more sets
of related proteins stored in your log in FASTA format. If
so, jump to 2 below. If not, go back to Part
1, section B and follow directions on retrieving a set of
related proteins.
2. In your log, you need to remove the spaces
on the line preceding ">gi|xxx..." for each
entry. [Don't disturb the sequence lines.] This is
necessary when running MSA, because any extra spaces will
terminate the alignment for all entries beyond those
spaces.
3. Go to:
http://www2.ebi.ac.uk/clustalw/
or http://www.ebi.ac.uk/clustalw/
Explore the site. You can read about the windows by
clicking on them.
4. Paste your grouped FASTA sequences into the
text box. For your first run, use the defaults. The
alignments will take a few minutes. You may want to enter
your e-mail to retrieve the report. If you run fails, the
first place to check if the FASTA format and the left-hand
spaces. If a run seems to take too long, try "off-hours",
keeping in mind that this is a European site, or try making
your alignment request smaller. You can do this by selecting
only 4-6 sequences. Alternatively, you may want to focus on
just one region or domain of your sequences. In that case,
you can select portions of the FASTA reports.
5. Once you have the report, browse to see what
you have. Click on Jalview for a graphical display.
Wait for the calculations and color assignment to be
complete before trying to navigate. For your convenience,
consensus notations and colors used in Jalview are assigned
as follows:
Consensus line notations:
* = identical or
conserved residues in all sequences in the alignment
: = indicates conserved
substitutions
. = indicates semi-conserved
substitutions.
|
Characteristics:
|
Amino acids:
|
|
red: small &
hydrophobic R groups
|
AVFPMILW
|
|
blue:
acidic
|
DE
|
|
magenta:
basic
|
RHK
|
|
green: hydroxyl +
X
|
STYHCNGQ
|
|
gray:
other
|
Symbols
for amino acids
|
Compare the results given here to the BLAST results. If
you used a group of sequences containing the query sequence
you used for a BLAST search, you see a similar
alignment.
6. Try repeating the MSA after changing selected
defaults. [It is good to play with a limited number of
sequences when doing this, due to the computing load it
creates. Four sequences of reasonable length should give you
enough to see the effects.] How are the reports
affected? Record your results for future reference.
Note: You can also run MSAs on nucleotide sequences.
[top of
page]
Summary Questions:
Try to limit your answers to two typed pages [12 pt
font]. Examples may be given in 10 pt font. You need not
retype the questions.
1. Give an example of a successful basic search
you did. Give the following information:
a. Word or word string used
b. Give examples from your log of different types
of reports you retrieved
c. Give one example of a sequence in FASTA
format
2. Briefly summarize your search for a group of
related proteins. Give the following information:
a. Word or word string used
b. Compare two methods used to find related
sequences. How many sequences were in your group(s)?
c. How did you use these sequences after you
saved them?
3. Summarize one of your BLAST search results.
Give the following information:
a. What was your query sequence, which
you entered in FASTA or as a PDB number?
b. How many closely related sequences did you
retrieve? What was the range of "E" scores? What does
this mean?
c. What other matches did the Blast search
find? What is the significance of these matches?
[Depending on your search, this may or may not
apply.]
d. What information did you get from the
query-anchored report, which you didn't see in the first
report?
e. Which style of BLAST report did you prefer?
Why?
4. Summarize one of you MSA results. Give the
following information:
a. What group of proteins did you select?
b. What regions are highly conserved? What
regions are not? [Consider the structure and function
of these proteins.]
c. If you were to compare the nucleotide
sequences of these same proteins, would the conserved
regions of the nucleotides be as conserved as the
conserved regions of the proteins? Why or why not?
5. What is a supergene family? The immunoglobulin
supergene family is an example. Answer one of the
following questions:
a. How would you demonstrate b-2-microglobulin
is related to immunoglobulins even though they serve
different functions?
b. How would you demonstrate that TCRs and
immunoglobulins are homologues and not paralogues?
[You need not actually do the problem. The idea here
is to explore how you would approach solving the problem
using bioinformatics applications.]
Interactive discussion forum:
Make one original posting to "Search Strategies", or
respond to an existing posting. Check back and read new
postings. They may contain information or answers which
you'll find useful.
[top of page]
|