Computer Exercise 2: Bioinformatics Searching
Finding sequences and working with
them
Stev 2055 PC lab
Introduction:
What does evolution have to do with learning to search
for sequences? What can playing with sequences teach us
about evolution? One answer to the first question might be
that since you are "more evolved" than any other primate and
"technological evolution" has taken you to the point of
being able to use a computer, searching for viral sequences
is a good way to occupy your time while learning something
about viruses. I suppose there are other answers as
well.
The possible answers to the second question are quite
varied. As time goes on, the list of answers should continue
to grow. A few examples should illustrate the point. By
comparing gene sequences or protein sequences, it is
possible to compare the degree of relatedness of the source
organisms. It is possible to build a phylogenetic tree based
on the differences and similarities found. In some cases, it
is possible to trace the divergence of a gene into a gene
family, represented by many different protein products. In
other cases, it is possible to trace the origin of an
individual gene found in one organism as being from a very
unrelated one. After doing these exercises, you will be able
to add to this list. But be prepared. For every question you
raise and attempt to answer, you will generate even more
questions. It is possible to get sucked in and lose focus as
you attempt to follow multiple leads at once.
Be prepared to spend some time on this exercise set. At
first it may seem overwhelming, but it gets better with
practice. Note questions in your log. Work on small chunks
at a time. Take breaks. For those with little or no
experience in bioinformatics, and for others wanting a
review or to learn new things about NCBI resources, do the
Pre-lab activity. The tutorials will give
you an overview of some of the components involved in
bioinformatics.
Reminder: Using a log can help you
backtrack to things you'd like to check out later. Saving
sequences, along with pertinent accession numbers and
identifiers, will become extremely useful, especially when
trying to use multiple analytical applications on the same
data set.
Collect the information requested below
[marked with asterisks
(*)] and
answer the summary questions at the end of
each part. Read them through before you start browsing. You
can answer these as you go, or answer them after browsing
the following sites. Points = 15. Due in two weeks
[3/5].
The last thing to do is to go to the Fora on WebCT. See
the end of the exercise for topic focus.
Key sites used in this exercise:
1. Tutorial lessons at NCBI:
http://www.ncbi.nlm.nih.gov/Education
2. Searching for taxonomic organization, specific
sequences, and performing BLAST analyses:
http://www.ncbi.nlm.nih.gov
Sub-sites of interest: [there are others, as
well]
- Taxonomy
- Entrez- nucleotides, proteins, etc.
- BLAST
3. Performing multiple sequence analyses
[MSA]:
http://www2.ebi.ac.uk/clustalw/ or
http://www.ebi.ac.uk/clustalw/
[top of
page]
Part 0: Pre-Lab Activity- Tutorials
1. To access NCBI tutorials directly, go to
http://www.ncbi.nlm.nih.gov/Education.
a. To become familiar with the size and
layout of NCBI, click on the site map. For future
use, this site map will allow you to navigate to
different sections quickly, or to find something that you
saw at one point but can't seem to find again.
b. From the main Education page, you can
get an overview of bioinformatics and some key terms.
c. Next, try the tutorial
Nucleotides, then try BLAST.
Even if you have used BLAST before, this is very helpful.
There are also new features which can help in doing BLAST
searches more effectively.
d. To better understand BLAST statistics
and the results of a search, browse the statistics
tutorial, available from the left-hand menu bar.
[In future, you can also access both this tutorial
and the main BLAST tutorial from the the BLAST
page.]
e. An interesting way to learn about different
aspects of bioinformatics and some specifics on
applications is to go to Coffee Break. These are a
series of short essays on a variety of topics with
specific tutorials imbedded. For example, the newest
posting is 22 October 01, "Finding Fanconi- The hunt for
the cause of autosomal dominant renal Fanconi syndrome."
Click on Archive to access the list of
topics. Try out a few now. Come back for more later.
2. To learn more about bioinformatics in general, and
have fun doing it, try playing Origin:
Unknown:
http://www.nbif.org/
There are different modules and different levels of
difficulty. Let me know what you think about it.
[top of
page]
Part 1: Finding a sequence (or two, or
more...)
Exercise:
A: Getting started
1. For starters, this will be by way of a detour,
to explore a cool site you should visit often as we make our
way through the viral taxa. Go to:
http://www.ncbi.nlm.nih.gov
2. Click on Taxonomy on the menu bar.
a. Click on Genetic Codes on the
left-hand bar. Here you can discover that AUG does
not always mean methionine, among other
things. [Don't linger long; you can return later for
a more thorough browse.]
b. Return to Taxonomy; then click on the
tree button when you get to the site; then
select viruses. [This is another site
you'll want to browse in more detail later.]
1) Page down through the taxon groups
[currently groups are listed in the order dsDNA,
dsRNA, retroids, ssDNA, ssRNA, unassigned] until
you reach ssRNA positive strand.
2) Select Picornaviridae; then select
Human Rhinovirus 16.
3) Retrieve protein sequences by selecting
protein, then retrieving.
3. Now you are in another part of the NCBI site.
You should have your list of 19 proteins. Click on
Q82122; then explore the different reports available.
In graphical, you will see the genomic RNA and the
location of the sequences for different protein products.
[Try the graphical view for some of the other sequences,
and you will find that not all provide the same amount of
information.]
*Use
your log to save an example of each kind of report,
along with a label and explanatory comments as to source,
Web page type [NCBI search], and anything else
you deem useful for future use.
4. Select one of the 19 proteins, and go to
FASTA report.
*Copy
the entire report, beginning at ">gi|xxx ...."
and ending at the end of the sequence. Paste this
in your log as a backup and for future use.
[What, you didn't open a log yet?? You can do it
now.] After pasting, click back to your Web page.
[You may also want to try NCBI's clipboard feature.
Just be aware that the clipboard is useful during an
active session, but will not save your selections.]
Congratulations! You now have a sequence, which
you can use in the next section. Now go to Part 2.
[Come back on your own to do B
below.]
B. For later, on your own.
1. Try entering descriptive terms, such as
"poliovirus", "rhinovirus serotype 3", etc. directly in the
search line of NCBI's home page, or in Entrez. Select
the type of search you want: GenBank, for nucleotides;
Proteins; Structure; and so on. You can refine your search
if you get 100's or 1000's of hits by adding specific words
to your search string. You can use the advance search field
to further refine your search. Try using the same search
term to search for both nucleotides and proteins. Compare
your results in terms of items retrieved and how they relate
to each other.
2. Try saving some files of interest. Then use
them for other analyses, such as visualizing in
Chime/RasMol, if a Structure search was done; or run BLAST
or MSA.
*a)
Use your log to save examples of the different types
of searches, along with useful commentary.
*b)
Be sure to save at least one nucleotide sequence of
interest in FASTA format, for use in running a BLAST
search.
3. Check out the "newsy" information on the main
pages of Entrez, such as Nucleotides, Proteins, etc. There
are some interesting developments.
4. Find a group of related
proteins. This can be done in a variety of ways, in
different locations. A brief introduction to two approaches
is given here.
a. Scan the right side of a page of
results from one of your searches above. Note that some
entries have Related
Sequences in blue type. Click on one to
see what you get back. Another way is to check if
Protein Neighbors shows in the Display menu for
your selected protein(s). Selecting it will give you a
list as well. Be sure to check the boxes before clicking
on "Display".
b. You can find pre-grouped sets of sequences
by going to Popset at Entrez. Try a couple of your
successful search terms used above. If you strike out,
try modifying your word string. Since this is still a
fairly new site, the data set of groups is still somewhat
limited. However, there is more than enough here to play
with and to get a good idea how multiple sequence
alignments [MSA] work. [I readily found three
sets of influenza sequences.] Note that you will
retrieve nucleotide sequence sets here. You can work with
these or if you want to look at the proteins, you can
select Protein
on the right side. This will take you to a summary list
of protein sequences as you had in A
above.
[top of
page]
Part 1 Summary
Questions:
Try to limit your answers to one typed page [12 pt
font] for this part. Examples may be given in 10 pt
font. [You need not retype the questions as part of your
responses.]
1. Give an example of a successful basic search
you did. Give the following information:
a. Word or word string used
b. Give examples from your log of different types
of reports you retrieved
c. Give one example of a sequence in FASTA format
2. How did you use this sequence after you saved
it?
[top of
page]
Part 2: Find homologous sequences by
using BLAST
BLAST is a search tool to find homologous
sequences to a target sequence. The query is: "What in
all in the crowd of database entries matches this selected
sequence?" This is fundamentally different from the
approach taken in doing multiple sequence alignments
[MSA], which compares selected sequences against
each other. [See Part 3.]
Exercise:
A: Finding homologous protein sequences
by using BLAST.
1. The shuttle to return to NCBI departs here:
http://www.ncbi.nlm.nih.gov/
If you are at the NCBI home page, click on the BLAST
button. If you are at the Entrez home page, click on BLAST
on the left-hand menu panel. On some formats of results
pages, BLAST can be accessed by clicking on it at the top of
the page. Read the overview; and check out the tutorials if
you haven't seen them already. They introduce the different
types of BLAST searches and the statistical analytical tools
used.
2. Select blastp in the first menu
window. For database window, leave it nr
[non-redundant]. Go to your log and select
the FASTA formatted sequence of a protein of interest
and copy it. Go back to the BLAST page. Scroll down
to the large text window, click the cursor inside,
then click paste. [FASTA sequence should still be
in the buffer. If not, go to your log and copy/paste it
over.] Be sure you eliminate any spaces to the left
of the ">gi|xxx" line and any lines to the left of
the sequence lines. Do not touch the right-side of any
lines.
3. Below the text window are other menu
selections, such as one which says pairwise.
For your first search, use the defaults. To
retrieve your BLAST report, click "Format results".
Browse the report. Note that the graph is active,
meaning you can navigate the report by mousing-over the bars
and clicking on them to get to the alignments. The window
above the graph tells you the identity of each sequence
retrieved. You can also scroll through the report. For
details of interpreting the report, 1) click on the line
above the graph, "Distribution of xx Blast Hits...", and/or
2) click on "FAQ".
- For long polyprotein sequences, % homology scores
will be high, even with a reasonable number of non-match
residues and gaps, since calculations are based on the
total number of residues. For single proteins and
fragments, homology scores will often be lower, because
each difference will make a greater numerical impact on
the calculations. "Expect" or "E" values
are useful. A value of 0.0 indicates extremely good
homology or an exact match. A value of 10 or more
generally indicates random chance and will go unreported.
Values in-between will give a sense of the relative
degree of significant match.
- For a pairwise Blast-p, the middle line between query
and subject gives a letter if an identical match, a + if
similar in character, or a blank if no match. [In
Blast-n, vertical lines and blanks are used between the
query and subject lines.] In both cases, hyphens in
the query or subject lines indicate gaps.
4. Go back to the BLAST query page. Change from
pairwise in the window below the text box to
flat query anchored. Run and browse the
report. Compare the results to your first report.
5. For another search go back to the BLAST query
page or open a new one. Select blastp again.
Set database to pdb. In the window above the
text box, change to accession number. [If you're
using the old page, first dump the FASTA sequence in the
text window.] In the text window, type in one of your
saved PDB accession numbers or use QJY1. [This
one of the chains of QJY, the rhinovirus coat protein which
we used in the first exercise.] Run BLAST either in
pairwise or flat query-anchored. You might
just want to try query anchored to compare the
difference in the report appearance.
B: For later- Finding homologous nucleotide
sequences by using BLAST.
This does not need to be done now, except by those eager
to try it.
1. [If you've already have a FASTA nucleotide
sequence in your log, skip to #2.] Go to either NCBI or
Entrez home page, select GenBank, and enter your choice of
search terms. [If you're drawing a blank and just want
something to try now, you can enter L24917 or
D00625.1 in the search line. L23917 is the human
rhinovirus 16 polyprotein gene for Q82122 protein; D00625.1
is from poliovirus 2.] Retrieve your target sequence in
FASTA format.
2. Go to BLAST. Select blastn and
leave database nr. Paste your sequence and run
your choice of pairwise, query-anchored, and
flat query-anchored searches. Compare the
similarities and differences found between running
blastp and blastn. Note the informational
advantages of each.
C: Optional.
Explore the uses of Psi-BLAST and BLAST-x.
Find out what they do and how they can be used. Depending on
your project, these may be of some benefit.
[top of
page]
Part 2 Summary Question:
Try to limit your answer to one typed page [12 pt
font] for this part. [You need not retype the
questions as part of your responses.]
1. Summarize one of your protein BLAST search
results. Give the following information:
a. What was your query sequence, which
you entered in FASTA or as a PDB number?
b. How many closely related sequences did you
retrieve? What was the range of "E" scores? What does
this mean?
c. Did you retrieve any matches of sequences
outside of the virus family of your query sequence? What
is the significance of these matches? [Depending on
your search, this may or may not apply.]
d. What information did you get from the
query-anchored report, which you didn't see in the first
report?
e. Which style of BLAST report did you prefer?
Why?
2. Summarize one of your nucleotide BLAST search
results. Follow parts a-e above.
[top of
page]
Part 3: Running Multiple Sequence
Alignments [MSA]
Exercise:
A: MSA of protein sequences.
1. By this time you should have one or more sets
of related proteins stored in your log in FASTA format. If
so, jump to 2 below. If not, go back to
Part 1, section B4 and follow directions
on retrieving a set of related proteins.
2. In your log, you need to remove the spaces on
the line preceding ">gi|xxx..." for each entry.
[Don't disturb the sequence lines.] This is
necessary when running MSA, because any extra spaces
will terminate the alignment for all entries beyond those
spaces.
3. Go to:
http://www2.ebi.ac.uk/clustalw/
or http://www.ebi.ac.uk/clustalw/
Explore the site. You can read about the windows by
clicking on them.
4. Paste your grouped FASTA sequences into the
text box. For your first run, use the defaults. The
alignments will take a few minutes. You may want to enter
your e-mail to retrieve the report. If you run fails, the
first place to check is the FASTA format and the left-hand
spaces. If a run seems to take too long, try "off-hours",
keeping in mind that this is a European site, or try making
your alignment request smaller. You can do this by selecting
only 4-6 sequences. Alternatively, you may want to focus on
just one region or domain of your sequences. In that case,
you can select portions of the FASTA reports.
5. Once you have the report, browse to see
what you have. Click on Jalview for a graphical display.
Wait for the calculations and color assignment to be
complete before trying to navigate. For your convenience,
consensus notations and colors used in Jalview are assigned
as follows:
Consensus line notations:
* = identical or
conserved residues in all sequences in the alignment
: = indicates conserved
substitutions
. = indicates semi-conserved
substitutions.
|
Characteristics:
|
Amino acids:
|
|
red: small &
hydrophobic R groups
|
AVFPMILW
|
|
blue:
acidic
|
DE
|
|
magenta:
basic
|
RHK
|
|
green: hydroxyl +
X
|
STYHCNGQ
|
|
gray:
other
|
|
Compare the results given here to the BLAST results. If
you used a group of sequences containing the query sequence
you used for a BLAST search, you should see a similar
alignment.
6. Try repeating the MSA after changing selected
defaults. [It is good to play with a limited number of
sequences when doing this, due to the computing load it
creates. Four sequences of reasonable length should give you
enough to see the effects.] How are the reports
affected? Record your results in your log for future
reference.
B: Optional.
You can also run MSAs on a selected set of nucleotide
sequences, such as from Popset or from one of your own
searches.
[top of
page]
Part 3 Summary
Questions:
Try to limit your answers to one typed page [12 pt
font] for this part. [You need not retype the
questions as part of your responses.]
1. Summarize one of you MSA results on some
selected proteins. Give the following information:
a. What group of proteins did you select?
b. What regions are highly conserved? What
regions are not? [Consider the structure and function
of the proteins.]
c. If you were to compare the nucleotide
sequences of these same proteins, would the conserved
regions of the nucleotides be as conserved as the
conserved regions of the proteins? Why or why not?
2. If you get alignments by running BLAST, what is
the advantage of using MSA?
3. How can MSA results be used in examining
phylogeny?
Interactive discussion
Make one original posting to "Search Strategies", or
respond to an existing posting. You can copy/paste sequences
as part of your posting and you can provide active links as
well. Check back and read new postings. They may contain
information or answers which you'll find useful.
[top of
page]
|