|
|
|||
| Nucleotide | Protein | Translations | Retrieve RID |
The NCBI BLAST family of programs includes:
The Databases available on our local server are:
>gi|532319|pir|TVFV2E|TVFV2E envelope protein ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRT QIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWC HFPSNWKGAWKEVKEEIVNLPKERYRGTNDPKRIFFQRQWGDPETANLWFNCHGEFFYCK MDWFLNYLNNLTVDADHNECKNTSGTKSGNKRAPGPCVQRTYVACHIRSVIIWLETISKK TYAPPREGHLECTSTVTGMTVELNYIPKNRTNVTLSPQIESIWAAELDRYKLVEITPIGF APTEVRRYTGGHERQKRVPFVXXXXXXXXXXXXXXXXXXXXXXVQSQHLLAGILQQQKNL LAAVEAQQQMLKLTIWGVK
A --> adenosine M --> A C (amino)
C --> cytidine S --> G C (strong)
G --> guanine W --> A T (weak)
T --> thymidine B --> G T C
U --> uridine D --> G A T
R --> G A (purine) H --> A C T
Y --> T C (pyrimidine) V --> G C A
K --> G T (keto) N --> A G C T (any)
- gap of indeterminate length
For those programs that use amino acid query sequences (BLASTP
and TBLASTN), the accepted amino acid codes are:
A alanine P proline
B aspartate or asparagine Q glutamine
C cystine R arginine
D aspartate S serine
E glutamate T threonine
F phenylalanine U selenocysteine
G glycine V valine
H histidine W tryptophan
I isoleucine Y tyrosine
K lysine Z glutamate or glutamine
L leucine X any
M methionine * translation stop
N asparagine - gap of indeterminate length
This function mask off segments of the query sequence that have low compositional complexity, as determined by the SEG program of Wootton and Federhen (Computers and Chemistry, 1993) or, for BLASTN, by the DUST program of Tatusov and Lipman. Filtering can eliminate statistically significant but biologically uninteresting reports from the blast output (e.g., hits against common acidic-, basic- or proline-rich regions), leaving the more biologically interesting regions of the query sequence available for specific matching against database sequences.
Filtering is only applied to the query sequence (or its translation products), not to database sequences. Default filtering is DUST for BLASTN, SEG for other programs.
It is not unusual for nothing at all to be masked by SEG, when applied to sequences in SWISS-PROT or refseq, so filtering should not be expected to always yield an effect. Furthermore, in some cases, sequences are masked in their entirety, indicating that the statistical significance of any matches reported against the unfiltered query sequence should be suspect. This will also lead to search error when default setting is used.
Filter (Mask for lookup table only)
BLAST searches consist of two phases, finding hits based upon a lookup table and then extending them. This option masks only for purposes of constructing the lookup table used by BLAST so that no hits are found based upon low-complexity sequence or repeats (if repeat filter is checked). The BLAST extensions are performed without masking and so they can be extended through low-complexity sequence.
A key element in evaluating the quality of a pairwise sequence alignment is the "substitution matrix", which assigns a score for aligning any possible pair of residues. The theory of amino acid substitution matrices is described in [1], and applied to DNA sequence comparison in [2]. In general, different substitution matrices are tailored to detecting similarities among sequences that are diverged by differing degrees [1-3]. A single matrix may nevertheless be reasonably efficient over a relatively broad range of evolutionary change [1-3]. Experimentation has shown that the BLOSUM-62 matrix [4] is among the best for detecting most weak protein similarities. For particularly long and weak alignments, the BLOSUM-45 matrix may prove superior. A detailed statistical theory for gapped alignments has not been developed, and the best gap costs to use with a given substitution matrix are determined empirically. Short alignments need to be relatively strong (i.e. have a higher percentage of matching residues) to rise above background noise. Such short but strong alignments are more easily detected using a matrix with a higher "relative entropy" [1] than that of BLOSUM-62. In particular, short query sequences can only produce short alignments, and therefore database searches with short queries should use an appropriately tailored matrix. The BLOSUM series does not include any matrices with relative entropies suitable for the shortest queries, so the older PAM matrices [5,6] may be used instead. For proteins, a provisional table of recommended substitution matrices and gap costs for various query lengths is:
Query length Substitution matrix Gap costs
------------ ------------------- ---------
<35 PAM-30 ( 9,1)
35-50 PAM-70 (10,1)
50-85 BLOSUM-80 (10,1)
>85 BLOSUM-62 (11,1)
The raw score of an alignment is the sum of the scores for aligning pairs of residues and the scores for gaps. Gapped BLAST and PSI-BLAST use "affine gap costs" which charge the score -a for the existence of a gap, and the score -b for each residue in the gap. Thus a gap of k residues receives a total score of -(a+bk); specifically, a gap of length 1 receives the score -(a+b).
To convert a raw score S into a normalized score S' expressed in bits, one uses the formula S' = (lambda*S - ln K)/(ln 2), where lambda and K are parameters dependent upon the scoring system (substitution matrix and gap costs) employed [7-9]. For determining S', the more important of these parameters is lambda. The "lambda ratio" quoted here is the ratio of the lambda for the given scoring system to that for one using the same substitution scores, but with infinite gap costs [8]. This ratio indicates what proportion of information in an ungapped alignment must be sacrificed in the hope of improving its score through extension using gaps. We have found empirically that the most effective gap costs tend to be those with lambda ratios in the range 0.8 to 0.9.
Some popular groups are: Archaea Bacteria Eukaryota Embryophyta (higher plants) Fungi Metazoa (multicellular animals) Vertebrata Mammalia Rodentia Primates
(TBO - traditional blast output)
0: 3 nucleotides missing - gap (TBO notation "-")
OOF alignment with DNAP:
DTRGGDTPQKSVFSRAQNTLWGERGDTQKRGGAQRGDIFSLWGG-GVLCV
| | | | | | | | | | | | | | | | |
D G T K F A T G G Q G Q D S G K V V
TBO:
DGTKFATGGQGQDSG-VV
DGTKFATGGQGQDSG VV
DGTKFATGGQGQDSGKVV
1: 2 nucleotides missing - "frameshift -2" (TBO notation "\\")
OOF alignment with DNAP:
DTRGGDTPQKSVFSRAQNTLWGERGDTQKRGGAQRGDIFSLWGGGGVLCV
| | | | | | | | | | | | | | |/ | |
D G T K F A T G G Q G Q D S GK V V
TBO:
DGTKFATGGQGQDSG\\GVV
DGTKFATGGQGQDSG VV
DGTKFATGGQGQDSG KVV
2: 1 nucletide missing - "frameshift -1" (TBO notation "\")
OOF alignment with DNAP:
DTRGGDTPQKSVFSRAQNTLWGERGDTQKRGGAQRGDIFSLWGGERGV
| | | | | | | | | | | | | | / | |
D G T K F A T G G Q G Q D S G K V
TBO:
DGTKFATGGQGQDS\GEV
DGTKFATGGQGQDS G V
DGTKFATGGQGQDS GKV
3: Complete match
OOF alignment with DNAP:
DTRGGDTPQKSVFSRAQNTLWGERGDTQKRGGAQRGDIFSLWGGEKRGV
| | | | | | | | | | | | | | | | |
D G T K F A T G G Q G Q D S G K V
TBO:
DGTKFATGGQGQDSGKV
DGTKFATGGQGQDSGKV
DGTKFATGGQGQDSGKV
4: 1 nucleotide insertion - "frameshift +1" (TBO notation "/")
OOF alignment with DNAP:
DTRGGDTPQKSVFSRAQNTLWGERGDTQKRGGAQRGDIFSLWGGVEKRGV
| | | | | | | | | | | | | | | \
D G T K F A T G G Q G Q D S G K V
TBO:
DGTKFATGGQGQDSG/KV
DGTKFATGGQGQDSG KV
DGTKFATGGQGQDSG KV
5: 2 nucleotides insertion - "frameshift +2" (TBP notation "//")
OOF alignment with DNAP:
DTRGGDTPQKSVFSRAQNTLWGERGDTQKRGGAQRGDIFSLFLWGGEKRGV
| | | | | | | | | | | | | | \ | |
D G T K F A T G G Q G Q D S G K V
TBO:
DGTKFATGGQGQDS//GKV
DGTKFATGGQGQDS GKV
DGTKFATGGQGQDS GKV
-G Cost to open a gap [Integer]
default = 5
-E Cost to extend a gap [Integer]
default = 2
-q Penalty for a mismatch in the blast portion of run [Integer]
default = -3
-r Reward for a match in the blast portion of run [Integer]
default = 1
-e Expectation value (E) [Real]
default = 10.0
-W Word size, default is 11 for blastn, 3 for other programs.
-v Number of one-line descriptions (V) [Integer]
default = 100
-b Number of alignments to show (B) [Integer]
default = 100
-G Cost to open a gap [Integer]
default = 11
-E Cost to extend a gap [Integer]
default = 1
-e Expectation value (E) [Real]
default = 10.0
-W Word size, default is 11 for blastn, 3 for other programs.
-v Number of one-line descriptions (V) [Integer]
default = 100
-b Number of alignments to show (B) [Integer]
default = 100
Limited values for gap existence and extension are supported for these three programs.
Some supported and suggested values are:
Existence Extension
10 1
10 2
11 1
8 2
9 2