Splice sites are the key signal sequences that determine the boundaries of exons. A method for splice site
detection should ideally be based on a thorough understanding of the complex eukaryotic splicing process. We trained
a backpropagation feedforward neural network with one layer of hidden units to recognize 5' and 3' splice sites,
using a representative data set (Drosophila melanogaster data set). We only consider genes that have
constraint consensus splice sites, i.e., GT' for the 5' and
AG' for the 3' splice site. The output of the
network is a score between 0 and 1 for a potential splice site.
The neural network method is described in detail in References and Abstract
A carefully randomly chosen independent test set of 43 human genes (/sequence/human-datasets.html) with no related sequences to the training set gave the following results:
+------------+-----------+----------------+------------+
| threshold | % | % | correlation|
| | sites | false positive | coefficient|
| | recognized| sites | (CC) |
+------------+-----------+----------------+------------+
| | | | |
| 0.99 | 26.0% | 0.1% | 0.46 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.95 | 50.4% | 0.7% | 0.65 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.90 | 64.1% | 1.1% | 0.73 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.85 | 72.7% | 1.4% | 0.78 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.80 | 74.4% | 1.9% | 0.78 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.75 | 77.8% | 1.9% | 0.81 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.70 | 81.6% | 2.7% | 0.81 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.65 | 85.0% | 3.2% | 0.83 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.60 | 88.0% | 3.5% | 0.84 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.55 | 89.3% | 3.7% | 0.84 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.50 | 91.5% | 4.2% | 0.85 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.45 | 93.2% | 4.7% | 0.85 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.40 | 93.2% | 5.2% | 0.84 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.35 | 93.6% | 5.3% | 0.84 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.30 | 94.9% | 5.8% | 0.84 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.25 | 95.3% | 6.2% | 0.84 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.20 | 96.2% | 6.7% | 0.83 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.15 | 96.6% | 8.2% | 0.81 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.10 | 97.9% | 9.1% | 0.81 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.05 | 98.3% | 11.1% | 0.78 |
| | | | |
+------------+-----------+----------------+------------+
These percentages are defined by:
predicted sites
sites recognized = -------------------------
all observed sites
predicted sites
false positive sites = -------------------------
all observed non-sites
(TPxTN)-(FNxFP)
correlation coefficient (CC) = ------------------------------------
________________________________
V (TP+FN)x(TN+FP)x(TP+FP)x(TN+FN)
TP = true positive = sites recognized TN = true negative = non-sites recognized FP = false positive = observed non-sites predicted as sites FN = false negatives = observed sites predicted as non-sites
+------------+-----------+----------------+------------+
| threshold | % | % | correlation|
| | sites | false positive | coefficient|
| | recognized| sites | (CC) |
+------------+-----------+----------------+------------+
| | | | |
| 0.99 | 7.3% | 0.0% | 0.25 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.95 | 33.3% | 0.4% | 0.52 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.90 | 47.9% | 0.5% | 0.64 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.85 | 57.7% | 0.6% | 0.70 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.80 | 61.2% | 0.9% | 0.72 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.75 | 65.4% | 1.1% | 0.74 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.70 | 69.7% | 1.3% | 0.77 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.65 | 73.5% | 1.5% | 0.79 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.60 | 76.5% | 1.8% | 0.80 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.55 | 79.1% | 2.0% | 0.81 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.50 | 80.8% | 2.4% | 0.81 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.45 | 82.5% | 2.9% | 0.81 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.40 | 83.8% | 3.1% | 0.81 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.35 | 86.8% | 3.7% | 0.82 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.30 | 88.5% | 4.0% | 0.82 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.25 | 88.5% | 4.5% | 0.81 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.20 | 90.2% | 4.8% | 0.82 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.15 | 91.0% | 6.0% | 0.80 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.10 | 92.3% | 7.9% | 0.77 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.05 | 94.9% | 10.4% | 0.74 |
| | | | |
+------------+-----------+----------------+------------+
Neural Network based "consensi" sequences: Extensive analysis of the perceptron neural network weight matrices have revealed the following "refined" 5' and 3' splice site consensus and non-consensus sequences:
5' Splice Site:
-7 6 5 4 3 2 -1 +1 2 3 4 5 6 7 +8
consensus: a a a A C|a A G / G T A A G T - c
non-consensus: g g g G G|T G|T A|T - - C|t g|t - - t -
3' Splice Site:
-21 -20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 -1
consensus: - T T T|c T T|C T|C T|c T|c T|c T|c T|c T|c T|C T|c T|C T|c A T|C A G
non-consensus: G
+1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 +20
consensus: G T c - - - g g - g g|a c g a a a|c a g - -
non-consensus: c|t t g|t
Capital letters indicate strong weights and lower case letters weaker weights. "|" means "or" "-" no significant weight "non-consensus" indicates bases that are very unlikely to appear at this position.
A carefully randomly chosen independent test set of 41 genes (Drosophila melanogaster gene set) with no related sequences to the training set gave the following results:
+------------+-----------+----------------+------------+
| threshold | % | % | correlation|
| | sites | false positive | coefficient|
| | recognized| sites | (CC) |
+------------+-----------+----------------+------------+
| | | | |
| 0.99 | 0.0% | 0.0% | - |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.95 | 22.9% | 0.0% | 0.44 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.90 | 53.3% | 0.0% | 0.69 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.85 | 61.9% | 0.0% | 0.75 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.80 | 66.7% | 0.0% | 0.78 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.75 | 69.5% | 0.8% | 0.78 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.70 | 77.1% | 0.8% | 0.83 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.65 | 78.1% | 1.0% | 0.83 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.60 | 81.9% | 1.0% | 0.86 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.55 | 82.9% | 1.0% | 0.86 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.50 | 88.6% | 1.8% | 0.88 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.45 | 90.5% | 2.5% | 0.88 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.40 | 91.4% | 3.0% | 0.88 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.35 | 91.4% | 4.0% | 0.85 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.30 | 94.3% | 4.8% | 0.86 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.25 | 96.2% | 5.3% | 0.86 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.20 | 97.1% | 5.8% | 0.86 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.15 | 97.1% | 8.0% | 0.82 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.10 | 99.1% | 10.3% | 0.80 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.05 | 99.1% | 15.1% | 0.73 |
| | | | |
+------------+-----------+----------------+------------+
#### _Drosophila melanogaster _3' Splice Site prediction:
+------------+-----------+----------------+------------+
| threshold | % | % | correlation|
| | sites | false positive | coefficient|
| | recognized| sites | (CC) |
+------------+-----------+----------------+------------+
| | | | |
| 0.99 | 1.9% | 0.0% | 0.12 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.95 | 11.4% | 0.0% | 0.30 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.90 | 28.6% | 0.6% | 0.46 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.85 | 44.8% | 0.6% | 0.60 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.80 | 53.3% | 1.1% | 0.65 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.75 | 60.1% | 2.0% | 0.69 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.70 | 69.5% | 2.3% | 0.74 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.65 | 73.3% | 2.5% | 0.76 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.60 | 76.2% | 3.1% | 0.77 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.55 | 79.0% | 4.2% | 0.77 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.50 | 83.8% | 5.4% | 0.78 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.45 | 87.6% | 5.9% | 0.80 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.40 | 90.5% | 6.5% | 0.81 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.35 | 92.4% | 7.0% | 0.81 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.30 | 94.3% | 9.0% | 0.79 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.25 | 94.3% | 10.7% | 0.77 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.20 | 96.2% | 13.0% | 0.75 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.15 | 96.2% | 14.7% | 0.73 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.10 | 96.2% | 17.5% | 0.69 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.05 | 97.1% | 30.7% | 0.56 |
| | | | |
+------------+-----------+----------------+------------+