|
||
TUTORIAL RESEARCH ARTICLE |
Correspondence to Anthony Fodor: anthony.fodor{at}gmail.com
|
|
|---|
R. Aldrich's present address is Section of Neurobiology, University of Texas at Austin, Austin, TX 78712.
Abbreviations used in this paper: BK, large conductance calcium-activated potassium; HMM, hidden Markov model; RCK, regulator of potassium conductance; SCOP, structural classification of proteins.
| INTRODUCTION |
|---|
|
|
|---|
In the literature, the degree of sequence similarity between two proteins is often measured by percent identity. If the percent identity is above
35%, it can reliably be asserted that two proteins share a common fold (Rost, 1999
). It is, however, an unfortunate feature of proteins that many have percent identities below 30% yet still share a common fold. Indeed, many proteins that have common folds, and many that do not, have percent identities between 20 and 30%(Brenner et al., 1998
; Rost, 1999
). This ambiguous 2030% sequence identity range is often referred to as the "twilight zone."
This problem of resolving proteins in the "twilight zone" is particularly acute in the study of ion channels. Even despite considerable recent progress (Long et al., 2005a
,b
), electrophysiological experiments are usually performed on eukaryotic channels, while the majority of solved channel structures are from prokaryotes. The great evolutionary distance between prokaryotes and eukaryotes means that comparisons between bacterial and vertebrate channels rarely hits the 30% identity mark. Nonetheless, it is widely assumed in the literature that eukaryotic and prokaryotic channels share common folds. In this paper, we exploit recent progress in structural genomics to quantify the reliability of these sorts of assumptions. We find that for a diversity of eukaryotic channels, folds can indeed be assigned with very low false positive rates. We also explore cases in the channel literature where there appears to be little sequence-based evidence for domains that have been proposed based on sequence similarity.
| MATERIALS AND METHODS |
|---|
|
|
|---|
In compiling our results for estimating false positive rates, we only counted hits where both the query sequence and the seed sequence were present in the ASTRAL database filtered at 40% identity. That is, no protein in either the query set or seed sequence set had >40% sequence identity to any other protein within the query or seed sequence set. In this way, we guarantee that, for our calculation of false positive rates in Fig. 2, large numbers of similar sequences do not distort our results on either the query or the target side. When attempting to find ion channel domains, however, we wished to perform as inclusive a search as possible. Therefore, in creating Figs. 38![]()
![]()
![]()
![]()
, we allowed hits against any Superfamily profile.
We scored a hit as "correct" when the query and the seed sequence belonged to the same SCOP superfamily. We scored a hit as "incorrect" when the query sequence and the seed sequence belonged to different SCOP superfamilies and different SCOP folds. We ignored hits in our false positive estimates in which the query sequence and seed sequence belonged to different SCOP superfamilies but the same SCOP fold. By these criteria, we generated 107,425 hits of which 67,158 were marked correct and 40,267 were incorrect.
We note that our metric of false positive rate is more discriminating than the "e-value" scores that the HMMer program generates. An e-value is a commonly used metric that is defined as the number of hits that one would expect if the search were performed using random sequences. Even when our false positive rates were
80%, we still see e-values of <0.0001 (see online supplemental material, Tables S1S6, available at http://www.jgp.org/cgi/content/full/jgp.200509419/DC1). This discrepancy may represent HMMer over-stating the significance of its searches or be a feature of the construction of the Superfamily models. Alternatively, it may reflect that proteins with significant sequence similarity due to common ancestry may no longer share a common fold. Resolving this question is beyond the scope of this paper.
Code wrapped around the HMMer distribution was used to perform these analyses and generate all the figures in this paper. Java code and instructions for reproducing Figs. 28![]()
![]()
![]()
![]()
![]()
in the paper are available upon request.
Online Supplemental Material
The online supplemental material (available at http://www.jgp.org/cgi/content/full/jgp.200509419/DC1) shows the results of running ion channel query sequences against the 9,939 Superfamily profile HMMs representing protein domains of known structure (Tables S1S6).
| RESULTS |
|---|
|
|
|---|
To compute the answer to this question, we turn to three preestablished databases (see Fig. 1).
Our first requirement is classification of each structure in the PDB. The SCOP database, which was created with a combination of manual and automated curation, describes each domain in the PDB with a controlled vocabulary(Murzin et al., 1995
; Hubbard et al., 1997
; Hubbard et al., 1998
, 1999
; Lo Conte et al., 2000
, 2002
; Andreeva et al., 2004
). The SCOP database defines proteins with a common "fold" as having the same pattern of major secondary structures. This would appear to be the classification level that we are interested in. However, classification efforts in the literature are often at the "superfamily" level (Gough et al., 2001
; Madera et al., 2004
). The SCOP database defines a superfamily as a set of proteins that, based on structure, have a probable common evolutionary origin. Because the great majority of folds in SCOP (768 of 887) contain only one superfamily, it makes little practical difference whether we work at the superfamily or fold level. Since there is a fairly developed literature on the problem of predicting SCOP superfamily from domain sequence (Gough et al., 2001
; Madera et al., 2004
; Wistrand and Sonnhammer, 2005
), we choose for the rest of this paper to work at the superfamily level.
|
The Superfamily database is a collection of searchable profiles that represent all proteins of known structure (Gough et al., 2001
). The construction of the Superfamily database starts with sequences from the PDB filtered at 95% sequence identity to remove close duplicates (Fig. 1; for details of the Superfamily technique see Gough et al., 2001
). A sequence from each PDB domain, representing a known SCOP superfamily, is used as a "seed sequence" to build a multiple sequence alignment by searching all the proteins in Genbank/EMBL/DDBJ for related sequences. Next, a hidden Markov model (HMM) is generated that describes the probability of finding each residue at each position in the alignment. This model, called a "profile HMM," is then used to again search all the proteins in Genbank/EMBL/DDBJ for matching proteins. This leads to a new alignment with an increased number of sequences. The new alignment is used to generate a new profile HMM, which in turn is used to build a larger alignment. This recursive search is repeated a set number of times (see Gough et al., 2001
, for details). The final profile HMM derived from this process can be used to estimate how well a given query sequence matches the SCOP superfamily of the seed sequence. Given a sequence S, and a Superfamily profile model M, we can use the HMMer software package (http://hmmer.wustl.edu/) to produce a log-odds score indicating how well M fits S. The log-odds score is defined (see HHMer user's guide distributed as part of the HMMer package) as
![]() |
Our study begins with a demonstration of the relationship between log-odds score and the probability of making an erroneous assignment, defined as assigning a query sequence to an incorrect SCOP superfamily. We start with the ASTRAL database, which provides easily parsable text files that contain sequence and SCOP assignments for every protein domain in the PDB (Fig. 1; see Materials and Methods). We then use the HMMer package to run each sequence in our ASTRAL query set over the 9,939 Superfamily domain models, which represent all known protein structures. For each query sequence matched to each model, we note the log-odds score of the hits (if any) and whether the superfamily of the query sequence is, in fact, the same as the superfamily of the seed sequence. (We ignore cases where the query sequence is identical to the seed sequence.) If the superfamilies are, in fact, different, we mark the hit between the query sequence and the profile model as being in error.
Fig. 2 A shows the results of the 107,425 hits generated by running the ASTRAL query sequence set against the Superfamily database.
Each point in this graph shows the number of true positives or false positives captured when we only include hits that are greater than or equal to the log-odds score given on the x-axis. As we move from right to left on the x-axis, the cutoff score is lowered and we include more hits in our analysis. This increases the number of both false positives and true positives captured. If we take, for example, a cutoff of
11.5 (dashed line), we would capture 49,192 true positives and 2,604 false positives, representing an error rate of
5%. There are, however, 67,158 total true positives in our dataset. We have, therefore, at a 5% error rate, only captured
73% (49,192/67,158) of the true positives in our dataset. This reflects a well known, but unfortunate, difficulty of working with protein sequences: sometimes proteins with little in common, and hence low log-odds scores, nonetheless have the same fold. We therefore need to set a low score cutoff to capture all the proteins with the same fold. A low cutoff, however, also includes many spurious hits of proteins that are not in the same superfamily, which is reflected in Fig. 2 A by the rapid increase in the number of false positives generated when cutoffs below
11 are used.
|
10 there is a rapid degradation in the quality of the generated predictions. At a log-odds score of <
1, the prediction that the query sequence is in the same superfamily as the seed sequence is wrong
80% of the time. The data in Fig. 2 B represent the error rates generated by comparing thousands of protein domains of known structure. We can, therefore, use Fig. 2 B as a guide to ask the following: in general, how much confidence do we have that two proteins share the same fold given some level of sequence similarity?
Predicted Results for Kv1.2
To get a sense of how well the methods described above are able to assign probabilities to domain assignments in ion channels, we can compare the predictions of our bioinformatics techniques with the structures of channels that have been directly determined by experiment. We start with the rat Kv1.2 potassium channel, which recently became the first eukaryotic ion channel to have a structure solved for the S1S6 core region (Long et al., 2005a
,b
). The version of the Superfamily database we used (1.67) was constructed, however, before this recently solved Kv1.2 channel structure was put into the PDB. This allows us to see how well our methods work in a case where we know, but our database does not, what the correct answer will be. Fig. 3 shows the results of running the Kv1.2 sequence against the 9,939 profiles in the Superfamily database.
Each line in Fig. 3 is a hit in which part of the Kv1.2 query sequence matched to a Superfamily profile. The log-odds score of each hit (given in Table S1, available at http://www.jgp.org/cgi/content/full/jgp.200509419/DC1) has been translated to false positive rate using the data in Fig. 2 B. As we move from top to bottom in Fig. 3, the hits against each profile become weaker, and an assertion that Kv1.2 shares a common structure with the seed sequence used to generate that profile is more likely to be in error.
|
The purple lines in Fig. 3 represent hits against profiles whose seed sequence belongs to the "POZ domain" superfamily. This superfamily includes members whose SCOP family is the "tetramerization domain of potassium channels." This structure, known as the T1 domain, has been solved in a number of ion channels, including the Shaker potassium channel (Kreusch et al., 1998
) and, as an isolated domain, the Kv1.2 channel itself (Minor et al., 2000
). We see that a number of these T1 structures map with high confidence to the NH2-terminal region. It may seem at first surprising that the Kv1.2 channel, when used as a query sequence, maps to a profile that used Kv1.2 as a seed sequence with a false positive rate >0. This can happen because the seed sequence is used to build a profile that consists of many different related sequences. The log-odds score that is generated is to the entire profile and not just to the seed sequence. Query sequences will therefore not necessarily map with extremely high scores to profiles in which the query sequence was also used as the seed sequence.
The assertions of the presence of the SCOP superfamilies "POZ domain" and "voltage-gated potassium channels" are the only two predictions that we would make for Kv1.2 with a false positive rate of <0.05 (Fig. 3). These results give us some confidence in the ability of these techniques to discriminate true and spurious hits.
Predicted Results for the HCN Channel
Another example of a eukaryotic ion channel for which we have direct structural evidence is the HCN2 channel, which is gated by both voltage and cyclic nucleotides. In the case of this channel, there is a recent crystal structure of the isolated cyclic nucleotide binding domain (Zagotta et al., 2003). The cyclic nucleotide-binding domain in the HCN channel occurs after the S6 region at the end of the channel. When cAMP binds to this region of the channel, the channel is more likely to open (DiFrancesco and Tortora, 1991
). Fig. 4 shows the results of running the HCN channel sequence against the Superfamily database.
The blue lines in Fig. 4 represent hits against the SCOP superfamily "cAMP-binding domain-like." As we see, we would have predicted the presence of a cyclic-nucleotide binding domain in this channel based on sequence similarity with a fair number of other previously solved cyclic-nucleotide binding domains (see also Table S2). Because the cyclic-nucleotide binding domain of this channel has, in fact, been solved (Zagotta et al., 2003) and been shown to share a common fold with other cyclic-nucleotide binding domains, we know this prediction is correct. Since the KvAP structure also maps to the HCN channel with a low false positive rate, we can assert with confidence that we know the structure of several key domains of this channel.
|
|
We also see evidence for two CBS domains at the COOH-terminal end of the CLC-0 sequence. The CBS (cystathionine beta synthase) domain has been shown to bind to the adenosyl portion of molecules such as ATP (Scott et al., 2004
). Even though none of the individual hits against profiles seeded with CBS domains map with a false positive rate <0.05, there are many CBS domain hits that map to two distinct regions in the COOH terminus with a moderate false positive rate (Fig. 5, green lines). This gives us good confidence that both CBS domains are present in the channel.
Predicting EF-hands in Sodium and Calcium Channels
The potassium and cyclic-nucleotide gated channel genes that we have so far examined create tetrameric channels with four subunits, each one of which consists of a copy of the channel gene. By contrast, sodium and calcium channel genes consist of four repeats of the channel motif contained within a single gene. If sodium and calcium channels share a common core domain architecture with potassium and cyclic-nucleotide gated channels, we would expect the solved potassium channel domains to map four times to these genes. This is exactly what we see with a low false positive rate (<0.01) for both the L-type calcium channel gene (Fig. 6) and the human cardiac sodium channel gene (Fig. 7).
|
|
Evaluation of Literature-proposed Domains in the BK Channel
In the HCN (Fig. 4), calcium (Fig. 6), and sodium (Fig. 7) channel sequences that we have analyzed, there was strong support for the presence of the S1S6 "core" regions based on homology to structures of prokaryotic potassium channels. Moreover, in these channels, we were also able to find clear support for proposed cyclic-nucleotide or calcium-binding regulatory domains in the COOH-terminal regions. Everything that we have observed so far is in good agreement with assertions made in the literature.
We now turn to a protein for which the interpretation of literature assignments will not prove as straightforward. The large conductance calcium-activated potassium (BK) channel is gated by both calcium and voltage. A controversy that has surrounded the BK channel concerns the location within the channel sequence of the calcium sensor. The BK channel has a long
800 amino acid tail after the S6 transmembrane domain. The tail of the BK channel is highly conserved between species, for example 95% identical between mouse and human, and does not, using pairwise metrics of sequence similarity, have any immediately obvious homology to any other known protein domain. This domain, highly conserved among, and unique to, BK channels has been the subject of a good deal of interest, much of it regarding whether this domain could harbor the calcium sensor of the channel. A number of different schemes whereby calcium could bind to the BK tail have been proposed (Schreiber and Salkoff, 1997
; Jiang et al., 2001
; Jiang et al., 2002
; Bao et al., 2004
). These have included a proposed novel calcium binding domain that has been called the "calcium bowl" (Schreiber and Salkoff, 1997
). It has recently been proposed that the calcium bowl region is within a domain that resembles an EF-hand motif (Braun and Sy, 2001
; Sheng et al., 2005
). The location of the proposed EF-hand domain, based on the published alignment (Sheng et al., 2005
), is shown in the top part of Fig. 8.
|
There is evidence beyond sequence analysis to support the EF-hand motif hypothesis. Electrophysiological support for this hypothesis comes from experiments that find changed calcium sensitivity in channels with mutations at residues proposed to correspond to important residues in an EF-hand motif (Braun and Sy, 2001
). The degree to which this evidence supports an EF-hand motif in the absence of sequence support is an open question, since it is always a possibility that this region of the channel is in fact an EF-hand domain with a sequence that does not closely resemble other EF-hand domains. We have seen that often domains can belong to the same superfamily yet share little by way of sequence similarity (Fig. 2 A). We point out, however, that there are a large number of different structural models that could explain a given set of biochemical data. This is the primary difficulty in making structural arguments based on mutagenesis. The BK channel could very well have a calcium-sensing fold with a novel structure that could, nonetheless, be consistent with experimental data that appears to support an EF-hand model. This is especially true given that electrophysiological measurements can only report the apparent affinity of ligand for a channel. Mutations that change the underlying energetics of channel gating can cause a shift in the apparent affinity of a ligand without necessarily being near the actual binding site of the ligand. The absence of strong sequence support for proposed domains, therefore, renders interpretation of biochemical and electrophysiological data that much more difficult.
The difficulty of interpreting electrophysiological data in the absence of compelling sequence support can be further illustrated by considering what is essentially a competing hypothesis about the structure of the BK channel. Based on the sensitivity of the channel to serine proteinase inhibitors, and the results of a sequence analysis, it has been proposed that the COOH terminus of the BK channel "structurally resembles serine proteinases" (Moss et al., 1996a
; Moss et al., 1996b
). The top part of Fig. 8 shows this proposed "serine proteinase-like" domain mapped to the channel sequence based on the published alignment (Moss et al., 1996a
). We see that this prediction overlaps the predicted EF-hand domain. The purple lines in the bottom part of Fig. 8 show hits against profiles whose seed sequence belonged to the "trypsin-like serine proteinases" superfamily. As was the case for EF-hands, we see that assignments to this SCOP superfamily do not occur at any reasonable false positive rate (see Table S6). And, yet, it seems inarguable that molecules that inhibit serine proteinases affect the channel (Moss et al., 1996a
,b
; Favre et al., 2000
). Unless this region of the channel can adopt radically different conformations, a possibility that seems unlikely, the EF-hand hypothesis and the serine proteinase-like domain hypothesis are mutually exclusive, despite the existence of supporting biochemical evidence for both hypotheses. The absence of compelling sequence support for either hypothesis makes it difficult to choose between them.
The EF-hand and "calcium bowl" hypotheses have not been the only proposed mechanisms whereby the COOH terminus of the BK channel can sense calcium. In 2001, the MacKinnon lab solved a crystal structure of the COOH-terminal of an Escherichia coli potassium channel (Jiang et al., 2001
). This region of the channel formed a common structure called a Rossman fold. On the basis of a recursive profile search of Genbank/EMBL/DDBJ, it was proposed that this domain, dubbed RCK or "regulator of potassium conductance" was also present in the BK channel. The position of this RCK1 domain, based on the published alignment, is shown in Fig.8. We see that, in fact, the E. coli RCK structure does match the BK channel with a good false positive rate of
0.03. The assertion that these regions of the two channels share a Rossman fold is therefore highly reasonable. In 2002, however, the MacKinnon lab published a structure of an MthK potassium channel (Jiang et al., 2002
). The structure of this channel included a "gating ring" consisting of eight RCK domains, which appeared to be in a position to coordinate calcium and "perform mechanical work to open the pore." Based on sequence analysis, and the fact that potassium channels are tetramers, while the MthK crystal structure showed eight RCK domains apparently coordinating calcium, it was suggested that a second RCK domain existed in the BK channel. Although the exact position of the second RCK domain within the channel structure was not indicated in the MthK paper, it presumably occurs soon after the initial RCK domain and is approximately as long. We have indicated this approximate position as dashed lines in Fig. 8, where we see that, in fact, the second Rossman fold maps to the BK channel with a high false positive rate of >75%.
It is intuitively pleasing to think that the MthK and BK channels work in the same way and share a conserved common mechanism of calcium binding. And the fact that both the MthK and the E. coli Rossman fold domains map to the second RCK domain, albeit with high false positive rates (Fig. 8), provide some support for the existence of this domain within BK. In addition, there is some supporting electrophysiological evidence suggesting this region of the channel may be important for sensitivity to calcium (Qian et al., 2002
). Nonetheless, we have a great deal more confidence in the existence of the first RCK domain in BK than in the second. In the absence of strong sequence-based evidence, electrophysiological and biochemical evidence for the existence of a second RCK domain must be especially compelling.
| DISCUSSION |
|---|
|
|
|---|
The central limitation of our approach is that two sequences may have no discernable sequence similarity and yet may share the same fold (Fig. 2 A). We cannot, therefore, say with certainty that an assertion of common structure is false, even if there appears to be little by way of sequence support for that assertion. One day, for example, we may have a structure of the BK COOH terminus, and it may very well contain some of the domains that appear in Fig. 8 with high false positive rates. If that turns out to be the case, the combination of intuition, biochemistry, and manual sequence analysis employed by good scientists will have trumped the kinds of automated sequence analyses we perform here. Nonetheless, the initial arguments for the presence of the RCK 2, serine proteinase-like, and EF-hand domains in BK were based in part or in whole on sequence, so it is fair to evaluate the strength of that sequence evidence. In the absence of direct structural data, acceptance of the hypotheses that these domains exist in the channel will require a great deal more experimental work than, for example, confirmation of the first RCK domain in the BK channel, which maps to the channel with a much lower false positive rate (Fig. 8).
The false positive rates that we calculate are dependent on a large number of assumptions and heuristics. We assume, for example, that the PDB is large enough to produce stable results. That is, that the probabilities we calculate won't significantly change as more structures are added. We assume that false positive rates generated from all PDB structures are relevant when applied to ion channels, which, of course, are not well represented in the PDB. We assume that our technique based on the Superfamily database is representative of all possible reasonable techniques. If we had used a different methodology, we might have obtained different results. For example, we could have used a protein threading approach (McGuffin et al., 2004
) or a profileprofile (Wang and Dunbrack, 2004
) approach rather than running a single channel sequence against a profile HMM to classify domains within our proteins. We could have used a protein classification database other than SCOP or a profile database other than Superfamily. We could have used a program other than HMMer or restricted our analysis to only membrane proteins. While the technique we have used here gives reasonable performance given the current state of the art in detecting structure from sequence (Gough et al., 2001
; Madera and Gough, 2002
; Madera et al., 2004
; Wistrand and Sonnhammer, 2005
), we can imagine rational approaches to this problem that would use other methods. Despite the inherent assumptions, we argue that our metric of measuring false positive rates is preferable to the alternative of asserting a common fold between a channel of interest and a channel of known structure based primarily on a visual inspection of a multiple sequence alignment with little assessment as to the statistical merit of that assertion.
One day we might have crystal structures for every protein that we care about, and there will be no need for the kind of bioinformatic estimates we have discussed here. In the meantime, by explicitly considering estimates of error rates in assertions of common structure, we can focus our experimental efforts on the most probable structural models for the proteins we study.
| ACKNOWLEDGMENTS |
|---|
This work was supported by the Mathers Foundation.
Olaf S. Andersen served as editor.
Submitted: 30 September 2005
Accepted: 15 May 2006
| REFERENCES |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
J. Yan, J. V. Olsen, K.-S. Park, W. Li, W. Bildl, U. Schulte, R. W. Aldrich, B. Fakler, and J. S. Trimmer Profiling the Phospho-status of the BKCa Channel {alpha} Subunit in Rat Brain Reveals Unexpected Patterns and Complexity Mol. Cell. Proteomics, November 1, 2008; 7(11): 2188 - 2198. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Yusifov, N. Savalli, C. S. Gandhi, M. Ottolia, and R. Olcese The RCK2 domain of the human BKCa channel is a calcium sensor PNAS, January 8, 2008; 105(1): 376 - 381. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Yang, L. Hu, J. Shi, K. Delaloye, F. T. Horrigan, and J. Cui Mg2+ mediates interaction between the voltage sensor and cytosolic domain to activate BK channels PNAS, November 13, 2007; 104(46): 18270 - 18275. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. J. Lingle Gating Rings Formed by RCK Domains: Keys to Gate Opening J. Gen. Physiol., February 2, 2007; 129(2): 101 - 107. [Full Text] [PDF] |
||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|