CATH functional families predict functional sites in proteins

Motivation: Identification of functional sites in proteins is essential for functional characterization, variant interpretation and drug design. Several methods are available for predicting either a generic functional site, or specific types of functional site. Here, we present FunSite, a machine learning predictor that identifies catalytic, ligand-binding and protein-protein interaction functional sites using features derived from protein sequence and structure, and evolutionary data from CATH functional families (FunFams). Results: FunSite’s prediction performance was rigorously benchmarked using cross-validation and a holdout dataset. FunSite outperformed other publicly-available functional site prediction methods. We show that conserved residues in FunFams are enriched in functional sites. We found FunSite’s performance depends greatly on the quality of functional site annotations and the information content of FunFams in the training data. Finally, we analyse which structural and evolutionary features are most predictive for functional sites. Availability


Introduction
Protein functional sites are amino acid residues, or groups of residues, that perform functional roles in proteins. Examples of functional sites include catalytic sites in enzymes, ligand-binding sites for small molecules, metal ions, nucleic acids and other proteins, and proteinprotein interaction sites. Characterization of functional sites is crucial for understanding the molecular basis of life, interpreting the impact of mutations, guiding targeted experiments, protein-engineering and drug design. Computational approaches to functional site prediction is essential since < 1% of all known proteins have any experimentallycharacterised or curator-assigned functional site information (UniProtKB, Jan 2019).
The key challenge for predicting functional sites using machine learning is to identify general properties of sites that distinguish one type of site from the others. The features most frequently used by machine learning predictors include sequence conservation information, physicochemical properties of amino acids, solvent accessibility, secondary structure, pocket information and crystallographic B-factors (Aumentado-Armstrong et al., 2015;Brylinski & Feinstein, 2013;Sankararaman, Sha, Kirsch, Jordan, & Sjölander, 2010;Sun, Wang, Xiong, Hu, & Liu, 2016;Xue et al., 2015). Sequence conservation information is captured using scores derived from multiple-sequence alignments (MSAs) of homologous proteins--based on the assumption that homologous proteins are likely to share functions and, therefore, have similar functional sites. However, this is a non-trivial task, as homologous proteins with similar functions can have diverse functional sites (Brown & Babbitt, 2014;Dessailly, Dawson, Mizuguchi, & Orengo, 2013;Furnham, Dawson, Rahman, Thornton, & Orengo, 2016;Taylor Ringia et al., 2004).
Most functional site predictors use PSI-BLAST (Altschul et al., 1997), which can detect distant homologs, to obtain sequence conservation information. Sequence conservation features can be calculated by an entropy analysis on the resulting position-specific scoring matrix (PSSM), which may include knowledge of positions that are conserved in distantly related proteins. Whilst sequence conservation features calculated this way have been found to predict catalytic and ligandbinding residues well (Bartlett, Porter, Borkakoti, & Thornton, 2002;Capra et al., 2009, Tan et al, 2013, they are not predictive of proteinprotein interfaces (Aumentado-Armstrong et al., 2015).
One way to overcome this is to calculate conservation and other features from protein families where the family grouping brings together sequences that are likely to have similar functional sites. A large number of computational methods utilize protein family information to predict functions or functional sites (Ashkenazy, Erez, Martz, Pupko, & Ben-Tal, 2010;Capra et al., 2009;del Sol Mesa, Pazos, & Valencia, 2003;Innis, Anand, & Sowdhamini, 2004;P. Jones et al., 2014;Lichtarge, Bourne, & Cohen, 1996). However, the quality of protein families greatly affects the performance of prediction, therefore using functionally coherent protein families is crucial.
Functional families (or FunFams) are subfamilies of evolutionary superfamilies in CATH (Sillitoe et al., 2019) which have been generated by a functional classification protocol that is based on recognizing sequence patterns that are conserved in the FunFam but differ between FunFams and, in particular, sequence patterns that reflect specificity determining positions . FunFam relatives have been found to be more functionally similar than other domain-based resources (see Supplementary Section S1; . Function prediction pipelines based on FunFams have been consistently ranked among the best function prediction methods by the international CAFA competition (Jiang et al., 2016;Radivojac et al., 2013) and more recently have been amongst the top 5 best performing methods (CAFA3; (Zhou et al., 2019)).
Patterns of conserved residues in FunFam alignments can be used to identify functionally important residues. In fact, conserved residues in FunFam alignments have been found to be highly enriched in known functional sites, including catalytic, ligand-binding, protein interfaces, nucleic acid-binding sites and allosteric sites (see Supplementary Section S1; ). However, conserved residues may also include residues important for the folding and packing of the protein. These tend to be conserved across the whole superfamily whereas functional determinants (also known as specificity-determining positions) tend to be differentially conserved in structurally equivalent positions between FunFams, thereby providing insights into the molecular mechanisms of functionally distinct residues in the superfamily Lee et al., 2016). Therefore, in principle, the finer classification of functional relatives in FunFams, compared to more generic PSI-BLAST approaches, could aid in detecting specific conservation features. These features, combined with other sequence and structure-based features, could be used in an integrated pipeline to predict different types of protein functional sites.
Despite the progress and availability of a wide range of functional site prediction methods, to our knowledge, none of the existing studies have systematically compared the characteristics of the different types of functional sites or attempted to predict multiple functional sites for a given query protein. Moreover, a large number of the prediction methods are not easily accessible or easy to interpret.
In this article, we present a new method (FunSite predictor) for predicting functional sites by using information from a protein functional family classification to predict three types of functional sites: catalytic, ligand-binding and protein-protein interaction sites. This method makes use of features derived from protein sequence, structure and CATH FunFams. Whilst we used the same set of features and machine learning model to train predictors for the three types of functional sites, we used different training sets for each type of site. We assessed the performance of FunSite by comparing against all publicly available functional site prediction methods that could be assessed using our comprehensive benchmark. FunSite predictors outperform all the other methods against which they were tested. Our combined approach allows the detection of sites which may have multiple functional roles. The performance of the predictors is highly dependent on the quality of the functional site data used for training.

Datasets
Three types of functional sites were considered: catalytic sites (CS), ligand-binding sites (LIG) and protein-protein interaction (PPI) sites.

Dataset generation
The general criteria used to generate all the datasets are described in Supplementary Methods (Section S2).

Catalytic Site (CS) domain datasets
A dataset of 667 domains was generated for catalytic sites from Mechanism and Catalytic Site Atlas (M-CSA; (Ribeiro et al., 2017). Only 'literature' type entries were used. 102 of the domains could be mapped to two datasets -the EF_Family dataset (Youn, Peters, Radivojac, & Mooney, 2007) and T124 dataset (Zhang et al., 2008)previously used to test other predictors (Sun et al., 2016;Zhang et al., 2008). These 102 domains were used to construct a holdout dataset for benchmarking. The remaining 565 domains were used as the training set.

Ligand-Binding (LIG) domain datasets
A dataset of 2826 domains was generated for ligand-binding sites for small-molecules (including metals) from BioLip (Yang, Roy, & Zhang, 2013). Although BioLip excludes artefacts resulting from the crystallization buffer, it does not distinguish between cognate and noncognate ligands (Kobren & Singh, 2019;Das & Orengo,2018). Ligandbinding site predictions generated by ConCavity (Capra et al., 2009) were available for 800 domains in this dataset. These 800 domains were used to construct a holdout dataset for benchmarking against ConCavity Downloaded from https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa937/5949022 by University College London user on 10 November 2020 (Capra et al., 2009) and the remaining 2026 domains were used for training.
Subsets of the LIG training and holdout datasets, comprising 675 and 240 domains respectively, gave training and benchmark datasets that only contained metal-binding sites.

Protein-Protein Interface (PPI) domain datasets
A dataset of 2247 domains was generated for interchain PPI sites from Inferred Biomolecular Interactions Server (IBIS) (Shoemaker et al., 2012). Only annotations present in experimentally-determined structures were extracted from IBIS. No inferred annotations were used. PPI site predictions were extracted from the meta-PPISP (Qin & Zhou, 2007) server, a meta-predictor of multiple methods, for 600 randomly selected domains in our constructed dataset, of which 599 domains returned results. These 599 domains were used as the holdout dataset for benchmarking against meta-PPISP.

Combined dataset of CS, LIG and PPI domains
A dataset of 175 enzyme domains was generated from the CS, LIG and PPI datasets that have at least one annotation in each domain, for all the three types of sites.
More information about the datasets is provided in Supplementary  Table S1. As the number of functional site residues in a protein domain is generally much lower than the number of non-site residues, the datasets have large class imbalance discussed further below together with strategies for addressing this. (Supplementary Figure S1).

FunSite predictors 2.2.1 Implementation and Training
To build the FunSite predictors we used gradient boosted decision trees (Friedman, 2001), applying the XGBoost (eXtreme Gradient Boosting) implementation (T. Chen & Guestrin, 2016). Gradient boosted trees are ensembles of weakly learning decision trees. During training, the XGBoost algorithm generates an ordered ensemble of decision trees in serial that are each fit to the residual error generated by all preceding trees in the ensemble. As such, gradient boosted trees are able to achieve high performance using only weakly learning decision trees.
The scikit-learn (Pedregosa et al., 2011) library in Python (www.python.org) was used to train and evaluate predictors using XGBoost's scikit-learn API. The number of decision trees in the ensemble was set to 1000. XGBoost hyperparameters were optimised using grid search with 5-fold cross validation (GridSearchCV function in scikit-learn). The hyperparameters and hyperparameter ranges used for the different FunSite predictors are listed in Supplementary Table S2.
To ensure the prediction error was not underestimated, the GroupKfold function in scikit-learn was used to select non-overlapping fold groups during cross-validation, such that residues from the same domain do not appear in the sets used for training and testing.
The ratio of site residues (positive samples) to non-site residues (negative samples) was very low for CS (1:63) and LIG (1:23) datasets (see Supplementary Table S1, Supplementary Figure S2). In order to reduce the class imbalance during training, the site to non-site ratio for the CS and LIG predictors was set to 1:6 by randomly selecting non-site residues. This is similar to class distributions used in previous studies (Sun et al., 2016). No changes were made to the non-site set for PPI as the ratio of site to non-site residues in the PPI training set was 1:4.

Feature generation
For each residue, FunSite generates features based on sequence, structure and FunFams. The sequence-and structure-based features generated are listed in Supplementary Methods (Section S2). The novel FunFam-based protein family features are described below:

Protein family features
To generate protein family features, each query protein sequence was scanned against the library of CATH FunFam HMM models using HMMER3 (Eddy, 2010). The protocol cath-resolve-hits (Lewis, Sillitoe, & Lees, 2018) was used to determine the optimal assignment of the sequence into domain regions based on the HMM matches. Regions of the query sequence were assigned to a FunFam if the HMM match Evalue was smaller than the model inclusion threshold, defined as the maximum E-value obtained by scanning each sequence within a particular FunFam against its HMM. The query sequence was then aligned with the sequences in the FunFam alignment using the mafft-add option in MAFFT (Katoh & Standley, 2013). Following alignment of domain sequences to CATH FunFams the following features were calculated: i.
Evolutionary conservation scores: The evolutionary conservation score for each residue was calculated from the FunFam multiple sequence alignment using Scorecons (Valdar, 2002). This is only possible for FunFams with sufficient information content in their alignment (see Supplementary Section S1). Scorecons scores range from 0 to 1 and residues having scores ≥ 0.7 are considered to be highly conserved . ii.
PSSM and weighted observed percentages (WOP) features: PSSM and weighted observed percentages (WOP) were calculated for the FunFam alignment using PSI-BLAST. 20dimensional PSSM and 20-dimensional WOP vectors are obtained for each residue.

iii. Conservation scores and predicted functional determinant (FD) scores from Structural Clusters of FunFams:
For each CATH superfamily, FunFam alignments with at least one domain structure, were merged to form Structural Clusters (SC5), such that structural relatives from the merged FunFams can be superimposed within a 5Å RMSD threshold. The Structural Cluster (SC5) MSAs were generated by merging FunFam MSAs guided by the structural alignment of domain structures from each constituent FunFam. Conservation scores for SC5 alignments were calculated by Scorecons. Functional determinant (FD) scores were predicted using GroupSim (Capra & Singh, 2008). GroupSim scores range from 0 to 1 where a higher score indicates higher probability for a column in an alignment to be a FD (i.e. to be associated with different conservation characteristics between FunFams within the structural cluster).

FunSite prediction protocol
The FunSite prediction protocol is shown in Figure 1. For each query protein, the residue feature vector (see Section 2.2.2) used by the FunSite predictors was generated for all residues. The three FunSite predictors (FunSite-CS, FunSite-LIG and FunSite-PPI) were run separately on the residue vectors of each query protein to generate separate prediction scores for each residue to be a catalytic, ligandbinding or protein-protein interaction site residue. The three predictors were run separately, instead of combining them into a single model, because the functional plasticity of residues (Dessailly et al., 2013) means that, depending on its context, a residue can perform multiple functions. For example, some proteins often use the same or overlapping set of surface residues to bind small molecules and other proteins (Davis & Sali, 2010;Mohamed, Degac, & Helms, 2015). Moreover, an increasing number of proteins have been found to perform multiple molecular functions using different, overlapping or sometimes the same set of residues (Das, Khan, Kihara, & Orengo, 2017). Therefore, the aim is to predict probabilities for all three types of site. Furthermore, to remove false positives, residues that are positively predicted to be a particular functional type are filtered out if there are no other positively predicted site residues of the same type in their structural neighborhood (5Å). This is based on the assumption that functional sites generally comprise two or more residues. This filtering step is applied only when making combined predictions (i.e. all types of functional sites) but not for the individual predictors. This was done so as not to inflate the prediction of the individual predictors for comparison to other predictors which did not apply these filters.

Comparison with other predictors
Performance of the FunSite predictors was benchmarked against the following functional site prediction methods:

Publicly available predictors
The following prediction methods were used, available at the time of writing.
Catalytic site predictors: CSmetaPred was used for benchmarking the predicted CS sites by FunSite-CS predictor. This is a consensus metapredictor that also provides predictions for its three component predictors-CRpred (Zhang et al., 2008), DISCERN (Sankararaman et al., 2010) and EXIA2 (Lu, Yu, Chien, & Huang, 2014), all of which exploit machine learning.
Ligand-binding site predictor: ConCavity (Capra et al., 2009) was used for benchmarking the predicted LIG sites by FunSite-LIG as its predictions were easily available.

Generic predictors
Generic predictors were also constructed for benchmarking the performance of the FunSite predictors. This was done by using the same training and test datasets, model implementation and parameters for model training. However, the generic predictors did not include any FunFam-based features. Instead, they include similar conservation-based features derived from PSSMs generated by PSI-BLAST. These features were combined with the other sequence-and structure-based features used by FunSite and which are frequently used by other functional site predictors, such as amino acid properties, pocket predictions, solvent accessibility, secondary structure information and B-factor. The full list of features used for training the generic predictors is provided in Supplementary Table S3.
The generic predictors provide an unbiased benchmark that is not affected by the size of the training dataset, the sampling strategy used, the machine learning model, or their different implementations. As a result, any improvement of performance shown by a FunSite predictor over its generic equivalent can be assumed to be largely due to the power of the FunFam-based features.
The evaluation methods used to compare performance with other predictors are described in Supplementary Methods (Section S2). Figure 2. Density plots showing the distribution of sequence conservation scores for catalytic site (CS), ligand-binding site (LIG) and protein-protein interaction site (PPI) residues in CATH FunFams compared to non-site residues that are buried (buried_NS) and those that are on the surface (surface_NS). The sequence conservation scores were generated by Scorecons (ranging from 0-1) and the residues with scores ≥ 0.7 are considered to be conserved (shaded gray in the figure).

Analysis of conservation properties for different functional site types
Residue conservation analysis of CATH FunFam alignments with high information content showed that experimentally characterised catalytic and ligand-binding site residues are generally highly conserved, i.e. they have Scorecons conservation scores ≥ 0.7. Figure 2 shows a comparison of the distribution of conservation scores for different types of known functional site residues identified in CATH FunFams and nonsite residues that are either buried or present on the protein surface. The majority of known catalytic site residues correspond to highly conserved residues in CATH FunFams. In a few cases, catalytic residues have low conservation scores due to the presence of mutated residues in those sites in some of the domains. A large proportion of the known ligand-binding sites were also found to be highly conserved in CATH FunFams. Comparatively, only a small number of known protein-protein interaction sites were found to be conserved in CATH FunFams. This is consistent with the findings of previous studies suggesting that few residues are conserved in protein interfaces (Caffrey et al., 2004;David & Sternberg, 2015;Humphris & Kortemme, 2007). The conservation (calculated using scorecons) for the protein-protein interaction (PPI) site residues and non-site surface residues (surface_NS) however differed significantly with a p-value of 2.2e -16 as calculated using Mann-Whitney U-test.
Buried non-site residues were also observed to be under evolutionary constraints because of their role in maintaining protein folding and stability. Non-site residues present on the surface were found to be the least conserved, although it is important to note that some of these may include sites that have not yet been annotated and other types of functional sites not considered in this analysis, such as allosteric and phosphorylation sites.

Selection of discriminating FunSite features for sites
Knowledge of the general characteristics of functional site residues was used to select a feature set for training the predictors to distinguish functional site residues from non-functional site residues. Some of these features included traditional characteristics such as sequence conservation scores (see Supplementary Figure S2a), solvent accessibility values and predicted pocket residues that are known to capture discriminating characteristics of CS, LIG and PPI residues. Additionally, a large number of other features were included in the FunSite predictors that were derived from combining sequence conservation information with other structural characteristics. For example, the number of residues surrounding conserved pocket residues (i.e. at a distance < 5Å) was found to be significantly different between site and non-site residues for CS and LIG (see Supplementary Figure  S2b) and the average number of surface residues within a 5Å radius (see Supplementary Figure S2c) was found to differentiate PPI residues from other residues.

Performance evaluation of FunSite predictors
The performance of the FunSite predictors for CS, LIG and PPI sites was evaluated and compared to generic predictors and other publicly available functional site predictors. The Precision-Recall (PR) curve was obtained by varying the positive classification threshold between 0 and 1.

FunSite-CS
The precision-recall (PR) curves in Figure 3a show the performance of FunSite-CS in predicting catalytic sites, compared to the generic predictor. Both FunSite and generic predictors were generated from 5fold cross-validation of the CS training dataset. FunSite-CS performs better with respect to both precision and recall and gives a higher area under the PR curve (PR AUC). Other evaluation metrics calculated on the CS training set are shown in Supplementary Table S4. Figure 3b shows the performance of FunSite-CS compared to other predictors on the CS holdout set of 106 domains. The FunSite-CS predictor performs competitively with the meta-predictor CSmetapred and performs better than the generic CS predictor and the individual catalytic site predictors (CRpred, EXIA2 and DISCERN). The distribution of prediction probabilities by FunSite-CS predictor on the holdout set is shown in Supplementary Figure S3a. It is important to note that the generic predictor also outperforms the publicly available predictors. This is most likely due to the use of a larger training set for the generic predictor and increase in the quality of catalytic site annotations (Ribeiro et al. 2017). (e) (f) Figure 3c shows the performance of the FunSite-LIG predictor from 5-fold cross-validation of the LIG training set compared to the generic LIG predictor. Neither predictor perform well, as shown by the shape and area under the precision-recall curves (AUC of 0.563 for FunSite-LIG predictor and 0.497 for FunSite-LIG predictor on the hold out test set). However, when the LIG metal dataset (a subset of the LIG dataset that only contains metal-binding sites) was used in a similar manner, to predict metal-binding sites, a large improvement was observed on the performance of both the FunSite-LIG metal and generic LIG metal predictors (Supplementary Figure S4). Other evaluation metrics calculated on the LIG test dataset are shown in Supplementary table S4.

FunSite-LIG
The comparatively lower performance of the predictors for the LIG set is probably due to variations in the characteristics of the ligand binding set. For example, this could include (i) differences in the types of ligands represented in the dataset (e.g. cognate and non-cognate) (Kobren & Singh, 2019;Das & Orengo,2018) and (ii) differences in the characteristics of ligand binding between enzymes and non-enzymatic proteins. For example, it is known that non-cognate ligands can bind to different regions in a protein to those bound by the cognate ligands, reflecting different binding characteristics (Tyzack, Fernando, Ribeiro, Borkakoti, & Thornton, 2018). Figure 3d shows the performance of the LIG predictors on the LIG hold-out test set compared to the ConCavity ligand-binding site prediction method. The FunSite-LIG predictor shows the highest PR AUC, followed by the generic LIG predictor and then ConCavity. The FunSite-LIG metal predictor also performs better than the generic predictor on the LIG metal test set (Supplementary Figure S4). Figure 3e shows the performance of the FunSite-PPI and generic-PPI predictors derived from 5-fold cross-validation of the PPI training dataset. FunSite-PPI outperforms the generic predictor to a greater extent than both FunSite-CS and FunSite-LIG. This shows that the FunFam based conservation-based features and other structural and sequence features of the FunSite-PPI predictor make a more powerful contribution towards distinguishing between protein interfaces and other residues, than the PSI-BLAST PSSM conservation feature and structure features captured by the generic predictor (an improvement in AUC of 0.112 for FunSite-PPI compared to generic-PPI, while FunSite-CS and FunSite-LIG had only an improvement of 0.033 and 0.084 respectively on hold out test sets).

FunSite-PPI
On the PPI holdout set, the FunSite predictor also outperformed the generic predictor and the meta-predictor meta-PPISP along with the individual PPI predictors (cons-PPISP, PINUP and PROMATE) ( Figure  3f). Other evaluation metrics calculated on the PPI test dataset are shown in Supplementary table S4.

Prediction using only FunFam and PSI-BLAST based features
FunFam derived features were both sequence and structure based while PSI-BLAST derived features were only sequence based (Section 2.2.2.1 and Table S3). We built 2 types of predictors using the FunFam features (one using only sequence information, another containing both FunFam derived sequence and structure information). We also built a predictor using only sequence derived PSI-BLAST features. The results showed that the predictors built using FunFam sequence features had similar performance compared to the predictor built using PSI-BLAST features ( Figure S8). However, when we included the additional structure information from FunFams, the predictor outperformed the PSI-BLAST predictor ( Figure S9). However, the performance was lower than the FunSite and generic predictors, suggesting that the additional features employed in those predictors are important i.e. the addition of other non-FunFam based sequence and structure-based features helped to improve performance.

Most influential features for FunSite predictions
The most important features for the FunSite predictions were analysed using Tree SHAP (Lundberg & Lee, 2017). SHAP combines a number of approaches to identify the influence of individual features on the prediction performance of tree ensemble methods. Feature influences are reported as SHAP values which are more consistent than classic feature importance metrics (Lundberg, Erion, & Lee, 2018). Higher SHAP  Table S5.

Prediction of different functional sites using CATH protein family features
values for features correspond to higher log-odds ratios that a residue is a functional site. Figure 4 shows the top 25 features for the FunSite-CS, -LIG and -PPI predictors ranked by SHAP values (see Supplementary Figure S5 for FunSite-LIG metal ). The top features are conservation scores, hydration potential, pocket information for CS sites; conservation scores, pocket information, solvent accessibility for LIG sites; and solvent accessibility, residue curvature for PPI sites.

FunSite predictions for the combined dataset
The full FunSite prediction protocol (Figure 1) was run on the combined dataset of 175 enzyme domains that have CSA, LIG and PPI functional site annotations. Prediction scores for each FunSite predictor were generated for all residues. To do this the three separate FunSite predictors were independently trained on the CS, LIG and PPI training datasets, respectively, from which the 175 query domains had been removed. As in the previous tests, positively predicted residues that had no other surrounding predicted residue of the same type were filtered out.
For each domain in the dataset, residues were assigned three ranks, based on the prediction scores from each of the three FunSite predictors. Only positively predicted residues, with predicted probability ≥ 0.5, for each site were ranked. The top ranked residues, up to a maximum of 20 residues, for each type of functional site were generated as FunSite predictions. As examples, Figure 5, Figure S10 and Figure S11 shows the FunSite predictions obtained for the12asA00, 4mdhA02 and 1itqA00 domains respectively. The final FunSite predictions (i.e. top 20 predicted residues for each predictor) were found to be significantly more enriched (Table 1, Supplementary Figure S6) in known functional sites compared to FunSite predictions which hadn't been filtered by removing likely false positives. They were also enriched compared with sets obtained by simply predicting that the conserved sites in FunFams are functional sites. The filtering step was most valuable for removing false positives in the prediction of PPI sites (Table 1).

FunSite-CS
Supplementary Figure S7 shows the pairwise plot of the prediction scores for the three different types of predicted sites in the 175 domains of the combined dataset. It can be seen that the prediction scores for CS sites are highly correlated with LIG sites. PPI sites do not show any positive correlation with either CS or LIG sites, although some sites have a high PPI score and either a high CS or high LIG score. This will occur in proteins that have an active site located between two or more domains (Ali, 2005;Der, 2012). There are also cases where LIG sites overlap with PPI sites (Mohamed, Degac, & Helms, 2015).

Conclusions
We have developed predictors for three types of functional site residues in proteins: catalytic, ligand-binding and protein interface residues. Our prediction methods exploit gradient boosted decision trees and a range of features based on sequence, structure and protein families. Features include conservation scores for residues, measured using entropy-based analysis of the multiple sequence alignment (MSA) of the functional family to which the protein domain has been assigned. Conservation analysis of residues known to be functional sites showed that catalytic sites were associated with high conservation scores. Ligand binding sites also exhibit high conservation scores, although not as high as catalytic residues. Protein interface residues, whilst showing some degree of conservation, exhibit the lowest level of conservation out of all three functional types.
Our method is focused at the domain level since FunFams are classified at the domain level and domains often have a particular functional role that is independent of the multi-domain context (Bashton & Chothia, 2007). In the context of catalytic residues, studies of enzyme families have shown that the majority of catalytic residues typically lie in one of the domains in a multi-domain protein (Furnham et al., 2016). In addition to using conservation features obtained from entropy analysis of FunFam MSAs, we also explored a range of other structure-and sequence-based features typically used in the prediction of protein functional sites. SHAP analysis revealed that conservation scores figured highly for catalytic and ligand binding residues, together with pocket information, whilst accessibility and curvature were more important than conservation for protein interface residues, although conservation was a contributing feature. However, we observed that the difference in performance between the FunSite predictor and the generic predictor was greatest for protein interface residues, suggesting that the FunSite based features were better able to distinguish interface/non-interface than those used by the generic predictor. The SHAP plots indicate that both FunFam and non-FunFam based sequence and structure features are important for the prediction of functional sites.
A major challenge in assessing the performance of site predictors is the difficulty in obtaining appropriately annotated functional sites. Recent studies of ligand-binding data in the PDB revealed that in many cases the bound ligands were not cognate (Tyzack et al., 2018). In addition, databases can employ different definitions of these sites. This will introduce both false positives (i.e. from non-cognate ligand-binding residues) and false negatives (i.e. missing cognate ligand-binding residues) confounding performance measures. The improvement observed in prediction performance for the metal ligand binding predictor is quite likely to be partially attributable to the greater accuracy of the positive annotations in this case. Higher quality annotations of functional sites will improve the reported performance of prediction methods, as demonstrated by the performance of our generic predictors.
Despite reasonable performances for CSA and LIG prediction, predicting protein interface residues is challenging. This could be aided  Table 1. Results from one-sided Fisher's exact tests comparing the precision/enrichment of the prediction of functional sites using FunSite predictor with filtering, compared to no filtering, and compared to simply using conserved sites as predicted functional sites. by including features associated with co-varying residues (De Juan, Pazos, & Valencia, 2013) . However, these predictions require knowledge of the protein partner and exploiting that information was outside the scope of this analysis. Future work will explore the enhancements in performance which can be obtained by using this type of information.
The main novelty of our approach lies in the fact that we are able to consider the likelihood of residues in a query protein belonging to three different types of functional site and we allow for the possibility that a residue may have more than one functional role, such as contributing to both the binding of a domain or protein partner and to the binding of the substrate. We do not recommend any cut off on the confidence measure when using our predictor. We suggest that the user consider the top ranked predictions even if they are relatively low confidence, but they should be guided by their own judgement for predictions with low confidence.
Some methods (Gligorijevic et al., 2020), exploit large scale metagenome data to make function prediction. Such approaches rely on highly expanded sequence families (e.g. bacterial or universal families). Our method does not require large scale metagenomic data and can therefore be used for function prediction across a wider range of superfamilies.
Since our approach has the advantage over other publicly available methods of exploiting recent, more comprehensive and possibly more accurate data for training the predictors, we also compared our approach against a generic predictor which uses the same features as our predictor, except for the conserved residue features. These were obtained from multiple sequence alignments obtained by PSI-BLAST rather than FunFams. This allowed us to assess the benefits of deriving conservation data from more functionally pure sets of relatives. We found that our FunFam based predictors outperformed the generic predictor for all types of functional sites.
Future work involves extending our pipeline to other types of functional sites and applying our predictor to identify putative functional sites for all FunFams in CATH superfamilies with high-quality conservation data from high information content from sequence diverse relatives.