Information

Protein PTM site prediction

Protein PTM site prediction



We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

Is there any in silico analysis method to predict post-translational modification sites on a given protein?


There are actually a lot of these sites available, I have used some of the one listed below. Additionally there are some huge list of other services available in this field from ExPASy, you can find it here and the Center for Biological Sequence Analysis, which can be found here.


Incorporating convolutional neural networks and sequence graph transform for identifying multilabel protein Lysine PTM sites

A computational method for identifying multiple Lysine posttranslational modification sites with high performance.

Learning features are extracted by using graph transformation from protein sequences.

Optimizing hyper-parameters for deep convolutional neural networks.

Compared with the state-of-the-art methods, our method had a significant improvement in all of the measurement metrics.

A basis for further research that can improve the protein function predictions using graph transformation and deep learning.


Annals of Proteomics and Bioinformatics

Md. Mehedi Hasan 1* and Mst. Shamima Khatun 2

1 Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka 820-8502, Japan
2 Laboratory of Bioinformatics, Department of Statistics, University of Rajshahi, Rajshahi, Bangladesh

*Address for Correspondence: Md. Mehedi Hasan, Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka 820-8502, Japan, Email: [email protected]

Dates: Submitted: 27 February 2018 Approved: 01 March 2018 Published: 02 March 2018

How to cite this article: Hasan MM, Khatun MS. Prediction of protein Post-Translational Modification sites: An overview. Ann Proteom Bioinform. 2018 2: 049-057. DOI: 10.29328/journal.apb.1001005

Copyright: © 2018 Hasan MM, et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.


NetPhos 3.1 Server

ATM, CKI, CKII, CaM-II, DNAPK, EGFR, GSK3, INSR, PKA, PKB, PKC, PKG, RSK, SRC, cdc2, cdk5 and p38MAPK.


NOTE: the online service at http://www.cbs.dtu.dk/services/NetPhosK is currently off-line
for the kinase specific predictions this service should be used instead.

CITATIONS

For publication of results, please cite:

Sequence- and structure-based prediction of eukaryotic protein phosphorylation sites.
Blom, N., Gammeltoft, S., and Brunak, S.
Journal of Molecular Biology: 294(5): 1351-1362, 1999.

Kinase specific predictions:

Prediction of post-translational glycosylation and phosphorylation of proteins from the amino acid sequence.
Blom N, Sicheritz-Ponten T, Gupta R, Gammeltoft S, Brunak S.
Proteomics: Jun4(6):1633-49, review 2004.


Methods

PTMselect overview

PTMselect determines the optimal set of proteases to improve global coverage of protein modification discovery by MS analysis by simulating parallel digestions with all possible combinations of proteases. Four types of optimizations can be performed with PTMselect:

Global modified site coverage discovery for at least one protein: all modified sites are considered to have equal importance and PTMselect calculates the best digestion settings to obtain the largest number of modifications.

Predicted modified site coverage discovery for at least one protein: modified sites with the highest probability to be modified receive the highest scores. PTMselect computes the digestion setting to match the largest number of sites with a high probability to be modified.

Targeted modified site discovery for at least one protein: a list of target modification positions is given by the user for each protein. PTMselect optimizes the discovery of the largest number of modified sites in the lists or the total number of targeted proteins, i.e. the proteins with at least one target modification.

The last possibility is to combine global, predicted and targeted optimization for any number of proteins and any modifications.

PTMselect selects or rejects modified peptides of a digestion setting according to their lengths. Indeed, a mismatch between the in silico tryptic peptide distribution and optimal peptide length for successful mass spectrometry is always observed 5 . PTMselect performs simulations with a peptide length of 7 to 40 amino acids by default, which is a good initial setting for human cell analysis by MS in our and others experience 5 . This range can be adjusted by the user.

PTMselect usability

PTMselect has been developed with usability and speed in mind.

The PTMselect basic tutorial (supplemental video PhosphoSelect_Basic_Tutorial_and_Install_v3.mp4) shows that PTMselect can be installed within minutes on MS Windows. The main task of the user is to download the protein Fasta files, then start PTMselect and enter the number of parallel digestions to simulate.

The PTMselect advanced tutorial (supplemental video VideoTutorial2_TCRpathway_v3.mp4) shows that the simulation of the best digestion settings for detection of the phosphorylations regulating an entire signaling pathway is easy too. All protein fasta files are copied in the fasta directory, and the target phosphosite files are simple text files with the positions of the targets sites in the protein sequence. Results are obtained within seconds.

Amino acids carrying the PTM can be easily changed. A unlimited number of amino acids can be targeted allowing the simultaneous optimization of detection of many modification sites with multiple modifications as well.

PTMselect algorithm

PTMselect input

PTMselect processes proteins sequences in FASTA format (Fig. 1a). Two additional types of files can be optionally loaded and processed by PTMselect:

Prediction tables with modified positions and their prediction scores. These tables can be obtained from any prediction tool. PTMselect is compatible by default with PhosphoPICK 11 . For each phosphosite of a given peptide, PTMselect sums the phosphosite “combined-score” of PhosphoPICK to calculate the global predicted score of the peptide.

Lists of target modification site positions. These lists are text files containing known modification site positions mandatory for the biologist’s project, for example the phosphosites involved in a signaling pathway (Fig. 1c).

In silico protein digestion and peptide filtering

PTMselect asks the user to enter the maximum number n of parallel digestions he wants to simulate. PTMselect begins by calculating all combinations of n proteases starting from one ([1], [2]. [1, 2], [1, 3]..). Then, for each combination, it performs in silico parallel digestions of the protein. PTMselect uses by default 8 proteases and CNBr. This list can be reduced or increased if necessary. It then removes peptides without modification sites or outside the peptide length range.

Score calculation

PTMselect calculates five scores: maximal, electro transfer dissociation (ETD), collision induced dissociation (CID), predicted matched and predicted unmatched.

The maximal score is the total number of modified sites in the protein.

The ETD score is the total number of modified sites in the peptides after digestion and filtering. Indeed, any labile modified site can be attributed unambiguously by electro-transfer dissociation 19,20 .

The CID score. Labile modified sites cannot always be attributed unambiguously when modified peptides are analyzed by collision induced dissociation 19,20 because spectra are often dominated by large neutral loss peaks compromising reliable site-specific identification 21 . That is why PTMselect gives more weight to mono modified peptides in the CID score calculation.

The CID score of a peptide with n modified sites is:

The score of the entire protein with k modified peptides is:

The predicted matched score is the sum of each individual modified site score predicted by a prediction software for all peptides selected after digestion and filtering.

The predicted unmatched score is the sum of each individual modified site score predicted by a prediction software for all peptides rejected after digestion and filtering.

Results output

The five modification scores (maximal, ETD, CID, predicted matched and unmatched) for each protease combination are exported in a table. PTMselect also calculates the number of mono-modified peptides, the number of target peptides accessible or not accessible and the corresponding lists of target site positions. A graphical map representing modified peptides and modification site positions is generated for each protease combination (Fig. 1d). The details of the modified sites in each peptide sequence and in the entire protein sequence are exported in a text file. PTMselect includes a summarizer able to process an unlimited number of score tables to calculate the sum of all the scores. When target modification sites are used, the summarizer builds a table with one target site by column. Thus, it is very easy to see which target sites are identifiable or not by a set of proteases.

PTMselect Benchmarks

Simulation time depends on the number of proteases and the size of the protein (cf. Supplemental Fig. S2). On a Linux 64 bit workstation with one CORE i7 processor the simulation time for 5 digestion settings out of 14, i.e. the simulation of 2379 protease combinations, was <6 sec for Lamin and <12 sec for Citron-kinase.

Simulations of parallel proteases digestions

Protein sequences

We used six publicly available protein sequences to evaluate PTMselect (see Supplementary files). PD-1, p53, Huntingtin, Citron-kinase, Cortactin, and Lamin were chosen for their high phosphorylation level, size range, and biological relevance. Their fasta sequences were obtained from UniProt database 22 .

PTMselect simulations for six proteins

Parallel protease digestions were simulated for p53, PD-1, Huntingtin, Citron-kinase, Cortactin and Lamin using the default proteases list provided with PTMselect (8 proteases + CNBr). Up to five parallel digestions were simulated with a peptide size range from 7 to 40 amino acids (supplemental files).

PTMselect targeted analysis of the TCR signaling pathway

Fasta sequences of proteins of the TCR signaling pathway were downloaded from UniProt database 22 . Phosphosite positions for the proteins in this pathway where obtained from reference 17 and PhosphoSitesPlus website 23 . For each protein, a text file containing the target site positions was created and used as input in PTMselect (Fig. 1a). Fasta files and target sites were processed together in PTMselect to produce a score table for each protein. In each score table, target phosphosites identifiable and unidentifiable by any digestion setting were listed. A combination of all score tables was computed automatically by PTMselect summarizer (Fig. 1a) in a summary table. The summary table was then sorted by number of target phosphosites identifiable in decreasing order, to identify the best digestion settings for MS analysis of the entire TCR pathway (supplemental files).

PTMselect prediction of multiple PTMs in a cross-talk example

Fasta sequence of protein H3.1 (Mus musculus) was obtained from from UniProt database 22 . The N-terminal methionine was removed from the sequence. Parallel protease digestions were simulated for H3.1 using the default proteases list provided with PTMselect (8 proteases + CNBr). To be able to analyse the cross-talk of K9 and K14 acetylation in the same peptide we set the number of missed cleavages to 3 for Lys-C, Lys-N and Trypsin. The number of missed cleavages for Chymotrypsin was 2, one for V8 and zero for Arg-C and Asp-N (Supplemental Fig. S5). To validate only peptides containing both K9 and K14 and not ending by K14 (we considered that Lysine acetylation induce a missed cleavage if Lysine is modified) a peptide filtration by regular expression was used. The regular expression “KSTGGK” was used to filter the peptides. The dot after KSTGGK implies that not only the KSTGGK sequence is present in the peptides but also that the peptides do not end by K14.

Code availability

PTMselect was developed using the high performance cross-platform Julia language 24 for numerical computing. Files can be accessed at (https://sites.google.com/site/fredsoftwares/products/ptm-select). A manual for using PTMselect to perform phosphosites basic and advanced search can be found in supplemental files. Peptides alignment tool, PepAlign, and lists comparison tool, nwCompare 25 , used to calculate the PTMs concordance are freely available at (https://sites.google.com/site/fredsoftwares/products/pepalign) and (https://sites.google.com/site/fredsoftwares/products/nwcompare-julia).


3 Results

In this study, we manually compiled 199 inter-protein PTM cross-talk pairs from 82 protein pairs across 86 human proteins (see details in Section 2 and Supplementary Table S1 ). When counting the number of PTM cross-talk events that each protein is involved in ( Supplementary Table S2 ), interestingly we found a few proteins have much more than the majority (median 4 events), especially CDC25C with 26 events, CDK1 with 22 events and AKT1 with 16 events. Also, a few protein pairs have more PTM cross-talk events than others ( Supplementary Table S3 ), e.g. 17 PTM cross-talk events occur between CDC25C and CDK1 as the most. We further present the PTM cross-talk into a protein interaction network ( Supplementary Fig. S1 ), and we surprisingly found that 47 out of the 86 proteins form a sub-graph, suggesting the important roles of PTM cross-talk in cell signaling and regulatory network.

3.1 Sequence co-evolution at residue level and motif level

Sequence co-evolution is widely used to study the functional association between two amino acids, as it presents a conservation interdependence across species in complex ecological networks ( de Juan et al., 2013). Here, we explore the sequence co-evolution of inter-protein PTM cross-talk at both a single residue level and a 7-mer motif level.

We first used the NHD to measure how frequent two residues conserve or mutate jointly across around 50 vertebrates. Figure 1A shows an example of AKT1 and prohibitin (PHB) across 20 vertebrates with a cross-talk event between S473 on AKT1 and Y114 on PHB. As described in the Section 2, only the species shared by both proteins are taken into account, thus Carlito syrichta is discarded as it is missing for PHB. For the remaining 19 shared species, 17 species have the same conservation states for both PTM residues (16 co-conserved and 1 co-mutated), which gives a residue co-evolution score of 17/19 for this example. Residue co-evolution scores were further calculated for 168 of the 199 cross-talk pairs, and 8574 of the 11 585 control pairs. The remaining 31 cross-talk and 3011 control pairs do not have this feature because either one of the proteins does not have an MSA or the amino acid of the input PTM does not match the MSA even if one or two position shift is allowed. By comparing the available samples in these two datasets, we found that the cross-talk PTM pairs have a significantly higher residue co-evolution than that of the control PTM pairs (mean: 0.807 versus 0.704, P < 10 − 5 by permutation test, Fig. 1B).

Based on the same MSA data, we extended the sequence co-evolution from the residue level to the sequence motif level. On the same example between protein AKT1 and PHB ( Fig. 2A), we first extracted the ±3 amino acids surrounding the PTM sites as a 7-mer motif. For S473 on AKT1, the two residues on −1 and 0 position in Dipodomys ordii was different from their human references, therefore the motif conservation for this species is 5/7 = 0.714. Similarly, we can have the motif conservation scores for all shared species on these two proteins, forming two motif conservation vectors. Then the motif co-evolution score is calculated by taking the dot product between these two motif conservation vectors with normalized to the number of common species. From the same sets of samples as the residue level, i.e. 168 cross-talk pairs and 8574 control pairs, we clearly see that cross-talk PTM pairs also have significantly higher motif co-evolution than that of the control set (mean: 0.754 versus 0.679, P < 10 − 5 by permutation test, Fig. 2B). Together, the two results suggest that sequence co-evolution at both PTM residue level and motif level can be good indicators of PTM cross-talk between proteins.

3.2 Co-modification across different species and different conditions in human

The effectiveness of using protein sequence conservation for analyzing the functional importance of PTMs is possibly because it gives an approximate PTM conservation status across species. Thus, the directly and experimentally verified PTM status across multiple species can be very informative to study the functions of PTM and their interplays ( Beltrao et al., 2012 Landry et al., 2009). Indeed, in our previous study ( Huang et al., 2015), we have shown that co-conservation of modifications among three species has the potential link to the functional interplay between two PTMs within a protein and can been used to predict intra-protein PTM cross-talk. Here, we apply the co-modification across Homo sapiens, Mus musculus and Rattus norvegicus to measure the modification co-conservation. Same as Huang et al. (2015), the co-modification measures the proportion that the two PTMs conserve simultaneously on the reference residues across the three species. Figure 3A shows example of modification status of two PTM pairs on the proteins AKT1 and PHB in the three species. The cross-talk pair between S473 on AKT1 and Y114 on PHB has co-modification states in human and mouse, giving a co-modification score of 2/3, while the non-cross-talk pair, S475 on AKT1 and S121 on PHB, has co-modification only in human, scoring at 1/3. Even though both PTM pairs have fully co-conserved residues across the three species, the co-modification levels are different, and may imply different functional dependence. Here, for fairness we removed the 13 PTM cross-talk samples whose one or two PTMs are not included in human PTM set in PhosphoSitePlus, and consequently we have 186 cross-talk pairs and 11 585 control pairs for further analysis. By comparing these two sample sets, we found that the score of co-modification across species is significantly higher in cross-talk pairs than that of control pairs (mean: 0.507 versus 0.429, P < 10 − 5 by permutation test, Fig. 3B).

Co-modification across species analysis of cross-talk PTMs. (A) Demonstration of co-modification across species with sequence alignments across human, mouse and rat. (B) Comparison of co-modification across species scores between cross-talk set (positive) and control set (negative)

Co-modification across species analysis of cross-talk PTMs. (A) Demonstration of co-modification across species with sequence alignments across human, mouse and rat. (B) Comparison of co-modification across species scores between cross-talk set (positive) and control set (negative)

Besides the evolutionary process, the correlation of modification status across different conditions in one species can also suggest functional associations. In a previous study, we proposed a co-occurrence method to explore functional connections between PTM sites by calculating their tendency to be modified simultaneously across 88 different conditions in human ( Li et al., 2017). Here, the same proteome-wide human phosphorylation dataset is used measure the co-modification across conditions for inter-protein PTM pairs (see Section 2 for more details). Figure 4A shows two examples of co-modification across the 88 conditions: a cross-talk sample between Y412 on protein FGR (tyrosine-protein kinase Fgr) and Y281 on SLAF1 (signaling lymphocytic activation molecule), and a control sample between S132 on SHIP2 and Y281 on SLAF1. Their phosphorylation status (red: on, blue: off) across 88 conditions are shown in the heatmap, where we can calculate the co-modification scores, i.e. −log10(p) in Fisher exact test, for these two examples and have 12.549 for cross-talk sample and 0.397 for control sample. As this feature is only available for phosphorylation-phosphorylation pairs, we only have co-modification scores for 87 of 199 cross-talk and 3040 of 11 585 control PTM pairs. Still, we see that the cross-talk pairs show a clearly higher co-modification across multiple conditions than that of the control pairs (mean: 2.111 versus 1.044, P < 10 − 5 by permutation test, Fig. 4B), indicating that the cross-talk PTM pairs have much higher chance to reject the independence null hypothesis than the random PTM pairs. Together, the above two analyses reveal that co-modification across different species and different conditions can be predictive features for identifying inter-protein cross-talk pairs.

Co-modification across different conditions analysis of cross-talk PTMs. (A) Demonstration of co-modification across 88 conditions for two PTM pairs (all phosphorylations cross-talk: Y412 on FGR and Y281 on SLAF1 control: S132 on SHIP2 and Y281 on SLAF1, achieved the score of 12.549 and 0.017, respectively). The specific information of 88 conditions is listed in Supplementary Table S2 . (B) Comparison of co-modification across different conditions scores between cross-talk set (positive) and control set (negative)

Co-modification across different conditions analysis of cross-talk PTMs. (A) Demonstration of co-modification across 88 conditions for two PTM pairs (all phosphorylations cross-talk: Y412 on FGR and Y281 on SLAF1 control: S132 on SHIP2 and Y281 on SLAF1, achieved the score of 12.549 and 0.017, respectively). The specific information of 88 conditions is listed in Supplementary Table S2 . (B) Comparison of co-modification across different conditions scores between cross-talk set (positive) and control set (negative)

3.3 Integrative prediction of PTM cross-talk between proteins

As demonstrated above, the inter-protein PTM cross-talk pairs display evolutionary correlations at both sequence level and modification level. Therefore, we ask if these four properties can be used to predict PTM cross-talk between proteins. First, we tested the discrimination power of each of the four features by 10-fold cross-validations. The area under the curve (AUC) values in Figure 5A show that the sequence co-evolution on the PTM residue is the most discriminative feature (AUC = 0.785), and it also has a relatively low no-call rate, namely only 31 out of 199 cross-talk and 3011 out of 11 585 control pairs do not have the residue co-evolution measures. Following features are sequence motif co-evolution (168 cross-talk samples, AUC = 0.685) and co-modification across conditions (87 cross-talk samples, AUC = 0.654). By contrast, the performance of co-modification across species was relatively poor (186 cross-talk samples, AUC = 0.558), partly due to the incompleteness of PTM data in mouse and rat. Then, we further ask if the integration of these four features can improve the prediction comparing to using a single feature alone. For fairness, we only used the 76 cross-talk samples and 2593 control samples that have all these four features to compare single-feature models and integrative model. Unsurprisingly, the performance with each single feature alone slightly decreases on this smaller dataset comparing to use all available samples before (see single feature in Fig. 5A and B). However, the integration of three predictive features, i.e. sequence co-evolution and co-modification across conditions, has the best performance and increases the AUC to 0.814 from 0.756 by a single feature alone (i.e. residue co-evolution). Due to the limited prediction power of co-modification across species, this feature fails to improve the performance in the integrative model by adding it. Therefore, we omit this feature in the integrative model.

Evaluating the performance of predicting PTM cross-talk using different features combinations 10-fold cross-validation with repeating 100 times are pooled together to generate an overall ROC curve. (A) Evaluation is performed on all available samples for each feature (combination) the size of cross-talk samples are presented in the brackets. (B) Evaluation is performed on 76 cross-talk and 2593 control fully featured samples. Abbreviations: sequence residue co-evolution (Seq_residue), sequence motif co-evolution (Seq_motif), co-modification across species (PTM_species), co-modification across different conditions (PTM_conditions), both sequence co-evolution (Seq both), both co-modification (PTM both)

Evaluating the performance of predicting PTM cross-talk using different features combinations 10-fold cross-validation with repeating 100 times are pooled together to generate an overall ROC curve. (A) Evaluation is performed on all available samples for each feature (combination) the size of cross-talk samples are presented in the brackets. (B) Evaluation is performed on 76 cross-talk and 2593 control fully featured samples. Abbreviations: sequence residue co-evolution (Seq_residue), sequence motif co-evolution (Seq_motif), co-modification across species (PTM_species), co-modification across different conditions (PTM_conditions), both sequence co-evolution (Seq both), both co-modification (PTM both)

Though the co-modification across conditions contributes a lot to the integrative model, a large number of samples do not have this attribute. Therefore, we also recommend the usage of only both sequence co-evolution features for most PTM pair candidates. Also, the sequence feature combination gives more than double cross-talk sample size comparing to that with co-modification across conditions (168 versus 76). Additionally, Figure 5B suggests that in this small sample set, the integration of both residue and motif co-evolution gives better performance than either of them alone, though this improvement is marginal, and need to be examined more extensively.

3.4 Influence of PTM type bias on prediction performance

Among the 199 inter-protein PTM cross-talk pairs, 150 pairs are cross-talk events between two phosphorylation sites ( Table 1). In other words, the compiled cross-talk set is bias toward the phosphorylation-phosphorylation PTM types. It is not clear if the prediction model can be used for PTM types that are not included or underrepresented in the training set. To test the influence of PTM types, we trained MBRF models with only phosphorylation-phosphorylation cross-talk pairs (150 cross-talk set pairs and 7312 control pairs), and tested the prediction performance on the rest PTM types (49 cross-talk pairs and 4273 control pairs). Figure 6 shows that phosphorylation–phosphorylation dataset is predictive for other PTM types (AUC = 0.777), even though only two sequence co-evolution features are available. With a threshold of 0.65, the false positive rate can be as low as 9.7% and the true positive rate is 38.5%. This prediction is equivalent as an independent test, evidencing the power of our method in predicting inter-protein PTM cross-talk and its robustness to PTM type bias.

Evaluating the robustness of the prediction model using biased training sets (phosphorylation– phosphorylation dataset). The ROC curves of the MBRF classifier using phosphorylation– phosphorylation dataset as training set and the rest as testing set. The false positive rate and true positive rate are presented in the brackets following the corresponding threshold 0.35, 0.5 and 0.65

Evaluating the robustness of the prediction model using biased training sets (phosphorylation– phosphorylation dataset). The ROC curves of the MBRF classifier using phosphorylation– phosphorylation dataset as training set and the rest as testing set. The false positive rate and true positive rate are presented in the brackets following the corresponding threshold 0.35, 0.5 and 0.65

3.5 PTM-X online server

Combining our previous intra-protein prediction method, we provide a web server named PTM-X for the prediction of intra- and inter-protein PTM cross-talk (http://bioinfo.bjmu.edu.cn/ptm-x/). The MBRF prediction model in the web site was trained with all human cross-talk and control pairs, for two types of feature combinations: (i) residue and motif sequence co-evolution and (ii) the addition of co-modification across conditions. Users can input candidate PTM pairs by specifying the protein UniProt accession number and the PTM positions on protein sequences. Then PTM-X server will give a final prediction result for each PTM pair by using the same feature combinations, by displaying on the web with a download link to a text file (see example in Supplementary Fig. S3 ). The input PTM pairs can be taken as potential cross-talk pairs if their prediction scores are higher than a given threshold. Generally, a strict threshold gives lower false positive rate but higher false negatives, while a more lenient threshold can be used to obtain more sensitive predictions. We provide an interface to facilitate this procedure, if users click on the prediction score on the web page, the ROC curve from the 10-fold cross-validation will appear and display the related false positive and true positive rate with the prediction score as a selected threshold ( Supplementary Fig. S3 ).


Database resources

MS and other experimental techniques have produced a large amount of PTM annotation data ( Figure 1), which are shared globally through databases. Each database has its own characteristics, with a different focus regarding the species type, from viruses to humans. Owing to the complexity and particularity of different PTMs, no database can provide a complete and comprehensive resource for PTM study [ 11]. For example, some databases contain data on a specific type of PTM, such as PhosphoBase [ 12] and O-glycobase [ 13], while others contain data on a variety of PTMs, such as UniProt [ 14] and HPRD [ 15]. Other commonly used databases [ 16–31] are shown in Table 1.


Protein PTM site prediction - Biology

a Saw Swee Hock School of Public Health, National University of Singapore, Singapore
E-mail: [email protected]

b Center for Genomics and Systems Biology, Department of Biology, New York University, New York, NY 10003, USA

c Institute of Molecular and Cell Biology, Agency for Science, Technology, and Research, Singapore

Abstract

While tandem mass spectrometry can detect post-translational modifications (PTM) at the proteome scale, reported PTM sites are often incomplete and include false positives. Computational approaches can complement these datasets by additional predictions, but most available tools use prediction models pre-trained for single PTM type by the developers and it remains a difficult task to perform large-scale batch prediction for multiple PTMs with flexible user control, including the choice of training data. We developed an R package called PTMscape which predicts PTM sites across the proteome based on a unified and comprehensive set of descriptors of the physico-chemical microenvironment of modified sites, with additional downstream analysis modules to test enrichment of individual or pairs of PTMs in protein domains. PTMscape is flexible in the ability to process any major modifications, such as phosphorylation and ubiquitination, while achieving the sensitivity and specificity comparable to single-PTM methods and outperforming other multi-PTM tools. Applying this framework, we expanded proteome-wide coverage of five major PTMs affecting different residues by prediction, especially for lysine and arginine modifications. Using a combination of experimentally acquired sites (PSP) and newly predicted sites, we discovered that the crosstalk among multiple PTMs occur more frequently than by random chance in key protein domains such as histone, protein kinase, and RNA recognition motifs, spanning various biological processes such as RNA processing, DNA damage response, signal transduction, and regulation of cell cycle. These results provide a proteome-scale analysis of crosstalk among major PTMs and can be easily extended to other types of PTM.


Prediction of S-Sulfenylation Sites Using Statistical Moments Based Features via CHOU’S 5-Step Rule

Post-translation modification (PTM) of cysteine S-sulfenylation sites in protein is important in cellular biology. S-sulfenylation plays a significant role in protein functioning, cell signaling and transcriptional regulation. Cysteine, S-sulfenylation site prediction is crucial in order to interpret the S-sulfenylation molecular mechanisms. In this study, statistical moments based methodology is proposed for cysteine S-sulfenylation site predictions. The system proposed has achieved accuracy far better than current state-of-the-art methods using tenfold cross validations and independent tests. The outcomes from the proposed method revealed that using statistical moments based features could produce more efficient and effective results. For the accessibility of the scientific community, we have developed a GitHub repository for cysteine S-sulfenylation sites prediction system which is freely accessible at https://www.github.com/ahmad-umt/S-Sulfenylation.

This is a preview of subscription content, access via your institution.


Introduction

Post translational modifications (PTMs) are alterations of the primary protein structure, including both new covalent links and cleavage events. Almost every protein in the cell undergoes modification during its lifetime [1] and more than 600 different amino acid modifications are catalogued in UniProtKB [2]. PTMs provide a way to expand the spectrum of protein functions as well as an additional layer for pathway regulation [3]. They are catalyzed by enzymes that identify a specific site in the substrate protein, with a plurality of PTM motifs residing in intrinsically disordered regions in order to facilitate enzyme accessibility [4]. Over the last few years, a deluge of methods have been proposed to predict PTM sites from sequence, for a recent review see e.g. [5]. The reasons for this popularity are broadly twofold. Given the paucity of experimental data for PTMs and their relevance for cellular regulation, there is a legitimate expectation that computational methods should fill in the experimental void. Computational methods can become hypothesis generators for an effective design of PTM experiments. Their implementation is straightforward due to the sequence specificity and peculiar physico-chemical properties of PTM motifs. This simplicity makes PTM prediction from sequence easily accessible to machine learning methods, but also presents several potential pitfalls [6]. In order to be useful for experimentalists, PTM predictors should provide good performance and be robust. Performance should be high enough to limit false positives to a minimum, while ensuring sufficient amount of correct predictions (true positives). Perhaps more importantly, the method should be robust enough to maintain performance across a range of different datasets, as it is often not clear which experimental conditions may introduce biases. On both accounts, PTM predictors may be problematic as they are rarely assessed by independent third parties. Indeed, their ability to identify new modification sites has been questioned [7] and effective results have been obtained only for a few PTM types [5]. The problem of validating machine learning methods has already been raised and best practices have been proposed [6]. Self-reported accuracy may be overestimated, with PTM predictors overfitting and not performing better than random when adopting the wrong training strategy [7]. Generalizing models for PTM site recognition is difficult as the number of experimental observations is low and many new types of motifs are still poorly characterized.

In this work, proline hydroxylation is taken as a case study to answer the question of how useful PTM predictors, especially those trained on small datasets, are to design experiments. Hydroxylation is one of the most abundant PTMs in the cell [8]. However, despite improvements in mass-spectrometry (MS) techniques, likely only a small fraction of all hydroxylated sites has so far been experimentally detected.

Proline hydroxylation (PH) is a PTM carried out by prolyl hydroxylases, catalyzing the addition of a hydroxyl group to the sidechain pyrrolidine ring at the gamma position. This modification is crucial for correct folding of the collagen triple-helix, which contains the conserved xPG motif. PH also plays a crucial role in signaling, in particular in oxygen sensing pathways, angiogenesis [9] and tumor cell proliferation [10, 11]. An example is HIF1α, the main target of the von Hippel-Lindau (pVHL) E3 ubiquitin ligase complex [12]. In normoxia, the prolyl hydroxylase domain-containing enzymes (PHDs) hydroxylate HIF1α, promoting its degradation through pVHL binding [13]. Under low oxygen concentration, the PHDs are inactivated and HIF-1α translocates into the nucleus to activate vascular proliferation and angiogenesis genes [14].

The first hydroxylation predictor [15] was trained to predict only collagen modifications. Several further PH predictors exist as web servers: HydPred [16], PredHydroxy [17], RF-Hydroxysite [18], iHyd-PseAAC [19] and iHyd-PseCp [20]. The latter has not been considered in our analysis as the server proved unstable, with frequent freezes. The stand-alone PH software OH-Pred [21], ModPred [4] and AMS3 [1] are also available. All are potential tools for large-scale analysis, taking only the protein sequence as input. Implementations include standard machine learning algorithms like Support Vector Machines, artificial Neural Networks and Random Forests, as well as alternative techniques like logistic regression and probabilistic classifiers. All methods were trained on SwissProt [22] annotation, with varying strategies to define positive and negative examples and different approaches to evaluate model quality. None of the PH predictors used a real independent dataset for validation, i.e. unaffected from SwissProt biases.

Here, we evaluate PH methods considering separately collagen and signalling examples as well as single proteins versus high throughput mass-spectrometry (MS) experiments. The majority of new hydroxylated prolines (Hyp) come from two MS recently published experiments, one on HeLa cells and another from a large experiment involving multiple tissues and samples [23–25]. These datasets are unseen for the PH predictors being tested, as they were not yet available in public databases when the predictors were trained. The number of MS hydroxylated sites is comparable to the entire SwissProt database and the new datasets allowed us to perform an unbiased blind test. A Naïve HMM predictor trained including MS data has also been implemented to simulate the effect of integrating new examples. The analysis presented here provides a starting point for a critical discussion on the problem of reliably predicting new PTMs.


Watch the video: Histone Post Translational Modifications (August 2022).