VHSE-Based Prediction of Proteasomal Cleavage Sites

Xie J, Xu Z, Zhou S, Pan X, Cai S, Yang L, et al. (2013) The VHSE-Based Prediction of Proteasomal Cleavage Sites. PLoS ONE 8(9): e74506. doi:10.1371/journal.pone.0074506

Abstract: "Prediction of proteasomal cleavage sites has been a focus of computational biology. Up to date, the predictive methods are mostly based on nonlinear classifiers and variables with little physicochemical meanings. In this paper, the physicochemical properties of 14 residues both upstream and downstream of a cleavage site are characterized by VHSE (principal component score vector of hydrophobic, steric, and electronic properties) descriptors. Then, the resulting VHSE descriptors are employed to construct prediction models by support vector machine (SVM). For both in vivo and in vitro datasets, the performance of VHSE-based method is comparatively better than that of the well-known PAProC, MAPPP, and NetChop methods. The results reveal that the hydrophobic property of 10 residues both upstream and downstream of the cleavage site is a dominant factor affecting in vivo and in vitro cleavage specificities, followed by residue’s electronic and steric properties. Furthermore, the difference in hydrophobic potential between residues flanking the cleavage site is proposed to favor substrate cleavages. Overall, the interpretable VHSE-based method provides a preferable way to predict proteasomal cleavage sites."

Peptide Formation

Peptide Formation

Image by GYassineMrabetTalk. (Own work) [Public domain], via Wikimedia Commons

Hydrophobic, Steric, and Electronic Properties

Amino Acids

Amino acids grouped by electrically charged, polar uncharged, hydrophobic, and special case sidechains.

Each amino acid has a single letter designation.

Image by Dancojocari [CC BY-SA 3.0 or GFDL], via Wikimedia Commons

Protein Representation (FASTA)

Bradykinin is an inflammatory mediator. It is a peptide that causes blood vessels to dilate (enlarge), and therefore causes blood pressure to fall. A class of drugs called ACE inhibitors, which are used to lower blood pressure, increase bradykinin (by inhibiting its degradation) further lowering blood pressure.

>sp|P01042|KNG1_HUMAN Kininogen-1 OS=Homo sapiens GN=KNG1 PE=1 SV=2
MKLITILFLCSRLLLSLTQESQSEEIDCNDKDLFKAVDAALKKYNSQNQSNNQFVLYRIT
EATKTVGSDTFYSFKYEIKEGDCPVQSGKTWQDCEYKDAAKAATGECTATVGKRSSTKFS
VATQTCQITPAEGPVVTAQYDCLGCVHPISTQSPDLEPILRHGIQYFNNNTQHSSLFMLN
EVKRAQRQVVAGLNFRITYSIVQTNCSKENFLFLTPDCKSLWNGDTGECTDNAYIDIQLR
IASFSQNCDIYPGKDFVQPPTKICVGCPRDIPTNSPELEETLTHTITKLNAENNATFYFK
IDNVKKARVQVVAGKKYFIDFVARETTCSKESNEELTESCETKKLGQSLDCNAEVYVVPW
EKKIYPTVNCQPLGMISLMKRPPGFSPFRSSRIGEIKEETTVSPPHTSMAPAQDEERDSG
KEQGHTRRHDWGHEKQRKHNLGHGHKHERDQGHGHQRGHGLGHGHEQQHGLGHGHKFKLD
DDLEHQGGHVLDHGHKHKHGHGHGKHKNKGKKNGKHNGWKTEHLASSSEDSTTPSAQTQE
KTEGPTPIPSLAKPGVTVTFSDFQDSDLIATMMPPISPAPIQSDDDWIPDIQIDPNGLSF
NPISDFPDTTSPKCPGRPWKSVSEINPTTQMKESYYFDLTDGLS

Bradykinin Structure

Bradykinin

By Yikrazuul (Own work) [Public domain], via Wikimedia Commons

MHC Class I Processing

The proteasome digests polypeptides into smaller peptides 5–25 amino acids in length and is the major protease responsible for generating peptide C termini.

Transporter associated with Antigen Processing (TAP) binds to peptides of length 9-20 amino acids and transports them into the endoplasmic reticulum (ER).

Image by Scray - Own work, CC BY-SA 3.0, Link

The principal component score Vector of Hydrophobic, Steric, and Electronic properties (VHSE) is a set of amino acid descriptors that come from A new set of amino acid descriptors and its application in peptide QSARs

  • VHSE1 and VHSE2 are related to hydrophobic (H) properties,
  • VHSE3 and VHSE4 to steric (S) properties, and
  • VHSE5 to VHSE8 to electronic (E) properties.
In [2]:
# (3-letter, VHSE1, VHSE2, VHSE3, VHSE4, VHSE5, VHSE6, VHSE7, VHSE8)
vhse = {
"A": ("Ala", 0.15, -1.11, -1.35, -0.92, 0.02, -0.91, 0.36, -0.48),
"R": ("Arg", -1.47, 1.45, 1.24, 1.27, 1.55, 1.47, 1.30, 0.83),
"N": ("Asn", -0.99, 0.00, -0.37, 0.69, -0.55, 0.85, 0.73, -0.80),
"D": ("Asp", -1.15, 0.67, -0.41, -0.01, -2.68, 1.31, 0.03, 0.56),
"C": ("Cys", 0.18, -1.67, -0.46, -0.21, 0.00, 1.20, -1.61, -0.19),
"Q": ("Gln", -0.96, 0.12, 0.18, 0.16, 0.09, 0.42, -0.20, -0.41),
"E": ("Glu", -1.18, 0.40, 0.10, 0.36, -2.16, -0.17, 0.91, 0.02),
"G": ("Gly", -0.20, -1.53, -2.63, 2.28, -0.53, -1.18, 2.01, -1.34),
"H": ("His", -0.43, -0.25, 0.37, 0.19, 0.51, 1.28, 0.93, 0.65),
"I": ("Ile", 1.27, -0.14, 0.30, -1.80, 0.30, -1.61, -0.16, -0.13),
"L": ("Leu", 1.36, 0.07, 0.26, -0.80, 0.22, -1.37, 0.08, -0.62),
"K": ("Lys", -1.17, 0.70, 0.70, 0.80, 1.64, 0.67, 1.63, 0.13),
"M": ("Met", 1.01, -0.53, 0.43, 0.00, 0.23, 0.10, -0.86, -0.68),
"F": ("Phe", 1.52, 0.61, 0.96, -0.16, 0.25, 0.28, -1.33, -0.20),
"P": ("Pro", 0.22, -0.17, -0.50, 0.05, -0.01, -1.34, -0.19, 3.56),
"S": ("Ser", -0.67, -0.86, -1.07, -0.41, -0.32, 0.27, -0.64, 0.11),
"T": ("Thr", -0.34, -0.51, -0.55, -1.06, 0.01, -0.01, -0.79, 0.39),
"W": ("Trp", 1.50, 2.06, 1.79, 0.75, 0.75, -0.13, -1.06, -0.85),
"Y": ("Tyr", 0.61, 1.60, 1.17, 0.73, 0.53, 0.25, -0.96, -0.52),
"V": ("Val", 0.76, -0.92, 0.17, -1.91, 0.22, -1.40, -0.24, -0.03)}

Creating the In Vivo Data

To create the in vivo training set, the authors

  • Queried the AntiJen database (7,324 MHC-I ligands)
  • Removed ligands with unknown source protein in ExPASy/SWISS-PROT (6036 MHC-I ligands)
  • Removed duplicate ligands (3,148 ligands)
  • Removed the 231 ligands used for test samples by Saxova et al, (2,917 ligands)
  • Removed sequences less than 28 residues (2,607 ligands) to create the cleavage sample set
  • Assigned non-cleavage sites, removed sequences with less than 28 resides (2,480 ligands) to create the non-cleavage sample set

This process created 5,087 training samples: 2,607 cleavage and 2,480 non-cleavage samples.

Creating Samples from Ligands and Proteins

The C-terminus of the ligand is assumed to be a cleavage site and the midpoint between the N-terminus and C-terminus is assumed to not be a cleavage site.

Both the cleavage and non-cleavage sites are at the center position of each sample.

Format of Training Data

  • Each Sequence is 28 residues long, however the authors found the best performance using 20 residues.
  • The Activity is 1 for cleavage and -1 for no cleavage.
  • There are 28 * 8 = 224 features in the raw training set.
In [5]:
training_set = pd.DataFrame.from_csv("data/proteasomal_cleavage/s2_in_vivo_mhc_1_antijen_swiss_prot_dataset.csv")
print training_set.head(3)
                       Sequence  Name  Activity  VHSE11  VHSE12  VHSE13  \
0  AAAFCSAMYVGDLCGSVELVGQLFTFSP  cv-1         1    0.15   -1.11   -1.35   
1  AACHKCIDFYSRIRELRHYSDSVYGDTL  cv-2         1    0.15   -1.11   -1.35   
2  AACRAAGLQDCTMLVNGDDLVVICESAG  cv-3         1    0.15   -1.11   -1.35   

   VHSE14  VHSE15  VHSE16  VHSE17   ...     VHSE277  VHSE278  VHSE281  \
0   -0.92    0.02   -0.91    0.36   ...       -0.64     0.11     0.22   
1   -0.92    0.02   -0.91    0.36   ...       -0.79     0.39     1.36   
2   -0.92    0.02   -0.91    0.36   ...        0.36    -0.48    -0.20   

   VHSE282  VHSE283  VHSE284  VHSE285  VHSE286  VHSE287  VHSE288  
0    -0.17    -0.50     0.05    -0.01    -1.34    -0.19     3.56  
1     0.07     0.26    -0.80     0.22    -1.37     0.08    -0.62  
2    -1.53    -2.63     2.28    -0.53    -1.18     2.01    -1.34  

[3 rows x 227 columns]

Creating the Linear SVM Model

The authors measured linear, polynomial, radial basis, and sigmoid kernel and found no significant difference in performance. The linear kernel was chosen for its simplicity and interpretability. The authors did not provide the C value used in their linear model, so I used GridSearchCV to find the best value.

In [6]:
from sklearn.model_selection import GridSearchCV
def create_linear_svc_model(parameters, sequence_len = 20, use_vhse = True):
    scaler = MinMaxScaler()
    (X_train_unscaled, y_train) = dataset_to_X_y(training_set, \
                                             "Sequence", "Activity", \
                                             sequence_len = sequence_len, \
                                             use_vhse = use_vhse)
    X_train = pd.DataFrame(scaler.fit_transform(X_train_unscaled))
    svr = svm.LinearSVC()
    clf = GridSearchCV(svr, parameters, cv=10, scoring='accuracy')
    clf.fit(X_train, y_train)
    print("The best parameters are %s with a score of %0.2f" \
          % (clf.best_params_, clf.best_score_))
    return (scaler, clf)

(vhse_scaler, vhse_model) = create_linear_svc_model(parameters = {'C': [0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100]})
The best parameters are {'C': 1} with a score of 0.77

Testing In Vivo SVM Model

In [7]:
def test_linear_svc_model(scaler, model, sequence_len = 20, use_vhse = True):
    testing_set = pd.DataFrame.from_csv("data/proteasomal_cleavage/s6_in_vivo_mhc_1_ligands_dataset.csv")
    (X_test_prescaled, y_test) = dataset_to_X_y(testing_set, \
                                                "Sequences", "Activity", \
                                                sequence_len = sequence_len,\
                                                use_vhse = use_vhse)
    X_test = pd.DataFrame(scaler.transform(X_test_prescaled))
    y_predicted = model.predict(X_test)
    accuracy = 100.0 * metrics.accuracy_score(y_test, y_predicted)
    ((tn, fp), (fn, tp)) = metrics.confusion_matrix(y_test, y_predicted, labels=[-1, 1])
    sensitivity = 100.0 * tp/(tp + fn)
    specificity = 100.0 * tn/(tn + fp)
    mcc = metrics.matthews_corrcoef(y_test, y_predicted)
    print "Authors reported performance"
    print "Acc: 73.5, Sen: 82.3, Spe: 64.8, MCC: 0.48"
    print "Notebook performance (sequence_len=%d, use_vhse=%s)" % (sequence_len, use_vhse)
    print "Acc: %.1f, Sen: %.1f, Spe: %.1f, MCC: %.2f" \
    %(accuracy, sensitivity, specificity, mcc)
    
test_linear_svc_model(vhse_scaler, vhse_model)
Authors reported performance
Acc: 73.5, Sen: 82.3, Spe: 64.8, MCC: 0.48
Notebook performance (sequence_len=20, use_vhse=True)
Acc: 72.7, Sen: 82.2, Spe: 63.2, MCC: 0.46

Comparing Linear SVM to PAProC, FragPredict, and NetChop

Performance

Interpreting Model Weights

The VHSE1 variable at the P1 position has the largest positive weight coefficient (10.49) in line with research showing that C-terminal residues are usually hydrophobic to aid in ER transfer and binding to the MHC molecule.

There is a mostly positive and mostly negative coefficents upstream and downstream of the cleavage site respectively. This potential difference appears to be conducive to cleavage.

PCA vs. full matrices

  • Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components Source.
  • Mei et al, creators of the VHSE, applied PCA on 18 hydophobic, 17 steric, and 15 electronic properties. The first 2, 2, and 4 principle components account for 74.33, 78.68, and 77.9% of variability in original matrices.
  • The authors of this paper only used the first principle component from the hydrophobic, steric, and electronic matrices.
  • What performance would the authors have found if used the full matrices instead of PCA features?
Matrix Features Sensitivity Specificity MCC
VHSE 3x20=60 82.2 63.2 0.46
Full 50x20=1000 81.2 64.1 0.46

Chipper Implementation

Chipper notebook auto-generates training sets, e.g. v0.3.0 trained with 26,166 samples vs. 5,087 used in this study

Chipper Performance

Chipper uses full 50 properties per residue instead of VHSE parameters

Method Sensitivity Specificity MCC Notes
PAProC 45.6 30.0 -0.25
FragPredict 83.5 16.5 0.00
NetChop 1.0 39.8 46.3 -0.14
NetChop 2.0 73.6 42.4 0.16
NetChop 3.0 81.0 48.0 0.31
VHSE-based SVC 82.3 64.8 0.48 3x20=60 features
chipper VHSE (SVC) 79.3 72.6 0.520 3x20=60 features
chipper VHSE (LR) 83.2 69.2 0.529 3x20=60 features
chipper (SVC) 79.3 77.9 0.572 50x20=1000 features
chipper (LR) 87.0 74.5 0.620 50x20=1000 features
xgboost 87.0 74.5 0.620 50x20=1000 features
FANN 82.7 78.8 0.616 50x20=1000 features

Chipper LR Performance

LR ROC Curve