Network-based protein structural classification

 

Contact: Tijana Milenkovic, tmilenko AT nd DOT edu

Introduction: Experimental determination of protein function is resource-consuming. As an alternative, computational prediction of protein function has received attention. In this context, protein structural classification (PSC) can help, by allowing for determining structural classes of currently unclassified proteins based on their features, and then relying on the fact that proteins with similar structures have similar functions. Existing PSC approaches rely on sequence-based or direct 3-dimensional (3D) structure-based protein features. In contrast, we first model 3D structures of proteins as protein structure networks (PSNs). Then, we use network-based features for PSC. We propose the use of graphlets, state-of-the-art features in many research areas of network science, in the task of PSC. Moreover, because graphlets can deal only with unweighted PSNs, and because accounting for edge weights when constructing PSNs could improve PSC accuracy, we also propose a deep learning framework that automatically learns network features from weighted PSNs. When evaluated on a large set of ~9,400 CATH and ~12,800 SCOP protein domains (spanning 36 PSN sets), the best of our proposed approaches are superior to existing PSC approaches in terms of accuracy, with comparable running time.

Reference: Khalique Newaz, Mahboobeh Ghalehnovi+, Arash Rahnama+, Panos J. Antsaklis, and Tijana Milenkovic (2018), + Equal contribution, Network-based protein structural classification, submitted.

Software: We implement two different protein classification procedures: logistic regression-based and deep learning-based.

Logistic regression-based classifier: We provide a python implementation which can be downloaded from here. In addition, the data (i.e., features) that was used in our paper can be downloaded from this link. The code works with scikit-learn 0.22.1 Python package. To Run the code, both the code and the data folder (named "data") should be placed in the same folder.

Usage:  "RunLR.py" is the main script for running the classifier with the features. The command to run the script is "python RunLR.py". The script needs to be run for each of the data sets separately. Inside a script, the following parameter needs to be set. 

  • Dataset: Path to the data set. For example, to run the code for CathP (i.e., cath primary) dataset, the value for this parameter should be set to the name of the folder containg the features for that dataset (here, "CathP").

Deep learning-based classifier: We provide a python implementation, which can be downloaded from here. The code works with tensorflow 1.8 and skimage 0.14.1 Python packages. The code should be placed in the same folder as the "data" folder. This link contains a sample of the data (i.e, weighted adjacency matrices of proteins) from our paper to test the code on. Note that for the deep learning framework, we could not share all of the data that was used in our paper, because for this framework, the networks are stored as weighted adjacency matrices. Because of this, and because of the large number of the networks, the entire data is way too large (over 4GB) to be stored on our web server. We are happy to share all of the data in an alternative way upon email request.

Usage:  "RunDL.py" is the main script. The command to run the script is "python RunDL.py" (this script can be run on CPU or GPU).The script needs to be run for each of the data sets separately. Inside the script, the following parameter needs to be set. 

  • Dataset: Path to the data set. For example, to run the code for "Cath3.20.20" dataset that has been provided here, the value for this parameter should be set to the name of the folder containg the features for that dataset (here, "Cath3.20.20").