GRAFENE: Graphlet-based alignment-free network approach integrates 3D structural and sequence (residue order) data to improve protein structural comparison

 

Contact:

Tijana Milenkovic, tmilenko AT nd DOT edu

Introduction: Initial protein structural comparisons were sequence-based. Since amino acids that are distant in the sequence can be close in the 3-dimensional (3D) structure, 3D contact approaches can complement sequence approaches. Traditional 3D contact approaches study 3D structures directly and are alignment-based. Instead, 3D structures can be modeled as protein structure networks (PSNs). Then, network approaches can compare proteins by comparing their PSNs. These can be alignment-based or alignment-free. We focus on the latter. Existing network alignment-free approaches have drawbacks: 1) They rely on naive measures of network topology. 2) They are not robust to PSN size. They cannot integrate 3) multiple PSN measures or 4) PSN data with sequence data, although this could improve comparison because the different data types capture complementary aspects of the protein structure. We address this by: 1) exploiting well-established graphlet measures via a new network alignment-free approach, 2) introducing normalized graphlet measures to remove the bias of PSN size, 3) allowing for integrating multiple PSN measures, and 4) using ordered graphlets to combine the complementary PSN data and sequence (specifically, residue order) data. We compare synthetic networks and real-world PSNs more accurately and faster than existing network (alignment-free and alignment-based), 3D contact, or sequence approaches.

Reference: Fazle E. Faisal, Khalique Newaz, Julie L. Chaney, Jun Li, Scott J. Emrich, Patricia L. Clark, and Tijana Milenkovic (2017), GRAFENE: Graphlet-based alignment-free network approach integrates 3D structural and sequence (residue order) data to improve protein structural comparison, submitted.

Software: Our Unix version implementation for performing protein structural comparison is available here.

Usage: ./psn-classify [contact-map-pdb-dir] [domain-annotation-file] [psn-approach] [output-dir] -c [cut-off] -v [node-threshold] -d [max-distance-threshold] -o [component-threshold] -s [group-elements-threshold] -r [variation-threshold] -k [long-range]

  • [contact-map-pdb-dir] contains a set of files, where each file stores contact map information (with respect to three distance cut-offs: 4A, 5A, and 6A) for a protein. Protein contact maps used in our study are available here.

  • [protein-annotation-file] is the file, which contains the annotation information of each protein in [contact-map-pdb-dir]. Protein annotation files used in our study are available here.

  • [psn-approach] is the protein structural comparison approach. Our software supports 17 approaches: a) Graphlet-3-4, b) Graphlet-3-5, c) OrderedGraphlet-3, d) OrderedGraphlet-3-4, e) NormGraphlet-3-4, f) NormGraphlet-3-5, g) NormOrderedGraphlet-3, h) NormOrderedGraphlet-3-4, i) Average-degree, j) Average-distance, k) Maximum-distance, l) Average-closeness-centrality, m) Average-clustering-coefficient, n) Intra-hub-connectivity, o) Assortativity, p) Existing-all, and q) Sequence.

  • [output-dir] is the directory that contains the output of the comparison.

  • [cut-off] is the distance cut-off. The software uses a distance cut-off 4A by default. Other options for the cut-off are 5A and 6A.

  • [node-threshold] is the least number of nodes in a PSN to be considered for comparison. The software uses a threshold 0 by default. Our suggested value for this parameter is 101.

  • [max-distance-threshold] is the least maximum distance in a PSN to be considered for comparison. The software uses a threshold 0 by default. Our suggested value for this parameter is 6.

  • [component-threshold] is the maximum number of components in a PSN to be considered for comparison. The software allows unlimited number of components in a PSN by default. Our suggested value for this parameter is 1.

  • [group-elements-threshold] is the minimum number of PSNs that belong to a class (or group) according to the [protein-annotation-file]. Our suggested value for this parameter is 30.

  • [variation-threshold] is the least possible variation in the input data as captured by the first r principal components resulted from principal component analysis. Our suggested value for this parameter is 0.90.

  • [long-range] is the "long-range(K)" constraint to be applied while counting ordered graphlets. The software uses a default value of 1 for this parameter.

Example:  Given contact maps of proteins in the directory "ContactAllPDB" and domain annotations in the file "cath-primary.txt", protein structural comparison can be computed by the PSN-approach "NormOrderedGraphlet-3-4" with the following command. 

  • ./psn-classify ContactAllPDB cath-primary.txt NormOrderedGraphlet-3-4 output-dir -c 4A -v 101 -d 6 -o 1 -s 30 -r 0.90 -k 2

  • The command will perform protein structural comparison using NormOrderedGraphlet-3-4 on the PSNs that are annotated by CATH primary classes. The command ensures that PSNs having at least 101 nodes, at least six maximum distance, and at most one component are considered for comparison. The command also ensures that a CATH primary class is considered only if there are at least 30 PSNs within the class (or group). Additionally, the command ensures that we only count the graphlets in which every pair of interacting nodes (amino acids) are at least 2 distance apart in the protein sequence. The comparison will be done by performing principal component analysis and by selecting first r of the resultant principal components that account for at least 90% variation in the input data.

  • The output file output-dir/pr-roc-normorderedgraphlet-3-4.txt contains comparison outcomes according to precision, recall, area under precision recall (AUPR), and are under receiver operator characteristic curve (AUROC).

Data:

  • Protein contact maps used in our study are available here.

  • Protein annotation files used in our study are available here.