Название: Computational Prediction of Protein Complexes from Protein Interaction Networks
Автор: Sriganesh Srihari
Издательство: Ingram
Жанр: Программы
Серия: ACM Books
isbn: 9781970001549
isbn:
Limitations of Computationally Predicted Protein Interactions. Despite their promise in enhancing experimentally curated interaction datasets, computational methods have their own limitations. For example, methods that rely on (high-throughput) experimental datasets to predict new interactions—e.g., PPI topology-based methods—carry an inherent bias in their predictions toward proteins already accounted for or enriched in these datasets. Moreover, these predictions are also affected by biological and technical noise (spurious interactions) in experimental datasets. The methods that are based on genomic distances, fusion of genes, and co-transcription of (operonic) genes are applicable only to a selected subset of species (prokaryotes), since most eurkaryotic systems do not exhibit these properties so strongly. Knowledge-based methods—e.g., those based on the GO graph, 3D structures, and literature abstracts—are restricted by the amount and kind of available information, and are again applicable to only a subset of proteins. As a result of these limitations, accurate prediction of physical interactions between proteins is a challenging problem, and most methods in fact predict only functional associations instead of direct physical interactions between proteins. Depending on the application, these functional associations can turn out to be more useful or less useful.
However, despite these limitations, computational techniques are orthogonal to and can effectively complement experimental techniques. Computational predictions can be used to enhance the topologies of PPI networks—for example, by “densifying” sparse network regions or piecing together disconnected proteins or regions of the network—which in turn can improve downstream applications of PPI networks including protein complex prediction.
3
Computational Methods for Protein Complex Prediction from PPI Networks
All models are wrong, but some models are useful.
—George Box (1919–2013), British statistician (as quoted in Box [1979])
The process of identifying protein complexes from high-throughput interaction datasets involves the following steps [Spirin and Mirny 2003, Srihari and Leong 2012b, Srihari et al. 2015a] (Figure 1.1):
1. integrating high-throughput datasets from multiple sources and assessing the confidence of interactions;
2. constructing a reliable PPI network using only the high-confidence interactions;
3. identifying modular subnetworks from the network to generate a candidate list of complexes; and
4. evaluating the identified complexes against bona fide complexes, and validating and assigning roles to novel complexes.
In this chapter, we review some of the representative methods developed to date for computational prediction of protein complexes from PPI networks.
3.1 Basic Definitions and Terminologies
A protein-protein interaction (PPI) network is modeled as an undirected graph G = 〈V, E〉, where V is the set of proteins, and E ⊆ V × V is the set of interactions between the proteins. We also use V(G) and E(G) to refer to the set of proteins and interactions of a (sub-)network G, respectively. For a protein v ∈ V, N(v) or Nv is the set of immediate neighbors of v, deg(v) = |N(v)| = |Nv| is the degree of v, and E(v) is the set of interactions in the immediate neighborhood of v. The interaction density of G (or its subnetwork) is defined as:
which gives a real number in [0, 1] quantifying the “richness of interactions” within G: 0 for a network without any interactions and 1 for a fully connected network.
The clustering coefficient CC(v) measures the “cliquishness” for the neighborhood of v, and is given by:
If the interactions of the network are scored (weighted), i.e., G = 〈V, E, w〉, then the weighted versions—weighted degree, weighted interaction density, and weighted clustering coefficients—are given as follows:
There are other variants for these definitions proposed in the literature; see Kalna and Higham [2007] and Newman [2010] for a survey.
3.2 Taxonomy of Methods for Protein Complex Prediction
Although at a generic level most methods make the basic assumption that protein complexes are embedded among densely connected subsets of proteins within the PPI network, these methods vary considerably in their algorithmic strategies or in the biological information that they employ to detect complexes. Accordingly, we classify protein complex detection methods into the following two categories [Srihari and Leong 2012b, Srihari et al. 2015a, Li et al. 2010]:
1. methods based solely on network clustering; and
2. methods based on network clustering and additional biological information.
The biological information can be in the form of functional, structural, organizational, or evolutionary evidence on complexes or their constituent proteins. Wang et al. [2010] and Chen et al. [2014] present alternative classifications based on topological (interaction density, betweenness centrality, and clustering coefficient), and dynamical properties of PPI networks capitalized by computational methods, respectively.
We present this classification of methods for protein complex prediction as two snapshots—methodology-based and chronology-based—as shown in Figures 3.1 and 3.2, respectively. The methodology-based classification (Figure 3.1) is based on algorithmic methodologies employed by the methods. At level 1 of the tree, we divide methods into those based solely on network clustering, and those employing additional biological information. At subsequent levels, we further subdivide the methods based on their algorithmic strategies into: (i) methods employing merging or growing of clusters from the network (agglomerative); and (ii) methods employing repeated partitioning of the network (divisive). Agglomerative methods go bottom-up, i.e., by beginning with small “seeds” (e.g., triangles or cliques) and repeatedly adding proteins or merging clusters of proteins based on certain similarity criteria to arrive at predicted set of complexes. On the other hand, divisive methods go top-down, i.e., by repeatedly partitioning the network into multiple dense subnetworks based on certain dissimilarity criteria to arrive at predicted complexes. In the chronology-based classification (Figure 3.2), we bin methods based on the time (year) when these were developed, and for each bin we stack the methods based on the kind of biological information they employ for the prediction.
Figures 3.1 and 3.2 and Table 3.1 show only the methods that are discussed in this chapter. Other methods that employ diverse СКАЧАТЬ