Название: Data Analytics in Bioinformatics
Автор: Группа авторов
Издательство: John Wiley & Sons Limited
Жанр: Программы
isbn: 9781119785606
isbn:
2.3.3.4 Clustering Large Applications (CLARA)
Rather than considering and calculating medoids for the entire dataset CLARA considers a subset from the dataset with a constant size upon which PAM(partitioning around medoids) algorithm is applied. PAM algorithm aims at reducing the degree of similarity by calculating the distance between the cluster medoids and the elements in the subset space. As a result an optimal set of medoids are obtained which requires less store space when compared to other algorithms. In contrast to other partitioning algorithms this measures the degree of dissimilarity amongst the elements and the medoid of the cluster. This algorithm is applied on huge datasets producing accurate results with less computation time [17].
Drawback—Cannot deal with higher dimensional data and closely connected clusters which leaves this algorithm a second choice in gene expression data.
2.3.4 Hierarchical Clustering Algorithms
Hierarchical clustering technique is again divided into two categories
Agglomerative clustering—This follows a bottom up approach as depicted in Figure 2.4(a). Initially, every other object in the dataset is assumed as a single cluster. Distance measures are computed upon each object and based upon the similarities they are merged to form clusters.
Figure 2.4 (a) Agglomerative clustering, (b) divisive clustering.
Divisive Clustering—This follows top down approach as depicted in Figure 2.4(b). Initially, it considers all the data points as single cluster or model. The model separation is performed recursively until required criteria is met based upon the model.
Various algorithms of hierarchical clustering are described below.
2.3.4.1 AGNES (Agglomerative Nesting)
In this algorithm subsequent merging operations are performed on the dataset to arrange them into clusters. Dissimilarity coefficients are acquired using either Euclidean or Manhattan distance metrics [17]. These measured dissimilarities forms a matrix using unweighted pair group method with arithmetic mean [18].
Drawbacks of this method are that the objects once merged cannot be separated and different computation metrics leads to varying results.
2.3.4.2 DIANA (Divisive Analysis)
This follows divisive clustering approach in which the whole dataset is divided into 2 sub parts (clusters) and these clusters are further separated into sub clusters until no splitting can be performed further [17].
Drawback—once separated clusters cannot be merged, this cannot handle huge dataset [19]. This may not be applicable for gene expression data as it cannot handle missing data [17].
2.3.4.3 CURE (Clustering Using Representatives)
This algorithm operates on defining stable scatter points which are able to detect the shape and extent of a cluster. These scatter points based upon the similarities moves towards centroid in order to form a cluster. The scatter points capture the shape and extent of clusters which enables to detect non-spherical clusters. This algorithm is applied to gene expression data which produces the clusters efficiently [20].
2.3.4.4 CHAMELEON
This algorithm is based on dynamic modeling which is executed in two stages. In the first stage graph partitioning algorithm is applied resulting into smaller clusters. As graph partitioning technique is applied the elements in the clusters are strongly connected. This establishes a quality separation among the clusters by discarding the noisy data. The later stage is executed with the application of agglomerative hierarchical clustering combining all the smaller clusters iteratively. This algorithm doesn’t fall short of incorrect merging because of its 2 stage execution. And also CHAMELEON overcomes all the shortcomings of AGNES and DIANA in which the data once split cannot be merged [21].
2.3.4.5 BRICH (Balanced Iterative Reducing and Clustering Using Hierarchies)
In this algorithmic approach entire data in the dataset is maintained in the form of a tree structure called as CF (clustering Feature) which is based on clustering feature. The CF tree also holds the information of clustering. As the outliers are eliminated the sub clusters data is made as leaf nodes. With iterative updating of CF BRICH provides the potential to deal with huge datasets. This works effectively for gene expression data [22].
Drawback—This algorithm gets skewed with the presence of nonspherical clusters [22].
2.3.5 Density-Based Approach
Density based clustering algorithmic approach is composed by splitting the low point density regions and high point density regions in a given data space. The low point density region is observed as outliers [26].
2.3.5.1 DBSCAN
The DBSCAN algorithm proposed by Ref. [27] is based on the following criterion:
Eps(€)—this depicts the space around the data point, if the difference of distance between 2 data points is greater than or equal to € then they are declared as neighbors, if the (€) value is very small then they are termed as outliers, if the (€) value is high then that data point is merged to form cluster.
Minpts—this represents the minimum number of data points within the radius (€) for the larger datasets like microarrays Minpts are obtained from the dimensions of the dataset.
From the following definition of data points clusters and outliers are identified
Core point—presence of higher Minpts within radius of (€)
Border point—presence of less Minpts within radius of (€) but close to core point
Outlier—point which is neither core nor border point.
2.3.6 Model-Based Approach
This clustering algorithm handles the shortcomings of K-Means and Fuzzy K-Means. In this approach a probabilistic model is developed to which the data from the dataset is fit into [23]. Self-Organizing Maps implement the model based approach technique.
2.3.6.1 SOM (Self-Organizing Maps)
This algorithm is popularly known in gene expression data clustering in which the expression values are closely related. SOM adapts a neural network template with a single layer in which data is split over neurons as shown in Figure 2.5. As every neuron is connected to every other neuron and is associated with a reference vector, the data objects are plotted to the nearest reference vectors СКАЧАТЬ