Data Analytics in Bioinformatics. Группа авторов
Чтение книги онлайн.

Читать онлайн книгу Data Analytics in Bioinformatics - Группа авторов страница 24

Название: Data Analytics in Bioinformatics

Автор: Группа авторов

Издательство: John Wiley & Sons Limited

Жанр: Программы

Серия:

isbn: 9781119785606

isbn:

СКАЧАТЬ [16].

       2.3.3.4 Clustering Large Applications (CLARA)

      Rather than considering and calculating medoids for the entire dataset CLARA considers a subset from the dataset with a constant size upon which PAM(partitioning around medoids) algorithm is applied. PAM algorithm aims at reducing the degree of similarity by calculating the distance between the cluster medoids and the elements in the subset space. As a result an optimal set of medoids are obtained which requires less store space when compared to other algorithms. In contrast to other partitioning algorithms this measures the degree of dissimilarity amongst the elements and the medoid of the cluster. This algorithm is applied on huge datasets producing accurate results with less computation time [17].

      Drawback—Cannot deal with higher dimensional data and closely connected clusters which leaves this algorithm a second choice in gene expression data.

       2.3.4 Hierarchical Clustering Algorithms

      Hierarchical clustering technique is again divided into two categories

Schematic illustration of the agglomerative clustering and divisive clustering.

      Figure 2.4 (a) Agglomerative clustering, (b) divisive clustering.

      Divisive Clustering—This follows top down approach as depicted in Figure 2.4(b). Initially, it considers all the data points as single cluster or model. The model separation is performed recursively until required criteria is met based upon the model.

      Various algorithms of hierarchical clustering are described below.

       2.3.4.1 AGNES (Agglomerative Nesting)

      In this algorithm subsequent merging operations are performed on the dataset to arrange them into clusters. Dissimilarity coefficients are acquired using either Euclidean or Manhattan distance metrics [17]. These measured dissimilarities forms a matrix using unweighted pair group method with arithmetic mean [18].

      Drawbacks of this method are that the objects once merged cannot be separated and different computation metrics leads to varying results.

       2.3.4.2 DIANA (Divisive Analysis)

      This follows divisive clustering approach in which the whole dataset is divided into 2 sub parts (clusters) and these clusters are further separated into sub clusters until no splitting can be performed further [17].

      Drawback—once separated clusters cannot be merged, this cannot handle huge dataset [19]. This may not be applicable for gene expression data as it cannot handle missing data [17].

       2.3.4.3 CURE (Clustering Using Representatives)

      This algorithm operates on defining stable scatter points which are able to detect the shape and extent of a cluster. These scatter points based upon the similarities moves towards centroid in order to form a cluster. The scatter points capture the shape and extent of clusters which enables to detect non-spherical clusters. This algorithm is applied to gene expression data which produces the clusters efficiently [20].

       2.3.4.4 CHAMELEON

       2.3.4.5 BRICH (Balanced Iterative Reducing and Clustering Using Hierarchies)

      In this algorithmic approach entire data in the dataset is maintained in the form of a tree structure called as CF (clustering Feature) which is based on clustering feature. The CF tree also holds the information of clustering. As the outliers are eliminated the sub clusters data is made as leaf nodes. With iterative updating of CF BRICH provides the potential to deal with huge datasets. This works effectively for gene expression data [22].

      Drawback—This algorithm gets skewed with the presence of nonspherical clusters [22].

       2.3.5 Density-Based Approach

      Density based clustering algorithmic approach is composed by splitting the low point density regions and high point density regions in a given data space. The low point density region is observed as outliers [26].

       2.3.5.1 DBSCAN

      The DBSCAN algorithm proposed by Ref. [27] is based on the following criterion:

      Eps(€)—this depicts the space around the data point, if the difference of distance between 2 data points is greater than or equal to € then they are declared as neighbors, if the (€) value is very small then they are termed as outliers, if the (€) value is high then that data point is merged to form cluster.

      Minpts—this represents the minimum number of data points within the radius (€) for the larger datasets like microarrays Minpts are obtained from the dimensions of the dataset.

      From the following definition of data points clusters and outliers are identified

      Core point—presence of higher Minpts within radius of (€)

      Outlier—point which is neither core nor border point.

       2.3.6 Model-Based Approach

      This clustering algorithm handles the shortcomings of K-Means and Fuzzy K-Means. In this approach a probabilistic model is developed to which the data from the dataset is fit into [23]. Self-Organizing Maps implement the model based approach technique.

       2.3.6.1 SOM (Self-Organizing Maps)

      This algorithm is popularly known in gene expression data clustering in which the expression values are closely related. SOM adapts a neural network template with a single layer in which data is split over neurons as shown in Figure 2.5. As every neuron is connected to every other neuron and is associated with a reference vector, the data objects are plotted to the nearest reference vectors СКАЧАТЬ