Название: Machine Learning Techniques and Analytics for Cloud Security
Автор: Группа авторов
Издательство: John Wiley & Sons Limited
Жанр: Программы
isbn: 9781119764090
isbn:
2.3.1 Description of Datasets
Influenza sequences (glycan dataset) are taken from the National Centre for Biotechnology Information. At first, to perform searching operation, Basic Local Alignment Search Tool (BLAST) has been applied on H1N1 infected human datasets of Influenza A/447/08 at Oklahoma, Influenza A/1138/08 at Oklahoma, and Influenza A/447/08 at Oklahoma and on non-infected normal human of Influenza A/California/04/2009-4C. The dataset of H1N1 contains glycan data in Oklahoma City and the dataset of normal human contains glycan data in California City. The dataset consists of 442 different glycans and list of linkers are sp0, sp8, sp9, sp12, etc. Individual columns of the dataset represent the glycan numbers, glycan structure, the RFU, the STDEV value, and the SEM.
2.3.2 Analysis of Result
In this paper, unsupervised machine learning method like as k-means, hierarchical, and fuzzy c-means algorithm are shown to prove excellent classification performance and have been successfully applied in data analysis of H1N1 infected and non-infected datasets. At first, k-means clustering algorithm are applied on H1N1 infected dataset Influenza A/447/08 at Oklahoma, Influenza A/1138/08 at Oklahoma, and Influenza A/447/08 at Oklahoma and on non-infected dataset Influenza A/California/04/2009-4C that are shown in Figures 2.2 to 2.5. Same process will be repeated for hierarchical clustering algorithms that are shown in Figures 2.6 to 2.8. Fuzzy c-means has applied on above-mentioned datasets that are shown in Figures 2.9 to 2.11. After completing cluster analysis, we have collected those glycan structures where the value of RFU, STDEV, and SEM has been significantly changed from normal state to infected state.
Figure 2.2 K-means cluster analysis of Influenza A (H1N1) non-infected human.
Figure 2.3 K-means cluster analysis of Influenza A (H1N1) infected human.
Figure 2.4 K-means cluster analysis of Influenza A (H1N1) infected human.
Figure 2.5 K-means cluster analysis of Influenza A (H1N1) infected human.
2.3.3 Validation of Results
2.3.3.1 T-Test (Statistical Validation)
The t-test statistical validation has been applied for comparing the means of two samples (infected and normal), even if they have different number of glycans. The following steps are used to solve t-test validation:
a) List H1N1 infected datasets for sample 1.
b) List normal dataset for sample 2.
c) Record the number replicates (in the data set, n = 3) for sample (The number of replicates for sample1, i.e., n1 is 3, the number of replicates for sample2, i.e., n2 is 3).
d) Compute the mean of both n1 and n2 (x1’, x2’). [mean = total/n]
e) Compute the standard deviation (σ) for each sample (σ1, σ2). Where, σ2 = ∑d2/(n − 1)
f) Compute the variance that is the difference between the two means . Where
g) Compute σb (square root of ).
h) Compute the p value as follows:
Figure 2.6 Hierarchical cluster analysis of Influenza A (H1N1) infected human.
Figure 2.7 Hierarchical cluster analysis of Influenza A (H1N1) infected human.
Figure 2.8 Hierarchical cluster analysis of Influenza A (H1N1) infected human.
Figure 2.9 Fuzzy c-means cluster analysis of Influenza A (H1N1) infected human.
2.3.3.2 Statistical Validation
In this article, on both datasets, k-means algorithm has been applied where k value is 3. Secondly, on the same datasets, hierarchical algorithm has been applied. At last, on the same datasets, fuzzy c-means algorithm has been applied where cluster number is 3. Total numbers of glycans are 442 that are present in all datasets. On host cell surfaces, these 442 glycans are displayed and act as sensory receptors that basically identify the glycoproteins of the viral surface. Consider an example where these 442 glycan structures concluded by sialic acid a2, 3- or a2, 6-linked that is called N-acetyl neuraminic acid which acts as receptors for H1N1. The upper respiratory surface of human mainly displays sialylated glycan receptors that are executed with a2 to 6-linked sialic acid. Moreover, various types of glycan receptors are responsible to identify the hemagglutinin glycoprotein (HA) on the outermost part of influenza A viruses. This way, human can be infected and H1N1 viruses transmit via respiratory droplets in humans. Nineteen differentially expressed glycans are found out of 442 after applying three clustering algorithms. After that, t-test statistical validations are applied on the infected and non-infected (normal) datasets. In Table 2.1, nineteen differentially expressed glycans are found after t-test validation.