Data Analytics in Bioinformatics. Группа авторов
Чтение книги онлайн.

Читать онлайн книгу Data Analytics in Bioinformatics - Группа авторов страница 14

Название: Data Analytics in Bioinformatics

Автор: Группа авторов

Издательство: John Wiley & Sons Limited

Жанр: Программы

Серия:

isbn: 9781119785606

isbn:

СКАЧАТЬ down the levels of blood pressure & cholesterol. If the levels did not go down then the patient will ask the doctor about the same and more tests will be considered for the lowering of the parameters that are required to evaluate the heart of the patient.

      Classification is a task in ML, which deals with the organized process of assigning a class label to an observation from the problem domain. It is a sub-group of the supervised form of ML. The traditional classification algorithm was invented by a Swedish botanist Carl Von Linnaeus and depicted in Ref. [33]. In the process of calculating the desired output in supervised learning, this classification is more effective when the input attribute is in the form of a discrete. The Classification approach always helps the user for taking decisions by providing the classified conclusions from the observed data, values as discussed in Refs. [34–36]. Figure 1.7 tries to present a classification graph by executing the data of different persons who are suffering from heart disease or not.

      In the above figure, the patients that are suffering from Heart disease are represented by the triangle symbol, and those who are not, are represented by rectangle symbols. The hyperplane (partition) line depicts the bifurcation between these two classified entities. In general, there are four types of classification techniques. They are:

      Figure 1.7 Concept of classification.

       Binary Classification: It considers the tasks of classification where the class labels are two, and the two classes consider one in the normal state and the other in the abnormal state [37].

       Imbalanced Classification: It involves the tasks of classification where the examples are unequally distributed in the class [38].

       Multi-label Classification: It involves the tasks of classification where the number of class labels is two or greater than two where for every example one or more than one class label may be predicted [39].

       Multi-Class Classification: It involves the tasks of classification where the number of class labels is greater than two [40].

      Figure 1.8 Classification based on gender.

      For Achieving the Classification approach more precisely, a heart disease dataset [41] has been used that comprises of a total of 1,025 people out of which 312 are females and 713 are males. A particular reason behind taking this dataset is that people are continuously suffering from heart diseases, this is so because people who consume alcohol excessively, consume oily and fast food and also inhale dangerous gases due to pollution. This Classification of gender is given below in Figure 1.8.

      Regression is a very powerful type of statistical analysis. This is used for finding the strength as well as the character between one dependent variable and a series of independent variables [42–44]. This analysis provides the knowledge on the product that weather any updation in the future is possible or not. The operation of regression provides the ability to a researcher for identifying the best parameter of a topic that can be used for analysis. Also, it provides the parameters that are not to be used for analysis.

      Where,

       B is known as dependent variable

       A or Aj∈k are independent variable

       n is an intercept

       q or qj∈k are slope variables

       i is regression residual

       k is any natural number.

      For easy understanding, a case study on heart disease is discussed below. In this case study, with the help of the regression approach, a prediction was done whether a person has heart disease or not. Here, the dependent variable is the heart disease and the independent variables are cholesterol levels, blood pressure, etc. After analyzing the data, it was found that the patient has a problem in his heart which is presented below on a 2D plane in Figure 1.9.

      The steps required for regression analysis are [50]:

       Select the dependent & independent variables.

       Explore the co-relation matrix along with the scatter plot.

       Perform the Linear or Multiple Regression Operation.

       Accord with the outliers along with the multi-collinearity.

       Perform the t-test.

       Handle the insignificant variables.

Graph depicts the concept of regression. The x-axis represents cholesterol level and the y-axis represents heart patient or not.

      Figure 1.9 Regression.

      Figure 1.10 Cholesterol line fit plot.

      The Regression operation performed on the heart disease dataset concerning the age and cholesterol and got the following results as shown in Figure 1.10.

      In the above figure, a line fit plot is mentioned that depicts the line of best fit. This line of best fit is known as the trend line. This trend line is based on a linear equation and try to present the standard cholesterol level of a general human w.r.t. the age. The plot has two axes that include a vertical axis depicting the age and the horizontal axis depicting the cholesterol values. The trend line could be linear, polynomial, or exponential СКАЧАТЬ