Название: Data Mining and Machine Learning Applications
Автор: Группа авторов
Издательство: John Wiley & Sons Limited
Жанр: Базы данных
isbn: 9781119792505
isbn:
Library of Congress Cataloging-in-Publication Data
ISBN 978-1-119-79178-2
Cover image: Pixabay.Com
Cover design by Russell Richardson
Set in size of 11pt and Minion Pro by Manila Typesetting Company, Makati, Philippines
Printed in the USA
10 9 8 7 6 5 4 3 2 1
Preface
Data, the latest currency of today’s world, is the new gold. In this new form of gold, the most beautiful jewels are data analytics and machine learning. Data mining and machine learning are considered interdisciplinary fields. Data mining is a subset of data analytics and machine learning involves the use of algorithms that automatically improve through experience based on data. However, the term data mining is a misnomer because it means to mine but not extract knowledge. A more apt term would be “knowledge discovery from data,” since it is the practice of examining large pre-existing databases to generate information. Data mining algorithms are currently being investigated and applied worldwide.
Massive datasets can be classified and clustered to obtain accurate results. The most common technologies used include classification and clustering methods. Accuracy and error rates are calculated for regression and classification, and clustering to find actual results through algorithms like support vector machines and neural networks with forward and backward propagation. Applications include fraud detection, image processing, medical diagnosis, weather prediction, e-commerce and so forth. Data mining algorithms are even used to analyze data by using sentiment analysis. These applications have been increasing in different areas and fields. Web mining and text mining also paved their way to construct the concrete q2 field in data mining.
This book is intended for industrial and academic researchers, and scientists and engineers in the information technology, data science and machine and deep learning domains. Featured in the book are:
A review of the state-of-the-art in data mining and machine learning,
A review and description of the learning methods in human-computer interaction,
Implementation strategies and future research directions used to meet the design and application requirements of several modern and real-time applications for a long time,
The scope and implementation of a majority of data mining and machine learning strategies, and
A discussion of real-time problems.
This book is a better choice than most other books available on the market because they were published a long time ago, and hence seldom elaborate on the current needs of data mining and machine learning. It is our hope that this book will promote mutual understanding among researchers in different disciplines, and facilitate future research development and collaborations.
We want to express our appreciation to all of the contributing authors who helped us tremendously with their contributions, time, critical thoughts, and suggestions to put together this peer-reviewed edited volume. The editors are also thankful to Scrivener Publishing and its team members for the opportunity to publish this volume. Lastly, we thank our family members for their love, support, encouragement, and patience during the entire period of this work.
Rohit RajaKapil Kumar Nagwanshi Sandeep Kumar K. Ramya Laxmi November 2021
1
Introduction to Data Mining
Santosh R. Durugkar1, Rohit Raja2, Kapil Kumar Nagwanshi3* and Sandeep Kumar4
1 Amity University Rajasthan, Jaipur, India
2 IT Department, GGV Bilaspur Central University, Bilaspur, India
3 ASET, Amity University Rajasthan, Jaipur, India
4 Computer Science and Engineering Department, Koneru Lakshmaiah Education Foundation, Vaddeswaram, Andra Pradesh, India
Abstract
Data mining, as its name suggests “mining”, is nothing but extracting the desired, meaningful exact information from the datasets. Its methods and algorithms help researchers and students develop the numerous applications to be used by the end-users. Its presence in the healthcare industry, marketing, scientific applications, etc., enables the end-users to extract the meaningful required information from the collection. In the initial section, we discuss KDD—knowledge discovery in the database with its different phases like data cleaning, data integration, data selection and transformation, representation. In this chapter, we give a brief introduction to data mining. Comparative discussion about classification and clustering helps the end-user to distinguish these techniques. We also discuss its applications, algorithms, etc. An introduction to a basic clustering algorithm, K-means clustering, hierarchical clustering, fuzzy clustering, and density-based clustering, will help the end-user to select a specific algorithm as per the application. In the last section of this chapter, we introduce various data mining tools like Python, Rapid Miner, and KNIME, etc., to the user to extract the required information.
Keywords: Data mining, KDD, clustering, classification, Python, KNIME
1.1 Introduction
1.1.1. Data Mining
‘Mining’—extracts the meaningful information from the databases. This method helps the researchers, students, and other IT professionals remove the exact significant details and develop the desired applications [1, 2]. It is also known as Knowledge Discovery from databases—KDD. The applications of KDD may include medical/hospitals, Marketing, Educational systems, Scientific applications, E-commerce, Retail industries, Biological analysis, Counterterrorism, use in data-warehouse, in the energy sector for decision making, Spatial data mining, and Logistics [4–6].
1.2 Knowledge Discovery in Database (KDD)
It helps detect the new patterns of previously unknown data, i.e., extracting the hidden patterns, data from the massive volume of datasets [3, 6]. Figure 1.1 gives an idea about Knowledge discovery in Database—KDD, which consists of the following phases:
Data cleaning: This step can be defined as removing irrelevant data. Removing irrelevant data is nothing but unwanted data; records can be removed. Data collection may consist of missing values which must be either needs to be removed or should impute the missing information [7].Figure СКАЧАТЬ