Multiblock Data Fusion in Statistics and Machine Learning. Tormod Næs
Чтение книги онлайн.

Читать онлайн книгу Multiblock Data Fusion in Statistics and Machine Learning - Tormod Næs страница 9

Название: Multiblock Data Fusion in Statistics and Machine Learning

Автор: Tormod Næs

Издательство: John Wiley & Sons Limited

Жанр: Химия

Серия:

isbn: 9781119600992

isbn:

СКАЧАТЬ to write a few introductory words about Multiblock Data Fusion in Statistics and Machine Learning. The book is maybe not timely! The subject has been around in chemometrics since the late 1980s; usually under the term multiblock analysis.

      Let me take that back immediately–the book is definitely timely. Even though this subject has been discussed for decades, it has taken off dramatically lately. And not only in chemometrics, but in a variety of fields. There are many diverse and interesting developments and in fact, it is quite difficult to really understand what is going on and to filter or even just understand the literature from so many sources. Each field will have their own internal jargon and background. This may be the biggest obstacle right now. It is evident that there are many interesting developments but grasping them is next to impossible. This book fixes that. And not only that, this book provides a comprehensive overview across fields and it also adds perspective and new research where needed. I would argue that this is the place if you want to understand data fusion comprehensively.

      That is, if you want to understand how to apply data fusion; or you want to develop new data fusion models; or learn how the algorithms and models work; or maybe you want to understand what the shortcomings of different approaches are. If you have questions like these or you simply want to know what is happening in this area of data science, then reading this book will be a nice and fulfilling experience.

      To write a comprehensive book about such an enormous field requires special people. And indeed, there are three very competent persons behind this book. They have all worked within the area for many years and have each provided important research on both the theoretical and the application sides of things. And they represent both the experience of the old-timers and visions of the coming generations. I can say that without insulting (I hope) as I am in the same age group as the more exdistinguished part of the authors.

      I have the deepest respect and the highest admiration for the three authors. I have learned so many things from their individual contributions over the years. Reading this joint work is not a disappointment. Please do enjoy!

       Rasmus Bro

      Combining information from two or possibly several blocks of data is gaining increased attention and importance in several areas of science and industry. Typical examples can be found in chemistry, spectroscopy, metabolomics, genomics, systems biology, and sensory science. Many methods and procedures have been proposed and used in practice. The area goes under different names: data integration, data fusion, multiblock analyses, multiset analyses, and others.

      This book is an attempt to provide an up-to-date treatment of the most used and important methods within an important branch of the area; namely methods based on so-called components or latent variables. These methods have already obtained enormous attention in, for instance, chemometrics, bioinformatics, machine learning, and sensometrics and have proved to be important both for prediction and interpretation.

      The book is primarily a description of methodologies, but most of the methods will be illustrated by examples from the above-mentioned areas. The book is written such that both users of the methods as well as method developers will hopefully find sections of interest. At the end of the book there is a description of a software package developed particularly for the book. This package is freely available in R and covers many of the methods discussed.

      To distinguish the different types of methods from each other, the book is divided into five parts. Part I is an introduction and description of preliminary concepts. Part II is the core of the book containing the main unsupervised and supervised methods. Part III deals with more complex structures and, finally, Part IV presents alternative unsupervised and supervised methods. The book ends with Part V discussing the available software.

       Age Smilde, Utrecht, The Netherlands

       Tormod Næs, Ås, Norway

       Kristian Hovde Liland, Ås, Norway

      March 2022

      Figure 1.1 High-level, mid-level, and low-level fusion for two input blocks.The Z’s represent the combined information from the twoblocks which is used for making the predictions. The upperfigure represents high-level fusion, where the results from two separate analyses are combined. The figure in the middle is an illustration of mid-level fusion, where components from the two data blocks are combined before further analysis. The lowerfigure illustrates low-level fusion where the data blocks are simply combined into one data block before further analysis takes place.

      Figure 1.2 Idea of dimension reduction and components. The scores T summarise the relationships between samples; the load-ings P summarise the relationships between variables.Sometimes weights W are used to define the scores.

      Figure 1.3 Design of the plant experiment. Numbers in the top row refer to lightlevels (in μE m−2 sec−1); numbers in the first column are degrees centigrade. Legend: D = dark, LL = low light, L = light and HL = high light.

      Figure 1.4 Scores on the first two principal components of a PCA on theplant data (a) and scores on the first ASCA interaction component (b). Legend: D = dark, LL = low light, L = light and HL = high light.

      Figure 1.5 Idea of copy number variation (a), methylation (b), and mutation (c)of the DNA. For (a) and (c): Source: Adapted from Koch et al., 2012.

      Figure 1.6 Plot of the Raman spectra used in predicting the fat content. The dashed lines show the split of the data set into multiple blocks.

      Figure 1.7 L-shape data of consumer liking studies.

      Figure 1.8 Phylogeny of some multiblock methods and relationsto basic data analysis methods used in this book.

      Figure 1.9 The idea of common and distinct components. Legend: blueis common variation; dark yellow and dark red are distinctvariation and shaded areas are noise (unsystematic variation).

      Figure 2.1 Idea of dimension reduction and components. Sometimes W isused to define the СКАЧАТЬ