Multiblock Data Fusion in Statistics and Machine Learning. Tormod Næs
Чтение книги онлайн.

Читать онлайн книгу Multiblock Data Fusion in Statistics and Machine Learning - Tormod Næs страница 21

Название: Multiblock Data Fusion in Statistics and Machine Learning

Автор: Tormod Næs

Издательство: John Wiley & Sons Limited

Жанр: Химия

Серия:

isbn: 9781119600992

isbn:

СКАЧАТЬ

      Figure 1.8 Phylogeny of some multiblock methods and relations to basic data analysis methods used in this book.

      1.7 Fundamental Choices

      In any sort of multiblock data analysis, choices have to be made such as which method to use and what kind of pre-processing to apply. Two fundamental questions which always should be considered (and dealt with) are highlighted below.

      Variation explained:Do we only want to explain variation between blocks or also within blocks?Fairness:Should all blocks play a role in the final solution or can we allow some of the blocks to be dominant in this respect?

      1.8 Common and Distinct Components

      Figure 1.9 The idea of common and distinct components. Legend: blue is common variation; dark yellow and dark red are distinct variation and shaded areas are noise (unsystematic variation).

      Suppose there are two data blocks X1 and X2 sharing the same samples, i.e., different variables are measured on the same set of samples (see Chapter 3). Then these two blocks can have variation in common (the blue part). This common variation spans a subspace and the common components are then a basis for this subspace.

      There is also a part in each block that contains still systematic variation (the dark yellow and dark red parts). These have nothing in common and are, therefore, called distinct parts. These also represent subspaces and the distinct components (two sets; one set for each block) are the bases for these subspaces. What is left in the matrices is unsystematic variation or noise (shaded parts).

      The division of each data block in common, distinct, and unsystematic variation should not be read in terms of the individual variables being in common or being distinct but in terms of subspaces. Hence, a part of the variation of a variable in block 1 may be in common with variation of some variables in block 2 whereas the other part of that variable may be distinct, see Elaboration 1.8.

      ELABORATION 1.8

      Common and distinct in spectroscopy

      Suppose that the same set of samples is measured in the UV-Vis regime (block X1) and with near-infrared (NIR, block X2). Also assume that this set of samples contains three chemical components (A,B,C): A absorbs both in UV-Vis and NIR; B only absorbs in the UV-Vis regime and C absorbs only in NIR. Then the common part is the absorption of A in both data blocks; the distinct parts are B in block 1 and C in block 2. However, at a particular wavelength in the NIR region there may be a contribution from both A and C. Hence, this wavelength, i.e., variable, has a common and a distinct part. The same can happen in block 1.

      1.9 Overview and Links

      1 A method for unsupervised (U), supervised (S) or complex (C) data structures.

      2 The method can deal with heterogeneous data (HET, i.e., different measurement scales) or can only deal with homogeneous data (HOM).

      3 A method that uses a sequential (SEQ) or simultaneous (SIM) approach.

      4 The method is defined in terms of a model (MOD) or in terms of an algorithm (ALG).

      5 A method for finding common (C); common and distinct (CD); or finding common, local and distinct components (CLD).

      6 Estimation of the model parameters is based on least squares (LS), maximum likelihood (ML), eigenvalue decompositions (ED) or maximising covariance or correlations (MC).

      The first item (A) is used to organise the different chapters. Some methods can deal with data of different measurements scales (heterogeneous data) and some methods can only handle homogeneous data. The difference between the simultaneous and sequential method is explained in more detail in Chapter 2. Some methods are defined by a clear model and some methods are based on an algorithm. The already discussed topic of common and distinct variation is also a distinguishing and important feature of the methods and the sections in some of the chapters are organised according to this principle. Finally, there are different ways of estimating the parameters (weights, scores, loadings, etc.) of the multiblock СКАЧАТЬ