Statistical Approaches for Hidden Variables in Ecology. Nathalie Peyrard
Чтение книги онлайн.

Читать онлайн книгу Statistical Approaches for Hidden Variables in Ecology - Nathalie Peyrard страница 7

Название: Statistical Approaches for Hidden Variables in Ecology

Автор: Nathalie Peyrard

Издательство: John Wiley & Sons Limited

Жанр: Социология

Серия:

isbn: 9781119902782

isbn:

СКАЧАТЬ also be discussed at length. As we shall see, while the statistical treatment of these variables may be complex, their inclusion in models is essential in providing us with a better understanding of ecological systems.

      The term “hidden variable”, widely used in ecology, finds its translation in the more general notion of latent variables in statistical modeling. This notion encompasses several situations and goes beyond the idea of unobservable physical variables alone. In statistics, a latent variable is generally defined as a variable of interest, which is not observable and does not necessarily have a physical meaning, the value of which must be deduced from observations. More precisely, latent variables are characterized by the following two specificities: (i) in terms of number, they are comparable to the number of data items, unlike parameters that are fewer in number. Consider, for example, the case of a hidden Markov chain, where the number of observed variables and latent variables is equal to the number of observation time steps; (ii) if their value were known, then model parameter estimation would be easier. For example, consider the estimation of parameters of a mixture model where the groups of individuals are known.

      In practice, if a latent variable has a physical reality but cannot be observed in the field (e.g. the precise trajectory of an animal, or the abundance of a seedbank), it is often referred to as a hidden variable (although both terms are often used interchangeably). In other cases, the latent variable naturally plays a role in the description of a given process or system, but has no physical existence. This is the case, for example, of latent variables corresponding to a classification of observations into different groups. We will refer to them as fictitious variables. Finally, latent variables may also play an instrumental role in describing a source of variability in observations that cannot be explained by known covariates, or in establishing a concise description of a dependency structure. They may result from a dimension reduction operation applied to a group of explanatory variables in the context of regression, as we see in the case of the principal components of a principal component analysis.

      The notion of latent variables is connected to that of hierarchical models: if they are not parameters, the elements in the higher levels of the model are latent variables. It is important to note that the notion of latent variables may be extended to cover the case of determinist quantities (represented by a constant in a model). For example, this holds true in cases where the latent variable is the trajectory of an ordinary differential equation (ODE) for which only noisy observations are available.

      Some of the most common examples of statistical models featuring latent variables are described here.

      Mixture models are used to define a small number of groups into which a set of observations may be sorted. In this case, the latent variables are discrete variables indicating which group each observation belongs to. Stochastic block models (SBMs) or latent block models (LBMs, or bipartite SBM) are specific forms of mixture models used in cases where the observations take the form of a network. Hidden Markov models (HMMs) are often used to analyze data collected over a period of time (such as the trajectory of an animal, observed over a series of dates) and take account of a subjacent process (such as the activity of the tracked animal: sleep, movement, hunting, etc.), which affects observations (the animal’s position or trajectory). In this case, the latent variables are discrete and represent the activity of the animal at every instant. In other models, the hidden process itself may be continuous. Mixed (generalized) linear models are one of the key tools used in ecology to describe the effects of a set of conditions (environmental or otherwise) on a population or community. These models include random effects which are, in essence, latent variables, used to account for higher than expected dispersions or dependency relationships between variables. In most cases, these latent variables are continuous and essentially instrumental in nature. Joint species distribution models (JSDMs) are a multidimensional version of generalized linear models, used to describe the composition of a community as a function of both environmental variables and of the interactions between constituent species. Many JSDMs use a multidimensionsal (e.g. Gaussian) latent variable, the dependency structure of which is used to describe inter-species interactions.

      In ecology, models are often used to describe the effect of experimental conditions or environmental variables on the response or behavior of one or more species. Explanatory variables of this kind are often known as covariates. These effects are typically accounted for using a regression term, as in the case of generalized linear models. A regression term of this type may also be used in latent variable models, in which case the distribution of the response variable in question is considered to depend on both the observed covariates and non-observable latent variables.

      The estimation problem is even more striking in the context of Bayesian inference, as a conditional distribution must be established not only for the latent variables, but also for parameters. Once again, except in very specific circumstances, precise determination of this joint conditional law (latent variables and parameters) is usually impossible.

      The inference methods used in models with a non-calculable conditional law fall into two broad categories: sampling methods and approximation methods. Sampling methods use a sample of data relating to the non-calculable law to obtain precise estimations of all relevant quantities. This category includes the Monte Carlo, the Markov chain Monte Carlo (MCMC) and the sequential Monte Carlo (SMC) methods. These algorithms are inherently random, and are notably used in Bayesian inference. Methods in the second category are used to determine an approximation of the conditional law of the latent variables (and, in the Bayesian case, of parameters) based on observations. This category includes variational methods and their extensions. These approaches vary in terms of the measure of proximity between the approximated law and the actual conditional law, and in terms of the distribution family used when searching for the approximation.

      This book provides an overview of recent work on statistical modeling and estimation in latent variable models for ecology. The different chapters illustrate the main principles described above. In some cases, they present statistical methods based on classical models and algorithms; in others, the focus is on developments from recent research in others. Each chapter addresses a specific ecological issue and a modeling approach to solving the problem, illustrated using one or more case studies.

      Readers may also access the R code1 in order to make use of the tools presented here, applied to their own data.

      Most of the questions associated with the case studies presented here relate to the comprehension or description of systems. While the issue of forecasting and prediction is touched upon in some chapters, this subject lies outside the main scope of our work. The issue of missing data (i.e. values not observed in samples) is also not addressed either. Finally, note that this work is not an exhaustive summary of latent variable models, or of the inference methods and algorithms used with these models. Each chapter touches on the question of СКАЧАТЬ