Multivariate

I. Bartomeus

Multivariate data

Multivariate != multiple regression

Multivariate means we have two or more response variables

We are interested in learning about the common patterns or modes of variation among those multiple response variables

Multivariate data require special statistical methods

Typical case in ecology

Species composition per site.

We want to know how multiple species abundance change e.g. along a gradient.

Packages in R

Vegan -> Very complete and updated. US tradiction

ade4 -> Quite complete. French tradition.

Summarizing community data

Diversity indexes (see diversity(BCI) in Vegan)
Beta diversity indexes (betadiver( ) in Vegan)
Rarefaction (rarecurve(BCI, sample = min(rs)) in Vegan)

Different questions, different methods

Clustering

#clustering
hclust() #uses disimilarity indexes (see below)

#K-means
kmeans() #uses raw data

Mantel test (or Procrustes)

mantel() #low power, better use procrustes
procrustes()

Different questions, different methods

Ordination (unconstrained)
- Principal Components Analysis (PCA)
- Correspondence Analysis (CA)
- Principal Coordinates Analysis (PCO or PCoA)
- Non-metric Multidimensional Scaling (NMDS)

Ordination

Principal Components Analysis (PCA) is a linear method — most useful for environmental data or sometimes with species data and short gradients

Correspondence Analysis (CA) is a unimodal method — most useful for species data, especially where non-linear responses are observed

Principal Coordinates Analysis (PCO) and Non-metric Multidimensional Scaling (NMDS) — can be used for any kind of data

PCA

Unconstrained = No explanatory variables

Summarizes a correlation matrix

Create as many axes as variables. Each of these subsequent axes is uncorrelated with previous axes — they are orthogonal — the variance each axis explains is uncorrelated.

Correspondence analysis

(CA) is in princple very similar to PCA — a weighted form of PCA.

Used when you have species abundance data across sites.

Distance based method are more commonly used (see below)

Principal Coordinates Analysis

Distance based method.

NMDS is more flexible (see below)

NMDS

Maps a dissimilarity matrix in 2D.
Stress measures its accuracy.

It’s all about your dissimilarity metric

Bianary:
- Jaccard: Mathematically simple and intuitive for expressing overlap as a percentage.
- Sorensen: Places more weight on shared species.

Dissimilarity metrics

Quantitative:
- Euclidean: Simple distance, good for e.g. distance between sites.
- Bray-Curtis [0-1]: Ignores cases in which the species is absent in both community samples, and it is dominated by the abundant species so that rare species add very little to the value of the coefficient.

Dissimilarity metrics

Morisita [0-1]: Almost completely independent of sample size and species diversity levels but extremely sensitive to dominant species due to the use of squared abundance terms.
Morisita-Horn [0-1]: A version of the index that is more stable for quantitative community overlap. Comparing communities when sampling effort or richness varies significantly between sites.
Kulczynski: Weigth more rare species.
Gower: Allows mixing categorical and quantitative variables

In R

Index	R Function Call	Notes
Jaccard	`vegdist(x, method = "jaccard", binary = TRUE)`	Classic binary index for similarity.
Sørensen	`vegdist(x, method = "bray", binary = TRUE)`	The binary version of Bray-Curtis is mathematically equivalent to Sørensen.

In R

Index	R Function Call	Notes
Bray-Curtis	`vegdist(x, method = "bray")`	Typical for abundance data in ecology.
Morisita	`vegdist(x, method = "morisita")`	Varying sample sizes; only works with integer counts.
Morisita-Horn	`vegdist(x, method = "horn")`	Handles non-integer/standardized data.

Constrained methods

“Constrained ordination relates the response data (species) directly to explanatory data (environmental variables). It only displays the variation in the species data that can be explained by the provided environmental variables.”

RDA (constrained version of PCA)
CCA (constrained version of CA)

Note that the axes are “constrained” to be linear combinations of the environmental variables. Any variation in the species data that is not related to those variables is “thrown away” or moved to residual (unconstrained) axes.

If you want to know “What are the biggest patterns in my data?”, use unconstrained. If you want to know “How much of my data is explained by these specific environmental factors?”, use constrained.

PERMANOVA

MANOVA is the multivariate form of ANOVA

Decompose variation in the responses into

variation within groups
variation between groups

PERMANOVA use permutation tests to assess the importance of fitted models — the data are shuffled in some way and the model refitted to derive a Null distribution under some hypothesis of no effect.

PERMANOVA

vegan has four different ways to do essentially do this kind of analysis

adonis() — implements Anderson (2001) - (deprecated)
adonis2() — implements McArdle & Anderson (2001)
dbrda() — implementation based on McArdle & Anderson (2001) - inherits from rda() and cca()
capscale() — implements Legendre & Anderson (1999)

The dispersion problem

Anderson (2001) noted that PERMANOVA could confound location & dispersion effects

If one or more groups are more variable — dispersed around the centroid — than the others, this can result in a false detection of a difference of means — a location effect.

betadisper()

A new approach to multivariate analysis

Generalized Linear Latent Variable Models (GLLVMs)

“bring tools and capabilities from classic (mixed-effects) regression models to multivariate community analysis”

Flexibility of Generalized Linear Models (GLMs) combined with dimensionality reduction.

Axes(components) = Latent variables

Multivariate

Multivariate data

Typical case in ecology

Packages in R

Summarizing community data

Different questions, different methods

Different questions, different methods

Ordination

PCA

Correspondence analysis

Principal Coordinates Analysis

NMDS

It’s all about your dissimilarity metric

Dissimilarity metrics

Dissimilarity metrics

In R

In R

Further reading:

Constrained methods

PERMANOVA

PERMANOVA

The dispersion problem

A new approach to multivariate analysis

GLLVMs

Further reading