# Correlation

*Using r*^{2} in the Earth Sciences

^{2}in the Earth Sciences

##
×
Introducing correlation

What percent of the variation in sea level rise can be explained by the relationship between greenhouse gases in Earth's atmosphere and sea level? Are changes in land use in a watershed correlated with changes in water quality? How is the maximum clast size of a conglomerate associated with bed thickness? Earth scientists can investigate the strength and directionality of these relationships by calculating the correlation between environmental variables. **Correlation** is the statistical relationship between two variables. But what does this mean?

Broadly speaking, correlations are broken down into three different categories:

**Positive correlation:**Variables that are positively correlated (also known as directly related) will change in the same direction-- i.e., as one variable increases, the other variable increases . In a scatter plot, this creates a positive slope, with the scatter plot points slanting upwards as you go to the right (Fig. 1). For example, carbon dioxide and temperature data measured at Mauna Loa, Hawaii over the past several decades (each data point represents measurements taken at the same time) are positively correlated.**Negative correlation:**Variables that are negatively correlated (also known as inversely related) will change in the opposite direction. Increased values of one variable are associated with decreased values of the other variable. In a scatter plot, this creates a negative slope, with the scatter plot points slanting downwards as you go to the right (Fig. 1). For example, some studies show that human population is negatively correlated with ecological species richness measurements.**No correlation:**As one variable increases, the other variable does not tend to either increase or decrease, meaning there is no relationship between the variables. In a scatter plot, there would be no discernible slant upwards or downwards (Fig. 1). For example, earthquake and wildfire data do not show a correlation.

## What is the difference between r and r^{2}?

One of the most commonly calculated correlation coefficients is **r**, known as **Pearson's correlation coefficient**. This measures the strength of a *linear* relationship between two variables, with r-values ranging from -1 (perfect negative correlation) to 1 (perfect positive correlation). There are several types of correlation, some focusing on non-linear relationships (not included here!), but all types are interpreted in the same way: the magnitude of the correlation coefficient represents the strength of the relationship (closer to 1 is strong, closer to 0 is weak) and the sign of the correlation coefficient represents the direction of the relationship (**direct** variables change in the same direction relationship is positive, **inverse** variables change in opposite directions relationship is negative).* *

**coefficient of determination, r**which is calculated by squaring the correlation coefficient r value. This calculation again uses the

^{2},*linear*relationship between two variables, but values range between 0 (no correlation) and 1 (perfect correlation) and there is no indication of the direction of the relationship. The r

^{2 }value is related to linear regression, in which you use a method called

**least-squares regression**The

**least squares method**is a way to find a line that best fits the data through a calculation to minimize the sum of the squared distance of each point to that line, to plot a line that best represents the relationship between your variables.

Once you have your linear regression, the r^{2}-value indicates how much of the variation in your y variable is explained by the least-squares regression on your x variable. In other words, how successfully does your line represent the relationship between the two variables.

## When do you use r and r^{2 }values?

You'll want to calculate an r value to test how strongly two variables are related to one another, as well as the direction of their relationship (positive or negative). You'll want to use the r^{2} value to determine the amount of variation in one variable that can be explained by the linear regression on the other variable. In either case, **a scatterplot of your data will help you visualize the relationship between the two variables**.

^{2 }are termed 'correlation coefficients', with r and r

^{2}used interchangeably as metrics to describe the relationship between two environmental variables. However, mathematically, r and r

^{2 }are two different calculations and also describe slightly different relationships between the two variables. In this module, we will cover both the correlation coefficient (r) and the coefficient of determination (r

^{2}).

## How do I calculate r^{2}?

Most of the steps involved in calculating (and visualizing!) r^{2} can be done using excel (or similar graphing software), so let's run through this process using a real world earth science example.

**
**

**
×
**

**Glaciers around the world are changing dramatically with glacier retreat being linked to climate warming, but what is the**

**relationship****between glacier change and different environmental variables? To answer this question, we can directly compare the change in glacier**

**terminus position**the end of a glacier, also called the snout or toe to another environmental variable, such as air temperature, ocean temperature, or precipitation. Water plays a critical role in glacier behavior. Not only can it decrease the friction between ice and the bedrock, which can increase glacier speed, but surface water can also pool in surface cracks and crevasses and expand those cracks through the full thickness of the glacier, creating a direct pathway for water to go from the surface to the glacier-bedrock interface. In this example, we will answer the question 'How strong is the relationship (r^{2}value) between glacier terminus change and the number of rainy days?' using a time series of terminus position from Jakobshavn Isbr

**æ****, the fastest moving glacier in Greenland (Figure 4), and a weather station record from the adjacent town of Ilulissat. In this example, we will be looking at the change in terminus position (km) across a summer, so a positive number indicates an advance of the glacier terminus, and a negative number indicates a retreat of the glacier terminus.**

LP_Glacier_ex_data.csv (Comma Separated Values 182bytes Jun21 24) Table 1. Number of rainy days and glacier terminus change by year.

* *

**Step 1.** Identify two variables you want to compare from a dataset to determine whether they are related.

**Step 2.** Assign environmental variables to x (independent variable) and y (dependent variable).

**independent variable**is always plotted on the x-axis (when graphing) and is "independent" i.e., other variables have no effect on these values (but these values will affect other variables). The

**dependent variable**is always plotted on the y-axis (when graphing) and "depends" on the other variable.

When calculating the r^{2}, it's important to not only identify the two variables you want to compare, but also which is the independent variable and which is the dependent variable.

**Step 3**. Create a scatter plot of the data and visually identify the strength and direction of the relationship between the variables.

Examine your data - how do the variables appear to be related? Are they positively, negatively, or not correlated? How would you describe the strength of the correlation? Does the correlation appear to be linear or follow some other pattern?

**Step 4**. Add a linear regression line and r^{2} value to your plot using Excel.

**Step 5**. Describe what the r^{2} value means in its geological context.

Now, with your linear regression line and r^{2} value added, do any of your answers change from above? How do the variables appear to be related? Are they positively, negatively, or not correlated? How would you describe the strength of the correlation? Does the correlation appear to be linear or follow some other pattern?

What does this r^{2} value mean in the context of your geological data?

^{2}values close to 1, while in a field study of bedrock slope vs. glacier velocity the data sets may have more sources of variation and generally lower r

^{2}values, even if the two variables are correlated. But the general principle is true that values closer to 1 indicate a stronger relationship relative to values closer to 0, for any given dataset.

## What can't the r^{2} value tell me?

As noted above, a high r^{2} value still does not imply causation-- just because the number of rainy days explains some variation in glacier position, it does not prove that changes in the abundance of rainy days will cause glaciers to advance or retreat. There *may* be a casual relationship, or there could be another variable (or several variables!) we haven't analyzed yet that causes variation in one or both of our analyzed variables. There are other statistical methods to examine causality-- see 'more help' below for some ideas.

It's also important to note that both r and r^{2} values can be affected by data points that are very different from the rest of the data, i.e., **outliers**. Trends in variation can also affect these values (i.e., if your data becomes less linear or more scattered as x increases or decreases. Finally, many datasets exhibit nonlinear relationships and a linear analysis would not be appropriate. Therefore, it is important to visually examine your data in a scatter plot to make sure that a linear regression is appropriate and to determine if any values or trends may influence your interpretations of correlation.

## Where do you use correlation coefficients in Earth science?

- Environmental science - Determining the relationship between percent impervious surface in a watershed and different water quality metrics.
- Ecology - Determining the relationship between climate variables and forest biodiversity.
- Geochemistry - Determining the relationships among different components of magma or igneous rocks.
- Hydrology - Determining the relationship between permeability and porosity of an aquifer.
- Geophysics - Determining the relationship between earthquake depth and ground shaking .
- Atmospheric science - Determining the relationship between cloud cover and precipitation.

## Next steps

## More help (resources for students)

- We don't cover calculating a linear regression between two variables in detail, but it is done here, on the linear regression page!
- Correlation vs. causation: https://sites.monroecc.edu/mofsowitz/psychology/correlationcausation/ and https://www.youtube.com/watch?v=ROpbdO-gRUo
- Additional information on correlation, causality, and more:
- Correlation Coefficient: https://mathworld.wolfram.com/CorrelationCoefficient.html
- Correlation (Khan Academy lesson): https://www.khanacademy.org/math/ap-statistics/bivariate-data-ap/correlation-coefficient-r

*Pages written by Laura C. Reynolds (Worcester State University) and Kristin M. Schild (University of Maine).*