Correlation
Using r² in the Earth Sciences

Initial Publication Date: August 16, 2024

This module is undergoing classroom implementation with the Math Your Earth Science Majors Need project. The module is available for public use, but it will likely be revised after classroom testing.

×

Kama River, Russia
Provenance: Kama River near of Pyskor, Perm Krai Wikimedia Commons Author: Niklitov
Reuse: This item is offered under a Creative Commons Attribution-NonCommercial-ShareAlike license http://creativecommons.org/licenses/by-nc-sa/3.0/ You may reuse this item for non-commercial purposes as long as you provide attribution and offer any derivative works under a similar license.

Introducing correlation

What percent of the variation in sea level rise can be explained by the relationship between greenhouse gases in Earth's atmosphere and sea level? Are changes in land use in a watershed correlated with changes in water quality? How is the maximum clast size of a conglomerate associated with bed thickness? Earth scientists can investigate the strength and directionality of these relationships by calculating the correlation between environmental variables. Correlation is the statistical relationship between two variables. But what does this mean?

Broadly speaking, correlations are broken down into three different categories:

Positive correlation: Variables that are positively correlated (also known as directly related) will change in the same direction-- i.e., as one variable increases, the other variable increases . In a scatter plot, this creates a positive slope, with the scatter plot points slanting upwards as you go to the right (Fig. 1). For example, carbon dioxide and temperature data measured at Mauna Loa, Hawaii over the past several decades (each data point represents measurements taken at the same time) are positively correlated.
Negative correlation:Variables that are negatively correlated (also known as inversely related) will change in the opposite direction. Increased values of one variable are associated with decreased values of the other variable. In a scatter plot, this creates a negative slope, with the scatter plot points slanting downwards as you go to the right (Fig. 1). For example, some studies show that human population is negatively correlated with ecological species richness measurements.
No correlation: As one variable increases, the other variable does not tend to either increase or decrease, meaning there is no relationship between the variables. In a scatter plot, there would be no discernible slant upwards or downwards (Fig. 1). For example, earthquake and wildfire data do not show a correlation.

What is the difference between r and r²?

One of the most commonly calculated correlation coefficients is r, known as Pearson's correlation coefficient. This measures the strength of a linear relationship between two variables, with r-values ranging from -1 (perfect negative correlation) to 1 (perfect positive correlation). There are several types of correlation, some focusing on non-linear relationships (not included here!), but all types are interpreted in the same way: the magnitude of the correlation coefficient represents the strength of the relationship (closer to 1 is strong, closer to 0 is weak) and the sign of the correlation coefficient represents the direction of the relationship (direct variables change in the same direction relationship is positive, inverse variables change in opposite directions relationship is negative).

Another correlation metric is the coefficient of determination, r², which is calculated by squaring the correlation coefficient r value. This calculation again uses the linear relationship between two variables, but values range between 0 (no correlation) and 1 (perfect correlation) and there is no indication of the direction of the relationship. The r²value is related to linear regression, in which you use a method called , to plot a line that best represents the relationship between your variables.

Once you have your linear regression, the r²-value indicates how much of the variation in your y variable is explained by the least-squares regression on your x variable. In other words, how successfully does your line represent the relationship between the two variables.

When do you use r and r²values?

You'll want to calculate an r value to test how strongly two variables are related to one another, as well as the direction of their relationship (positive or negative). You'll want to use the r² value to determine the amount of variation in one variable that can be explained by the linear regression on the other variable. In either case, a scatterplot of your data will help you visualize the relationship between the two variables.

In Earth Sciences, it is common practice for both r and r²are termed 'correlation coefficients', with r and r² used interchangeably as metrics to describe the relationship between two environmental variables. However, mathematically, r and r²are two different calculations and also describe slightly different relationships between the two variables. In this module, we will cover both the correlation coefficient (r) and the coefficient of determination (r²).

Test yourself! Which of the following statements can be said about the relationship between river width and river depth? Check all that apply.

The relationship between river width and depth is positively correlated

Yes. As river width increases, so does river depth, so the two are positively correlated. 1 of 2 correct choices.

The relationship between river width and depth is negatively correlated

As river width increases (x axis), river depth also increases (y axis). Since both increase together, and lead to an upward slanting line, the correlation would be positive not negative.

The relationship between river width and depth is not correlated.

Since river depth increases as river depth increases, there is a positive correlation between the two variables. This can also be determined by the upward slanting line as you move right

The relationship between river width and depth is strongly correlated.

Yes. An r² value of 0.9 represents a strong correlation. 1 of 2 correct choices.

The relationship between river width and depth is weakly correlated

Values closer to 0 are considered weak, while values closer to 1 are considered strong. Therefore, an r² value of 0.9 would be considered a strong correlation.

How do I calculate r²?

Most of the steps involved in calculating (and visualizing!) r² can be done using excel (or similar graphing software), so let's run through this process using a real world earth science example.

Glaciers around the world are changing dramatically with glacier retreat being linked to climate warming, but what is the relationship between glacier change and different environmental variables? To answer this question, we can directly compare the change in glacier terminus position the end of a glacier, also called the snout or toe to another environmental variable, such as air temperature, ocean temperature, or precipitation. Water plays a critical role in glacier behavior. Not only can it decrease the friction between ice and the bedrock, which can increase glacier speed, but surface water can also pool in surface cracks and crevasses and expand those cracks through the full thickness of the glacier, creating a direct pathway for water to go from the surface to the glacier-bedrock interface. In this example, we will answer the question 'How strong is the relationship (r² value) between glacier terminus change and the number of rainy days?' using a time series of terminus position from Jakobshavn Isbræ, the fastest moving glacier in Greenland (Figure 4), and a weather station record from the adjacent town of Ilulissat. In this example, we will be looking at the change in terminus position (km) across a summer, so a positive number indicates an advance of the glacier terminus, and a negative number indicates a retreat of the glacier terminus.

LP_Glacier_ex_data.csv (Comma Separated Values 182bytes Jun21 24) Table 1. Number of rainy days and glacier terminus change by year.

Step 1. Identify two variables you want to compare from a dataset to determine whether they are related.

Show me how

In the following dataset, we have three columns: time (year), number of days with rain each year, and glacier terminus change (km) during the summer each year. We might hypothesize that the more rain that falls, the more the glacier will retreat (terminus change will be negative); therefore, the two variables we want to compare would be the glacier terminus change and the total number of rainy days.

Step 2. Assign environmental variables to x (independent variable) and y (dependent variable).

Show me how

You want to identify an x and y value for each 'n' individual measurement. The variable 'n' can represent many different things (an individual sample, a time of observation, etc) depending on the problem you are working on. In this example, the 'n' is the year of observation, and we want to see if glacier position is influenced by rainy days, so we will designate the number of rainy days as the independent variable (x), and the glacier position as the dependent variable (y).

The independent variable is always plotted on the x-axis (when graphing) and is "independent" i.e., other variables have no effect on these values (but these values will affect other variables). The dependent variable is always plotted on the y-axis (when graphing) and "depends" on the other variable.

When calculating the r², it's important to not only identify the two variables you want to compare, but also which is the independent variable and which is the dependent variable.

Step 3. Create a scatter plot of the data and visually identify the strength and direction of the relationship between the variables.

Show me how to create a scatter plot in Excel

In Excel, enter the spreadsheet information as you see it in figure 5. To create a scatter plot of your data, highlight you can do this by selecting the upper left cell and dragging your cursor to the lower right only the cells of your x and y data. Once highlighted, select insert (tab at top) → scatter icon (x and y axis with just dots). A scatter plot should now appear!

Examine your data - how do the variables appear to be related? Are they positively, negatively, or not correlated? How would you describe the strength of the correlation? Does the correlation appear to be linear or follow some other pattern?

Show me how

Step 4. Add a linear regression line and r² value to your plot using Excel.

Show me how to create a regression line

First, make sure that your scatterplot is selected by clicking on the plot. Next, we want to add a linear regression line. To do this, click on Chart design (tab at top) → Add chart element, drop down arrow → trendline → linear. Behind the scenes, Excel uses the least-square method to calculate the line of best fit to your data.

Show me how to add the r² value to our plot

Again make sure that your scatterplot is selected, under the Chart design (tab at top) → Add chart element, drop down menu → more trendline options. Now in the right-hand panel, make sure the bar graph icon (tab) is selected, then linear is still selected, and then check "display equation on chart" and "display r-squared value on chart" (you can also add the linear trendline here and skip the prior step). Now the equation that represents your line, as well as your r² value should appear on your scatterplot next to your linear trendline! If you want to learn more about the equation for the linear regression, see the linear regression module.

What if I want to calculate the correlation coefficient r instead?

Show me how to calculate the correlation coefficient r

The correlation coefficient, r, can be used to describe the strength and direction of correlation between two variables and can supplement your visual estimation of these properties from looking at the scatter plot. You can calculate this value in several different ways. One way would be by hand, using the equation listed below.

Show me how r is calculated

However, Excel can help us again.

Show me how to calculate r in Excel

To calculate the correlation coefficient in Excel, select an empty cell and type in =CORREL( . In the parentheses, highlight your x values, and then type a comma, and then highlight your y values. Close parentheses, hit enter and you should see your r value (check to make sure it makes sense-- that it is between -1 and 1 and seems to match the strength you estimated visually).

Step 5. Describe what the r² value means in its geological context.

Now, with your linear regression line and r² value added, do any of your answers change from above? How do the variables appear to be related? Are they positively, negatively, or not correlated? How would you describe the strength of the correlation? Does the correlation appear to be linear or follow some other pattern?

Show me how

Hopefully all of your answers from above still hold true- the variables are negatively correlated, meaning as the number of rainy days increase, the more the glacier terminus retreats (becomes more negative). Since the r² value is 0.41, this is a moderately strong correlation, especially for field data, and is supported by the data points falling relatively close to the trendline. As the trendline is linear and the data points fall on the line, this relationship appears to be linear.

What does this r² value mean in the context of your geological data?

Show me how

In the previous step, notice that the r² value is positive, even though the relationship is negative? This is because r² does not indicate the direction of the relationship (as can be determined by the r-value), but instead tells us how much of the variability in your y value can be explained by the regression on your x values. So, in this example, 0.41 (out of 1.0, or 41%) of variation in glacier position change can be explained by the regression between rainy days and glacier terminus change. Therefore, if we know that glacier terminus change is strongly correlated with rainy days, can we say that increasing rainy days will cause glacier retreat? No! We are looking at a very small data set, and both variables are influenced by other factors as both variables are part of a system. For example, how much a glacier changes is influenced by factors such as the thickness of the ice, the shape of the underlying bedrock, and the temperature of the ocean, and precipitation is influenced by temperature and atmospheric patterns. However, we can say that that this relationship shows that rain likely contributes to changes in glacier position. Correlation does not equal causation.

W hat counts as 'strong' vs 'weak' correlation is defined differently in different fields and different studies. For example, in a tightly constrained laboratory experiment examining the effects of slope on glacier velocity, we may expect to have r² values close to 1, while in a field study of bedrock slope vs. glacier velocity the data sets may have more sources of variation and generally lower r² values, even if the two variables are correlated. But the general principle is true that values closer to 1 indicate a stronger relationship relative to values closer to 0, for any given dataset.

What can't the r² value tell me?

As noted above, a high r² value still does not imply causation-- just because the number of rainy days explains some variation in glacier position, it does not prove that changes in the abundance of rainy days will cause glaciers to advance or retreat. There may be a casual relationship, or there could be another variable (or several variables!) we haven't analyzed yet that causes variation in one or both of our analyzed variables. There are other statistical methods to examine causality-- see 'more help' below for some ideas.

It's also important to note that both r and r² values can be affected by data points that are very different from the rest of the data, i.e., outliers. Trends in variation can also affect these values (i.e., if your data becomes less linear or more scattered as x increases or decreases. Finally, many datasets exhibit nonlinear relationships and a linear analysis would not be appropriate. Therefore, it is important to visually examine your data in a scatter plot to make sure that a linear regression is appropriate and to determine if any values or trends may influence your interpretations of correlation.

Where do you use correlation coefficients in Earth science?

Environmental science - Determining the relationship between percent impervious surface in a watershed and different water quality metrics.
Ecology - Determining the relationship between climate variables and forest biodiversity.
Geochemistry - Determining the relationships among different components of magma or igneous rocks.
Hydrology - Determining the relationship between permeability and porosity of an aquifer.
Geophysics - Determining the relationship between earthquake depth and ground shaking .
Atmospheric science - Determining the relationship between cloud cover and precipitation.