Initial Publication Date: August 16, 2024

Correlation
Using r2 in the Earth Sciences

This module is undergoing classroom implementation with the Math Your Earth Science Majors Need project. The module is available for public use, but it will likely be revised after classroom testing.

Introducing correlation

What percent of the variation in sea level rise can be explained by the relationship between greenhouse gases in Earth's atmosphere and sea level? Are changes in land use in a watershed correlated with changes in water quality? How is the maximum clast size of a conglomerate associated with bed thickness? Earth scientists can investigate the strength and directionality of these relationships by calculating the correlation between environmental variables. Correlation is the statistical relationship between two variables. But what does this mean?

Broadly speaking, correlations are broken down into three different categories:

  • Positive correlation: Variables that are positively correlated (also known as directly related) will change in the same direction-- i.e., as one variable increases, the other variable increases . In a scatter plot, this creates a positive slope, with the scatter plot points slanting upwards as you go to the right (Fig. 1). For example, carbon dioxide and temperature data measured at Mauna Loa, Hawaii over the past several decades (each data point represents measurements taken at the same time) are positively correlated.
  • Negative correlation:Variables that are negatively correlated (also known as inversely related) will change in the opposite direction. Increased values of one variable are associated with decreased values of the other variable. In a scatter plot, this creates a negative slope, with the scatter plot points slanting downwards as you go to the right (Fig. 1). For example, some studies show that human population is negatively correlated with ecological species richness measurements.
  • No correlation: As one variable increases, the other variable does not tend to either increase or decrease, meaning there is no relationship between the variables. In a scatter plot, there would be no discernible slant upwards or downwards (Fig. 1). For example, earthquake and wildfire data do not show a correlation.


What is the difference between r and r2?

One of the most commonly calculated correlation coefficients is r, known as Pearson's correlation coefficient. This measures the strength of a linear relationship between two variables, with r-values ranging from -1 (perfect negative correlation) to 1 (perfect positive correlation). There are several types of correlation, some focusing on non-linear relationships (not included here!), but all types are interpreted in the same way: the magnitude of the correlation coefficient represents the strength of the relationship (closer to 1 is strong, closer to 0 is weak) and the sign of the correlation coefficient represents the direction of the relationship (direct variables change in the same direction relationship is positive, inverse variables change in opposite directions relationship is negative). 

Another correlation metric is the coefficient of determination, r2, which is calculated by squaring the correlation coefficient r value. This calculation again uses the linear relationship between two variables, but values range between 0 (no correlation) and 1 (perfect correlation) and there is no indication of the direction of the relationship. The r2 value is related to linear regression, in which you use a method called least-squares regression The least squares method is a way to find a line that best fits the data through a calculation to minimize the sum of the squared distance of each point to that line, to plot a line that best represents the relationship between your variables.

Once you have your linear regression, the r2-value indicates how much of the variation in your y variable is explained by the least-squares regression on your x variable. In other words, how successfully does your line represent the relationship between the two variables.

When do you use r and r2 values?  

You'll want to calculate an r value to test how strongly two variables are related to one another, as well as the direction of their relationship (positive or negative). You'll want to use the r2 value to determine the amount of variation in one variable that can be explained by the linear regression on the other variable. In either case, a scatterplot of your data will help you visualize the relationship between the two variables.

In Earth Sciences, it is common practice for both r and r2 are termed 'correlation coefficients', with r and r2 used interchangeably as metrics to describe the relationship between two environmental variables. However, mathematically, r and r2 are two different calculations and also describe slightly different relationships between the two variables. In this module, we will cover both the correlation coefficient (r) and the coefficient of determination (r2).

Test yourself! Which of the following statements can be said about the relationship between river width and river depth? Check all that apply.

[CORRECT] Yes. As river width increases, so does river depth, so the two are positively correlated. 1 of 2 correct choices.
[INCORRECT] As river width increases (x axis), river depth also increases (y axis). Since both increase together, and lead to an upward slanting line, the correlation would be positive not negative.

[INCORRECT] Since river depth increases as river depth increases, there is a positive correlation between the two variables. This can also be determined by the upward slanting line as you move right

[CORRECT] Yes. An r2 value of 0.9 represents a strong correlation. 1 of 2 correct choices.  

[INCORRECT] Values closer to 0 are considered weak, while values closer to 1 are considered strong. Therefore, an r2 value of 0.9 would be considered a strong correlation.  

  

How do I calculate r2?

Most of the steps involved in calculating (and visualizing!) r2 can be done using excel (or similar graphing software), so let's run through this process using a real world earth science example.

Glaciers around the world are changing dramatically with glacier retreat being linked to climate warming, but what is the relationship between glacier change and different environmental variables? To answer this question, we can directly compare the change in glacier terminus position the end of a glacier, also called the snout or toe to another environmental variable, such as air temperature, ocean temperature, or precipitation. Water plays a critical role in glacier behavior. Not only can it decrease the friction between ice and the bedrock, which can increase glacier speed, but surface water can also pool in surface cracks and crevasses and expand those cracks through the full thickness of the glacier, creating a direct pathway for water to go from the surface to the glacier-bedrock interface. In this example, we will answer the question 'How strong is the relationship (r2 value) between glacier terminus change and the number of rainy days?' using a time series of terminus position from Jakobshavn Isbræ, the fastest moving glacier in Greenland (Figure 4), and a weather station record from the adjacent town of Ilulissat. In this example, we will be looking at the change in terminus position (km) across a summer, so a positive number indicates an advance of the glacier terminus, and a negative number indicates a retreat of the glacier terminus.

LP_Glacier_ex_data.csv (Comma Separated Values 182bytes Jun21 24) Table 1. Number of rainy days and glacier terminus change by year.

 

Step 1. Identify two variables you want to compare from a dataset to determine whether they are related.


Step 2. Assign environmental variables to x (independent variable) and y (dependent variable).  

The independent variable is always plotted on the x-axis (when graphing) and is "independent" i.e., other variables have no effect on these values (but these values will affect other variables). The dependent variable is always plotted on the y-axis (when graphing) and "depends" on the other variable.

When calculating the r2, it's important to not only identify the two variables you want to compare, but also which is the independent variable and which is the dependent variable.  

Step 3. Create a scatter plot of the data and visually identify the strength and direction of the relationship between the variables.


Examine your data - how do the variables appear to be related? Are they positively, negatively, or not correlated? How would you describe the strength of the correlation? Does the correlation appear to be linear or follow some other pattern?


Step 4. Add a linear regression line and r2 value to your plot using Excel.


What if I want to calculate the correlation coefficient r instead?

 


Step 5. Describe what the r2 value means in its geological context.

Now, with your linear regression line and r2 value added, do any of your answers change from above? How do the variables appear to be related? Are they positively, negatively, or not correlated? How would you describe the strength of the correlation? Does the correlation appear to be linear or follow some other pattern?

What does this r2 value mean in the context of your geological data?

W hat counts as 'strong' vs 'weak' correlation is defined differently in different fields and different studies. For example, in a tightly constrained laboratory experiment examining the effects of slope on glacier velocity, we may expect to have r2 values close to 1, while in a field study of bedrock slope vs. glacier velocity the data sets may have more sources of variation and generally lower r2 values, even if the two variables are correlated. But the general principle is true that values closer to 1 indicate a stronger relationship relative to values closer to 0, for any given dataset.

What can't the r2 value tell me?

As noted above, a high r2 value still does not imply causation-- just because the number of rainy days explains some variation in glacier position, it does not prove that changes in the abundance of rainy days will cause glaciers to advance or retreat. There may be a casual relationship, or there could be another variable (or several variables!) we haven't analyzed yet that causes variation in one or both of our analyzed variables. There are other statistical methods to examine causality-- see 'more help' below for some ideas.

It's also important to note that both r and r2 values can be affected by data points that are very different from the rest of the data, i.e., outliers. Trends in variation can also affect these values (i.e., if your data becomes less linear or more scattered as x increases or decreases. Finally, many datasets exhibit nonlinear relationships and a linear analysis would not be appropriate. Therefore, it is important to visually examine your data in a scatter plot to make sure that a linear regression is appropriate and to determine if any values or trends may influence your interpretations of correlation.

Where do you use correlation coefficients in Earth science?

  • Environmental science - Determining the relationship between percent impervious surface in a watershed and different water quality metrics.
  • Ecology - Determining the relationship between climate variables and forest biodiversity.
  • Geochemistry - Determining the relationships among different components of magma or igneous rocks.
  • Hydrology - Determining the relationship between permeability and porosity of an aquifer.
  • Geophysics - Determining the relationship between earthquake depth and ground shaking .
  • Atmospheric science - Determining the relationship between cloud cover and precipitation.


Next steps

I am ready to PRACTICE!

If you think you have a handle on the steps above, click on this bar to try practice problems with worked answers.
Or, if you want even more practice, see 'More help' below.

More help (resources for students)

Pages written by Laura C. Reynolds (Worcester State University) and Kristin M. Schild (University of Maine).


      Next Page »