Correlation
Using r2 in the Earth Sciences
This module is undergoing classroom implementation with the Math Your Earth Science Majors Need project. The module is available for public use, but it will likely be revised after classroom testing.
Kama River, Russia
Provenance: Kama River near of Pyskor, Perm Krai Wikimedia Commons Author: Niklitov
Reuse: This item is offered under a Creative Commons Attribution-NonCommercial-ShareAlike license http://creativecommons.org/licenses/by-nc-sa/3.0/ You may reuse this item for non-commercial purposes as long as you provide attribution and offer any derivative works under a similar license.
Introducing correlation
What percent of the variation in sea level rise can be explained by the relationship between greenhouse gases in Earth's atmosphere and sea level? Are changes in land use in a watershed correlated with changes in water quality? How is the maximum clast size of a conglomerate associated with bed thickness? Earth scientists can investigate the strength and directionality of these relationships by calculating the correlation between environmental variables. Correlation is the statistical relationship between two variables. But what does this mean?
Broadly speaking, correlations are broken down into three different categories:
- Positive correlation: Variables that are positively correlated (also known as directly related) will change in the same direction-- i.e., as one variable increases, the other variable increases . In a scatter plot, this creates a positive slope, with the scatter plot points slanting upwards as you go to the right (Fig. 1). For example, carbon dioxide and temperature data measured at Mauna Loa, Hawaii over the past several decades (each data point represents measurements taken at the same time) are positively correlated.
- Negative correlation:Variables that are negatively correlated (also known as inversely related) will change in the opposite direction. Increased values of one variable are associated with decreased values of the other variable. In a scatter plot, this creates a negative slope, with the scatter plot points slanting downwards as you go to the right (Fig. 1). For example, some studies show that human population is negatively correlated with ecological species richness measurements.
- No correlation: As one variable increases, the other variable does not tend to either increase or decrease, meaning there is no relationship between the variables. In a scatter plot, there would be no discernible slant upwards or downwards (Fig. 1). For example, earthquake and wildfire data do not show a correlation.
Fig. 1. Strong to weak correlation scatter plots
Provenance: Rory McFadden, Carleton College
Reuse: This item is offered under a Creative Commons Attribution-NonCommercial-ShareAlike license http://creativecommons.org/licenses/by-nc-sa/3.0/ You may reuse this item for non-commercial purposes as long as you provide attribution and offer any derivative works under a similar license.
What is the difference between r and r2?
One of the most commonly calculated correlation coefficients is r, known as Pearson's correlation coefficient. This measures the strength of a linear relationship between two variables, with r-values ranging from -1 (perfect negative correlation) to 1 (perfect positive correlation). There are several types of correlation, some focusing on non-linear relationships (not included here!), but all types are interpreted in the same way: the magnitude of the correlation coefficient represents the strength of the relationship (closer to 1 is strong, closer to 0 is weak) and the sign of the correlation coefficient represents the direction of the relationship (direct variables change in the same direction relationship is positive, inverse variables change in opposite directions relationship is negative).
Fig. 2. Scatter plot of crabs vs clams remaining with regression line
Provenance: Rory McFadden, Carleton College
Reuse: This item is offered under a Creative Commons Attribution-NonCommercial-ShareAlike license http://creativecommons.org/licenses/by-nc-sa/3.0/ You may reuse this item for non-commercial purposes as long as you provide attribution and offer any derivative works under a similar license.
Another correlation metric is the
coefficient of determination, r2, which is calculated by squaring the correlation coefficient r value. This calculation again uses the
linear relationship between two variables, but values range between 0 (no correlation) and 1 (perfect correlation) and there is no indication of the direction of the relationship. The r
2 value is related to linear regression, in which you use a method called
least-squares regression The least squares method is a way to find a line that best fits the data through a calculation to minimize the sum of the squared distance of each point to that line, to plot a line that best represents the relationship between your variables.
Once you have your linear regression, the r2-value indicates how much of the variation in your y variable is explained by the least-squares regression on your x variable. In other words, how successfully does your line represent the relationship between the two variables.
When do you use r and r2 values?
You'll want to calculate an r value to test how strongly two variables are related to one another, as well as the direction of their relationship (positive or negative). You'll want to use the r2 value to determine the amount of variation in one variable that can be explained by the linear regression on the other variable. In either case, a scatterplot of your data will help you visualize the relationship between the two variables.
In Earth Sciences, it is common practice for both r and r2 are termed 'correlation coefficients', with r and r2 used interchangeably as metrics to describe the relationship between two environmental variables. However, mathematically, r and r2 are two different calculations and also describe slightly different relationships between the two variables. In this module, we will cover both the correlation coefficient (r) and the coefficient of determination (r2).
How do I calculate r2?
Most of the steps involved in calculating (and visualizing!) r2 can be done using excel (or similar graphing software), so let's run through this process using a real world earth science example.
Fig. 4. Satellite image of Jakobshavn Isbræ on 07/07/2001, with the terminus position between 1851 and 2006 noted
Provenance: The base image came from: NASA/Goddard Space Flight Center Scientific Visualization Studio Historic calving front locations courtesy of Anker Weidick and Ole Bennike
Reuse: This item is offered under a Creative Commons Attribution-NonCommercial-ShareAlike license http://creativecommons.org/licenses/by-nc-sa/3.0/ You may reuse this item for non-commercial purposes as long as you provide attribution and offer any derivative works under a similar license.
Glaciers around the world are changing dramatically with glacier retreat being linked to climate warming, but what is the relationship between glacier change and different environmental variables? To answer this question, we can directly compare the change in glacier terminus position the end of a glacier, also called the snout or toe to another environmental variable, such as air temperature, ocean temperature, or precipitation. Water plays a critical role in glacier behavior. Not only can it decrease the friction between ice and the bedrock, which can increase glacier speed, but surface water can also pool in surface cracks and crevasses and expand those cracks through the full thickness of the glacier, creating a direct pathway for water to go from the surface to the glacier-bedrock interface. In this example, we will answer the question 'How strong is the relationship (r2 value) between glacier terminus change and the number of rainy days?' using a time series of terminus position from Jakobshavn Isbræ, the fastest moving glacier in Greenland (Figure 4), and a weather station record from the adjacent town of Ilulissat. In this example, we will be looking at the change in terminus position (km) across a summer, so a positive number indicates an advance of the glacier terminus, and a negative number indicates a retreat of the glacier terminus.
LP_Glacier_ex_data.csv (Comma Separated Values 182bytes Jun21 24) Table 1. Number of rainy days and glacier terminus change by year.
Step 1. Identify two variables you want to compare from a dataset to determine whether they are related.
In the following dataset, we have three columns: time (year), number of days with rain each year, and glacier terminus change (km) during the summer each year. We might hypothesize that the more rain that falls, the more the glacier will retreat (terminus change will be negative); therefore, the two variables we want to compare would be the glacier terminus change and the total number of rainy days.
Step 2. Assign environmental variables to x (independent variable) and y (dependent variable).
You want to identify an x and y value for each 'n' individual measurement. The variable 'n' can represent many different things (an individual sample, a time of observation, etc) depending on the problem you are working on. In this example, the 'n' is the year of observation, and we want to see if glacier position is influenced by rainy days, so we will designate the number of rainy days as the independent variable (x), and the glacier position as the dependent variable (y).
The independent variable is always plotted on the x-axis (when graphing) and is "independent" i.e., other variables have no effect on these values (but these values will affect other variables). The dependent variable is always plotted on the y-axis (when graphing) and "depends" on the other variable.
When calculating the r2, it's important to not only identify the two variables you want to compare, but also which is the independent variable and which is the dependent variable.
Step 3. Create a scatter plot of the data and visually identify the strength and direction of the relationship between the variables.
In Excel, enter the spreadsheet information as you see it in figure 5. To create a scatter plot of your data, highlight you can do this by selecting the upper left cell and dragging your cursor to the lower right only the cells of your x and y data. Once highlighted, select insert (tab at top) → scatter icon (x and y axis with just dots). A scatter plot should now appear!
Fig. 5. Glacier retreat scatter plot
Provenance: Kristin Schild, University of Maine
Reuse: This item is offered under a Creative Commons Attribution-NonCommercial-ShareAlike license http://creativecommons.org/licenses/by-nc-sa/3.0/ You may reuse this item for non-commercial purposes as long as you provide attribution and offer any derivative works under a similar license.
Examine your data - how do the variables appear to be related? Are they positively, negatively, or not correlated? How would you describe the strength of the correlation? Does the correlation appear to be linear or follow some other pattern?
The variables appear to be linearly correlated because they cluster together in a straight line pattern. They are negatively correlated because the higher the number of rainy days, the more the terminus retreats (increases negatively). The correlation appears to be fairly strong, since there is not a lot of scatter of the points.
Step 4. Add a linear regression line and r2 value to your plot using Excel.
First, make sure that your scatterplot is selected by clicking on the plot. Next, we want to add a linear regression line. To do this, click on Chart design (tab at top) → Add chart element, drop down arrow → trendline → linear. Behind the scenes, Excel uses the least-square method to calculate the line of best fit to your data.
Fig. 6. Glacier retreat scatter plot with regression line
Provenance: Kristin Schild, University of Maine
Reuse: This item is offered under a Creative Commons Attribution-NonCommercial-ShareAlike license http://creativecommons.org/licenses/by-nc-sa/3.0/ You may reuse this item for non-commercial purposes as long as you provide attribution and offer any derivative works under a similar license.
Again make sure that your scatterplot is selected, under the Chart design (tab at top) → Add chart element, drop down menu → more trendline options. Now in the right-hand panel, make sure the bar graph icon (tab) is selected, then linear is still selected, and then check "display equation on chart" and "display r-squared value on chart" (you can also add the linear trendline here and skip the prior step). Now the equation that represents your line, as well as your r
2 value should appear on your scatterplot next to your linear trendline! If you want to learn more about the equation for the linear regression, see the
linear regression module.
Fig. 7. Glacier retreat scatter plot with r^2 value
Provenance: Kristin Schild, University of Maine
Reuse: This item is offered under a Creative Commons Attribution-NonCommercial-ShareAlike license http://creativecommons.org/licenses/by-nc-sa/3.0/ You may reuse this item for non-commercial purposes as long as you provide attribution and offer any derivative works under a similar license.
What if I want to calculate the correlation coefficient r instead?
The correlation coefficient, r, can be used to describe the strength and direction of correlation between two variables and can supplement your visual estimation of these properties from looking at the scatter plot. You can calculate this value in several different ways. One way would be by hand, using the equation listed below.
`r=[sum(x_i-x)(y_i-y)]/sqrt[sum(x_i-x)^2*sum(y_i-y)^2]`
However, Excel can help us again.
To calculate the correlation coefficient in Excel, select an empty cell and type in =CORREL( . In the parentheses, highlight your x values, and then type a comma, and then highlight your y values. Close parentheses, hit enter and you should see your r value (check to make sure it makes sense-- that it is between -1 and 1 and seems to match the strength you estimated visually).
Fig. 8. Screenshot of Excel data table for correlation
Provenance: Kristin Schild, University of Maine
Reuse: This item is offered under a Creative Commons Attribution-NonCommercial-ShareAlike license http://creativecommons.org/licenses/by-nc-sa/3.0/ You may reuse this item for non-commercial purposes as long as you provide attribution and offer any derivative works under a similar license.
Step 5. Describe what the r2 value means in its geological context.
Now, with your linear regression line and r2 value added, do any of your answers change from above? How do the variables appear to be related? Are they positively, negatively, or not correlated? How would you describe the strength of the correlation? Does the correlation appear to be linear or follow some other pattern?
Hopefully all of your answers from above still hold true- the variables are negatively correlated, meaning as the number of rainy days increase, the more the glacier terminus retreats (becomes more negative). Since the r2 value is 0.41, this is a moderately strong correlation, especially for field data, and is supported by the data points falling relatively close to the trendline. As the trendline is linear and the data points fall on the line, this relationship appears to be linear.
What does this r2 value mean in the context of your geological data?
In the previous step, notice that the r2 value is positive, even though the relationship is negative? This is because r2 does not indicate the direction of the relationship (as can be determined by the r-value), but instead tells us how much of the variability in your y value can be explained by the regression on your x values. So, in this example, 0.41 (out of 1.0, or 41%) of variation in glacier position change can be explained by the regression between rainy days and glacier terminus change. Therefore, if we know that glacier terminus change is strongly correlated with rainy days, can we say that increasing rainy days will cause glacier retreat? No! We are looking at a very small data set, and both variables are influenced by other factors as both variables are part of a system. For example, how much a glacier changes is influenced by factors such as the thickness of the ice, the shape of the underlying bedrock, and the temperature of the ocean, and precipitation is influenced by temperature and atmospheric patterns. However, we can say that that this relationship shows that rain likely contributes to changes in glacier position. Correlation does not equal causation.
W hat counts as 'strong' vs 'weak' correlation is defined differently in different fields and different studies. For example, in a tightly constrained laboratory experiment examining the effects of slope on glacier velocity, we may expect to have r2 values close to 1, while in a field study of bedrock slope vs. glacier velocity the data sets may have more sources of variation and generally lower r2 values, even if the two variables are correlated. But the general principle is true that values closer to 1 indicate a stronger relationship relative to values closer to 0, for any given dataset.
What can't the r2 value tell me?
As noted above, a high r2 value still does not imply causation-- just because the number of rainy days explains some variation in glacier position, it does not prove that changes in the abundance of rainy days will cause glaciers to advance or retreat. There may be a casual relationship, or there could be another variable (or several variables!) we haven't analyzed yet that causes variation in one or both of our analyzed variables. There are other statistical methods to examine causality-- see 'more help' below for some ideas.
It's also important to note that both r and r2 values can be affected by data points that are very different from the rest of the data, i.e., outliers. Trends in variation can also affect these values (i.e., if your data becomes less linear or more scattered as x increases or decreases. Finally, many datasets exhibit nonlinear relationships and a linear analysis would not be appropriate. Therefore, it is important to visually examine your data in a scatter plot to make sure that a linear regression is appropriate and to determine if any values or trends may influence your interpretations of correlation.
Where do you use correlation coefficients in Earth science?
- Environmental science - Determining the relationship between percent impervious surface in a watershed and different water quality metrics.
- Ecology - Determining the relationship between climate variables and forest biodiversity.
- Geochemistry - Determining the relationships among different components of magma or igneous rocks.
- Hydrology - Determining the relationship between permeability and porosity of an aquifer.
- Geophysics - Determining the relationship between earthquake depth and ground shaking .
- Atmospheric science - Determining the relationship between cloud cover and precipitation.
Next steps
I am ready to PRACTICE!
If you think you have a handle on the steps above, click on this bar to try practice problems with worked answers.
Or, if you want even more practice, see 'More help' below.More help (resources for students)
Pages written by Laura C. Reynolds (Worcester State University) and Kristin M. Schild (University of Maine).