Initial Publication Date: August 16, 2024
Guiding students through Correlation
An instructor's guide to Correlation Coefficients
Laura Reynolds (Worcester State University)
Kristin M. Schild (University of Maine)
What should students get out of this module?
After completing this module, a student should be able to:
- Identify qualitative strength and direction of correlations given a scatter plot of data
- Calculate a r2 value (coefficient of determination) from a given dataset, using excel
- Describe what the r2 value means in its geological context
Why are these math skills challenging to incorporate into courses?
Some common challenges for students related to correlation, r values, and r2 values in Earth Science courses:
- Loose colloquial wording around 'correlation'. Students often do not know that 'correlation' has a mathematical meaning, and think of it as the same as 'relationship'.
- r and r2 are used interchangeably in the Earth Sciences. Since the field itself is not unified on how to represent relationships between variables, it can appear that the correlation coefficient r and the coefficient of determination r2 represent the same information about the relationship between variables. Additionally, Excel displays the r2 value as capital (R2); therefore, both r2 and R2 are often used interchangeably in literature.
- Plotting independent vs. dependent variables. Students often have trouble identifying which variable is the independent variable and which is the dependent variable, especially in time series datasets (three variables). In time series datasets, students will often plot the independent variable vs. time and the dependent variable vs. time on the same plot, and then do wiggle matching to show 'correlation', instead of plotting the variables directly against one another, performing a linear regression, and calculating the r2 value.
- Misinterpreting what the r or r2 value tells you about correlation between variables. Students may consider the sign and magnitude of the r value together, assigning 'no correlation' to -1 (the lowest value) and 'complete correlation' to 1 (the highest value) instead of separating the sign (direction of relationship) and magnitude (strength of relationship). Students may also interpret positive and negative correlations as good vs. bad, instead of the relative direction of change in variables. Additionally, students may misinterpret r or r2 as being equivalent to the slope of a linear regression, especially because the r2 value is displayed next to the linear trendline in a scatterplot in Excel and is a real number.
- Misunderstanding of the interpretation of the r2 value with regards to causation. After computing r2, correlation and causation can be conflated. It can be difficult for students to understand that r2 represents the fraction of variance in the y variable explained by the regression of y on x (CIT textbook) and does NOT represent how much variation in y is caused by variation in x.
What we don't include in the page?
- While we provide a fairly detailed explanation for how to calculate r and r2 values in Excel through formulas as well as within a scatter plot, we do not provide detailed step by step explanations for how linear regressions are performed or how to interpret the resulting linear regression equation (you can find detailed steps in this linear regression module).
- We give examples of calculations to use in Excel, but do not provide detailed instructions on basic Excel use, such as how to highlight data, use formulas, etc. These types of introductory Excel skills can be found in this intro to Excel video.
- While we caution students against conflating correlation and causation, we don't go into detail about other methods to test causation among variables.
- We focus on linear correlation in this module; no non-linear regression examples are discussed. We also don't discuss homoscedastic and heteroscedastic relationships, which can affect the interpretation of r and r2 values.
- We give examples only in Excel; however, other programs (R, python, Google sheets) could be used for the same purposes.
Instructor resources
Support for teaching this quantitative skill
- Project Eddie (Environmental Data-Driven Inquiry and Exploration) statistical vignettes provide powerpoint slides for instructors, covering topics such as Linear Regression and the Correlation Coefficient.
- Example of an interactive lecture that works through qualitative analysis of correlation between variables, and several examples of Pearson's r calculations: The Evolution of Pearson's Correlation Coefficient/Exploring Relationships between Two Quantitative Variables
Examples of activities that use this quantitative skill
- In this Find the Moho activity, students plot distance vs travel time from seismic data in Excel, and then calculate and interpret r-squared values for linear regressions: https://serc.carleton.edu/NAGTWorkshops/geophysics/activities/18914.html
- In the Wind and Ocean Ecosystems (EDDIE) module, students are asked to create scatter plots, run linear regressions, and calculate and interpret correlation coefficient values in Excel: https://serc.carleton.edu/eddie/teaching_materials/modules/wind_ocean_ecosystems.html
- In the Ships that pass in the night: Competition in the fossil record activity (advanced version), students are asked to create scatter plots of fossil diversity through time, as well as brachiopod vs bivalve diversity, make linear regressions, and calculate and interpret r and r2 values: https://serc.carleton.edu/teachearth/activities/204123.html