Linear Regression - Practice Problems
Solving Earth science problems with regression
This module is available for public use, but it is undergoing revision after classroom implementation with the Math Your Earth Science Majors Need project.
Working with ecological data
Ecology explores relationships between organisms and other living (biotic) things or nonliving (abiotic) components in their environment. Certain factors may impact the abundance, distribution, or physiology of organisms, including abiotic factors such as temperature, moisture, or sunlight, and biotic factors such as the presence of predators or competitors. Linear regressions can be used to quantify these relationships and predict organism responses to various levels of the abiotic or biotic factor.
green crab
Provenance: Photo by Oregon Sea Grant https://www.flickr.com/photos/oregonseagrant/8412563286
Reuse: This item is offered under a Creative Commons Attribution-NonCommercial-ShareAlike license http://creativecommons.org/licenses/by-nc-sa/3.0/ You may reuse this item for non-commercial purposes as long as you provide attribution and offer any derivative works under a similar license.
Problem 1: Ecologists tested the grazing pressure of green crabs on clams. They constructed `1 m^2` cages and planted 300 clams in each cage. Two days later they counted the number of remaining clams and recorded the data in the data below.
Number of Crabs |
Number of Clams Remaining |
2 |
137 |
4 |
70 |
2 |
184 |
5 |
0 |
4 |
35 |
0 |
297 |
3 |
122 |
5 |
1 |
1 |
253 |
3 |
150 |
Problem 1A: First, perform a linear regression using the step-by-step instructions for calculating `m` (slope) and `b` (intercept) of the regression line. What is the full equation for the regression line that you calculated?
Step 1. Decide which variable is the
independent variable and which is the
dependent variable.
In this example, the number of crabs is the independent variable (`x`), and the number of clams remaining is the dependent variable (`y`). This is because the researchers chose the number of crabs to place in each plot, and the resulting number of clams depends on how many crabs are present.
Step 2. Calculate
`x^2` for every value of
`x`, and
`y^2` for every value of
`y`. For each pair of values, calculate
`x xx y`.
You should end up with ten values (the same number of values of
`x`) for
`x^2`, ten values for
`y^2`, and ten values for
`x xx y`.
`x^2` |
`y^2` |
`x xx y` |
4 |
18,769 |
274 |
16 |
4,900 |
280 |
4 |
33,856 |
368 |
25 |
0 |
0 |
16 |
1,225 |
140 |
0 |
88,209 |
0 |
9 |
14,884 |
366 |
25 |
1 |
5 |
1 |
64,009 |
253 |
9 |
22,500 |
450 |
Step 3. Calculate
`Sigmax`,
`Sigmay`,
`Sigma(x^2)`,
`Sigma(y^2)`, and
`Sigma(x xx y)`.
`Sigmax = 2 + 4 + 2 + 5 + 4 + 0 + 3 + 5 + 1 + 3 = 29`
`Sigmay = 137 + 70 + 184 + 0 + 35 + 297 + 122 + 1 + 253 + 150 = 1249`
`Sigma(x^2) = 4 + 16 + 4 + 25 + 16 + 0 + 9 + 25 + 1 + 9 = 109`
`Sigma(y^2) = 18769 + 4900 + 33856 + 0 + 1225 + 88209 + 14884 + 1 + 64009 + 22500 = 248353`
`Sigma(x xx y) = 274 + 280 + 368 + 0 + 140 + 0 + 366 + 5 + 253 + 450 = 2136`
`Sigmax` |
`Sigmay` |
`Sigma(x^2)` |
`Sigma(y^2)` |
`Sigma(x xx y)` |
29 |
1249 |
109 |
248,353 |
2136 |
Step 4. Familiarize yourself with the equation of a line: `y=mx+b`.
Step 5. Calculate m (slope): `(n(Sigma(x xx y)) – (Sigmax)(Sigmay))/(n(Sigma(x^2)) – (Sigmax)^2)` .
To simplify this calculation, let's start with the
numerator:
`(n(Sigma(x xx y)) – (Sigmax)(Sigmay))`
Remember n! In this example, `n = 10` because we have 10 values of `x` and 10 values of `y` .
numerator = `(10(2136) – (29)(1249))`
numerator = `(21360 – 36221)`
numerator = `-14861`
Now we can calculate the bottom half of the equation (this is called the denominator): `(n(Sigma(x^2)) – (Sigmax)^2)`
denominator = `(10(109) – (29)^2)`
denominator = `(1090 – 841)`
denominator = `249`
To finish the calculation, divide the numerator by the denominator:
`m = -14861/249 = -59.683`
Here, we have rounded the actual answer from a long decimal string (-59.68273092...) to -59.683.
Step 6. Calculate b (intercept): `(Sigmay – m(Sigmax))/(n)`
Again we can start on the
numerator (top half of the equation):
`(Sigmay – m(Sigmax))`
numerator = `(1249 – (-59.683)(29))`
numerator = `(1249 – (-1730.807))`
numerator = `2979.807`
Next, we would calculate the denominator (bottom half of the equation), but this is just `n` !
denominator = `10`
To finish the calculation, divide the numerator by the denominator:
`b = 2979.807/10 = 297.98`
Again, we have rounded the actual answer from a longer decimal (297.9807) to 297.98.
Step 7. Replace 'b' and 'm' in line equation: `y=mx+b`.
From the previous steps, we know:
`m = -59.683`
`b = 297.98`
So the final equation of the linear regression is: `y = -59.683x + 297.98`
Problem 1B: Next, run the linear regression statistics using Excel's Data Analysis Toolpak. Do the values given for `m` and `b` match the values that you calculated in Part A?
Step 1. Enter your data into an Excel spreadsheet in two columns.
Optional: to help visualize the relationship between the variables, create a scatterplot using the Excel graph feature and insert a trendline. A trendline shows you a best fit line for the data on the chart.
Provenance: Laura Treible, Savannah State University
Reuse: This item is in the public domain and maybe reused freely without restriction.
Create the scatterplot: Select both columns of your data. In the Excel menu, click on "Insert" and then the chart feature and select the scatter option (with no connecting lines). In the resulting chart, click the chart area to display the Chart Design and Format tabs in the top menu. Update the chart title, axis labels, and more.
Add the trendline: Select the chart area again. Click on "Chart Design" then "Add Chart Element." Select Trendline --> Linear.
Step 2. Use the Regression tool on your data.
Provenance: Laura Treible, Savannah State University
Reuse: This item is in the public domain and maybe reused freely without restriction.
In the pop-up menu, select "Regression." Click in the box "Input Y range." Then highlight the
`y` values in your spreadsheet. Then click in the box "Input X range." Then highlight the
`x` values in your spreadsheet. Click the "Output Range" circle, then click in the box to the right of the words "output range." In your spreadsheet, click in the upper left corner of a blank area where you want to place the regression output. Finish by clicking "Ok."
Step 3. Determine the line equation for your data. Write it in the form of `y=mx+b` and look for the `R^2` value.
Provenance: Laura Treible, Savannah State University
Reuse: This item is in the public domain and maybe reused freely without restriction.
The Summary Output will display three tables of results: Regression Statistics, ANOVA, and a third unlabeled table. The slope,
`m`, is in the third table, in the row labeled X Variable and the Coefficients column. The
`y` intercept is the value in the Intercept row under Coefficients. The
`R^2` value is the second line in the Regression Statistics table (labeled R square).
- The line equation is `y = -59.682x + 297.98`.
- `R^2 = 0.96` which indicates a relatively strong correlation. It is important to note here, that the relationship is negative (we can see this if we graph the data, or determine this from the negative slope), but since `R^2` is a squared value, it will always be positive.
Step 4. Check to see if this line equation matches what you calculated by the least squares method above. Note that the values of `m` and `b` may differ slightly due to different rounding within the calculations, but they should be close. Do they match? Hurray! You now know how to do a linear regression with two different methods.
Determining a standard (calibration) curve
Earth scientists often try to measure concentrations of chemicals in waters, soils, sediments, rocks, or biological things. Often what is measured must be compared to known samples, called standards. This is accomplished by creating a standard (calibration) curve. Standard curves are not usually curves! They are graphs of data points of measurements from an instrument (on the y axis) based on known concentrations of chemicals in various samples. The concentrations and measurements will ideally have a linear relationship, which can be determined by a linear regression. The line equation can then be used to figure out the concentrations of unknown samples that are analyzed by your instrument.
Problem 2A: You want to analyze some stream water samples for copper to see if an active mine is affecting the water quality. You can use an instrument called an atomic absorption spectrometer (AAS) with a light wavelength of 420 nm for this. You create 4 standards with known amount of copper in them. The AAS then measures how much light is absorbed by each standard. Beer's Law states that the amount of light absorbed (absorbance) is linearly related to the concentration of copper in each standard and sample.
Copper in solution is blue. Higher concentrations of copper make darker blue solutions.
Provenance: Leiem, CC. Creativecommons, via Wikimedia Commons
Reuse: This item is offered under a Creative Commons Attribution-NonCommercial-ShareAlike license http://creativecommons.org/licenses/by-nc-sa/3.0/ You may reuse this item for non-commercial purposes as long as you provide attribution and offer any derivative works under a similar license.
The table below shows the data from your AAS. Analyze the data with a linear regression to determine the line equation for your standard curve.
Concentration (mg/L) |
Absorbance |
0 |
0.003 |
0.2 |
0.033 |
0.4 |
0.065 |
0.6 |
0.098 |
0.8 |
0.125 |
Step 1. Copy the data into an Excel spreadsheet in two columns, then decide which variable is the independent variable and which is the dependent variable. Then insert a chart (graph) and add a trendline.
With standard curves, concentration is the independent variable (
`x`), and the absorbance is the dependent variable (
`y`). This is because the standards were created with known amounts of copper, and the spectrophotometer absorbance
depends on how much copper is in the solution.
Your graph should look similar to the graph on the right.
Steps 2 and 3. The Data Analysis Toolpak should already be activated and ready for you to use the Regression analysis. Be sure to select the Absorbance column data for the box "Input Y range." Then select the data in the concentration column for the box "Input X range." Click the "Output Range" circle, then click in the box to the right of the words "output range." In your spreadsheet, click in the upper left corner of a blank area where you want to place the regression output. Finish by clicking "Ok."
Your Output Summary should look like the output to the right. The important values are highlighted.
Step 4. Extract the `m` and `b` values from the regression analysis to get your line equation in the form of `y = mx + b`.
Your line equation should be `y = 0.1545x + 0.0003`.
Problem 2B: You analyzed two water sample for copper with your spectrophotometer. The absorbance for the Rabbit Run stream water is 0.114 and the absorbance for the Mill Creek stream water is 0.078. What are the copper concentrations in these samples?
You need to use the line equation you got in Part 2A. You will solve for `x` by substituting the given absorption values for `y`.
For Rabbit Run:
`y = 0.1545x + 0.0003`. So
`0.114 = 0.1545x + 0.0003`, so
`x = (0.114-0.0003)/0.1545`, resulting in
`x` = 0.74 mg copper/L.
For Mill Creek: `y = 0.1545x + 0.0003`. So `0.078 = 0.1545x + 0.0003`, so `x = (0.078-0.0003)/0.1545`, resulting in `x` = 0.50 mg copper/L.
Geochemical variation diagrams
Provenance: Zellmer, G. F., et al. "Geochemical evolution of the Soufriere Hills volcano, Montserrat, Lesser Antilles volcanic arc." Journal of Petrology 44.8 (2003): 1349-1374.
Reuse: This item is offered under a Creative Commons Attribution-NonCommercial-ShareAlike license http://creativecommons.org/licenses/by-nc-sa/3.0/ You may reuse this item for non-commercial purposes as long as you provide attribution and offer any derivative works under a similar license.
Harker diagrams are geochemical variation diagrams commonly used in Earth science to represent the chemical constituents in a rock as a proportion of silica `(SiO_(2))`. Some of these relationships are linear and can be represented with linear regression.
Problem 3: The figure to the right shows some examples of Harker Diagrams from Montserrat, Lesser Antilles volcanic arc. You are given some of the data for `CaO` and `SiO_(2)` in the table below. Units are weight percent (wt %).
SiO2 (wt %) |
CaO (wt %) |
52 |
8 |
69 |
2 |
56 |
7 |
53 |
8 |
62 |
5 |
74 |
1 |
60 |
5 |
53 |
9 |
47 |
12 |
55 |
9 |
Problem 3A: First, perform a linear regression using the step-by-step instructions for calculating `m` (slope) and `b` (intercept) of the regression line. What is the full equation for the regression line that you calculated?
Step 1. Decide which variable is the
independent variable and which is the
dependent variable.
In this example, the wt% CaO is the independent variable (`x`), and the wt% SiO2 is the dependent variable (`y`).
Step 2. Calculate
`x^2` for every value of
`x`, and
`y^2` for every value of
`y`. For each pair of values, calculate
`x xx y`.
You should end up with ten values (the same number of values of
`x`) for
`x^2`, ten values for
`y^2`, and ten values for
`x xx y`.
`x^2` |
`y^2` |
`x xx y` |
2704 |
64 |
416 |
4761 |
4 |
138 |
3136 |
49 |
392 |
2809 |
64 |
424 |
3844 |
25 |
310 |
5476 |
1 |
74 |
3600 |
25 |
300 |
2809 |
81 |
477 |
2209 |
144 |
564 |
3025 |
81 |
495 |
Step 3. Calculate
`Sigmax`,
`Sigmay`,
`Sigma(x^2)`,
`Sigma(y^2)`, and
`Sigma(x xx y)`.
`Sigmax = 2 + 4 + 2 + 5 + 4 + 0 + 3 + 5 + 1 + 3 = 29`
`Sigmay = 137 + 70 + 184 + 0 + 35 + 297 + 122 + 1 + 253 + 150 = 1249`
`Sigma(x^2) = 4 + 16 + 4 + 25 + 16 + 0 + 9 + 25 + 1 + 9 = 109`
`Sigma(y^2) = 18769 + 4900 + 33856 + 0 + 1225 + 88209 + 14884 + 1 + 64009 + 22500 = 248353`
`Sigma(x xx y) = 274 + 280 + 368 + 0 + 140 + 0 + 366 + 5 + 253 + 450 = 2136`
`Sigmax` |
`Sigmay` |
`Sigma(x^2)` |
`Sigma(y^2)` |
`Sigma(x xx y)` |
581 |
66 |
34373 |
538 |
3590 |
Step 4. Familiarize yourself with the equation of a line: `y=mx+b`.
Step 5. Calculate m (slope): `(n(Sigma(x xx y)) – (Sigmax)(Sigmay))/(n(Sigma(x^2)) – (Sigmax)^2)` .
To simplify this calculation, let's start with the
numerator:
`(n(Sigma(x xx y)) – (Sigmax)(Sigmay))`
Remember n! In this example, `n = 10` because we have 10 values of `x` and 10 values of `y` .
numerator = `(10(3590) – (581)(66))`
numerator = `(35900 – 38346)`
numerator = `-2446`
Now we can calculate the bottom half of the equation (this is called the denominator): `(n(Sigma(x^2)) – (Sigmax)^2)`
denominator = `(10(34373) – (581)^2)`
denominator = `(343730 – 337561)`
denominator = `6169`
To finish the calculation, divide the numerator by the denominator:
`m = -2446/6169 = -0.396`
Here, we have rounded the actual answer from a long decimal string (-0.39649862...) to -0.396.
Step 6. Calculate b (intercept): `(Sigmay – m(Sigmax))/(n)`
Again we can start on the
numerator (top half of the equation):
`(Sigmay – m(Sigmax))`
numerator = `(66 – (-0.396)(581))`
numerator = `(1249 – (-230.366))`
numerator = `296.366`
Next, we would calculate the denominator (bottom half of the equation), but this is just `n` !
denominator = `10`
To finish the calculation, divide the numerator by the denominator:
`b = 296.366/10 = 29.64`
Again, we have rounded the actual answer from a longer decimal (29.6365699) to 29.64.
Step 7. Replace 'b' and 'm' in line equation: `y=mx+b`.
From the previous steps, we know:
`m =-0.396`
`b = 29.64`
So the final equation of the linear regression is: `y = -0.396x + 29.64`
Problem 3B: Next, run the linear regression statistics using Excel's Data Analysis Toolpak. Do the values given for `m` and `b` match the values that you calculated in Problem 3A?
Step 1. Enter your data into an Excel spreadsheet in two columns.
Optional: to help visualize the relationship between the variables, create a scatterplot using the Excel graph feature and insert a trendline. A trendline shows you a best fit line for the data on the chart.
Regression scatter plot harker diagrams
Provenance: Rory McFadden, Carleton College
Reuse: This item is offered under a Creative Commons Attribution-NonCommercial-ShareAlike license http://creativecommons.org/licenses/by-nc-sa/3.0/ You may reuse this item for non-commercial purposes as long as you provide attribution and offer any derivative works under a similar license.
Create the scatterplot: Select both columns of your data. In the Excel menu, click on "Insert" and then the chart feature and select the scatter option (with no connecting lines). In the resulting chart, click the chart area to display the Chart Design and Format tabs in the top menu. Update the chart title, axis labels, and more.
Add the trendline: Select the chart area again. Click on "Chart Design" then "Add Chart Element." Select Trendline --> Linear.
Step 2. Use the Regression tool on your data.
Regression popup forharker diagrams
Provenance: Rory McFadden, Carleton College
Reuse: This item is offered under a Creative Commons Attribution-NonCommercial-ShareAlike license http://creativecommons.org/licenses/by-nc-sa/3.0/ You may reuse this item for non-commercial purposes as long as you provide attribution and offer any derivative works under a similar license.
In the pop-up menu, select "Regression." Click in the box "Input Y range." Then highlight the
`y` values in your spreadsheet. Then click in the box "Input X range." Then highlight the
`x` values in your spreadsheet. Click the "Output Range" circle, then click in the box to the right of the words "output range." In your spreadsheet, click in the upper left corner of a blank area where you want to place the regression output. Finish by clicking "Ok."
Step 3. Determine the line equation for your data. Write it in the form of `y=mx+b` and look for the `R^2` value.
Regression output from harker diagrams
Provenance: Rory McFadden, Carleton College
Reuse: This item is offered under a Creative Commons Attribution-NonCommercial-ShareAlike license http://creativecommons.org/licenses/by-nc-sa/3.0/ You may reuse this item for non-commercial purposes as long as you provide attribution and offer any derivative works under a similar license.
The Summary Output will display three tables of results: Regression Statistics, ANOVA, and a third unlabeled table. The slope,
`m`, is in the third table, in the row labeled X Variable and the Coefficients column. The
`y` intercept is the value in the Intercept row under Coefficients. The
`R^2` value is the second line in the Regression Statistics table (labeled R square).
- The line equation is `y = -0.396x + 29.64`.
- `R^2 = 0.95` which indicates a relatively strong correlation. It is important to note here, that the relationship is negative (we can see this if we graph the data, or determine this from the negative slope), but since `R^2` is a squared value, it will always be positive.
Step 4. Check to see if this line equation matches what you calculated by the least squares method above. Note that the values of `m` and `b` may differ slightly due to different rounding within the calculations, but they should be close.
Changes over time
Glacial retreat in the Herens Valley, Switzerland, leaving behind bare valleys.
Provenance: Luca Bonacina (distributed via imaggeo.egu.eu)
Reuse: This item is offered under a Creative Commons Attribution-NonCommercial-ShareAlike license http://creativecommons.org/licenses/by-nc-sa/3.0/ You may reuse this item for non-commercial purposes as long as you provide attribution and offer any derivative works under a similar license.
Many things change over time- trees get taller, the earth's plates move, water evaporates. But are these changes linear? If these things change by a constant amount with each time period, then there's a linear correlation with time as the independent variable. If it increases consistently over time, that's a positive correlation; if it decreases consistently over time, that's a negative correlation. Linear regression is used to model these changes over time- and to help us make predictions about past or future time points.
Problem 4A: When glaciers retreat, they leave behind bare land with little to no soil left. Often loose, unconsolidated material called till is left behind and it will slowly become soil. About 11,000 years ago, glaciers retreated from Wisconsin, and soil has been forming since then. In the year 2000 in one remote area, the soil thickness was measured to be 32 inches. Scientists used carbon-14 dating to estimate that the soil thickness after 5000 years was 13 inches and increased to 30 inches after 10,000 years. Using these four time points (call the glacial retreat time 0 and today 11,000 years), determine the line equation that relates time and soil thickness.
Step 1. Enter the data into an Excel spreadsheet in two columns, then insert a chart (graph) and add a trendline.
Your graph should look similar to this:
Steps 2 and 3. The Data Analysis Toolpak should already be activated and ready for you to use the Regression analysis. Select the Soil Depth column data for the box "Input Y range." Then select the data in the time column for the box "Input X range."
Your Output Summary should look like this one. The important values are highlighted:
Step 4. Extract the `m` and `b` values from the regression analysis to get your line equation in the form of `y = mx + b`.
Note that the
`y` intercept,
`b`, is listed as "Intercept," and the slope
`m` is listed as "X Variable 1."
Your line equation should be `y = 0.003x - 0.623`.
Problem 4B: How long does this model predict it would take to form a new inch of soil in this area of Wisconsin?
Step 1. Determine your unknown and known values.
The known value is 1 inch of soil, which is our
`y` value. We are solving for the number of years, our
`x`.
Step 2. Put those values into the line equation from Part 4a.
The line equation is
`y = 0.003x - 0.623`, so
`1 = 0.003x - 0.623`.
That means `x = (1 + 0.623)/0.003` and `x = 541` years.
It would take approximately 541 years for one inch of soil to form here.
Next steps
TAKE THE QUIZ!!
I think I'm competent with linear regression and I am ready to take the quiz! This link takes you to WAMAP. If your instructor has not given you instructions about WAMAP, you may not have to take the quiz.Or you can go back to the Linear Regression explanation page.