How do I calculate a linear regression?
Finding relationships between variables in the Earth sciences
This module is available for public use, but it is undergoing revision after classroom implementation with the Math Your Earth Science Majors Need project.
An introduction to linear regressions
Does lake depth affect water column stability? Is the age of a volcanic island related to its distance from a hot spot? Is the soil pH related to the amount of sulfides? Earth scientists can investigate the relationships between these things using linear regressions.
Linear regression provides a statistical analysis of the relationship between two variables. A linear change is when one factor changes by a constant amount with respect to a second factor. A line that models this relationship is usually calculated through the least squares method, resulting in the equation `y=mx+b`. If there is a linear relationship, the resulting equation then helps you predict the value of one variable (`x` or `y`) based on knowing the value of the other variable.
The
slope,
`m`, or gradient, equals the change in
`y` divided by the change in
`x`. The
`y` intercept,
`b`, is the point where the line crosses the
`y` axis (
`x` = 0).
The least squares method is a way to find a line that best fits the data through a calculation to minimize the sum of the squared distance of each point to that line.
When do I need to calculate a linear regression?
Linear regressions are used when you want to determine if there is a linear relationship between two variables. You may need to first plot the data in a scatterplot (see 'How do I plot points on a graph?') to visually determine if there is a relationship between the two variables, and if that relationship appears linear. The equation of your calculated best-fitting line can then be used to predict values of one variable based on given or known values of the other variable.
How do I calculate a linear regression?
Here are the steps to take when using least squares analysis for a linear regression:
The active Kilauea Volcano lava flow and cross-section of the Pacific plate near Kilauea, Hawaii (USGS)
Provenance: Kilauea Eruption: Anthony Quintano, Creative Commons, via Wikimedia Common. Pacific Plate Cross-section: Joel E. Robinson (US Geological Survey)
Reuse: This item is offered under a Creative Commons Attribution-NonCommercial-ShareAlike license http://creativecommons.org/licenses/by-nc-sa/3.0/ You may reuse this item for non-commercial purposes as long as you provide attribution and offer any derivative works under a similar license.
We will use a data set with paired variables (where every value for one variable has a corresponding value of the other variable) from the Hawaiian hot spot track to calculate a linear regression. Kilauea is an active volcano forming at the Hawaiian hot spot above a mantle plume. A chain of volcanic islands to the northwest of Kilauea shows the hot spot track over the last 80 million years. The Pacific plate moves northwest over the hot spot, displacing the volcanic islands to the northwest and allowing a new volcanic island to form at the mantle plume.
Can we use the distance from the hot spot to determine the age of the older volcanic islands? Data provided are the distance from the hot spot in kilometers (km) and the age of the volcanic islands in millions of years (Ma).
Distance (km) |
Age (Ma) |
0 |
0.25 |
300 |
2 |
1,800 |
20 |
2,600 |
28 |
3,500 |
42 |
4,800 |
60 |
In all of these calculation steps, you can use a calculator or a spreadsheet program (like Excel) to do the actual calculations. When you move to examples with larger data sets, using a spreadsheet program to do calculations may be more efficient.
How to Calculate a Linear Regression with the Least Squares Method
The
independent variable is always plotted on the x-axis (when graphing) and is "independent" i.e., other variables have no effect on these values (but these values will affect other variables).
The dependent variable is always plotted on the y-axis (when graphing) and "depends" on the other variable.
Rainfall amount could cause a change in hill slope stability, but it isn't possible for hill slope stability to cause a change in rainfall amount. Therefore, rainfall is the independent variable and hill slope is the dependent variable.
If you have trouble deciding which are the independent and dependent variables, it may be that one variable does not drive the relationship with the other. This is okay; simply assign one variable as `x` and the other as `y`.
Step 1. Decide which variable is the independent variable and which is the dependent variable. For graphing and calculating linear regressions, the independent variable is represented as `x`, and the dependent variable is represented as `y`.
In this example, we are going to use distance (km) as the `x` (independent variable), and age (Ma) as the `y` (dependent variable). This is because we calculate the age of the volcanoes depending on how far away it is from the hot spot.
Step 2. Calculate `x^2` for every value of `x`, and `y^2` for every value of `y`. For each pair of values, calculate `x xx y`.
You should end up with six values (the same number of values of
`x`) for
`x^2`, six values for
`y^2`, and six values for
`x xx y`.
You can write these values as additional columns in your table to help keep everything organized, like this:
x: Distance (km) |
y: Age (Ma) |
`x^2` |
`y^2` |
`x xx y` |
0 |
0.25 |
0 |
0.0625 |
0 |
300 |
2 |
90,000 |
4 |
600 |
1,800 |
20 |
3,240,000 |
400 |
36,000 |
2,600 |
28 |
6,760,000 |
784 |
72,800 |
3,500 |
42 |
12,250,000 |
1,764 |
147,000 |
4,800 |
60 |
23,040,000 |
3,600 |
288,000 |
`x^2` |
`y^2` |
`x xx y` |
02 = 0 |
0.252 = 0.0625 |
0 × 0.25 = 0 |
3002 = 90,000 |
22 = 4 |
300 × 2 = 600 |
1,8002 = 3,240,000 |
202 = 400 |
1,800 × 20 = 36,000 |
2,6002 = 6,760,000 |
282 = 784 |
2,600 × 28 = 72,800 |
3,5002 = 12,250,000 |
422 = 1,764 |
3,500 × 42 = 147,000 |
4,8002 = 23,040,000 |
602 = 3,600 |
4,800 × 60 = 288,000 |
Step 3. Now, you will add up some of these values you just calculated. The `Sigma` symbol means "sum of", so in this step, you are adding up all of the values for each of these groups.
Calculate `Sigmax`, `Sigmay`, `Sigma(x^2)`, `Sigma(y^2)`, and `Sigma(x xx y)`.
For example, to calculate `Sigmax`, you will add up all values of `x`, and to calculate `Sigma(y^2)`, you will add up all of the values of `y^2` that you calculated in Step 2.
`Sigmax = 0 + 300 + 1800 + 2600 + 3500 + 4800 = 13000`
`Sigmay = 0.25 + 2 + 20 + 28 + 42 + 60 = 152.25`
`Sigma(x^2) = 0 + 90000 + 3240000 + 6760000 + 12250000 + 23040000 = 45380000`
`Sigma(y^2) = 0.0625+ 4 + 400 + 784 + 1764 + 3600 = 6552.0625`
`Sigma(x xx y) = 0 + 600 + 36000 + 72800 + 147000 + 288000 = 544400`
You can write these values at the bottom of your table to help keep everything organized, like this:
x: Distance (km) |
y: Age (Ma) |
`x^2` |
`y^2` |
`x xx y` |
0 |
0.25 |
0 |
0.0625 |
0 |
300 |
2 |
90,000 |
4 |
600 |
1,800 |
20 |
3,240,000 |
400 |
36,000 |
2,600 |
28 |
6,760,000 |
784 |
72,800 |
3,500 |
42 |
12,250,000 |
1,764 |
147,000 |
4,800 |
60 |
23,040,000 |
3,600 |
288,000 |
|
|
|
|
|
`Sigmax` |
`Sigmay` |
`Sigma(x^2)` |
`Sigma(y^2)` |
`Sigma(x xx y)` |
13,000 |
152.25 |
45,380,000 |
6,552.0625 |
544,400 |
Step 4. Familiarize yourself with the equation of a line: `y=mx+b`.
The `x` and `y` in this equation represent data and will remain as `x` and `y` in the equation.
The `m` is the slope of the line, and the `b` is the y-intercept, or where the line crosses the y-axis. Performing linear regression will help you calculate the values for `m` (Step 5) and `b` (Step 6) to replace in the equation.
n is the number of data points or values within a variable.
In this example, there are 6 values for `x` and 6 values for `y` , so `n = 6` .
Step 5. Calculate m (slope): `(n(Sigma(x xx y)) – (Sigmax)(Sigmay))/(n(Sigma(x^2)) – (Sigmax)^2)` .
This may look complicated, but don't worry! You already calculated the values for each of these components in Step 3. For this calculation, simply plug in those values that you calculated in Step 3.
To simplify this calculation, let's start on the top half of the equation (this is called the
numerator):
`(n(Sigma(x xx y)) – (Sigmax)(Sigmay))`
numerator = `(6(544400) – (13000)(152.25))`
numerator = `(3266400 – 1979250)`
numerator = `1287150`
Now we can calculate the bottom half of the equation (this is called the denominator): `(n(Sigma(x^2)) – (Sigmax)^2)`
denominator = `(6(45380000) – (13000)^2)`
denominator = `(272280000 – 169000000)`
denominator = `103280000`
To finish the calculation, divide the numerator by the denominator:
`m = 1287150/103280000 = 0.0125`
Here, we have rounded the actual answer from a long decimal string (0.0124627226955848...) to 0.0125.
Step 6. Calculate b (intercept): `(Sigmay – m(Sigmax))/(n)`
To simplify this calculation, again we can start on the
numerator (top half of the equation):
`(Sigmay – m(Sigmax))`
numerator = `(152.25 – 0.0125(13000))`
numerator = `(152.25 – 162.5)`
numerator = `-10.25`
Next, we would calculate the denominator (bottom half of the equation), but this is just `n` !
denominator = `n`
denominator = `6`
To finish the calculation, divide the numerator by the denominator:
`b = -10.25/6 = -1.71`
Again, we have rounded the actual answer from a long decimal string (-1.708333333...) to -1.71.
Step 7. Replace 'b' and 'm' in the line equation: `y=mx+b`.
From the previous steps, we know:
`m = 0.0125`
`b = -1.71`
So the final equation of the linear regression is: `y = 0.0125x - 1.71`
How to Use Excel to Calculate a Linear Regression Using the Data Analysis Toolpak
Now let's use a statistical analysis tool in Excel to determine a linear regression for the Kilauea volcano data.
Step 1. Enter your data into an Excel spreadsheet in two columns. To help visualize the relationship between the variables, create a scatterplot using the Excel graph feature and insert a trendline. A trendline shows you a best fit line for the data on the chart; it is often calculated with a least squares but can be calculated several different ways.
Create the scatterplot: Select both columns of your data. In the Excel menu, click on "Insert" and then the chart feature and select the scatter option (with no connecting lines). In the resulting chart, click the chart area to display the Chart Design and Format tabs in the top menu. Update the chart title, axis labels, and more.
Add the trendline: Select the chart area again. Click on "Chart Design" then "Add Chart Element." Select Trendline --> Linear.
Step 2. Under the tab "Data," select "Data Analysis." If you do not see "Data Analysis" as an option, you must activate the Data Analysis Toolpak.
In Windows: File --> Options --> Add-Ins --> Manage Add-Ins. Then click on the "Go" button, check the box for "Analysis Toolpak" and click "Ok." Then go back to the Data tab.
With a Mac: Tools menu at the top of the screen --> Excel Add-Ins --> Choose from available add-ins --> Check the box for "Analysis Toolpak" and click "Ok."
Step 3. Now use the Regression tool on your data.
In the pop-up menu, select "Regression." Click in the box "Input Y range." Then highlight the
`y` values in your spreadsheet. Then click in the box "Input X range." Then highlight the
`x` values in your spreadsheet. Click the "Output Range" circle, then click in the box to the right of the words "output range." In your spreadsheet, click in the upper left corner of a blank area where you want to place the regression output. Finish by clicking "Ok."
Step 4. Determine the line equation for your data. Write it in the form of `y=mx+b` and look for the `R^2` value.
The Summary Output will display three tables of results: Regression Statistics, ANOVA, and a third unlabeled table. The slope,
`m`, is in the third table, in the row labeled X Variable and the Coefficients column. The
`y` intercept is the value in the Intercept row under Coefficients. The
`R^2` value is the second line in the Regression Statistics table (labeled R square).
The accompanying correlation coefficient, usually called `R^2`, sometimes `r^2`, measures the strength of the linear relationship between `x` and `y`. The closer the correlation coefficient is to 1.0, the stronger the correlation between `x` and `y`. Values near zero indicate no linear relationship between the variables. The desired value of `R^2` varies depending on the context and field of study.
- The Summary Output from our volcano data is shown in the figure.
- `R^2 = 0.994` which indicates a strong correlation, meaning that the ages of the old volcanoes increase as distance from the active volcano increases.
- Our line equation is `y = 0.0125x - 1.628`. This is because the `y` intercept is negative. Because the slope, `m` , is positive, that indicates a positive linear relationship. A negative slope indicates a negative linear relationship.
Step 5. Check to see if this line equation matches what you calculated by hand above. Note that the values of `m` and `b` may differ slightly due to different rounding within the calculations, but they should be close. Do they match? Hurray! You now know how to do a linear regression with two different methods.
How do you use the line equation from the linear regression?
One of the most important things a scientist does with data is to discover if there is a relationship between two factors. A linear regression shows us if there is a strong positive or negative linear relationship, and that conclusion is often the goal of a study.
Another important use of linear regressions is as a tool to solve for an unknown. Once you have a line equation from your linear regression, you can solve for `x` if you have `y`, or solve for `y` if you have `x`. For example, your unknown may be the concentration of a salt or metal in a solution or what the temperature will be at a certain level of carbon dioxide.
You are studying an island in the Hawaiian hot spot chain approximately 1000 km from Kilauea. Based on your linear regression, how old are the rocks here?
Step 1:
- Distance is our independent variable, our `x`. So put 1000 km in for `x` and solve for `y`.
- If `y=mx+b`, that means `y = (0.012 xx 1000) - 1.628`.
- Then `y = 12 - 1.628`. Calculate to show `y = 10.372` million years
Step 2: What does this mean? It tells us that since this island is 1000 km away from Kilauea, its rocks are approximately 10.37 million years old!
Where do you use linear regressions in Earth science?
- Determination of the velocity of tectonic plates or seismic waves
- Determination of the concentration of an unknown solution based on various measurements of standards of known concentrations
- Relating environmental or ecological responses to a measured factor
- Determination of the relationship of temperature to gas concentrations or chemical components
Next steps
I am ready to PRACTICE!
If you think you have a handle on the steps above, click on this bar to try practice problems with worked answers.
Or, if you want even more practice, see 'More help' below.Or, if you want even more practice, see More help below.
More help (resources for students)
Pages written by Dr. Laura Treible (Savannah State University) and Dr. Melanie Szulczewski (University of Mary Washington).