How do I calculate a linear regression?
Finding relationships between variables in the Earth sciences

An introduction to linear regressions

Does lake depth affect water column stability? Is the age of a volcanic island related to its distance from a hot spot? Is the soil pH related to the amount of sulfides? Earth scientists can investigate the relationships between these things using linear regressions.
Linear regression provides a statistical analysis of the relationship between two variables. A linear change is when one factor changes by a constant amount with respect to a second factor. A line that models this relationship is usually calculated through the least squares method, resulting in the equation `y=mx+b`. If there is a linear relationship, the resulting equation then helps you predict the value of one variable (`x` or `y`) based on knowing the value of the other variable.

The slope, `m`, or gradient, equals the change in `y` divided by the change in `x`. The `y` intercept, `b`, is the point where the line crosses the `y` axis (`x` = 0).

The least squares method is a way to find a line that best fits the data through a calculation to minimize the sum of the squared distance of each point to that line.

When do I need to calculate a linear regression?

Linear regressions are used when you want to determine if there is a linear relationship between two variables. You may need to first plot the data in a scatterplot (see 'How do I plot points on a graph?') to visually determine if there is a relationship between the two variables, and if that relationship appears linear. The equation of your calculated best-fitting line can then be used to predict values of one variable based on given or known values of the other variable.

How do I calculate a linear regression?

Here are the steps to take when using least squares analysis for a linear regression:

We will use a data set with paired variables (where every value for one variable has a corresponding value of the other variable) from the Hawaiian hot spot track to calculate a linear regression. Kilauea is an active volcano forming at the Hawaiian hot spot above a mantle plume. A chain of volcanic islands to the northwest of Kilauea shows the hot spot track over the last 80 million years. The Pacific plate moves northwest over the hot spot, displacing the volcanic islands to the northwest and allowing a new volcanic island to form at the mantle plume.
Can we use the distance from the hot spot to determine the age of the older volcanic islands? Data provided are the distance from the hot spot in kilometers (km) and the age of the volcanic islands in millions of years (Ma).

Distance (km) Age (Ma)
0 0.25
300 2
1,800 20
2,600 28
3,500 42
4,800 60
In all of these calculation steps, you can use a calculator or a spreadsheet program (like Excel) to do the actual calculations. When you move to examples with larger data sets, using a spreadsheet program to do calculations may be more efficient.

How to Calculate a Linear Regression with the Least Squares Method

The independent variable is always plotted on the x-axis (when graphing) and is "independent" i.e., other variables have no effect on these values (but these values will affect other variables).

 

The dependent variable is always plotted on the y-axis (when graphing) and "depends" on the other variable.

Step 1. Decide which variable is the independent variable and which is the dependent variable. For graphing and calculating linear regressions, the independent variable is represented as `x`, and the dependent variable is represented as `y`.

 

Step 2. Calculate `x^2` for every value of `x`, and `y^2` for every value of `y`. For each pair of values, calculate `x xx y`.

Step 3. Now, you will add up some of these values you just calculated. The `Sigma` symbol means "sum of", so in this step, you are adding up all of the values for each of these groups.

Calculate `Sigmax`,  `Sigmay`,  `Sigma(x^2)`,  `Sigma(y^2)`,  and  `Sigma(x xx y)`.

For example, to calculate `Sigmax`, you will add up all values of `x`, and to calculate `Sigma(y^2)`, you will add up all of the values of `y^2` that you calculated in Step 2

Step 4. Familiarize yourself with the equation of a line: `y=mx+b`
The `x` and `y` in this equation represent data and will remain as `x` and `y` in the equation. 
The `m` is the slope of the line, and the `b` is the y-intercept, or where the line crosses the y-axis. Performing linear regression will help you calculate the values for `m` (Step 5) and `b` (Step 6) to replace in the equation.

n is the number of data points or values within a variable.

 

In this example, there are 6 values for `x` and 6 values for `y` , so `n = 6` .

 

Step 5. Calculate m (slope): `(n(Sigma(x xx y)) – (Sigmax)(Sigmay))/(n(Sigma(x^2)) – (Sigmax)^2)` .

This may look complicated, but don't worry! You already calculated the values for each of these components in Step 3. For this calculation, simply plug in those values that you calculated in Step 3.

Step 6. Calculate b (intercept): `(Sigmay – m(Sigmax))/(n)`

Step 7. Replace 'b' and 'm' in the line equation: `y=mx+b`.

How to Use Excel to Calculate a Linear Regression Using the Data Analysis Toolpak

Now let's use a statistical analysis tool in Excel to determine a linear regression for the Kilauea volcano data.

Step 1. Enter your data into an Excel spreadsheet in two columns. To help visualize the relationship between the variables, create a scatterplot using the Excel graph feature and insert a trendline. A trendline shows you a best fit line for the data on the chart; it is often calculated with a least squares but can be calculated several different ways.

Step 2. Under the tab "Data," select "Data Analysis."  If you do not see "Data Analysis" as an option, you must activate the Data Analysis Toolpak.

Step 3. Now use the Regression tool on your data.

Step 4. Determine the line equation for your data. Write it in the form of `y=mx+b` and look for the `R^2` value.

Step 5. Check to see if this line equation matches what you calculated by hand above. Note that the values of `m` and `b` may differ slightly due to different rounding within the calculations, but they should be close. Do they match? Hurray! You now know how to do a linear regression with two different methods.

How do you use the line equation from the linear regression?

One of the most important things a scientist does with data is to discover if there is a relationship between two factors. A linear regression shows us if there is a strong positive or negative linear relationship, and that conclusion is often the goal of a study.

Another important use of linear regressions is as a tool to solve for an unknown. Once you have a line equation from your linear regression, you can solve for `x` if you have `y`, or solve for `y` if you have `x`. For example, your unknown may be the concentration of a salt or metal in a solution or what the temperature will be at a certain level of carbon dioxide.

You are studying an island in the Hawaiian hot spot chain approximately 1000 km from Kilauea. Based on your linear regression, how old are the rocks here?

Where do you use linear regressions in Earth science?

  • Determination of the velocity of tectonic plates or seismic waves
  • Determination of the concentration of an unknown solution based on various measurements of standards of known concentrations
  • Relating environmental or ecological responses to a measured factor
  • Determination of the relationship of temperature to gas concentrations or chemical components

Next steps

Two format options we can choose from

I am ready to PRACTICE!
If you think you have a handle the steps above, click on this bar to try practice problems with worked answers.

Or, if you want even more practice, see More help below.

More help (resources for students)

Pages written by Dr. Laura Treible (Savannah State University) and Dr. Melanie Szulczewski (University of Mary Washington).


      Next Page »