Using Regression Models to make Predictions

Michelle Isenhour, Naval Postgraduate School, Operations Research
Author Profile

Summary

This activity introduces students to prediction and confidence intervals for a simple linear regression model using a MATLAB Live Script. To draw a connection to confidence intervals for an unknown population mean, the activity begins with an explanation of how the true regression line is simply a line of average values. The concept of a point estimate and confidence interval for the mean response E[Y] is explained mathematically and illustrated graphically as well as the concept of a prediction, prediction error, and prediction interval for a new observation Yi. At the conclusion of the activity, the student will understand the key differences between confidence intervals and prediction intervals for simple linear regression models.


Learning Goals

At the conclusion of this activity, the student will:

  • Understand that the true regression line is simply a line of average values.
  • Understand the graphical representation of the distribution of Y for different values of x.
  • Understand the concept of a point estimate and confidence interval for the mean response E[Y].
  • Understand the concept of a prediction, prediction error, and prediction interval for a new observation Yi.

The student will gain the following MATLAB skills:

  • Fit a regression line using fitlm(X,y)
  • Use predict(mdl,Xnew,'Prediction','curve') to add the 95% confidence interval for the true mean response E[Y]
  • Use predict(mdl,Xnew,'Prediction','observation') to add the 95% prediction interval for the prediction of a new(future) value of Y

Context for Use

Although this activity was designed for use with students at the graduate-level, Simple Linear Regression is a basic (undergraduate) statistics and data analysis topic and therefore this activity would also be appropriate in an undergraduate level statistics course where linear regression is covered. This activity could be conducted individually, in groups of 3-4 students, or collectively as a class. In a classroom setting, students are allowed the entire class period (50 minutes) to work through the MATLAB live script and complete the exercises for further exploration.

The students should have already completed standard instruction on parameter estimation, confidence intervals, and simple linear regression.

Description and Teaching Materials

The Simple Linear Model

The MATLAB Using Regression Models to Make Predictions Live Script (MATLAB Live Script 54kB Aug17 19) begins with a look back at the simple linear regression model and mathematically demonstrates that the simple linear regression model is simply a line of average values.

The Distribution of Y for Different Values of x

Next, the MATLAB Live Script walks the student through a demonstration of what it means for the error terms to be random variables that are normally distributed with mean 0 and variance equal to sigma squared. The Further Exploration activity asks the student to try different values for the parameters to determine their effect on the distribution and to try different numbers of samples to determine what happens as n approaches infinity.

The Relationship between the True Regression Line and our Fitted Regression Line

The third exploratory activity generates a random sample of observations, uses fitlm(X,y) to fit a linear regression model and then compares the fitted model to the true regression line so that students can see the relationship. The Further Exploration activity asks the students to generate new random samples until they understand the relationship between the true regression line, the sample data, and the estimated (fitted) regression line.

The Confidence Interval for a Simple Linear Regression Model

Building on the activity just completed, the MATLAB Live Script demonstrates how a confidence interval is constructed for a simple linear regression model. Just as students were able to derive a confidence interval around a sample mean as a method of inference about an unknown population mean, the MATLAB Live Script now motivates the idea of a confidence interval around the estimated regression line as a method of inference about the unknown true regression line. After generating 100 samples and 100 estimated regression lines, the predict(mdl,Xnew,'Prediction','curve') command is used to add the 95% confidence interval for the true mean response E[Y] that corresponds to the very last sample. The use of the visual display helps to explain the hourglass shape of the confidence interval.

The Prediction Interval for a Simple Linear Regression Model

A very difficult concept to grasp is the difference between a confidence interval and a prediction interval for a simple linear regression model, so the next part of the MATLAB Live Script mathematically describes the difference and then uses predict(mdl,Xnew,'Prediction','observation') to add the 95% prediction interval for the prediction of a new(future) value of Y. Students are able to visually see that the prediction interval is much wider than the confidence interval.

The Further Exploration activity asks the student to compare the width of the intervals near the mean value and then again near the max value of the predictor variable. After demonstrating how the width of the prediction interval relates to the sampling distributions of the yi's, the student is asked to create an array of values outside the domain of the predictor variables and learn (through exploration) why we don't want to use a fitted regression model to make predictions far outside the range of the predictor variables.

Teaching Notes and Tips

The Using Regression Models to Make Predictions Live Script (MATLAB Live Script 54kB Aug17 19) explores the concepts of confidence intervals and prediction intervals for simple linear regression models from a graphical perspective which introduces the student to the fitlm(X,y) command to create a linear regression model and the predict(mdl,Xnew,Name,Value) command to predict the response from the linear regression model. The parameters used to illustrate the concepts were arbitrarily chosen and could be easily modified for any situation where the true linear regression model is known.

For reproducibility in support of the in-class demonstrations, the Using Regression Models to Make Predictions Live Script (MATLAB Live Script 54kB Aug17 19) uses the rng(seed, generator) command to control the random generation of the sample data. For the Further Exploration activities, the instructor/student will need to remove (or comment out) these lines in the code.

The Using Regression Models to Make Predictions Live Script (MATLAB Live Script 54kB Aug17 19) was intentionally designed to support individual exploration, as well as collective exploration. At the undergraduate level, I would recommend walking through the Live Script together as an in-class activity. At a post-graduate level, I encourage students to explore the Live Script individually prior to coming to class and then we collectively discuss the observations during class.

Assessment

As a stand-alone document, the Using Regression Models to Make Predictions Live Script (MATLAB Live Script 54kB Aug17 19) is intended to serve as an exploratory in-class activity and is not directly assessed. However, students are expected to recall this information in order to answer conceptual questions about the relationship between confidence intervals and prediction intervals for simple linear regression on the final exam.

Additionally, on the computational portion of the final exam students are expected to use bivariate data to fit a linear model, use the model to make predictions and then describe the corresponding confidence interval and/orprediction interval based on whether they are predicting a mean response or a new observation, respectively.

References and Resources

Textbook: Probability and Statistics for Engineering and the Sciences, 9th Edition, (2016) by Jay Devore. Published by Cengage Learning, Boston.