Using Regression Models to make Predictions
Summary
This activity introduces students to prediction and confidence intervals for a simple linear regression model using a MATLAB Live Script. To draw a connection to confidence intervals for an unknown population mean, the activity begins with an explanation of how the true regression line is simply a line of average values. The concept of a point estimate and confidence interval for the mean response E[Y] is explained mathematically and illustrated graphically as well as the concept of a prediction, prediction error, and prediction interval for a new observation Yi. At the conclusion of the activity, the student will understand the key differences between confidence intervals and prediction intervals for simple linear regression models.
Learning Goals
At the conclusion of this activity, the student will:
- Understand that the true regression line is simply a line of average values.
- Understand the graphical representation of the distribution of Y for different values of x.
- Understand the concept of a point estimate and confidence interval for the mean response E[Y].
- Understand the concept of a prediction, prediction error, and prediction interval for a new observation Yi.
The student will gain the following MATLAB skills:
- Fit a regression line using fitlm(X,y)
- Use predict(mdl,Xnew,'Prediction','curve') to add the 95% confidence interval for the true mean response E[Y]
- Use predict(mdl,Xnew,'Prediction','observation') to add the 95% prediction interval for the prediction of a new(future) value of Y
Context for Use
Although this activity was designed for use with students at the graduate-level, Simple Linear Regression is a basic (undergraduate) statistics and data analysis topic and therefore this activity would also be appropriate in an undergraduate level statistics course where linear regression is covered. This activity could be conducted individually, in groups of 3-4 students, or collectively as a class. In a classroom setting, students are allowed the entire class period (50 minutes) to work through the MATLAB live script and complete the exercises for further exploration.
The students should have already completed standard instruction on parameter estimation, confidence intervals, and simple linear regression.
Description and Teaching Materials
The Simple Linear Model
The MATLAB Using Regression Models to Make Predictions Live Script (MATLAB Live Script 54kB Aug17 19) begins with a look back at the simple linear regression model and mathematically demonstrates that the simple linear regression model is simply a line of average values.
The Distribution of Y for Different Values of x
Next, the MATLAB Live Script walks the student through a demonstration of what it means for the error terms to be random variables that are normally distributed with mean 0 and variance equal to sigma squared. The Further Exploration activity asks the student to try different values for the parameters to determine their effect on the distribution and to try different numbers of samples to determine what happens as n approaches infinity.
The Relationship between the True Regression Line and our Fitted Regression Line
The third exploratory activity generates a random sample of observations, uses fitlm(X,y) to fit a linear regression model and then compares the fitted model to the true regression line so that students can see the relationship. The Further Exploration activity asks the students to generate new random samples until they understand the relationship between the true regression line, the sample data, and the estimated (fitted) regression line.
The Confidence Interval for a Simple Linear Regression Model
The Prediction Interval for a Simple Linear Regression Model
A very difficult concept to grasp is the difference between a confidence interval and a prediction interval for a simple linear regression model, so the next part of the MATLAB Live Script mathematically describes the difference and then uses predict(mdl,Xnew,'Prediction','observation') to add the 95% prediction interval for the prediction of a new(future) value of Y. Students are able to visually see that the prediction interval is much wider than the confidence interval.
Teaching Notes and Tips
The Using Regression Models to Make Predictions Live Script (MATLAB Live Script 54kB Aug17 19) explores the concepts of confidence intervals and prediction intervals for simple linear regression models from a graphical perspective which introduces the student to the fitlm(X,y) command to create a linear regression model and the predict(mdl,Xnew,Name,Value) command to predict the response from the linear regression model. The parameters used to illustrate the concepts were arbitrarily chosen and could be easily modified for any situation where the true linear regression model is known.
For reproducibility in support of the in-class demonstrations, the Using Regression Models to Make Predictions Live Script (MATLAB Live Script 54kB Aug17 19) uses the rng(seed, generator) command to control the random generation of the sample data. For the Further Exploration activities, the instructor/student will need to remove (or comment out) these lines in the code.
The Using Regression Models to Make Predictions Live Script (MATLAB Live Script 54kB Aug17 19) was intentionally designed to support individual exploration, as well as collective exploration. At the undergraduate level, I would recommend walking through the Live Script together as an in-class activity. At a post-graduate level, I encourage students to explore the Live Script individually prior to coming to class and then we collectively discuss the observations during class.
Assessment
As a stand-alone document, the Using Regression Models to Make Predictions Live Script (MATLAB Live Script 54kB Aug17 19) is intended to serve as an exploratory in-class activity and is not directly assessed. However, students are expected to recall this information in order to answer conceptual questions about the relationship between confidence intervals and prediction intervals for simple linear regression on the final exam.
Additionally, on the computational portion of the final exam students are expected to use bivariate data to fit a linear model, use the model to make predictions and then describe the corresponding confidence interval and/orprediction interval based on whether they are predicting a mean response or a new observation, respectively.