Predicting Coronary Heart Disease Using Machine Learning Model
Summary
This activity introduces students to supervised learning methods and basic data science techniques using MATLAB. The focus is on the prediction of CHD using logistic regression. This activity involves implementing a numerical simulation for predicting Coronary Heart Disease (CHD) using logistic regression on the Framingham Heart Study dataset. The activity aims to guide students through data preprocessing, model development, evaluation, and visualization of results using MATLAB. The data is split into training and test sets, with accuracy evaluation using a confusion matrix and a custom colormap heatmap.
Learning Goals
Concepts and Content Learned:
Logistic regression: Students will learn how this algorithm predicts binary outcomes like disease presence.
Data preprocessing: Handling missing data, normalizing features, and splitting data for training and testing.
Model evaluation: Using accuracy metrics and interpreting confusion matrices to assess model performance.
Data visualization: Creating bar charts and heatmaps for clearer data insights.
MATLAB Utilization and Its Impact:
MATLAB simplifies tasks like data processing, model building, and visualization.
It helps students apply machine learning concepts practically, reinforcing their learning with real-world data.
Visualization tools allow students to immediately interpret results, improving their understanding of model performance.
Higher-Order Thinking Skills Developed:
Critical thinking: Deciding how to preprocess data and evaluate model accuracy.
Data analysis and problem-solving: Students interpret results and troubleshoot model issues.
Model development: Building and testing a machine learning model fosters deeper learning.
Other Skills Developed:
MATLAB proficiency: Enhancing computational skills through hands-on coding.
Collaboration: Working in pairs builds teamwork and communication skills.
Context for Use
This activity is designed for undergraduate students or beginners exploring machine learning. It works well with students, typically 1 to 2 students, and can be done in a lab setting or during a classroom session.
Time required: The activity can be completed in 40 to 70 minutes, depending on students' skills.
Student grouping: Students can work individually or in pairs.
Technical skills: Basic familiarity with MATLAB, including arrays, tables, and basic coding, is needed. No prior machine learning experience is required.
Course context: This activity is an introductory exercise in a machine learning, data science, or programming course. It fits well early in the course when students are being introduced to applied machine learning concepts.
Adaptability: The activity is flexible and could be adapted for other programming environments like Python, making it easy to use in various settings.
Description and Teaching Materials
In this activity, students will learn how to use logistic regression to predict the 10-year risk of coronary heart disease (CHD). The activity focuses on implementing a machine learning model using MATLAB to analyze a medical dataset. Students will go through key steps like data cleaning, train-test split, model training, and evaluation using performance metrics like a confusion matrix.
Materials Needed:
1. Dataset: The dataset is from an ongoing cardiovascular study on the residents of Framingham, Massachusetts. It includes over 4,000 records and 15 attributes. The dataset can be downloaded from Kaggle [Framingham Heart Study dataset](https://www.kaggle.com/datasets/amanajmera1/framingham-heart-study-dataset).
2. Activity Files:
MATLAB code file that students will use to load, clean, and process the data, build the model, and evaluate results. (This can be shared as a .m file)
Pre-assigned URLs for learning about logistic regression and confusion matrices:
[IBM: Logistic Regression](https://www.ibm.com/topics/logistic-regression)
[GeeksforGeeks: Confusion Matrix](https://www.geeksforgeeks.org/confusion-matrix-machine-learning/)
MATLAB Usage:
MATLAB is used to:
Clean the dataset by removing missing values.
Normalize the data and split it into training and test sets.
Train the logistic regression model and evaluate it using a confusion matrix and accuracy metrics.
Visualize the results through bar charts and heatmaps for easier interpretation.
Although the activity could be performed using other software like Python or R, MATLAB is chosen due to its user-friendly syntax, built-in functions for model training, and robust visualization tools, making it an excellent choice for students learning machine learning basics.
Additional Resources:
IBM: Logistic Regression
(https://www.ibm.com/topics/logistic-regression)
GeeksforGeeks: Confusion Matrix (https://www.geeksforgeeks.org/confusion-matrix-machine-learning/)
Key Components of the Activity:
Data gathering and cleaning: Students will clean the dataset by removing missing values.
Model building and evaluation: Students will train and evaluate a logistic regression model using a confusion matrix and accuracy.
Extension (Bonus): Students can extend the code by allowing user input for risk prediction based on model attributes.
This hands-on exercise provides students with practical experience in using MATLAB for machine learning tasks, developing skills in model development, data analysis, and prediction.
Matlab Code (Matlab File 2kB Oct10 24)
Activity Instruction Doc (Microsoft Word 2007 (.docx) 288kB Oct10 24)
Teaching Notes and Tips
Familiarize Students with MATLAB:
Encourage students to explore the MATLAB environment and its basic functionalities before starting the activity. This could include understanding how to navigate the interface, load data, and run scripts.
Provide resources for students to learn basic MATLAB commands, such as indexing arrays, using tables, and handling missing data.
Dataset Understanding:
Emphasize the importance of understanding the dataset. Discuss the meaning of each feature in the Framingham dataset and how it relates to coronary heart disease.
Discuss the significance of missing values and how they can impact model performance.
Common Areas of Confusion:
Data Cleaning: Students may struggle with identifying and handling
missing values. Reinforce the importance of cleaning the data and provide examples of how to interpret the results of rmmissing().
Model Evaluation: Clarify the interpretation of the confusion matrix, particularly what true positives, true negatives, false positives, and false negatives mean in the context of predicting heart disease.
MATLAB Tips:
Encourage students to use comments in their code to document their thought process and improve readability. This will help them and others understand their code later.
Remind students to save their work frequently and utilize MATLAB's built-in functions for debugging if they encounter errors.
Extending the Activity:
After students complete the primary activity, encourage them to modify the code to experiment with different features or other models (e.g., decision trees) to see how the results change.
Suggest adding a function to the code that allows for user input, where they can enter their health attributes and receive a prediction about their risk of CHD.
Practical Considerations:
Discuss ethical considerations in using medical data and the importance of patient confidentiality and data protection.
If conducting this activity in a classroom or lab setting, ensure that students have access to MATLAB on their devices or provide a lab with MATLAB installed.
Assessment
Deliverables for this activity include:
1.A written report briefing about activity performed and the compiled results
2.A MATLAB script (M-file) that successfully loads and preprocesses the dataset, builds the logistic regression model, and visualizes the results.
3.A heatmap figure generated from the confusion matrix, with appropriate labels and annotations
4.Extended code for receiving user inputs and predicting risk along with outputs to prove the execution of the same
Each deliverable is manually graded based on code functionality, accuracy, and the quality of explanations provided.
References and Resources
Framingham Heart Study Dataset (available online)
MATLAB documentation for fitglm and data visualization functions
Relevant chapters from the textbook: Essential MATLAB for Scientists and Engineers by Hahn and Valentine
https://www.ibm.com/topics/logistic-regression
https://www.geeksforgeeks.org/confusion-matrix-machine-learning/