Detect phishing emails Using Machine Learning Model

Hongmei Chi, Florida Agricultural and Mechanical University,
Author Profile
Initial Publication Date: October 14, 2024

Summary

In this hands-on lab, participants will explore the application of machine learning techniques to detect phishing emails using a dataset sourced from Kaggle. The lab will guide attendees through the entire process, from data preprocessing to model training and evaluation.

Share your modifications and improvements to this activity through the Community Contribution Tool »

Learning Goals

Dataset Overview: Introduction to the Kaggle phishing email dataset, including its structure and key features.
Data Preprocessing: Techniques for cleaning and preparing the data, such as handling missing values, normalizing text data, and feature extraction.

Context for Use

This hands-on lab will for the graduate students in cybersecurity or AI classes.

Model Selection: Discussion of various machine learning algorithms suitable for classification tasks, including logistic regression, decision trees, and support vector machines.

Training and Testing: Step-by-step instructions on splitting the dataset into training and testing sets, followed by training the selected model.

Evaluation Metrics: Introduction to metrics such as accuracy, precision, recall, and F1-score to assess model performance.

Visualization: Utilizing MATLAB's visualization tools to display results and insights from the model's predictions.

By the end of the lab, participants will have hands-on experience in developing a machine learning model to classify phishing emails, equipping them with practical skills applicable in cybersecurity and data science fields.

Description and Teaching Materials

[1] Kaggle dataset https://www.kaggle.com/datasets/shashwatwork/phishing-dataset-for-machine-learning
[2] Tangkere, B. B. (2024). Analisis Performa Logistic Regression dan Support Vector Classification untuk Klasifikasi Email Phising. Jurnal Ekonomi Manajemen Sistem Informasi, 5(4), 442-450.




Teaching Notes and Tips

Logistic regression: Students will learn how this algorithm predicts binary outcomes like disease presence.
Data preprocessing: Handling missing data, normalizing features, and splitting data for training and testing.
Model evaluation: Using accuracy metrics and interpreting confusion matrices to assess model performance.
Data visualization: Creating bar charts and heatmaps for clearer data insights.
=
MATLAB simplifies tasks like data processing, model building, and visualization

Assessment

1.A written lab results will be submitted
2.A MATLAB script (M-file) that successfully loads and preprocesses the dataset, builds the logistic regression model, and visualizes the results.
Each deliverable is manually graded based on code functionality, accuracy, and the quality of explanations provided.

References and Resources