Spam Filtering Using Machine Learning Techniques Implemented With Design Tree Classifier

5 Chapters
|
39 Pages
|
5,850 Words

Spam filtering employing machine learning techniques, particularly utilizing Decision Tree Classifier design, involves the creation of a system capable of distinguishing between unsolicited and legitimate emails based on various features extracted from email content and metadata. This process encompasses the utilization of algorithms to classify emails into spam or non-spam categories, leveraging decision trees to hierarchically partition the feature space and make predictions. By training the classifier on a dataset comprising labeled examples of spam and non-spam emails, the model learns to discern patterns indicative of spam, such as specific keywords, email addresses, or structural characteristics. Through iterative refinement and optimization, the classifier becomes adept at accurately identifying and filtering out unwanted messages, thereby enhancing email security and user experience.

ABSTRACT

A spam filter is a program that is used to detect unsolicited and unwanted email and prevent those messages from getting to a user’s inbox. E-mail spam, known as unsolicited bulk Email (UBE), junk mail, or unsolicited commercial email (UCE), is the practice of sending unwanted e-mail messages, frequently with commercial content, in large quantities to an indiscriminate set of recipients. Spam is prevalent on the Internet because the transaction cost of electronic communications is radically less than any alternate form of communication. There are many spam filters using different approaches to identify the incoming message as spam, ranging from white list / black list, Bayesian analysis, keyword matching, mail header analysis, postage, legislation, and content scanning etc. Even though we are still flooded with spam emails everyday. This is not because the filters are not powerful enough, it is due to the swift adoption of new techniques by the spammers and the inflexibility of spam filters to adapt the changes. In our work, we employed supervised machine learning techniques to filter the email spam messages. Widely used supervised machine learning techniques namely C 4.5 Decision tree classifier, Multilayer Perceptron, Naïve Bayes Classifier are used for learning the features of spam emails and the model is built by training with known spam emails and legitimate emails. The results of the models are discussed.

TABLE OF CONTENT

COVER PAGE
TITLE PAGE
APPROVAL PAGE
DEDICATION
ACKNOWLEDGEMENT
ABSTRACT

CHAPTER ONE
1.0 INTRODUCTION
1.1 BACKGROUND OF THE STUDY
1.2 PROBLEM STATEMENT
1.3 AIM AND OBJECTIVES OF THE STUDY
1.4 SCOPE OF THE STUDY
1.5 SIGNIFICANCE OF THE STUDY
1.6 MOTIVATION OF THE STUDY
1.7 BENEFIT OF THE STUDY

CHAPTER TWO
LITERATURE REVIEW
2.1 OVERVIEW OF SPAM FILTERING
2.2 TYPES OF SPAM FILTERS
2.3 REVIEW OF SPAM FILTERING METHODS
2.4 INBOUND AND OUTBOUND FILTERING
2.5 REVIEW OF RELATED STUDIES
2.6 SPAM FILTER ARCHITECTURE AND METHODS

CHAPTER THREE
METHODOLOGY
3.1 INTRODUCTION
3.2 SOURCES OF DATA
3.3 DATA COLLECTION PROCESS
3.4 TREE CLASSIFIER MODEL
3.5 FEATURE EXTRACTION

CHAPTER FOUR
4.1 EXPERIMENT AND RESULT

CHAPTER FIVE
5.1 CONCLUSION
5.2 RECOMMENDATION
REFERENCES

CHAPTER ONE

1.0 INTRODUCTION
1.1 BACKGROUND OF THE STUDY
The internet has become an integral part of everyday life and e-mail has become a powerful tool for information exchange. Along with the growth of the Internet and e-mail, there has been a dramatic growth in spam in recent years. Spam can originate from any location across the globe where Internet access is available. Despite the development of anti- spam services and technologies, the number of spam messages continues to increase rapidly (Gaurav et al., 2020). In order to address the growing problem, each organization must analyze the tools available to determine how best to counter spam in its environment. Tools, such as the corporate e-mail system, e-mail filtering gateways, contracted anti-spam services, and end-user training, provide an important arsenal for any organization. However, users cannot avoid the very serious problem of attempting to deal with large amounts of spam on a regular basis (Ablel-Rheem et al., 2020). If there are no anti-spam activities, spam will inundate network systems, kill employee productivity, steal bandwidth, and still be there tomorrow.
It is impossible to tell exactly who was the first one to come upon a simple idea that if you send out an advertisement to millions of people, then at least one person will react to it no matter what is the proposal. E-mail provides a perfect way to send these millions of advertisements at no cost for the sender, and this unfortunate fact is nowadays extensively exploited by several organizations. As a result, the e-mailboxes of millions of people get cluttered with all this so-called unsolicited bulk e-mail also known as “spam” or “junk mail”. Being increadibly cheap to send, spam causes a lot of trouble to the Internet community: large amounts of spam-traffic between servers cause delays in delivery of legitimate e- mail, people with dial-up Internet access have to spend bandwidth downloading junk mail (Ablel-Rheem et al., 2020). Sorting out the unwanted messages takes time and introduces a risk of deleting normal mail by mistake. Finally, there is quite an amount of pornographic spam that should not be exposed to children.

1.2 PROBLEM STATEMENT
Spam email is a nuisance that will clog up ones’ inboxes and overload one’s servers. Spam is also dangerous — the entry point for serious attacks that could damage your computers, your computer network, your bottom line, and even company’s reputation. That is why there is need for a spam filter solution as the key first line of defense.
The spam problem is an ongoing issue since 2018 and 14.5 billion spam e-mails were sent per day (Bauer, 2018). According to the Internet Security Threat Report released in 2019 by Symantec, spam levels for their customers increased in 2018 (Symantec Internet Security Threat Report, 2019). What draws the attention is that small enterprises were attacked more often than large companies, and e-mail malware reached stable levels. Therefore, there is a need to tailor even simple tools for detection and filtering of spam in all organizations. The increasing number of spam e-mails has created a strong need to develop more reliable and efficient anti-spam filters, including ones based on machine-learning tools – spam filter using machine language which is design with tree classifier (Dada et al., 2019). Tree classifier is implemented here because is easy to implement and easy to understand. It provides an overall satisfactory performance as far as spam mail detection is concerned.

1.3 AIM AND OBJECTIVES OF THE STUDY
The main aim of this work is to design a machine learning-based approach for spam filtering that is implemented with a tree classifier. The objectives of the work are:
i. To design the system model using tree classifier
ii. To detect unsolicited and unwanted email and prevent those messages from getting to a user’s inbox.
iii. To experiment a program that will successfully detect, filter spam emails and improve Internet Security.

1.4 SCOPE OF THE STUDY
The scope of this work covers building a fast and precise selection of machine-learning tree classifiers for spam filtering. It allows for quick selection of interesting parameters, which is essential for working with large datasets.

1.5 SIGNIFICANCE OF THE STUDY
Business email system without spam filtering is highly vulnerable, if not unusable. It is important to stop as much spam as you can, to protect your network from the many possible risks: viruses, phishing attacks, compromised web links and other malicious content.
Spam filters also protect your servers from being overloaded with non-essential emails, and the worse problem of being infected with spam software that may turn them into spam servers themselves.
This study will provide means of identification of the best-performing machine- learning-based classifiers and selection of the one with the leading parameters. The system allows for quick analysis of data of higher dimensionality. This is especially important if large datasets are to be analyzed and we want to assure the proper scalability of our system. The study will also show how to find a database to train a machine-learning model used for spam detection

1.6 MOTIVATION OF THE STUDY
Common uses for mail filters include organizing incoming email and removal of spam and computer viruses. A less common use is to inspect outgoing email at some companies to ensure that employees comply with appropriate policies and laws. Users might also employ a mail filter to prioritize messages, and to sort them into folders based on subject matter or other criteria.

1.7 BENEFIT OF THE STUDY
Spam filters detect unsolicited, unwanted, and virus-infested email (called spam) and stop it from getting into email inboxes. Internet Service Providers (ISPs) use spam filters to make sure they aren’t distributing spam. Small- to medium- sized businesses (SMBs) also use spam filters to protect their employees and networks.
Spam filters are applied to both inbound email (email entering the network) and outbound email (email leaving the network). ISPs use both methods to protect their customers. SMBs typically focus on inbound filters.

Save/Share This On Social Media:
MORE DESCRIPTION:

Here’s a brief guide on how you might implement spam filtering using a Decision Tree Classifier:

  1. Data Collection: Gather a dataset containing examples of both spam and non-spam (ham) emails. This dataset should be labeled, meaning each email is tagged as either spam or ham.
  2. Data Preprocessing: Clean and preprocess your data. This involves tasks like removing irrelevant characters, converting text to lowercase, and handling missing data.
  3. Feature Extraction: Transform your email data into a format suitable for machine learning. This often involves converting text into numerical vectors using techniques like TF-IDF (Term Frequency-Inverse Document Frequency).
  4. Split the Dataset: Divide your dataset into training and testing sets. The training set is used to train the model, while the testing set is used to evaluate its performance.
  5. Decision Tree Training: Train your Decision Tree Classifier on the training set. The model will learn patterns in the data that distinguish between spam and ham emails.
  6. Model Evaluation: Evaluate the performance of your model using the testing set. Common metrics include accuracy, precision, recall, and F1 score.
  7. Tuning and Optimization: Adjust hyperparameters and features to improve your model’s performance. This might involve pruning the tree, tweaking the tree depth, or trying different feature extraction techniques.
  8. Testing on New Data: Once you’re satisfied with your model’s performance, you can use it to classify new, unseen emails as spam or ham.

Remember that this is a high-level overview, and there are many details to consider at each step. Additionally, you might want to explore other machine learning algorithms or ensemble methods to see what works best for your specific dataset.