»

Spam Filtering Using Machine Learning Techniques

Spam filtering, employing machine learning techniques, refers to the process of automatically identifying and segregating unwanted or unsolicited messages from legitimate ones within a digital communication system, such as emails or text messages. Leveraging algorithms and models trained on vast datasets, this approach employs a variety of features and patterns to classify incoming messages as either spam or not spam. By analyzing factors like content, sender information, and metadata, machine learning algorithms such as decision trees, support vector machines, or neural networks can discern subtle cues indicative of spam, continuously adapting and refining their filtering capabilities over time. This iterative process enhances the accuracy and efficiency of spam detection, mitigating the inundation of unwanted messages and preserving the integrity of communication channels.

ABSTRACT

A spam filter is a program that is used to detect unsolicited and unwanted email and prevent those messages from getting to a user’s inbox. E-mail spam, known as unsolicited bulk Email (UBE), junk mail, or unsolicited commercial email (UCE), is the practice of sending unwanted e-mail messages, frequently with commercial content, in large quantities to an indiscriminate set of recipients. Spam is prevalent on the Internet because the transaction cost of electronic communications is radically less than any alternate form of communication. There are many spam filters using different approaches to identify the incoming message as spam, ranging from white list / black list, Bayesian analysis, keyword matching, mail header analysis, postage, legislation, and content scanning etc. Even though we are still flooded with spam emails everyday. This is not because the filters are not powerful enough, it is due to the swift adoption of new techniques by the spammers and the inflexibility of spam filters to adapt the changes. In our work, we employed supervised machine learning techniques to filter the email spam messages. Widely used supervised machine learning techniques namely C 4.5 Decision tree classifier, Multilayer Perceptron, Naïve Bayes Classifier are used for learning the features of spam emails and the model is built by training with known spam emails and legitimate emails. The results of the models are discussed.

TABLE OF CONTENTS

COVER PAGE

TITLE PAGE

APPROVAL PAGE

DEDICATION

ACKNOWELDGEMENT

ABSTRACT

CHAPTER ONE

1.0 INTRODUCTION

BACKGROUND OF THE STUDY
PROBLEM STATEMENT
OBJECTIVE OF THE STUDY
SCOPE OF THE STUDY
SIGNIFICANCE OF THE STUDY
MOTIVATION OF THE STUDY
BENEFIT OF THE STUDY
TYPES OF SPAM FILTERS

CHAPTER TWO

LITERATURE REVIEW

OVERVIEW OF SPAM FILTERING
REVIEW OF SPAM FILTERING METHODS
INBOUND AND OUTBOUND FILTERING
OVERVIEW OF SPAM FILTERING
SPAM FILTER ARCHITECTURE AND METHODS

CHAPTER THREE

METHODOLOGY

INTRODUCTION
STUDY SELECTION
SOURCES OF DATA
DATA COLLECTION PROCESS
COMPARISON CRITERION
FEATURE EXTRACTION

CHAPTER FOUR

EXPERIMENT AND RESULT

CHAPTER FIVE

CONCLUSION
RECOMMENDATION
REFERENCES

CHAPTER ONE

1.0 INTRODUCTION

1.1 BACKGROUND OF THE STUDY

The internet has become an integral part of everyday life and e-mail has become a powerful tool for information exchange. Along with the growth of the Internet and e-mail, there has been a dramatic growth in spam in recent years. Spam can originate from any location across the globe where Internet access is available. Despite the development of anti- spam services and technologies, the number of spam messages continues to increase rapidly. In order to address the growing problem, each organization must analyze the tools available to determine how best to counter spam in its environment. Tools, such as the corporate e-mail system, e- mail filtering gateways, contracted anti-spam services, and end-user training, provide an important arsenal for any organization. However, users cannot avoid the very serious problem of attempting to deal with large amounts of spam on a regular basis. If there are no anti spam activities, spam will inundate network systems, kill employee productivity, steal bandwidth, and still be there tomorrow.

It is impossible to tell exactly who was the first one to come upon a simple idea that if you send out an advertisement to millions of people, then at least one person will react to it no matter what is the proposal. E-mail provides a perfect way to send these millions of advertisements at no cost for the sender, and this unfortunate fact is nowadays extensively exploited by several organizations. As a result, the e-mailboxes of millions of people get cluttered with all this so-called unsolicited bulk e-mail also known as “spam” or “junk mail”. Being increadibly cheap to send, spam causes a lot of trouble to the Internet community: large amounts of spam-traffic between servers cause delays in delivery of legitimate e- mail, people with dial-up Internet access have to spend bandwidth downloading junk mail. Sorting out the unwanted messages takes time and introduces a risk of deleting normal mail by mistake. Finally, there is quite an amount of pornographic spam that should not be exposed to children.

1.2 PROBLEM STATEMENT

Spam email is a nuisance that will clog up your employees’ inboxes and overload your servers. Spam is also dangerous — the entry point for serious attacks that could damage your computers, your computer network, your bottom line, and even your company’s reputation. That is why there is need for a spam filter solution as the key first line of defense.

1.3 AIM OF THE STUDY

A spam filter is a program that is used to detect unsolicited and unwanted email and prevent those messages from getting to a user’s inbox. The main aim of this work is to present an overview of the existing machine learning-based approaches for spam filtering

1.4 SCOPE OF THE STUDY

It is estimated that 70 percent of all email sent globally is spam, and the volume of spam continues to grow because spam remains a lucrative business. Spammers get ever more sophisticated and creative in their tactics to get their messages into your inboxes and wreak their havoc. Spam filtering solutions must continually be updated to address this evolving threat.

Spam filters use “heuristics” methods, which means that each email message is subjected to thousands of predefined rules (algorithms). Each rule assigns a numerical score to the probability of the message being spam, and if the score passes a certain threshold the email is flagged as spam and blocked from going further.

1.5 SIGNIFICANCE OF THE STUDY

Business email system without spam filtering is highly vulnerable, if not unusable. It is important to stop as much spam as you can, to protect your network from the many possible risks: viruses, phishing attacks, compromised web links and other malicious content.

Spam filters also protect your servers from being overloaded with non-essential emails, and the worse problem of being infected with spam software that may turn them into spam servers themselves.

By preventing spam email from reaching your employees’ mailboxes, spam filters give an additional layer of protection to your users, your network, and your business.

1.6 MOTIVATION OF THE STUDY

Common uses for mail filters include organizing incoming email and removal of spam and computer viruses. A less common use is to inspect outgoing email at some companies to ensure that employees comply with appropriate policies and laws. Users might also employ a mail filter to prioritize messages, and to sort them into folders based on subject matter or other criteria.

1.7 BENEFIT OF THE STUDY

Spam filters detect unsolicited, unwanted, and virus-infested email (called spam) and stop it from getting into email inboxes. Internet Service Providers (ISPs) use spam filters to make sure they aren’t distributing spam. Small- to medium- sized businesses (SMBs) also use spam filters to protect their employees and networks.

Spam filters are applied to both inbound email (email entering the network) and outbound email (email leaving the network). ISPs use both methods to protect their customers. SMBs typically focus on inbound filters.

1.8 TYPES OF SPAM FILTERS

There are different types of spam filters for different criteria:

Content filters – parse the content of messages, scanning for words that are commonly used in spam emails.
Header filters – examine the email header source to look for suspicious information.

Blacklist filters – stop emails that come from a blacklist of suspicious IP addresses. Some filters go further and check the IP reputation of the IP address.

Rules-based filters – apply customized rules designed by the organization to exclude emails from specific senders, or emails containing specific words in their subject line or body.

CHAPTER TWO

2.0 LITERATURE REVIEW

2.1 OVERVIEW OF SPAM FILTERING

Email filtering is the processing of email to organize it according to specified criteria. The term can apply to the intervention of human intelligence, but most often refers to the automatic processing of incoming messages with anti-spam techniques – to outgoing emails as well as those being received.

Email filtering software may reject an item at the initial SMTP connection stage ^[Zonk,2011^] or pass it through unchanged for delivery to the user’s mailbox – or alternatively: redirect the message for delivery elsewhere; quarantine it for further checking; edit or ‘tag’ it in some way.