Fall 2018 | Demo on 7th Dec 2018
Project By Darshan Kumar S Yaradoni [dsy27@cornell.edu] and Aman Tyagi [at789@cornell.edu]
Access to internet and social media has resulted in an exponential increase of digital marketing and targeted advertising. One of the ways targeted advertising is achieved is through e-mails. Most of the times, the advertisement e-mails are unsolicited and provide no useful information to the users. These are better classified as spam. E-mail spam filtering is becoming all the more important and relevant in today’s digital age, given the massive amount of targeted advertising that’s in place. Even though e-mail spam filtering isn’t a new domain per se, of late it’s being treated from the perspective of Artificial Intelligence, particularly natural language processing and machine learning. This project targets the domain of e-mail spam filtering using machine learning. A classifier is trained using supervised Machine Learning algorithm called Support Vector Machine (SVM) to filter e-mail as spam or not spam. Each e-mail is converted to a feature vector. The SVM is trained on Raspberry Pi and the result displayed on the piTFT screen. In addition to displaying whether the e-mail is spam or not, the display also gives the user information about potential reasons for why the e-mail has been classified as spam. The database used for training is a toned-down version of the SpamAssassin Public Corpus; only the body of the e-mail is used.
The design of this project included implementation of the following -
Fetching E-mail from gmail account
First, we setup a gmail account 'rpifall18@gmail.com'. Then we used fetchmail to receive e-mails on Raspberry Pi. Fetchmail is a client for IMAP and POP. We installed fetchmail using
sudo apt-get install fetchmail
Next, we created a file named .fetchmailrc in the home directory. This file has the contents -
poll imap.gmail.com
protocol IMAP
user "rpifall18@gmail.com" with password "*********" mda "/home/pi/myfetchmailparser.sh"
folder 'INBOX'
fetchlimit 1
keep
ssl
By default, fetchmail will pass the mail to port 25 on local host. When mda (mail delivery agent) is used, the mail will be passed to mda, which happens to be a script in our case. A limit of 1 has been set for fetchlimit. ‘keep’ ensures that the mail is not deleted after it has been read. Next, a file called myfetchmailparser.sh is created. This script writes the received email to a text file.
Filename=$(date +"%Y%m%d_%H%M%S_%N")
Outfile="/var/tmp/mail"$Filename
echo "" > $OutputFile
while read y
do
echo $y >> $Outfile
done
Since fetchmail doesn’t work on root, we set the user pi as the owner using the following commands -
sudo chown pi .fetchmailrc
sudo chown pi myfetchailparser.sh
Finally, the emails are read using –
fetchmail > /dev/null
When a new email arrives and fetchmail is executed, a filename with unique time stamp is created in /var/tmp directory, as shown below –
Figure 1: New email received using fetchmail
This file contains the email received. In addition to the body of the e-mail, fetchmail also extracts several other tags which are not needed for SVM model. These need to be removed. So, we executed the following commands to post-process the e-mail.
cd /var/tmp
filename=$(ls -lrt | tail -1 | awk '{print $9}')
cp $filename /home/pi/prj/forRPi/testmail.txt
cd /home/pi/prj/forRPi/
sudo chown pi testmail.txt
tail -n +76 testmail.txt > test76.txt
sudo chown pi test76.txt
head -n -6 test76.txt > currentmail.txt
sudo chown pi currentmail.txt
Using awk, ls, tail commands, the latest file written out in /var/tmp is identified. The name of the file is stored in $filename. This is copied to the project directory as testmail.txt. Then, using tail and head, the unwanted contents in testmail.txt are removed. Sample of the contents that are removed are shown below –
Figure 2: Content that needs to be removed
The file currentmail.txt is then passed to the python module for further pre-processing.
E-mail pre-processing using Natural Language Processing
Pre-processing of e-mails involves converting the e-mail text to a form that is suitable for feature extraction. For example, spam e-mails are likely to contain URLs, asking the user to click on malicious links. These URLs will be different in every spam mail. One way to take care of URLs is to normalize them all – i.e. all URL links in the body of e-mail will be replaced by a string “httpaddr”. Similarly, we carried out many pre-processing steps on the e-mail text.
Email address normalizing: All email addresses in the body of the text are replaced with the text “emailaddr”. This can be done using :
line = re.sub(r"[^\s]+@[^\s]+","emailaddr",line)
Here, line refers to the current line being scanned in the e-mail body. re.sub is a function in the Regular Expressions package that matches a regular expression and substitutes it with another string. Here, the e-mail address in the form string@string is replaced with “emailaddr”
Conversion to lower-case letters: This step is carried out to ensure capitalization is ignored. The words “Include” and “include” must be treated the same. Conversion to lower-case is done using:
line = line.lower()
Normalizing numbers: All numbers must be treated the same way. For example, 1000 must be treated the same way as 100 or any other number. It’s only necessary that the SVM classifier understands that a number exists in the e-mail content. Magnitude of the number does not make an impact in spam classification. Hence all numbers are replaced with the word “number” using:
line = re.sub(r"\d","number",line)
Word Stemming: Stemming in the context of natural language processing refers to stripping the word to its root form. For example the words “includes”, “include”, “including”, “included” must be converted to the stemmed form “include”. This can be done using –
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()
stemmed = [porter.stem(word) for word in tokens]
Trimming of punctuations: All punctuations must be removed before extracting e-mail features. This can be done using natural language toolkit package in python –
from nltk.tokenize import word_tokenize
tokens = word_tokenize(text)
An example of actual e-mail and processed e-mail are shown in the figure below:
Figure 3: Example of actual e-mail and processed e-mail
Feature Extraction
E-mail feature extraction is a key step in spam filtering, since the features predominantly determine the outcome of SVM. First step in e-mail feature extraction involves taking a decision about which words will be used for classification. If all words are included for classification, there’s a high likelihood of SVM model overfitting the training data. To avoid overfitting, words that rarely occur should not be considered. A general practice in spam filtering is to consider words that occur at least 100 times in the spam corpus vocabulary list. This results in about 2000 words that should be considered for classifier. Given processed e-mail text, each word can be mapped to the 2000-word vocabulary list. The vocabulary list has a number listed against each word, as shown below:
Figure 4: Vocabulary list and mapping words to indices
Each word in the e-mail is then mapped to the number in the vocabulary list. For example, if the word “what” appears in the list, it is mapped to the number “1652”. This is done using Python dictionary. First, vocabList.txt – the file which has the words along with numbers – is converted to Python dictionary. Each word in the e-mail is then mapped to the number in the vocabulary list. For example, if the word “what” appears in the list, it is mapped to the number “1652”.
vocabDictionary = {}
with open("vocab.txt") as f:
for line in f:
(key, val) = line.split()
vocabDictionary[int(key)] = val
The dictionary is then used to map the words to the index number using this code snippet –
indices = [];
for key,value in vocabDictionary.items():
for each_word in stemmed:
if each_word==value:
indices.append(key)
After mapping the words to indices, a feature vector is created using these indices. The feature vector is a binary feature vector that indicates whether a particular word occurs in the e-mail. If word 'k' is present in e-mail, then feature_vector(k) = 1. An example of binary feature vector is shown below:
Figure 5: Example of binary feature vector
Training SVM
Support Vector Machine is a supervised machine learning algorithm used mostly for classification problems. When a Support Vector Machine is explicitly told to find a line or hyperplane which best segregates features into 2 classes, it does so by arriving at a line that is at a maximum distance from each of the points that "support" the vector. In the figure shown below, the line l3 has the highest “margin” i.e. the distance between the line and the closest point of either class. Hence this is the optimal line that separates the two classes.
Figure 6: Classification using SVM
We implemented SVM using python package and achieved train accuracy of 99.97%. The RPi takes less than 30 seconds to train the SVM.
from sklearn import svm
clf = svm.SVC(kernel='linear', C = 1.0)
clf.fit(X_train,y_train.ravel())
The steps involved in Design and testing are illustrated in the figure below:
Figure 7: Steps involved in this project
After training SVM, we tested the system using test vectors. The model achieved test accuracy of 97.8%
import scipy.io as sio
#load Spam Test Data
mat_contents = sio.loadmat('testvector.mat')
X_test=mat_contents['Xtest']
y_test=mat_contents['ytest']
#test data prediction
y_pred_test=clf.predict(X_test)
#calculate test accuracy
count=0.0
for i in range(len(y_test)):
if(y_pred_test[i]==y_test[i]):
count=count+1
accuracy = count/len(y_test)*100
print("Test Accuracy:" + str(accuracy))
We also tested the performance of the spam classifier on many e-mails. Two examples, one for spam and one for non spam e-mails are shown below -
Figure 8: Spam Classification on actual e-mails
We discuss here the issues we encountered and how we resolved them.
Fetching e-mail on Raspberry Pi using imaplib: We tried to use imaplib to fetch e-mails on the RPi. But gmail refused to provide access. So we switched to the fetchmail method outlined above.
Running shell commands as non-root inside sudo python script call: To get the display running on piTFT, we have to execute python code using ‘sudo’. But fetchmail, the email client that we used to receive e-mails, does not run as ‘root’. We spent several hours trying to figure out how to run a linux shell command as non-root when python script is called using sudo. We resolved the issue of running commands as non-root by executing ‘su – pi -c’ which runs a command in the shell as the user pi, instead of root.
Incorrect classification of very short e-mails: The model that we trained is not robust enough for very short e-mails. Sometimes, very short non-spam e-mails are classified as spam. This is a limitation of the current implementation and we are working on resolving this by exploring other SVM kernels.
Results of this project are as expected. When we started off with the initial plan, the goal was to build a stand-alone E-mail spam filter on Raspberry Pi using Machine Learning. We successfully implemented the goals we had in mind and also identified the shortcomings/limitations of our system as we progressed through different stages of the project. We were able to achieve a really good training accuracy and also a decent test accuracy. We tested the system on 6 e-mail samples - 4 of them being spam and 2 being non-spam. Our system was able to correctly classify all the six examples. In this project, we also implemented the User Interface to provide useful information to the user when training and testing the spam filter. Overall, we met all the goals and also completed all tasks on time. We kept track of our project using the timeline we prepared at the beginning of the project, shown below -
Figure 9:Project Timeline
The figure shown below summarizes the results.
Figure 10: Table showing train accuracy, test accuracy, training time and classification results
The User Interface implemented for this project is shown below:
Figure 11: UI on piTFT for E-mail Spam Filter
In conclusion, the e-mail spam filter on the raspberry pi can correctly classify many e-mails, with the exception of very short e-mails. Very short e-mails tend to result in extremely small set of features, which makes it difficult for classification. The user interface implemented in this project also worked seamlessly without any glitches. However, we have identified possible improvements for project. Scope for future work is outlined in the next section.
As of now, we have explored the linear SVM kernel. Using the non-linear kernel may give better results. Also, we haven't included the sender of the e-mail, e-mail subject into consideration when extracting features. These can also be included for better classification. Possible improvement to this project includes addressing incorrect classifications for very short e-mails.
Darshan worked on implementing SVM training, testing and User Interface implementation. Aman worked on implementing fetchmail and e-mail pre-processing.
We would like to thank Professor Joseph Skovira and TAs Rohit Krishnakumar, Joao Pedro Carvao, Xitang Zhao and Mei Yang for the inputs and guidance extended to us.
user_interface.py
#E-mail Spam Filtering on Raspberry Pi using Machine Learning
#Authors: Darshan Kumar S Yaradoni (NetID: dsy27) and Aman Tyagi (NetID: at789)
#testing trained model
#This function implements UI using pygame and calls train and test functions
import pygame
from pygame.locals import * # for event MOUSE variables
import os
import time
import math
import RPi.GPIO as GPIO
import commands
from call_train import train
from call_test import test
from top_5 import spam_words
#enable display on piTFT
os.putenv('SDL_VIDEODRIVER', 'fbcon')
os.putenv('SDL_FBDEV', '/dev/fb1')
#track mouse clicks on piTFT
os.putenv('SDL_MOUSEDRV', 'TSLIB')
os.putenv('SDL_MOUSEDEV', '/dev/input/touchscreen')
GPIO.setmode(GPIO.BCM)
GPIO.setup(27,GPIO.IN,pull_up_down=GPIO.PUD_UP)
pygame.init()
pygame.mouse.set_visible(False)
WHITE = 255, 255, 255
BLACK = 0,0,0
BLUE = 0,0,255
RED = 255,0,0
GREEN = 0,255,0
screen = pygame.display.set_mode((320, 240))
my_font = pygame.font.Font(None, 30)
my_buttons = {'Train':(80,200), 'Test':(220,200)}
screen.fill(BLACK) # Erase the Work space
size = width, height = 320, 240
black = 0, 0, 0
#header display on piTFT
heading = {'E-mail Spam Filter': (100,30)}
for my_text, text_pos in heading.items():
text_surface = my_font.render(my_text, True, WHITE)
rect = text_surface.get_rect(center=(160,30))
screen.blit(text_surface, rect)
pygame.display.flip()
#user option to exit
my_textfooter = 'Press GPIO 27 to exit'
my_fontfooter = pygame.font.Font(None,15)
text_surface = my_fontfooter.render(my_textfooter, True, WHITE)
rect = text_surface.get_rect(center=(160,230))
screen.blit(text_surface, rect)
pygame.display.flip()
#display training message when training in progress
train_message = {'Training SVM ...':(100,30)}
buttontrain=pygame.Rect(45,185,70,30)
pygame.draw.rect(screen,[255,0,0],buttontrain)
pygame.display.flip()
buttontest=pygame.Rect(185,185,70,30)
pygame.draw.rect(screen,[0,0,255],buttontest)
pygame.display.flip()
while True:
for my_text, text_pos in my_buttons.items():
text_surface = my_font.render(my_text, True, WHITE)
rect = text_surface.get_rect(center=text_pos)
screen.blit(text_surface, rect)
pygame.display.flip()
for event in pygame.event.get():
if(event.type is MOUSEBUTTONDOWN):
pos = pygame.mouse.get_pos()
elif(event.type is MOUSEBUTTONUP):
pos = pygame.mouse.get_pos()
x,y = pos
#implement state encodings depending on mouse event press
if y > 180:
if x < 160:
state = 1
else:
state = 2
if state==1:
train_done = 0
screen.fill(BLACK)
for my_text, text_pos in train_message.items():
text_surface = my_font.render(my_text, True, WHITE)
rect = text_surface.get_rect(center=(160,30))
screen.blit(text_surface, rect)
pygame.display.flip()
#function call to train SVM
train_done,train_accuracy,test_accuracy,train_time = train()
#if done with training, display train, test accuracy and train time
if train_done:
state=0
screen.fill(BLACK)
train_done_message = {'Training done':(100,80)}
for my_text, text_pos in train_done_message.items():
text_surface = my_font.render(my_text, True, WHITE)
rect = text_surface.get_rect(center=(160,30))
screen.blit(text_surface, rect)
pygame.display.flip()
my_font2 = pygame.font.Font(None,22)
#display train accuracy
my_text = ("Train Accuracy:"+" "+str(train_accuracy)+"%")
text_surface=my_font2.render(my_text,True,RED)
rect=text_surface.get_rect(center=[160,100])
screen.blit(text_surface,rect)
pygame.display.flip()
#display test accuracy
my_text = ("Test Accuracy:"+" "+str(test_accuracy)+"%")
text_surface=my_font2.render(my_text,True,GREEN)
rect=text_surface.get_rect(center=[160,125])
screen.blit(text_surface,rect)
#display train time
my_text = ("Train Time: "+str(round(train_time,2))+" seconds")
text_surface=my_font2.render(my_text,True,RED)
rect=text_surface.get_rect(center=[160,80])
screen.blit(text_surface,rect)
#rectangles around train and test buttons
pygame.display.flip()
pygame.draw.rect(screen,[255,0,0],buttontrain)
pygame.draw.rect(screen,[0,0,255],buttontest)
#footer message to tell user to press 27 to exit
mytextfooter='Press GPIO 27 to Exit'
myfontfooter = pygame.font.Font(None,15)
text_surface=myfontfooter.render(mytextfooter,True,WHITE)
rect=text_surface.get_rect(center=[160,230])
screen.blit(text_surface,rect)
pygame.display.flip()
print("Training done")
#if 'Test' button is pressed, test on saved model
if state==2:
test_done = 0
screen.fill(BLACK)
test_message = {"Testing on unread e-mail":(100,30)}
for my_text, text_pos in test_message.items():
text_surface = my_font.render(my_text, True, WHITE)
rect = text_surface.get_rect(center=(160,30))
screen.blit(text_surface, rect)
pygame.display.flip()
#use os.system to execute getemail.sh script in shell
os.system("/home/pi/getemail.sh")
#get the classification result
test_done,p,mystring=test()
if p==1:
result='Spam'
else:
result='Not Spam'
if test_done:
state=0
screen.fill(BLACK)
#display testing done message
test_done_message = {'Testing done':(100,80)}
for my_text, text_pos in test_done_message.items():
text_surface = my_font.render(my_text, True, WHITE)
rect = text_surface.get_rect(center=(160,30))
screen.blit(text_surface, rect)
pygame.display.flip()
#get the sender of email and display on piTFT
status, output = commands.getstatusoutput("awk '/From/{print}' /home/pi/testmail.txt")
my_text=output
my_font2=pygame.font.Font(None,18)
text_surface=my_font2.render(my_text,True,BLUE)
rect=text_surface.get_rect(center=[160,80])
screen.blit(text_surface,rect)
pygame.display.flip()
#get the subject of email and display on piTFT
status, output = commands.getstatusoutput("awk '/Subject/{print}' /home/pi/testmail.txt")
#if spam, call to test function returns p 1. If p is 1, display spam potentially due to:
if p==1:
string2 = "Spam potentially due to:"
my_text2 = string2
my_font2=pygame.font.Font(None,18)
text_surface=my_font2.render(my_text2,True,BLUE)
rect=text_surface.get_rect(center=[160,155])
screen.blit(text_surface,rect)
pygame.display.flip()
my_text3 = mystring
my_font2=pygame.font.Font(None,18)
text_surface=my_font2.render(my_text3,True,GREEN)
rect=text_surface.get_rect(center=[160,170])
screen.blit(text_surface,rect)
pygame.display.flip()
#output is the result of classificaiton
my_text = output
my_font2=pygame.font.Font(None,18)
text_surface=my_font2.render(my_text,True,BLUE)
rect=text_surface.get_rect(center=[160,105])
screen.blit(text_surface,rect)
pygame.display.flip()
#display test result on piTFT
my_text=("Test Result: "+str(result))
my_font2=pygame.font.Font(None,22)
text_surface=my_font2.render(my_text,True,RED)
rect=text_surface.get_rect(center=[160,135])
screen.blit(text_surface,rect)
#draw rectangles around Train and Test buttons
pygame.draw.rect(screen,[255,0,0],buttontrain)
pygame.draw.rect(screen,[0,0,255],buttontest)
#user message to exit on pressing GPIO 27
mytextfooter='Press GPIO 27 to Exit'
myfontfooter=pygame.font.Font(None,15)
text_surface=myfontfooter.render(mytextfooter,True,WHITE)
rect=text_surface.get_rect(center=[160,230])
screen.blit(text_surface,rect)
pygame.display.flip()
#exit if GPIO 27 is pressed
if GPIO.input(27) == GPIO.LOW:
print("Quitting now")
pygame.display.quit()
pygame.quit()
exit()
pygame.quit()
call_train.py
#E-mail Spam Filtering on Raspberry Pi using Machine Learning
#Authors: Darshan Kumar S Yaradoni (NetID: dsy27) and Aman Tyagi (NetID: at789)
#testing trained model
#This function returns 1 when training completes, train accuracy, test accuracy and training time
def train():
import config
# natural language kit
import nltk
import time
nltk.download('punkt')
nltk.download('stopwords')
import re
import numpy as np
f=open('/home/pi/emailSample1LowerCase.txt','w')
f.close()
with open('/home/pi/emailSample1.txt', 'r') as fileinput:
for line in fileinput:
#replace digits with string "number"
line = re.sub(r"[0-9][0-9][0-9]","number",line)
line = re.sub(r"[0-9][0-9]","number",line)
line = re.sub(r"[0-9]","number",line)
#replace URLs with "httpaddr"
line = re.sub(r"(http|https)://[^\s]*", "httpaddr",line)
line = re.sub(r"[^\s]+@[^\s]+","emailaddr",line)
line = re.sub(r"[$]+","dollar",line)
with open('/home/pi/emailSample1LowerCase.txt', 'a') as the_file:
the_file.write(line)
the_file.close()
filename = 'emailSample1LowerCase.txt'
file = open(filename, 'rt')
text = file.read()
file.close()
# split into words
from nltk.tokenize import word_tokenize
tokens = word_tokenize(text)
# convert to lower case
tokens = [w.lower() for w in tokens]
# remove punctuation from each word
import string
table = string.maketrans('', '')
stripped = [w.translate(table,string.punctuation) for w in tokens]
# remove remaining tokens that are not alphabetic
words = [word for word in stripped if word.isalpha()]
f=open('/home/pi/emailSampleProcessed','w')
f.close()
for each_word in words:
with open('/home/pi/emailSampleProcessed', 'a') as the_file:
the_file.write(each_word+" ")
f.close()
filename = '/home/pi/emailSampleProcessed'
file = open(filename, 'rt')
text = file.read()
from nltk.tokenize import word_tokenize
tokens = word_tokenize(text)
# stemming of words
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()
stemmed = [porter.stem(word) for word in tokens]
#write to file for further processing
f=open('/home/pi/emailSampleProcessedFinal','w')
f.close()
for each_word in stemmed:
with open('/home/pi/emailSampleProcessedFinal', 'a') as the_file:
the_file.write(each_word+" ")
f.close()
vocabList = open("/home/pi/vocab.txt","r")
vocabDictionary = {}
with open("/home/pi/vocab.txt") as f:
for line in f:
(key, val) = line.split()
vocabDictionary[int(key)] = val
indices = [];
for key,value in vocabDictionary.items():
for each_word in stemmed:
if each_word==value:
indices.append(key)
#extract email Features
n=1899
s=(n,1)
#initialize with zeros
feature_vector=np.zeros(s)
for i in range(len(indices)):
feature_vector[indices[i]]=1
#load Spam Training Data
import scipy.io as sio
mat_contents = sio.loadmat('/home/pi/train.mat')
X_train=mat_contents['X']
y_train=mat_contents['y']
from sklearn import svm
clf = svm.SVC(kernel='linear', C = 1.0)
#timer for actual train time
start_time = 0.0
train_time = 0.0
start_time = time.time()
#train model
clf.fit(X_train,y_train.ravel())
train_time = time.time() - start_time
#save trained model
from sklearn.externals import joblib
filename = '/home/pi/emailClassifier.sav'
joblib.dump(clf, filename)
config.state_train=1
#test model
y_pred_train=clf.predict(X_train)
count=0.0
for i in range(len(y_train)):
if(y_pred_train[i]==y_train[i]):
count=count+1
train_accuracy=count/len(y_train)*100
mat_contents=sio.loadmat('/home/pi/spamTest.mat')
x_test=mat_contents['Xtest']
y_test=mat_contents['ytest']
y_pred_test=clf.predict(x_test)
count=0.0
for i in range(len(y_test)):
if(y_pred_test[i]==y_test[i]):
count=count+1
test_accuracy = count/len(y_test)*100
#return train,test accuracy and train time
return 1,train_accuracy,test_accuracy,train_time
call_test.py
#E-mail Spam Filtering on Raspberry Pi using Machine Learning
#Authors: Darshan Kumar S Yaradoni (NetID: dsy27) and Aman Tyagi (NetID: at789)
#testing trained model
#This function returns 1 when testing is done, returns result of test: Spam/Not Spam and the words which classifier identifies as spam
def test():
import nltk
#download required packages
nltk.download('punkt')
nltk.download('stopwords')
import re
import numpy as np
f=open('/home/pi/emailSample1LowerCase.txt','w')
f.close()
with open('/home/pi/currentmail.txt', 'r') as fileinput:
for line in fileinput:
#substitute actual numbers with string 'number'
line = re.sub(r"[0-9][0-9][0-9]","number",line)
line = re.sub(r"[0-9][0-9]","number",line)
line = re.sub(r"[0-9]","number",line)
#substitute urls with "httpaddr"
line = re.sub(r"(http|https)://[^\s]*", "httpaddr",line)
line = re.sub(r"[^\s]+@[^\s]+","emailaddr",line)
line = re.sub(r"[$]+","dollar",line)
with open('/home/pi/emailSample1LowerCase.txt', 'a') as the_file:
the_file.write(line)
the_file.close()
filename = '/home/pi/emailSample1LowerCase.txt'
file = open(filename, 'rt')
text = file.read()
file.close()
from nltk.tokenize import word_tokenize
tokens = word_tokenize(text)
#convert to lowercase
tokens = [w.lower() for w in tokens]
# remove punctuation from each word
import string
table = string.maketrans('', '')
stripped = [w.translate(table,string.punctuation) for w in tokens]
# remove remaining tokens that are not alphabetic
words = [word for word in stripped if word.isalpha()]
f=open('/home/pi/emailSampleProcessed','w')
f.close()
for each_word in words:
with open('/home/pi/emailSampleProcessed', 'a') as the_file:
the_file.write(each_word+" ")
f.close()
filename = '/home/pi/emailSampleProcessed'
file = open(filename, 'rt')
text = file.read()
from nltk.tokenize import word_tokenize
tokens = word_tokenize(text)
# stemming of words
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()
stemmed = [porter.stem(word) for word in tokens]
f=open('/home/pi/emailSampleProcessedFinal','w')
f.close()
for each_word in stemmed:
with open('/home/pi/emailSampleProcessedFinal', 'a') as the_file:
the_file.write(each_word+" ")
#vocabList
vocabList = open("/home/pi/vocab.txt","r")
vocabDictionary = {}
with open("/home/pi/vocab.txt") as f:
for line in f:
(key, val) = line.split()
vocabDictionary[int(key)] = val
indices = [];
for key,value in vocabDictionary.items():
for each_word in stemmed:
if each_word==value:
indices.append(key)
#extract email Features
n=1899
s=(n,1)
#initialize with zeros
feature_vector=np.zeros(s)
from sklearn.externals import joblib
for i in range(len(indices)):
feature_vector[indices[i]]=1
#load saved trained model
clf = joblib.load('/home/pi/emailClassifier.sav')
#predict on extract features
p = clf.predict(feature_vector.T)
x = clf.coef_
y=x
sort_i=np.sort(y)
sort_t=sort_i.T
sort_i_index=np.argsort(y)
sort_t_index = sort_i_index.T
indices_new = list(set(indices))
w1 = vocabDictionary[indices_new[1]]
w2 = vocabDictionary[indices_new[2]]
w3 = vocabDictionary[indices_new[3]]
w4 = vocabDictionary[indices_new[4]]
w5 = vocabDictionary[indices_new[5]]
mystr=w1+" "+w2+" "+w3+" "+w4+" "+w5
return 1, int(p), mystr
get_email.sh
#E-mail Spam Filtering on Raspberry Pi using Machine Learning
#Authors: Darshan Kumar S Yaradoni (NetID: dsy27) and Aman Tyagi (NetID: at789)
#This function reads unread e-mail and extracts body of e-mail
su - pi -c "fetchmail > /dev/null"
cd /var/tmp
filename=$(ls -lrt mail* | tail -1 | awk '{print $9}')
cp $filename /home/pi/testmail.txt
sudo chown pi /home/pi/testmail.txt
tail -n +76 /home/pi/testmail.txt > /home/pi/test76.txt
sudo chown pi /home/pi/test76.txt
head -n -6 /home/pi/test76.txt > /home/pi/currentmail.txt
sudo chown pi /home/pi/currentmail.txt