E-mail Spam Filtering on Raspberry Pi using Machine Learning

Fall 2018 | Demo on 7^th Dec 2018
Project By Darshan Kumar S Yaradoni [dsy27@cornell.edu] and Aman Tyagi [at789@cornell.edu]

Demonstration Video

Introduction

Access to internet and social media has resulted in an exponential increase of digital marketing and targeted advertising. One of the ways targeted advertising is achieved is through e-mails. Most of the times, the advertisement e-mails are unsolicited and provide no useful information to the users. These are better classified as spam. E-mail spam filtering is becoming all the more important and relevant in today’s digital age, given the massive amount of targeted advertising that’s in place. Even though e-mail spam filtering isn’t a new domain per se, of late it’s being treated from the perspective of Artificial Intelligence, particularly natural language processing and machine learning. This project targets the domain of e-mail spam filtering using machine learning. A classifier is trained using supervised Machine Learning algorithm called Support Vector Machine (SVM) to filter e-mail as spam or not spam. Each e-mail is converted to a feature vector. The SVM is trained on Raspberry Pi and the result displayed on the piTFT screen. In addition to displaying whether the e-mail is spam or not, the display also gives the user information about potential reasons for why the e-mail has been classified as spam. The database used for training is a toned-down version of the SpamAssassin Public Corpus; only the body of the e-mail is used.

Project Objectives

Given an e-mail text, extract the features using natural language processing techniques and pass the features as input to SVM.
Train SVM on the Raspberry Pi and use the model to classify incoming mails as spam or no-spam.
Use the piTFT display system as User Interface to implement actions such as displaying spam or no-spam and providing information about why a particular e-mail has been classified as spam.
Configure the Raspberry Pi to detect when a new e-mail has arrived in a gmail account and read this email, followed by e-mail pre-processing and feature extraction.

Design

The design of this project included implementation of the following -

Fetching E-mail from a gmail account
E-mail pre-processing using Natural Language Processing (NLP)
Feature Extraction
Training SVM
Classification using trained model
Implementing User Interface (UI) for piTFT display

Fetching E-mail from gmail account

First, we setup a gmail account 'rpifall18@gmail.com'. Then we used fetchmail to receive e-mails on Raspberry Pi. Fetchmail is a client for IMAP and POP. We installed fetchmail using


                sudo apt-get install fetchmail

Next, we created a file named .fetchmailrc in the home directory. This file has the contents -


                poll imap.gmail.com
                protocol IMAP
                user "rpifall18@gmail.com" with password "*********" mda "/home/pi/myfetchmailparser.sh"
                folder 'INBOX'
                fetchlimit 1
                keep
                ssl

By default, fetchmail will pass the mail to port 25 on local host. When mda (mail delivery agent) is used, the mail will be passed to mda, which happens to be a script in our case. A limit of 1 has been set for fetchlimit. ‘keep’ ensures that the mail is not deleted after it has been read. Next, a file called myfetchmailparser.sh is created. This script writes the received email to a text file.


                Filename=$(date +"%Y%m%d_%H%M%S_%N")
                Outfile="/var/tmp/mail"$Filename
                echo "" > $OutputFile
                while read y
                do
                  echo $y >> $Outfile
                done

Since fetchmail doesn’t work on root, we set the user pi as the owner using the following commands -


                sudo chown pi .fetchmailrc
                sudo chown pi myfetchailparser.sh

Finally, the emails are read using –


                fetchmail > /dev/null

When a new email arrives and fetchmail is executed, a filename with unique time stamp is created in /var/tmp directory, as shown below –

Figure 1: New email received using fetchmail

This file contains the email received. In addition to the body of the e-mail, fetchmail also extracts several other tags which are not needed for SVM model. These need to be removed. So, we executed the following commands to post-process the e-mail.


                cd /var/tmp
                filename=$(ls -lrt | tail -1 | awk '{print $9}')
                cp $filename /home/pi/prj/forRPi/testmail.txt

                cd /home/pi/prj/forRPi/
                sudo chown pi testmail.txt
                tail -n +76 testmail.txt > test76.txt
                sudo chown pi test76.txt
                head -n -6 test76.txt > currentmail.txt
                sudo chown pi currentmail.txt

Using awk, ls, tail commands, the latest file written out in /var/tmp is identified. The name of the file is stored in $filename. This is copied to the project directory as testmail.txt. Then, using tail and head, the unwanted contents in testmail.txt are removed. Sample of the contents that are removed are shown below –

Figure 2: Content that needs to be removed

The file currentmail.txt is then passed to the python module for further pre-processing.

E-mail pre-processing using Natural Language Processing

Pre-processing of e-mails involves converting the e-mail text to a form that is suitable for feature extraction. For example, spam e-mails are likely to contain URLs, asking the user to click on malicious links. These URLs will be different in every spam mail. One way to take care of URLs is to normalize them all – i.e. all URL links in the body of e-mail will be replaced by a string “httpaddr”. Similarly, we carried out many pre-processing steps on the e-mail text.

Email address normalizing: All email addresses in the body of the text are replaced with the text “emailaddr”. This can be done using :


                line = re.sub(r"[^\s]+@[^\s]+","emailaddr",line)

Here, line refers to the current line being scanned in the e-mail body. re.sub is a function in the Regular Expressions package that matches a regular expression and substitutes it with another string. Here, the e-mail address in the form string@string is replaced with “emailaddr”

Conversion to lower-case letters: This step is carried out to ensure capitalization is ignored. The words “Include” and “include” must be treated the same. Conversion to lower-case is done using:


                line = line.lower()

Normalizing numbers: All numbers must be treated the same way. For example, 1000 must be treated the same way as 100 or any other number. It’s only necessary that the SVM classifier understands that a number exists in the e-mail content. Magnitude of the number does not make an impact in spam classification. Hence all numbers are replaced with the word “number” using:


                line = re.sub(r"\d","number",line)

Word Stemming: Stemming in the context of natural language processing refers to stripping the word to its root form. For example the words “includes”, “include”, “including”, “included” must be converted to the stemmed form “include”. This can be done using –


                from nltk.stem.porter import PorterStemmer
                porter = PorterStemmer()
                stemmed = [porter.stem(word) for word in tokens]

Trimming of punctuations: All punctuations must be removed before extracting e-mail features. This can be done using natural language toolkit package in python –


                from nltk.tokenize import word_tokenize
                tokens = word_tokenize(text)

An example of actual e-mail and processed e-mail are shown in the figure below:

Figure 3: Example of actual e-mail and processed e-mail

Feature Extraction

E-mail feature extraction is a key step in spam filtering, since the features predominantly determine the outcome of SVM. First step in e-mail feature extraction involves taking a decision about which words will be used for classification. If all words are included for classification, there’s a high likelihood of SVM model overfitting the training data. To avoid overfitting, words that rarely occur should not be considered. A general practice in spam filtering is to consider words that occur at least 100 times in the spam corpus vocabulary list. This results in about 2000 words that should be considered for classifier. Given processed e-mail text, each word can be mapped to the 2000-word vocabulary list. The vocabulary list has a number listed against each word, as shown below:

Figure 4: Vocabulary list and mapping words to indices

Each word in the e-mail is then mapped to the number in the vocabulary list. For example, if the word “what” appears in the list, it is mapped to the number “1652”. This is done using Python dictionary. First, vocabList.txt – the file which has the words along with numbers – is converted to Python dictionary. Each word in the e-mail is then mapped to the number in the vocabulary list. For example, if the word “what” appears in the list, it is mapped to the number “1652”.


                vocabDictionary = {}
                with open("vocab.txt") as f:
                for line in f:
                    (key, val) = line.split()
                    vocabDictionary[int(key)] = val

The dictionary is then used to map the words to the index number using this code snippet –


                indices = [];
                for key,value in vocabDictionary.items():
                      for each_word in stemmed:
                          if each_word==value:
                              indices.append(key)

After mapping the words to indices, a feature vector is created using these indices. The feature vector is a binary feature vector that indicates whether a particular word occurs in the e-mail. If word 'k' is present in e-mail, then feature_vector(k) = 1. An example of binary feature vector is shown below: Generic placeholder image

Figure 5: Example of binary feature vector

Training SVM

Support Vector Machine is a supervised machine learning algorithm used mostly for classification problems. When a Support Vector Machine is explicitly told to find a line or hyperplane which best segregates features into 2 classes, it does so by arriving at a line that is at a maximum distance from each of the points that "support" the vector. In the figure shown below, the line l₃ has the highest “margin” i.e. the distance between the line and the closest point of either class. Hence this is the optimal line that separates the two classes.

Figure 6: Classification using SVM

We implemented SVM using python package and achieved train accuracy of 99.97%. The RPi takes less than 30 seconds to train the SVM.


                from sklearn import svm
                clf = svm.SVC(kernel='linear', C = 1.0)
                clf.fit(X_train,y_train.ravel())

The steps involved in Design and testing are illustrated in the figure below:

Figure 7: Steps involved in this project

Testing

After training SVM, we tested the system using test vectors. The model achieved test accuracy of 97.8%


              import scipy.io as sio

              #load Spam Test Data
              mat_contents = sio.loadmat('testvector.mat')
              X_test=mat_contents['Xtest']
              y_test=mat_contents['ytest']
              #test data prediction
              y_pred_test=clf.predict(X_test)
              #calculate test accuracy
              count=0.0
              for i in range(len(y_test)):
                if(y_pred_test[i]==y_test[i]):
                    count=count+1
              accuracy = count/len(y_test)*100
              print("Test Accuracy:" + str(accuracy))

We also tested the performance of the spam classifier on many e-mails. Two examples, one for spam and one for non spam e-mails are shown below - Generic placeholder image

Figure 8: Spam Classification on actual e-mails

Issues encountered and their resolution

We discuss here the issues we encountered and how we resolved them.

Fetching e-mail on Raspberry Pi using imaplib: We tried to use imaplib to fetch e-mails on the RPi. But gmail refused to provide access. So we switched to the fetchmail method outlined above.

Running shell commands as non-root inside sudo python script call: To get the display running on piTFT, we have to execute python code using ‘sudo’. But fetchmail, the email client that we used to receive e-mails, does not run as ‘root’. We spent several hours trying to figure out how to run a linux shell command as non-root when python script is called using sudo. We resolved the issue of running commands as non-root by executing ‘su – pi -c’ which runs a command in the shell as the user pi, instead of root.

Incorrect classification of very short e-mails: The model that we trained is not robust enough for very short e-mails. Sometimes, very short non-spam e-mails are classified as spam. This is a limitation of the current implementation and we are working on resolving this by exploring other SVM kernels.

Results

Results of this project are as expected. When we started off with the initial plan, the goal was to build a stand-alone E-mail spam filter on Raspberry Pi using Machine Learning. We successfully implemented the goals we had in mind and also identified the shortcomings/limitations of our system as we progressed through different stages of the project. We were able to achieve a really good training accuracy and also a decent test accuracy. We tested the system on 6 e-mail samples - 4 of them being spam and 2 being non-spam. Our system was able to correctly classify all the six examples. In this project, we also implemented the User Interface to provide useful information to the user when training and testing the spam filter. Overall, we met all the goals and also completed all tasks on time. We kept track of our project using the timeline we prepared at the beginning of the project, shown below -

Figure 9:Project Timeline

The figure shown below summarizes the results.

Figure 10: Table showing train accuracy, test accuracy, training time and classification results

The User Interface implemented for this project is shown below:

Figure 11: UI on piTFT for E-mail Spam Filter

Conclusion

In conclusion, the e-mail spam filter on the raspberry pi can correctly classify many e-mails, with the exception of very short e-mails. Very short e-mails tend to result in extremely small set of features, which makes it difficult for classification. The user interface implemented in this project also worked seamlessly without any glitches. However, we have identified possible improvements for project. Scope for future work is outlined in the next section.

Future work

As of now, we have explored the linear SVM kernel. Using the non-linear kernel may give better results. Also, we haven't included the sender of the e-mail, e-mail subject into consideration when extracting features. These can also be included for better classification. Possible improvement to this project includes addressing incorrect classifications for very short e-mails.

Work Distribution

Darshan worked on implementing SVM training, testing and User Interface implementation. Aman worked on implementing fetchmail and e-mail pre-processing.

Project group picture: Aman Tyagi (left) and Darshan Kumar S Y (right)

Parts List

Raspberry Pi $35.00
piTFT Screen $30.00

Total: $65.00. Both the parts were provided in the lab and we did not incur any other expenses towards this project.

Acknowledgements

We would like to thank Professor Joseph Skovira and TAs Rohit Krishnakumar, Joao Pedro Carvao, Xitang Zhao and Mei Yang for the inputs and guidance extended to us.

References

Fetchmail
Machine Learning
Pygame
Python Natural Language Processing Toolkit

Code Appendix

user_interface.py


#E-mail Spam Filtering on Raspberry Pi using Machine Learning
#Authors: Darshan Kumar S Yaradoni (NetID: dsy27) and Aman Tyagi (NetID: at789)
#testing trained model
#This function implements UI using pygame and calls train and test functions              
import pygame
from pygame.locals import * # for event MOUSE variables
import os
import time
import math
import RPi.GPIO as GPIO
import commands
from call_train import train
from call_test import test
from top_5 import spam_words
#enable display on piTFT
os.putenv('SDL_VIDEODRIVER', 'fbcon')
os.putenv('SDL_FBDEV', '/dev/fb1')
#track mouse clicks on piTFT
os.putenv('SDL_MOUSEDRV', 'TSLIB')
os.putenv('SDL_MOUSEDEV', '/dev/input/touchscreen')
GPIO.setmode(GPIO.BCM)
GPIO.setup(27,GPIO.IN,pull_up_down=GPIO.PUD_UP)

pygame.init()
pygame.mouse.set_visible(False)
WHITE = 255, 255, 255
BLACK = 0,0,0
BLUE = 0,0,255
RED = 255,0,0
GREEN = 0,255,0
screen = pygame.display.set_mode((320, 240))
my_font = pygame.font.Font(None, 30)
my_buttons = {'Train':(80,200), 'Test':(220,200)}
screen.fill(BLACK) # Erase the Work space
size = width, height = 320, 240
black = 0, 0, 0

#header display on piTFT
heading = {'E-mail Spam Filter': (100,30)}

for my_text, text_pos in heading.items():
        text_surface = my_font.render(my_text, True, WHITE)
        rect = text_surface.get_rect(center=(160,30))
        screen.blit(text_surface, rect)
        pygame.display.flip()
#user option to exit
my_textfooter = 'Press GPIO 27 to exit'
my_fontfooter = pygame.font.Font(None,15)       
text_surface = my_fontfooter.render(my_textfooter, True, WHITE)
rect = text_surface.get_rect(center=(160,230))
screen.blit(text_surface, rect)
pygame.display.flip()

#display training message when training in progress
train_message =  {'Training SVM ...':(100,30)}  
buttontrain=pygame.Rect(45,185,70,30)
pygame.draw.rect(screen,[255,0,0],buttontrain)
pygame.display.flip()
buttontest=pygame.Rect(185,185,70,30)
pygame.draw.rect(screen,[0,0,255],buttontest)
pygame.display.flip()

while True:
      
      for my_text, text_pos in my_buttons.items():
        text_surface = my_font.render(my_text, True, WHITE)
        rect = text_surface.get_rect(center=text_pos)
        screen.blit(text_surface, rect)
        pygame.display.flip()
      
      for event in pygame.event.get():
        if(event.type is MOUSEBUTTONDOWN):
         pos = pygame.mouse.get_pos()
        elif(event.type is MOUSEBUTTONUP):
         pos = pygame.mouse.get_pos()
         x,y = pos
         
         #implement state encodings depending on mouse event press      
         if y > 180:
          if x < 160:
          
           state = 1
          else:
           state = 2
          
          if state==1:
              train_done = 0

              screen.fill(BLACK)
              for my_text, text_pos in train_message.items():
                 text_surface = my_font.render(my_text, True, WHITE)
                 rect = text_surface.get_rect(center=(160,30))
                 screen.blit(text_surface, rect)
                 pygame.display.flip()
              #function call to train SVM 
              train_done,train_accuracy,test_accuracy,train_time = train()
              
              #if done with training, display train, test accuracy and train time
              if train_done:
                 state=0
                 screen.fill(BLACK)
                 train_done_message = {'Training done':(100,80)}
                 for my_text, text_pos in train_done_message.items():
                     text_surface = my_font.render(my_text, True, WHITE)
                     rect = text_surface.get_rect(center=(160,30))
                     screen.blit(text_surface, rect)
                     pygame.display.flip()
                 my_font2 = pygame.font.Font(None,22)
                 #display train accuracy
                 my_text = ("Train Accuracy:"+" "+str(train_accuracy)+"%")
                 text_surface=my_font2.render(my_text,True,RED)
                 rect=text_surface.get_rect(center=[160,100])
                 screen.blit(text_surface,rect)
                 pygame.display.flip()
                 #display test accuracy
                 my_text = ("Test Accuracy:"+" "+str(test_accuracy)+"%")
                 text_surface=my_font2.render(my_text,True,GREEN)
                 rect=text_surface.get_rect(center=[160,125])
                 screen.blit(text_surface,rect)
                 #display train time
                 my_text = ("Train Time: "+str(round(train_time,2))+" seconds")
                 text_surface=my_font2.render(my_text,True,RED)
                 rect=text_surface.get_rect(center=[160,80])
                 screen.blit(text_surface,rect)
                 #rectangles around train and test buttons
                 pygame.display.flip()
                 pygame.draw.rect(screen,[255,0,0],buttontrain)
                 pygame.draw.rect(screen,[0,0,255],buttontest)
                 #footer message to tell user to press 27 to exit
                 mytextfooter='Press GPIO 27 to Exit'
                 myfontfooter = pygame.font.Font(None,15)
                 text_surface=myfontfooter.render(mytextfooter,True,WHITE)
                 rect=text_surface.get_rect(center=[160,230])
                 screen.blit(text_surface,rect)
                 pygame.display.flip()
                 print("Training done")
          #if 'Test' button is pressed, test on saved model
          if state==2:
              test_done = 0
              screen.fill(BLACK)
              test_message = {"Testing on unread e-mail":(100,30)}
              for my_text, text_pos in test_message.items():
                 text_surface = my_font.render(my_text, True, WHITE)
                 rect = text_surface.get_rect(center=(160,30))
                 screen.blit(text_surface, rect)
                 pygame.display.flip()
              #use os.system to execute getemail.sh script in shell
              os.system("/home/pi/getemail.sh")
              #get the classification result
              test_done,p,mystring=test()
              if p==1:
                  result='Spam'
              else:
                  result='Not Spam'
              if test_done:
                 state=0
                 screen.fill(BLACK)
                 #display testing done message
                 test_done_message = {'Testing done':(100,80)}
                 for my_text, text_pos in test_done_message.items():
                     text_surface = my_font.render(my_text, True, WHITE)
                     rect = text_surface.get_rect(center=(160,30))
                     screen.blit(text_surface, rect)
                     pygame.display.flip()
                 #get the sender of email and display on piTFT
                 status, output = commands.getstatusoutput("awk '/From/{print}' /home/pi/testmail.txt")
                 my_text=output
                 my_font2=pygame.font.Font(None,18)
                 text_surface=my_font2.render(my_text,True,BLUE)
                 rect=text_surface.get_rect(center=[160,80])
                 screen.blit(text_surface,rect)
                 pygame.display.flip()
                 #get the subject of email and display on piTFT
                 status, output = commands.getstatusoutput("awk '/Subject/{print}' /home/pi/testmail.txt")
                 #if spam, call to test function returns p 1. If p is 1, display spam potentially due to:
                 if p==1:
                     string2 = "Spam potentially due to:"
                     my_text2 = string2
                     my_font2=pygame.font.Font(None,18)
                     text_surface=my_font2.render(my_text2,True,BLUE)
                     rect=text_surface.get_rect(center=[160,155])
                     screen.blit(text_surface,rect)
                     pygame.display.flip()
                     my_text3 = mystring
                     my_font2=pygame.font.Font(None,18)
                     text_surface=my_font2.render(my_text3,True,GREEN)
                     rect=text_surface.get_rect(center=[160,170])
                     screen.blit(text_surface,rect)
                     pygame.display.flip()
                 #output is the result of classificaiton
                 my_text = output
                 my_font2=pygame.font.Font(None,18)
                 text_surface=my_font2.render(my_text,True,BLUE)
                 rect=text_surface.get_rect(center=[160,105])
                 screen.blit(text_surface,rect)
                 pygame.display.flip()
                 #display test result on piTFT
                 my_text=("Test Result: "+str(result))
                 my_font2=pygame.font.Font(None,22)
                 text_surface=my_font2.render(my_text,True,RED)
                 rect=text_surface.get_rect(center=[160,135])
                 screen.blit(text_surface,rect)
                 #draw rectangles around Train and Test buttons
                 pygame.draw.rect(screen,[255,0,0],buttontrain)
                 pygame.draw.rect(screen,[0,0,255],buttontest)
                 #user message to exit on pressing GPIO 27
                 mytextfooter='Press GPIO 27 to Exit'
                 myfontfooter=pygame.font.Font(None,15)
                 text_surface=myfontfooter.render(mytextfooter,True,WHITE)
                 rect=text_surface.get_rect(center=[160,230])
                 screen.blit(text_surface,rect)
                 pygame.display.flip()
      #exit if GPIO 27 is pressed                
      if GPIO.input(27) == GPIO.LOW:
        print("Quitting now")
        pygame.display.quit()
        pygame.quit()
        exit()
pygame.quit()

call_train.py


#E-mail Spam Filtering on Raspberry Pi using Machine Learning
#Authors: Darshan Kumar S Yaradoni (NetID: dsy27) and Aman Tyagi (NetID: at789)
#testing trained model
#This function returns 1 when training completes, train accuracy, test accuracy and training time
def train():
    import config
    # natural language kit
    import nltk
    import time
    nltk.download('punkt')
    nltk.download('stopwords')
    import re
    import numpy as np
    f=open('/home/pi/emailSample1LowerCase.txt','w')
    f.close()
    with open('/home/pi/emailSample1.txt', 'r') as fileinput:
      for line in fileinput:
        #replace digits with string "number"
        line = re.sub(r"[0-9][0-9][0-9]","number",line)
        line = re.sub(r"[0-9][0-9]","number",line)
        line = re.sub(r"[0-9]","number",line)
        #replace URLs with "httpaddr"
        line = re.sub(r"(http|https)://[^\s]*", "httpaddr",line)
        line = re.sub(r"[^\s]+@[^\s]+","emailaddr",line)
        line = re.sub(r"[$]+","dollar",line)
        with open('/home/pi/emailSample1LowerCase.txt', 'a') as the_file:
         the_file.write(line)
    the_file.close()

    filename = 'emailSample1LowerCase.txt'
    file = open(filename, 'rt')
    text = file.read()
    file.close()
    # split into words
    from nltk.tokenize import word_tokenize
    tokens = word_tokenize(text)
    # convert to lower case
    tokens = [w.lower() for w in tokens]
    # remove punctuation from each word
    import string
    table = string.maketrans('', '')
    stripped = [w.translate(table,string.punctuation) for w in tokens]
    # remove remaining tokens that are not alphabetic
    words = [word for word in stripped if word.isalpha()]
    f=open('/home/pi/emailSampleProcessed','w')
    f.close()
    for each_word in words:
      with open('/home/pi/emailSampleProcessed', 'a') as the_file:
        the_file.write(each_word+" ")
    f.close()

    filename = '/home/pi/emailSampleProcessed'
    file = open(filename, 'rt')
    text = file.read()

    from nltk.tokenize import word_tokenize
    tokens = word_tokenize(text)
    # stemming of words
    from nltk.stem.porter import PorterStemmer
    porter = PorterStemmer()
    stemmed = [porter.stem(word) for word in tokens]

    #write to file for further processing
    f=open('/home/pi/emailSampleProcessedFinal','w')
    f.close()
    for each_word in stemmed:
      with open('/home/pi/emailSampleProcessedFinal', 'a') as the_file:
            the_file.write(each_word+" ")
    f.close()

    vocabList = open("/home/pi/vocab.txt","r")

    vocabDictionary = {}
    with open("/home/pi/vocab.txt") as f:
      for line in f:
        (key, val) = line.split()
        vocabDictionary[int(key)] = val


    indices = [];


    for key,value in vocabDictionary.items():
      for each_word in stemmed:
        if each_word==value:
            indices.append(key)

    #extract email Features
    n=1899
    s=(n,1)
    #initialize with zeros
    feature_vector=np.zeros(s)

    for i in range(len(indices)):
      feature_vector[indices[i]]=1


    #load Spam Training Data
    import scipy.io as sio
    mat_contents = sio.loadmat('/home/pi/train.mat')
    X_train=mat_contents['X']
    y_train=mat_contents['y']
    from sklearn import svm
    clf = svm.SVC(kernel='linear', C = 1.0)
    #timer for actual train time
    start_time = 0.0
    train_time = 0.0
    start_time = time.time()
    #train model
    clf.fit(X_train,y_train.ravel())
    train_time = time.time() - start_time
    #save trained model
    from sklearn.externals import joblib
    filename = '/home/pi/emailClassifier.sav'
    joblib.dump(clf, filename)
    config.state_train=1
    #test model
    y_pred_train=clf.predict(X_train)
    count=0.0
    for i in range(len(y_train)):
        if(y_pred_train[i]==y_train[i]):
                count=count+1
    train_accuracy=count/len(y_train)*100
    
    mat_contents=sio.loadmat('/home/pi/spamTest.mat')
    x_test=mat_contents['Xtest']
    y_test=mat_contents['ytest']
    y_pred_test=clf.predict(x_test)
    count=0.0
    for i in range(len(y_test)):
       if(y_pred_test[i]==y_test[i]):
          count=count+1
    test_accuracy = count/len(y_test)*100
    #return train,test accuracy and train time
    return 1,train_accuracy,test_accuracy,train_time

call_test.py


#E-mail Spam Filtering on Raspberry Pi using Machine Learning
#Authors: Darshan Kumar S Yaradoni (NetID: dsy27) and Aman Tyagi (NetID: at789)
#testing trained model
#This function returns 1 when testing is done, returns result of test: Spam/Not Spam and the words which classifier identifies as spam
def test():
   import nltk
   #download required packages
   nltk.download('punkt')
   nltk.download('stopwords')
   import re
   import numpy as np
   f=open('/home/pi/emailSample1LowerCase.txt','w')
   f.close()
   with open('/home/pi/currentmail.txt', 'r') as fileinput:
    for line in fileinput:
     #substitute actual numbers with string 'number'
     line = re.sub(r"[0-9][0-9][0-9]","number",line)
     line = re.sub(r"[0-9][0-9]","number",line)
     line = re.sub(r"[0-9]","number",line)
     #substitute urls with "httpaddr"
     line = re.sub(r"(http|https)://[^\s]*", "httpaddr",line)
     line = re.sub(r"[^\s]+@[^\s]+","emailaddr",line)
     line = re.sub(r"[$]+","dollar",line)
     with open('/home/pi/emailSample1LowerCase.txt', 'a') as the_file:
       the_file.write(line)
   the_file.close()

   filename = '/home/pi/emailSample1LowerCase.txt'
   file = open(filename, 'rt')
   text = file.read()
   file.close()

   from nltk.tokenize import word_tokenize
   tokens = word_tokenize(text)
   #convert to lowercase
   tokens = [w.lower() for w in tokens]
   # remove punctuation from each word
   import string
   table = string.maketrans('', '')
   stripped = [w.translate(table,string.punctuation) for w in tokens]
   # remove remaining tokens that are not alphabetic
   words = [word for word in stripped if word.isalpha()]
   f=open('/home/pi/emailSampleProcessed','w')
   f.close()
   for each_word in words:
    with open('/home/pi/emailSampleProcessed', 'a') as the_file:
     the_file.write(each_word+" ")
   f.close()

   filename = '/home/pi/emailSampleProcessed'
   file = open(filename, 'rt')
   text = file.read()

   from nltk.tokenize import word_tokenize
   tokens = word_tokenize(text)
   # stemming of words
   from nltk.stem.porter import PorterStemmer
   porter = PorterStemmer()
   stemmed = [porter.stem(word) for word in tokens]
   f=open('/home/pi/emailSampleProcessedFinal','w')
   f.close()
   for each_word in stemmed:
    with open('/home/pi/emailSampleProcessedFinal', 'a') as the_file:
        the_file.write(each_word+" ")
   
   #vocabList
   vocabList = open("/home/pi/vocab.txt","r")
   
   vocabDictionary = {}
   with open("/home/pi/vocab.txt") as f:
    for line in f:
        (key, val) = line.split()
        vocabDictionary[int(key)] = val
                          
   indices = [];
   for key,value in vocabDictionary.items():
    for each_word in stemmed:
        if each_word==value:
            indices.append(key)


   #extract email Features
   n=1899
   s=(n,1)
   #initialize with zeros
   feature_vector=np.zeros(s)
   from sklearn.externals import joblib
   for i in range(len(indices)):
     feature_vector[indices[i]]=1
   #load saved trained model
   clf = joblib.load('/home/pi/emailClassifier.sav')
   #predict on extract features
   p = clf.predict(feature_vector.T)
   x = clf.coef_
   y=x
   sort_i=np.sort(y)
   sort_t=sort_i.T
   sort_i_index=np.argsort(y)
   sort_t_index = sort_i_index.T
   indices_new = list(set(indices))
   w1 = vocabDictionary[indices_new[1]]
   w2 = vocabDictionary[indices_new[2]]
   w3 = vocabDictionary[indices_new[3]]
   w4 = vocabDictionary[indices_new[4]]
   w5 = vocabDictionary[indices_new[5]]
   mystr=w1+"  "+w2+"  "+w3+"  "+w4+"  "+w5
   return 1, int(p), mystr

get_email.sh


#E-mail Spam Filtering on Raspberry Pi using Machine Learning
#Authors: Darshan Kumar S Yaradoni (NetID: dsy27) and Aman Tyagi (NetID: at789)
#This function reads unread e-mail and extracts body of e-mail
su - pi -c "fetchmail > /dev/null"
cd /var/tmp
filename=$(ls -lrt mail* | tail -1 | awk '{print $9}')
cp $filename /home/pi/testmail.txt

sudo chown pi /home/pi/testmail.txt
tail -n +76 /home/pi/testmail.txt > /home/pi/test76.txt
sudo chown pi /home/pi/test76.txt
head -n -6 /home/pi/test76.txt > /home/pi/currentmail.txt
sudo chown pi /home/pi/currentmail.txt