Data must use double bracket and as.character()lapply(sms_corpus[1:2], as.character)

Data Preview:

heads sms_raw

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!


order now

 

##   type## 1  ham## 2  ham## 3 spam## 4  ham## 5  ham## 6 spam##                                                                                                                                                          text## 1 Go until jurong point, crazy.. Available only in bugis n great world la e buffet… Cine there got amore wat…## 2 Ok lar… Joking wif u oni…                                                                                                                    ## 3 Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C’s apply 08452810075over18’s## 4                                                                                                           U dun say so early hor… U c already then say…## 5                                                                                               Nah I don’t think he goes to usf, he lives around here though## 6  FreeMsg Hey there darling it’s been 3 week’s now and no word back! I’d like some fun you up for it still? Tb ok! XxX std chgs to send, 3022431.50 to rcv

  

str(sms_raw)

 

## ‘data.frame’:    5574 obs. of  2 variables:##  $ type: Factor w/ 2 levels “ham”,”spam”: 1 1 2 1 1 2 1 1 2 2 …##  $ text: chr  “Go until jurong point, crazy.. Available only in bugis n great world la e buffet… Cine there got amore wat…” “Ok lar… Joking wif u oni…” “Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C”| __truncated__ “U dun say so early hor… U c already then say…” …

 

Target Variable:

Count and proportions

table(sms_raw$type)

 

## ##  ham spam ## 4827  747

 

 

round(prop.table(table(sms_raw$type)), digits = 2)

 

## ##  ham spam ## 0.87 0.13

 

Now we will convert dataset in to a bag of data which
have no order.

spam <- subset(sms_raw, type == "spam")wordcloud(spam$text, max.words = 60, colors = brewer.pal(5, "Dark2"), random.order = FALSE)   Step-2 Data Preprocessing 1st step revolves around creation of a data corpus, which is text documents collection. Now we need to standardize data and cleasing. Data Cleasing is removal of numbers and punctuations from the data. Corpus: This corpus consists 5574 messages. #Steps to creating a corpus #Step 1: Prepare a vector source object using VectorSource#Step 2: Supply the vector source to VCorpus, to import from sourcessms_corpus <- VCorpus(VectorSource(sms_raw$text)) #To view a message, must use double bracket and as.character()lapply(sms_corpus1:2, as.character)   ## $`1`## 1 "Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."## ## $`2`## 1 "Ok lar... Joking wif u oni..."         Corpus Cleaning: Must be done before separation of words. Using function tm_map(). # converts to lowercasesms_corpus_clean <- tm_map(sms_corpus, content_transformer(tolower)) #remove numbers as numbers are uniquesms_corpus_clean <- tm_map(sms_corpus_clean, content_transformer(removeNumbers)) #removing stop words, i.e, to, or, but, and. Use stopwords() as argument, parameter that indicates what words we don't wantsms_corpus_clean <- tm_map(sms_corpus_clean, removeWords, stopwords()) #remove punctuation, i,e "", .., ', `sms_corpus_clean <- tm_map(sms_corpus_clean, removePunctuation) #apply stemming, removing suffixes f(learns, learning, learned) --> learnsms_corpus_clean <- tm_map(sms_corpus_clean, stemDocument) #lastly, strip addtional whitespacessms_corpus_clean <- tm_map(sms_corpus_clean, stripWhitespace)   Post corpus cleansing: This includes Punctuation RemovalNumbers RemovalStop words removalApplied stemmingConversion to lowercaseRemoval of additional whitespaces as.character (sms_corpus1:3) Data Preparation: Dividing messages into individual words #convert our corpus to a DTMsms_dtm <- DocumentTermMatrix(sms_corpus_clean) #dimension of DTMdim(sms_dtm)   ## 1 5574 6617   #alternate way to data cleanse all in 1 gosms_dtm2 <- DocumentTermMatrix(sms_corpus_clean, control =                          list(tolower = TRUE,removeNumbers = TRUE,stopwords = TRUE,removePunctuation = TRUE,stemming = TRUE))  Corpus word cloud wordcloud(sms_corpus_clean, min.freq = 50, random.order = FALSE, colors=brewer.pal(8, "Dark2"))   Creating Training and Test Dataset: The dataset will be divided into two portions, training and test with a percentage of 75 to 25. Preparing Training and Test Set:   #Training setsms_dtm_train <- sms_dtm1:4180,  #Test setsms_dtm_test <- sms_dtm4181:5574,   #Training Labelsms_train_labels <- sms_raw1:4180, $type #Test Labelsms_test_labels <- sms_raw4181:5574, $type To ensure the train and test sets are representative, both sets should roughly have the same proportion of spam and ham. #Proportion for train labelsprop.table(table(sms_train_labels))   ## sms_train_labels##       ham      spam ## 0.8648325 0.1351675   #Proportion for test labelsprop.table(table(sms_test_labels))   ## sms_test_labels##       ham      spam ## 0.8694405 0.1305595       Creating Indicator Features: # finding words that appear at least 5 timessms_freq_words <- findFreqTerms(sms_dtm_train, 5)#preview of most frequent words, 1166 terms with at least 5 occurencesstr(sms_freq_words)   ##  chr 1:1166 "” “wk” “<9c>” “” …

 

#filter the DTM sparse matrix to only contain words with at least 5 occurence#reducing the features in our DTMsms_dtm_freq_train <- sms_dtm_train , sms_freq_wordssms_dtm_freq_test <- sms_dtm_test , sms_freq_words Since Naive Bayes trains on categorical data, the numerical data must be converted to categorical data. We need to convert our counts in our two sparse matrices into Yes or No levels. # create a function to do , convert zeros and non-zeros into "Yes" or "No"convert_counts <- function(x){  x <- ifelse(x > 0, “Yes”, “No”)} #apply to train and test reduced DTMs, applying to columnsms_train <- apply(sms_dtm_freq_train, MARGIN = 2, convert_counts)sms_test <- apply(sms_dtm_freq_test, MARGIN = 2, convert_counts) #check structure of both the DTM matricesstr(sms_train)   ##  chr 1:4180, 1:1166 "No" "No" "No" "No" "No" "Yes" ...##  - attr(*, "dimnames")=List of 2##   ..$ Docs : chr 1:4180 "1" "2" "3" "4" ...##   ..$ Terms: chr 1:1166 "” “wk” “<9c>” “” …

 

str(sms_test)

 

##  chr 1:1394, 1:1166 “No” “No” “No” “Yes” “No” “No” …##  – attr(*, “dimnames”)=List of 2##   ..$ Docs : chr 1:1394 “4181” “4182” “4183” “4184” …##   ..$ Terms: chr 1:1166 “” “wk” “<9c>” “” …

 

 

 

 

 

 

Step 3 – Model training

Now we will finally apply the Naïve Bayes Algorithm.

# applying Naive Bayes to training setsms_classifier <- naiveBayes(sms_train, sms_train_labels, laplace = 0) #applying to test setsms_test_pred <- predict(sms_classifier, sms_test) #preview of outputhead(data.frame("actual" = sms_test_labels, "predicted" = sms_test_pred))   ##   actual predicted## 1    ham       ham## 2    ham       ham## 3    ham       ham## 4   spam      spam## 5    ham       ham## 6    ham       ham   Step- 4 Model Evaluation Using cross table CrossTable(sms_test_pred, sms_test_labels, prop.chisq = FALSE, dnn = c("predicted", "actual"))   ##  ##    Cell Contents## |-------------------------|## |                       N |## |           N / Row Total |## |           N / Col Total |## |         N / Table Total |## |-------------------------|## ##  ## Total Observations in Table:  1394 ## ##  ##              | actual ##    predicted |       ham |      spam | Row Total | ## -------------|-----------|-----------|-----------|##          ham |      1205 |        21 |      1226 | ##              |     0.983 |     0.017 |     0.879 | ##              |     0.994 |     0.115 |           | ##              |     0.864 |     0.015 |           | ## -------------|-----------|-----------|-----------|##         spam |         7 |       161 |       168 | ##              |     0.042 |     0.958 |     0.121 | ##              |     0.006 |     0.885 |           | ##              |     0.005 |     0.115 |           | ## -------------|-----------|-----------|-----------|## Column Total |      1212 |       182 |      1394 | ##              |     0.869 |     0.131 |           | ## -------------|-----------|-----------|-----------|## ##   Confusion Matrix   confusionMatrix(sms_test_pred, sms_test_labels, dnn = c("predicted", "actual"))   ## Confusion Matrix and Statistics## ##          actual## predicted  ham spam##      ham  1205   21##      spam    7  161##                                           ##                Accuracy : 0.9799          ##                  95% CI : (0.9711, 0.9866)##     No Information Rate : 0.8694          ##     P-Value Acc > NIR : < 2e-16         ##                                           ##                   Kappa : 0.9085          ##  Mcnemar's Test P-Value : 0.01402         ##                                           ##             Sensitivity : 0.9942          ##             Specificity : 0.8846          ##          Pos Pred Value : 0.9829          ##          Neg Pred Value : 0.9583          ##              Prevalence : 0.8694          ##          Detection Rate : 0.8644          ##    Detection Prevalence : 0.8795          ##       Balanced Accuracy : 0.9394          ##                                           ##        'Positive' Class : ham             ## From the two tables above, the model has a decent accuracy of nearly 98%, missing out on 7 messages as ham instead of rightfully classifying it as spam.  

x

Hi!
I'm Simon!

Would you like to get a custom essay? How about receiving a customized one?

Check it out