Data you up for it still? Tb

Data Preview:heads sms_raw ##   type## 1  ham## 2  ham## 3 spam## 4  ham## 5  ham## 6 spam##                                                                                                                                                          text## 1 Go until jurong point, crazy..

Available only in bugis n great world la e buffet… Cine there got amore wat…## 2 Ok lar.

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!


order now

.. Joking wif u oni…                                                                                                                    ## 3 Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005.

Text FA to 87121 to receive entry question(std txt rate)T&C’s apply 08452810075over18’s## 4                                                                                                           U dun say so early hor…

U c already then say…## 5                                                                                               Nah I don’t think he goes to usf, he lives around here though## 6  FreeMsg Hey there darling it’s been 3 week’s now and no word back! I’d like some fun you up for it still? Tb ok! XxX std chgs to send, 3022431.50 to rcv  str(sms_raw) ## ‘data.frame’:    5574 obs. of  2 variables:##  $ type: Factor w/ 2 levels “ham”,”spam”: 1 1 2 1 1 2 1 1 2 2 ..

.##  $ text: chr  “Go until jurong point, crazy.. Available only in bugis n great world la e buffet…

Cine there got amore wat…

” “Ok lar… Joking wif u oni…” “Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C”| __truncated__ “U dun say so early hor.

.. U c already then say…

” … Target Variable:Count and proportionstable(sms_raw$type) ## ##  ham spam ## 4827  747  round(prop.table(table(sms_raw$type)), digits = 2) ## ##  ham spam ## 0.87 0.13 Now we will convert dataset in to a bag of data whichhave no order.spam <- subset(sms_raw, type == "spam")wordcloud(spam$text, max.

words = 60, colors = brewer.pal(5, “Dark2”), random.order = FALSE) Step-2 Data Preprocessing 1st steprevolves around creation of a data corpus, which is text documents collection.

Now we need to standardize data and cleasing.Data Cleasing is removal of numbers and punctuations fromthe data.Corpus:This corpus consists 5574 messages.

#Steps to creating a corpus #Step 1: Prepare a vector source object using VectorSource#Step 2: Supply the vector source to VCorpus, to import from sourcessms_corpus <- VCorpus(VectorSource(sms_raw$text)) #To view a message, must use double bracket and as.character()lapply(sms_corpus1:2, as.character) ## $`1`## 1 "Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat.

..”## ## $`2`## 1 “Ok lar…

Joking wif u oni…”    Corpus Cleaning:Must be done before separation of words.

Using function tm_map().# converts to lowercasesms_corpus_clean <- tm_map(sms_corpus, content_transformer(tolower)) #remove numbers as numbers are uniquesms_corpus_clean <- tm_map(sms_corpus_clean, content_transformer(removeNumbers)) #removing stop words, i.e, to, or, but, and. Use stopwords() as argument, parameter that indicates what words we don't wantsms_corpus_clean <- tm_map(sms_corpus_clean, removeWords, stopwords()) #remove punctuation, i,e "", ..

, ‘, `sms_corpus_clean <- tm_map(sms_corpus_clean, removePunctuation) #apply stemming, removing suffixes f(learns, learning, learned) --> learnsms_corpus_clean <- tm_map(sms_corpus_clean, stemDocument) #lastly, strip addtional whitespacessms_corpus_clean <- tm_map(sms_corpus_clean, stripWhitespace) Post corpus cleansing:This includesPunctuation RemovalNumbers RemovalStop words removalApplied stemmingConversion to lowercaseRemoval of additional whitespacesas.character (sms_corpus1:3)Data Preparation:Dividing messages into individual words#convert our corpus to a DTMsms_dtm <- DocumentTermMatrix(sms_corpus_clean) #dimension of DTMdim(sms_dtm) ## 1 5574 6617 #alternate way to data cleanse all in 1 gosms_dtm2 <- DocumentTermMatrix(sms_corpus_clean, control =                          list(tolower = TRUE,removeNumbers = TRUE,stopwords = TRUE,removePunctuation = TRUE,stemming = TRUE)) Corpus word cloudwordcloud(sms_corpus_clean, min.freq = 50, random.order = FALSE, colors=brewer.pal(8, "Dark2")) Creating Training andTest Dataset:The dataset will be divided into two portions, trainingand test with a percentage of 75to 25.

Preparing Training and Test Set: #Training setsms_dtm_train <- sms_dtm1:4180,  #Test setsms_dtm_test <- sms_dtm4181:5574,  #Training Labelsms_train_labels <- sms_raw1:4180, $type #Test Labelsms_test_labels <- sms_raw4181:5574, $typeTo ensure the train and test sets are representative, both sets should roughlyhave the same proportion of spam and ham.#Proportion for train labelsprop.table(table(sms_train_labels)) ## sms_train_labels##       ham      spam ## 0.8648325 0.1351675 #Proportion for test labelsprop.table(table(sms_test_labels)) ## sms_test_labels##       ham      spam ## 0.

8694405 0.1305595   Creating Indicator Features:# finding words that appear at least 5 timessms_freq_words <- findFreqTerms(sms_dtm_train, 5)#preview of most frequent words, 1166 terms with at least 5 occurencesstr(sms_freq_words) ##  chr 1:1166 "” “wk” “<9c>” “” … #filter the DTM sparse matrix to only contain words with at least 5 occurence#reducing the features in our DTMsms_dtm_freq_train <- sms_dtm_train , sms_freq_wordssms_dtm_freq_test <- sms_dtm_test , sms_freq_wordsSince Naive Bayes trains on categorical data, the numerical data must beconverted to categorical data.

We need to convert our counts in our two sparsematrices into Yes or No levels.# create a function to do , convert zeros and non-zeros into “Yes” or “No”convert_counts <- function(x){  x <- ifelse(x > 0, “Yes”, “No”)} #apply to train and test reduced DTMs, applying to columnsms_train <- apply(sms_dtm_freq_train, MARGIN = 2, convert_counts)sms_test <- apply(sms_dtm_freq_test, MARGIN = 2, convert_counts) #check structure of both the DTM matricesstr(sms_train) ##  chr 1:4180, 1:1166 "No" "No" "No" "No" "No" "Yes" ...

##  – attr(*, “dimnames”)=List of 2##   ..$ Docs : chr 1:4180 “1” “2” “3” “4” .

..##   ..$ Terms: chr 1:1166 “” “wk” “<9c>” “” …

 str(sms_test) ##  chr 1:1394, 1:1166 “No” “No” “No” “Yes” “No” “No” …

##  – attr(*, “dimnames”)=List of 2##   ..$ Docs : chr 1:1394 “4181” “4182” “4183” “4184” …##   .

.$ Terms: chr 1:1166 “” “wk” “<9c>” “” …      Step 3 – Model trainingNow we will finally apply the Naïve Bayes Algorithm.

# applying Naive Bayes to training setsms_classifier <- naiveBayes(sms_train, sms_train_labels, laplace = 0) #applying to test setsms_test_pred <- predict(sms_classifier, sms_test) #preview of outputhead(data.frame("actual" = sms_test_labels, "predicted" = sms_test_pred)) ##   actual predicted## 1    ham       ham## 2    ham       ham## 3    ham       ham## 4   spam      spam## 5    ham       ham## 6    ham       ham Step- 4 Model EvaluationUsing cross tableCrossTable(sms_test_pred, sms_test_labels, prop.chisq = FALSE, dnn = c("predicted", "actual")) ##  ##    Cell Contents## |-------------------------|## |                       N |## |           N / Row Total |## |           N / Col Total |## |         N / Table Total |## |-------------------------|## ##  ## Total Observations in Table:  1394 ## ##  ##              | actual ##    predicted |       ham |      spam | Row Total | ## -------------|-----------|-----------|-----------|##          ham |      1205 |        21 |      1226 | ##              |     0.983 |     0.017 |     0.

879 | ##              |     0.994 |     0.115 |           | ##              |     0.864 |     0.

015 |           | ## ————-|———–|———–|———–|##         spam |         7 |       161 |       168 | ##              |     0.042 |     0.958 |     0.121 | ##              |     0.006 |     0.

885 |           | ##              |     0.005 |     0.115 |           | ## ————-|———–|———–|———–|## Column Total |      1212 |       182 |      1394 | ##              |     0.

869 |     0.131 |           | ## ————-|———–|———–|———–|## ##  Confusion Matrix confusionMatrix(sms_test_pred, sms_test_labels, dnn = c(“predicted”, “actual”)) ## Confusion Matrix and Statistics## ##          actual## predicted  ham spam##      ham  1205   21##      spam    7  161##                                           ##                Accuracy : 0.9799          ##                  95% CI : (0.9711, 0.

9866)##     No Information Rate : 0.8694          ##     P-Value Acc > NIR : < 2e-16         ##                                           ##                   Kappa : 0.9085          ##  Mcnemar's Test P-Value : 0.01402         ##                                           ##             Sensitivity : 0.9942          ##             Specificity : 0.

8846          ##          Pos Pred Value : 0.9829          ##          Neg Pred Value : 0.9583          ##              Prevalence : 0.8694          ##          Detection Rate : 0.8644          ##    Detection Prevalence : 0.

8795          ##       Balanced Accuracy : 0.9394          ##                                           ##        ‘Positive’ Class : ham             ## From the two tables above, the model has a decent accuracy of nearly98%, missing out on 7 messages as ham instead of rightfullyclassifying it as spam. 

x

Hi!
I'm Simon!

Would you like to get a custom essay? How about receiving a customized one?

Check it out