A Quick Sentiment Analysis Example with Tidy Text Package in R

Find the data here: https://charleshsliao.wordpress.com/2017/03/03/a-sms-spam-test-with-naive-bayes-in-r-with-text-processing/

If we want to, we can explore the sentiment of Ham and Spam messages separately. I chose not to filter like this.

</pre>
rawtext<-read.csv("HamorSpam.csv",header=F,sep=",",stringsAsFactors = F)
str(rawtext)

## 'data.frame':    5572 obs. of  2 variables:
##  $ V1: chr  "ham" "ham" "spam" "ham" ...
##  $ V2: chr  "Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..." "Ok lar... Joking wif u oni..." "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C"| __truncated__ "U dun say so early hor... U c already then say..." ...

library(dplyr)

##
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
##
##     filter, lag

## The following objects are masked from 'package:base':
##
##     intersect, setdiff, setequal, union

rawtext<-rename(rawtext,type=V1)
rawtext<-rename(rawtext,msg=V2)
library(tidytext)
library(tm)

## Loading required package: NLP

msgdf<-data_frame(msg=rawtext$msg)
tidymsg<-unnest_tokens(msgdf,word,msg)
tidymsg

## # A tibble: 87,598 × 1
##         word
##        <chr>
## 1         go
## 2      until
## 3     jurong
## 4      point
## 5      crazy
## 6  available
## 7       only
## 8         in
## 9      bugis
## 10         n
## # ... with 87,588 more rows

msgcorpus<-VCorpus(VectorSource(tidymsg))
msgcorpus <- tm_map(msgcorpus, content_transformer(function(x)
iconv(x, to='UTF-8-MAC', sub='byte')),mc.cores=1)
library(SnowballC)
msgclean<-tm_map(msgcorpus,content_transformer(tolower))
msgclean<-tm_map(msgclean,removeNumbers)
msgclean<-tm_map(msgclean,removeWords,stopwords())
msgclean<-tm_map(msgclean,removePunctuation)
msgclean<-tm_map(msgclean,stemDocument)
msgclean<-tm_map(msgclean, stripWhitespace)
library(qdap)

## Loading required package: qdapDictionaries

## Loading required package: qdapRegex

##
## Attaching package: 'qdapRegex'

## The following objects are masked from 'package:dplyr':
##
##     escape, explain

## Loading required package: qdapTools

##
## Attaching package: 'qdapTools'

## The following object is masked from 'package:dplyr':
##
##     id

## Loading required package: RColorBrewer

##
## Attaching package: 'qdap'

## The following objects are masked from 'package:tm':
##
##     as.DocumentTermMatrix, as.TermDocumentMatrix

## The following object is masked from 'package:NLP':
##
##     ngrams

## The following object is masked from 'package:dplyr':
##
##     %>%

## The following object is masked from 'package:base':
##
##     Filter

msgcleantidy<-as.data.frame(msgclean)
msgtouse<-unnest_tokens(msgcleantidy,word,text)
#The nrc lexicon categorizes words in a binary fashion (“yes”/“no”)
#into categories of positive, negative, anger, anticipation, disgust,
#fear, joy, sadness, surprise, and trust.
msgpositive<-get_sentiments("nrc")%>%filter(sentiment=="positive")
msgpositive<-msgtouse%>%semi_join(msgpositive)%>%count(word,sort=TRUE)

## Joining, by = "word"

mpndf<-as.data.frame(msgpositive)
tmndf<-count(msgtouse,word,sort=T)
positive_words_percentage<-sum(mpndf$n)/sum(tmndf$n)
positive_words_percentage

## [1] 0.07419504
<pre>
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s