Tag Results
8 posts tagged R
8 posts tagged R
The last post talked about R in the cloud and made reference to how United Airlines was possibly influencing brand sentiment through a contest they are having for a trip to the Olympics. After a bit of hacking in R this is the recent sentiment I’ve seen after treating the campaign #FlyMetoLondon as neutral (as these are optionally generated after a user registers for a contest). I would say this is a very effective campaign for brand management.

Nearly 1200 of the most recent 1500 posts were coming from this contest, more than offsetting the otherwise neutral to negative sentiment from other tweets. I guess we’ll see how this campaign shifts the ambient sentiment when the campaign is over, as not scientifically meaningful, but previous assessment showed an approval rating of 27% (i.e. 73% neutral to negative).
Update
This is from sentiment140 run at 11pm EST using @united as the search query.

To clarify, last time I measured the sentiment, it was seemingly very negative overall (as roughly confirmed by the chart above), now factoring in the campaign, it is seemingly very positive, but what will be the sentiment when the campaign is over?
Here is the relevant R code for reference …
library(twitteR)
united.tweets = searchTwitter('@united', n=1500)
tweet = united.tweets[[1]]
class(tweet)
tweet$screenName
tweet$text
library(plyr)
united.text = laply(united.tweets, function(t) t$getText())
head(united.text, 5)
hu.liu.pos = scan('/users/kindageeky/Downloads/positive-words.txt', what='character', comment.char=';')
hu.liu.neg = scan('/users/kindageeky/Downloads/negative-words.txt', what='character', comment.char=';')
pos.words = c(hu.liu.pos, 'upgraded', 'early', 'safe')
neg.words = c(hu.liu.neg, 'wtf', 'delay', 'late', 'broke', 'lost', 'dated', 'damage', 'herd', 'epicfail', 'sitting', 'mech', 'tarmac', 'ugh' )
score.sentiment = function(sentences, pos.words, neg.words, .progress='none', brand)
{
require(plyr)
require(stringr)
# we got a vector of sentences. plyr will handle a list or a vector as an "l" for us
# we want a simple array of scores back, so we use "l" + "a" + "ply" = laply:
scores = laply(sentences, function(sentence, pos.words, neg.words) {
# clean up sentences with R's regex-driven global substitute, gsub():
sentence = gsub('[[:punct:]]', '', sentence)
sentence = gsub('[[:cntrl:]]', '', sentence)
sentence = gsub('\\d+', '', sentence)
# and convert to lower case, may blow up:
t 0)
{
#print('marketing')
return(0)
}
else {
# compare our words to the dictionaries of positive & negative terms
pos.matches = match(words, pos.words)
neg.matches = match(words, neg.words)
# match() returns the position of the matched term or NA
# we just want a TRUE/FALSE:
pos.matches = !is.na(pos.matches)
neg.matches = !is.na(neg.matches)
# and conveniently enough, TRUE/FALSE will be treated as 1/0 by sum():
score = sum(pos.matches) - sum(neg.matches)
}
return(score)
}, pos.words, neg.words, .progress=.progress )
scores.df = data.frame(score=scores, text=sentences)
return(scores.df)
}
united.scores = score.sentiment(united.text, pos.words, neg.words, .progress='text', 'flymetolondon')
library(ggplot2)
qplot(united.scores$score)
hist(united.scores$score)
Cloudstat is R in the cloud, the gallery app they have up so far inspects a twitter search, which reminds me of the clever presentation on twitter sentiment analysis with R by Jeffrey Breen. To amuse yourself, click on the title link and search @united. I hacked my way through the same presentation a few months back and at the time measured @united’s sentiment at 27%, maybe not surprising as airlines generally fare worse in approvals than the postal service. Anyway, seems like there is a lot more brand management going on now as @united mentions itself in a positive light anytime a user registers for a contest for a trip to london. Here is the topic cloud that cloudstat generated: 
Yet when you look at the posts minus the brand management, you get a very different picture, for example this one …
@United’s plane was 3 hours late getting in to Boston. They asked me to fill out a survey, which is 404. http://t.co/lnpeEUwB
This whole space around sentiment analysis is very interesting, and especially hard to solve with posts as short as twitter. I have yet to see any evidence that sarcasm is being detected in sentiment analysis services, though there appears to be some promising research.
Back to CloudStat, digging deeper there are various recipes (mostly against twitter) for doing analysis with R. Apparently, I could work on my politeness.
> Result(x)
@kindageeky is ok polite, only better than 59 percent of twitter users.
[1] "Ok."
>
This is the output of CloudStat’s politeness tracker. The notion of having an R language Cloud Robot is kinda cool, will have to look through the tutorials.
This is what Pier came up with using only Math … plan to whip up an R implementation soon … brilliant
Without programming:
\(a= 2mn; b= m^2 -n^2;\)
\(c= m^2 + n^2;\)
\(a + b + c = 1000;\)
\(2mn + (m^2 -n^2) + (m^2 + n^2) = 1000;\)
\(2mn + 2m^2 = 1000;\)
\(2m(m+n) = 1000;\)
\(m(m+n) = 500;\)
\(m>n;\)
//Simple R routine goes here …
\(m= 20; n= 5; a= 200; b= 375; c= 425;\)
more fun with R parsing, yick!
reverse <- function(x)
{
sapply(lapply(strsplit(x, NULL), rev), paste,collapse=”“)
}
isPalindrome <- function(x, y)
{
x<-toString(x)
y<-reverse(x)
a <- strtoi(x, 10)
b <- strtoi(y, 10)
if(a==b){ return(TRUE) } else {return(FALSE)}
}
maxPalindrome <- 0
number1 <- 0
number2 <- 0
for(i in 100:999)
{
for(j in 100:999)
{
k <- i*j
if(k>maxPalindrome)
{
if(isPalindrome(k))
{
maxPalindrome <- k
number1 <- i
number2 <- j
}
}
}
}
print(maxPalindrome)
print(number1)
print(number2)
what I learned tonight is that string parsing in R kinda sucks … overcame limitations with the stringr library … problem was finding the max product of a 5 digit sequence in a 1000 digit number.
productMax <- 0
for(i in 1:996)
{
j <- i+4
x <-str_sub(raw, i, j)
#print(x)
y <- as.integer(str_sub(x, 1, 1))
for(k in 2:5)
{
y <- c(y,as.integer(str_sub(x, k, k)))
}
#print(y)
z <- 1
for(m in 1:5)
{
z <- z*y[m];
if(z==0)
{
break
}
}
#print(z)
if(z>productMax) {productMax <- z}
}
print(productMax)
pretty simple …
x <- seq(1,100,1)
> x
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
[27] 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52
[53] 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78
[79] 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
> sumOfSquares <- 0
> for(i in x){sumOfSquares <- c(sumOfSquares, i^2)}
> sumOfSquares
[1] 0 1 4 9 16 25 36 49 64 81 100 121 144 169 196 225 256
[18] 289 324 361 400 441 484 529 576 625 676 729 784 841 900 961 1024 1089
[35] 1156 1225 1296 1369 1444 1521 1600 1681 1764 1849 1936 2025 2116 2209 2304 2401 2500
[52] 2601 2704 2809 2916 3025 3136 3249 3364 3481 3600 3721 3844 3969 4096 4225 4356 4489
[69] 4624 4761 4900 5041 5184 5329 5476 5625 5776 5929 6084 6241 6400 6561 6724 6889 7056
[86] 7225 7396 7569 7744 7921 8100 8281 8464 8649 8836 9025 9216 9409 9604 9801 10000
> sum(sumOfSquares)
[1] 338350
> sum(x)^2
[1] 25502500
> sum(x)^2 - sum(sumOfSquares)
[1] 25164150
this one was pretty straightforward, trick was starting and stepping by the smallest number evenly divisible by 1 thru 10 which was in the example, execution time 2 seconds.
isDivisible <- function(x)
{
for(i in 2:20)
{
if(x%%i==0){next}
else return(F);
}
return(T);
}
i <- 2520
> while(i < y){if(isDivisible(i)) {print(i); break}; i <- i+2520; if(i%%1000000==0) print(“…”)}
I was turned onto this project by Bobby at Auerilius who works on the Tinkerpop project. Basically project euler is a collection of math problems that demand computational solutions. Since i’m hobbying in math and trying to get acquainted with R, thought i’d start chipping away at this one. First couple problems were very easy, though some time was spent just figuring out R language constructs to program the solutions. The third was interesting, prime factorization to 600851475143. Below is my cleaned up R console solution … only perf tricks I used were only inspecting values to \(\sqrt600851475143\), testing for even divisibility for modulus before testing whether it was a prime, and short circuiting the prime test. Perf threshold for solutions on project euler is 1 minute, guessing I can improve upon the 15 seconds below. Lastly, in futzing about for prime factorization algorithms, came across this ancient algorithm - Sieve of Eratosthenes, which is pretty impractical for the >600B number above, but nonetheless very interesting for smaller numbers. Apparently sieving is very useful in cryptography.
> isPrime <- function(x)
+ {
+ y = sqrt(x)
+ for(i in 2:y)
+ {
+ if(x%%i==0) return(F)
+ }
+ return(T)
+ }
> for(i in 1:y){if(600851475143%%i==0 && isPrime(i)) {x<-c(x, i); print(i)}; if(i%%1000000==0) print(“…”)}
[1] 71
[1] 839
[1] 1471
[1] 6857
[1] “…”
[1] “…”
[1] “…”
[1] “…”
[1] “…”