I am working on a super-secret project for which I am harvesting a highly confidential source of data: twitter š The idea is to gather a small amount of twitter data, but for a long timeā¦ maybe a year. I tried to use the package TwitteR, but it can onlyĀ grab up to a week of tweetsā¦ itās not really good for a set-it-and-forget-it ongoing capture since it requires user-based authentication, which means (I guess) that a machine canāt authenticate for it. Tangibly this means a human needs to start the process every time. So I could run the script weekly, but of course thereās days you miss, or run at different timesā¦ plus itās just plain annoyingā¦
does no #rstats package useTwitter oauth 2.0? I can’t believe it… this means that automated tweet-scrapers in R aren’t possible? pic.twitter.com/eq9dnrbB6L
ā amit (@VizMonkey) August 9, 2017
And then I remembered aboutĀ streamR, which allows exactly for this ongoing scraping. This blog documents my experience this up on my server, using a linux service.
Letās Go!
(Small meta-note: Iām experimenting with a new blogging style: showing more of my errors and my iterative approach to solving problems in order to counter the perception of the perfect analystā¦ something a bunch of people have been talking about recently. I was exposed to it byĀ @JennyBryanĀ and @hspterĀ during EARL London, and itās really got me thinking. Anyway, I do realize that it makes for a messier read full of tangents and dead ends. Love it? Hate it? Please let me know what you think in the comments!)
(metanote 2: The linux bash scripts are available in their own github repo)
So if you donāt have a linux server of your own, follow Dean Atalliās excellent guide to set one up on Digital Oceanā¦ itās cheap and totally worth it. Obviously, youāll need to install streamR
, also ROauth
. I use other packages in the scripts here, up to you to do it exactly how I do it or not. Alsoā¦ remember when you install R-Packages on Ubuntu, you have to do it as the superuser in linux, not from R (otherwise that package wonāt be available for any other user (like shiny). If you donāt know what Iām talking about then you didnāt read Dean Atalliās guide like I said aboveā¦ why are you still here?). Actually, itās so annoying to have to remember how to correctly install R packages on linux, that I created a little utility for it. save the following into a file called āRinstaller.shā:
Ā
!/bin/bash # Ask the user what package to install echo what package should I grab? read varname echo I assume you mean CRAN, but to use github type "g" read source if [ "$source" = "g" ]; then echo -------------------------------- echo Installing $varname from GitHub sudo su - -c \\"R -e \"devtools::install_github('$varname')\"\\" else echo -------------------------------- echo Grabbin $varname from CRAN sudoĀ suĀ -Ā -cĀ \\"RĀ -eĀ \"install.packages('$varname',Ā repos='http://cran.rstudio.com/')\"\\" fi
this function will accept an input (the package name) and then will ask if to install from CRON or from github. From github, obviously you need to supply the user account and package name. There! Now we donāt need to remember anything anymore! š Oh, make sure you chmod 777 Rinstaller.sh
Ā (which lets anyone execute the file) and then to run it:Ā ./Rinstaller.sh
Ā
Anyway, I messed around with streamR for a while and figured out how I wanted to structure the files. I think I want 3 filesā¦ one to authenticate, one to capture tweets, and the third to do the supersecret analysis. Here they are:
Ā
Authenticator
## Auth library(ROAuth) requestURL <- "https://api.twitter.com/oauth/request_token" accessURL <- "https://api.twitter.com/oauth/access_token" authURL <- "https://api.twitter.com/oauth/authorize" consumerKey <- "myKey" consumerSecret <- "mySecret" my_oauth <- OAuthFactory$new(consumerKey = consumerKey, consumerSecret = consumerSecret, requestURL = requestURL, accessURL = accessURL, authURL = authURL) my_oauth$handshake(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")) save(my_oauth, file = "my_oauth.Rdata")
So we use this file to connect to the Twitter service. You will need to set yourself up with an APIā¦ itās fairly painless. Go do that hereĀ and select āCreate new appā.
(small caveat: make sure the my_oauth
file is saved in the working directory. You can make sure of it by creating a Project for these three filesā¦ actually, working w/ working directories in a scripted setting is a painā¦ more on this later).
Tweet-Getter
library(streamR) library(here) ## Get load("/srv/shiny-server/SecretFolder/my_oauth.Rdata") filterStream("/srv/shiny-server/SecretFolder/tweets.json", track = "SecretTopic", oauth = my_oauth)
OK, so we run the authenticator once, then we run can run this file, which just goes out and gathers all tweets related to SecretTopic
, and saves them to tweets.json
. This works with my stream because itās relatively small number of tweetsā¦ but be careful, if your topic gets tons of hits, the file can grow VERY quickly. You might be interested in splitting up the output into multiple files. Check this to see how.
Ā
On working directories, itās super annoying, itās typically bad practice to specify a direct path to files in your script, instead, itās encouraged that you use tech to āknow your pathāā¦ for example, we can use the here
Ā package, or use the Project folder. The problem that arises when running files from some kind of automated cron or scheduler is that it doesnāt know how to read .Rproj
Ā files, and therefore doesnāt know what folder to use. I asked this question in the RStudio Community site, which have sparked a large discussionā¦ check it out! Anyway, the last script:
Ā
Tweet-analyzer
## read library(streamR) tweets.df <- parseTweets("tweets.json", verbose = FALSE) ## Do the Secret Stuff š
Ā
Ok, so now we can authenticate, gather tweets, and anaylze the resulting file!
Ā
OK cool! So letās get the TweetGetter running! As long as itās running, it will be appending tweets to the json
Ā file. We could run it on our laptop, but itāll stop running when we close our laptop, so thatās a perfect candidate to run on a server.Ā If you donāt know how to get your stuff up into a linux server, I recommend saving your work locally, setting up git, git push
ing it up to a private github remote (CAREFUL! This will have your Private Keys so make sure you donāt use a public repo) and then git pull
ing it into your server.
EDIT: As mentioned by @John in the comments, think deeply about security and see if you feel comfortable doing this. You can perfectly well skip this step and just recreate the credential file in the server, that way no private keys would live on github at allā¦ up to you.
Ā
OK, set it up to run on the server
(CAVEAT!! I am not a linux expertā¦ far from it! If anyone sees me doing something boneheaded (like the chmod 777
above, please leave a comment).
The first time you run the script, it will ask you to authenticateā¦ So I recommend running the Authenticator file from RStudio on the server, which will allow you to grab the auth code and paste it into the Rstudio session. Once youāre done, you should be good to capture tweets on that server. The problem is, if you run the TweetGetter in RStudioā¦ when you close that session, it stops the script.
Ā
Idea 2: Hrmā¦ letās try in the shell. So SSH into the server (on windows use Putty), go to the Project folder and type:
Rscript TweetGetter.R
Ā
It runs, but when I close the SSH session it also stops the script :-\ . I guess that instance is tied to the SSH session? I donāt get itā¦ but whatever, fine.
Idea 3: set a cronjob to run it! In case you donāt know, cron jobs are the schedulers on linux. Run crontab -e
Ā to edit the jobs, and crontab -l
Ā to view what jobs you have scheduled. To understand the syntax of the crontabs, see this.
Ā
So the idea would be to start the task on a scheduleā¦ that way itās not my session that started itā¦ although of course, if itās set on a schedule and the schedule dictates itās time to start up again but the file is already running, I donāt want it to run twiceā¦ hrmā¦
Ā
Oh I know! Iāll create a small bash file (like a small executable) that CHECKS if the thingie is running, and if it isnāt then run it, if it is, then donāt do anything! This is what I came up with:
if pgrep -x "Rscript" > /dev/null then echo "Running" else echo "Stopped... restarting" Rscript "/srv/shiny-server/SecretFolder/newTweetGetter.R" fi
WARNING! THIS IS WRONG.
What this is saying is ācheck if āRscriptā is running on the server (I assumed I didnāt haveĀ any OTHER running R process at the time, a valid assumption in this case). If it is, then just say āRunningā, if itās not, then say āStoppedā¦ restartingā and re-run the file, using Rscript
. Then, we can put file on the cron job to run hourlyā¦ so hourly I will check if the job is running or not, and if itās stopped, restart. This is what the cron job looks like:
1 * * * * "/srv/shiny-server/SecretFolder/chek.sh"
In other words, run the file chek.sh
Ā during minute 1 of every hour, every day of money, every month of the year, and every day of the week (ie, every hour :))
OKā¦. Cool! So Iām good right? Let me check if the json is getting tweetsā¦ hrmā¦ no data in the past 10 minutes or soā¦ has nobody tweeted or is it broken?Ā Hrm2ā¦ how does one check the cronjob log file? Oh, there is noneā¦ but shouldnāt there be? ::google:: I guess there is supposed to be oneā¦ ::think:: Oh, itās because Iām logged in with a user that doesnāt have admin rights, so when it tries to create a log file in a protected folder, it gets rejectedā¦ Well Fine! Iāll pipe the output of the run to a file in a folder I know I can write to. (Another option is to set up the cron job as the root adminā¦. ie instead of crontab -e
Ā you would say sudo crontab -e
ā¦ but if thereās one thing I know about linux is that I donāt know linux and therefore I use admin commands as infrequently as I can get away with). So how do I pipe run contents to a location I can see? Wellā¦ google says this is one way:
Ā
40 * * * * "/srv/shiny-server/SecretFolder/chek.sh" >> /home/amit/SecretTweets.log 2>&1
So what this is doing is running the file just as before, but the >>
Ā pushes the results to a log file on my home directory. Just a bit of Linux for youā¦ >
Ā recreates the piped output everytime (ie overwrites), whereasĀ >>
Ā appends to what was already there. The 2>&1
Ā part means āgrab standard output and errorsāā¦ if you wanna read more about why, geek out, but I think youāre basically saying āgrab any errors and pipe them to standard output and then grab all standard outputā.
OK, so after looking at the output, I saw what was happeningā¦ during every crontab run, the chek.sh
file made it seem like the newTweetGetter.R
Ā wasnāt runningā¦ so it would restart it, gather 1 tweet and then time out. š What strange behaviour! Am I over some Twitter quota? No, it canāt be, itās a streaming service, twitter will feed me whatever it wants, Iām not requesting any amountā¦ so it canāt be that.
Ā
here is where I threw my hands up and asked Richard, my local linux expert for help
Enter a very useful command: top
. This command, and itās slightly cooler version htop
Ā (which doesnāt come in Ubuntu by default but is easy to installā¦ sudo apt install htop
) quickly showed me that when you call an R file via Rscript
, it doesnāt launch a service called Rscript
, it launches a service called /usr/lib/R/bin/exec/R --slave --no-restore --file=/srv/shiny-server/SecretFolder/newTweetGetter.R
.Ā Which explains why chek.sh
Ā didnāt think it was running (when it was)ā¦ and when the second run would try to connect to the twitter stream, it got rejected (because the first script was already connected). So this is where Richard said āBTW, you should probably set this up as a serviceā¦ā. And being who I am and Richard who he is, I said: āokā. (Although I didnāt give up on the cronā¦ seeĀ ENDNOTE1).
After a bit of playing around, we found that the shiny-server linux service was probably easy enough to manipulate and get functional (guidance here and here), so letās do it!
Setting up the service:
- First of all, go to the folder where all the services live.Ā
cd /etc/systemd/system/
- Next, copy the shiny service into your new one, called SECRETtweets.service:
sudo cp shiny-server.service SECRETtweets.service
- Now edit the contents!
sudo nano SECRETtweets.service
Ā and copy paste the following code:
[Unit] Description=SECRETTweets [Service] Type=simple User=amit ExecStart=/usr/bin/Rscript "/srv/shiny-server/SecretFolder/newTweetGetter.R" Restart=always WorkingDirectory= /srv/shiny-server/SecretFolder/ Environment="LANG=en_US.UTF-8" [Install] WantedBy=multi-user.target
-
restart the daemon that picks up services? Donāt know whyā¦ just do it:
sudo systemctl daemon-reload
-
Now start the service!!
sudo systemctl start SECRETtweets
Ā
Now your service is running! You can check the status of it using: systemctl status SECRETtweets.service
Where each part does this:
- Description is what the thingie does
- Type says how to run it, and āsimpleā is the defaultā¦ but check the documentation if u wanna do something more fancy
- UserĀ this defines what user is running the service. This is a bit of extra insurance, in case you installed a package as a yourself and not as a superuser (which is the correct way)
- ExecStartĀ is the command to run
- Restart by specifying this to āalwaysā, if the script ever goes down, itāll automatically restart and start scraping again! š Super cool, no? WARNING: Not sure about whether this can cause troubleā¦ if twitter is for some reason pissed off and doesnāt want to serve tweets to you anymore, not sure if CONSTANTLY restarting this could get you in trouble. If I get banned, Iāll letchu knowā¦ stay tuned)
- **WorkingDirectoryĀ **This part is where the magic happens. Remember earlier on we were worried and worried about HOW to pass the working directory to the R script? This is how!! Now we donāt have to worry about paths on the server anymore!
- **EnvironmentĀ **is the language
- WantedBy I have no idea what this does and donāt care because it works!
So there you go! This is the way to set up a proper service that you can monitor, and treat properly like any formal linux process! Enjoy!
Ā
ENDNOTE 1
Ok, itās trueā¦ sometimes a Service is the right thing to do, if you have a job that runs for a certain amount of time, finishes, and then you want to run it again discretely later, you should set it up as a cron job. So for those cases, hereās the correct script to check the script is running, even assigning a working directory.
if ps aux | grep "R_file_you_want_to_check.R" | grep -v grep > /dev/null then echo "Running, all good!" else echo "Not running... will restart:" cd /path_to_your_working_directory Rscript "R_file_you_want_to_check.R" fi
Ā
save that as chek.sh
Ā and assign it to the cron with the output to your home path, like:
40 * * * * "/srv/shiny-server/SecretFolder/chek.sh" >> /home/amit/SecretTweets.log 2>&1
Ā