This post explores the inter-relationships of StackOverflow Tags for R-related questions. So I grabbed all the questions tagged with “r”, took the other tags in each question and made some network charts that show how often each tag is seen with the other tags. The point is to see the empirical relationships that develop as people organically describe their problems with R. Full analysis on GitHub, as always.
<newbie> For the non-techies out there: StackOverflow.com is a question and answer website which many techies LOVE because in many cases it’s the best place to get answers when you’re stuck… I’ve used it a bunch of times. When you ask a question, you can tag it with (mostly) pre-defined “Tags” that help experts find your question. For example, I might ask a question: “How can I sum three numbers in Excel?”. In this case, I’d be smart to add the tags:
Formula. This will help Excel and Formula experts to find my question and answer it quickly. Anyway, StackOverflow (or SO) is this whole thing, check it out… it’s awesome.
What I did was harvest all the questions regarding the stats program R, and then took all the other tags in that question and showed the relationships between these tags. </newbie>
Using the tremendously awesome SO Data Explorer which lets you query the entire SO question corpus, I found a query close enough to what I wanted and downloaded all the questions that had the tag “r”. A little manipulation and I’m ready to plot the relationships! But plot what? I can imagine that the tag
ggplot2 would often be related to the tag
plot so there should be a connection there… but should that count as much as a one-off random relationship? In order to answer this, we count how many times we saw the relationship, and call it the Link Strength (LS). So tags that are very frequently linked together will have a very high LS, and the one-off will have a low LS.
Jumping right to it, BOOM! Here is LS=10 (this will only show tags as related if they were seen together more than 10 times) :
>>Play with the interactive version though<<, it’s WAAAAAAAAAAAAAAY funner (by the way, Ctrl+F works on it :)). Here it is in an ugly iframe:
Two problems with this one:
- It’s not possible to distinguish the strong links (with high LS) from the weak links
- Those “floaters” that you see in the peripheries might be related to the central network… the link might just be seen less than 10 times.
OK, so let’s take a step back and figure out the LS for all tag-pairs. Plotting that, we see that the link strength increases very slowly (see figure to the right-top, and a blowup of the “elbow” right-bottom). As can be seen, around 6000 of our 7000ish tag-pairs have a LS <10.
To investigate further, let’s plot charts at LS= 1, 10, 100, where the LS is the thickness of each link (thicker is higher Link Strength). To accomplish this I used the ForceNetwork rather than the SimpleNetwork functions of the D3Network package (yes I know the new one is called network3d but I haven’t installed it yet, sue me). Oh, and these charts are zoom and scroll enabled, so enjoy the interactive versions here: LS=1, LS = 10, LS = 100. They are way better to navigate. Hover your mouse over each node to see the tag name.
So. considering that the LS variability is mostly very low (which is what we saw on the LS point charts above anyway), I’m going to go out on a limb and say that the Link Strength per se is an interesting but perhaps unnecessary visualization element… it seems like the number of links to other nodes is more important. Therefore:
Conclusion 2: Tag popularity (as relates to R) is best predicted by how “central” a node is, not by the LS of its connections to other nodes. Here, centrality is used as a proxy to describe how many OTHER Tags it’s connected to. I could count them, but meh… you do it.
I think that Network charts are a great way of exploring the relationships between tags. These relationships, when mapped together somehow show how we all use our beloved r. For example, in the LS=10 chart I provided, you can see the following topical “arms”: machine learning, packages, knitr, xml, sql, shiny, rstudio, regex, etc, all with a bunch of tags within each arm. You’ve also got the messy internal cluster tags that are linked to by EVERYONE… these are the r staples.
Anyway, These network charts can also be used to investigate new tags that might be interesting to users that consider themselves specialists in a specific area.
It’s a bit tricky to figure out the best LS to visualize… I like 10… but feel free to play around. I also started playing around with a method of identifying specific tags to explore… it’s in the R-script… it’s not great but might be a good start… check it out if interested.
This analysis could be used for any tag. I chose “r”, but it’s easy to see how to change the query to get all the questions for any other tag too… check out the script.
A gift for everyone!
So this is just the tip of this analysis.
I’ve made a csv with just the Link Strengths for each pair of tags (oh, its 500 megs… extract it yourself from the R code)… it can be found in the GitHub repo. Of course while you’re there and you might find out that the initial query from SO has more than just the tagnames for each question… go crazy internets!
A gift for rich people
OK fine you rich bastards… you have a computer that can handle big data and a 4k monitor? Enjoy the full_pawwah of the complete network (LS>1), plotted using the Simple technique which will have all teh names etc, at 4k rez. I hope you choke.
(edited by Laure Belotti)
After some discussions, I figured it would be more fun to think about what the R-Taggosphere looks like WITHOUT the top tags which muck up the picture by being too popular and having too many relationships. What’s left is an easier to read network of weaker connections, which do paint a very interesting albeit unexpected picture. Check out the interactive version! Or iframed below:
(Incedentally, the top tags are (from popularest to least): ggplot2|plot|data.frame|matrix|shiny|knitr|loops|function|for-loop|list|rstudio|time-series|statistics|data.table|python|subset|apply|xts|plyr|rmarkdown|regression|graph|dplyr|latex|vector|csv|merge|lapply|legend|date|dataframes|rcpp|regex|string|zoo)