Archive
Juking the stats – WordPress and social proof

“Unnecessary Math” – via slaya771 on reddit http://www.reddit.com/r/funny/comments/1d3zs9/unnecessary_math/
Everyone with a basic science education knows that you cannot add quantities whose units do not match; you cannot add population to elevation, for instance, as the picture shows.
This does not stop companies from doing something that’s arguably worse, as it’s harder to detect and call them on their BS.
Take WordPress.com. I use them as my blogging platform and I’m overall happy with them. WordPress allows you to customize your blog by inserting widgets. I have the “Follow Blog: Email Subscription” widget installed. Here is what it looks like to readers:
This number is a lie. In my stats page I can see the truth – there are really only 41 email subscribers. The rest are following me on Twitter. When I post on WordPress, it automatically sends a tweet with a link to the post.
WordPress adds my Twitter follower count to my email subscriber count, and then implies that all of them are following my blog via email. Read the wording again. “Join 220 other followers”, right above a text box for email address entry.
First, why would WordPress do this?
I see two main possibilities.
One, it’s an honest mistake. The backend system has some field for ‘followers’ which is always computed by summing up all the different follower types, and this field was inadvertently used rather than the email follower count. I tried to contact WordPress about this on Monday, August 25, 2014 but have not yet received a response.
The second possibility is that it’s deliberate. The subscriber count is a form of social proof, which lets readers gauge the quality of the site. My hypothesis is that WordPress has empirical evidence that a higher number of followers displayed in this widget leads to increased follow rate. You could imagine A/B experiments where some visitors see the true count, and the others see the value doubled, and measure the difference. Or conversely, take away the follower count from that text and see if the follow rate drops.
The second question is, why does it matter?
While it’s not as wrong as adding elevation to population, as the image that started this post shows, it’s still wrong. The units are right in the sense that you are adding counts of people to counts of people. But all followers are not created equal. People could follow me on Twitter for any number of reasons, while not caring at all about my blog. Conversely, people who choose to explicitly sign up for email notifications of new posts are showing a drastically different level of intent. To call them both followers and to insert them in a widget that purports to show email subscribers is disingenuous.
Fortunately the widget has an option to disable the follower count altogether, and from now on I am going to do just that.
Data Visualization – Size of NFL Football Players Over Time

Screenshot from http://noahveltman.com/nflplayers/ – 1920

Screenshot from http://noahveltman.com/nflplayers/ – 2014
I love Noah Veltman’s visualization of the changing height and weight distribution of professional football players. It uses animation to convey the incredible increase in size of the typical football player, and it does so with a minimal amount of chart junk. Let’s look at two aspects that make this effective.
It uses the appropriate visualization
There are 4 variables plotted on the graph – height, weight, density, and time. Two of the variables are encoded in the axes of the chart. The time dimension is controlled by the slider (or by hitting the play button). The density is represented by the color on the chart.
You could present this data as a table of data but it would be much harder to understand the pattern that the animation conveys in a very simple manner – not only are players getting bigger in both terms of height and weight, but the variance is increasing as well.
It makes good use of color
It uses color appropriately, by varying the saturation rather than the hue. I’ve blogged about this topic before when discussing the Wind Map. To repeat my favorite quote about this, Stephen Few states in his PDF “Practical Rules for Using Color in Charts”:
When using color to encode a sequential range of quantitative values, stick with a single hue (or a small set of closely related hues) and vary intensity from pale colors for low values to increasingly darker and brighter colors for high values
Extensions
I could imagine extending this visualization in a few ways:
- Allow users to view the players that match a given height/weight combination (who exactly are the outliers?)
- Allow restricting the data to a given position (see how quarterbacks’ height/weight are distributed vs those of the offensive line)
- Compare against some other normalized metrics, such as rate of injury. Is there a correlation?
This is a great data visualization because it tells a story and it spurs the imagination towards additional areas of analysis and research.
“Everyone should be able to pull and analyze data”

“Data overload” Islam Elsedoudi via flickr- cc http://creativecommons.org/licenses/by-sa/2.0/
Everyone should be able to write spaghetti code, and everyone should be able to pull and analyze data. And I’m not just talking about business-folk here.
…
Look at what’s going on in the digital humanities. Now, even literature, history, and religious scholars can use data to shed new insight on old texts. How awesome is that? But you have to be able to actually analyze the data. That means being able to query and scrub; that means knowing a bit of probability and statistics. The difference between a median and mean would be a start.So yes, it’s no longer acceptable to say, “I suck at math!” and then ignore that part of the world.
I suck at physical exercise, but that doesn’t mean it’s OK for me to melt into a chair all day. We all need to work at the important stuff in life, and understanding data has become terribly important.
- John Foreman, chief data scientist at MailChimp. Read the full interview on chartio.com.
I agree with the overall sentiment of the quote, that more people should be able to do basic data scraping and analysis. Unfortunately, I don’t see it happening anytime soon for two reasons – the tools to analyze data are complicated to non-engineers and most people do not receive training in programming (to script and pull the data in the first place) or statistics (to crunch the data and draw valid insights).
Even if everyone had the skills and tools necessary to pull and analyze the data, there would still be a need for skilled analysts / data scientists. Executives and product managers often don’t have the time to do analysis themselves; it’s not efficient for them to do so. Analysts fulfill an important role by distilling raw data into products and insights.
How to download your WordPress.com stats in CSV, JSON, or XML format
I wanted raw data about the popularity of my various posts on this blog to better determine what sort of topics I should post about. WordPress.com provides some nice aggregate stats, but I wanted more. After stumbling around the Internet for awhile, I cobbled together a way to download my blog data in either CSV, XML, or JSON format.
There are three steps:
- Get an API key
- Get your blog URL
- Construct the URL to download the data
Get an API key
Akismet is WordPress.com’s anti-spam solution. Register for an Akismet API key at http://akismet.com/wordpress/ by clicking on “Get an Akismet API key”.
Sign up for an account. If you choose the personal blog option, you can drag the slider all the way to the left and register for free. If you value the service that Akismet provides, you can pay more. When you complete the signup flow, you will be provided with a 12 digit ID. Copy this down.
Get your blog URL
Copy the full URL of your blog, minus the leading https://. For me this is developmentality.wordpress.com
.
Construct the URL
There is a limited API for downloading your data at the following URL:
http://stats.wordpress.com/csv.php
View this in a browser to see what the API parameters are.
Construct the url
http://stats.wordpress.com/csv.php?api_key=<api_key>&blog_uri=<blog_uri>
View this URL in the browser (or via wget
/ curl
) and you should see the view data.
There are multiple data sources. From the documentation:
table String One of views, postviews, referrers, referrers_grouped, searchterms, clicks, videoplays
Here is some sample data from each table. Change the format
param from csv
to json
or xml
to get the data in different formats.
views
CSV
"date","views"
"2013-12-31",118
JSON
[{"date":"2010-02-05","views": 46}]
XML
<views>
<day date="2014-01-01">112</day>
</views>
postviews
CSV
"date","post_id","post_title","post_permalink","views"
"2014-01-28",369876479,"Three ways of creating dictionaries in Python","https://developmentality.wordpress.com/2012/03/30/three-ways-of-creating-dictionaries-in-python/",46
JSON
[{"date":"2014-01-29","postviews":[{"post_id":369876479,"post_title":"Three ways of creating dictionaries in Python","permalink":"http:\/\/developmentality.wordpress.com\/2012\/03\/30\/three-ways-of-creating-dictionaries-in-python\/","views":22},{"post_id":369875635,"post_title":"R - Sorting a data frame by the contents of a column","permalink":"http:\/\/developmentality.wordpress.com\/2010\/02\/12\/r-sorting-a-data-frame-by-the-contents-of-a-column\/","views":16}]}]
XML
<postviews>
<day date="2014-01-30"></day>
<day date="2014-01-29">
<post id="369876479" title="Three ways of creating dictionaries in Python" url="https://developmentality.wordpress.com/2012/03/30/three-ways-of-creating-dictionaries-in-python/">54</post>
</day>
</postviews>
referrers
CSV
"date","referrer","views"
"2014-01-28","http://www.google.com/",63
JSON
[{"date":"2014-01-30","referrers":[]},{"date":"2014-01-29","referrers":[{"referrer":"http:\/\/www.google.com\/","views":66},{"referrer":"www.google.com\/search","views":27},{"referrer":"www.google.co.uk","views":10}]}]
XML
<referrers>
<day date="2014-01-30"></day>
<day date="2014-01-29">
<referrer value="http://www.google.com/" count="" limit="100">66</referrer>
</day>
</referrers>
referrers_grouped
CSV
"date","group","group_name","referrer","views"
"-","Search Engines","Search Engines","http://www.google.com/",1256
JSON
[{"date":"-","referrers_grouped":[{"referrers_grouped":"Search Engines","views":{"http:\/\/www.google.com\/":1305}}]}]
XML
<referrers_grouped>
<day date="-">
<group domain="Search Engines" name="Search Engines">
<referrer value="http://www.google.com/">1305</referrer>
</group>
</day>
</referrers_grouped>
Dates aren’t included so it’s the sum over the past N
days, defaulting to 30. To change this, set the days
URL parameter:
http://stats.wordpress.com/csv.php?api_key=<api_key>&blog_uri=<blog_uri>&table=referrers_grouped&days=<num_days>
searchterms
CSV
"date","searchterm","views"
"2014-01-28","encrypted_search_terms",190
JSON
[{"date":"2014-01-30","searchterms":[]},{"date":"2014-01-29","searchterms":[{"searchterm":"encrypted_search_terms","views":159},{"searchterm":"dynamically load property file in mule","views":2}]}]
XML
<searchterms>
<day date="2014-01-30"></day>
<day date="2014-01-29">
<searchterm value="encrypted_search_terms" count="" limit="100">159</searchterm>
<searchterm value="dynamically load property file in mule" count="" limit="100">2</searchterm>
</day>
</searchterms>
clicks
CSV
"date","click","views"
"2014-01-28","http://grab.by/grabs/b608b9c315119ca07a1f7083aabbb9c7.png",3
JSON
[{"date":"2014-01-30","clicks":[]},{"date":"2014-01-29","clicks":[{"click":"http:\/\/www.anddev.org\/extended_checkbox_list__extension_of_checkbox_text_list_tu-t5734.html","views":2},{"click":"http:\/\/android.amberfog.com\/?p=296","views":2}]}]
XML
<clicks>
<day date="2014-01-30"></day>
<day date="2014-01-29">
<click value="http://www.anddev.org/extended_checkbox_list__extension_of_checkbox_text_list_tu-t5734.html" count="" limit="100">2</click>
</day>
</clicks>
videoplays
I am not sure what this format is as I have no video plays on my blog.
Conclusion
I hope you find this useful. I’ll make another post later showing how to crunch some of this data and extract meaningful information from the raw data.