Mapping Web Traffic: A Post on a Previous Post

I got lucky with my last post and a friend helped me publish it in Foreign Policy. In order to publish the interactive map created in D3 I had to host it on my own server, which meant I had access to more web traffic data than this lowly blog ever gets.

My hosting service doesn’t provide me with much data but there was enough to provide a pretty good sample of the hits that the Foreign Policy article received. Below are two maps of where people were viewing the article from and on which kind of device, and then some explanation on how the maps were made.

ip_world_map

The map above shows 1000 IP locations for visitors to this article. The article is in English from an American publication which likely explains why most of the visits are from the US – 56% of visitors were from the US  – or countries where English is widely spoke. Yet there are still quite a few visitors from parts of Asia, but very few from the “global south.” The fact that Foreign Policy is not widely read there isn’t particularly surprising. Just 0.6% of the visitors are from mainland China, and 2.2% are from Hong Kong and 0.8% from Taiwan. Maybe it’s best home buyers in large Chinese cities aren’t reading another article on a potential real estate bubble.

US_device_map

By looking at another dataset from my server I was able to extract both IP locations and the type of device used to access the article. One reason this data is mildly interesting is because the interactive graphic in the Foreign Policy article was not viewable on mobile devices (or at least not on iPhones), so every iPhone and Android “dot” was probably confused when the article mentioned how to interact with the map. Luckily only 28% of people were using mobile devices to read the article.

About the maps and data: 

The shapefiles for the maps were downloaded from Natural Earth and loaded into R using the rgdal package.

Data from the first map was collected from Bluehost’s, my server host, awstats service. The data was already in a table so it was just pasted into a csv file. The data shows visitors from 6:00am on July 27 to 6:00am on July 28. Linking the IP addresses to a geographic location was done with an R function created by Andrew Ziem. The map image was created with ggplot:


ip_world_map <- ggplot() +
   geom_polygon(data = map, aes(long, lat, group = group)) +
   coord_equal() +
   geom_point(data = ip_df, aes(x=longitude, y = latitude, size = hits),
      color = "red", alpha = .15) +
   scale_size(range = c(.1, 3)) +
   ggtitle("Website Visitor IP Locations") +
   theme(plot.title = element_text(lineheight=.8),
      axis.ticks.y = element_blank(),
      axis.text.y = element_blank(),
      axis.title.y=element_blank(),
      axis.ticks.x = element_blank(),
      axis.text.x = element_blank(),
      axis.title.x=element_blank())

 

Data for the second map of IP locations and devices was gathered by going to the “Latest Visitors” on my server’s cpanel and then viewing the page source. From there I was able to copy the json code into a text editor and read it into R using the rjson package. Originally each of the 385 objects from the json  file looked like this (IP address scrubbed out):


{"localtime":"7\/27\/16 7:52 PM","protocol":"HTTP\/1.1","status":"200","ip":"##.###.##.###","httpdate":"27\/Jul\/2016:19:52:52","size":"36274","timestamp":1469670772,"agent":"Mozilla\/5.0 (iPad; CPU OS 9_3_2 like Mac OS X) AppleWebKit\/601.1.46 (KHTML, like Gecko) Mobile\/13F69","url":"\/chinarealestate\/Indexed_China_Housing.csv","tz":"-0600","method":"GET","referer":"http:\/\/dataspiked.com\/chinarealestate\/","line":999}

This wasn’t very hard to turn into a nice R friendly data frame but parsing the text to get the device was harder than expected because the device type was embedded in a string like this: “Mozilla\/5.0 (iPad; CPU OS 9_3_2 like Mac OS X) AppleWebKit\/601.1.46 (KHTML, like Gecko) Mobile\/13F69”. First I was able to extract that characters between the first set of parenthesis then take the first word in that string. This was a useful resource for figuring out how to parse text.  The same function to get geographic locations of IP address for the first map was used to get the latitude and longitude of the IP addresses for this map. And then finally ggplot was used create the map image. The data are from 3:00pm-8:00pm on July 27 and 6:00am-10:00am on July 28.

The R code can be found here on GitHub.