UPDATE: We’ve just done an update to this tutorial on our new blog using R Spatial’s new simple features package! It’s a much nicer workflow than the one below so head over to Multivariate Dot-Denisty Maps in R with sf and ggplot2 now to check it out. Cheers!

Background

I recently came across Eric Fisher’s brilliant collection of dot density maps that show racial and ethnic divisions within US cities. His work was inspired by Bill Rankin’s Map of Chicago that was made in 2009. Bill makes some salient points in this video about the limitations of choropleth mapping (where boundaries are filled with one colour based on one variable) and how it has a tendancy to “reinforce political ideals of national determination and ethinic homogeneity.”

Could we have a cartography without boundaries?

I loved this idea of breaking each boundary down from the sum of it’s parts to a more granular level to uncover the diversity within. So I thought I’d try to tap into the burgeoning world of #rstats and see if I could make a version of the ethnic dot map for my recent city of residence, London.

There’s already some excellent tools out there for exploring London Census data at a geographical level, Data Shine being a fine example. It allows you to map data from any section of the Census to Output Level region boundaries (more on these later). The only drawback to this, as previously outlined by our main man Bill Rankin, is that we can only view data for one variable at a time i.e. the percentage of residents in a specific output area belonging to one particular ethnic group.

Thus, I set out to build map that would provide a representative view of ethnicity in London at a single glance. A cartography without boundaries…


Searching for Guidance & Data

To get me started I invested in the expert guidance of data-visualiser-extraordinaire Nathan Yau, aka Flowing Data. Nathan has a whole host of tutorials on how to make really great visualisations in R (including a brand new course focused on mapping) and thankfully one of them deals with how to plot dot density using base R.

Now with a better understanding of the task at hand, I needed to find the required ethnicity data and shapefiles. I recently saw a video of Amelia McNamara’s great talk at the OpenVis Conference titled ‘How spatial polygons shape our world’. The .shp file really is a glorious thing and it seems that the spatial polygon makers are the unsung heros of the datavis world, so a big round of applause for all those guys is in order.

Anyway, I digress. Luckily for me, the good folks over at the London DataStore have a vast array of Shapefiles that go from Borough level all the way down to Super Output Area level. I’m going to use the Output Areas as the boundaries for the dots and the much broader Borough boundaries for ploting area labels and borders.

The ethnic group data itself was sourced from the Nomis website which has a handy 2011 Census table finder tool where you can easily download an Ethnic Group csv file for London output areas. Vamonos.


Data Prep

So here we go.

First lets load the required packages:

library(maptools)
library(rgeos)
library(tidyverse)
library(rgdal)
library(ggthemes)

Now read in both London Output Areas and London Boroughs shape files (saved in the working directory) and add the appropriate coordinate system references.

ldnoa <- readOGR(dsn = "ESRI/OA_2011_London_gen_MHW.shp") %>%
  spTransform(CRS("+proj=longlat +datum=WGS84"))

ldnb <- readOGR(dsn = "ESRI/London_Borough_Excluding_MHW.shp") %>%
  spTransform(CRS("+proj=longlat +datum=WGS84"))

As previously mentioned, London Output Areas are an “open source geospatial classification” designed to allow for more thorough profiling possibilities. With a total of 25053 LOAs, it is the smallest available geospatial classification in London. Let’s have a look at them…

par(mar = c(0,0,0,0))
plot(ldnoa)

Dense!

Now lets compare that to the Borough boundaries…

plot(ldnb)

Less dense.

Next we read in our London ethnic group census data and select only the columns we want, in the order we want to plot them (largest number of people to lowest). I’ve chosen only the top-level ethnic categories to limit the number of colours we’re going to be plotting. The data does have more sub-category ethnicity levels which we could use to dig even deeper. But… mañana.

ldneth <- read.csv("bulk.csv", stringsAsFactors = F, check.names = F) %>%
  select(geography, White, `Asian/Asian British`, `Black/African/Caribbean/Black British`,
         `Mixed/multiple ethnic groups`, `Other ethnic group`) %>%
  rename(Mixed = `Mixed/multiple ethnic groups`, Asian = `Asian/Asian British`,
         Black = `Black/African/Caribbean/Black British`, Other = `Other ethnic group`)

Now for a beautiful data (arranged) marriage. We merge the the output areas shapefile with the ethnic group data; possible because they both have a column of data populated with the LOA geo codes. From the merged data we create a dataframe of the final number of dots we’re going to plot in each output area. The number we divide by to get this becomes the number of people each dot represents - in our case 10.

ldndata <- merge(ldnoa@data, ldneth, by.x="OA11CD", by.y="geography", sort=FALSE)
num.dots <- select(ldndata, White:Other) / 10

Now for the all important allocation of dots in their respective polygons. Here we create a ‘SpatialPointsDataFrame’ for each ethnicity and store them in a list. This can take a while as we’re generating over 800k coordinates, so it’s a good idea to save your workspace after it is done to save you doing it again for the future.

sp.dfs <- lapply(names(num.dots), function(x) {
  dotsInPolys(ldnoa, as.integer(num.dots[, x]), f="random")
})

f=“random” means that the dots are randomly distributed within their output area boundary. This may sound strange - random allocation??? - but because the boundaries we are using are at the smallest possible level, our distribution should be as accurate as possible.


Plotting with R Base

Now if we want to plot the map using base R graphics then we’re already in a position to do so with our list of spatial point dataframes, hurrah. All we need to do is iterate through each spatial dataframe and plot the dots using a different colour of your choosing for each iteration. Let’s give it a bash and see what we get.

# Base Plot --------------------------------------------------------------

# define a palette
pal <- c("#8dd3c7", "#ffffb3", "#fb8072", "#bebada", "#80b1d3")

# first plot the london boroughs as a base layer with hidden borders to keep a blank canvas
# then plot the dots
par(mar = c(0,0,0,0))
plot(ldnb, lwd = 0.01, border = "white")
for (i in 1:length(sp.dfs)) {
  plot(sp.dfs[[i]], add = T, pch = 16, cex = 0.1, col = pal[i])
}

Voilaaa. I’ve never seen London look better.

Side note: for these maps to look as they do above you have to blow them up to a reasonably large size to show the definition between colours.

The ‘add = T’ code chunk of the dot plotting is muito importante, as it allows each new plot of dots to layered on top of our original shapefile plot.

Let’s see if we can add borough labels in the appropriate place. The gCentroid function in the rgeos package is our friend here:

# Add Boundary Labels -------------------------------------------------

# get the centre point of each borough (polygon)
centroids <- data.frame(gCentroid(ldnb, byid = TRUE))

# plot the name of the borough on the centre points
par(mar = c(0,0,0,0))
plot(ldnb, lwd = 0.01, border = "white")
for (i in 1:length(sp.dfs)) {
  plot(sp.dfs[[i]], add = T, pch = 16, cex = 0.1, col = pal[i])
}
text(x = centroids$x, y = centroids$y, labels = as.character(ldnb$NAME), 
     cex = .8, family = "sans", font = 2)

It’s also possible to zoom in on an area of your choosing using the xlim and ylim call. Let’s focus in closer to Culture of Insight’s HQ in Islington and have a look:

# Zoom to a specific location ------------------------------------------

xlim <- c(-0.145044, 0.004463)
ylim <- c(51.480955,51.55274)

par(mar = c(0,0,0,0))
plot(ldnb, lwd = 0.01, border = "white", xlim = xlim, ylim = ylim)
for (i in 1:length(sp.dfs)) {
  plot(sp.dfs[[i]], add = T, pch = 16, cex = 0.5, col = pal[i])
} # increase cex (dot size) to compensate for zoom
text(x = centroids$x, y = centroids$y, labels = as.character(ldnb$NAME), 
     cex = 1, family = "sans", font = 2) # also increase text text size


Utilising More Plotting Methods - ggplot2

R’s base graphics are great; it seems like you can throw just about anything into a plot() call and it will give you something back. But the ubiquitous admiration for ggplot2 cannot be ignored, nor should it. So let’s take a trip down the ggplot rabbit hole and see what we can come up.

To start plotting in ggplot2 will require a bit more data prep as ggplot2 will only take in a regular data.frame, not a spatial point df.

So, let’s convert our London Boroughs ‘SpatialPolygonsDataFrame’ to a standard df like so:

# this method retains any data contained in the .shp file
# tidy(ldnb) from the the broom package is also legit if you just want the polygons
ldnb@data$id <- row.names(ldnb@data)
ldnb.points <- fortify(ldnb, region = "id")
ldnb.df <- merge(ldnb.points, ldnb@data, by = "id")

Let’s see if it worked…

ggplot(ldnb.df, aes(long, lat, group = group)) +
  geom_path() +
  coord_map()

Looking good.

Next, we need to convert our list of ‘SpatialPointDataFrames’ to regular dataframes with longitude and latitude variables for each dot. Here’s how I went about it:

# for each sp.df, scrape out the coordinates of each dot and store as a regular dataframe
dfs <- lapply(sp.dfs, function(x) {
  data.frame(coordinates(x)[,1:2])
})

# we're going to bind these dataframes together but first we need to add an ethnicity
# variable to allow for categorising data by colour after binding
# the double square brackets [[]] are used to select the dataframes held within the list
ethnicities <- c("White", "Asian", "Black", "Mixed", "Other")
for (i in 1:length(ethnicities)) {
  dfs[[i]]$Ethnicity <- ethnicities[i]
}

# final bit of data prep: bind all dataframes into one then set factor levels
# the factor level will dictate the order in which dots are plotted
# we want the category with most dots to be plotted first and vice versa, 
# so that categories with the most dots don't mask categories with fewer dots
dots.final <- bind_rows(dfs)
dots.final$Ethnicity <- factor(dots.final$Ethnicity, levels = ethnicities)

All done! We’re now all set to plot our dots however we see fit. We can create similar plots to the previous base plot (adding a black background and Borough boundaries) like so:

ggplot(ldnb.df) +
  geom_point(data = dots.final, aes(x, y, colour = Ethnicity), size = 0.01) +
  geom_path(aes(long, lat, group = group), colour = "#d3d3d3") +
  scale_colour_manual(values = pal) +
  theme_map() +
  theme(plot.background = element_rect(fill = "black"), legend.position = "none") +
  coord_map()

We can now draw on a vast repertoire of tools to shape the final output of our map. I won’t go into great detail about some of the options I’ve used but merely provide a few of examples…


Zooming

The ‘deep zoom’ image at the top of the page was made using the ggmap package, which works in conjunction with ggplot2 to add a maptile as a baselayer to the plot. I then saved the plot as a big old png file, added the title, annotations and legend in Adobe Illustrator (couldn’t work out how to get ggplot titles/legends to scale to image size when saving as a large image) and used Microsoft deep zoom composer and OpenSeadragon (‘an open-source, web-based viewer for high-resolution zoomable images, implemented in pure JavaScript, for desktop and mobile’) to get the zoom functionality in the browser. FUN!


But what’s the point of anything without an animated gif?

Good question.

For this I followed Lena Groeger’s ‘multiple photo gif in Photoshop’ tutorial, but you could try out the gganimate package to try and do it all in R.

gif


Interactivity? Shiny?

I did try throwing all these dots into a leaflet map but it was causing a lot carnage. We’d really need to reduce the number of dots significantly for it to work smoothly on a computer with modest amounts of RAM. But with our tiny super output area polygons, making each dot represent say 100 people rather than 10 would result in a lot of people not being represented at all.

So I decided to implement a lot of the data prep techniques outlined in this tutorial into an interactive dot density map in shiny using UK constituency election data. You can read more about that in our previous blogpost or visit the app directly by clicking on the image below. There’s a link to the source code in the ‘About’ tab in you’re interested!

dotmap


That’s all for now. I hope this can inspire more dot density maps in R! Please get in touch via the comments section or my twitter if you have any questions or suggested improvements (I’m sure there’s plenty) on any of the above.

And of course, if you think you’re business/organisation could benefit from any data wrangling/visualisations like the above, get in touch here and let’s work together!

fin.