In this series of posts I will set out to recreate some of the visualizations from the book “Knowledge is Beautiful” by David McCandless in R.

David McCandless is author of two bestselling infographics books and gave a TED talk about data visualization. I bought his second book “Knowledge is Beautiful”, in 2015, which contains 196 beautiful infographics.

At that time, I was really into The Walking Dead, and his book inspired me to make my own infographic:

Recently, I was trying to think of some fun data visualization projects and decided to choose a couple from the book that could be recreated as close to possible in R.

The book is an excellent resource for those who like these sorts of exercises as every single visualization in the book is paired with an online dataset to explore at your interest!!!. I never knew how rich the datasets were until I tried to recreate my first visualization, “Best in Show”. The dataset for Best in Show alone, is an excel file with eight sheets!

McCandles says the whole book took him 15,832 hours over two-years and I don’t doubt it. Whipping up a quick EDA plotcan be fast and simple for a meeting if you only care about actionable results. However, if your publishing something it takes some time to create a stunning visual.

Best in Show


Best in Show is a scatter-plot of dog silhouettes, color-coded based on the category of dog, sized accordingly, and pointing either left-or-right, depending on their intelligence.

Let’s load the environment; I like the needs() package which makes it simple to install & load packages into R.

knitr::opts_chunk$set(echo = TRUE)
if (!require("needs")) {
  install.packages("needs", dependencies = TRUE)
  library(needs)
}

As mentioned above, the data is an excel file with eight sheets. We can read in the Excel file with read_excel() and specify the sheet we would like to access with the sheet = argument.

# install&Load readxl package
needs(here,
      readxl,
      stringr,
      dplyr,    # to use select()
      magrittr) # to use %<>% operators

path = here()

# Specify the worksheet by name
dog_excel <- read_excel("~/bestinshow.xlsx", sheet = "Best in show full sheet", range = cell_rows(3:91))

# Pick the desired variables
dog_excel %<>%
  select(1,3,5:6, "intelligence category", "size category")

# Rename the columns
colnames(dog_excel) <- c("breed", "category", "score", "popularity", "intelligence", "size")

# Remove first row (non-descript column names)
dog_excel <- dog_excel[-1,]

Take a look at the HTML table (supports filtering, pagination, and sorting).

needs(DT)
datatable(dog_excel, options = list(pageLength = 5))

Take a look at the intelligence levels in the dataset.

# What are the intelligence levels?
unique(dog_excel$intelligence)
## [1] "Brightest"     "Above average" "Excellent"     "Average"      
## [5] "Fair"          "Lowest"

McCandless’ classified dogs intelligence as either “dumb” or “clever”. But here we see there are categories in this clean dataset. So let’s assign the top three factors as clever and the other three as dumb

dog_excel$intelligence %<>% 
  str_replace(pattern = "Brightest", replace = "clever", .) %>%   
  str_replace(pattern = "Above average", replace = "clever", .) %>% 
  str_replace(pattern = "Excellent", replace = "clever", .) %>% 
  str_replace(pattern = "Average", replace = "dumb", .) %>% 
  str_replace(pattern = "Fair", replace = "dumb", .) %>% 
  str_replace(pattern = "Lowest", replace = "dumb", .)

There are 87 dogs in the original visualization. I found 24 silhouettes under an Creative Commons Attribution-Share Alike 4.0 License on SuperColoring.com. This means I can freely copy and redistribute in any medium or format as long as I give a link to the webpage and indicate the author’s name and the license. I’ve included that information in a .csv. and we will use this to subset McCandless’s data.

needs(readr)
dog_silhouettes <- read_csv("~/dog_silhouettes.csv")

dog_df <- dog_excel %>% 
  inner_join(dog_silhouettes, by = "breed")

# change popularity from string into numeric values
dog_df$popularity <- as.numeric(dog_df$popularity)

We need to use the magick package to make the white background surrounding the .svg silhouettes transparent, scale images to a common size (by width), and save them as .png’s.

needs(magick)

file.names <- dir(path, pattern = ".svg")

for (file in file.names){
  # read in the file
  img <- image_read_svg(file)
  # scale all images to a common scale
  img_scaled <- image_scale(img, 700)
  # make the background transparent
  img_trans <- image_transparent(img_scaled, 'white')
  # get rid of .svg ending
  file_name <- str_replace(file, pattern = ".svg", replace = "")
  # write out the file as a .png
  image_write(img_trans, paste0(file_name, ".png"))
}

Some of the dog silhouette’s are pointing their heads in the opposite direction. We need to use imag_flop() from the magick package so that all are facing the same direction. Later we can subset the data frame by intelligence so clever and dumb dogs face opposite directions.

path = here()

flop.images <- c("~/labrador-retriever-black-silhouette.png", "~/border-terrier-black-silhouette.png", "~/boxer-black-silhouette.png", "~/french-bulldog-black-silhouette.png", "~/german-shepherd-black-silhouette.png", "~/golden-retriever-black-silhouette.png", "~/greyhound-black-silhouette.png", "~/rottweiler-black-silhouette.png")

for(i in flop.images){
  i <- str_replace(i, pattern = "~/", replace = "")
  img <- image_read(i)
  img_flop <- image_flop(img)
  image_write(img_flop, i)
}

The next step is to color the dogs according to category.

Let’s write a function that can be given three arguments: 1) df a data frame, 2) category the type of dog it is, 3) color a specific color for each category.

The first step is to select only those dogs which belong to the category of interest. This can be done with filter() from the dplyr package. However, if you want to use dplyr functions, like filter(), you need to follow the instruction at this website; set with enquo.

img_color <- function(df, category, color){
  # Make filter a quosure
  category = enquo(category)
  # subset df on category
  new_df <- df  %>% 
    filter(category == !!category)
  # get directory paths of images for the for loop 
  category_names <- new_df$breed  %>%
    tolower() %>% 
    str_replace_all(" ", "-") %>% 
    paste0("-black-silhouette.png")
  for(name in category_names){
    img <- image_read(name)
    img_color <- image_fill(img, color, "+250+250")
    image_write(img_color, name)
  }
}

I wanted to replicate the figures as close as possible within R so to replicate the colors of the visualizations I scanned the book, saved the images, and then used this tool to get the html color codes.

Now let’s Use the created function above to color dog silhouettes according to their category.

# Herding color "#D59E7B"
img_color(dog_df, "herding", "#D59E7B")

# Hound color "#5E4D6C"
img_color(dog_df, "hound", "#5E4D6C")

# Non-sporting color "#6FA86C"
img_color(dog_df, "non-sporting", "#6FA86C")

# Sporting color "#B04946"
img_color(dog_df, "sporting", "#B04946")

# Terrier color "#A98B2D"
img_color(dog_df, "terrier", "#A98B2D")

# Toy color "#330000"
img_color(dog_df, "toy", "#330000")

# Working color "#415C55"
img_color(dog_df, "working", "#415C55")

Okay it’s finally time to make the visualization.

Usually one plots points with geom_point() from ggplot2 but in this case I want images for each of the breed’s instead. We can use the ggimage package, and with a little tweaking we can flop images based on intelligence. ggimage does not support color as an aesthetic like ggplot2 which is why I manually assigned colors & sizes earlier.

Since popularity scores range from 1 to 140 with 1 being the most popular we will need to reverse the y-axis with scale_y_reverse().

needs(ggplot2,
      ggimage)
# add "~/" to get filenames of images for plotting
dog_df$name <- paste0("~/", dog_df$name)

# create a ggplot/ggimage object
p <- ggplot(subset(dog_df, intelligence == "clever"), aes(x = score, y = popularity, image = name), alpha = 0.5) + geom_image(image_fun = image_flop) + geom_image(data=subset(dog_df, intelligence == "dumb")) +
  labs(title = "Best in Show", subtitle = "The ultimate datadog", caption = "Source: bit.ly/KIB_BestDogs") +
  labs(x = NULL, y = NULL) +
  theme(panel.background = element_blank(),
        legend.position = "top", 
        legend.box = "horizontal",
        plot.title = element_text(size = 13,
                                  # I'm not sure what font he chose so I'll pick something I think looks similar
                                 family = "AvantGarde",
                                 face = "bold", 
                                              lineheight = 1.2),
        plot.subtitle = element_text(size = 10,
                                     family = "AvantGarde"), 
        plot.caption = element_text(size = 5,
                                    hjust = 0.99),  
        axis.text = element_blank(), 
        axis.ticks = element_blank()) +
  scale_y_reverse()

The final step is to add text annotations underneath the dog breeds. Since I subset dog_df by intelligence if I try to annotate with geom_text() it will only annotate part of the data. We will need to the annotate() function instea since geome are not mapped from variables of a data frame, but are instead passed in as vectors.

# Add annotations
p + annotate("text", x=dog_df$score[1:24], y=((dog_df$popularity[1:24])+6), label = dog_df$breed[1:24], size = 2.0)

The visualization looks similar to the original and highlights most of the aesthetics that were included. One could always add the additional details in Adobe Illustrator or Inkscape to make it look more like the final visualization

Stay tuned for Part II!

 

A work by Matthew J. Oldach

moldach686@gmail.com