The Freeride World Tour (FWT) has been hosting extreme skiing & snowboarding events since 1996. Having just wrapped up the 2018 season in March I did an analysis on rankings and past FWT winners using R.



If you haven’t heard of the FWT yet it’s an exciting sport where riders choose gnarley-looking lines through cliff-faces, cornices and nasty couloirs (like this line). There’s no artificial jumps or half-pipes just a gate at the top and one at the bottom. The judges use five criteria (Difficulty of Line, Control, Fluidity, Jumps and Technique) which are scored from 0 to 100.

My desire to do this project was mainly to practice some web-scraping, accessing the Twitter API and reinforce my own understanding of the concepts applied within. Skiing and snowboarding are forgotten when it comes to sports analytics - I mean even tennis has an R package- so I thought it would be cool project. *** Let’s prepare the R session.

# Not sure why my system is messing up from this
options(java.home="C:\\Program Files\\Java\\jre1.8.0_171\\")
library(rJava)
## I like to install/load all of the required packages at once with needs
if (!require("needs")) {
  install.packages("needs", dependencies = TRUE)
  library(needs)
}
## Loading required package: needs
needs(rvest,
      readr,
      tidyverse,
      magrittr,
      robotstxt,
      qdap,
      tm,
      twitteR,
      dismo,
      maps,
      ggplot2,
      data.table,
      plyr,
      rtweet,
      wordcloud,
      knitr,
      RColorBrewer,
      magick,
      ggthemr)

# color palette
rdbu11 <- RColorBrewer::brewer.pal(11, "RdBu")

#FWT18 on Twitter


First, I collected data from the Twitter API using the TwitteR package. To do this, I needed to set up a developer account to authenticate the connection from R (a good tutorial on how to do this is here).

api_key <- "your api key"

api_secret <- "your api secret"

access_token <- "your access token"

access_token_secret <- "your access token secret"

setup_twitter_oauth(api_key, api_secret, access_token, access_token_secret)
## [1] "Using direct authentication"

Quickly visualize the frequency of tweets mentioning #FWT18 with ts_plot()

# For some reason `searchTwitter` only returns 9 days?
searchTerm <- "#FWT18"
searchResults <- searchTwitter(searchTerm, n = 10000, since = "2018-04-23")  # Gather Tweets 
## Warning in doRppAPICall("search/tweets", n, params = params,
## retryOnRateLimit = retryOnRateLimit, : 10000 tweets were requested but the
## API can only return 57
tweetFrame <- twListToDF(searchResults) # Convert to a nice dF

ts_plot(tweetFrame) +
  ggplot2::theme_minimal() +
  ggplot2::theme(plot.title = ggplot2::element_text(face = "bold")) +
  ggplot2::labs(
    x = NULL, y = NULL,
    title = "Frequency of #FWT18 Twitter statuses in the past nine days",
    subtitle = "Twitter status (tweet) counts ",
    caption = "\nSource: Data collected from Twitter's API via TwitteR"
  )

I get a warning that I asked the Twitter API for a maximum of 1,000 tweets but it only returned 58 from the past 9 days.

It would be nice to get a longer history but the twitter API only indexs a few days worth of tweets and does not allow historic queries (there’s a Python package that can but I haven’t tried it out yet.

Where in the world are people discussing #FWT18?


## [1] "try 2 ..."
## [1] "try 3 ..."
## [1] "try 4 ..."
## [1] "try 5 ..."

It looks like most of the tweets originated from Europe and North America, although we see a few tweets from Japan, Australia and Africa.

Note that geo-tagged tweets are only available for those who have opted in for that in the settings, which is a mere 3% of Twitter users.

What are some of the top words coming up in tweets about #FWT18?


This time I included @FreerideWTour and the Twitter handles of a few of the riders by using a function that looks for mentions and hash-tags (@ and #)

searchTerms <- list("FreerideWTour", "FWT18", "EvaWalkner", "HazelBirnbaum", "Jackiepaaso", "LorraineHuber", "drewtabke", "BimboesMickael", "leoslemett", "markus1eder", "Reine_Barkered") 

# function to get tweets in data.frame format for use in ldply
getTweets <- function(searchTerm, n = 1000) {
  library(twitteR)
  
  TS <- paste0("@", searchTerm, " OR ", "#", searchTerm)
  # get tweets
  tweets <- searchTwitter(TS, n = n, since = format(Sys.Date()-7), lang="en")
  # strip retweets
  if(length(tweets)>0) {
    tweets <- strip_retweets(tweets, strip_manual = TRUE, strip_mt = TRUE)
    # convert to data.frame
    tweetdf <- twListToDF(tweets)
    # add searchTerm and return
    out <- cbind(searchTerm, tweetdf)
  } else {
    out <- structure(list(searchTerm = structure(integer(0), .Label = c(searchTerm), class = "factor"),
                               text = character(0), 
                               favorited = logical(0), 
                               favoriteCount = numeric(0), 
                               replyToSN = character(0), 
                               created = structure(numeric(0), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
                               truncated = logical(0), replyToSID = character(0), 
                               id = character(0), 
                               replyToUID = character(0), 
                               statusSource = character(0), 
                               screenName = character(0), 
                               retweetCount = numeric(0), 
                               isRetweet = logical(0), 
                               retweeted = logical(0), 
                               longitude = character(0), 
                               latitude = character(0)), 
                          .Names = c("searchTerm", "text", "favorited", "favoriteCount", "replyToSN", "created",  "truncated", "replyToSID", "id", "replyToUID", "statusSource", "screenName", "retweetCount", "isRetweet", "retweeted", "longitude","latitude"), row.names = integer(0), class = "data.frame")
  }
  
  return(out)
}

After having defined a function for multiple search terms, I apply it to the list.

# use plyr to get tweets for all searchTerms in parallel
tweets_by_searchTerm <- ldply(searchTerms, function(searchTerm) getTweets(searchTerm, n=1000))

These tweets are messy so I clean them first with the tm_map(). Then create a wordcloud for the most popular things being mentioned along the #fwt18 tag.

# save the text
text <- tweets_by_searchTerm$text
text <- str_replace_all(text, "[^[:graph:]]", " ")

# create corpus
tweet_corpus <- VCorpus(VectorSource(text))

# clean up the corpus

tweet_corpus <- tm_map(tweet_corpus, content_transformer(replace_abbreviation))
tweet_corpus <- tm_map(tweet_corpus, content_transformer(tolower))
tweet_corpus <- tm_map(tweet_corpus, removePunctuation)
tweet_corpus <- tm_map(tweet_corpus, removeWords, c(stopwords("en"), "freeridewtour", "dropin", "fwt18", "gopro", "httpstcot13hya19ie", "httpstcot2olcfil2g", "markus1eder", "jackiepaaso"))

myDTM <- TermDocumentMatrix(tweet_corpus,
                              control = list(minWordLength = 1))
  
  m <- as.matrix(myDTM)
  m <- sort(rowSums(m), decreasing=TRUE)
  
  wordcloud(names(m), m, scale = c(3, 0.5), 
          min.freq = 2, max.words = 50,
          colors=brewer.pal(8, "RdYlBu"))

Looks like crashes and the weekend are used often in the context of these search terms.

Exploring FWT18 rankings


Since the data is not available as a .txt or a .csv file on the website, nor do they provide and API I needed to crawl for it. It’s worth mentioning that administrators may want to protect certain parts of their website for a number of reasons, “such as indexing of an unannounced site, traversal of parts of the site which require vast resources of the server, recursive traversal of an infinite URL space, etc.”

Therefore, one should always check if they have permission. One way to do this, is to use the robotstxt package to check if your webbot has permission to access certain parts of a webpage (Thanks to [@ma-salmon](https://twitter.com/ma_salmon) for that tip).

# check permission to crawl
paths_allowed("https://www.freerideworldtour.com/rider/")
## [1] TRUE

Okay, it looks like we have permission.

Unfortunately the code for the FWT 2018 rankings page is “fancy” meaning one needs to click the drop-down arrows to get a riders score for each event.

I think the data is being loaded with JavaScript which means that I would need to use a program which can programmatically click the button. I’ve heard splashr or RSelenium may accomplish this. But, I’m new to web-scraping and only familiar with rvest so I came up with a (relatively) quick work-around.

I placed the names from the 2018 roster into a dataset) and loaded it as an object. I can automatically crawl every rider by feeding these names to rvest with a for loop to the end of https://www.freerideworldtour.com/rider/

roster <- read_csv("https://ndownloader.figshare.com/files/11173433")

# create a url prefix
url_base <- "https://www.freerideworldtour.com/rider/"
riders <- roster$name

# Assemble the dataset
output <- data_frame()
for (i in riders) { 
  temp <- read_html(paste0(url_base, i)) %>% 
    html_node("div") %>% 
    html_text() %>% 
    gsub("\\s*\\n+\\s*", ";", .) %>% 
    gsub("pts.", "\n", .) %>% 
    read.table(text = ., fill = T, sep = ";", row.names = NULL, 
               col.names = c("Drop", "Ranking", "FWT", "Events", "Points")) %>%
    subset(select = 2:5) %>% 
    dplyr::filter(
      !is.na(as.numeric(as.character(Ranking))) & 
        as.character(Points) != ""
    ) %>%
    dplyr::mutate(name = i)
  output <- bind_rows(output, temp)
}
## Error in read.table(text = ., fill = T, sep = ";", row.names = NULL, col.names = c("Drop", : more columns than column names

I was going to look at the overall standings for each category (skiing and snowboarding) broken-down by how many points athletes earned at each FWT event in 2018; however, I noticed there was something odd going on.

How many riders did I search for?

# How many riders in the roster?
unique(roster) # there are 56
## # A tibble: 56 x 3
##    name                 sport sex  
##    <chr>                <chr> <chr>
##  1 aymar-navarro        ski   male 
##  2 berkeley-patterson   ski   male 
##  3 carl-regner-eriksson ski   male 
##  4 conor-pelton         ski   male 
##  5 craig-murray         ski   male 
##  6 drew-tabke           ski   male 
##  7 fabio-studer         ski   male 
##  8 felix-wiemers        ski   male 
##  9 george-rodney        ski   male 
## 10 grifen-moller        ski   male 
## # ... with 46 more rows

How many riders did I actually get information for?

# How many names in the output object?
unique(output$name) # there are only 37?
##  [1] "aymar-navarro"        "berkeley-patterson"   "carl-regner-eriksson"
##  [4] "conor-pelton"         "craig-murray"         "drew-tabke"          
##  [7] "fabio-studer"         "felix-wiemers"        "george-rodney"       
## [10] "grifen-moller"        "ivan-malakhov"        "kristofer-turdell"   
## [13] "leo-slemett"          "logan-pehota"         "loic-collomb-patton" 
## [16] "markus-eder"          "mickael-bimboes"      "reine-barkered"      
## [19] "ryan-faye"            "sam-lee"              "stefan-hausl"        
## [22] "taisuke-kusunoki"     "thomas-rich"          "trace-cooke"         
## [25] "yann-rausis"          "arianna-tricomi"      "elisabeth-gerritzen" 
## [28] "eva-walkner"          "hazel-birnbaum"       "jaclyn-paaso"        
## [31] "kylie-sivell"         "lorraine-huber"       "rachel-croft"        
## [34] "blake-hamm"           "christoffer-granbom"  "clement-bochatay"    
## [37] "davey-baird"

Apparently the function I wrote is not doing exactly what I want it to. After a bit of messing around I found that the rider elias-elhardt was the source of the trouble.

Since Elias only competed in the qualifiers let’s remove him from the roster object and re-run the code

roster <- read_csv("https://ndownloader.figshare.com/files/11173433")

# Remove Elias Elhardt
roster <- roster[-40,]
riders <- roster$name

# Assemble the dataset
output <- data_frame()
for (i in riders) { 
  temp <- read_html(paste0(url_base, i)) %>% 
    html_node("div") %>% 
    html_text() %>% 
    gsub("\\s*\\n+\\s*", ";", .) %>% 
    gsub("pts.", "\n", .) %>% 
    read.table(text = ., fill = T, sep = ";", row.names = NULL, 
               col.names = c("Drop", "Ranking", "FWT", "Events", "Points")) %>%
    subset(select = 2:5) %>% 
    dplyr::filter(
      !is.na(as.numeric(as.character(Ranking))) & 
        as.character(Points) != ""
    ) %>%
    dplyr::mutate(name = i)
  output <- bind_rows(output, temp)
}

# Join with roster
fwt_2018 <- output %>% 
  left_join(roster)
fwt_2018 <- unique(fwt_2018)
fwt_2018$Points <- as.numeric(fwt_2018$Points)

# Set theme for the next few graphics
ggthemr("fresh")

# Female ski
fwt_2018 %>% 
  filter(FWT == "FWT") %>% 
  filter(sex == "female") %>% 
  filter(Points != "NA") %>% 
  filter(sport == "ski") %>%
  ggplot(aes(x = name, y = Points, fill = Events)) +
  geom_col() +
  coord_flip() +
  labs(title = "Female Ski")

# Female snowboard
fwt_2018 %>% 
  filter(FWT == "FWT") %>% 
  filter(sex == "female") %>% 
  filter(Points != "NA") %>% 
  filter(sport == "snowboard") %>%
  ggplot(aes(x = name, y = Points, fill = Events)) +
  geom_col() +
  coord_flip() +
   labs(title = "Female Snowboard")

# Male ski
fwt_2018 %>% 
  filter(FWT == "FWT") %>% 
  filter(sex == "male") %>% 
  filter(Points != "NA") %>% 
  filter(sport == "ski") %>%
  ggplot(aes(x = name, y = Points, fill = Events)) +
  geom_col() +
  coord_flip() +
  labs(title = "Male Ski")

# Male snowboard
fwt_2018 %>% 
  filter(FWT == "FWT") %>% 
  filter(sex == "male") %>% 
  filter(Points != "NA") %>% 
  filter(sport == "snowboard") %>%
  filter(name != "drew-tabke") %>% 
  ggplot(aes(x = name, y = Points, fill = Events)) +
  geom_col() +
  coord_flip() +
   labs(title = "Male Snowboard")

Historical FWT Winners


The FWT lists past event winners on their website. I gathered the data of all winners from the 23 tours between 1996 and 2018 and included their age from either the website or a quick web-search. The dataset can be found on figshare.

# load the data
df <- read_csv("https://ndownloader.figshare.com/files/11300864")

Get summary statistics on age of winners

df %>% 
  summarize(mean_age = median(age, na.rm = TRUE),
            max_age = max(age, na.rm = TRUE),
            min_age = min(age, na.rm = TRUE))
##   mean_age max_age min_age
## 1       29      43      15

Find minimum age of winner by sex and sport

df %>%
  group_by(sex, sport) %>% 
  slice(which.min(age)) %>% 
  dplyr::select(name, sex, sport, age)
## # A tibble: 4 x 4
## # Groups:   sex, sport [4]
##   name            sex    sport       age
##   <chr>           <chr>  <chr>     <int>
## 1 Arianna Tricomi female ski          23
## 2 Michelle Gmitro female snowboard    16
## 3 George Rodney   male   ski          21
## 4 Cyril Neri      male   snowboard    15

How many times have riders won FWT events?

df %>% 
  dplyr::select(year:age) %>% 
  add_count(name) %>% 
  dplyr::select(-year, -month, -age) %>% 
  unique() %>% 
  arrange(desc(n)) %>%
  add_count(n) %>% 
  # filter(n != 1) %>% 
  ggplot(aes(x = n, y = nn, fill = "blue")) + 
  geom_col(show.legend = FALSE) +
  labs(x = "Number of times rider won FWT events", y = "Number of riders who won FWT events") +
  scale_x_continuous(breaks = seq(1, 16, 1), labels = seq(1, 16, 1))

The large number of riders who won at least one FWT event dwarfs those unique athlets who won a considerable number of events. Let’s have a look at those who won at least 5 events.

df %>% 
  dplyr::select(year:age) %>% 
  add_count(name) %>% 
  dplyr::select(-year, -month, -age) %>% 
  unique() %>% 
  arrange(desc(n)) %>%
  add_count(n) %>% 
  filter(n != 1) %>% 
  ggplot(aes(x = n, y = nn, fill = "blue")) + 
  geom_col(show.legend = FALSE) +
  labs(x = "# riders", y = "Time's crowned winner") +
  scale_x_continuous(breaks = seq(2, 16, 1), labels = seq(2, 16, 1)) +
  theme(panel.grid = element_line(colour = "grey75", size = .25)) + labs(caption = "Histogram of the participants by the number of times they won FWT events")