DailyR rvest web scraping

Dear Friends, extracting data from the web is an important skill to have in data science. R provides many packages to ‘scrape’ data. In this post, I use the rvest package to scrape data from the top premier league scorers from a BBC site.

I’m a huge Liverpool fan and want to check out how teams and players are doing. First, browse the BBC website and inspected the url. Use the inspect feature from your browser to inspect the data and appropriate xpath.

Major Steps

  • Use read_html and html_nodes to scrape the data
  • Use strsplit to separate the features of each player’s stats
  • Use data.table to organize the data
  • Use plot_ly to visualize the results

Check out this Video for Step By Step Instructions

Scrape the Data

knitr::opts_chunk$set(echo = TRUE)
library(rvest)

url = "http://www.bbc.com/sport/football/premier-league/top-scorers" # website to scrape
x_path = '//*[@id="top-scorers"]/ol' # xpath 

website <-  read_html(url) 
top_scorers <- website %>%
  html_nodes(xpath = x_path) %>%
  html_text() # text scraped from website

substring(top_scorers, 1, 400) # inspect first 200 characters
## [1] "     Mohamed Salah      Liverpool   148 mins per goal 3256 mins played  22 Goals scored 8 Assists   Shots on targetTotal  62%    64    104              Pierre-Emerick Aubameyang      Arsenal   124 mins per goal 2731 mins played  22 Goals scored 5 Assists   Shots on targetTotal  56%    40    72              Sadio Mané      Liverpool   140 mins per goal 3085 mins played  22 Goals scored 1 Assists   "

Place the Data in a Data.Table

The data.table package is a great tool to work with data. Check out my post here for further details. Let’s wrnagle this data into something that makes sense and is easy to visualize.

library(data.table)
library(pander)

top_scorers <- strsplit(top_scorers, "              ") # Use the space marker to split the data near player names

top_scorers <- data.table(name = top_scorers[[1]]) # place the results in a data.table

top_scorers$team <- sapply(top_scorers$name, function(x) unlist(strsplit(x, "      "))[2]) # use the smaller space marker to split near team names

top_scorers$name <- sapply(top_scorers$name, function(x) unlist(strsplit(x, "      "))[1]) # cleans up name column, remove everything after the space marker

digits <- sapply(1:length(top_scorers$team), function(x) as.numeric(unlist(strsplit(gsub("[^\\d ]+", " ", top_scorers$team[x], perl = TRUE), " "))[x != ""])) # extract all the numerical data from the text

digits <- unlist(digits) # turns the list of 24 vectors into a single vector
digits <- digits[!is.na(digits)] # removes NAs
dim(digits) <- c(7,25) # conforms the single vector into a matrix wiht 7x24 dimensions
digits <- data.table(t(digits)) # convert the matrix into a data.table
colnames(digits)[] <- c("minutes_per_goal", "minutes_played", "goals_scored", "assists", "shots_on_target_percentage", "shots_on_target", "shot_attempts") # column headers

top_scorers$team <- sapply(top_scorers$team, function(x) unlist(strsplit(x, "   "))[1]) # clean up name column, remove everything after the space marker

top_scorers <- cbind(top_scorers, digits) # combine the data.tables

pander(top_scorers[1:5,]) # checkout the first 5 player data
Table continues below
name team minutes_per_goal minutes_played
Mohamed Salah Liverpool 148 3256
Pierre-Emerick Aubameyang Arsenal 124 2731
Sadio Mané Liverpool 140 3085
Sergio Agüero Man City 118 2479
Jamie Vardy Leicester 152 2728
Table continues below
goals_scored assists shots_on_target_percentage shots_on_target
22 8 62 64
22 5 56 40
22 1 55 42
21 8 49 43
18 4 58 40
shot_attempts
104
72
76
87
69

Plot the Data

OK, now that we wrangled the data into a data.table, let’s look at the data briefly with a chart. Plotly is a great package that enables users to interact with the chart. Let’s check it out.

library(plotly) # uber web-based interactive graphing tools
data(top_scorers) 

top_scorers$team <- as.factor(top_scorers$team) # make teams as.factor

p <- plot_ly(top_scorers, # the data.table
             x = ~ minutes_per_goal, 
             y = ~ goals_scored, 
             z = ~ assists, 
             color = ~ team) %>% # make the teams as.factor
  add_markers() %>%
  layout(scene = list(xaxis = list(title = 'Minutes per Goal'),
                     yaxis = list(title = 'Goals Scores'),
                     zaxis = list(title = 'Assists')))
p