R for Hockey Analysis — Part 2: Tidyverse Basics
This is the second part of a series where I’ll show you some ways in which you can use R to analyze hockey data. If you haven’t seen the first tutorial, where I talk about R/RStudio installation and R basics, I think it’d be a good idea to check it out.
In this tutorial, we’ll play around with real 2018 NHL draft data, scraped from EliteProspects using my recently released R package,
But before we get started, we’re going to have to install the tidyverse
.
Installing Packages
The lingo for installing packages in R is to run:
install.packages("name_of_the_package")
where you put the “name of the package” between quotes. Note that this works for packages housed on CRAN, which is a repository for R packages. My package, “elite”, isn’t on CRAN, so installation for “elite” and similar packages is slightly different, for whatever it’s worth.
Anyways, to install the tidyverse
, we run:
install.packages("tidyverse")
You should see a lot of words in your console. That’s good! When that’s all finished, you should see something along the lines of:
package ‘tidyverse’ successfully unpacked...
If that’s what you see, great! If not, try again and possibly google-search an error message if R provided you with one.
Loading Packages
Once you install a package, you need to load it to actually use it.
Loading packages is nearly identical to installing packages. The lingo is:
library(name_of_the_package)
Where “name_of_the_package” is the name of the package you’re trying to load. Note that you don’t need quotes around the package name, but it would still work if you used quotes. Personally, I don’t use quotes around the package names for loading packages.
To load the tidyverse
, you can run:
library(tidyverse)
And your output should look something like this:
> library(tidyverse)-- Attaching packages --------------------------------------- tidyverse 1.2.1 --v ggplot2 3.0.0 v purrr 0.2.5v tibble 1.4.2 v dplyr 0.7.6v tidyr 0.8.1 v stringr 1.3.1v readr 1.1.1 v forcats 0.2.0-- Conflicts ------------------------------------------ tidyverse_conflicts() --x dplyr::filter() masks stats::filter()x dplyr::lag() masks stats::lag()Warning messages:1: package ‘tidyverse’ was built under R version 3.4.4 2: package ‘ggplot2’ was built under R version 3.4.4 3: package ‘tidyr’ was built under R version 3.4.4 4: package ‘purrr’ was built under R version 3.4.4 5: package ‘dplyr’ was built under R version 3.4.4 6: package ‘stringr’ was built under R version 3.4.4
It’s possible that you’ll have some of the warning messages like I have above; if so, that’s completely fine. R is just warning you that your version of R is not nearly as up-to-date as the version of R under which the tidyverse
packages were built.
Wait — what the h*ck is the tidyverse?
I’m glad you asked! The tidyverse
is a collection of packages that make working with data in R a lot easier than it could be. Each of the packages works in conjunction with the others, so when you load tidyverse
, you actually load eight different packages at once.
The eight packages are:
ggplot2
— for visualizationspurrr
— for working with functionstibble
— for working with a type of data frame called thetibble
dplyr
— for data manipulationtidyr
— for reshaping your datastringr
— for working with stringsreadr
— for loading dataforcats
— for working with a type of vector called “factors”
Awesome, so… what’s next?
Reading in Data
To use a dataset in R, you need to “read” it in some way. In this situation, “reading” means loading the dataset in R.
Bear in mind that, when you load a dataset in R, you’re never editing the actual dataset. By that, I mean that the original dataset will be unchanged no matter what you do in R.
To read in data, I use functions from the readr
package from the tidyverse
. These functions begin with read_.…
. For example, these functions include read_csv
, read_delim
, and read_rds
, all for reading in their respective file formats. For this tutorial, we’ll be reading in a file of the “.rds” format. And don’t worry if any of this file formatting talk is confusing — this is only a quick introduction.
So, let’s load the data. I made the dataset available on my GitHub. Download it, and copy/drag the dataset to your working directory. If you forgot, in the last tutorial, you set the working directory to whatever folder in which R will look for files that you want to load.
Once you copied the dataset to the working directory, we can load it into R with the read_rds()
function. Let’s name this datasetdraft_data
so that it’s easier to manipulate thereafter. Remember, to name something in R, we can use the assignment operator (<-
).
Here’s the code to run:
draft_data <- read_rds("2018_nhl_draft_data.rds")
Great! And to check on the look of the data, let’s run draft_data
:
> draft_data# A tibble: 217 x 16 draft_league draft_year pick_number round draft_team name <chr> <chr> <dbl> <dbl> <chr> <chr> 1 NHL Entry D~ 2018 1 1 Buffalo S~ Rasm~ 2 NHL Entry D~ 2018 2 1 Carolina ~ Andr~ 3 NHL Entry D~ 2018 3 1 Montréal ~ Jesp~ 4 NHL Entry D~ 2018 4 1 Ottawa Se~ Brad~ 5 NHL Entry D~ 2018 5 1 Arizona C~ Barr~ 6 NHL Entry D~ 2018 6 1 Detroit R~ Fili~ 7 NHL Entry D~ 2018 7 1 Vancouver~ Quin~ 8 NHL Entry D~ 2018 8 1 Chicago B~ Adam~ 9 NHL Entry D~ 2018 9 1 New York ~ Vita~10 NHL Entry D~ 2018 10 1 Edmonton ~ Evan~# ... with 207 more rows, and 10 more variables:# position <chr>, shot_handedness <chr>, birth_place <chr>,# birth_country <chr>, birthday <chr>, height <dbl>,# weight <dbl>, age <dbl>, player_url <chr>,# player_statistics <list>
Woah. Wait. What? What’s a tibble
?
Tibble
Remember the last tutorial where I wrote about the data frame? If you forgot, a data frame is basically a spreadsheet of data; it’s a collection of vectors of the equal length.
A tibble
is a fancy type of data frame. The tibble
has many advantages over the typical R data frame (created with data.frame()
). In brief, the tibble
is cleaner and more consistent than the typical R data frame is. From here on out, I’ll almost always be using tibbles rather than the typical R data frame.
Now, that’s great. But, what’s that <chr>
and <dbl>
near the top of the tibble
?
Vector Classes
Do you remember, in last tutorial, when I taught you how to create a vector? We used the function c()
to do this.
In case you forgot, here’s a vector of “goals” scored by three Rangers last season. Try running this:
goals <- c(16, 27, 25)
Cool. So what?
In R, vectors have classes, and these classes have implications for what you can do with those vectors.
Last tutorial, we also had a vector called assists
for those same players. Let’s run this:
assists <- c(37, 20, 19)
And we added those 2 vectors together to get points
, right?
> points <- goals + assists> points[1] 53 47 44
The reason why that vector addition works is because both goals
and assists
are of the same vector class; they’re both numeric (also known as double) vectors.
There are many types of vector classes in R, but the 3 vector classes that you’ll see most often are numeric vectors, character vectors, and logical vectors.
Numeric Vectors
- Any numbers
- EX:
1, 1.0001, 0
Character Vectors
- Anything with quotes (“ “)
- EX:
"hockey", "NY rangers", "lias andersson > casey mittelstadt"
Logical Vectors
- Generally, either a
TRUE
or aFALSE
- EX:
TRUE, FALSE
To see what vector class a vector is, you can use the function class()
Try these:
> class(c(1, 2, 3))[1] "numeric"> class(c(TRUE, FALSE))[1] "logical"> class(c("casey mittelstadt is not that amazing", "TRUE"))[1] "character"
And, in the last tutorial, vector addition worked with the goals
and assists
vectors because both vectors are of class numeric
. It wouldn’t work if one vector were of class character
and the other of class numeric
.
We can even test this:
> goals <- c(16, 27, 25)> goals[1] 16 27 25> names <- c("Mats Zuccarello", "Mika Zibanejad", "Kevin Hayes")> names[1] "Mats Zuccarello" "Mika Zibanejad" [3] "Kevin Hayes" > goals + namesError in goals + names : non-numeric argument to binary operator
While vector classes aren’t hugely important right now, they are often the cause of some very annoying errors in R — as you can see above — and it’s best to try and understand the concept of vector classes right away.
And let’s go back to the dataset — the tibble
— that we loaded earlier, and see what it looks like:
> draft_data# A tibble: 217 x 16 draft_league draft_year pick_number round draft_team name <chr> <chr> <dbl> <dbl> <chr> <chr> 1 NHL Entry D~ 2018 1 1 Buffalo S~ Rasm~ 2 NHL Entry D~ 2018 2 1 Carolina ~ Andr~ 3 NHL Entry D~ 2018 3 1 Montréal ~ Jesp~ 4 NHL Entry D~ 2018 4 1 Ottawa Se~ Brad~ 5 NHL Entry D~ 2018 5 1 Arizona C~ Barr~ 6 NHL Entry D~ 2018 6 1 Detroit R~ Fili~ 7 NHL Entry D~ 2018 7 1 Vancouver~ Quin~ 8 NHL Entry D~ 2018 8 1 Chicago B~ Adam~ 9 NHL Entry D~ 2018 9 1 New York ~ Vita~10 NHL Entry D~ 2018 10 1 Edmonton ~ Evan~# ... with 207 more rows, and 10 more variables:# position <chr>, shot_handedness <chr>, birth_place <chr>,# birth_country <chr>, birthday <chr>, height <dbl>,# weight <dbl>, age <dbl>, player_url <chr>,# player_statistics <list>
Do you notice the <chr>
and <dbl>
? Those are the vector classes! That’s part of why the tibble
is nice — it tells you as what class each vector was loaded.
Awesome. Vector classes and data frames are a bit complicated, so it’s totally fine if you don’t get it right away. But if you do, fantastic! Now let’s finally get onto the best parts of the tidyv
…
Functions
erse
… Sorry. Let’s talk about functions, what they are, and what you need to know about them at this moment. Sorry folks, we need to do this.
A function is a verb. It’s a way of saying “do this” or “get that” or “calculate these”. You supply an object — whether that be a data frame or a vector or something else — and the function does something.
There are literally thousands of functions in R. You can even create your own — I’ll show you how to do that in a later tutorial.
One function you’ve already used in this tutorial is c()
, and that creates vectors. Another function you’ve used is class()
. You’re basically an expert!
Now, what goes into a function? How do you know what you should be typing inside of c()
to get it to work?
What you type inside of a function is called an argument, and the arguments that you need to supply a function depend on what the function is and what you want that function to do.
Let’s take a quick look at the arguments needed for class()
. We can do this by looking at the documentation for the function. Trust me, there’s a reason behind all of this — I promise!
To look at the documentation for a function, you can run ?function_name
, where function_name
is the name of the function. So, to view the documentation for class()
, you can run:
?class
This is what should pop up in the viewer (the bottom right quadrant):