Monday, January 2, 2012

Example

Here is a little example of what I do. While learning R isn't easy, it can be very powerful and efficient once you get your feet wet. I intend for this example to whet  your appetite. This should take you less than 20 minutes. By the end, you will have made this graph:



Pretty, isn't it?

Go here: http://joelefkowitz.com/pitcher_card.php?pid=136880 and click "Download excel file." This is Roy Halladay's data from 2011. Open the file in excel, click "save as", and change the file extension to .csv so it looks like "halladay.csv". This will make it easier to import to R.

Go here, and download R. Then open R. To read in the file, we need to change our working directory. We can do this with the setwd() command. Mine looks like this:

setwd("C:/Users/Josh/baseball_stuff/PITCHRX")


Now read in the data. Type:

pitcher = read.csv("halladay.csv")

This reads in the file and assigns it to an object called pitcher. To get a feel for the object, first type in

head(pitcher)

This shows you the first few rows of the object so that you know the import wasn't screwed up. Now type

str(pitcher)

While str() means convert to string in Python, in R it means structure. This will give you a feel for each column in the object, which we can see is of type data.frame (like a spreadsheet in Excel). Looks like everything went well, awesome.

Now I want a graph that shows me Halladay's pitch locations. And I want it to be pretty, and to be split up by pitch locations. I also want smoothing, and labeled axes. And to top it off, limited dimensions. We need the ggplot2 package. To install it, type

install.packages("ggplot2")

Load it by typing

library("ggplot2")

Now we can use it. But first, eliminate pitches that we don't care about, by typing:

pitcher = subset(pitcher, !(pitch_type  %in%  c("IN",  "")))

Now plot away.


ggplot(data = pitcher) +
stat_density2d(geom="tile", aes(x = px, y = pz, fill = ..density..), contour = F, data = pitcher) +
facet_wrap(~pitch_type) +
scale_x_continuous("horizontal pitch location") +
scale_y_continuous("vertical pitch location") +
coord_cartesian(xlim = c(-2, 2), ylim = c(1, 4))


Boom, you just made one high quality graph in less than 20 minutes. Of course I haven't explained why what we just did works, and it's pretty complicated, but that's why you'll keep reading my website (I hope). We will go over more things like this in the future, but just want to post something quick and powerful.

5 comments:

  1. I am an R beginner so this question might be simple. Do I only need to load ggplot2 package once or do I have to load it each time I open R?

    ReplyDelete
  2. You only need to install it once, but you do need to load it every session. Thanks for visiting the site!

    ReplyDelete
  3. I am now attempting to continue with the example and I am getting the following error message:

    Error in match(x, table, nomatch = 0L) : object 'pitch_type' not found

    after attempting this step:

    pitcher = subset(pitcher, !(pitch_type %in% c("IN", "")))

    I have done a little searching and reading but haven't found an answer (at least one that a beginner like me can understand). It looks to me like it is not recognizing the variable "pitch_type" from the halladay.csv file. Am I doing something wrong?

    ReplyDelete
    Replies
    1. That's strange. Are you sure the data is not screwed up? To take a peak at the data, try

      head(pitcher) which will show you a few rows. There are also a few other ways to perform this step. Reading my post on data frames should help. Basically,


      pitcher = subset(pitcher, !(pitch_type %in% c("IN", "")))

      is equivalent to

      pitcher = pitcher[pitcher$pitch_type!='IN' & pitcher$pitch_type!='', ]

      Delete
  4. Thanks for pointing me to your post on data frames. I understand that better now and I figured out that data were screwed up in R. When I used head(pitcher), the titles of the columns were separated by periods and each row of data had a "\t" in between each entry, leading me to believe the file was tab delimited. I fixed it using

    pitcher=read.csv("halladay.csv", sep="\t")

    Anyways, thanks again for this site and for your help tonight. I look forward to learning more as your site grows.

    ReplyDelete