Thursday, January 26, 2012

Hiatus

Due to an unexpected, but positive, turn of events, I will have to hold off on posting at pitchR/x for a few months. I was pretty excited about starting the website and I'm sorry that I won't be able to contribute more content right now. In the meantime, I will be pursuing an internship that forbids me from writing about baseball and baseball data analysis for a few months. You're still free to contact me, but I won't be contributing to FanGraphs or THT or this website for a while.

Thanks again for ever reading my site, and I apologize again for having to put this on hold.

Wednesday, January 11, 2012

Calculate umpire strikezone sizes

Head on over to THT, and you should see a post by yours truly about calculating the size of umpire strikezones. Here I will open source the code for how to do that. The hard part though is that in order to make use out of the code, you need to have every single pitch from 2011 available at your disposal. If you have that, then you can use the following code to generate umpire strikezone sizes. The script makes very heavy use of the "plyr" package (packages load libraries of extra functions and/or data into R), which may be my favorite R package. It was designed by Hadley Wickham, who is a celebrity of sorts in the R community. The plyr package is great for data manipulation and group-wise summaries. You can install it with:

install.packages("plyr")

You also need the "mgcv" package for creating the model that is used in the calculations.

You can find the code below in blue, and comments in green. Hope it's readable:

#load library
library("plyr")

#initalize -- add necessary variables. Best way is with vectorized ifelse function. 
pitcher$babipn = ifelse(pitcher$type=="X" & pitcher$event %in% c("Single", "Double", "Triple"), 1, 0)
pitcher$inplay = ifelse(pitcher$type=="X" & pitcher$event != "Home Run", 1, 0)
pitcher$swing = ifelse(pitcher$type=='X' | pitcher$des %in% c('Foul', 'Foul Tip', 'Foul (Runner Going)', 
            'Swinging Strike', 'Swinging Strike (Blocked)'), 1, 0)
pitcher$walk = ifelse(pitcher$des %in% c("Ball", "Ball In Dirt", "Intent Ball") 
            & pitcher$event %in% c("Intent Walk", "Walk") & pitcher$ball==3, 1, 0)
pitcher$strikeout = ifelse(pitcher$des %in% c("Called Strike", "Foul Tip", "Strike",
            "Swinging Strike", "Swinging Strike (Blocked)")
            & pitcher$event %in% c("Strikeout","Strikeout - DP") & pitcher$strike==2, 1, 0)
pitcher$homerun = ifelse(pitcher$des == "In play, run(s)" & pitcher$event == 'Home Run', 1, 0)
pitcher$full_name = paste(pitcher$first, pitcher$last)

#find number of plate appearances for each umpire, don't know why I named it atbats
#calculate kwERA, FIP, babip, and swing rate
pitcher = ddply(pitcher, .(ump_id), transform, atbats = length(unique(ab_id)))
pitcher = ddply(pitcher, .(ump_id), transform, fip = ((13 * sum(homerun)
              + 3 * sum(walk) - 1.93 * sum(strikeout)) / (.23 * atbats)) + 3.10,
              kwERA = 5.24 - (12 * (sum(strikeout) - sum(walk)) / atbats),
              babip = sum(babipn) / sum(inplay), swing = mean(swing))

#subset data to only look at pitches where umpire makes a call
#this won't affect the stats we just calculated
pitcher = subset(pitcher, des %in% c("Automatic Ball", "Ball","Ball In Dirt", "Called Strike",
                                   "Intent Ball", "Pitchout", "Strike"))
#add in variable to describe called strike
pitcher$cs = ifelse(pitcher$des == 'Called Strike', 1, 0)

#find number of pitches seen by each umpire
pitcher = ddply(pitcher, .(ump_id), transform, n = length(px))

#only look at umpires with greater than 3000 pitches
pitcher = subset(pitcher, n > 3000)

#define function to calculate strikezone area
csarea = function(x) {require('mgcv')
  #create model with mgcv package, use smoothing                  
  model <- gam(cs ~ s(px) + s(pz), data = x, family = binomial(link='logit'))
  mygrid <- expand.grid(x=seq(-4, 4, length=50), y=seq(-2, 6, length=50))
  #find predicted called strike rates for a bunch of tiny bins                  
mygrid$z <- as.numeric(predict(model, data.frame(px=mygrid$x, pz=mygrid$y), type='response'))
thresh <- 0.5
  #find number of tiny bins where predicted rate is at least .5                  
nbins=length(mygrid$z[mygrid$z >= thresh])
  #multiply number of tiny bins by size of each bin                  
areas=nbins * 0.02665556
  #add in fields to data.frame                  
x$area <-rep(areas, length(x[,1]))
x$nbins <- rep(nbins, length(x[,1]))
return(x)
}

#call + evaluate csarea() on each subset of umpire data
pitcher=ddply(pitcher, .(ump_id), "csarea")

#create a summary dataset for each umpire, then sort by area
sum.pitch=ddply(pitcher, .(ump_id), summarize, area=mean(area), full_name=full_name[1],
                fip=mean(fip), kwERA=mean(kwERA), babip=mean(babip), swing=mean(swing))
sum.pitch=arrange(sum.pitch, area)

Thursday, January 5, 2012

Working with data frames

R, just like other programming languages, has different types of objects. Matrices, arrays, data.frames, lists, vectors, tables, etc. But by far the most important for working with baseball data is going to be dataframes.

I'm not sure of the level of experience of the people reading this, so the first part will be more beginner oriented. If that bores you, skip down to the advanced section.

In the example post (just scroll down), we looked at some Roy Halladay data. It is stored in a data frame. Just as a reminder, to get this data, do the following:

Go here: http://joelefkowitz.com/pitcher_card.php?pid=136880 and click "Download excel file." This is Roy Halladay's data from 2011. Open the file in excel, click "save as", and change the file extension to .csv so it looks like "halladay.csv". This will make it easier to import to R.

Go here, and download R. Then open R. To read in the file, we need to change our working directory. We can do this with the setwd() command. Mine looks like this:

setwd("C:/Users/Josh/baseball_stuff/PITCHRX")


Now read in the data. Type:

pitcher = read.csv("halladay.csv")
We know that this is an object of type data frame by typing in

class(pitcher)

Say I no longer want all of this data. For some reason, I only want the first 10 rows. How do we get a new data frame and store it in a new object?

A data frame, much like a matrix (two-dimensional array), has two dimensions - rows and columns. To index a data frame (or a matrix), we use brackets in R next to the object, like so:

pitcher[i, j]

Where "i" denotes the rows that you want and "j" is the columns. To get the first ten rows we could do this a few ways. Lets assign the numbers 1-10 to an object:

i = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

The command "c" means concatenate, and does just that -- concatenate and bunch of things together. You can use it on many data types, not just numeric data. But typing out all those numbers? Inefficient. We can do the same thing much faster with:

i = 1:10

Now index:

newdf = pitcher[i, ]

Since we want all columns, we just leave in a blank space next to the comma. And yes, you need the comma so R knows which dimensions you are indexing.

But we don't even need to assign anything to i . We can just do:

newdf  = pitcher[1:10, ]

But why would we ever do this? A more powerful method is to index a method based on a condition. Say we only want Roy Halladay's curveballs.

We can look at individual columns of a data frame with the "$" operator. Pitch type data is housed in the "pitch_type" column, which we can access with

pitcher$pitch_type

To find which rows were curveballs, we can use the logical operator for equivalency "==" . This is used in many languages. Type:

pitcher$pitch_type == "CU"

You actually just get a bunch of boolean values (true or false). But next to the values where it's true is a row index telling us which row these pitches can be located at. We can use these boolean vectors to index data frames.

curveballs = pitcher[pitcher$pitch_type == "CU", ]

This will index the pitcher data frame such the new object, curveballs, will only have rows where the conditional is true. This is really, really powerful. You can use multiple conditionals to subset different combinations of rows and columns, such as:

df = pitcher[pitcher$pitch_type=="FT" & pitcher$start_speed > 91, c("pfx_x", "pfx_z", "start_speed")]

This would find all rows where the entire conditional is true (& means and), and only takes the data from those columns which we described with the "c" command. We also could have described which columns using numbers for their indices like so (start_speed is the 26th column, and so on):


df = pitcher[pitcher$pitch_type=="FT" & pitcher$start_speed > 91, c(26, 34, 35)]


You now have a new data frame with only the columns start_speed, pfx_x, pfx_z in it and only with two-seam pitches with an initial velocity greater than 91.

Advanced discussion


Another way is to use the subset command. You can use the command like this:

df = subset(pitcher, pitch_type == "CU")

And this would be equivalent to:

pitcher = pitcher[pitcher$pitch_type == 'CU', ]

But that's pretty fancy isn't it? We didn't have to access the column in the subset command with the $ operator, we just named it. The evaluation of the subset command is controlled such that it occurs within the dataframe. But what we if we define an object in a global environment and try and use it in the subset function?


x = 'CU'


df = subset(pitcher, pitch_type == x)

What happens? Well, it actually works just like what we would want it to. But that means that the subset function will also work with global objects in conditionals. This means that subset has different scoping rules than the normal indexing operations we looked at later. The conditional is first evaluated within the context of the environment, which in this case is "pitcher". But it can still go back a generation to the global environment if it needs to. This is dynamic scoping, which is different than the lexical scoping that is normally used by the language.

We can basically replicate this function like so:

conditional = substitute(pitch_type == "CU")


df = pitcher[eval(conditional, pitcher), ]

The object conditional is here is of a type "call." Evaluating this call in the context of a dataframe gives us a bunch of true/false values - another boolean vector. In fact, this is equivalent to:

pitcher$pitch_type=='CU'

So we can see how the subset command works, and it's pretty fancy, using some scoping and evaluation tricks -- it's pretty advanced stuff. The subset command is meant to save typing, but I'm not really sure it does unless you have a lot of conditionals. It also is a bit like voodoo magic if you don't how it works.






Monday, January 2, 2012

R resources

Earlier we spoke about PITCHfx resources, and now we will learn about R resources. Well, what is R? Straight from wikipedia:

R is a programming language and software environment for statistical computing and graphics. The R language is widely used among statisticians for developing statistical software,[2][3] and R is widely used for statistical software development and data analysis.[3]
If you are familiar with proprietary statistical packages like SPSS, SAS, and Matlab, then R is like an open-source variant. It's particularly similar to Matlab. And it's nothing like excel. In excel, you have nicely designed  graphical user interface (GUI) with buttons that you click on to do analysis. Yea, you can type in macros and things of that nature, but it's really quite different from R. And much less powerful.

While most of the work done in R is through a command console, there are GUIs for it that can be useful. I use two.

Rstudio

Rstudio is a nice integrated development environment (IDE) and it's light on memory usage. I use it whenever I'm working with massive sets of data, like 700,000 pitches. You don't need to fumble around with windows because everything is in one window. Download here: http://www.rstudio.org/

Deducer


Deducer is another GUI, and it's special because it gives you the power of the ggplot2 package in a GUI format. This means that for graphs, you don't need the command console. Of course it's still better to know the code as the GUI is kind of limiting, but it's not necessary here. Download here: http://www.deducer.org/pmwiki/pmwiki.php?n=Main.DeducerManual

When you're first starting off a nice friendly GUI can be very helpful, so I'd highly recommend downloading one of these (or both).

Now that you have it downloaded, you need to know what you're doing. A fantastic resource for leaning R is the art of R programming, available by PDF here. Not only will this tell you a lot about R, but about the best ways to use R. You can view other tutorials here: http://www.statmethods.net/about/books.html. While we don't need to know all of the capabilities of R for baseball analysis, it's best be familiar with the program before you start trying to make graphs and do analysis.

You also need to read Brian Mills website. He had a great sab-R-metrics series going earlier this year, most of which can be found here. Brian's website also combines R knowledge with baseball data, so it is similar to this website. All of his posts are tremendous resources, and Brian really knows his stuff. I would definitely recommend reading through all of his R related posts.

If you read through all of these resources and the PITCHfx resources, you'll be a PITCHfx and R master in a reasonable amount of time.


 

PITCHfx resources

Most of the posts on this website will deal with PITCHfx data in some capacity, so here are some resources. These will teach you a lot no matter what level of experience you have. One great primer is this one, originally posted in a Hardball Times annual and posted later online, written by Mike Fast. A second primer that you should read can be found here, written by Josh Smolow. These two primers are very informative and should give you a very good grasp on how to responsibly leverage the data to learn about baseball. Here's one last piece that you should read, also by Mike Fast; it talks about common mistakes made with PITCHfx. As far as I'm concerned, these pieces are required reading for working with the data. 

Best way to play with the data is to get your own database though. Lots already written about that on the internet, so I won't rehash the various methods right now. If you can't get your own though, that's ok, but you'll need to rely on Joe Lefkowitz' site

You will get the most out of my website if you read the articles that I have linked to. In a few minutes I will post something on R resources. 

Example

Here is a little example of what I do. While learning R isn't easy, it can be very powerful and efficient once you get your feet wet. I intend for this example to whet  your appetite. This should take you less than 20 minutes. By the end, you will have made this graph:



Pretty, isn't it?

Go here: http://joelefkowitz.com/pitcher_card.php?pid=136880 and click "Download excel file." This is Roy Halladay's data from 2011. Open the file in excel, click "save as", and change the file extension to .csv so it looks like "halladay.csv". This will make it easier to import to R.

Go here, and download R. Then open R. To read in the file, we need to change our working directory. We can do this with the setwd() command. Mine looks like this:

setwd("C:/Users/Josh/baseball_stuff/PITCHRX")


Now read in the data. Type:

pitcher = read.csv("halladay.csv")

This reads in the file and assigns it to an object called pitcher. To get a feel for the object, first type in

head(pitcher)

This shows you the first few rows of the object so that you know the import wasn't screwed up. Now type

str(pitcher)

While str() means convert to string in Python, in R it means structure. This will give you a feel for each column in the object, which we can see is of type data.frame (like a spreadsheet in Excel). Looks like everything went well, awesome.

Now I want a graph that shows me Halladay's pitch locations. And I want it to be pretty, and to be split up by pitch locations. I also want smoothing, and labeled axes. And to top it off, limited dimensions. We need the ggplot2 package. To install it, type

install.packages("ggplot2")

Load it by typing

library("ggplot2")

Now we can use it. But first, eliminate pitches that we don't care about, by typing:

pitcher = subset(pitcher, !(pitch_type  %in%  c("IN",  "")))

Now plot away.


ggplot(data = pitcher) +
stat_density2d(geom="tile", aes(x = px, y = pz, fill = ..density..), contour = F, data = pitcher) +
facet_wrap(~pitch_type) +
scale_x_continuous("horizontal pitch location") +
scale_y_continuous("vertical pitch location") +
coord_cartesian(xlim = c(-2, 2), ylim = c(1, 4))


Boom, you just made one high quality graph in less than 20 minutes. Of course I haven't explained why what we just did works, and it's pretty complicated, but that's why you'll keep reading my website (I hope). We will go over more things like this in the future, but just want to post something quick and powerful.

Welcome!

Welcome to pitchR/x. This is a website that will focus on PITCHf/x data and R. If you have stumbled on to this website, you are probably already familiar with these two things, in at least some capacity. My expectations for the website are something I'm still working on, but we'll figure it out as we go. The website will never have a large readership given the content, but that's ok.

I intend for this website to contribute to sabermetric community, and to encourage the use of PITCHf/x and R, two tools which I use often in my writing, which can mostly be found at FanGraphs and The Hardball Times.

Thanks for visiting! Later today we will get rolling out of the gates with an example of what I do, and then we will continue this little introduction.

Side note: I love comments. Whenever I post an article somewhere I'm always disappointed if I don't get any feedback, and I freak out a little bit. Paranoia? Whatever, just comment away. Ask questions, tell me I'm an idiot (don't actually do that), I just want this place to encourage discussion.