install.packages("plyr")

You also need the "mgcv" package for creating the model that is used in the calculations.

You can find the code below in blue, and comments in green. Hope it's readable:

#load library

library("plyr")

#initalize -- add necessary variables. Best way is with vectorized ifelse function.

pitcher$babipn = ifelse(pitcher$type=="X" & pitcher$event %in% c("Single", "Double", "Triple"), 1, 0)

pitcher$inplay = ifelse(pitcher$type=="X" & pitcher$event != "Home Run", 1, 0)

pitcher$swing = ifelse(pitcher$type=='X' | pitcher$des %in% c('Foul', 'Foul Tip', 'Foul (Runner Going)',

'Swinging Strike', 'Swinging Strike (Blocked)'), 1, 0)

pitcher$walk = ifelse(pitcher$des %in% c("Ball", "Ball In Dirt", "Intent Ball")

& pitcher$event %in% c("Intent Walk", "Walk") & pitcher$ball==3, 1, 0)

pitcher$strikeout = ifelse(pitcher$des %in% c("Called Strike", "Foul Tip", "Strike",

"Swinging Strike", "Swinging Strike (Blocked)")

& pitcher$event %in% c("Strikeout","Strikeout - DP") & pitcher$strike==2, 1, 0)

pitcher$homerun = ifelse(pitcher$des == "In play, run(s)" & pitcher$event == 'Home Run', 1, 0)

pitcher$full_name = paste(pitcher$first, pitcher$last)

#find number of plate appearances for each umpire, don't know why I named it atbats

#calculate kwERA, FIP, babip, and swing rate

pitcher = ddply(pitcher, .(ump_id), transform, atbats = length(unique(ab_id)))

pitcher = ddply(pitcher, .(ump_id), transform, fip = ((13 * sum(homerun)

+ 3 * sum(walk) - 1.93 * sum(strikeout)) / (.23 * atbats)) + 3.10,

kwERA = 5.24 - (12 * (sum(strikeout) - sum(walk)) / atbats),

babip = sum(babipn) / sum(inplay), swing = mean(swing))

#subset data to only look at pitches where umpire makes a call

#this won't affect the stats we just calculated

pitcher = subset(pitcher, des %in% c("Automatic Ball", "Ball","Ball In Dirt", "Called Strike",

"Intent Ball", "Pitchout", "Strike"))

#add in variable to describe called strike

pitcher$cs = ifelse(pitcher$des == 'Called Strike', 1, 0)

#find number of pitches seen by each umpire

pitcher = ddply(pitcher, .(ump_id), transform, n = length(px))

#only look at umpires with greater than 3000 pitches

pitcher = subset(pitcher, n > 3000)

#define function to calculate strikezone area

csarea = function(x) {require('mgcv')

#create model with mgcv package, use smoothing

model <- gam(cs ~ s(px) + s(pz), data = x, family = binomial(link='logit'))

mygrid <- expand.grid(x=seq(-4, 4, length=50), y=seq(-2, 6, length=50))

#find predicted called strike rates for a bunch of tiny bins

mygrid$z <- as.numeric(predict(model, data.frame(px=mygrid$x, pz=mygrid$y), type='response'))

thresh <- 0.5

#find number of tiny bins where predicted rate is at least .5

nbins=length(mygrid$z[mygrid$z >= thresh])

#multiply number of tiny bins by size of each bin

areas=nbins * 0.02665556

#add in fields to data.frame

x$area <-rep(areas, length(x[,1]))

x$nbins <- rep(nbins, length(x[,1]))

return(x)

}

#call + evaluate csarea() on each subset of umpire data

pitcher=ddply(pitcher, .(ump_id), "csarea")

#create a summary dataset for each umpire, then sort by area

sum.pitch=ddply(pitcher, .(ump_id), summarize, area=mean(area), full_name=full_name[1],

fip=mean(fip), kwERA=mean(kwERA), babip=mean(babip), swing=mean(swing))

sum.pitch=arrange(sum.pitch, area)

Hmm. Interesting. Here's a few suggestions:

ReplyDelete1. I have found that a joint smooth is more appropriate than additive smooths for px and pz. UBRE score is lower for the former model than the latter (and the zone ends up being more interesting as well).

2. If you're low on RAM, try using the "bam" version of "gam" in the mgcv package. This saves a lot of memory.

3. I actually created a "for loop" for my plots. You can do it for lots of functions...prolly inefficient but it helps with making lots of similar pictures saved as files. Just do:

myx <- matrix(data=seq(from=-2, to=2,

length=100), nrow=100, ncol=100)

myz <- t(matrix(data=seq(from=0,to=5,

length=100), nrow=100, ncol=100))

fitdata <- data.frame(px=as.vector(myx),

pz=as.vector(myz))

for(i in unique(umpire_id)) {

d <- subset(data, data$umpire_id==i)

umpname <- paste(i, data$ump_FN,

data$umpLN, sep="")

filename <- paste(i, "Heat.png", sep="")

png(file=filename, height=850, width=750)

fit <- gam(strike_call ~s(px,pz,k=51),

method="GCV.Cp",

family=binomial(link="logit"), data=d)

mypredict <- predict(fit,

fitdata, type="response")

mypredict <- matrix(mypredict,

nrow=c(100,100))

contour(x=seq(from=-1.5, to=1.5,

length=100), y=seq(from=1, to=4,

length=100), z=mypredict, lwd=2,

lty="dashed", axes=T, levels=.5,

labels="", labcex=1.3, col="darkred",

xlab="Horizontal Location (ft.,

Umpire's View)", ylab="Vertical

Location (ft.)", main=umpname)

dev.off()

}

4. Once I have my model for an umpire, I could also add the simple code to calculate area in inches, or just do it manually:

library(splancs)

###Area Contour

clines <- contourLines(x=seq(from=-1.5, to=1.5,

length=100), y=seq(from=1, to=4,

length=100), mypredict)

ump_area <- round(with(clines[[5]],

areapl(cbind(x,y)))*12, 3)

***

Anyway, good stuff. I'll have to play with "ddply". My code when doing crosstabs and stuff like that is really awful and inefficient. I have reached the limits of "tapply".

:-)

Also, sorry for the crappy code format through the comments. Ick!

ReplyDeleteThanks for the info/comments, Millsy. And about the ddply function and plyr package, this pdf shows some ways to make use out of it. http://www.jstatsoft.org/v40/i01/paper

ReplyDeleteAfter getting comfortable with the package the amount of for loops and lapply/split that I use has declined a lot.

Another way to generate the plots would be

pdf("filename.pdf")

d_ply(pitcher, .(ump_id), function (x) {code to make plot})

dev.off()

And the way you are finding area with the splancs library seems better than what I'm doing, so I will have to check that out.