Friday, October 3, 2008

Anatomy of a Baseball Stat: Explaining Win Probability

Since many of you might not read Bill James' Baseball Manifestos, I thought I would show you how arrive at the numbers that I've been showing as the probability of victory for any given playoff team. We will draw upon two formulas from Bill James, one from his 1981 Baseball Abstract and the other from his 1984 Baseball Abstract.

First, let's look at his Pythagorean Win Expectancy "theorem". I use quotes because even though he called it that, he never thought that it quite stood the test of an actual mathematical theorem. Baseball, at it's most basic, yet most accurate, is a game whose outcome is determined by how many runs the teams score. This has two four points: runs scored by both teams and runs given up by both teams. The obvious: the team with more runs after 9 innings wins. So, he combined the idea of the pythagorean theorem that you all probably still remember (and aren't you glad that's so--I'm sure you've used it to determine the long side of a right triangle given the two short sides recently!): a² + b² = c²

Now this has led to countless important discoveries, not the least of which couldn't have happened without it, Algebra and Trigonometry. So James' basic idea was that there was some correlation between runs scored, and runs given up, and somehow coming up with wins. I don't recall the details now (I first came across this in the late '80s). But the formula that he came up with (which I've detailed elsewhere before) is:

P WP = R² / R² + RA²

P WP = Predicted Winning Percentage
R = Runs Scored (batting)
RA = Runs Given Up (pitching)

So as an example, let's compute this for the Cubs this season. Cubs scored 854 and gave up 668 runs.

P WP = 854² / 854² + 668² = .620

This is the number that was used to compute the predictive standings. But for the Playoffs, I get more granular, since the games encompass playing 3 at home and 2 on the road. For that, I consult ESPN, but this time, get their extended team stats and sort by home and away runs/runs-against.

So the Cubs then end up at:

Home: P WP = 854² / 454² + 339² = .649
Away: P WP = 854² / 401² + 332² = .563

So this is where we want to see what their chances of winning a home game and winning an away game are against their opponent, the LA Dodgers. This is when we bring in James' second discovery which he called Log5 probability (I won't go into the details of the name, that's totally straining your staying power!). Log5 allows you to take the winning percentage of any team, multiply it by their opponent, and come up with a probability of victory (or defeat, as the case might be). Now, because actual winning percentage can sometimes be a function of luck or other external factors, I use the pythagorean or predicted winning percentage.

To do this, I also need the same information for the opposing team, in this case the Dodgers. I'll use the same method and come up with their home and away winning percentages, which are .637 and .454, respectively. So the formula will then look like this:

One Game Win Probability (Cubs home advantage)
 = CHPWP - (CHPWP x LAAPWP) / CHPWP + LAAPWP - (2 x LAAPWP x CHPWP) where
CHPWP = Cubs Home Predicted Winning Percentage
LAAPWP = Dodgers Away Predicted Winning Percentage

With the numbers just given, the formula and result look like this:
Cubs Win Probability = .649 x .454 / .649 + .454 - (2 x .454 x .649) = .690 or 69%
LA Dodgers Win Probability = 1 - Cubs Win Probability = .310 or 31%

The next step is the one I outlined briefly early this week and referenced in the Diamond Mind article of Monday's post, where you assemble all the potential probabilities of the different scenarios in a 5-game series: sweep 3, lose first and win next 3, lose first 2 win next 3, etc. Each one of these is associated with a probability. So, as an example, how do we calculate the probability of a 3-game sweep for these two teams? We simply use the multiplicative property of probabilities. The Cubs have a probability of winning each of the first two home games of .690. The already calculated chance following the above Log5 method for the Away games are .454. The formula and result:

Probability of 3-game sweep for Cubs = .690 x .690 x .454 = 22%

For the Dodgers, the chance of a sweep is only 5%. But since the Cubs lost the two games at home, the chance of a sweep for the Dodgers went up:

Probability of Dodgers Sweep with two victories = 1.00 x 1.00 x .546 = 55% 
(1.00 represents a victory or probability of 100% of winning that game which we know they won)

Hopefully, this has been properly explained. Now to go back to the Red Sox vs. Angels game!

No comments: