Working with Minor League Similarity Scores

March 25, 2015

There are some people who see baseball players that way-each one is unique, absolutely not interchangeable with another. I don't deny the validity of that approach-but if you take that tack, then you can't turn around and argue that your player should be in the Hall of Fame because his numbers are just as good as this other player's. "Similarity" is a complex concept, and two players who are not statistically similar may be profoundly similar in some other way...players who have similar primary characteristics will tend to have similar secondary characteristics as well.

- Bill James in Whatever Happened to the Hall of Fame?

Similarity scores were created by Bill James to compare the careers of Hall of Fame eligible players.In In the most basic sense, similarity scores use aggregated performance statistics to compare a player’s worth for induction into the Hall of Fame. Projection systems follow from this method: Steamer, PECOTA, Marcel, ZiPS, and others use some combination of a player’s recent performance, usually the last 3-4 seasons, to project future performance. Depending on the method, a player’s base statistics are then modified using typical aging curves, linear weights, regression, and numerous other factors. Notably, PECOTA uses 3 year performance statistics of comparable players, using nearest-neighors analysis, to forecast a player’s future performance.

"The PECOTA similarity scores are based primarily on looking at a three-year window of a pitcher’s performance. Thus, we might look at what a pitcher did from ages 35-37, and compare that against the most similar age 35-37 performances, after adjusting for parks, league effects, and a whole host of other things. This is different from the similarity scores you might see at or in other places, which attempt to evaluate the totality of a player’s career up to a given age."

Nate Silver

One of Silver’s first explanations of PECOTA’s forecasting method details the value in projecting a minor league player’s future career based on the career performance of their comparisons. Teams would be remiss to not consider what a player’s future statistics might look like based on their previous performance. PECOTA has created a projection system that models minor league players better than it’s competitors utilizing the comparable player’s model. Let’s use our minor league database to investigate minor league similarity scores and create projections for a notable minor league player.

Bill James Similarity Scores

Similarities - Career

James’ Similarity Score model was designed for major league careers, but let’s see how the model holds for minor league careers. The dataset includes minor league statistics from 2000-2014:

Kris Bryant221.5817474062014020348352142164971970.3270.4280.6661.0944131000
Ryan Braun231.67199864767131240616421443412701510.3130.3750.5720.947439926
Alex Gordon*261.892351061867199278695481703051552120.3210.4380.5781.016501906
Kelvin Diaz21018276563911721347722147231274920.3330.4260.5320.958340903
Jake Lamb*23124410799201582958310371931021272290.3210.4060.5530.959509902
D.J. Peterson221.251787777031192104224415882651580.2990.3620.5520.914388896
Matt Williams362.1721716511225051510650.3380.3940.6461.0442894
Evan Longoria262.3721993780314523843147160821041700.2960.3850.5280.913424890
Jose Fernandez26325510529201692877454118219101041840.3120.3890.5370.926494889
Albert Pujols201.671335444907415441719964546470.3140.3780.5430.921266887
Pedro Feliz362.5156646606961743923811912311100.2870.3210.5460.867331883

You might have heard of a few of those names. The issue with these similarities is that they encompass a player’s career minor league statistics; we’re more interested in the performance of Bryant’s same-aged peers.

Similarities - Age

Let’s see how he compares to other 22 year old players:

Kris Bryant222.517474062014020348352142164971970.3270.4280.6661.0954131000
Alex Gordon*22213057648611115839129101223721130.3250.4270.5881.016286932
Corey Dickerson*2211757436591322044914451482112671500.310.380.6311.011416924
Nick Akins220126548472941524073212055581350.3220.4070.641.047302921
Kevin Mench2211325834911181643992712119778720.3340.4270.6151.042302909
Ryan Braun221.516573065010320049632122309551400.3080.3670.5490.917357907
Mark Teixeira#221.5863753216310221519695246600.3180.4130.5921.005190904
Jake Lamb*220.5136619528951674452210982741260.3160.4050.5440.949287902
James Darnell22114263052489167415229697981010.3190.4280.5420.97284901
Jedd Gyorko221.520894584415427364232155144921710.3230.3920.5180.909437900
Hunter Pence221.0172737652119207405391271210791200.3170.3910.5740.964374898

Still a very impressive list. Using these similar players, lets take a play out of the PECOTA playbook and generate some basic projections without adjusting for outside effects (park factors, leagues, league-wide performance shifts, etc). By simply calculating the mean of these top 10 player comparables for each statistical category we can get a general idea of Bryant’s future performance.

10 Year Projection - Kris Bryant


Lets take that table and visualize it!

Using rCharts, an interactive visualization package maintained by Ramnath Vaidyanathan, I created a Rmarkdown page with knitr displaying the projection data.

Have a look!

Have feedback, questions, or want to see something else added? Check out the code I used to create this page or fork my repository to propose changes. Edit My Code