*2.1K*

My name is Ben Dominguez (@bendominguez011) and I'm the founder of Fantasy Data Pros. I am here to help you learn to code with Fantasy Baseball!

Fantasy Data Pros is a platform for people to turn their obsession with sports into programming and data science skills. We teach programming and data science fundamentals all through sports. Fantasy Data Pros offers a free tutorial series for each of the 5 major sports (we'll be coming out with more posts in the coming months) and a couple of online courses, including a Learn Python with Baseball course.

In the Learn Python with Baseball course, we offer 11 learning modules, 7 hours of video content, and an online community of 300+ other members learning to code.

Sign up to become an F6P All-Access Member to receive a promo code for $15 off access to Fantasy Data Pros tutorials today!

## Fantasy Data Pros - Learn to Code With Fantasy Baseball

### Analyzing Projected Batting Averages

In this post, we're going to be showing you the power of Python and advanced data analytics by looking at projected batting averages from four different sources for the upcoming MLB season.

We're interested in seeing which players have the highest/lowest variance with respect to their projected batting averages. This can be valuable information for season long fantasy drafts since variance gives us insight into how confident we can be in a player's projected production (in this case, batting average).

In general, players with lower variance suggest we should have higher confidence in their projections coming to fruition, while players with higher variance among their projections could be classified as riskier.

Let's get started by reading in our data and combining it into a pandas data frame for analysis.

importĀ pandasĀ asĀ pd importĀ matplotlib.pyplotĀ asĀ plt importĀ warnings;Ā warnings.simplefilter('ignore') #Ā ReadĀ inĀ projectionĀ sourcesĀ atcĀ =Ā pd.read_csv("https://raw.githubusercontent.com/fantasydatapros/fangraph-data-f6p/main/atcBats.csv") batxĀ =Ā pd.read_csv("https://raw.githubusercontent.com/fantasydatapros/fangraph-data-f6p/main/batxBats.csv") steamerĀ =Ā pd.read_csv("https://raw.githubusercontent.com/fantasydatapros/fangraph-data-f6p/main/steamerBats.csv") zipsĀ =Ā pd.read_csv("https://raw.githubusercontent.com/fantasydatapros/fangraph-data-f6p/main/zipsBats.csv") #Ā CreateĀ aĀ listĀ ofĀ columnsĀ weĀ wantĀ toĀ lookĀ atĀ thatĀ areĀ relatedĀ toĀ battingĀ average statsĀ =Ā ['AVG'] #Ā ListĀ ofĀ playerĀ descriptiveĀ columns player_infoĀ =Ā ['Name','PlayerId','Team'] #Ā FilterĀ outĀ columnsĀ weĀ won'tĀ beĀ using atcĀ =Ā atc[player_info+stats] batxĀ =Ā batx[player_info+stats] steamerĀ =Ā steamer[player_info+stats] zipsĀ =Ā zips[player_info+stats] #Ā #Ā We'llĀ needĀ toĀ combineĀ theĀ projectionĀ sourcesĀ intoĀ oneĀ dataĀ frame,Ā but #Ā beforeĀ weĀ doĀ thatĀ we'llĀ needĀ toĀ giveĀ eachĀ statĀ projectionĀ aĀ uniqueĀ columnĀ name #Ā LoopĀ throughĀ eachĀ statĀ andĀ giveĀ itĀ aĀ newĀ nameĀ thatĀ correspondsĀ toĀ theĀ projectionĀ source forĀ statĀ inĀ stats: Ā Ā atc.rename({stat:f'atc_proj_{stat}'},Ā axis=1,Ā inplace=True) Ā Ā batx.rename({stat:f'batx_proj_{stat}'},Ā axis=1,Ā inplace=True) Ā Ā steamer.rename({stat:f'steamer_proj_{stat}'},Ā axis=1,Ā inplace=True) Ā Ā zips.rename({stat:f'zips_proj_{stat}'},Ā axis=1,Ā inplace=True) #Ā CombineĀ statsĀ intoĀ oneĀ dataĀ frame dfĀ =Ā atc.merge( Ā Ā Ā Ā batx,on=player_info,how='left' Ā Ā Ā Ā ).merge( Ā Ā Ā Ā steamer,on=player_info,how='left' Ā Ā Ā Ā ).merge( Ā Ā Ā Ā zips,on=player_info,how='left') #Ā DetermineĀ meanĀ projectedĀ battingĀ averageĀ forĀ eachĀ player df['avg_proj_AVG']Ā =Ā df[['atc_proj_AVG','batx_proj_AVG',Ā 'steamer_proj_AVG','zips_proj_AVG']].mean(axis=1) #Ā RoundĀ battingĀ averagesĀ toĀ 3Ā decimalĀ places dfĀ =Ā df.round(3) #Ā DetermineĀ varianceĀ amongĀ playerĀ projections df['proj_variance']Ā =Ā df[['atc_proj_AVG','batx_proj_AVG',Ā 'steamer_proj_AVG','zips_proj_AVG']].var(axis=1) #Ā SortĀ dataĀ frameĀ byĀ averageĀ projectedĀ averages df.sort_values(by='avg_proj_AVG',ascending=False,inplace=True) df.head(5)

Name | Team | ATC | THE BAT X | Steamer | ZiPS | AVG | Variance |
---|---|---|---|---|---|---|---|

Luis Arraez | MIA | .302 | .296 | .297 | .311 | .301 | .000047 |

Freddie Freeman | LAD | .301 | .299 | .292 | .293 | .296 | .000020 |

Masataka Yoshida | BOS | .286 | .282 | .299 | .305 | .293 | .000117 |

Trea Turner | PHI | .292 | .286 | .283 | .301 | .291 | .000063 |

Vladimir Guerrero Jr. | TOR | .289 | .296 | .293 | .284 | .290 | .000027 |

So now we have a data frame that contains projections for 2023 batting averages from four separate projection sources. Additionally, we have the variance associated with the projections for each player.

Suppose we're getting ready for a Fantasy Baseball draft, if we wanted to prioritize projected batting averages as the most important statistic when drafting, we want to be confident that the batting average we're projecting for each player pans out. So essentially when we're looking at available players we want to see a high projected batting average with a low variance.

Since we're interested in drafting good players, let's filter out players with an average projected batting average of less than .250. After that, we'll sort our data frame by projection variance so we can see which players have the highest variance with respect to their projected batting average.

dfĀ =Ā df[df.avg_proj_AVGĀ >=Ā 0.250] df.sort_values(by='proj_variance',Ā ascending=False,Ā inplace=True) #PlotĀ TopĀ 10Ā PlayersĀ WithĀ HighestĀ BattinvĀ AverageĀ ProjectionĀ Variance tempĀ =Ā df.copy() #Ā ScaleĀ VarianceĀ columnĀ byĀ 1000 temp.proj_varianceĀ *=Ā 1000 #Ā RenameĀ proj_varianceĀ columnĀ toĀ somethingĀ moreĀ descriptiveĀ forĀ ourĀ visual #Ā AlsoĀ renameĀ NameĀ toĀ Player temp.rename({ Ā Ā Ā Ā Ā Ā 'proj_variance':Ā 'BattingĀ AveragesĀ ProjectionĀ Variance', Ā Ā Ā Ā Ā Ā 'Name':Ā 'Player' Ā Ā Ā Ā },Ā axis=1,Ā inplace=True) #Ā MakeĀ aĀ barĀ plotĀ forĀ variance figĀ =Ā plt.figure() axĀ =Ā temp.head(10).plot.bar('Player','BattingĀ AveragesĀ ProjectionĀ Variance') plt.xticks(rotation=45) plt.title('MLBĀ 2023Ā HighĀ VarianceĀ BattingĀ AverageĀ Projections (ProjectedĀ AverageĀ >0.250)') ax.set_ylabel('VarianceĀ (scaledĀ *Ā 1000)') ax.set_ylim(0,Ā 0.35) #Ā OverlayĀ ProjectedĀ BattingĀ Averages ax2Ā =Ā ax.twinx() ax2.plot(temp.head(10).Player,Ā temp.head(10).avg_proj_AVG,Ā color="red",Ā marker="o") ax2.set_ylabel("AverageĀ BattingĀ AverageĀ Projection") ax2.set_ylim(0.250,Ā 0.300) ax2.legend(['ProjectedĀ AVG'],Ā loc='upperĀ right') ax.legend(loc='upperĀ left');

In the plot above, we're looking at the 10 players with the highest variance among their batting average projections. We see Joey Ortiz as the player with the highest variance. This comes as no surprise since Joey Ortiz is a prospect for the Baltimore Orioles.

If we move further down the list we can see names like Jose Miranda and Jose Iglesias, each of who are in the top three highest-variance players. With solid projected batting averages greater than .270, these would be decent players to have on our season long Fantasy Baseball squad, but we would urge caution with these players given the high variance in their projections.

Let's take a closer look at Jose Iglesias and his career statistics.

#Ā ReadĀ inĀ joseĀ iglesiasĀ gameĀ data importĀ numpyĀ asĀ np jiĀ =Ā pd.read_csv("https://raw.githubusercontent.com/fantasydatapros/fangraph-data-f6p/main/jose_iglesias.csv") ji.sort_values(by='game_date',inplace=True) tempĀ =Ā ji.groupby('season',Ā as_index=False).agg({ Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 'hits':Ā np.sum, Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 'atBats':Ā np.sum, Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 'AVG':Ā np.mean, Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 'gamesPlayed':Ā np.sum }) axĀ =Ā temp.plot('season',Ā 'AVG',Ā marker='o') ax.set_ylabel('BattingĀ Average') ax.set_xlabel('Season') ax.set_title('JoseĀ IglesiasĀ CareerĀ BattingĀ Averages') ax2Ā =Ā ax.twinx() ax2.plot(temp.season,Ā temp.gamesPlayed,color="red",Ā marker="o") ax2.set_ylabel("GamesĀ Played") ax2.set_ylim(0,162) ax2.legend(['GamesĀ Played'],Ā loc='upperĀ right') ax.legend(loc='upperĀ left');

Jose Iglesias has had some up and down years between 2013 and 2018, but since then has seen his average steadily hover around the .275-.280 range.

So why the high variance in the projections?

It could be that some projections weigh recent years more heavily, while others weigh some of the down years he had early in his career equally. This can result in a wider range of projections.

For fun, let's look at how his 40-game moving average compares to his 162-game moving average. We'd expect volatile players to have more hot/cold streaks, thus their 40-game moving average would deviate further away from their 162-game moving average.

ji['162gm_avg']Ā =Ā ji.AVG.rolling(window=162,Ā min_periods=30).mean() ji['40gm_avg']Ā =Ā ji.AVG.rolling(window=40,Ā min_periods=30).mean() ax=ji.plot('game_date',Ā ['40gm_avg','162gm_avg']) ax.set_title('JoseĀ Iglesias 40gmĀ movingĀ averageĀ vsĀ 162Ā gmĀ movingĀ average') ax.set_ylabel('BattingĀ Average') ax.set_xlabel('GameĀ Date');

Throughout his career, Iglesias has shown some volatility in his batting average with respect to his long-term average. This may also provide some insight into the variance in his projections for the upcoming season.

Let's look at some low-variance players now.

dfĀ =Ā df[df.avg_proj_AVG>=0.250] df.sort_values(by='proj_variance',Ā inplace=True) #PlotĀ TopĀ 10Ā PlayersĀ WithĀ LowestĀ BattingĀ AverageĀ ProjectionĀ Variance tempĀ =Ā df.copy() #Ā ScaleĀ VarianceĀ columnĀ byĀ 1000 temp.proj_varianceĀ *=Ā 1000 #Ā RenameĀ proj_varianceĀ columnĀ toĀ somethingĀ moreĀ descriptiveĀ forĀ ourĀ visual #Ā AlsoĀ renameĀ NameĀ toĀ Player temp.rename({'proj_variance':'BattingĀ AveragesĀ ProjectionĀ Variance', Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 'Name':'Player'},Ā axis=1,Ā inplace=True) #Ā MakeĀ aĀ barĀ plotĀ forĀ variance figĀ =Ā plt.figure() axĀ =Ā temp.head(10).plot.bar('Player','BattingĀ AveragesĀ ProjectionĀ Variance') plt.xticks(rotation=45) plt.title('MLBĀ 2023Ā LowĀ VarianceĀ BattingĀ AverageĀ Projections (ProjectedĀ AverageĀ >0.250)') ax.set_ylabel('VarianceĀ (scaledĀ *Ā 1000)') ax.set_ylim(0,0.05) #Ā OverlayĀ ProjectedĀ BattingĀ Averages ax2Ā =Ā ax.twinx() ax2.plot(temp.head(10).Player,Ā temp.head(10).avg_proj_AVG,Ā color="red",Ā marker="o") ax2.set_ylabel("AverageĀ BattingĀ AverageĀ Projection") ax2.set_ylim(0.250,0.280) ax2.legend(['ProjectedĀ AVG'],Ā loc='upperĀ right') ax.legend(loc='upperĀ left') plt.show()

Alright so now we have the 10 players with the lowest variance among their projections. At the top of the list is Amed Rosario with a solid projected batting average of .275.

Similar to how we looked at Jose Iglesias, let's take a closer look at what Rosario has done in his career.

#Ā ReadĀ inĀ AmedĀ RosarioĀ gameĀ data importĀ numpyĀ asĀ np arĀ =Ā pd.read_csv("https://raw.githubusercontent.com/fantasydatapros/fangraph-data-f6p/main/amed_rosario.csv") tempĀ =Ā ar.groupby('season',as_index=False).agg({'hits':Ā np.sum, Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 'atBats':Ā np.sum, Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 'AVG':Ā np.mean, Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 'gamesPlayed':Ā np.sum}) axĀ =Ā temp.plot('season','AVG',marker='o') ax.set_ylabel('BattingĀ Average') ax.set_xlabel('Season') ax.set_title('AmedĀ RosarioĀ CareerĀ BattingĀ Averages') ax2Ā =Ā ax.twinx() ax2.plot(temp.season,Ā temp.gamesPlayed,Ā color="red",Ā marker="o") ax2.set_ylabel("GamesĀ Played") ax2.set_ylim(0,Ā 162) ax2.legend(['GamesĀ Played'],Ā loc='upperĀ right') ax.legend(loc='upperĀ left');

figĀ =Ā plt.figure() ar.sort_values(by='game_date',Ā inplace=True) ar['162gm_avg']Ā =Ā ar.AVG.rolling(window=162,min_periods=30).mean() ar['40gm_avg']Ā =Ā ar.AVG.rolling(window=40,Ā min_periods=30).mean() axĀ =Ā ar.plot('game_date',['40gm_avg','162gm_avg']) ax.set_title('AmedĀ Rosario 40gmĀ movingĀ averageĀ vsĀ 162Ā gmĀ movingĀ average') ax.set_ylabel('BattingĀ Average') ax.set_xlabel('GameĀ Date') plt.xticks(rotation=45);

Similar to Jose Iglesias, Amed Rosario has had some volatility in his batting average when you compare his 40-game moving average to his 162-game moving average.

So why does Rosario have so much less variance in his projections for the upcoming season?

If you look at his 162-game moving average, you can see that since the latter part of the 2019 season, his 162-game moving average has been near or above that .275 mark.

Since he has been able to sustain that average over such a long period, it stands to reason that his projections for the upcoming season would be similar across multiple sources. Thus if we consider his last 3 seasons and what he's been able to accomplish batting average-wise over time, we should have a fairly high degree of confidence that Rosario should be able to crack that .275 average mark again, or perhaps something higher.

## Fantasy Data ProsĀ - Learn to Code With Fantasy Baseball

In this post, we looked at projection data combined with variance and used Python to help us make high-quality decisions about which players we should target in our Fantasy Baseball drafts.

If you found this type of analysis to be useful and are interested in learning to do this sort of thing yourself, definitely check us out at https://www.fantasydatapros.com/baseball.

Remember to become an F6P All-Access Member to receive a promo code for $15 off access to Fantasy Data Pros tutorials today!

Check out our 2023 Fantasy Baseball Rankings and our Dynasty Baseball Rankings too!