Fantasy Data Pros – Learn to Code With Fantasy Baseball

by Keith Lott
Fantasy Data Pros - Learn to Code With Fantasy Baseball

My name is Ben Dominguez (@bendominguez011) and I'm the founder of Fantasy Data Pros. I am here to help you learn to code with Fantasy Baseball!

Fantasy Data Pros is a platform for people to turn their obsession with sports into programming and data science skills. We teach programming and data science fundamentals all through sports. Fantasy Data Pros offers a free tutorial series for each of the 5 major sports (we'll be coming out with more posts in the coming months) and a couple of online courses, including a Learn Python with Baseball course.

In the Learn Python with Baseball course, we offer 11 learning modules, 7 hours of video content, and an online community of 300+ other members learning to code.

Sign up to become an F6P All-Access Member to receive a promo code for $15 off access to Fantasy Data Pros tutorials today!

Fantasy Data Pros - Learn to Code With Fantasy Baseball

Analyzing Projected Batting Averages

In this post, we're going to be showing you the power of Python and advanced data analytics by looking at projected batting averages from four different sources for the upcoming MLB season.

We're interested in seeing which players have the highest/lowest variance with respect to their projected batting averages. This can be valuable information for season long fantasy drafts since variance gives us insight into how confident we can be in a player's projected production (in this case, batting average).

In general, players with lower variance suggest we should have higher confidence in their projections coming to fruition, while players with higher variance among their projections could be classified as riskier.

Let's get started by reading in our data and combining it into a pandas data frame for analysis.

import pandas as pd

import matplotlib.pyplot as plt

import warnings; warnings.simplefilter('ignore')

# Read in projection sources 

atc = pd.read_csv("https://raw.githubusercontent.com/fantasydatapros/fangraph-data-f6p/main/atcBats.csv")

batx = pd.read_csv("https://raw.githubusercontent.com/fantasydatapros/fangraph-data-f6p/main/batxBats.csv")

steamer = pd.read_csv("https://raw.githubusercontent.com/fantasydatapros/fangraph-data-f6p/main/steamerBats.csv")

zips = pd.read_csv("https://raw.githubusercontent.com/fantasydatapros/fangraph-data-f6p/main/zipsBats.csv")

# Create a list of columns we want to look at that are related to batting average

stats = ['AVG']

# List of player descriptive columns

player_info = ['Name','PlayerId','Team']

# Filter out columns we won't be using

atc = atc[player_info+stats]

batx = batx[player_info+stats]

steamer = steamer[player_info+stats]

zips = zips[player_info+stats]

# 

# We'll need to combine the projection sources into one data frame, but

# before we do that we'll need to give each stat projection a unique column name

# Loop through each stat and give it a new name that corresponds to the projection source

for stat in stats:

  atc.rename({stat:f'atc_proj_{stat}'}, axis=1, inplace=True)

  batx.rename({stat:f'batx_proj_{stat}'}, axis=1, inplace=True)

  steamer.rename({stat:f'steamer_proj_{stat}'}, axis=1, inplace=True)

  zips.rename({stat:f'zips_proj_{stat}'}, axis=1, inplace=True)

# Combine stats into one data frame

df = atc.merge(

    batx,on=player_info,how='left'

    ).merge(

    steamer,on=player_info,how='left'

    ).merge(

    zips,on=player_info,how='left')

# Determine mean projected batting average for each player

df['avg_proj_AVG'] = df[['atc_proj_AVG','batx_proj_AVG', 'steamer_proj_AVG','zips_proj_AVG']].mean(axis=1)

# Round batting averages to 3 decimal places

df = df.round(3)

# Determine variance among player projections

df['proj_variance'] = df[['atc_proj_AVG','batx_proj_AVG', 'steamer_proj_AVG','zips_proj_AVG']].var(axis=1)

# Sort data frame by average projected averages

df.sort_values(by='avg_proj_AVG',ascending=False,inplace=True)

df.head(5)
NameTeamATCTHE BAT XSteamerZiPSAVGVariance
Luis ArraezMIA.302.296.297.311.301.000047
Freddie FreemanLAD.301.299.292.293.296.000020
Masataka Yoshida
BOS.286.282.299.305.293.000117
Trea TurnerPHI.292.286.283.301.291.000063
Vladimir Guerrero Jr.
TOR.289.296.293.284.290.000027

So now we have a data frame that contains projections for 2023 batting averages from four separate projection sources. Additionally, we have the variance associated with the projections for each player.

Suppose we're getting ready for a Fantasy Baseball draft, if we wanted to prioritize projected batting averages as the most important statistic when drafting, we want to be confident that the batting average we're projecting for each player pans out. So essentially when we're looking at available players we want to see a high projected batting average with a low variance.

Since we're interested in drafting good players, let's filter out players with an average projected batting average of less than .250. After that, we'll sort our data frame by projection variance so we can see which players have the highest variance with respect to their projected batting average.

df = df[df.avg_proj_AVG >= 0.250]

df.sort_values(by='proj_variance', ascending=False, inplace=True)

#Plot Top 10 Players With Highest Battinv Average Projection Variance

temp = df.copy()

# Scale Variance column by 1000

temp.proj_variance *= 1000

# Rename proj_variance column to something more descriptive for our visual

# Also rename Name to Player

temp.rename({

      'proj_variance': 'Batting Averages Projection Variance',

      'Name': 'Player'

    }, axis=1, inplace=True)

# Make a bar plot for variance

fig = plt.figure()

ax = temp.head(10).plot.bar('Player','Batting Averages Projection Variance')

plt.xticks(rotation=45)

plt.title('MLB 2023 High Variance Batting Average Projections
(Projected Average >0.250)')

ax.set_ylabel('Variance (scaled * 1000)')

ax.set_ylim(0, 0.35)

# Overlay Projected Batting Averages

ax2 = ax.twinx()

ax2.plot(temp.head(10).Player, temp.head(10).avg_proj_AVG, color="red", marker="o")

ax2.set_ylabel("Average Batting Average Projection")

ax2.set_ylim(0.250, 0.300)

ax2.legend(['Projected AVG'], loc='upper right')

ax.legend(loc='upper left');

"2023

In the plot above, we're looking at the 10 players with the highest variance among their batting average projections. We see Joey Ortiz as the player with the highest variance. This comes as no surprise since Joey Ortiz is a prospect for the Baltimore Orioles.

If we move further down the list we can see names like Jose Miranda and Jose Iglesias, each of who are in the top three highest-variance players. With solid projected batting averages greater than .270, these would be decent players to have on our season long Fantasy Baseball squad, but we would urge caution with these players given the high variance in their projections.

Let's take a closer look at Jose Iglesias and his career statistics.

# Read in jose iglesias game data

import numpy as np

ji = pd.read_csv("https://raw.githubusercontent.com/fantasydatapros/fangraph-data-f6p/main/jose_iglesias.csv")

ji.sort_values(by='game_date',inplace=True)

temp = ji.groupby('season', as_index=False).agg({

            'hits': np.sum,

            'atBats': np.sum,

            'AVG': np.mean,

            'gamesPlayed': np.sum

})

ax = temp.plot('season', 'AVG', marker='o')

ax.set_ylabel('Batting Average')

ax.set_xlabel('Season')

ax.set_title('Jose Iglesias Career Batting Averages')

ax2 = ax.twinx()

ax2.plot(temp.season, temp.gamesPlayed,color="red", marker="o")

ax2.set_ylabel("Games Played")

ax2.set_ylim(0,162)

ax2.legend(['Games Played'], loc='upper right')

ax.legend(loc='upper left');

Jose Iglesias career batting averages

Jose Iglesias has had some up and down years between 2013 and 2018, but since then has seen his average steadily hover around the .275-.280 range.

So why the high variance in the projections?

It could be that some projections weigh recent years more heavily, while others weigh some of the down years he had early in his career equally. This can result in a wider range of projections.

For fun, let's look at how his 40-game moving average compares to his 162-game moving average. We'd expect volatile players to have more hot/cold streaks, thus their 40-game moving average would deviate further away from their 162-game moving average.

ji['162gm_avg'] = ji.AVG.rolling(window=162, min_periods=30).mean()

ji['40gm_avg'] = ji.AVG.rolling(window=40, min_periods=30).mean()

ax=ji.plot('game_date', ['40gm_avg','162gm_avg'])

ax.set_title('Jose Iglesias
40gm moving average vs 162 gm moving average')

ax.set_ylabel('Batting Average')

ax.set_xlabel('Game Date');

"</p

Throughout his career, Iglesias has shown some volatility in his batting average with respect to his long-term average. This may also provide some insight into the variance in his projections for the upcoming season.

Let's look at some low-variance players now.

df = df[df.avg_proj_AVG>=0.250]

df.sort_values(by='proj_variance', inplace=True)

#Plot Top 10 Players With Lowest Batting Average Projection Variance

temp = df.copy()

# Scale Variance column by 1000

temp.proj_variance *= 1000

# Rename proj_variance column to something more descriptive for our visual

# Also rename Name to Player

temp.rename({'proj_variance':'Batting Averages Projection Variance',

             'Name':'Player'}, axis=1, inplace=True)

# Make a bar plot for variance

fig = plt.figure()

ax = temp.head(10).plot.bar('Player','Batting Averages Projection Variance')

plt.xticks(rotation=45)

plt.title('MLB 2023 Low Variance Batting Average Projections
(Projected Average >0.250)')

ax.set_ylabel('Variance (scaled * 1000)')

ax.set_ylim(0,0.05)

# Overlay Projected Batting Averages

ax2 = ax.twinx()

ax2.plot(temp.head(10).Player, temp.head(10).avg_proj_AVG, color="red", marker="o")

ax2.set_ylabel("Average Batting Average Projection")

ax2.set_ylim(0.250,0.280)

ax2.legend(['Projected AVG'], loc='upper right')

ax.legend(loc='upper left')

plt.show()

"2023

Alright so now we have the 10 players with the lowest variance among their projections. At the top of the list is Amed Rosario with a solid projected batting average of .275.

Similar to how we looked at Jose Iglesias, let's take a closer look at what Rosario has done in his career.

# Read in Amed Rosario game data

import numpy as np

ar = pd.read_csv("https://raw.githubusercontent.com/fantasydatapros/fangraph-data-f6p/main/amed_rosario.csv")

temp = ar.groupby('season',as_index=False).agg({'hits': np.sum,

                          'atBats': np.sum,

                          'AVG': np.mean,

                          'gamesPlayed': np.sum})

ax = temp.plot('season','AVG',marker='o')

ax.set_ylabel('Batting Average')

ax.set_xlabel('Season')

ax.set_title('Amed Rosario Career Batting Averages')

ax2 = ax.twinx()

ax2.plot(temp.season, temp.gamesPlayed, color="red", marker="o")

ax2.set_ylabel("Games Played")

ax2.set_ylim(0, 162)

ax2.legend(['Games Played'], loc='upper right')

ax.legend(loc='upper left');

Amed Rosario career batting averages

fig = plt.figure()

ar.sort_values(by='game_date', inplace=True)

ar['162gm_avg'] = ar.AVG.rolling(window=162,min_periods=30).mean()

ar['40gm_avg'] = ar.AVG.rolling(window=40, min_periods=30).mean()

ax = ar.plot('game_date',['40gm_avg','162gm_avg'])

ax.set_title('Amed Rosario
40gm moving average vs 162 gm moving average')

ax.set_ylabel('Batting Average')

ax.set_xlabel('Game Date')

plt.xticks(rotation=45);

Amed Rosario batting average

Similar to Jose Iglesias, Amed Rosario has had some volatility in his batting average when you compare his 40-game moving average to his 162-game moving average.

So why does Rosario have so much less variance in his projections for the upcoming season?

If you look at his 162-game moving average, you can see that since the latter part of the 2019 season, his 162-game moving average has been near or above that .275 mark.

Since he has been able to sustain that average over such a long period, it stands to reason that his projections for the upcoming season would be similar across multiple sources. Thus if we consider his last 3 seasons and what he's been able to accomplish batting average-wise over time, we should have a fairly high degree of confidence that Rosario should be able to crack that .275 average mark again, or perhaps something higher.

Fantasy Data Pros - Learn to Code With Fantasy Baseball

In this post, we looked at projection data combined with variance and used Python to help us make high-quality decisions about which players we should target in our Fantasy Baseball drafts.

If you found this type of analysis to be useful and are interested in learning to do this sort of thing yourself, definitely check us out at https://www.fantasydatapros.com/baseball.

Remember to become an F6P All-Access Member to receive a promo code for $15 off access to Fantasy Data Pros tutorials today!


Check out our 2023 Fantasy Baseball Rankings and our Dynasty Baseball Rankings too!

You may also like

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

F6P Badges Banner

Follow us on social media

f6p-logo-footer

A Six Pack of Fantasy Sports

Copyright © 2024 Fantasy Six Pack.