top of page

An Introduction to PyBaseball: Using Python to Analyze Baseball Data

Disclaimer: A basic understanding of Python is required to understand the parts of this article. While it may be worth trying to learn parts from this, I would not recommend this piece if this is your first time seeing the language.

If one so happens to be a new reader of this blog, then it is worth noting that since its founding, the main goal we’ve been trying to fulfill for the reader is to spread sabermetric knowledge in an easy-to-understand fashion. From that, we’ve hoped that the reader would read and inquire among other sources so that they become even a slight sabermetrician themselves. To be considered a type of stat-expert (or sabermetrician), they need to be able to analyze data. In its most basic form, that is going to Baseball Savant, Baseball Reference, or Fangraphs and being able to understand the variety of statistical acronyms and explain them to others. In its highest form, it is utilizing computer programming to both retrieve and analyze large sets of baseball data.

In today’s baseball landscape, it is almost an essential for any new-entry baseball operations employee to understand the game and coding language(s) due to the size of the data - hence the need for this type of article. Python is one of the most extensively used programming languages in the world, automating tasks with a variety of ways that makes lots of manual data entry work obsolete. Within Python, there are packages that contain functions that allow the language to have more applications, specifically regarding baseball. Introducing PyBaseball (I know, simple), the package that allows any user to pull and analyze a variety of data through the MLB Statcast as well as Baseball Reference and Fangraphs.. This free tool allows any amateur baseball fan with even basic coding knowledge to prepare work that could answer their deep-lying questions.

Understanding how to use PyBaseball is important, but understanding how it even works in the first place is a bigger piece of the puzzle. The package uses access to APIs (Application Programing Interface) and various scripts / web scrapers to specifically look for certain types of data within the baseball world among the aforementioned websites. Once the package is installed into a computer system, all of these functions and code sync so that the analyst can easily call functions and retrieve the necessary data without having to work on creating their own complicated functions - it streamlines the entire process. This will primarily be focusing the Statcast aspect of the PyBaseball program.

For the sake of simplicity for the reader, all code will be completed within Google Colab.

The Basics

Like with any Python package, PyBaseball needs to be installed. This can be done using Pip, a package manager that allows certain programs to be installed using Python. The code should look like this:

!pip install pybaseball

The output will likely spit a bunch of random repeating words and loading bars, but the only words that matter are “Successfully Installed.” As long as that shows up, you should be set with access to a large chunk of Major League Baseball’s data. Now, just to call it. Along with PyBaseball, several baseline functions need to be imported into the code.

import pandas as pd
import numpy as np
import pybaseball as pyb
import matplotlib.pyplot as plt

In regards to the other lines, Pandas is a data analysis tool, Numpy helps manipulate arrays of data, and Matplotlib.pyplot is the program assisting with making data visuals. With those tools now here, the selection of data needs to be chosen. To do this, use a function similar to this.

statcast_data = pyb.statcast(start_dt='2023-03-30', end_dt='2023-8-30')

Let’s break this down. The pyb part of the call is using the PyBaseball library specifically, and .statcast is the specific function meant to call all of the raw Statcast data. Within this function, the period of time you want the raw data from needs to be specified, which is why the parameters of start_dt and end_dt are there. In this example, Statcast data from March 30th, 2023 to August 30th, 2023 is put into an object called statcast_data. Depending on the size of the data, the data may take a long time to load. To make sure the data loads correctly, call a function with the new dataset.


This function calls three rows of the data, as shown in the output. Now, it is clear that the data is loaded correctly. With the dataset now in place, let’s move on.

Filtering the Data

The next job becomes cleaning and filtering the data. To do that, you need to understand the objective. For the sake of this introduction, we are hypothetically proposing that a team wants a visual of the relationship between Release Height and Plate Height on purely Four-Seam Fastballs (FF) thrown by Spencer Strider. In approaching this, while not necessarily a necessity, the pitcher and pitch type will be filtered out first from the statcast_data.

filtered_statcast_data =     statcast_data[(statcast_data['player_name'] == 'Strider, Spencer') & (statcast_data['pitch_type'] == 'FF')]


For this code, within statcast_data, two columns are being specified - the ‘player_name’ and the ‘pitch_type’. The code is specifying that the ‘player_name’ column must equal Strider, Spencer (the desired pitcher) and that the ‘pitch_type’ column must equal FF (the desired pitch, four-seam fastball). This is put into a new object called filtered_statcast_data. As always, it’s a great idea to double check your work. A misplaced comma or an extra space can completely change the way the program interprets the filtering. Hence, the .head() function will print out the desired number of rows of the data. That way there is peace of mind in the matter. Now, since only two columns are needed, the rest can be disregarded.

filtered_statcast_data = filtered_statcast_data[['plate_z','release_pos_z']].dropna()

Feel free to double-check with the .head() function again, but the examples will no longer show it as to avoid repetitiveness. This call specifies two columns within the data, ‘plate_z’ and ‘rel_position_z’ (plate height and release position height respectively) to keep, while also dropping any rows with non-available / blank data with the .dropna() function. It then reassigns this to the already established filtered_statcast_data, which overrides it into now only having the two columns without errors. The current dataset now only contains plate height and plate location data on four-seam fastballs thrown by Spencer Strider between March 30th, 2023 and August 30th, 2023. Again, moving on…

Visualizing the Data

Arguably the most important part, visualizing / analyzing the data is everything when trying to get a point across. Python offers a great deal of plots, charts, and other types of graphs, but this example will focus on a scatter-plot. After all, the relationship of release height and plate location height for his fastballs does need to be considered. To do this, the matplotlib.pyplot (plt) program will be utilized.


plt.scatter(filtered_statcast_data['release_pos_z'],filtered_statcast_data['plate_z'], c='blue')

plt.title("Spencer Strider's Release Height vs Plate Location Height (FF)")

plt.xlabel("Release Height (feet)")

plt.ylabel("Plate Location Height (feet)")


Lots of code, so breaking this down - plt.figure() is establishing that something is being plotted. Plt.scatter() is establishing the type being scatter, with the parameters being an x-axis, y-axis, and the color. The x-axis was clarified as the ‘release_pos_z’ column, the y-axis being the ‘plate_z’ column, and the color (shown as c=) as blue. The plt.title() is just physically adding a title to the plot, with the plt.xlabel() and plt.ylabel() similarly acting as labels. With the plt.grid(), the goal is to just make the plot easier to read - lines connecting to the points they are associated with does that. Lastly, just makes sure that it is physically visible on the screen. This outputs the following:

This is a great visual to physically see, but it doesn’t say much other than the fact that he threw a lot of four-seamers with a release height between 5.9 and 6 feet and a plate height of about 2.5 to 3 feet. The goal of this was to find the relationship, so adding some statistics may help. The best estimate of linear causation / relationships is r-squared, which measures the variance of a dependent variable (y) based solely on the independent variable (x). Some replacement code can add this to our graph.

from sklearn.metrics import r2_score

coefficients = np.polyfit(filtered_statcast_data['release_pos_z'], filtered_statcast_data['plate_z'], 1)

regression_line = np.poly1d(coefficients)

plt.scatter(filtered_statcast_data['release_pos_z'],filtered_statcast_data['plate_z'], c='blue')

plt.plot(filtered_statcast_data['release_pos_z'],regression_line(filtered_statcast_data['release_pos_z']), 'r-')

r2 = r2_score(filtered_statcast_data['plate_z'], regression_line(filtered_statcast_data['release_pos_z']))

plt.text(min(filtered_statcast_data['release_pos_z']), max(filtered_statcast_data['plate_z']), f'R^2 = {r2:.4f}', fontsize=12)

plt.title("Spencer Strider's Release Height vs Plate Location Height (FF)")

plt.xlabel("Release Height (feet)")

plt.ylabel("Plate Location Height (feet)")

The first line imports the r-squared score tool from sklearn.metrics, another python package. The np.polyfit takes the ‘release_pos_z’ column and fits it with the linear regression coefficients based on the ‘plate_z’ column. The np.poly1d establishes the line based on those coefficients, and then assigns it to the variable regression_line. This again utilizes the plt.scatter function like above, with the main difference being the addition of the plt.plot function, which is plotting the regression line referenced earlier in those codes to that of the ‘release_pos_z’ column points. Now, using that regression_line above as well as the imported module, the r-squared score can be calculated. The code calls the r2_score function, and first references the y-axis and then regression line of the x-axis to do so. The actual score is then plotted using the plt.text function, taking the minimum value of the x and maximum value of y for the location on the graph (can be changed as one sees fit). This clarifies the R^2 Score based on the result of the previously calculated R^2 object up to 4 decimals places, as stated with the .4f. There is also a specification for font size as to ensure the right look, which was clarified by adding the fontsize=12. Using the prior labeling code as above, this is the new output:

This makes the analysis much easier. The first thing visible is the r-squared score, which is an abysmal 0.0211. Most familiar with basic statistics know that this implies that 2% of the variation in plate location height can be explained by release height with this specific subsample, which is almost nothing. The line, while slanted, demonstrates almost no clear relationship. Ergo, one could reasonably conclude that for Spencer Strider’s four-seamers between 3/30 and 8/30 for 2023, his release height did not have a significant impact on his plate location height. Congrats! By following along, an analysis could now be conducted using the PyBaseball package!


While this was arguably a very simple analysis, that was the point - programming for baseball (and in general) is a process, and should start with a strong foundation before going into advanced machine-learning models and other forecasting-type projects. To recap, this established how to load PyBaseball into your machine, saving the data that you wanted into an object. Filtering was then touched on, which showcased how to filter data based on certain words within certain columns, keeping only the data needed for the project. Visualizing the data with a plot to analyze was the next important step, but the basic look did not provide much room for analysis. By editing the plot and adding statistical estimates through Python, one could make reasonable conclusions based on the data.

One may be able to still evaluate sabermetric data through programs through Excel, but with the right knowledge, pulling directly from programs can hasten up the entire process by a wide margin. PyBaseball, as a general program, quickens up the entire process by just directly integrating with the MLB Statcast API and other sites, instead of forcing the junior analyst to manually sift and data scrape across the web for the most notable measures as was done in the early days. By having at least a decent idea of how to use PyBaseball, I encourage the reader to dive further and learn more within the program so that they can conduct much deeper-level research on their own. For more information on PyBaseball’s scripts or other examples of how to use the program, I encourage the reader to check out PyBaseball’s Github Repo.

1 Comment

Hey dude, just wanted to say thank you for this primer. pybaseball tutorials aren't the easiest thing to find.

I used this and the bokeh library to do some cool stuff around the sticky stuff ban in 2021:

Looking forward to seeing what you do next!

bottom of page