Graph Databases in Pro Sports Analysis as a Competitive Differentiator

By Rebecca Rabb / Consultant

July 23, 2021

Blog

Reading Time: 5 minutes

Graph databases such as Neo4j are driving the next generation of professional sports analytics and data science. Going back in time, Michael Lewis captured the attention of his audience in the early 2000’s when he published Moneyball, a story of the Oakland Athletics using a new angle on statistical analysis to optimize their roster on one of the lowest salaried teams in baseball.  The story inspired many curious fans, and this approach leveraging analytics in sports analysis has grown considerably over the past decade. Many professional sports teams even have their own analytics and data science departments now, and are looking for the next insight that will give their team that desired edge.

Growing trends such as “going for it on 4th down” in football, increased 3-point field goal attempts in basketball, and maximizing tee shot distance in golf have all originated from statistical analyses suggesting that those actions lead to a higher rate of success, in the aggregate. Many already agree that analytics has changed sports, whether for good or for bad, however my thesis is that graph databases provide an improved means for organizing, interacting with and driving value out of this sports data.

Complex Sports Statistics Require a Graph Database

Let’s begin with the complexity of sports statistics. Using baseball as an example, the casual fan knows the basics from a box score or graphics shown on screen during a broadcast. You have your offensive numbers on runs batted in, hits, strikeouts, at-bats, walks as well as defensive numbers for fielding percentage, double plays, errors, assists, and more. One challenge often encountered in bringing this data together is the sheer quantity of it. Each pitch, at-bat, play, all have their own associated data and it can become overwhelming to organize let alone clean it in a way that maximizes the quality for downstream analysis.

The required aggregations when done in traditional data stores can also become cumbersome. Players can transition between major and minor leagues within a season, get moved to the injured reserve, fill different positions depending on team needs and more. This makes it arduous for the end user looking to organize the data in a fashion that’s scalable and transparent for other stakeholders.

The graph database schema below provides answers to initial questions on players’ roles in games and their respective organizations, abstracted and simplified in a way that enables sub-second queries and aggregations not possible in traditional data stores:

Using graph databases/Neo4j in sports analysis: basic schema example

This abstracted schema also shows that a player plays a particular position in a game for a particular team and that team is also part of a conference. This is fairly simple to start out, but many possible questions already begin to emerge:

  1. Is a player being utilized at the correct position?
  2. Are divisions’ schedules equally matched?
  3. Which players have overlapped in past programs and experience?
  4. How does a player’s success compare across major versus minor league?

An advantage of graph databases is that many properties can then be added to each node and even to the relationships in order to enrich and add context in order to be able to answer these emerging questions. Some examples include the player’s date of birth, height, weight, score or even the weather conditions of the game, as well as the game result. In this way you could for example query for “5’10” second basemen under 25 who have the best win percentage in the national league“, or likewise search for “teams whose winning record has been least affected by strong winds from the southwest“. The flexibility and power starts to become evident as one wants to ask more complex questions involving the relationships between every aspect of the data.

Looking at roster and team composition in a graph database is novel and valuable. It becomes even more valuable when adding additional depth to the schema, for example the player’s performance and opponent. This example shows just a few added nodes, but this can be extrapolated to any number of player or team attributes, relationships or statistics one is looking to evaluate.

Using graph databases/Neo4j in sports analysis: more advanced schema example

Asking More Complex and Valuable Questions Using Graph Databases

Rather than simply having the narrow scope of the result of a game for a player on a given team, we now have relational information on how the opponent or any aspect of the context may have contributed to that result. A night with five home runs may be enough to pull out a win against Opponent A, however it results in a loss against Opponent B; how can the coaching staff make appropriate adjustments when playing teams similar to Opponent B in order to get the win? If Team A has a proportionally high strikeout-to-walk ratio, what kind of adjustments need to be made by batters when facing the Team A pitching staff? Player A’s team completed a higher percentage of double-plays when he was a shortstop than when he was at his typical position of third base, so could Player A be a utility player in the infield?

Data Science for Sports Analysis Using Graph Databases

Using a graph database also enables us to calculate and compare cosine similarity between nodes. In this way one can compare opponents, games, players, and other nodes against one another to better understand strengths and weaknesses and how to make the appropriate adjustments. With 30 teams in the major leagues, there are bound to be programs that play “similarly” to each other (from a data science perspective) in aspects of the game that may include types of relief pitching or stolen base percentage. Evaluation of your opponent based on similarity allows for anticipating given game scenarios which leads to better practice and strategy.

This theme of similarity comes into play as well for player scouting. Consider a club looking to find a couple backups for a star player on their roster. Rather than scouting on superficial numbers such as size, speed, and slugging percentage, scouts can tailor the approach to select attributes that complement their star player.

A New Era in Professional Sports Analysis

With just these few examples using baseball to illustrate, it becomes evident why graph databases are being used in professional sports analysis as a transformative and uniquely valuable way to organize sports data and ask uniquely powerful questions of that sports schema. Be it evaluating whether a player is right for a given position, consolidating stats from a single game, or formulating strategy for an upcoming game based on opponent analysis, graph databases provide a means to move towards professional sports analytics both based on “the numbers” but also on nuanced and numerous critical relationships between the entities involved.


Graphable delivers insightful graph database (e.g. Neo4j consulting) / machine learning (ml) / natural language processing (nlp) projects as well as graph and Domo consulting for BI/analytics, with measurable impact. We are known for operating ethically, communicating well, and delivering on-time. With hundreds of successful projects across most industries, we thrive in the most challenging data integration and data science contexts, driving analytics success.

Want to find out more about our Hume consulting on the Hume knowledge graph / insights platform? As the Americas principal reseller, we are happy to connect and tell you more. Book a demo by contacting us here.

Check out our article, What is a Graph Database? for more info.


We are known for operating ethically, communicating well, and delivering on-time. With hundreds of successful projects across most industries, we thrive in the most challenging data integration and data science contexts, driving analytics success.
Contact us for more information: