A data analysis project by Anna Briamonte and Emily Colleran
Created in May of 2022 for EG 10118: Engineering Computing at the University of Notre Dame
We organized, calculated, and analyzed statistics for over 350 Division I men’s college basketball teams as well as data from drafts of the National Basketball Association for the 2013 to 2018 seasons in order to determine whether or not statistically “good” college basketball teams are successful in the draft for some given year; i.e., is it possible that, say, “blue blood” teams (Kentucky, UCLA, UNC, Duke, Indiana, etc.) succeed in the draft despite being statistically outperformed by other teams on the season? (And if we are looking to investigate further, are certain statistics perhaps better indicators of draft success for teams than others? Additionally, which teams have performed the most in terms of specific statistics over the past several years?)
Analyses like these of NCAA Division I men’s basketball statistics and NBA draft data are most likely of great help to high school recruits when deciding which school to play for, NBA teams and coaches looking to assess the benefits of drafting players from certain teams after a given season, analysts of the sport at both the collegiate and professional levels, and those of us who simply are interested in basketball.
For the purposes of this project, we (calculated and then) used a given team's adjusted efficiency margin (AdjEM) to compare performance with other teams. (AdjEM is the difference between adjusted offensive efficiency [AdjO] and adjusted defensive efficiency [AdjD], with AdjO being an estimate of the number of points the given team scores against the average DI defense over 100 possessions and AdjD being an estimate of the number of points the given team allows to the average DI offense over 100 possessions.)
In order to "calculate" teams' "performance" in the draft, we used data from our NBA CSV to calculate what we call "magnitude of success" -- we used the CSV to total the number of picks teams had in a given year as well as their average pick number, and divided the total number by the average number to find this magnitude. The greater this fraction, the more successful we determined a team to be in a given draft.
"The origin of the term 'blue bloods' in college basketball" -- ncaa.comOur data is pulled from CSV files -- one with DI basketball statistics for teams and one with NBA draft picks. These datasets overlapped for the 2013 - 2018 seasons.
To clean our DI data, we generated non-repeating lists of teams, years, and statistics, as well as a "master list" of every data point from our CSV. We made an empty dictionary to hold statistics for our chosen seasons and our list of teams, and from there built a dictionary of dataframes, one for each statistic in the CSV (as well as for the teams' calculated adjusted efficiency margins). From this we also developed code to allow the user to query a statistical category, team name, and year to output the appropriate statistic, as well as code for an interactive table that sorts a given statistical dataframe. Additionally, we wrote a function to return the highest average value of a given statistic over the six seasons, as well as the team achieving that value.
To clean our NBA data, we again made non-repeating lists, this time of years and schools, as well as a "master list" of all of the data points. We then calculated average draft pick numbers as well as total draft picks on certain years and our "magnitude of success", and generated a dataframe. From this we developed code to allow to user to query a team name and year to output average pick number, total number of picks, and magnitude of success.
This is a bar chart of our calculated adjusted efficiency margins for 2013. We generated a total of 18 graphs like this in our code, some for AdjEM, some for 3PO, and some for EFGO. Hover to see data points.
This is a scatterplot of average draft pick number verus total number of draft picks for teams. Toggle the slider to see change over time. Logically, dots of teams with higher magnitudes of success would be located in the top left corner of the graph. Hover to see data points.
This is a bar chart of magnitudes of successes for teams. Toggle the slider to see change over time. We sorted the data for this graph, so the largest margins are on the left. Hover to see data points.
Our 37 percent show us that there is definitely a discrepancy in teams who perform statistically and teams who succeed in the NBA draft -- perhaps team name plays a role in professional recruitment. If we were to continue working on this project, we would study specifically the numbers and statistics of the famed "blue blood" teams to see if they specifically have players that are being drafted more than any other given team. Looking at the highest three-point and field-goal shooting percentages is also interesting -- Saint Mary's is for sure a fairly consistent team, but historically we haven't heard much about them advancing far into the tournament. Same with Belmont -- this team makes the tournament often, but we never see them going very far. These teams have been great at shooting, but obviously have not been great at producing titles within the last few years -- so clearly defense, tempo, perhaps even luck or location of game can play a large role in whether or not a team succeeds on a given season. We also see that the University of Washington has the highest magnitude of success, another team which we generally do not hear about in the tournament. Perhaps their magnitude of success serves as a reminder to us that teams who have yet to succeed on a major scale are still capable of recruiting talented athletes who are worth watching whether or not their team is winning championships.
This project taught us a lot about the more creative and demanding aspects of data analysis. We discovered how integral careful data organization can be to the success of a project. We found ourselves drawing diagrams and rethinking different key or table orientations often. This proved particularly difficult as we tried to merge our own organization of CSV data with limitations of graphing functions in python. We assumed much of our time would be spent focused on syntax or different capabilities of certain coding languages, but this was not the case. We spent time talking through different data options weighing the pros and cons of different dictionary lists and dataframes. We had to move our NBA dataframes from horizontal to vertical orientation and we elected to use sets of dictionaries or one dictionary for many values at different points on our data journey. It was exciting to see how important collaboration and reformatting our thinking were to creating our project. We most definitely did not expect to spend so much time drawing and planning code before coding our project. It was also gratifying to start with just bare CSV cells and work our way through functions and graphs that highlighted different relationships between the data. This was for sure an important introduction into the world of CSE.