Regression in NBA 2K
How to make the G.O.A.T of all custom hoppers in NBA2K using regression
I’ve spent more hours than I care to admit mix and matching different player builds to use in the commonly popular myCareer mode in this year’s installment of NBA2K. What I love about this game is how different each new player plays based on the skills I choose for each player to excel at.
For those that have never played 2K, let me explain how it goes. The game mode myCareer allows gamers to create a customized basketball player and use that player to play through a career in the NBA. Customizing a player includes picking the player’s looks, height, weight, and most importantly, proficiency in certain skills such as 3-point shooting, shot-blocking, making contested layups, and so on.
All these skills add up to give your player an overall rating between 65–100. All-Stars, such as LeBron or Kevin Durant have high 90s for their overall ratings, while rookies entering the league usually have ratings in the low 80s. Not all of these customizable skills have the same impact on calculating the overall rating a player has, however. This makes sense, as not all basketball skills are equally important. For example, 3-point shooting will always seem more important to overall success than something such as a player’s ability to cause deflections.
Each of these skills will start at a default value between 0-to-100 which reflects how talented that player is at that skill. When creating a player, the gamer is given a limited amount of extra skill points to increase specific skills from their default values. For example, if I want to play through an NBA career with a player similar to Steph Curry, I would use most of my skill points to increase my player’s shooting skills from their default values. Since all of these skills have different relationships with the overall rating of a player and these relationships are not explicitly stated in the game, many players may ask:
“What will be my overall rating if I spend my ‘Skill’ points on these skills”?
This type of question is perfect for understanding regression. In this scenario, we have 20+ independent numeric variables making up our basketball skill values and one dependent variable that is our overall player rating. What we don’t understand is how the relationship between these independent variables and our dependent variable impacts overall player rating. I suggest that if you don’t know what a relationship between variables means, it would certainly benefit you to read this article before continuing. Once we understand these relationships, we will be able to then place our limited number of skill points into the skills that will increase our overall player rating the most. But, how the heck are we supposed to learn these relationships when the game does not explicitly tell us.
First, we will focus on how to discover the relationship between one independent variable (one skill) and the dependent variable (overall score) with linear regression. For now, let’s look at the relationship between our custom player’s 3-point shooting skills and their overall player score.
As seen in the table above, there is a positive correlation between 3-point shooting and overall player rating. This tells us that increasing the player’s 3-point shooting skills will increase the player’s overall rating, but we still aren’t quantifying this relationship. Some of my more astute readers would think of using correlation coefficients covered in my last article to quantify this relationship. This could work for finding which skills are most important to the overall rating, but it wouldn’t give us a formula for predicting the value for the overall rating itself. For that, we need a good old linear regression line. Y = mX+b, or in English a formula that will predict our overall player rating given any 3-point shooting skill value.
Overall Rating = (Slope * 3-point shooting skill) + Y-intercept
First, let’s dive into understanding what slope is and what it does in our linear regression line. The slope is the “m” in Y = mX + b. The slope represents the cause and effect relationship between the input variable (X) and the dependent variable (Y). This relationship is positive if the value of “m” is greater than 0, and it’s negative if less than 0, However, it has no relation if it’s exactly 0. If you read my previous article on correlation, then slope may seem to represent the same concept as correlation coefficients but on a different scale. Correlation coefficients may show the direction and strength of a relationship between variables but should not be used for mapping sample input to sample output, as it’s not optimized like slope is in a linear regression line.
Optimized like Richard Sherman changing his Twitter name to “Optimus Prime” before matching up against Megatron, one of the greatest NFL receivers of all time? No, optimized in the context of regression simply means that the value for the slope (Change in Y given X) in a regression line should be the best value for mapping input to output for an entire dataset. Let’s take a look at the graph below to better understand this concept.
Remember that a linear regression line is really just a formula. It’s reasonable to represent it as a line as it can create a point with any real number for input; as it will always produce a corresponding output. According to the regression line in this graph, if our player has 60 skill points in 3-point shooting, their overall player rating should be 72. This prediction was spot on but when our player only had 40 skill points in the 3-point shooting skill, we grossly overpredicted what their overall rating would be. We can measure this error by taking the predicted value minus the real value. In our great prediction, our error would be 0 because both the predicted and actual were 72, but in our poor prediction, our error would be (68–65) or an error of 3 points. Since slope is optimized to reduce error across the whole dataset we would need to take the average error of all of our predicted points compared to our actuals to see if the current slope of our regression line is the best possible slope.
Our example above reflects the line of best fit or the regression line with the least amount of average error with a slope of 0.15. This means that for every additional skill point used on 3-point shooting our player's overall rating is expected to increase by 0.15 points. The actual process of reducing the error of a regression line until we find the best fit can be another one of those very tedious tasks to calculate by hand in statistics. I always suggest working smarter rather than harder, so that’s why I made this python code that can be used to find the line of best fit.
A quick overview of the actual logic of optimizing the slope of a regression model is as follows. The slope will start at a random value, the y-intercept will be calculated, the regression line’s predicted values will be generated, then the average error between predicted and actuals will be calculated, the slope will be adjusted to reduce error, and this process is repeated until the average error of the line can no longer be reduced. Phew, even typing that felt tiresome! For a good visual of this process, look here.
Lastly, to completely understand linear regression, we need to understand what the Y-intercept is. The Y-intercept is the “b” in Y=mX+b. The Y-intercept is simply the Y-value when the input to the regression line is zero. In other words, our player’s overall rating when his 3-point shooting skill is 0. Even with a 3-point stroke somehow worse than Shaq’s, the lowest overall rating a player can have in 2k is 65. This is not our Y-Intercept though, as the Y-intercept is based on when the regression line crosses (intercepts) the Y-axis. That means, even though the lowest rating a player can have is 65, our regression line still considers 63 as the Y-intercept, you can see this in the table below, as well as the graph above.
Now that we’ve found the slope and y-intercept, we can generate a decent prediction for the overall player rating of our custom NBA 2K players given their 3-point shooting talent. These predictions will be created with our regression line formula.
Player’s Overall Rating = (Player’s 3-point shooting skill value * .15) + 62
This only tells us our player’s rating based on their 3-point shooting talent. As mentioned above, there are 20+ skills for players to customize their player with. This is where we must expand on simple linear regression and use multiple linear regression.
Multiple Regression: Y = mX1 + mX2 …… + mXn + b
Multiple regression uses multiple variables as input to predict an output based on previous examples. The linear regression line we created predicts a player’s overall rating based on their 3-point shooting skills; whereas our new model will need to predict a player’s overall rating based on a player’s 3-point shooting, dunking ability, ball handling, and a lot of other skills. Here lies the difference between linear regression and multiple regression. Linear regression only defines the relationship between one input and output, while multiple regression defines the relationship between multiple inputs and one output.
Both models are calculated in almost the same way. They both have a Y-Intercept, commonly referenced as (“b”) in mathematical notation, which is the output of our model when all input given to the model is 0. That means Y-intercept = Y when Y = m*0 +m*0 …. + m*0 + b. In our example, this would be our player’s overall rating when all of their possible basketball skills have a value of zero. They also both need to be optimized to find the values for “m”, or the relationship a given input has to its output. Multiple regression is optimized slightly differently than linear regression but we would need another entire article to explain these two processes and how they differ. One thing to note is that the slope for each input can differ from the other in multiple regression.
For now, I’ve provided some code that will complete a multiple regression model for you. The big thing to remember is that the difference between linear and multiple regression is the number of inputs that are used to calculate an output. Now, for the moment you’ve all been waiting for… Here’s a table that shows the relationship between each possible basketball skill and its effect on a player’s overall rating.
These results were calculated by using ten players I have created in NBA2K as the samples for our multiple regression model. The Y-intercept is significantly lower than our linear regression line at 38.1. The above results show that a player's ability to block shots has the biggest impact on their overall rating, while perimeter defense had the least. These results should be taken with a grain of salt due to the fact they were created with limited samples and with only players I have created, which can introduce bias. With more samples, we could create a more accurate model but sadly that data is not readily available. For now, I hope you’ve learned something and that this article can help you create the player of your dreams.