Overfitting Or When Your Model Tries Too Hard to Please

Overfitting is one of those words you hear constantly in machine learning circles. Try reading the LightGBM parameters documentation, and you’ll see it (and its companion, regularization) pop up over 19 times, just on that single page.

I was taught, in a somewhat hand-wavy fashion, that overfitting happens when a model learns the training data too well. But that phrasing immediately rubbed me the wrong way. Isn’t learning well the whole point? Don’t we want our models to learn as much as they can?

Truth is, the real problem isn’t learning but generalization. A model that overfits memorizes the training data instead of learning the underlying pattern. That translates to impressive scores on data it’s already seen, but disastrous predictions on anything it hasn’t. Google’s definition sums it up well:

Overfitting means creating a model that matches (memorizes) the training set so closely that the model fails to make correct predictions on new data.

This is precise and true, but still felt abstract to a visual learner like me. And if you feel the same way, you’re in luck. This article is dedicated to helping you see what overfitting looks like, how to trigger it, and a we'll touch on how to fight back. We’ll build some toy datasets, train models of varying complexity, and even throw in an interactive plot you can play with yourself.

Can't wait? I hope so!

1. A Gentle Start: Fitting Curves with Polynomials

To build an intuition for overfitting, we’ll start with a tiny dataset of 25 points that follows the curve of $f(x) = x^3$, with a bit of added noise. We'll hold out 8 of those points for testing, and use the rest to train our models. Nothing fancy yet, just a single feature and a single target, easy to plot and understand!

Here’s the data generation code:

from sklearn.model_selection import train_test_split

rng = np.random.default_rng(SEED)

X = np.linspace(-2, 2, 25)
y = (X ** 3) + rng.normal(0, 2, size=X.shape)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=SEED
)

Simple synthetic dataset

Notice the gray curve? It's our underlying function $f(x) = x^3$ upon which our noisy dataset is based. The models never get to see it. It represents the true pattern we're hoping they’ll uncover. The purple dots are the data the model actually sees.

Now that the data is ready and we’re aligned on the goal, let’s throw in some models. Specifically, we'll train 16 polynomial regressions, with degrees ranging from 1 (a straight line) all the way up to 16 (basically a spaghetti curve). For each one, we measure the mean absolute error (MAE) on both the training and test sets. That way we can keep track of how well the model fits the training data, and how well it generalizes to unseen data.

from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_absolute_error

degrees = list(range(1, 17))
results = []

for degree in degrees:
    poly = PolynomialFeatures(degree)
    X_train_poly = poly.fit_transform(X_train)
    X_test_poly = poly.transform(X_test)

    model = LinearRegression()
    model.fit(X_train_poly, y_train)

    y_train_pred = model.predict(X_train_poly)
    y_test_pred = model.predict(X_test_poly)

    results.append({
        "degree": degree,
        "train_mae": mean_absolute_error(y_train, y_train_pred),
        "test_mae": mean_absolute_error(y_test, y_test_pred),
        "y_plot_pred": y_plot_pred
    })

Let's compare a few of those models in a plot.

Comparison of models of various orders

On the left, the first-order model underfits. Limited to simple straight lines, it settles for the best one it can, and misses the curvature. In the middle, the third order polynomial hits a sweet spot, as exepected: low error on the test set, and a shape that follows our original function closely. On the right, the high order model loses the plot completely (pun intended!). It almost perfectly "memorizes" the training data, bending itself into knots to hit every point known to it.

Still not convinced? Let’s make things move. On the left, you can see how the curve adapts to the data as we increase the degree of the polynomial. On the right, you get the MAE for both the training set and the test set, as a function of model complexity. As the latter increases, the model continues to improve its accuracy on the training set, eventually fitting it perfectly. But at the same time, the error on the test set rises. This is overfitting in action: the model is no longer learning the underlying trend, it’s chasing the noise.

2. From Curve Fitting to Income Prediction

Now that we’ve explored overfitting on a tiny toy problem, let’s step up the complexity a bit. We'll simulate a more realistic dataset: age vs. income for the citizens of Overfitistan. Still just one feature and one target (to keep visualization easily) but the distribution and the noise will be more realistic.

Simulating Real-World Data (Sort Of)

Here are the rules of our little imaginary world:

Kids under 12 earn nothing (child labor laws and all).
Teenagers might have part-time jobs.
Income tends to rise quickly in early adulthood, peak around middle age, and then slowly decline through retirement.
People between 20 and 65 show the most variability, simulating level of education, layoffs, promotions, etc...

We'll model those with a piecewise function :

$$ \text{income}(a) = \begin{cases} 0 & \text{if } a < 12 \\[6pt] \displaystyle \frac{15000}{18} \cdot a & \text{if } 12 \leq a < 18 \\[6pt] \displaystyle 15000 + 50000 \cdot \left( \frac{a - 18}{14} \right)^{1.5} & \text{if } 18 \leq a < 30 \\[6pt] \displaystyle 55000 + 350 \cdot (a - 30) & \text{if } 30 \leq a < 60 \\[6pt] \displaystyle 65000 - 10000 \cdot \left(1 - e^{-0.05 \cdot (a - 60)}\right) & \text{if } a \geq 60 \end{cases} $$

Just add quite a bit of noise and that's it !

To make things more interesting, we’ll simulate a realistic age distribution. That means more 40-year-olds than elderly or toddlers, as if we were drawing names at random from a national registry. The resulting dataset will have fewer information at the start and the tail of the distribution, which will give our models something to struggle with.

def income(age, rng=None):
    # Income
    if age < 12:
        income = 0
    elif age < 18:
        income = (15000 / 18) * age
    elif age < 30:
        income = 15000 + 50000 * ((age - 18) / 14) ** 1.5
    elif age < 60:
        income = 55000 + 350 * (age - 30)
    else:
        income = 65000 - 10000 * (1 - np.exp(-0.05 * (age - 60)))

    # Noise
    if age < 18:
        noise_factor = np.clip((age - 12) / 6, 0, 1)
    elif age > 60:
        noise_factor = np.clip((90 - age) / 30, 0, 1)
    else:
        noise_factor = 1

    income += rng.normal(0, noise_factor * 10000) if rng else 0
    return income if income > 0 else 0

# Realistic sampling with noise
rng = np.random.default_rng(SEED)
age_train = []
while len(age_train) < 1000:
    new_vals = np.random.normal(loc=40, scale=18, size=1000 - len(age_train))
    age_train.extend(new_vals[(new_vals >= 0) & (new_vals <= 90)])

income_train = [income(x, rng=rng) for x in age_train]

And here is the resulting plot.

A better dataset

This time, the purple line is the synthetic curve, meaning the trend we are hoping our model uncover. The blue dots are the noisy, imperfect observations the model will get. The shaded band shows the income spread at each age, using a smoothed min–max range.

It’s still a pretty simple dataset, but it’s just complex enough to make things interesting. And unlike the polynomial example, this time we’ll use a more powerful algorithm: gradient-boosted decision trees.

One Dataset, Three Very Different Fits

To see the effect of complexity, we’ll train three LightGBM models, each representing a different archetype: - An overfitter, with far too much capacity and no guardrails. - An underfitter, too constrained to capture the patterns in the data. - A well-regularized model, tuned via cross-validation and grid search.

Each model uses the same learning rate (0.05) to keep things fair. For evaluation, we compute the mean absolute error (MAE) between the predictions and the true underlying curve, not a test set. This is unusual (and only possible) because we know the exact function that generated the data, allowing for a more precise measure of generalization performance.

overfit_params = {
    'num_leaves': 256,
    'num_iterations': 1000,
    'min_data_in_leaf': 1,
}
overfit_model = lgb.train(common_params | overfit_params, train_dataset)

Here, we let the model to grow very wide and deep trees with num_leaves = 256. min_data_in_leaf is set to 1, meaning the model can populate those intricate trees with tiny leaves that may only be useful for individual data points. We also give it plenty of boosting iterations, so it has "time" to memorize the noise. You can probably guess that this leads to extreme overfitting.

underfit_params = {
    'num_leaves': 2,
    'num_iterations': 25,
}
underfit_model = lgb.train(common_params | underfit_params, train_dataset)

This one is the exact opposite. With num_leaves=2 and just 25 boosting rounds, the model can create at most 50 simple rules like if age < 10 : return 0. That’s far too limited to capture subtleties like the early adulthood income increase or the gradual decline near retirement.

param_grid = {
    'num_leaves': [4, 8, 16],
    'min_data_in_leaf': [10, 15, 20, 25, 30],
    'lambda_l2': [0, 10, 20, 30, 40, 50],
}
total_iter = np.prod([len(v) for v in param_grid.values()])

best_score = float('inf')
best_params = None

for nl, mdl, l2, in tqdm(product(
    param_grid['num_leaves'],
    param_grid['min_data_in_leaf'],
    param_grid['lambda_l2'],
), total=total_iter):
    params = common_params | {'num_leaves': nl, 'min_data_in_leaf': mdl, 'lambda_l2': l2,}

    cv_result = lgb.cv(
        params | {'num_iterations': 10000, 'early_stopping_rounds': 50},
        train_dataset,
        nfold=5,
        stratified=False,
        seed=SEED,
    )

    mean_mae = cv_result['valid l1-mean'][-1]

    if mean_mae < best_score:
        best_score = mean_mae
        best_params = params | {'num_iterations' : len(cv_result['valid l1-mean'])}

Let’s try to strike a balance for our final model. We define a small grid of key hyperparameters: num_leaves, min_data_in_leaf, and lambda_l2, and evaluate every combination using 5-fold cross-validation. We also enable early_stopping_rounds, in order to let LightGBM find the optimal number of boosting iterations, after which the model starts memorizing the training data. The best combination is the one that minimizes the validation MAE across folds and is used to train the final model.

learning_rate: 0.05
num_leaves: 4
min_data_in_leaf: 30
lambda_l2: 10
num_iterations: 167

Not too deep, not too shallow (keep in mind that this is still a very simple problem !) and with quite a bit of extra regularization from both min_data_in_leaf and lambda_l2. Now we just have to train a brand new model using those on the full dataset.

regularized_model = lgb.train(best_params, train_dataset)

Overfit, Underfit, Just-Right: The Visuals

Let’s see how our three models perform. Each subplot shows the synthetic ground truth (purple line) - for your eyes only, not the models’! The data distribution is still there too, along with each model’s predictions in color.

Comparison of LGBM Models

On the left, the model does a decent job for citizens aged 20 to 30-ish, but flatlines everywhere else. It simply ran out of decision leaves, a clear case of underfitting. On the right, we see a different kind of disaster. The model "sticks" to the noisy data, especially in regions with plenty of examples to learn from. But by reacting to every local bump and wiggle, it completely loses the big picture.

In the middle, the regularized model is a much better balance, and closely follows the ideal curve. But even this model struggles with the elderly range, where the trend shift is small and training data is sparse. You might be tempted to let the model fit more closely in that region, but doing so would likely cause overfitting in the well-covered age ranges. It’s a classic trade-off: there's rarely a perfect set of hyperparameters that works everywhere. Most of the time, you're just choosing the least-worst compromise. And keep in mind, this is still a toy problem with one feature and a thousand points. As your data grows in size and complexity, the balancing act only gets harder.

I should mention that there are multiple ways to address this kind of imbalance. In this case, I’d probably try giving more weight to rare cases (like elderly samples) during training, but that comes with its own caveats.

Conclusion

I hope this was a helpful introduction to the vast world of overfitting. To be clear, we only scratched the surface here. In practice, overfitting is a deep and model-specific problem, and the tools to combat it vary widely depending on what you’re working with. That said, a few strategies and techniques do carry over.

Since I mostly work with gradient-boosted tree models, here’s a rough outline of the workflow I tend to follow:

Start wide. Use a relatively high learning rate (e.g., 0.1 or 0.2) and explore a broad range of hyperparameters: at minimum, things like num_leaves (or max_depth if you trees grow depth-first), min_data_in_leaf, lambda_l2, colsample_bytree (and/or colsample_bynode), and subsample. Run a coarse but wide search with K-fold cross-validation, and use early_stopping_rounds to zero in on promising ranges quickly.
Then go deep. Lower the learning rate (e.g., to 0.05 or even 0.01) and tighten the search around the values identified previously. Drop regularization options that were unused. Continue using cross-validation and early_stopping_rounds, but this time to find the optimal number of boosting rounds.
Final training. Train on the full training set using the best hyperparameters found, and remember to fix num_iterations to the optimal value.

There’s no magic bullet. But the better you understand your data and your model’s behavior as you tweak both your feature engineering and hyperparameters, the more likely you are to find that sweet spot between memorization and over-simplification.

If you’d like to explore the full code, including the more detailed version of what was shown here and all the visualizations, you can find the complete notebook here.

Good luck out there!

Interactive Plot: Tweak and Overfit at Will!

Originally, this article was going to stop there. But as a visual learner, I couldn’t resist taking it a step further. So I brushed up on my JavaScript (I would still implore you not to look at the code, though!) and, with a bit of Plotly magic, built an interactive visualization for us to play with. Below, you can adjust the hyperparameters and see in real time how they affect the model’s predictions on the age-income problem.

A quick word of caution: this is a toy example with just one feature and a few hundred points. In real-world scenarios, with noisy, high-dimensional data, hyperparameters will behave differently. The settings that work here almost certainly won’t transfer to your next Kaggle competition or production pipeline.

Still, it’s a great way to build intuition for the trade-offs involved. So go ahead, try it! Watch how the model hugs or ignores the data. See how the wiggles form when you let it overfit and how they vanish when you regularize! Try fitting the elderly range and see what happens! The default sliders are set to a reasonably well-balanced model, but I’ll let you in on a secret: there’s an even better set of hyperparameters hidden in there. Think you can find it?