close
close

first Drop

Com TW NOw News 2024

Crushing the Average: A Dive into Penalized Quantile Regression for Python
news

Crushing the Average: A Dive into Penalized Quantile Regression for Python

How to build penalty quantile regression models (with code!)

Crushing the Average: A Dive into Penalized Quantile Regression for PythonPhoto by Joes Valentine / Unsplash: Imagine these are normal distributions.

This is my third post in the series on penalized regression. In the first one we talked about how to implement a sparse group lasso in python, one of the best alternatives to variable selection available for regression models today, and in the second one we talked about adaptive estimators and how they are much better than their traditional counterparts. But today I want to talk about quantile regression. and delve into the realm of high-dimensional quantile regression using the robust asgl packagewith an emphasis on the implementation of quantile regression with an adaptive lasso penalty.

Today we’re going to see the following:

  • What is quantile regression?
  • What are the advantages of quantile regression compared to traditional least squares regression?
  • How to Implement Penalty Quantile Regression Models in Python

What is quantile regression?

Let’s start with something that many of us have probably encountered: least squares regression. This is the classic go-to method when we want to predict an outcome based on some input variables. It works by finding the line (or hyperplane in higher dimensions) that best fits the data by minimizing the squared differences between observed and predicted values. Simply put, it’s like trying to draw the smoothest line through a scatterplot of data points. But here’s the catch: it’s all about the common denominator. In least squares regression, the emphasis is solely on modeling the average trend in the data.

So, what’s the problem with modeling the average? Well, life isn’t always about averages. Imagine analyzing income data, which is often skewed by a few high earners. Or think about data with outliers, such as real estate prices in a neighborhood with a sudden development of luxury apartments. In these situations, focusing on the average can be skewed, potentially leading to misleading insights.

Advantages of quantile regression

Get to know quantile regression. Unlike its least squares sibling, quantile regression allows us to examine multiple quantiles (or percentiles) of the data distribution. This means we can understand how different parts of the data behave, beyond just the average. Want to know how the bottom 10% or top 90% of your data responds to changes in input variables? Quantile regression has you covered. It’s especially useful when working with outliers or highly skewed data, as it provides a more nuanced picture by looking at the distribution as a whole. They say a picture is worth a thousand words, so let’s see what quantile regression and least squares regression look like in a few simple examples.

Image by Author: Examples of a comparison of quantile regression and least squares regression.

These two figures show very simple regression models with one predictor variable and one response variable. The left figure has an outlier in the upper right corner (that lonely dot there). This outlier affects the estimate provided by least squares (the red line), which is way out of the way and produces very poor predictions. But quantile regression is unaffected by outliers and the predictions are spot-on. In the right figure, we have a dataset that is heteroscedastic. What does that mean? Imagine that your data forms a cone shape, which gets wider as the value of X increases. More technically, the variability of our response variable doesn’t play out according to the rules — it expands as X grows. Here, least squares (red) and quantile regression for the median (green) follow similar paths, but they only tell part of the story. By adding additional quantiles to the mix (in blue, 10%, 25%, 75%, and 90%) we can capture how our data moves across the spectrum and see its behavior.

Implementations of quantile regression

High-dimensional scenarios, where the number of predictors exceeds the number of observations, are becoming increasingly common in today’s data-driven world. They are emerging in fields such as genomics, where thousands of genes can predict a single outcome, or in image processing, where numerous pixels contribute to a single classification task. These complex situations require the use of penalized regression models to effectively manage the multitude of variables. Most Existing software in R and Python offers limited capabilities to penalize quantile regression in such high-dimensional contexts.

This is where my Python package, asglappears. asgl package provides a comprehensive framework for fitting different penalized regression modelsincluding sparse group lasso and adaptive lasso — techniques I’ve talked about in other posts. It’s based on cutting-edge research and offers full compatibility with scikit-learnenabling seamless integration with other machine learning tools.

Example (with code!)

Let’s see how we can use asgl to perform quantile regression with an adaptive lasso penalization. First, make sure you have the asgl library installed:

pip install asgl

Next we demonstrate the implementation using synthetic data:

import numpy as np
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from asgl import Regressor

# Generate synthetic data
X, y = make_regression(n_samples=100, n_features=200, n_informative=10, noise=0.1, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define and train the quantile regression model with adaptive lasso
model = Regressor(model="qr", penalization='alasso', quantile=0.5)

# Fit the model
model.fit(X_train, y_train)

# Make predictions
predictions = model.predict(X_test)

# Evaluate the model
mae = mean_absolute_error(y_test, predictions)
print(f'Mean Absolute Error: {mse:.3f}')

In this example, we generate a dataset with 100 samples and 200 features, where only 10 features are truly informative, making it a high-dimensional regression problem). The Regressor class from the asgl package is configured to perform quantile regression (by selecting model=’qr’) for the median (by selecting quantile=0.5). If we are interested in other quantiles, we just need to set the new quantile value somewhere in the interval (0, 1). We solve for an adaptive lasso penalization (by selecting penalization=’alasso’) and we can optimize other aspects of the model, such as how the adaptive weights are estimated, etc., or use the default configuration.

Benefits of asgl

I would like to conclude with a summary of the benefits of asgl:

  1. Scalability:The package efficiently processes high-dimensional datasets, making it suitable for applications in a wide range of scenarios.
  2. Flexibility: With support for various models and penalties, asgl meets diverse analytical needs.
  3. Integration: Compatibility with scikit-learn simplifies model evaluation and hyperparameter tuning

And that’s all for this post on quantile regression! By crushing the mean and exploring the full distribution of the data, we open up new possibilities for data-driven decision making. Stay tuned for more insights into the world of penalized regression and the asgl library.


Squashing the Average: A Dive into Penalized Quantile Regression for Python was originally published in Towards Data Science on Medium, where people continued the conversation by bookmarking and commenting on this story.