close
close

first Drop

Com TW NOw News 2024

A Guide to Understanding Interaction Terms
news

A Guide to Understanding Interaction Terms

Introduction

Interaction terms are included in regression modeling to capture the effect of two or more independent variables on the dependent variable. Sometimes it is not just the simple relationship between the control variables and the target variable that is being examined, interaction terms can be very useful at these times. They are also useful when the relationship between an independent variable and the dependent variable is dependent on the level of another independent variable.

This of course implies that the effect of one predictor on the response variable depends on the level of another predictor. In this blog, we explore the idea of ​​interaction terms through a simulated scenario: predicting over and over again how much time users would spend on an e-commerce channel using their past behavior.

Learning objectives

  • Understand how interaction terms increase the predictive power of regression models.
  • Learn how to create and include interaction terms in a regression analysis.
  • Analyze the impact of interaction terms on model accuracy using a practical example.
  • Visualize and interpret the effects of interaction terms on predicted outcomes.
  • Gain insight into when and why to apply interaction terms in realistic scenarios.

This article was published as part of the Data Science Blogathon.

Understanding the Basics of Interaction Terms

In real life, we do not see any variable acting in isolation from the others, and therefore the real models are much more complex than those we study in class. For example, the effect of the end user’s navigation actions, such as adding items to a shopping cart, on the time spent on an e-commerce platform differs when the user adds the item to a shopping cart and purchases them. By adding interaction terms as variables to a regression model, we can acknowledge these intersections and thus improve the fitness of the model for the purpose in terms of explaining the patterns underlying the observed data and/or predicting future values ​​of the dependent variable.

Mathematical representation

Let us consider a linear regression model with two independent variables, X1​ and X2:

Y = β0​ + β1​X1​ + β2​X2​ + ϵ,

where Y is the dependent variable, β0​ is the intercept, β1​ and β2​ are the coefficients for the independent variables X1​ and X2, respectively, and ϵ is the error term.

Adding an interaction term

To include an interaction term between X1​ and X2​, we introduce a new variable X1⋅X2 ​:

Y = β0 + β1X1 + β2X2 + β3(X1⋅X2) + ϵ,

where β3 represents the interaction effect between X1​ and X2​. The term X1⋅X2 is the product of the two independent variables.

How do interaction terms affect regression coefficients?

  • β0​: The intercept, which represents the expected value of Y when all independent variables are zero.
  • β1​: The effect of X1​ on Y when X2​ is zero.
  • β2​: The effect of X2​ on Y when X1​ is zero.
  • β3​: The change in the effect of X1​ on Y for a one unit change in X2​, or equivalently, the change in the effect of X2​ on Y for a one unit change in X1.​

Example: User Activity and Time Spent

First, let’s create a simulated dataset to represent user behavior in an online store. The data consists of:

  • added_to_cart: Indicates whether a user has added products to their cart (1 for add and 0 for not add).
  • bought: Whether the user completed a purchase (1 for completion or 0 for non-completion).
  • time_spent: The amount of time a user spends on an e-commerce platform. Our goal is to predict the duration of a user’s visit to an online store by analyzing whether they add products to their cart and complete a transaction.
# import libraries
import pandas as pd
import numpy as np

# Generate synthetic data
def generate_synthetic_data(n_samples=2000):

    np.random.seed(42)
    added_in_cart = np.random.randint(0, 2, n_samples)
    purchased = np.random.randint(0, 2, n_samples)
    time_spent = 3 + 2*purchased + 2.5*added_in_cart + 4*purchased*added_in_cart + np.random.normal(0, 1, n_samples)
    return pd.DataFrame({'purchased': purchased, 'added_in_cart': added_in_cart, 'time_spent': time_spent})

df = generate_synthetic_data()
df.head()

Output:

A Guide to Understanding Interaction Terms

Simulated scenario: user behavior on an e-commerce platform

As a next step, we first build an ordinary least squares regression model taking into account these market actions, but without covering their interaction effects. Our hypotheses are as follows: (Hypothesis 1) There is an effect of the time spent on the website, with each action performed separately. Now we build a second model that includes the interaction term that exists between adding products to the shopping cart and making a purchase.

This will help us to counteract the impact of those actions, individually or combined, on the time spent on the site. This suggests that we want to find out whether users who both add products to the cart and make a purchase spend more time on the site than the time spent when each behavior is considered separately.

Model without interaction term

After construction of the model the following results were noted:

  • With a mean squared error (MSE) of 2.11, the model without the interaction term accounts for approximately 80% (test R-squared) and 82% (train R-squared) of the variance in time_spent. This indicates that time_spent predictions deviate from actual time_spent by an average of 2.11 squared units. Although this model can be improved, it is reasonably accurate.
  • Furthermore, the graph below graphically shows that while the model performs reasonably well, there is still much room for improvement, especially in capturing higher values ​​of time_spent.
# Import libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

# Model without interaction term
X = df(('purchased', 'added_in_cart'))
y = df('time_spent')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Add a constant for the intercept
X_train_const = sm.add_constant(X_train)
X_test_const = sm.add_constant(X_test)

model = sm.OLS(y_train, X_train_const).fit()
y_pred = model.predict(X_test_const)

# Calculate metrics for model without interaction term
train_r2 = model.rsquared
test_r2 = r2_score(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)

print("Model without Interaction Term:")
print('Training R-squared Score (%):', round(train_r2 * 100, 4))
print('Test R-squared Score (%):', round(test_r2 * 100, 4))
print("MSE:", round(mse, 4))
print(model.summary())


# Function to plot actual vs predicted
def plot_actual_vs_predicted(y_test, y_pred, title):

    plt.figure(figsize=(8, 4))
    plt.scatter(y_test, y_pred, edgecolors=(0, 0, 0))
    plt.plot((y_test.min(), y_test.max()), (y_test.min(), y_test.max()), 'k--', lw=2)
    plt.xlabel('Actual')
    plt.ylabel('Predicted')
    plt.title(title)
    plt.show()

# Plot without interaction term
plot_actual_vs_predicted(y_test, y_pred, 'Actual vs Predicted Time Spent (Without Interaction Term)')

Output:

Output: A Guide to Understanding Interaction Terms
interaction terms

Model with an interaction term

  • A better fit for the model with the interaction term is indicated by the scatter plot with the interaction term, which shows predicted values ​​that are significantly closer to the actual values.
  • The model explains much more of the variance in time spent on the interaction term, as evidenced by the higher R-squared value of the test (from 80.36% to 90.46%).
  • The predictions of the model with the interaction term are more accurate, as shown by the lower MSE (from 2.11 to 1.02).
  • The tighter alignment of the points on the diagonal line, especially for higher values ​​of time_spent, indicates an improved fit. The interaction term helps express how user actions collectively affect the amount of time spent.
# Add interaction term
df('purchased_added_in_cart') = df('purchased') * df('added_in_cart')
X = df(('purchased', 'added_in_cart', 'purchased_added_in_cart'))
y = df('time_spent')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Add a constant for the intercept
X_train_const = sm.add_constant(X_train)
X_test_const = sm.add_constant(X_test)

model_with_interaction = sm.OLS(y_train, X_train_const).fit()
y_pred_with_interaction = model_with_interaction.predict(X_test_const)

# Calculate metrics for model with interaction term
train_r2_with_interaction = model_with_interaction.rsquared
test_r2_with_interaction = r2_score(y_test, y_pred_with_interaction)
mse_with_interaction = mean_squared_error(y_test, y_pred_with_interaction)

print("\nModel with Interaction Term:")
print('Training R-squared Score (%):', round(train_r2_with_interaction * 100, 4))
print('Test R-squared Score (%):', round(test_r2_with_interaction * 100, 4))
print("MSE:", round(mse_with_interaction, 4))
print(model_with_interaction.summary())


# Plot with interaction term
plot_actual_vs_predicted(y_test, y_pred_with_interaction, 'Actual vs Predicted Time Spent (With Interaction Term)')

# Print comparison
print("\nComparison of Models:")
print("R-squared without Interaction Term:", round(r2_score(y_test, y_pred)*100,4))
print("R-squared with Interaction Term:", round(r2_score(y_test, y_pred_with_interaction)*100,4))
print("MSE without Interaction Term:", round(mean_squared_error(y_test, y_pred),4))
print("MSE with Interaction Term:", round(mean_squared_error(y_test, y_pred_with_interaction),4))

Output:

Interaction terms: output
Output

Compare model performance

  • The model predictions without the interaction term are shown by the blue dots. When the actual time use values ​​are higher, these dots are more spread out from the diagonal line.
  • The model predictions with the interaction term are shown by the red points. The model with the interaction term produces more accurate predictions. Especially for higher actual time spent values, because these points are closer to the diagonal line.
# Compare model with and without interaction term

def plot_actual_vs_predicted_combined(y_test, y_pred1, y_pred2, title1, title2):

    plt.figure(figsize=(10, 6))
    plt.scatter(y_test, y_pred1, edgecolors="blue", label=title1, alpha=0.6)
    plt.scatter(y_test, y_pred2, edgecolors="red", label=title2, alpha=0.6)
    plt.plot((y_test.min(), y_test.max()), (y_test.min(), y_test.max()), 'k--', lw=2)
    plt.xlabel('Actual')
    plt.ylabel('Predicted')
    plt.title('Actual vs Predicted User Time Spent')
    plt.legend()
    plt.show()

plot_actual_vs_predicted_combined(y_test, y_pred, y_pred_with_interaction, 'Model Without Interaction Term', 'Model With Interaction Term')

Output:

output

Conclusion

The improvement in model performance with the interaction term demonstrates that adding interaction terms to your model can sometimes increase its importance. This example highlights how interaction terms can capture additional information not captured by the main effects alone. In practice, considering interaction terms in regression models can potentially lead to more accurate and insightful predictions.

In this blog, we first generated a synthetic dataset to simulate user behavior on an e-commerce platform. Then, we constructed two regression models: one without interaction terms and one with interaction terms. By comparing their performances, we demonstrated the significant impact of interaction terms on the accuracy of the model.

Key Points

  • Regression models with interaction terms can help to better understand the relationships between two or more variables and the target variable by capturing their combined effects.
  • Including interaction terms can significantly improve model performance, as evidenced by the higher R-squared values ​​and lower MSE in this manual.
  • Interaction terms are not only theoretical concepts, they can also be applied to real-world situations.

Frequently Asked Questions

Question 1. What are interaction terms in regression analysis?

A. They are variables created by multiplying two or more independent variables. They are used to capture the combined effect of these variables on the dependent variable. This can provide a more nuanced understanding of the relationships in the data.

Question 2. When should I use interaction terms in my model?

A. You should consider using IT when you suspect that the effect of one independent variable on the dependent variable depends on the level of another independent variable. For example, if you suspect that the impact of adding items to the shopping cart on time spent on an e-commerce platform depends on whether the user makes a purchase. You should include an interaction term between these variables.

Question 3. How do I interpret the coefficients of interaction terms?

A. The coefficient of an interaction term represents the change in the effect of one independent variable on the dependent variable for a one-unit change in another independent variable. For example, in our example above we have an interaction term between purchased and added_in_cart, the coefficient tells us how the effect of adding items to the cart on the time spent when making a purchase changes.

The media shown in this article is not owned by Analytics Vidhya and is used at the author’s sole discretion.