Title: Python Analytics for Business Intelligence¶

From Insight to Action¶

Do you know what your data is trying to tell you?¶

Dr. Andrés García Medina

andgarm.n@gmail.com

https://sites.google.com/view/andresgm/home

In this session, you’ll learn how to:

  • Use logistic regression to predict key business outcomes

  • Apply clustering techniques to segment and understand your customers

  • Overcome the challenge of uncertainty in decision-making

  • See how real-world dashboards turn complex data into clear insights

Module 1: Logistic Regression for Credit Risk Analysis and Customer Churn¶

Introduction to Classification Models in Business Analytics¶

In business analytics problems, we often need to predict a categorical variable:

  • Will a customer pay a loan or not?
  • Will a customer churn or stay?

Classification models allow you to assign probabilities to each class and make decisions based on a threshold (e.g., 0.5).

Business application examples:

  • Credit risk: Identify customers with a high probability of defaulting.
  • Customer retention: Anticipate churn to design preventive strategies.

Fundamentals of Logistic Regression¶

Logistic regression is a binary classification model. The idea is to model the probability of an event:

$ P(Y=1|X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + ... + \beta_p X_p)}} $

  • The sigmoid function transforms any real value into a number between 0 and 1.
  • The logit is the transformation:

$ \text{logit}(p) = \ln\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 X_1 + ... + \beta_p X_p $

Interpretation of coefficients:

  • $ \beta_j $ indicates the change in the logit (odds ratio) per additional unit of $ X_j $.
In [1]:
# Importing necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

Example: Credit Default Risk Prediction¶

Let's simulate a customer dataset with the following variables:

  • Income: The annual income of the client

  • Credit Amount: The amount of credit the client is requesting or has taken.

  • Arrears: The number of previous late payments.

  • Default (1 = defaulted, 0 = paid)

In [2]:
# Data Simulation
np.random.seed(42)
n = 500

df_credit = pd.DataFrame({
"Income": np.random.normal(50000, 15000, n),
"Credit_Amount": np.random.normal(20000, 5000, n),
"Arrears": np.random.randint(0, 5, n)
})


beta0 = -5
beta1 = 0.0005
beta2 = -0.00004
beta3 = 0.8

# Probability of Default
p_default = 1 / (1 + np.exp(-(beta0 + beta1*df_credit['Credit_Amount'] +beta2*df_credit['Income'] + beta3*df_credit['Arrears'])))
df_credit["Default"] = np.random.binomial(1, p_default)

# Save the simulated credit data to a CSV file
df_credit.to_csv("credit_data.csv", index=False)
In [3]:
#Open data
df_credit = pd.read_csv('credit_data.csv')
df_credit.head()
Out[3]:
Income Credit_Amount Arrears Default
0 57450.712295 24630.887738 3 1
1 47926.035482 29547.083202 0 1
2 59715.328072 13007.162131 2 0
3 72845.447846 22814.846183 4 1
4 46487.699379 16746.787154 2 1
In [4]:
# Descriptive Statistics

df_credit.describe()
Out[4]:
Income Credit_Amount Arrears Default
count 500.000000 500.000000 500.000000 500.000000
mean 50102.569919 20159.130585 2.014000 0.898000
std 14718.798710 4889.985976 1.416268 0.302951
min 1380.989899 6515.566785 0.000000 0.000000
25% 39495.388933 17023.541301 1.000000 1.000000
50% 50191.957195 20142.657998 2.000000 1.000000
75% 59551.748812 23256.211488 3.000000 1.000000
max 107790.972360 33161.910324 4.000000 1.000000
In [5]:
# Estimate parameters of the logistic regression model using the data
X = df_credit[["Income", "Credit_Amount", "Arrears"]]
y = df_credit["Default"]

model_credit_in = LogisticRegression()
model_credit_in.fit(X, y)

# Display the parameters (in-sample)
print("Intercept:", model_credit_in.intercept_[0])
print("Coefficients:", model_credit_in.coef_[0])
Intercept: -4.915706515840034
Coefficients: [-7.81822335e-05  6.20387121e-04  6.86068503e-01]

How to use the regression equation for prediction?¶

Consider a client with an annual salary of $\$60,000$, a credit amount of $\$15,000$, and one arrear.

For this particular client, what is his probability of default?

We can use the estimated equation:

$ P(Y=1|X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3)}} $

or type in Python:

In [6]:
# Define a sample client's data (Income, Credit_Amount, Arrears)
sample_client = pd.DataFrame({
    "Income": [60000],
    "Credit_Amount": [15000],
    "Arrears": [1]
})

# Predict the probability of default for the sample client
probability_default_client = model_credit_in.predict_proba(sample_client)[:, 1]

# Print the predicted probability
print(f"The predicted probability of default for the sample client is: {probability_default_client[0]:.4f}")
The predicted probability of default for the sample client is: 0.5951

Odds Ratios¶

Odds ratios quantify the change in the odds of the outcome (default in this case) for a one-unit increase in a predictor variable, holding all other predictors constant.

  • An odds ratio of 1 means the predictor has no effect on the odds of the outcome.
  • An odds ratio greater than 1 means the predictor increases the odds of the outcome.
  • An odds ratio less than 1 means the predictor decreases the odds of the outcome.
In [7]:
# Compute odds ratios for the credit risk model
odds_ratios_credit = pd.DataFrame({
    'Feature': X.columns,
    'Odds Ratio': np.exp(model_credit_in.coef_[0])
})

print("Odds Ratios - Credit Risk Model:")
display(odds_ratios_credit)
Odds Ratios - Credit Risk Model:
Feature Odds Ratio
0 Income 0.999922
1 Credit_Amount 1.000621
2 Arrears 1.985893

Interpretation of Odds Ratios for Credit Risk¶

This interpretation suggests that focusing on a customer's payment history (Arrears) is much more critical for assessing credit risk than their current income or the amount of credit they are seeking, based on this specific dataset and model.

  • Income: The odds ratio for Income is very close to 1 (0.9999). This suggests that a one-unit increase in income has a negligible effect on the odds of defaulting, holding other factors constant.
  • Credit_Amount: The odds ratio for Credit_Amount is slightly greater than 1 (1.0006). This means that for every one-unit increase in the credit amount, the odds of defaulting increase by a very small amount (approximately 0.06%).
  • Arrears: The odds ratio for Arrears is approximately 1.985. This means that for every one-unit increase in the number of previous late payments (Arrears), the odds of defaulting increase by about 98.6%. This is a substantial increase, indicating that the number of previous arrears is a strong predictor of default risk.

Model training: out-sample¶

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

model_credit_out = LogisticRegression()
model_credit_out.fit(X_train, y_train)

# Display the parameters (out-sample)
print("Intercept:", model_credit_out.intercept_[0])
print("Coefficients:", model_credit_out.coef_[0])
Intercept: -5.430045539160187
Coefficients: [-9.59264607e-05  7.33997106e-04  6.17921734e-01]
In [9]:
# Make predictions on the test set
y_pred = model_credit_out.predict(X_test)
print(y_pred)
[0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 0 1 0 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 0 1 1 1 1 1 1 1 1 1
 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 0
 1 1]
In [10]:
#Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues")
plt.title("Confusion Matrix - Credit Risk")
plt.show()
No description has been provided for this image

Let’s define the key terms first:

  • TP (True Positives): Predicted 1 and actually 1
  • TN (True Negatives): Predicted 0 and actually 0
  • FP (False Positives): Predicted 1 but actually 0
  • FN (False Negatives): Predicted 0 but actually 1

A false positive means you deny credit to someone who would have paid it back. These are missed business opportunities.

A false negative in this context means you grant credit to someone who ends up defaulting. These are direct financial losses.


Precision $ = \frac{TP}{TP + FP} $

Measures how many of the predicted positives are actually positive.
“Of all loans predicted to default, how many really defaulted?”


Recall (Sensitivity, True Positive Rate) $ = \frac{TP}{TP + FN} $

Measures how many of the actual positives were captured.
“Of all loans that actually defaulted, how many did the model detect?”


F1-Score $ = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} $

Harmonic mean of Precision and Recall. Balances both metrics.


Accuracy $ = \frac{TP + TN}{TP + TN + FP + FN} $

Fraction of all predictions that were correct.


Macro Average $ = \frac{1}{N_{classes}} \sum_{i=1}^{N_{classes}} \text{Metric}_i $

Simple average across all classes, treating each class equally (ignores imbalance).


Weighted Average $ = \frac{\sum_{i=1}^{N_{classes}} \text{Support}_i \cdot \text{Metric}_i}{\sum_{i=1}^{N_{classes}} \text{Support}_i} $

Average across classes, weighted by the number of instances in each class (accounts for imbalance).

In [11]:
# Classification report
print(classification_report(y_test, y_pred))

#Note: The support number shows how many data points are considered for
#      calculating the metrics for each class
#      see https://www.nb-data.com/p/breaking-down-the-classification
              precision    recall  f1-score   support

           0       0.74      0.64      0.68        22
           1       0.94      0.96      0.95       128

    accuracy                           0.91       150
   macro avg       0.84      0.80      0.82       150
weighted avg       0.91      0.91      0.91       150

Interpretation of classification metrics for Credit risk¶

  • The model is strongly oriented towards minimizing false negatives (missing actual defaulters), which is a common strategy in credit risk to protect against losses.

Example: Customer Churn Prediction¶

Churn rate in business refers to the number of customers or subscribers that leave a provider in a given time period.

Now, a case of customer retention:

  • Customer age
  • Months with the company
  • Number of customer service calls
  • Churn (1 = churned, 0 = stayed)
In [12]:
# Churn data simulation
np.random.seed(123)
n = 400

age = np.random.randint(18, 70, n)
months = np.random.randint(1, 60, n)
calls = np.random.poisson(2, n)

p_churn = 1 / (1 + np.exp(-( -3 + 0.1*calls -0.03*months + 0.1*age )))
churn = np.random.binomial(1, p_churn)

df_churn = pd.DataFrame({
"Age": age,
"Months": months,
"Calls": calls,
"Churn": churn
})
In [13]:
df_churn.head()
Out[13]:
Age Months Calls Churn
0 63 33 0 1
1 20 15 1 0
2 46 3 2 1
3 52 28 2 1
4 56 27 1 1
In [14]:
# Model training
X = df_churn[["Age", "Months", "Calls"]]
y = df_churn["Churn"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123)

model_churn = LogisticRegression()
model_churn.fit(X_train, y_train)

y_pred = model_churn.predict(X_test)
print(classification_report(y_test, y_pred))
              precision    recall  f1-score   support

           0       0.67      0.62      0.65        53
           1       0.72      0.76      0.74        67

    accuracy                           0.70       120
   macro avg       0.70      0.69      0.69       120
weighted avg       0.70      0.70      0.70       120

In [15]:
conf_matrix = confusion_matrix(y_test, y_pred)
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Oranges")
plt.title("Confusion Matrix - Customer Churn")
plt.show()
No description has been provided for this image
In [16]:
# Compute odds ratios for the churn model
odds_ratios_churn = pd.DataFrame({
    'Feature': X.columns,
    'Odds Ratio': np.exp(model_churn.coef_[0])
})

print("Odds Ratios - Customer Churn Model:")
display(odds_ratios_churn)
Odds Ratios - Customer Churn Model:
Feature Odds Ratio
0 Age 1.114219
1 Months 0.980445
2 Calls 1.111131

Interpreting the Odds Ratios for the Churn Model:

  • Age: This means that for every one-year increase in customer age, the odds of churning increase by about 11.4%. Older customers have a slightly higher propensity to churn, holding other factors constant.
  • Months: The odds ratio for Months is approximately 0.980. This means that for every one-month increase in the time a customer has been with the company, the odds of churning decrease by about 2%. Customers who have been with the company longer are slightly less likely to churn.
  • Calls: The odds ratio for Calls is approximately 1.111. This means that for every one-unit increase in the number of customer service calls, the odds of churning increase by about 11.1%. Customers who make more service calls are more likely to churn.

These odds ratios provide actionable insights:

  • Age: While the effect is small, it suggests that slightly older customers might require targeted retention efforts.
  • Months: Customers who have been with the company for a shorter period are more likely to churn. This highlights the importance of onboarding and early engagement strategies.
  • Calls: The number of service calls is a significant predictor of churn. This indicates that customers with frequent issues or who require more support are at higher risk. Businesses should investigate the root causes of these calls and improve service quality or provide better self-service options. Addressing these issues could be a key factor in reducing churn.

Predict the probability of churn for a particular profile of a client¶

In [17]:
# Define a sample client's data (Age, Months, Calls)
sample_client = pd.DataFrame({
    "Age": [39],
    "Months": [12],
    "Calls": [3]
})

# Predict the probability of default for the sample client
probability_churn = model_churn.predict_proba(sample_client)[:, 1]

# Print the predicted probability
print(f"The predicted probability of churn for the sample client is: {probability_churn[0]:.4f}")
The predicted probability of churn for the sample client is: 0.6806

Module Conclusions¶

  • Logistic regression is a powerful tool for binary classification problems in business.
  • It allows you to estimate probabilities and interpret coefficients in terms of odds ratios and even make predictions for future clients
  • Main applications in this module:
  • Credit risk: predicting payment defaults.
  • Customer churn: identifying users at risk of churn.

Module 2: Customer Segmentation with Clustering and Unsupervised Learning¶

Fundamentals of Clustering and Unsupervised Learning¶

Clustering is an unsupervised learning technique used to group similar data points together, and is widely used in business to:

  • Segment customers based on purchasing behavior.
  • Identify patterns in product usage.
  • Target marketing campaigns more effectively.

Customer Segmentation using K-means¶

K-means clustering partitions the dataset into K clusters, where each data point belongs to the cluster with the nearest mean.

Key concepts:

  • Unsupervised Learning: The algorithm learns patterns without labeled outcomes.
  • Cluster: A group of data points that are more similar to each other than to points in other clusters.

Example: Mall Customer¶

The “Mall Customers” dataset includes the following characteristics:

  • Customer ID: A unique identifier for each customer.
  • Gender: The customer's gender: Male, Female
  • Age: The customer's age.
  • Annual Income (K$): The customer's approximate annual income, expressed in thousands of dollars.
  • Spending Score (1-100): The score assigned by the shopping center based on customer behavior and spending habits. This score ranges from 1 to 100.
In [18]:
# Importing necessary libraries
from sklearn.cluster import KMeans
In [19]:
customer_data = pd.read_csv('Mall_Customers.csv')
customer_data.head()
Out[19]:
CustomerID Genre Age Annual Income (k$) Spending Score (1-100)
0 1 Male 19 15 39
1 2 Male 21 15 81
2 3 Female 20 16 6
3 4 Female 23 16 77
4 5 Female 31 17 40
In [20]:
# Visualize the data
plt.figure(figsize=(6, 4))
sns.scatterplot(x='Annual Income (k$)', y='Spending Score (1-100)', data=customer_data)
plt.title('Annual Income vs. Spending Score')
plt.xlabel('Annual Income (USD)')
plt.ylabel('Spending Score (1-100)')
plt.show()
No description has been provided for this image

The main steps of k-means algorithm:

  1. Estimate the number of clusters (K): Choose K using a valid method (e.g., Elbow Method).
  2. Initialize centroids: Randomly place K centroids in the feature space.
  3. Assign points to clusters: Each data point is assigned to the nearest centroid.
  4. Recalculate centroids: Compute the new centroid of each cluster as the average of its assigned points.
  5. Iterate until convergence: Repeat steps 3 and 4 until cluster assignments no longer change significantly.

End-to-end K-means algorithm¶

Elbow Method: Choosing optimal number of clusters¶

Before starting with the K-means algorithm, we need to decide the number of clusters (K) we want to identify.

  • The elbow method and other techniques are used to estimate the number of clusters (K) in the K-means algorithm. - These techniques, especially the elbow method, analyze how within-cluster variability, measured as WCSS (Within-Cluster Sum of Squares), changes as the number of clusters increases.
  • The goal is to find a balance where an increase in the number of clusters does not result in a significant improvement in variability (WCSS).

The formula to calculate WCSS (Within-Cluster Sum of Squares) is as follows:

$ WCSS = \sum_{i=1}^{k} \sum_{x \in S_i} \| x - \mu_i \|^2 $

Where:

  • $ k $ is the number of clusters.
  • $ S_i $ represents the set of points in cluster $ i $.
  • $ x $ is a point within cluster $ S_i $.
  • $ \mu_i $ is the centroid of cluster $ i $, i.e., the point that represents the center of the cluster.
  • $ \| x - \mu_i \|^2 $ is the squared distance between point $ x $ and centroid $ \mu_i $, which gives us an idea of how far each point is from its center.
  • Run the algorithm 10 times: when K=1, K=2, … K=10
  • We will calculate the sum of squares of each cluster (WCSS).
  • We will visualize in a graph when this trend change occurs using the elbow graph.
In [21]:
import numpy as np
X = customer_data.iloc[:, [3, 4]].values

wcss = []
n_clu = 10
for i in range(1, n_clu+1):
    kmeans = KMeans(n_clusters = i,
                    init = "k-means++",
                    max_iter = 400,
                    n_init = 10,
                    random_state = 42)
    kmeans.fit(X)
    wcss.append(kmeans.inertia_)
In [22]:
plt.plot(range(1,11), wcss, marker='o')
plt.title("Elbow plot")
plt.xlabel("#clusters")
plt.ylabel("WCSS(k)")
plt.grid(True)
plt.show()
No description has been provided for this image

In this case, we could say that choosing K = 5 as the optimal number of clusters for our dataset suggests that this amount provides a good balance between the efficiency of the algorithm and the accuracy in clustering the data.

Implementation of the K-Means Algorithm¶

  • Now that we have the K-optimal, we'll organize our dataset X into 5 distinct groups or clusters.

  • The idea is to run the K-means method again with K=5 so that the algorithm finds the 5 best clusters based on the data characteristics.

In [23]:
kmeans = KMeans(n_clusters = 5,
                init="k-means++",
                max_iter = 100,
                n_init = 10,
                random_state = 42)

y_kmeans = kmeans.fit_predict(X)
y_kmeans[0:10]
Out[23]:
array([4, 2, 4, 2, 4, 2, 4, 2, 4, 2], dtype=int32)

Integrating Clustering Results into the Dataset¶

After fitting the K-Means model to our data and assigning each sample to a cluster, we must now integrate the results into our original dataset to facilitate more detailed analysis and provide added value to the business.

In [24]:
y_kmeans_series = pd.Series(y_kmeans, name='Cluster Pred')

X_clustered = pd.concat([customer_data, y_kmeans_series], axis=1)
X_clustered.head()
Out[24]:
CustomerID Genre Age Annual Income (k$) Spending Score (1-100) Cluster Pred
0 1 Male 19 15 39 4
1 2 Male 21 15 81 2
2 3 Female 20 16 6 4
3 4 Female 23 16 77 2
4 5 Female 31 17 40 4

Visualizing Clusters and Centroids¶

After integrating the cluster assignments (Cluster Pred) into our dataset, the next step is to visualize the results to gain an intuitive understanding of how the clusters are distributed based on the selected features.

In [25]:
# Visualizing clusters
plt.figure(figsize=(6,5))
sns.scatterplot(x='Annual Income (k$)', y='Spending Score (1-100)', 
    hue='Cluster Pred', palette='Set2', data=X_clustered, s=100)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
            s=100, c="black", label="Centroid", marker="^")
plt.title("Customer Segments")
plt.show()
No description has been provided for this image

Business Interpretation of Clusters¶

Based on the visualization of the clusters, we can interpret the segments as follows:

  • Cluster 1 (green): Low Income, Low Spending: This group has low annual income and low spending scores. They may be price-sensitive or occasional shoppers.
    • Business Insight: Target with budget-friendly promotions, loyalty programs, and focus on essential goods.
  • Cluster 2 (orange): High Income, High Spending: This group has high annual income and high spending scores. They are likely valuable customers who are willing to spend more.
    • Business Insight: Offer premium products, exclusive deals, personalized recommendations, and excellent customer service.
  • Cluster 3 (blue): Low Income, High Spending: This group has low annual income but high spending scores. They may be impulsive buyers or heavily influenced by promotions and trends.
    • Business Insight: Focus on impulse buys, limited-time offers, and create a sense of urgency.
  • Cluster 4 (pink): High Income, Low Spending: This group has high annual income but low spending scores. They may be saving money or spending elsewhere.
    • Business Insight: Encourage spending with targeted promotions, highlight value and benefits, and consider offering financial services or rewards programs.
  • Cluster 5 (green): Mid Income, Mid Spending: This is the average customer group with moderate income and spending habits.
    • Business Insight: Maintain consistent marketing efforts, offer a balanced mix of products and promotions, and focus on building customer loyalty.

By understanding these distinct customer segments, businesses can tailor their marketing strategies, product offerings, and customer service to better meet the needs and preferences of each group, ultimately leading to increased sales and customer satisfaction.

Key Takeaways¶

  • Clustering is a powerful unsupervised learning technique for customer segmentation.
  • K-means is simple but effective for many business applications.
  • Visualizing clusters provides insight into customer behavior.
  • Combining clustering with classification allows for predictive segmentation and better decision-making.

Module 3: Time Series Analysis for Sales and Demand Forecasting¶

Understanding Time Series Patterns¶

Time series data is a sequence of observations collected over time. Common patterns include:

  • Trend: Long-term upward or downward movement.
  • Seasonality: Repeating short-term cycle patterns (e.g., monthly sales peaks).
  • Noise: Random fluctuations that cannot be explained by trend or seasonality.

We will visualize these patterns using an guide example from Camm, Jeffrey D., et al. Statistics for business and economics. Cengage, 2024.

Case Study: Forecasting Food and Beverage Sales.

  • The Vintage Restaurant, on Captiva Island near Fort Myers, Florida, is owned and operated by Karen Payne.
  • The restaurant just completed its third year of operation. Since opening her restaurant, Karen has sought to establish a reputation for the Vintage as a high-quality dining establishment that specializes in fresh seafood.
  • Through the efforts of Karen and her staff, her restaurant has become one of the best and fastest growing restaurants on the island.
  • To better plan for future growth of the restaurant, Karen needs to develop a system that will enable her to forecast food and beverage sales by month for up to one year in advance.
In [26]:
df_vintage = pd.read_excel("vintage.xlsx")
df_vintage.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36 entries, 0 to 35
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   Month   36 non-null     int64
 1   Sales   36 non-null     int64
dtypes: int64(2)
memory usage: 708.0 bytes
In [27]:
df_vintage.head()
Out[27]:
Month Sales
0 1 242
1 2 235
2 3 232
3 4 178
4 5 184
In [28]:
plt.figure(figsize=(6, 4))
plt.plot(df_vintage['Month'], df_vintage['Sales'])
plt.title('Vintage Restaurant Monthly Sales')
plt.xlabel('Month')
plt.ylabel('Sales ($1000s)')
plt.grid(True)
plt.show()
No description has been provided for this image

Moving Average (MA): Smooths data by averaging over a rolling window. Useful for identifying trends.

Exponential Smoothing (ES): Assigns exponentially decreasing weights to past observations. Useful for forecasting with trend and seasonality.

In [29]:
# Moving Average
df_vintage['MA_3'] = df_vintage['Sales'].rolling(window=3).mean()

# Exponential Smoothing
from statsmodels.tsa.holtwinters import ExponentialSmoothing

es_model = ExponentialSmoothing(df_vintage['Sales'], trend='add', seasonal='add', seasonal_periods=12)
es_fit = es_model.fit()
df_vintage['ES_Forecast'] = es_fit.fittedvalues
In [30]:
plt.figure(figsize=(8, 4))
plt.plot(df_vintage['Sales'], label='Original Sales', marker='o')
plt.plot(df_vintage['MA_3'], label='3-Month Moving Average', linestyle='--')
plt.plot(df_vintage['ES_Forecast'], label='Exponential Smoothing', linestyle='-.')
plt.title('Sales Forecasting with Moving Average and Exponential Smoothing')
plt.xlabel('Month')
plt.ylabel('Sales')
plt.legend()
plt.show()
No description has been provided for this image

ARIMA: Mathematical Foundation¶

ARIMA (AutoRegressive Integrated Moving Average) is a statistical model used to forecast non-stationary time series by combining three components

  1. Autoregressive (AR) part – order p
    • The current value of the series depends on its past values.

    • Mathematically:

      $y_t = c + \phi_1 y_{t-1} + \phi_2 y_{t-2} + \cdots + \phi_p y_{t-p} + \varepsilon_t$

    • where:

      • $\phi_i$ = autoregressive coefficients,
      • $\varepsilon_t$ = white noise error term.
  1. Integrated (I) part – order d
    • Represents the number of differencing operations applied to remove trends and make the series stationary.
    • First differencing: $ y'_t = y_t - y_{t-1} $
    • After (d) differences, the transformed series becomes stationary.
  1. Moving Average (MA) part – order q
    • The current value depends on past forecast errors.
    • Mathematically: $ y_t = c + \theta_1 \varepsilon_{t-1} + \theta_2 \varepsilon_{t-2} + \cdots + \theta_q \varepsilon_{t-q} + \varepsilon_t $
    • where:
      • $\theta_j$ = moving average coefficients,
      • $\varepsilon_t$ = white noise error term.

ARIMA(p, d, q) General Model¶

After differencing the series (d) times to achieve stationarity, the ARIMA model combines AR(p) and MA(q):

$ \Delta^d y_t = c + \phi_1 \Delta^d y_{t-1} + \cdots + \phi_p \Delta^d y_{t-p}

  • \theta_1 \varepsilon_{t-1} + \cdots + \theta_q \varepsilon_{t-q} + \varepsilon_t $
  • $\Delta^d y_t$: the series differenced (d) times,
  • $\phi_i$: AR coefficients,
  • $\theta_j$: MA coefficients,
  • $\varepsilon_t$: white noise.

Key Points for Forecasting Sales¶

  • Use differencing $(d)$ to remove long-term trends.
  • AR terms $(p)$ capture momentum from past sales.
  • MA terms $(q)$ capture the effect of shocks and random fluctuations.
  • Proper selection of $(p,d,q)$ is critical for accurate forecasts.

Let's start with a simple ARIMA(1, 1, 1) as an example, assuming some differencing and autoregressive/moving average components.

In a real-world scenario, you would perform model identification steps.

In [31]:
from statsmodels.tsa.arima.model import ARIMA
import warnings
warnings.filterwarnings("ignore")
model_arima = ARIMA(df_vintage['Sales'], order=(1, 1, 1), seasonal_order=(1, 1, 1, 12)) 
# Added seasonal order based on identified seasonality
model_arima_fit = model_arima.fit()
In [32]:
# Forecast for the next 12 months
forecast_periods = 12
forecast_arima = model_arima_fit.forecast(steps=forecast_periods)

# Create a DataFrame for the forecast
forecast_index = range(df_vintage['Month'].max() + 1, df_vintage['Month'].max() + 1 + forecast_periods)
forecast_df = pd.DataFrame({'Month': forecast_index, 'Sales_Forecast_ARIMA': forecast_arima})

print("ARIMA Forecast for the next 12 months:")
display(forecast_df)
ARIMA Forecast for the next 12 months:
Month Sales_Forecast_ARIMA
36 37 284.619781
37 38 265.305505
38 39 270.512378
39 40 214.573704
40 41 218.213499
41 42 172.243247
42 43 178.575979
43 44 184.907693
44 45 141.908406
45 46 158.627769
46 47 186.374581
47 48 246.884119
In [33]:
# Plot the original data and the forecast
plt.figure(figsize=(8, 4))
plt.plot(df_vintage['Month'], df_vintage['Sales'], label='Original Sales')
plt.plot(forecast_df['Month'], forecast_df['Sales_Forecast_ARIMA'], label='ARIMA Forecast', color='red', linestyle='--')
plt.title('Vintage Restaurant Sales Forecast (ARIMA)')
plt.xlabel('Month')
plt.ylabel('Sales ($1000s)')
plt.legend()
plt.grid(True)
plt.show()
No description has been provided for this image
In [34]:
# Get confidence intervals
forecast_ci = model_arima_fit.get_forecast(steps=forecast_periods).conf_int()

# Add confidence intervals to the forecast DataFrame
forecast_df['Lower_CI'] = forecast_ci.iloc[:, 0].values
forecast_df['Upper_CI'] = forecast_ci.iloc[:, 1].values

print("\nARIMA Forecast with Confidence Intervals:")
display(forecast_df)
ARIMA Forecast with Confidence Intervals:
Month Sales_Forecast_ARIMA Lower_CI Upper_CI
36 37 284.619781 273.838851 295.400711
37 38 265.305505 254.334222 276.276787
38 39 270.512378 258.517612 282.507144
39 40 214.573704 201.951636 227.195772
40 41 218.213499 204.898500 231.528498
41 42 172.243247 158.300834 186.185659
42 43 178.575979 164.023032 193.128925
43 44 184.907693 169.772662 200.042723
44 45 141.908406 126.210019 157.606793
45 46 158.627769 142.391206 174.864331
46 47 186.374581 169.602646 203.146516
47 48 246.884119 229.643446 264.124791
In [35]:
# Plot the original data, forecast, and confidence intervals
plt.figure(figsize=(10, 5))
plt.plot(df_vintage['Month'], df_vintage['Sales'], label='Original Sales')
plt.plot(forecast_df['Month'], forecast_df['Sales_Forecast_ARIMA'], label='ARIMA Forecast', color='red', linestyle='--')
plt.fill_between(forecast_df['Month'], forecast_df['Lower_CI'], forecast_df['Upper_CI'], color='red', alpha=0.1, label='Confidence Interval')
plt.title('Vintage Restaurant Sales Forecast with Confidence Intervals (ARIMA)')
plt.xlabel('Month')
plt.ylabel('Sales ($1000s)')
plt.legend()
plt.grid(True)
plt.show()
No description has been provided for this image
  • Time series patterns (trend, seasonality, noise) are key for understanding sales data.
  • Moving averages and exponential smoothing are simple yet effective forecasting tools.
  • ARIMA models provide more flexibility for non-stationary series.
  • Forecasting sales and demand helps SMEs plan inventory, marketing, and production.

Module 4: Dashboards for Business Reporting¶

Introduction to Dashboards¶

Dashboards are interactive tools that consolidate data, metrics, and analytical results into a single interface. They help businesses:

  • Monitor performance in real time.
  • Communicate insights effectively.
  • Make data-driven decisions.

Key Features of Effective Dashboards¶

  • Clarity: Easy to interpret, minimal clutter.
  • Interactivity: Users can filter, drill-down, or explore data.
  • Integration: Can include results from models (e.g., regression, clustering, forecasts).

Setting Up the Environment¶

We will use Python libraries:

  • pandas for data manipulation
  • plotly.express for visualizations
  • dash for building the dashboard

Install libraries if needed:

!pip install pandas plotly dash
In [36]:
# Import libraries
import pandas as pd
import plotly.express as px
import plotly.io as pio
pio.renderers.default = "notebook"   # or "notebook_connected"
from dash import Dash, dcc, html, Input, Output

Let's work on a simple example

In [37]:
# Sample dataset: sales data for an SME
data = {
    'Month': ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun'],
    'Product_A_Sales': [120, 150, 170, 130, 160, 180],
    'Product_B_Sales': [80, 90, 100, 85, 95, 105],
    'Revenue': [2000, 2300, 2500, 2100, 2400, 2700]
}

df = pd.DataFrame(data)
df
Out[37]:
Month Product_A_Sales Product_B_Sales Revenue
0 Jan 120 80 2000
1 Feb 150 90 2300
2 Mar 170 100 2500
3 Apr 130 85 2100
4 May 160 95 2400
5 Jun 180 105 2700

Exploratory Visualizations¶

Before building the dashboard, let's visualize sales and revenue trends.

In [38]:
# Line chart for sales
fig_sales = px.line(df, x='Month', y=['Product_A_Sales', 'Product_B_Sales'],
                    title='Monthly Sales for Products A and B')
fig_sales.show()
In [39]:
# Bar chart for revenue
fig_revenue = px.bar(df, x='Month', y='Revenue', title='Monthly Revenue')
fig_revenue.show()

Building an Interactive Dashboard¶

We'll create a simple dashboard with:

  • Dropdown to select products
  • Line chart for sales
  • Bar chart for revenue
In [40]:
# Initialize the Dash app
app = Dash(__name__)

# Layout
app.layout = html.Div([
    html.H1("SME Sales Dashboard", style={'text-align': 'center'}),

    html.Label("Select Product:"),
    dcc.Dropdown(
        id='product_dropdown',
        options=[
            {'label': 'Product A', 'value': 'Product_A_Sales'},
            {'label': 'Product B', 'value': 'Product_B_Sales'}
        ],
        value='Product_A_Sales'
    ),

    dcc.Graph(id='sales_graph', figure={}),
    dcc.Graph(id='revenue_graph', figure={})
])

# Callbacks
@app.callback(
    Output('sales_graph', 'figure'),
    Input('product_dropdown', 'value')
)
def update_sales(selected_product):
    fig = px.line(df, x='Month', y=selected_product, title=f'Monthly Sales of {selected_product}')
    return fig

@app.callback(
    Output('revenue_graph', 'figure'),
    Input('product_dropdown', 'value')
)
def update_revenue(_):
    fig = px.bar(df, x='Month', y='Revenue', title='Monthly Revenue')
    return fig

# Run the app
if __name__ == '__main__':
    app.run(mode='inline')

The best way is to open it on Bash (Unix shell)! Let's go

Practical Applications¶

  • Integrating Forecast Models: You can extend this dashboard to display predictions from regression or ARIMA models.
  • KPIs: Add cards for metrics like total sales, average revenue, or growth rates.
  • Interactive Filtering: Include filters by region, product category, or time period.

Take Away¶

We learned how to:

  1. Understand the importance of dashboards in business reporting.
  2. Explore sales and revenue data with interactive visualizations.
  3. Build a lightweight dashboard using Dash and Plotly.
  4. Integrate interactivity to enhance business decision-making.

Conclusion¶

Throughout this program, you have gained not only the technical knowledge of Python’s open-source analytics tools but also the strategic insights to apply them effectively within a business context. You are now equipped to:

  • Optimize performance and manage risk with data-driven decision-making.

  • Leverage Python’s accessible ecosystem to give SMEs the same analytical edge as larger competitors.

  • Implement growth-oriented analytics strategies that translate directly into smarter actions and tangible results.

By the end of this journey, you should feel confident in applying Python analytics to your own organization—empowering you to anticipate trends, uncover opportunities, and make decisions that drive sustainable growth.