Histogram Analysis¶

  • Histograms are one of the most powerful ways to visualize data because they provide a clear view of the distribution of values within a dataset.
  • By grouping data into bins, histograms reveal the shape of the distribution, highlight central tendencies, identify potential outliers, and illustrate variability.
  • This visual summary makes it easier to compare different sets of data and spot trends or anomalies that might not be apparent through summary statistics alone.
In [1]:
import os
import warnings
import logging

# configure logging
logging.basicConfig(
    level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s", datefmt="%H:%M:%S",
)
logger = logging.getLogger(__name__)

# get warning filter policy from the environment variables
# set to "ignore" for rendering the HTMLs, or to "once" otherwise
WARNING_FILTER_POLICY = os.getenv("WARNING_FILTER_POLICY", "once")
logger.info(f"{WARNING_FILTER_POLICY = }")
warnings.filterwarnings(WARNING_FILTER_POLICY)
21:12:45 [INFO] WARNING_FILTER_POLICY = 'ignore'
In [2]:
import numpy as np
import pandas as pd
import seaborn as sns

pd.set_option("display.max_columns", None)
pd.options.display.float_format = "{:,.2f}".format
In [3]:
from utils.constants import RANDOM_SEED
from utils.common import (
    get_data_folder_path,
    set_plotting_config,
    plot_histogram,
    plot_comparison_histograms
)
In [4]:
# plots configuration
sns.set_style("darkgrid")
sns.set_palette("colorblind")
set_plotting_config()
%matplotlib inline

1. Load Data¶

In this notebook, we will use the Medical Insurance Payout Dataset. This dataset contains historical data for over 1300 insurance customers (age, sex, BMI, number of children, smoking habits, and region) along with their actual medical charges. i.e., the expenditure for the customer.

Sources:

  1. Kaggle: https://www.kaggle.com/datasets/harshsingh2209/medical-insurance-payout
  2. Original source: https://raw.githubusercontent.com/JovianML/opendatasets/master/data/medical-charges.csv
In [5]:
data_path = get_data_folder_path()

df_input = pd.read_csv(os.path.join(data_path, "expenses.csv"))

2. Plot Histograms¶

Distribution of Medical Charges¶

In [6]:
fig1 = plot_histogram(
    title="Distribution of Medical Charges",
    histogram_title=f"(n = {len(df_input)})",
    df=df_input,
    plot_col="charges",
    display_name="Medical Charges",
    display_unit="USD",
    bin_size=2000,
    linewidth=1.5,
    show_legend=True,
    show_percentage=True,
    show_mean=True,
    show_median=True,
    show_zero_line=False,
    show_kde=True,
    figsize=(8, 6),
)
display(fig1)
No description has been provided for this image

Distribution of Medical Charges by Smoking Status¶

In [7]:
fig2 = plot_histogram(
    title="Distribution of Medical Charges by Smoking Status",
    histogram_title=f"(n = {len(df_input)})",
    df=df_input,
    plot_col="charges",
    display_name="Medical Charges",
    display_unit="USD",
    stratify_col="smoker",
    bin_size=2000,
    linewidth=1.5,
    show_legend=True,
    show_mean=True,
    show_percentage=False,
    show_median=True,
    show_zero_line=False,
    show_kde=False,
)
display(fig2)
No description has been provided for this image

COVID impact on Medical Charges¶

Let's do a (hypothetical) simulation of the impact of an event such as COVID-19 on the distribution of medical charges. Let's assume that COVID caused the average medical charge to increase by ~10% for all insurance customers in the database

In [8]:
# create a random normal distribution with mean 1.1 (+10%) and std 0.1
np.random.seed(RANDOM_SEED)
random_change = np.random.normal(loc=1.1, scale=0.1, size=len(df_input))
# apply the random change to the charges to simulate COVID's impact
df_input["charges_new"] = df_input["charges"] * random_change
# calculate the difference
df_input["charges_diff"] = df_input["charges_new"] - df_input["charges"]
In [9]:
plot_comparison_histograms(
    title="COVID-19 impact on Medical Charges",
    left_title="Distribution of Medical Charges Before and After COVID-19",
    right_title="Distribution of Medical Charges Differences",
    df=df_input,
    plot_col_before="charges",
    plot_col_after="charges_new",
    plot_col_diff="charges_diff",
    display_name="Medical Charges",
    display_unit="USD",
    bin_size=2000,
    show_percentage=True,
    show_mean=True,
    show_median=True,
    figsize=(14, 6),
)
Out[9]:
No description has been provided for this image
In [ ]: