The Math Behind Kernel Density Estimation | by Zackary Nay | Sep, 2024
The following derivation takes inspiration from Bruce E. Hansen’s “Lecture Notes on Nonparametric” (2009). If you are interested in learning more you can refer to his original lecture notes here.
Suppose we wanted to estimate a probability density function, f(t), from a sample of data. A good starting place would be to estimate the cumulative distribution function, F(t), using the empirical distribution function (EDF). Let X1, …, Xn be independent, identically distributed real random variables with the common cumulative distribution function F(t). The EDF is defined as:
Then, by the strong law of large numbers, as n approaches infinity, the EDF converges almost surely to F(t). Now, the EDF is a step function that could look like the following:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm# Generate sample data
np.random.seed(14)
data = np.random.normal(loc=0, scale=1, size=40)
# Sort the data
data_sorted = np.sort(data)
# Compute ECDF values
ecdf_y = np.arange(1, len(data_sorted)+1) / len(data_sorted)
# Generate x values for the normal CDF
x = np.linspace(-4, 4, 1000)
cdf_y = norm.cdf(x)
# Create the plot
plt.figure(figsize=(6, 4))
plt.step(data_sorted, ecdf_y, where='post', color='blue', label='ECDF')
plt.plot(x, cdf_y, color='gray', label='Normal CDF')
plt.plot(data_sorted, np.zeros_like(data_sorted), '|', color='black', label='Data points')
# Label axes
plt.xlabel('X')
plt.ylabel('Cumulative Probability')
# Add grid
plt.grid(True)
# Set limits
plt.xlim([-4, 4])
plt.ylim([0, 1])
# Add legend
plt.legend()
# Show plot
plt.show()
Therefore, if we were to try and find an estimator for f(t) by taking the derivative of the EDF, we would get a scaled sum of Dirac delta functions, which is not very helpful. Instead let us consider using the two-point central difference formula of the estimator as an approximation of the derivative. Which, for a small h>0, we get:
Now define the function k(u) as follows:
Then we have that:
Which is a special case of the kernel density estimator, where here k is the uniform kernel function. More generally, a kernel function is a non-negative function from the reals to the reals which satisfies:
We will assume that all kernels discussed in this article are symmetric, hence we have that k(-u) = k(u).
The moment of a kernel, which gives insights into the shape and behavior of the kernel function, is defined as the following:
Lastly, the order of a kernel is defined as the first non-zero moment.
We can only minimize the error of the kernel density estimator by either changing the h value (bandwidth), or the kernel function. The bandwidth parameter has a much larger impact on the resulting estimate than the kernel function but is also much more difficult to choose. To demonstrate the influence of the h value, take the following two kernel density estimates. A Gaussian kernel was used to estimate a sample generated from a standard normal distribution, the only difference between the estimators is the chosen h value.
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde# Generate sample data
np.random.seed(14)
data = np.random.normal(loc=0, scale=1, size=100)
# Define the bandwidths
bandwidths = [0.1, 0.3]
# Plot the histogram and KDE for each bandwidth
plt.figure(figsize=(12, 8))
plt.hist(data, bins=30, density=True, color='gray', alpha=0.3, label='Histogram')
x = np.linspace(-5, 5, 1000)
for bw in bandwidths:
kde = gaussian_kde(data , bw_method=bw)
plt.plot(x, kde(x), label=f'Bandwidth = {bw}')
# Add labels and title
plt.title('Impact of Bandwidth Selection on KDE')
plt.xlabel('Value')
plt.ylabel('Density')
plt.legend()
plt.show()
Quite a dramatic difference.
Now let us look at the impact of changing the kernel function while keeping the bandwidth constant.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KernelDensity# Generate sample data
np.random.seed(14)
data = np.random.normal(loc=0, scale=1, size=100)[:, np.newaxis] # reshape for sklearn
# Intialize a constant bandwidth
bandwidth = 0.6
# Define different kernel functions
kernels = ["gaussian", "epanechnikov", "exponential", "linear"]
# Plot the histogram (transparent) and KDE for each kernel
plt.figure(figsize=(12, 8))
# Plot the histogram
plt.hist(data, bins=30, density=True, color="gray", alpha=0.3, label="Histogram")
# Plot KDE for each kernel function
x = np.linspace(-5, 5, 1000)[:, np.newaxis]
for kernel in kernels:
kde = KernelDensity(bandwidth=bandwidth, kernel=kernel)
kde.fit(data)
log_density = kde.score_samples(x)
plt.plot(x[:, 0], np.exp(log_density), label=f"Kernel = {kernel}")
plt.title("Impact of Different Kernel Functions on KDE")
plt.xlabel("Value")
plt.ylabel("Density")
plt.legend()
plt.show()
While visually there is a large difference in the tails, the overall shape of the estimators are similar across the different kernel functions. Therefore, I will focus primarily focus on finding the optimal bandwidth for the estimator. Now, let’s explore some of the properties of the kernel density estimator, including its bias and variance.