Find linear relationship between variables in Bayesian way (using probability to measure uncertainty)
About linear regression model¶
Linear regression is one of the simplest and popular statistical models. Applying it we assume linear relationship between response variable(s) $y$ and explanatory variable(s) $x$. Mathematically such relationship can be written as
$$ y_i = \beta_{0} + \beta_{1}x_{1i} + ... + \beta_{n}x_{ni} + \epsilon_{i}$$where $ \epsilon_{i} $ is random noise and $\beta_0, \beta_1 ... \beta_n$ - unknown coefficients (model parameters) we want to predict. It's also often assumed that $ \epsilon_{i} $ is normally distributed with zero mean and unknown variance.
In case of 1 response variable $y$ and 1 explanatory variable $x$:
$$ y_i = f(x_i) = \beta_{0} + \beta_{1}x_{i} + \epsilon_{i}$$The goal of model fitting is to find unknown model parameters using the training data $ (x_1,y_1),(x_2,y_2),...,(x_n, y_n)$ . After we find such parameters it's possible to run the model in direct way predicting $y_{n+1}$ with $x_{n+1}$
Non-bayesian way of finding such coefficients is optimization of a loss function (i.e. error minimization). In case of OLS it's all about minimizing the sum of the squared differences between the observed dependent variables $y_i$ and predicted ones $f(x_i)$. Such solutions give point estimates for unknown parameters. Sometimes using them is just fine. However more often we also need some measure of uncertainty. How confident we are about predicted values. Hopefully we can use bayesian inference to find unknown parameters in probabilistic way.
Bayesian linear regression¶
We start with the same model specification
$$ y_i = \beta_{0} + \beta_{1}x_{1i} + ... + \beta_{n}x_{ni} + \epsilon_{i}$$That's equivalent to
$$ y_i \sim \beta_{0} + \beta_{1}x_{1i} + ... + \beta_{n}x_{ni} + \mathcal{N}(0,\sigma^2)$$Or
$$ \mathbf{y} \sim \mathcal{N}(\mathbf{X\beta}, \sigma^2)$$Here we treat $y$ as a random vector with the mean $\mathbf{X\beta}$ and standard deviation $\sigma$. Now it's possible to model such relationship with probabilistic programming methods.
Example¶
Let's generate some data with Python
import numpy as np
import pandas as pd
# Parameters we will try to find later with StatSim
b0 = 24
b1 = 2
sigma = 3
n = 50
# Generate X
x = np.linspace(0, 20, n)
# Set seed to get same random vector
np.random.seed(100)
# Generate Y
y = b0 + b1 * x + np.random.normal(scale=sigma, size=n)
# Show data
data = pd.DataFrame(data={'x':x,'y':y})
data.head(5)
...
Draw a plot:
data.plot(x='x', y='y', kind='scatter', title='Generated data');
Now let's save the data and load it into StatSim
data.to_csv('~/temp/linreg.csv', index=False)
Linear regression in StatSim¶
from iframer import *
iframer('https://statsim.com/app/?m=linreg-simple')
Settings:
- Chains: 3
- Samples: 5000
- Burn: 1000
- Lag: 50
SciKitLearn¶
from sklearn import datasets, linear_model
linreg = linear_model.LinearRegression()
x = x.reshape(-1,1)
y = y.reshape(-1,1)
linreg.fit(x, y)
print('b0 =', float(linreg.coef_))
print('b1 =', float(linreg.intercept_))
import matplotlib.pyplot as plt
y_true = b0 + b1 * x
y_predict = linreg.predict(x)
plt.scatter(x, y, color = 'lightgrey')
plt.plot(x, y_true, color = 'green', label='Target')
plt.plot(x, y_predict, color = 'blue', label='Predicted')
plt.legend();
Linear Regression model in PyMC3¶
import pymc3 as pm
with pm.Model() as model:
b0_pred = pm.Uniform('b0_pred', -100, 100)
b1_pred = pm.Uniform('b1_pred', -100, 100)
sigma_pred = pm.Uniform('sigma_pred', 0.01, 10)
y_obs = pm.Normal('y_obs', mu = b0_pred + b1_pred * x, sd = sigma_pred, observed = y)
trace = pm.sample(2000, cores=2)
pm.traceplot(trace, lines={'b0_pred': b0, 'b1_pred': b1, 'sigma_pred': sigma});
Linear Regression model in TensorFlow Probability¶