About linear regression model¶

Linear regression is one of the simplest and popular statistical models. Applying it we assume linear relationship between response variable(s) $y$ and explanatory variable(s) $x$. Mathematically such relationship can be written as

$$ y_i = \beta_{0} + \beta_{1}x_{1i} + ... + \beta_{n}x_{ni} + \epsilon_{i}$$

where $ \epsilon_{i} $ is random noise and $\beta_0, \beta_1 ... \beta_n$ - unknown coefficients (model parameters) we want to predict. It's also often assumed that $ \epsilon_{i} $ is normally distributed with zero mean and unknown variance.

In case of 1 response variable $y$ and 1 explanatory variable $x$:

$$ y_i = f(x_i) = \beta_{0} + \beta_{1}x_{i} + \epsilon_{i}$$

The goal of model fitting is to find unknown model parameters using the training data $ (x_1,y_1),(x_2,y_2),...,(x_n, y_n)$ . After we find such parameters it's possible to run the model in direct way predicting $y_{n+1}$ with $x_{n+1}$

Non-bayesian way of finding such coefficients is optimization of a loss function (i.e. error minimization). In case of OLS it's all about minimizing the sum of the squared differences between the observed dependent variables $y_i$ and predicted ones $f(x_i)$. Such solutions give point estimates for unknown parameters. Sometimes using them is just fine. However more often we also need some measure of uncertainty. How confident we are about predicted values. Hopefully we can use bayesian inference to find unknown parameters in probabilistic way.

Bayesian linear regression¶

We start with the same model specification

$$ y_i = \beta_{0} + \beta_{1}x_{1i} + ... + \beta_{n}x_{ni} + \epsilon_{i}$$

That's equivalent to

$$ y_i \sim \beta_{0} + \beta_{1}x_{1i} + ... + \beta_{n}x_{ni} + \mathcal{N}(0,\sigma^2)$$

Or

$$ \mathbf{y} \sim \mathcal{N}(\mathbf{X\beta}, \sigma^2)$$

Here we treat $y$ as a random vector with the mean $\mathbf{X\beta}$ and standard deviation $\sigma$. Now it's possible to model such relationship with probabilistic programming methods.

Example¶

Let's generate some data with Python

In [159]:

import numpy as np
import pandas as pd

# Parameters we will try to find later with StatSim
b0 = 24
b1 = 2
sigma = 3
n = 50

# Generate X
x = np.linspace(0, 20, n)

# Set seed to get same random vector
np.random.seed(100)

# Generate Y
y = b0 + b1 * x + np.random.normal(scale=sigma, size=n)


# Show data
data = pd.DataFrame(data={'x':x,'y':y})
data.head(5)

Out[159]:

	x	y
0	0.000000	18.750704
1	0.408163	25.844368
2	0.816327	29.091760
3	1.224490	25.691671
4	1.632653	30.209268

...

Draw a plot:

In [160]:

data.plot(x='x', y='y', kind='scatter', title='Generated data');

Now let's save the data and load it into StatSim

In [161]:

data.to_csv('~/temp/linreg.csv', index=False)

Linear regression in StatSim¶

In [162]:

from iframer import *
iframer('https://statsim.com/app/?m=linreg-simple')

Open in statsim.com

Settings:

Chains: 3
Samples: 5000
Burn: 1000
Lag: 50

SciKitLearn¶

In [163]:

from sklearn import datasets, linear_model

linreg = linear_model.LinearRegression()
x = x.reshape(-1,1)
y = y.reshape(-1,1)
linreg.fit(x, y)
print('b0 =', float(linreg.coef_))
print('b1 =', float(linreg.intercept_))

b0 = 1.9191940054276075
b1 = 24.685657822102073

In [164]:

import matplotlib.pyplot as plt

y_true = b0 + b1 * x
y_predict = linreg.predict(x)
plt.scatter(x, y, color = 'lightgrey')
plt.plot(x, y_true, color = 'green', label='Target')
plt.plot(x, y_predict, color = 'blue', label='Predicted')
plt.legend();

Linear Regression model in PyMC3¶

In [165]:

import pymc3 as pm

with pm.Model() as model:
    b0_pred = pm.Uniform('b0_pred', -100, 100)
    b1_pred = pm.Uniform('b1_pred', -100, 100)
    sigma_pred = pm.Uniform('sigma_pred', 0.01, 10)
    y_obs = pm.Normal('y_obs', mu = b0_pred + b1_pred * x, sd = sigma_pred, observed = y)
    trace = pm.sample(2000, cores=2)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (2 chains in 2 jobs)
NUTS: [sigma_pred, b1_pred, b0_pred]
Sampling 2 chains: 100%|██████████| 5000/5000 [00:06<00:00, 730.65draws/s] 
The acceptance probability does not match the target. It is 0.891548318682092, but should be close to 0.8. Try to increase the number of tuning steps.
The acceptance probability does not match the target. It is 0.9164934750561446, but should be close to 0.8. Try to increase the number of tuning steps.

In [166]:

pm.traceplot(trace, lines={'b0_pred': b0, 'b1_pred': b1, 'sigma_pred': sigma});

Linear Regression model in TensorFlow Probability¶

Links¶

In [ ]: