Assuming non-gaussian noise and existed outliers, find linear relationship between explanatory (independent) and response (dependent) variables, predict future values.

Example¶

Let's generate some data with predefined parameters and noise, then add outliers and compare both simple and robust linear models.

In [22]:

import numpy as np
import matplotlib.pyplot as plt

In [44]:

# Parameters we will try to find later
b0 = 24
b1 = 2
sigma = 3

# Number of samples to generate
n = 50

# Generate X
x = np.linspace(0, 20, n)

# Set seed to get same random vector
np.random.seed(100)

# Generate Y
y = b0 + b1 * x + np.random.normal(scale=sigma, size=n)

# Add outliers
y[7] = 500

# Show data
plt.scatter(x, y, color='blue')
plt.show()

First, let's check a simple linear regression

In [45]:

from sklearn import linear_model

linreg = linear_model.LinearRegression()
xr = x.reshape(-1,1)
yr = y.reshape(-1,1)
linreg.fit(xr, yr)
y_true = b0 + b1 * x
y_predict = linreg.predict(xr)
plt.scatter(x, y, color = 'lightgrey')
plt.plot(x, y_true, color = 'green', label='Target')
plt.plot(x, y_predict, color = 'blue', label='Predicted')
plt.legend();

You can see that it's enough to have just one outlier in a dataset to fool regression algorithm. Let's use a simple bayesian model from the Bayesian linear regression model

In [50]:

from iframer import *
iframer('https://statsim.com/app/?a=%5B%7B%22b%22%3A%5B%7B%22n%22%3A%22x%22%2C%22sh%22%3Afalse%2C%22t%22%3A2%2C%22u%22%3Afalse%2C%22dims%22%3A%22%22%2C%22v%22%3A%220%2C0.40816327%2C0.81632653%2C1.2244898%2C1.63265306%2C2.04081633%2C2.44897959%2C2.85714286%2C3.26530612%2C3.67346939%2C4.08163265%2C4.48979592%2C4.89795918%2C5.30612245%2C5.71428571%2C6.12244898%2C6.53061224%2C6.93877551%2C7.34693878%2C7.75510204%2C8.16326531%2C8.57142857%2C8.97959184%2C9.3877551%2C9.79591837%2C10.20408163%2C10.6122449%2C11.02040816%2C11.42857143%2C11.83673469%2C12.24489796%2C12.65306122%2C13.06122449%2C13.46938776%2C13.87755102%2C14.28571429%2C14.69387755%2C15.10204082%2C15.51020408%2C15.91836735%2C16.32653061%2C16.73469388%2C17.14285714%2C17.55102041%2C17.95918367%2C18.36734694%2C18.7755102%2C19.18367347%2C19.59183673%2C20%22%7D%2C%7B%22n%22%3A%22y%22%2C%22sh%22%3Afalse%2C%22t%22%3A2%2C%22u%22%3Afalse%2C%22dims%22%3A%22%22%2C%22v%22%3A%2218.75070358%2C25.84436774%2C29.09176047%2C25.69167148%2C30.20926848%2C29.62428918%2C29.56149819%2C500%2C29.96212475%2C32.11194311%2C30.78918435%2C34.2850823%2C32.04513322%2C37.06278611%2C37.44673385%2C35.93166453%2C35.46738336%2C40.96674908%2C37.37947068%2C36.15524934%2C45.18347559%2C45.76767267%2C41.20354626%2C40.24820299%2C44.14539281%2C47.21940987%2C47.41749083%2C50.1254847%2C45.87842868%2C47.84049743%2C49.15699474%2C44.97647146%2C47.85339206%2C53.38813754%2C54.00643633%2C51.20358779%2C56.95662191%2C49.13223115%2C50.95121102%2C52.13943115%2C55.01974374%2C55.46487254%2C58.30765798%2C57.26322461%2C63.81761157%2C55.53540701%2C58.60109011%2C63.4398702%2C58.34293796%2C68.4121416%22%7D%2C%7B%22d%22%3A%22Uniform%22%2C%22n%22%3A%22b0%22%2C%22o%22%3Afalse%2C%22p%22%3A%7B%22a%22%3A%22-100%22%2C%22b%22%3A%22100%22%7D%2C%22sh%22%3Atrue%2C%22t%22%3A0%2C%22dims%22%3A%221%22%7D%2C%7B%22d%22%3A%22Uniform%22%2C%22n%22%3A%22b1%22%2C%22o%22%3Afalse%2C%22p%22%3A%7B%22a%22%3A%22-50%22%2C%22b%22%3A%2250%22%7D%2C%22sh%22%3Atrue%2C%22t%22%3A0%2C%22dims%22%3A%221%22%7D%2C%7B%22d%22%3A%22Uniform%22%2C%22n%22%3A%22sigma%22%2C%22o%22%3Afalse%2C%22p%22%3A%7B%22a%22%3A%220.01%22%2C%22b%22%3A%2210%22%7D%2C%22sh%22%3Atrue%2C%22t%22%3A0%2C%22dims%22%3A%221%22%7D%2C%7B%22d%22%3A%22Gaussian%22%2C%22p%22%3A%7B%22mu%22%3A%22b0%20%2B%20b1%20%2A%20x%22%2C%22sigma%22%3A%22sigma%22%7D%2C%22t%22%3A4%2C%22v%22%3A%22y%22%7D%5D%2C%22mod%22%3A%7B%22n%22%3A%22LinReg%22%2C%22e%22%3A%22%22%2C%22s%22%3A1%2C%22m%22%3A%22HMC%22%7D%2C%22met%22%3A%7B%22sm%22%3A%225000%22%2C%22l%22%3A%22%22%2C%22b%22%3A%22500%22%2C%22chains%22%3A%223%22%2C%22s%22%3A%227%22%2C%22stepSize%22%3A%220.3%22%7D%7D%5D')

Open in statsim.com

In [51]:

import pymc3 as pm

In [52]:

# Generated PyMC3 code
with pm.Model() as model:
	b0 = pm.Uniform('b0', lower=-100, upper=100)
	b1 = pm.Uniform('b1', lower=-50, upper=50)
	sigma = pm.Uniform('sigma', lower=0.01, upper=100)
	observer = pm.Normal('observer', mu=b0 + b1 * x, sd=sigma, observed=y)
	result = pm.sample(model=model, step=pm.Metropolis(), draws=5000*20, tune=500)
pm.traceplot(result, varnames=['b0','b1','sigma']);

Multiprocess sampling (2 chains in 2 jobs)
CompoundStep
>Metropolis: [sigma]
>Metropolis: [b1]
>Metropolis: [b0]
Sampling 2 chains: 100%|██████████| 201000/201000 [00:43<00:00, 4573.72draws/s]
The number of effective samples is smaller than 10% for some parameters.

In [53]:

result

Out[53]:

<MultiTrace: 2 chains, 100000 iterations, 6 variables>

In [55]:

pm.plot_posterior_predictive_glm(result)

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-55-9048e26295cd> in <module>()
----> 1 pm.plot_posterior_predictive_glm(result)

/usr/local/lib/python3.5/dist-packages/pymc3/plots/posteriorplot.py in plot_posterior_predictive_glm(trace, eval, lm, samples, **kwargs)
    151     for rand_loc in np.random.randint(0, len(trace), samples):
    152         rand_sample = trace[rand_loc]
--> 153         plt.plot(eval, lm(eval, rand_sample), **kwargs)
    154     # Make sure to not plot label multiple times
    155         kwargs.pop('label', None)

/usr/local/lib/python3.5/dist-packages/pymc3/plots/posteriorplot.py in <lambda>(x, sample)
    138     """
    139     if lm is None:
--> 140         lm = lambda x, sample: sample['Intercept'] + sample['x'] * x
    141 
    142     if eval is None:

KeyError: 'Intercept'

In [ ]:

Bayesian robust linear regression

Example¶