Assuming non-gaussian noise and existed outliers, find linear relationship between explanatory (independent) and response (dependent) variables, predict future values.
Let's generate some data with predefined parameters and noise, then add outliers and compare both simple and robust linear models.
In [22]:
import numpy as np
import matplotlib.pyplot as plt
In [44]:
# Parameters we will try to find later
b0 = 24
b1 = 2
sigma = 3
# Number of samples to generate
n = 50
# Generate X
x = np.linspace(0, 20, n)
# Set seed to get same random vector
# Generate Y
y = b0 + b1 * x + np.random.normal(scale=sigma, size=n)
# Add outliers
y[7] = 500
# Show data
plt.scatter(x, y, color='blue')
First, let's check a simple linear regression
In [45]:
from sklearn import linear_model
linreg = linear_model.LinearRegression()
xr = x.reshape(-1,1)
yr = y.reshape(-1,1), yr)
y_true = b0 + b1 * x
y_predict = linreg.predict(xr)
plt.scatter(x, y, color = 'lightgrey')
plt.plot(x, y_true, color = 'green', label='Target')
plt.plot(x, y_predict, color = 'blue', label='Predicted')
You can see that it's enough to have just one outlier in a dataset to fool regression algorithm. Let's use a simple bayesian model from the Bayesian linear regression model
In [50]:
from iframer import *
In [51]:
import pymc3 as pm
In [52]:
# Generated PyMC3 code
with pm.Model() as model:
b0 = pm.Uniform('b0', lower=-100, upper=100)
b1 = pm.Uniform('b1', lower=-50, upper=50)
sigma = pm.Uniform('sigma', lower=0.01, upper=100)
observer = pm.Normal('observer', mu=b0 + b1 * x, sd=sigma, observed=y)
result = pm.sample(model=model, step=pm.Metropolis(), draws=5000*20, tune=500)
pm.traceplot(result, varnames=['b0','b1','sigma']);
In [53]:
In [55]:
In [ ]: