From: https://github.com/ksatola
Version: 0.1.0

Model - PM2.5 - Time Series Correlations

In [1]:
%load_ext autoreload
In [2]:
%autoreload 2
In [3]:
import sys
sys.path.insert(0, '../src')
In [4]:
import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')
In [5]:
import pandas as pd 

import matplotlib.pyplot as plt
%matplotlib inline
In [6]:
from model import (
    get_pm25_data_for_modelling,
)

from stats import (
    adfuller_test
)

from plot import (
    plot_stl,
    plot_ts_corr
)

Load hourly data

In [ ]:
dfh = get_pm25_data_for_modelling('ts', 'h')
dfh.head()
In [ ]:
#dfh.tail(24)
In [ ]:
train_range_from_h = '2008-01-01 02:00:00'
train_range_to_h = '2018-12-30 23:00:00'
test_range_from_h = '2018-12-31 01:00:00'
test_range_to_h = None

Load daily data

In [7]:
dfd = get_pm25_data_for_modelling('ts', 'd')
dfd.head()
common.py | 42 | get_pm25_data_for_modelling | 02-Jun-20 23:09:35 | INFO: Dataframe loaded: /Users/ksatola/Documents/git/air-polution/agh/data/dfpm25_2008-2018_daily.hdf
common.py | 43 | get_pm25_data_for_modelling | 02-Jun-20 23:09:35 | INFO: Dataframe size: (4019, 1)
Out[7]:
pm25
Datetime
2008-01-01 53.586957
2008-01-02 30.958333
2008-01-03 46.104167
2008-01-04 42.979167
2008-01-05 57.312500
In [8]:
#dfd.tail(7)
In [9]:
train_range_from_d = '2008-01-01'
train_range_to_d = '2018-12-25'
test_range_from_d = '2018-12-26'
test_range_to_d = None

Decompose time series

In [10]:
df = dfd.copy()
series = df['pm25'].resample(rule='D').mean()
series.index.freq
Out[10]:
<Day>
In [11]:
# STL for Daily Data
result = plot_stl(data=dfd['pm25'], period=365, low_pass=367)
In [12]:
series = result.resid
In [13]:
adfuller_test(series)
Test statistic: -12.04972808762017
P-value: 2.614368859594535e-22 -> 0.0000000000000000
Critical values: {'1%': -3.431989155785145, '5%': -2.8622642893621704, '10%': -2.5671555470083454}
The time series does not have a unit root, so it is stationary (reject H0)

Correlation for Time Series Modelling

Time series modeling assumes a relationship between an observation and the previous observation. Previous observations in a time series are called lags, with the observation at the previous time step called lag=1, the observation at two time steps ago lag=2, and so on. A useful type of plot to explore the relationship between each observation and a lag of that observation is called the scatter plot. Pandas has a built-in function for exactly this called the lag plot. It plots the observation at time t on the x-axis and the observation at the next time step (t+1) on the y-axis.

  • If the points cluster along a diagonal line from the bottom-left to the top-right of the plot, it suggests a positive correlation relationship.
  • If the points cluster along a diagonal line from the top-left to the bottom-right, it suggests a negative correlation relationship.
  • Either relationship is good as they can be modeled.

More points tighter in to the diagonal line suggests a stronger relationship and more spread from the line suggests a weaker relationship. A ball in the middle or a spread across the plot suggests a weak or no relationship.

In [14]:
# Create a Lag scatter plot
from pandas.plotting import lag_plot

#series = result.observed
series = result.resid
plt.figure(figsize=(20, 20))
lag_plot(series, lag=1)
plt.show();

The plot created from running the example shows a relatively strong positive correlation between observations and their lag1 values.

In [15]:
series = result.resid
plt.figure(figsize=(20, 20))
lag_plot(series, lag=365)
plt.show();

The plot created from running the example shows a relatively strong negative correlation between observations and their lag365 values.

In [16]:
# Create multiple lag scatter plots
series = result.resid
values = pd.DataFrame(series.values)
lags = 8
columns = [values]

for i in range(1, (lags + 1)):
    columns.append(values.shift(i))

dataframe = pd.concat(columns, axis=1) 
columns = ['t']

for i in range(1, (lags + 1)):
    columns.append('t-' + str(i)) 

dataframe.columns = columns 
plt.figure(1)

dataframe.head()
Out[16]:
t t-1 t-2 t-3 t-4 t-5 t-6 t-7 t-8
0 -68.814036 NaN NaN NaN NaN NaN NaN NaN NaN
1 -11.795487 -68.814036 NaN NaN NaN NaN NaN NaN NaN
2 1.764438 -11.795487 -68.814036 NaN NaN NaN NaN NaN NaN
3 3.714591 1.764438 -11.795487 -68.814036 NaN NaN NaN NaN NaN
4 1.840989 3.714591 1.764438 -11.795487 -68.814036 NaN NaN NaN NaN
<Figure size 432x288 with 0 Axes>
In [17]:
for i in range(1, (lags + 1)):
    # Define the plot matrix 2x4
    plt.figure(figsize=(40, 20))
    ax = plt.subplot(240 + i)
    ax.set_title('t vs t-' + str(i))
    plt.scatter(x=dataframe['t'].values, y=dataframe['t-'+str(i)].values)
plt.show();
In [18]:
# Create multiple lag scatter plots (around 365)
series = result.resid
values = pd.DataFrame(series.values)
start = 362
lags = 8
columns = [values]

for i in range(start, (lags + start)):
    columns.append(values.shift(i))

dataframe = pd.concat(columns, axis=1) 
columns = ['t']

for i in range(start, (lags + start)):
    columns.append('t-' + str(i)) 

dataframe.columns = columns 

dataframe.tail()
Out[18]:
t t-362 t-363 t-364 t-365 t-366 t-367 t-368 t-369
4014 -2.545600 -28.741087 -3.362139 6.618422 11.940012 35.775341 49.543466 6.171839 -7.935862
4015 -1.032167 -3.210180 -28.741087 -3.362139 6.618422 11.940012 35.775341 49.543466 6.171839
4016 -8.194337 8.326026 -3.210180 -28.741087 -3.362139 6.618422 11.940012 35.775341 49.543466
4017 -8.209242 -0.430453 8.326026 -3.210180 -28.741087 -3.362139 6.618422 11.940012 35.775341
4018 -19.869173 -6.424660 -0.430453 8.326026 -3.210180 -28.741087 -3.362139 6.618422 11.940012
In [19]:
plt.figure(1)
for i in range(1, (lags)):
    # Define the plot matrix 2x4
    plt.figure(figsize=(40, 20))
    ax = plt.subplot(240 + i)
    ax.set_title('t vs t-' + str(i+start))
    plt.scatter(x=dataframe['t'].values, y=dataframe['t-'+str(i+start)].values)
plt.show();
<Figure size 432x288 with 0 Axes>

Autocorrelation (ACF)

We can quantify the strength and type of relationship between observations and their lags. In statistics, this is called correlation, and when calculated against lag values in time series, it is called autocorrelation (self-correlation). A correlation value calculated between two groups of numbers, such as observations and their lag=1 values, results in a number between -1 and 1. The sign of this number indicates a negative or positive correlation respectively. A value close to zero suggests a weak correlation, whereas a value closer to -1 or 1 indicates a strong correlation. Correlation values, called correlation coefficients, can be calculated for each observation and different lag values.

Once calculated, a plot can be created to help better understand how this relationship changes over the lag. This type of plot is called an autocorrelation plot and Pandas provides this capability built in, called the autocorrelation_plot() function. The plot shows lag along the x-axis and the correlation on the y-axis. Dotted lines are provided that indicate any correlation values above those lines are statistically significant (meaningful).

Identification of an MA model is often best done with the ACF rather than the PACF. For an MA model, the theoretical PACF does not shut off, but instead tapers toward 0 in some manner. A clearer pattern for an MA model is in the ACF. The ACF will have non-zero autocorrelations only at lags involved in the model.

More: https://online.stat.psu.edu/stat510/book/export/html/662

In [20]:
from pandas.plotting import autocorrelation_plot

plt.figure(figsize=(40, 20))
autocorrelation_plot(series)
Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x125880350>
In [21]:
from statsmodels.graphics import tsaplots

fig = tsaplots.plot_acf(series, lags=40)
plt.show();

# MA
#q=0,1,2,3
In [22]:
fig = tsaplots.plot_acf(series, lags=366)
plt.show();
In [23]:
fig = tsaplots.plot_acf(series, lags=366*8)
plt.show();

There are 3 stronger autocorrelated lags at lag1, lag2 and lag3.


Partial autocorrelation (PACF)

An autocorrelation (ACF) plot represents the autocorrelation of the series with lags of itself. A partial autocorrelation (PACF) plot represents the amount of correlation between a series and a lag of itself that is not explained by correlations at all lower-order lags. Ideally, we want no correlation between the series and lags of itself. Graphically speaking, we would like all the spikes to fall in the blue region.

Identification of an AR model is often best done with the PACF. For an AR model, the theoretical PACF “shuts off” past the order of the model. The phrase “shuts off” means that in theory the partial autocorrelations are equal to 0 beyond that point. Put another way, the number of non-zero partial autocorrelations gives the order of the AR model. By the “order of the model” we mean the most extreme lag of x that is used as a predictor.

More: https://online.stat.psu.edu/stat510/book/export/html/662

In [24]:
fig = tsaplots.plot_pacf(series, lags=40)
plt.show();

# RA
#p=0,1,2
In [25]:
fig = tsaplots.plot_pacf(series, lags=366)
plt.show();
In [26]:
plot_ts_corr(series)
In [ ]: