From: https://github.com/ksatola
Version: 0.1.0

Model - Prepare Analytical Views For Modelling

We will prepare the data for:

  • statistical modeling (without any transformations),
  • machine learning analysis by creating lagged variables.

The data will be adjusted for the following kinds of forecasts:

  • Next 24 hours (hourly data),
  • Next 7 days forecast (daily data).

We will then save these new dataframes.

In [1]:
%load_ext autoreload
In [2]:
%autoreload 2
In [3]:
import sys
sys.path.insert(0, '../src')
In [4]:
import pandas as pd 
import numpy as np
In [24]:
from model import (
    load_data,
    calculate_season,
    build_datetime_features,
    get_df_for_lags_columns
)

from plot import (
    plot_ts_corr,
    plot_stl
)

from stats import (
    adfuller_test
)
In [6]:
pd.set_option('display.max_rows', 999)
pd.set_option('display.max_columns', 999)
pd.set_option('precision', 5)
In [7]:
data_path = 'data/'
data_file = data_path + 'dfpm2008_2018.csv'

Load PM2.5 and PM10 Analytical View From a CSV File

In [8]:
df = load_data(data_file)
common.py | 14 | load_data | 08-Jun-20 18:10:10 | INFO: Dataframe loaded: data/dfpm2008_2018.csv
common.py | 15 | load_data | 08-Jun-20 18:10:10 | INFO: DataFrame size: (96388, 2)

Build PM2.5 HDF Analytical View Representation File

In [9]:
df.index
Out[9]:
Index(['2008-01-01 01:00:00', '2008-01-01 02:00:00', '2008-01-01 03:00:00',
       '2008-01-01 04:00:00', '2008-01-01 05:00:00', '2008-01-01 06:00:00',
       '2008-01-01 07:00:00', '2008-01-01 08:00:00', '2008-01-01 09:00:00',
       '2008-01-01 10:00:00',
       ...
       '2018-12-31 15:00:00.000', '2018-12-31 16:00:00.000',
       '2018-12-31 17:00:00.000', '2018-12-31 18:00:00.000',
       '2018-12-31 19:00:00.000', '2018-12-31 20:00:00.000',
       '2018-12-31 21:00:00.000', '2018-12-31 22:00:00.000',
       '2018-12-31 23:00:00.000', '2019-01-01 00:00:00.000'],
      dtype='object', name='Datetime', length=96388)
In [10]:
# Convert index to datetime with minutes frequency
df.index = pd.to_datetime(df.index)
In [11]:
# https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases
df.asfreq('T').index
Out[11]:
DatetimeIndex(['2008-01-01 01:00:00', '2008-01-01 01:01:00',
               '2008-01-01 01:02:00', '2008-01-01 01:03:00',
               '2008-01-01 01:04:00', '2008-01-01 01:05:00',
               '2008-01-01 01:06:00', '2008-01-01 01:07:00',
               '2008-01-01 01:08:00', '2008-01-01 01:09:00',
               ...
               '2018-12-31 23:51:00', '2018-12-31 23:52:00',
               '2018-12-31 23:53:00', '2018-12-31 23:54:00',
               '2018-12-31 23:55:00', '2018-12-31 23:56:00',
               '2018-12-31 23:57:00', '2018-12-31 23:58:00',
               '2018-12-31 23:59:00', '2019-01-01 00:00:00'],
              dtype='datetime64[ns]', name='Datetime', length=5785861, freq='T')
In [12]:
df.head()
Out[12]:
pm10 pm25
Datetime
2008-01-01 01:00:00 109.5 92.0
2008-01-01 02:00:00 96.0 81.0
2008-01-01 03:00:00 86.5 73.0
2008-01-01 04:00:00 71.5 60.5
2008-01-01 05:00:00 72.0 61.0
In [13]:
# We will only need PM2.5 for modelling
df.drop(columns=['pm10'], axis='columns', inplace=True) # axis=1
df.head()
Out[13]:
pm25
Datetime
2008-01-01 01:00:00 92.0
2008-01-01 02:00:00 81.0
2008-01-01 03:00:00 73.0
2008-01-01 04:00:00 60.5
2008-01-01 05:00:00 61.0

Dataset - Hourly Frequency

In [14]:
data_file_hdf = data_path + 'dfpm25_2008-2018_hourly.hdf'
df.to_hdf(data_file_hdf, key='df', mode='w')

Test Read

In [15]:
df = pd.read_hdf(path_or_buf=data_file_hdf, key="df")
print(f'Dataframe size: {df.shape}')
df.head()
Dataframe size: (96388, 1)
Out[15]:
pm25
Datetime
2008-01-01 01:00:00 92.0
2008-01-01 02:00:00 81.0
2008-01-01 03:00:00 73.0
2008-01-01 04:00:00 60.5
2008-01-01 05:00:00 61.0

Dataset - Daily Frequency

In [16]:
# Resample data to daily using mean of values
df_daily = df.resample(rule='D').mean() # daily
df_daily.head()
Out[16]:
pm25
Datetime
2008-01-01 53.58696
2008-01-02 30.95833
2008-01-03 46.10417
2008-01-04 42.97917
2008-01-05 57.31250
In [17]:
data_file_hdf = data_path + 'dfpm25_2008-2018_daily.hdf'
df_daily.to_hdf(data_file_hdf, key='df', mode='w')

Test Read

In [18]:
df = pd.read_hdf(path_or_buf=data_file_hdf, key="df")
print(f'Dataframe size: {df.shape}')
df.head()
Dataframe size: (4019, 1)
Out[18]:
pm25
Datetime
2008-01-01 53.58696
2008-01-02 30.95833
2008-01-03 46.10417
2008-01-04 42.97917
2008-01-05 57.31250

Feature engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data. For machine learning models, instead of dealing with datetime index, we will create additional features based on time and include them in the analytical view. In addition, we will perform time series data decomposition (remove trend and seasonality) and check ACF for most correlated lags to be chosen as features. In the final prediction mechanism, the predicted values on stationary component can be transform to the original values by adding removed earlier trend and seasonality components.


Machine Learning Models: Build PM2.5 HDF Analytical View Representation File

In [19]:
df = load_data(data_file)
common.py | 14 | load_data | 08-Jun-20 18:12:23 | INFO: Dataframe loaded: data/dfpm2008_2018.csv
common.py | 15 | load_data | 08-Jun-20 18:12:23 | INFO: DataFrame size: (96388, 2)
In [20]:
# We will only need PM2.5 for ML modelling
df.drop(columns=['pm10'], axis='columns', inplace=True) # axis=1
df.head()
Out[20]:
pm25
Datetime
2008-01-01 01:00:00 92.0
2008-01-01 02:00:00 81.0
2008-01-01 03:00:00 73.0
2008-01-01 04:00:00 60.5
2008-01-01 05:00:00 61.0

Create Lagged Variables - Hourly

In [28]:
# Create 10 hours of lag values to predict current observation (correlation coefficient > 0.5)
df24h = get_df_for_lags_columns(data=df, col_name='pm25', n_lags=10, remove_nans=True)
df24h
Out[28]:
t t-1 t-2 t-3 t-4 t-5 t-6 t-7 t-8 t-9 t-10
Datetime
2008-01-01 11:00:00 30.5000 41.5000 62.5000 70.5000 69.5000 67.0000 61.0000 60.5000 73.00000 81.00000 92.00000
2008-01-01 12:00:00 26.0000 30.5000 41.5000 62.5000 70.5000 69.5000 67.0000 61.0000 60.50000 73.00000 81.00000
2008-01-01 13:00:00 28.5000 26.0000 30.5000 41.5000 62.5000 70.5000 69.5000 67.0000 61.00000 60.50000 73.00000
2008-01-01 14:00:00 28.0000 28.5000 26.0000 30.5000 41.5000 62.5000 70.5000 69.5000 67.00000 61.00000 60.50000
2008-01-01 15:00:00 32.5000 28.0000 28.5000 26.0000 30.5000 41.5000 62.5000 70.5000 69.50000 67.00000 61.00000
... ... ... ... ... ... ... ... ... ... ... ...
2018-12-31 20:00:00.000 23.6990 19.3182 18.8965 19.4075 9.7000 9.9000 11.8866 7.0000 9.01751 6.94473 7.02653
2018-12-31 21:00:00.000 32.6308 23.6990 19.3182 18.8965 19.4075 9.7000 9.9000 11.8866 7.00000 9.01751 6.94473
2018-12-31 22:00:00.000 31.0801 32.6308 23.6990 19.3182 18.8965 19.4075 9.7000 9.9000 11.88660 7.00000 9.01751
2018-12-31 23:00:00.000 33.2028 31.0801 32.6308 23.6990 19.3182 18.8965 19.4075 9.7000 9.90000 11.88660 7.00000
2019-01-01 00:00:00.000 36.6074 33.2028 31.0801 32.6308 23.6990 19.3182 18.8965 19.4075 9.70000 9.90000 11.88660

96378 rows × 11 columns

In [29]:
corr = df24h.corr(method='pearson')
corr[0:1]
Out[29]:
t t-1 t-2 t-3 t-4 t-5 t-6 t-7 t-8 t-9 t-10
t 1.0 0.97263 0.9289 0.87807 0.82461 0.77248 0.72383 0.68073 0.64286 0.61076 0.58424
In [30]:
# Remove Datetime index and calculate date-related features from it
df24h = build_datetime_features(df24h, 'Datetime')
df24h.head()
Out[30]:
t t-1 t-2 t-3 t-4 t-5 t-6 t-7 t-8 t-9 t-10 month day hour dayofyear weekofyear dayofweek quarter season
0 30.5 41.5 62.5 70.5 69.5 67.0 61.0 60.5 73.0 81.0 92.0 1 1 11 1 1 1 1 1
1 26.0 30.5 41.5 62.5 70.5 69.5 67.0 61.0 60.5 73.0 81.0 1 1 12 1 1 1 1 1
2 28.5 26.0 30.5 41.5 62.5 70.5 69.5 67.0 61.0 60.5 73.0 1 1 13 1 1 1 1 1
3 28.0 28.5 26.0 30.5 41.5 62.5 70.5 69.5 67.0 61.0 60.5 1 1 14 1 1 1 1 1
4 32.5 28.0 28.5 26.0 30.5 41.5 62.5 70.5 69.5 67.0 61.0 1 1 15 1 1 1 1 1
In [31]:
data_file_hdf = data_path + 'dfpm25_2008-2018_ml_24hours_lags.hdf'
df24h.to_hdf(data_file_hdf, key='df', mode='w')

Test Read

In [32]:
df24h = pd.read_hdf(path_or_buf=data_file_hdf, key="df")
print(f'Dataframe size: {df.shape}')
df24h.head()
Dataframe size: (96388, 1)
Out[32]:
t t-1 t-2 t-3 t-4 t-5 t-6 t-7 t-8 t-9 t-10 month day hour dayofyear weekofyear dayofweek quarter season
0 30.5 41.5 62.5 70.5 69.5 67.0 61.0 60.5 73.0 81.0 92.0 1 1 11 1 1 1 1 1
1 26.0 30.5 41.5 62.5 70.5 69.5 67.0 61.0 60.5 73.0 81.0 1 1 12 1 1 1 1 1
2 28.5 26.0 30.5 41.5 62.5 70.5 69.5 67.0 61.0 60.5 73.0 1 1 13 1 1 1 1 1
3 28.0 28.5 26.0 30.5 41.5 62.5 70.5 69.5 67.0 61.0 60.5 1 1 14 1 1 1 1 1
4 32.5 28.0 28.5 26.0 30.5 41.5 62.5 70.5 69.5 67.0 61.0 1 1 15 1 1 1 1 1

Create Lagged Variables - Daily

In [33]:
# Convert index to datetime with minutes frequency
df.index = pd.to_datetime(df.index)

# Resample data to daily using mean of values
df_daily = df[['pm25']].resample(rule='D').mean() # daily
df_daily.head()
Out[33]:
pm25
Datetime
2008-01-01 53.58696
2008-01-02 30.95833
2008-01-03 46.10417
2008-01-04 42.97917
2008-01-05 57.31250
In [37]:
# Create 10 days of lag values to see correlation coefficients
df7d = get_df_for_lags_columns(data=df_daily, col_name='pm25', n_lags=10, remove_nans=True)
df7d
Out[37]:
t t-1 t-2 t-3 t-4 t-5 t-6 t-7 t-8 t-9 t-10
Datetime
2008-01-11 141.83333 110.08333 101.37500 45.04167 46.08333 36.06250 57.31250 42.97917 46.10417 30.95833 53.58696
2008-01-12 47.62500 141.83333 110.08333 101.37500 45.04167 46.08333 36.06250 57.31250 42.97917 46.10417 30.95833
2008-01-13 76.27083 47.62500 141.83333 110.08333 101.37500 45.04167 46.08333 36.06250 57.31250 42.97917 46.10417
2008-01-14 96.68750 76.27083 47.62500 141.83333 110.08333 101.37500 45.04167 46.08333 36.06250 57.31250 42.97917
2008-01-15 155.85417 96.68750 76.27083 47.62500 141.83333 110.08333 101.37500 45.04167 46.08333 36.06250 57.31250
... ... ... ... ... ... ... ... ... ... ... ...
2018-12-28 18.61486 15.27444 17.10235 15.82144 8.04118 17.05840 7.28296 44.24834 68.65069 54.04651 88.69356
2018-12-29 16.24105 18.61486 15.27444 17.10235 15.82144 8.04118 17.05840 7.28296 44.24834 68.65069 54.04651
2018-12-30 15.51440 16.24105 18.61486 15.27444 17.10235 15.82144 8.04118 17.05840 7.28296 44.24834 68.65069
2018-12-31 12.59546 15.51440 16.24105 18.61486 15.27444 17.10235 15.82144 8.04118 17.05840 7.28296 44.24834
2019-01-01 36.60740 12.59546 15.51440 16.24105 18.61486 15.27444 17.10235 15.82144 8.04118 17.05840 7.28296

4009 rows × 11 columns

In [38]:
corr = df7d.corr(method='pearson')
corr[0:1]
Out[38]:
t t-1 t-2 t-3 t-4 t-5 t-6 t-7 t-8 t-9 t-10
t 1.0 0.74103 0.53764 0.45317 0.40674 0.37857 0.37081 0.37517 0.37945 0.36735 0.34983
In [39]:
# Create 3 days of lag values to predict current observation (correlation coefficient > 0.4)
df7d = get_df_for_lags_columns(data=df_daily, col_name='pm25', n_lags=4, remove_nans=True)
df7d
Out[39]:
t t-1 t-2 t-3 t-4
Datetime
2008-01-05 57.31250 42.97917 46.10417 30.95833 53.58696
2008-01-06 36.06250 57.31250 42.97917 46.10417 30.95833
2008-01-07 46.08333 36.06250 57.31250 42.97917 46.10417
2008-01-08 45.04167 46.08333 36.06250 57.31250 42.97917
2008-01-09 101.37500 45.04167 46.08333 36.06250 57.31250
... ... ... ... ... ...
2018-12-28 18.61486 15.27444 17.10235 15.82144 8.04118
2018-12-29 16.24105 18.61486 15.27444 17.10235 15.82144
2018-12-30 15.51440 16.24105 18.61486 15.27444 17.10235
2018-12-31 12.59546 15.51440 16.24105 18.61486 15.27444
2019-01-01 36.60740 12.59546 15.51440 16.24105 18.61486

4015 rows × 5 columns

In [40]:
# Remove Datetime index and calculate date-related features from it
df7d = build_datetime_features(df7d, 'Datetime')
df7d.head()
Out[40]:
t t-1 t-2 t-3 t-4 month day hour dayofyear weekofyear dayofweek quarter season
0 57.31250 42.97917 46.10417 30.95833 53.58696 1 5 0 5 1 5 1 1
1 36.06250 57.31250 42.97917 46.10417 30.95833 1 6 0 6 1 6 1 1
2 46.08333 36.06250 57.31250 42.97917 46.10417 1 7 0 7 2 0 1 1
3 45.04167 46.08333 36.06250 57.31250 42.97917 1 8 0 8 2 1 1 1
4 101.37500 45.04167 46.08333 36.06250 57.31250 1 9 0 9 2 2 1 1
In [41]:
data_file_hdf = data_path + 'dfpm25_2008-2018_ml_7days_lags.hdf'
df7d.to_hdf(data_file_hdf, key='df', mode='w')

Test Read

In [42]:
df7d = pd.read_hdf(path_or_buf=data_file_hdf, key="df")
print(f'Dataframe size: {df7d.shape}')
df7d.head()
Dataframe size: (96388, 1)
Out[42]:
t t-1 t-2 t-3 t-4 month day hour dayofyear weekofyear dayofweek quarter season
0 57.31250 42.97917 46.10417 30.95833 53.58696 1 5 0 5 1 5 1 1
1 36.06250 57.31250 42.97917 46.10417 30.95833 1 6 0 6 1 6 1 1
2 46.08333 36.06250 57.31250 42.97917 46.10417 1 7 0 7 2 0 1 1
3 45.04167 46.08333 36.06250 57.31250 42.97917 1 8 0 8 2 1 1 1
4 101.37500 45.04167 46.08333 36.06250 57.31250 1 9 0 9 2 2 1 1
In [ ]: