From: https://github.com/ksatola
Version: 0.1.0

Model - Prepare Analytical Views For Modelling¶

We will prepare the data for:

statistical modeling (without any transformations),
machine learning analysis by creating lagged variables.

The data will be adjusted for the following kinds of forecasts:

Next 24 hours (hourly data),
Next 7 days forecast (daily data).

We will then save these new dataframes.

Contents¶

Load PM2.5 and PM10 Analytical View From a CSV File
Statistical Models: Build PM2.5 HDF Analytical View Representation File
Data-related feature engineering
Machine Learning Models: Build PM2.5 HDF Analytical View Representation File

%load_ext autoreload

%autoreload 2

import sys
sys.path.insert(0, '../src')

import pandas as pd 
import numpy as np

from model import (
    load_data,
    calculate_season,
    build_datetime_features,
    get_df_for_lags_columns
)

from plot import (
    plot_ts_corr,
    plot_stl
)

from stats import (
    adfuller_test
)

pd.set_option('display.max_rows', 999)
pd.set_option('display.max_columns', 999)
pd.set_option('precision', 5)

data_path = 'data/'
data_file = data_path + 'dfpm2008_2018.csv'

Load PM2.5 and PM10 Analytical View From a CSV File¶

df = load_data(data_file)

common.py | 14 | load_data | 08-Jun-20 18:10:10 | INFO: Dataframe loaded: data/dfpm2008_2018.csv
common.py | 15 | load_data | 08-Jun-20 18:10:10 | INFO: DataFrame size: (96388, 2)

Build PM2.5 HDF Analytical View Representation File¶

df.index

Index(['2008-01-01 01:00:00', '2008-01-01 02:00:00', '2008-01-01 03:00:00',
       '2008-01-01 04:00:00', '2008-01-01 05:00:00', '2008-01-01 06:00:00',
       '2008-01-01 07:00:00', '2008-01-01 08:00:00', '2008-01-01 09:00:00',
       '2008-01-01 10:00:00',
       ...
       '2018-12-31 15:00:00.000', '2018-12-31 16:00:00.000',
       '2018-12-31 17:00:00.000', '2018-12-31 18:00:00.000',
       '2018-12-31 19:00:00.000', '2018-12-31 20:00:00.000',
       '2018-12-31 21:00:00.000', '2018-12-31 22:00:00.000',
       '2018-12-31 23:00:00.000', '2019-01-01 00:00:00.000'],
      dtype='object', name='Datetime', length=96388)

Set the index type to datetime to be able to perform time-related operations¶

# Convert index to datetime with minutes frequency
df.index = pd.to_datetime(df.index)

# https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases
df.asfreq('T').index

DatetimeIndex(['2008-01-01 01:00:00', '2008-01-01 01:01:00',
               '2008-01-01 01:02:00', '2008-01-01 01:03:00',
               '2008-01-01 01:04:00', '2008-01-01 01:05:00',
               '2008-01-01 01:06:00', '2008-01-01 01:07:00',
               '2008-01-01 01:08:00', '2008-01-01 01:09:00',
               ...
               '2018-12-31 23:51:00', '2018-12-31 23:52:00',
               '2018-12-31 23:53:00', '2018-12-31 23:54:00',
               '2018-12-31 23:55:00', '2018-12-31 23:56:00',
               '2018-12-31 23:57:00', '2018-12-31 23:58:00',
               '2018-12-31 23:59:00', '2019-01-01 00:00:00'],
              dtype='datetime64[ns]', name='Datetime', length=5785861, freq='T')

df.head()

# We will only need PM2.5 for modelling
df.drop(columns=['pm10'], axis='columns', inplace=True) # axis=1
df.head()

Dataset - Hourly Frequency¶

data_file_hdf = data_path + 'dfpm25_2008-2018_hourly.hdf'
df.to_hdf(data_file_hdf, key='df', mode='w')

Test Read¶

df = pd.read_hdf(path_or_buf=data_file_hdf, key="df")
print(f'Dataframe size: {df.shape}')
df.head()

Dataframe size: (96388, 1)

Dataset - Daily Frequency¶

# Resample data to daily using mean of values
df_daily = df.resample(rule='D').mean() # daily
df_daily.head()

data_file_hdf = data_path + 'dfpm25_2008-2018_daily.hdf'
df_daily.to_hdf(data_file_hdf, key='df', mode='w')

Test Read¶

df = pd.read_hdf(path_or_buf=data_file_hdf, key="df")
print(f'Dataframe size: {df.shape}')
df.head()

Dataframe size: (4019, 1)

Feature engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data. For machine learning models, instead of dealing with datetime index, we will create additional features based on time and include them in the analytical view. In addition, we will perform time series data decomposition (remove trend and seasonality) and check ACF for most correlated lags to be chosen as features. In the final prediction mechanism, the predicted values on stationary component can be transform to the original values by adding removed earlier trend and seasonality components.

Machine Learning Models: Build PM2.5 HDF Analytical View Representation File¶

df = load_data(data_file)

common.py | 14 | load_data | 08-Jun-20 18:12:23 | INFO: Dataframe loaded: data/dfpm2008_2018.csv
common.py | 15 | load_data | 08-Jun-20 18:12:23 | INFO: DataFrame size: (96388, 2)

# We will only need PM2.5 for ML modelling
df.drop(columns=['pm10'], axis='columns', inplace=True) # axis=1
df.head()

Create Lagged Variables - Hourly¶

# Create 10 hours of lag values to predict current observation (correlation coefficient > 0.5)
df24h = get_df_for_lags_columns(data=df, col_name='pm25', n_lags=10, remove_nans=True)
df24h

corr = df24h.corr(method='pearson')
corr[0:1]

# Remove Datetime index and calculate date-related features from it
df24h = build_datetime_features(df24h, 'Datetime')
df24h.head()

data_file_hdf = data_path + 'dfpm25_2008-2018_ml_24hours_lags.hdf'
df24h.to_hdf(data_file_hdf, key='df', mode='w')

Test Read¶

df24h = pd.read_hdf(path_or_buf=data_file_hdf, key="df")
print(f'Dataframe size: {df.shape}')
df24h.head()

Dataframe size: (96388, 1)

Create Lagged Variables - Daily¶

# Convert index to datetime with minutes frequency
df.index = pd.to_datetime(df.index)

# Resample data to daily using mean of values
df_daily = df[['pm25']].resample(rule='D').mean() # daily
df_daily.head()

# Create 10 days of lag values to see correlation coefficients
df7d = get_df_for_lags_columns(data=df_daily, col_name='pm25', n_lags=10, remove_nans=True)
df7d

corr = df7d.corr(method='pearson')
corr[0:1]

# Create 3 days of lag values to predict current observation (correlation coefficient > 0.4)
df7d = get_df_for_lags_columns(data=df_daily, col_name='pm25', n_lags=4, remove_nans=True)
df7d

# Remove Datetime index and calculate date-related features from it
df7d = build_datetime_features(df7d, 'Datetime')
df7d.head()

data_file_hdf = data_path + 'dfpm25_2008-2018_ml_7days_lags.hdf'
df7d.to_hdf(data_file_hdf, key='df', mode='w')

Test Read¶

df7d = pd.read_hdf(path_or_buf=data_file_hdf, key="df")
print(f'Dataframe size: {df7d.shape}')
df7d.head()

Dataframe size: (96388, 1)

	pm10	pm25
Datetime
2008-01-01 01:00:00	109.5	92.0
2008-01-01 02:00:00	96.0	81.0
2008-01-01 03:00:00	86.5	73.0
2008-01-01 04:00:00	71.5	60.5
2008-01-01 05:00:00	72.0	61.0

	pm25
Datetime
2008-01-01	53.58696
2008-01-02	30.95833
2008-01-03	46.10417
2008-01-04	42.97917
2008-01-05	57.31250

	pm25
Datetime
2008-01-01	53.58696
2008-01-02	30.95833
2008-01-03	46.10417
2008-01-04	42.97917
2008-01-05	57.31250

	t	t-1	t-2	t-3	t-4	t-5	t-6	t-7	t-8	t-9	t-10
Datetime
2008-01-01 11:00:00	30.5000	41.5000	62.5000	70.5000	69.5000	67.0000	61.0000	60.5000	73.00000	81.00000	92.00000
2008-01-01 12:00:00	26.0000	30.5000	41.5000	62.5000	70.5000	69.5000	67.0000	61.0000	60.50000	73.00000	81.00000
2008-01-01 13:00:00	28.5000	26.0000	30.5000	41.5000	62.5000	70.5000	69.5000	67.0000	61.00000	60.50000	73.00000
2008-01-01 14:00:00	28.0000	28.5000	26.0000	30.5000	41.5000	62.5000	70.5000	69.5000	67.00000	61.00000	60.50000
2008-01-01 15:00:00	32.5000	28.0000	28.5000	26.0000	30.5000	41.5000	62.5000	70.5000	69.50000	67.00000	61.00000
...	...	...	...	...	...	...	...	...	...	...	...
2018-12-31 20:00:00.000	23.6990	19.3182	18.8965	19.4075	9.7000	9.9000	11.8866	7.0000	9.01751	6.94473	7.02653
2018-12-31 21:00:00.000	32.6308	23.6990	19.3182	18.8965	19.4075	9.7000	9.9000	11.8866	7.00000	9.01751	6.94473
2018-12-31 22:00:00.000	31.0801	32.6308	23.6990	19.3182	18.8965	19.4075	9.7000	9.9000	11.88660	7.00000	9.01751
2018-12-31 23:00:00.000	33.2028	31.0801	32.6308	23.6990	19.3182	18.8965	19.4075	9.7000	9.90000	11.88660	7.00000
2019-01-01 00:00:00.000	36.6074	33.2028	31.0801	32.6308	23.6990	19.3182	18.8965	19.4075	9.70000	9.90000	11.88660

	t	t-1	t-2	t-3	t-4	t-5	t-6	t-7	t-8	t-9	t-10	month	day	hour	dayofyear	weekofyear	dayofweek	quarter	season
0	30.5	41.5	62.5	70.5	69.5	67.0	61.0	60.5	73.0	81.0	92.0	1	1	11	1	1	1	1	1
1	26.0	30.5	41.5	62.5	70.5	69.5	67.0	61.0	60.5	73.0	81.0	1	1	12	1	1	1	1	1
2	28.5	26.0	30.5	41.5	62.5	70.5	69.5	67.0	61.0	60.5	73.0	1	1	13	1	1	1	1	1
3	28.0	28.5	26.0	30.5	41.5	62.5	70.5	69.5	67.0	61.0	60.5	1	1	14	1	1	1	1	1
4	32.5	28.0	28.5	26.0	30.5	41.5	62.5	70.5	69.5	67.0	61.0	1	1	15	1	1	1	1	1

	t	t-1	t-2	t-3	t-4	t-5	t-6	t-7	t-8	t-9	t-10
Datetime
2008-01-11	141.83333	110.08333	101.37500	45.04167	46.08333	36.06250	57.31250	42.97917	46.10417	30.95833	53.58696
2008-01-12	47.62500	141.83333	110.08333	101.37500	45.04167	46.08333	36.06250	57.31250	42.97917	46.10417	30.95833
2008-01-13	76.27083	47.62500	141.83333	110.08333	101.37500	45.04167	46.08333	36.06250	57.31250	42.97917	46.10417
2008-01-14	96.68750	76.27083	47.62500	141.83333	110.08333	101.37500	45.04167	46.08333	36.06250	57.31250	42.97917
2008-01-15	155.85417	96.68750	76.27083	47.62500	141.83333	110.08333	101.37500	45.04167	46.08333	36.06250	57.31250
...	...	...	...	...	...	...	...	...	...	...	...
2018-12-28	18.61486	15.27444	17.10235	15.82144	8.04118	17.05840	7.28296	44.24834	68.65069	54.04651	88.69356
2018-12-29	16.24105	18.61486	15.27444	17.10235	15.82144	8.04118	17.05840	7.28296	44.24834	68.65069	54.04651
2018-12-30	15.51440	16.24105	18.61486	15.27444	17.10235	15.82144	8.04118	17.05840	7.28296	44.24834	68.65069
2018-12-31	12.59546	15.51440	16.24105	18.61486	15.27444	17.10235	15.82144	8.04118	17.05840	7.28296	44.24834
2019-01-01	36.60740	12.59546	15.51440	16.24105	18.61486	15.27444	17.10235	15.82144	8.04118	17.05840	7.28296

Model - Prepare Analytical Views For Modelling¶

Contents¶

Load PM2.5 and PM10 Analytical View From a CSV File¶

Build PM2.5 HDF Analytical View Representation File¶

Set the index type to datetime to be able to perform time-related operations¶

Dataset - Hourly Frequency¶

Test Read¶

Dataset - Daily Frequency¶

Test Read¶

Data-related feature engineering¶

Machine Learning Models: Build PM2.5 HDF Analytical View Representation File¶

Create Lagged Variables - Hourly¶

Test Read¶

Create Lagged Variables - Daily¶

Test Read¶