From: https://github.com/ksatola
Version: 0.1.0

Model - PM2.5 - Baseline Reference Models

In [1]:
%load_ext autoreload
In [2]:
%autoreload 2
In [3]:
import sys
sys.path.insert(0, '../src')
In [4]:
import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')
In [5]:
import pandas as pd 

import matplotlib.pyplot as plt
%matplotlib inline
In [6]:
from model import (
    get_pm25_data_for_modelling
)

from measure import (
    get_rmse,
    walk_forward_ref_model_validation,
    get_mean_folds_rmse_for_n_prediction_points,
    prepare_data_for_visualization
)

from plot import (
   visualize_results
)

from utils import (
    get_datetime_identifier
)

from logger import logger

Persistence Baseline Models for Forecasting

Establishing a baseline is essential for any time series forecasting problem. A baseline in performance gives you an idea of how well all other models will actually perform on your problem.

A baseline in forecast performance provides a point of comparison. It is a point of reference for all other modeling techniques on your problem. If a model achieves performance at or below the baseline, the technique should be fixed or abandoned.

The technique used to generate a forecast to calculate the baseline performance must be easy to implement and naive of problem-specific details. Once prepared, you then need to select a naive technique that you can use to make a forecast and calculate the baseline performance. Three properties of a good technique for making a baseline forecast are:

  • Simple: A method that requires little or no training or intelligence.
  • Fast: A method that is fast to implement and computationally trivial to make a prediction.
  • Repeatable: A method that is deterministic, meaning that it produces an expected output given the same input.

Baseline forecasts quickly flesh out whether you can do significantly better. If you can’t, you’re probably working with a random walk.

There can be many naive forecast methods, here we examine two:

  • Simple Average Forecast (Zero Rule Algorithm)
  • Persistence Model

Moving Average Forecasts

With the moving average approach we take n values previously known, calculate their average and take it as the next value. This new value is then added to the training set and a new average of n lags is calculated to predict the next point (and so on). As a forecasting method, there are situations where this technique works the best.

Moving average smoothing is a naive and effective technique in time series forecasting. It can be used for data preparation, feature engineering, and even directly for making predictions.

Smoothing is a technique applied to time series to remove the fine-grained variation between time steps. The hope of smoothing is to remove noise and better expose the signal of the underlying causal processes. Moving averages are a simple and common type of smoothing used in time series analysis and time series forecasting.

There are various types of moving averages:

  • Simple Moving Average (SMA): uses a sliding window to take the average over a set number of time periods. It is an equally weighted mean of the previous n data.
  • Cumulative Moving Average (CMA): Unlike Simple Moving Average which drops the oldest observation as the new one gets added, cumulative moving average considers all prior observations. CMA averages out all of the previous data up until the current data point, which make it very similar to the Simple Average Forecast technique.
  • Exponential Moving Average (EMA): gives more weight to the recent observations and as a result of which, it can be a better model or better capture the movement of the trend in a faster way. EMA's reaction is directly proportional to the pattern of the data. Since EMAs give a higher weight on recent data than on older data, they are more responsive to the latest data points value changes as compared to SMAs.

Load hourly data

In [7]:
dfh = get_pm25_data_for_modelling('ts', 'h')
dfh.head()
common.py | 42 | get_pm25_data_for_modelling | 11-Jun-20 18:53:54 | INFO: Dataframe loaded: /Users/ksatola/Documents/git/air-pollution/agh/data/dfpm25_2008-2018_hourly.hdf
common.py | 43 | get_pm25_data_for_modelling | 11-Jun-20 18:53:54 | INFO: Dataframe size: (96388, 1)
Out[7]:
pm25
Datetime
2008-01-01 01:00:00 92.0
2008-01-01 02:00:00 81.0
2008-01-01 03:00:00 73.0
2008-01-01 04:00:00 60.5
2008-01-01 05:00:00 61.0

Train test split

In [8]:
# This is done in the walk forward validation function.

Load daily data

In [9]:
dfd = get_pm25_data_for_modelling('ts', 'd')
dfd.head()
common.py | 42 | get_pm25_data_for_modelling | 11-Jun-20 18:55:15 | INFO: Dataframe loaded: /Users/ksatola/Documents/git/air-pollution/agh/data/dfpm25_2008-2018_daily.hdf
common.py | 43 | get_pm25_data_for_modelling | 11-Jun-20 18:55:15 | INFO: Dataframe size: (4019, 1)
Out[9]:
pm25
Datetime
2008-01-01 53.586957
2008-01-02 30.958333
2008-01-03 46.104167
2008-01-04 42.979167
2008-01-05 57.312500

Train test split

In [10]:
# This is done in the walk forward validation function.

Simple Average Forecast (Zero Rule Algorithm)

The most common baseline method for supervised machine learning is the Zero Rule Algorithm. This algorithm predicts the majority class in the case of classification, or the average outcome in the case of regression.

With the Simple Average Forecast approach we take all the values previously known, calculate the average and take it as all the next values.

This method is better from the Persistence Model (below) as it does not depend on the value of the last train dataset observation, instead it uses calculated once mean of the entire train dataset. Simple average and naive forecasting predict a constant value.

For a seasonal data we have here, this approach can be more fair (on average) but because of existing outliers, it will be constantly wrong in most cases (for most prediction points).

In [11]:
model_name = 'SAF'

24-Hour Prediction

In [12]:
df = dfh.copy()
df.columns = ['pm25']
df.head()
Out[12]:
pm25
Datetime
2008-01-01 01:00:00 92.0
2008-01-01 02:00:00 81.0
2008-01-01 03:00:00 73.0
2008-01-01 04:00:00 60.5
2008-01-01 05:00:00 61.0
In [13]:
# Define first past/future cutoff point in time offset (1 year of data)
cut_off_offset = 365*24 # for hourly data
#cut_off_offset = 365 # for daily data

# Set datetime format for index
dt_format = "%Y-%m-%d %H:%M:%S" # for hourly data
#dt_format = "%Y-%m-%d" # for daily data

# Create train and validate sets
train_test_split_position = int(len(df)-cut_off_offset)

# Create as many folds as remains till the end of known data
n_folds = len(df) #train_test_split_position+3

# Predict for X points
n_pred_points = 24 # for hourly data
#n_pred_points = 7 # for daily data
In [15]:
%%time
# Validate result on test
# Creates 365*24*24 models for hourly data, or 365*7 models for hourly data

fold_results = walk_forward_ref_model_validation(data=df,
                                                 col_name='pm25',
                                                 model_type='SAF',
                                                 cut_off_offset=cut_off_offset,
                                                 n_pred_points=n_pred_points,
                                                 n_folds=-1,
                                                 period='')
print(len(fold_results))
print(fold_results[0])
validation.py | 105 | walk_forward_ref_model_validation | 11-Jun-20 18:57:58 | INFO: SAF model validation started
Started fold 000000/008760 - 2020-06-11_18-57-58
Started fold 000100/008760 - 2020-06-11_18-58-06
Started fold 000200/008760 - 2020-06-11_18-58-14
Started fold 000300/008760 - 2020-06-11_18-58-23
Started fold 000400/008760 - 2020-06-11_18-58-31
Started fold 000500/008760 - 2020-06-11_18-58-40
Started fold 000600/008760 - 2020-06-11_18-58-49
Started fold 000700/008760 - 2020-06-11_18-58-57
Started fold 000800/008760 - 2020-06-11_18-59-06
Started fold 000900/008760 - 2020-06-11_18-59-15
Started fold 001000/008760 - 2020-06-11_18-59-25
Started fold 001100/008760 - 2020-06-11_18-59-33
Started fold 001200/008760 - 2020-06-11_18-59-42
Started fold 001300/008760 - 2020-06-11_18-59-51
Started fold 001400/008760 - 2020-06-11_19-00-00
Started fold 001500/008760 - 2020-06-11_19-00-09
Started fold 001600/008760 - 2020-06-11_19-00-17
Started fold 001700/008760 - 2020-06-11_19-00-26
Started fold 001800/008760 - 2020-06-11_19-00-34
Started fold 001900/008760 - 2020-06-11_19-00-43
Started fold 002000/008760 - 2020-06-11_19-00-52
Started fold 002100/008760 - 2020-06-11_19-01-00
Started fold 002200/008760 - 2020-06-11_19-01-09
Started fold 002300/008760 - 2020-06-11_19-01-18
Started fold 002400/008760 - 2020-06-11_19-01-26
Started fold 002500/008760 - 2020-06-11_19-01-35
Started fold 002600/008760 - 2020-06-11_19-01-43
Started fold 002700/008760 - 2020-06-11_19-01-52
Started fold 002800/008760 - 2020-06-11_19-02-01
Started fold 002900/008760 - 2020-06-11_19-02-09
Started fold 003000/008760 - 2020-06-11_19-02-18
Started fold 003100/008760 - 2020-06-11_19-02-27
Started fold 003200/008760 - 2020-06-11_19-02-36
Started fold 003300/008760 - 2020-06-11_19-02-45
Started fold 003400/008760 - 2020-06-11_19-02-54
Started fold 003500/008760 - 2020-06-11_19-03-04
Started fold 003600/008760 - 2020-06-11_19-03-13
Started fold 003700/008760 - 2020-06-11_19-03-23
Started fold 003800/008760 - 2020-06-11_19-03-32
Started fold 003900/008760 - 2020-06-11_19-03-41
Started fold 004000/008760 - 2020-06-11_19-03-50
Started fold 004100/008760 - 2020-06-11_19-03-59
Started fold 004200/008760 - 2020-06-11_19-04-07
Started fold 004300/008760 - 2020-06-11_19-04-16
Started fold 004400/008760 - 2020-06-11_19-04-25
Started fold 004500/008760 - 2020-06-11_19-04-34
Started fold 004600/008760 - 2020-06-11_19-04-42
Started fold 004700/008760 - 2020-06-11_19-04-51
Started fold 004800/008760 - 2020-06-11_19-05-00
Started fold 004900/008760 - 2020-06-11_19-05-09
Started fold 005000/008760 - 2020-06-11_19-05-18
Started fold 005100/008760 - 2020-06-11_19-05-27
Started fold 005200/008760 - 2020-06-11_19-05-35
Started fold 005300/008760 - 2020-06-11_19-05-44
Started fold 005400/008760 - 2020-06-11_19-05-53
Started fold 005500/008760 - 2020-06-11_19-06-02
Started fold 005600/008760 - 2020-06-11_19-06-10
Started fold 005700/008760 - 2020-06-11_19-06-19
Started fold 005800/008760 - 2020-06-11_19-06-28
Started fold 005900/008760 - 2020-06-11_19-06-37
Started fold 006000/008760 - 2020-06-11_19-06-46
Started fold 006100/008760 - 2020-06-11_19-06-55
Started fold 006200/008760 - 2020-06-11_19-07-03
Started fold 006300/008760 - 2020-06-11_19-07-12
Started fold 006400/008760 - 2020-06-11_19-07-21
Started fold 006500/008760 - 2020-06-11_19-07-31
Started fold 006600/008760 - 2020-06-11_19-07-41
Started fold 006700/008760 - 2020-06-11_19-07-50
Started fold 006800/008760 - 2020-06-11_19-07-59
Started fold 006900/008760 - 2020-06-11_19-08-08
Started fold 007000/008760 - 2020-06-11_19-08-18
Started fold 007100/008760 - 2020-06-11_19-08-27
Started fold 007200/008760 - 2020-06-11_19-08-36
Started fold 007300/008760 - 2020-06-11_19-08-45
Started fold 007400/008760 - 2020-06-11_19-08-54
Started fold 007500/008760 - 2020-06-11_19-09-03
Started fold 007600/008760 - 2020-06-11_19-09-12
Started fold 007700/008760 - 2020-06-11_19-09-21
Started fold 007800/008760 - 2020-06-11_19-09-30
Started fold 007900/008760 - 2020-06-11_19-09-39
Started fold 008000/008760 - 2020-06-11_19-09-48
Started fold 008100/008760 - 2020-06-11_19-09-57
Started fold 008200/008760 - 2020-06-11_19-10-05
Started fold 008300/008760 - 2020-06-11_19-10-14
Started fold 008400/008760 - 2020-06-11_19-10-23
Started fold 008500/008760 - 2020-06-11_19-10-32
Started fold 008600/008760 - 2020-06-11_19-10-41
Started fold 008700/008760 - 2020-06-11_19-10-50
8736
                     observed  predicted      error  abs_error
Datetime                                                      
2018-01-01 01:00:00  84.90085  37.831378  47.069472  47.069472
2018-01-01 02:00:00  67.44355  37.831378  29.612172  29.612172
2018-01-01 03:00:00  76.66860  37.831378  38.837222  38.837222
2018-01-01 04:00:00  64.96090  37.831378  27.129522  27.129522
2018-01-01 05:00:00  64.14875  37.831378  26.317372  26.317372
2018-01-01 06:00:00  76.06410  37.831378  38.232722  38.232722
2018-01-01 07:00:00  69.19180  37.831378  31.360422  31.360422
2018-01-01 08:00:00  48.51735  37.831378  10.685972  10.685972
2018-01-01 09:00:00  45.92715  37.831378   8.095772   8.095772
2018-01-01 10:00:00  44.19595  37.831378   6.364572   6.364572
2018-01-01 11:00:00  39.27865  37.831378   1.447272   1.447272
2018-01-01 12:00:00  32.61625  37.831378   5.215128   5.215128
2018-01-01 13:00:00  34.09440  37.831378   3.736978   3.736978
2018-01-01 14:00:00  33.51795  37.831378   4.313428   4.313428
2018-01-01 15:00:00  41.24420  37.831378   3.412822   3.412822
2018-01-01 16:00:00  49.08765  37.831378  11.256272  11.256272
2018-01-01 17:00:00  51.24645  37.831378  13.415072  13.415072
2018-01-01 18:00:00  41.64520  37.831378   3.813822   3.813822
2018-01-01 19:00:00  40.98405  37.831378   3.152672   3.152672
2018-01-01 20:00:00  45.36865  37.831378   7.537272   7.537272
2018-01-01 21:00:00  58.24830  37.831378  20.416922  20.416922
2018-01-01 22:00:00  63.21335  37.831378  25.381972  25.381972
2018-01-01 23:00:00  78.28435  37.831378  40.452972  40.452972
2018-01-02 00:00:00  91.30400  37.831378  53.472622  53.472622
CPU times: user 12min 53s, sys: 2 s, total: 12min 55s
Wall time: 12min 56s
In [ ]:
8736
                     observed  predicted      error  abs_error
Datetime                                                      
2018-01-01 01:00:00  84.90085  37.831378  47.069472  47.069472
2018-01-01 02:00:00  67.44355  37.831378  29.612172  29.612172
2018-01-01 03:00:00  76.66860  37.831378  38.837222  38.837222
2018-01-01 04:00:00  64.96090  37.831378  27.129522  27.129522
2018-01-01 05:00:00  64.14875  37.831378  26.317372  26.317372
2018-01-01 06:00:00  76.06410  37.831378  38.232722  38.232722
2018-01-01 07:00:00  69.19180  37.831378  31.360422  31.360422
2018-01-01 08:00:00  48.51735  37.831378  10.685972  10.685972
2018-01-01 09:00:00  45.92715  37.831378   8.095772   8.095772
2018-01-01 10:00:00  44.19595  37.831378   6.364572   6.364572
2018-01-01 11:00:00  39.27865  37.831378   1.447272   1.447272
2018-01-01 12:00:00  32.61625  37.831378   5.215128   5.215128
2018-01-01 13:00:00  34.09440  37.831378   3.736978   3.736978
2018-01-01 14:00:00  33.51795  37.831378   4.313428   4.313428
2018-01-01 15:00:00  41.24420  37.831378   3.412822   3.412822
2018-01-01 16:00:00  49.08765  37.831378  11.256272  11.256272
2018-01-01 17:00:00  51.24645  37.831378  13.415072  13.415072
2018-01-01 18:00:00  41.64520  37.831378   3.813822   3.813822
2018-01-01 19:00:00  40.98405  37.831378   3.152672   3.152672
2018-01-01 20:00:00  45.36865  37.831378   7.537272   7.537272
2018-01-01 21:00:00  58.24830  37.831378  20.416922  20.416922
2018-01-01 22:00:00  63.21335  37.831378  25.381972  25.381972
2018-01-01 23:00:00  78.28435  37.831378  40.452972  40.452972
2018-01-02 00:00:00  91.30400  37.831378  53.472622  53.472622
CPU times: user 12min 53s, sys: 2 s, total: 12min 55s
Wall time: 12min 56s

Serialize output data

In [16]:
from joblib import dump, load

timestamp = get_datetime_identifier("%Y-%m-%d_%H-%M-%S")

path = f'results/pm25_ts_{model_name}_results_h_{timestamp}.joblib'

dump(fold_results, path) 
fold_results = load(path)
print(len(fold_results))
print(fold_results[0])
8736
                     observed  predicted      error  abs_error
Datetime                                                      
2018-01-01 01:00:00  84.90085  37.831378  47.069472  47.069472
2018-01-01 02:00:00  67.44355  37.831378  29.612172  29.612172
2018-01-01 03:00:00  76.66860  37.831378  38.837222  38.837222
2018-01-01 04:00:00  64.96090  37.831378  27.129522  27.129522
2018-01-01 05:00:00  64.14875  37.831378  26.317372  26.317372
2018-01-01 06:00:00  76.06410  37.831378  38.232722  38.232722
2018-01-01 07:00:00  69.19180  37.831378  31.360422  31.360422
2018-01-01 08:00:00  48.51735  37.831378  10.685972  10.685972
2018-01-01 09:00:00  45.92715  37.831378   8.095772   8.095772
2018-01-01 10:00:00  44.19595  37.831378   6.364572   6.364572
2018-01-01 11:00:00  39.27865  37.831378   1.447272   1.447272
2018-01-01 12:00:00  32.61625  37.831378   5.215128   5.215128
2018-01-01 13:00:00  34.09440  37.831378   3.736978   3.736978
2018-01-01 14:00:00  33.51795  37.831378   4.313428   4.313428
2018-01-01 15:00:00  41.24420  37.831378   3.412822   3.412822
2018-01-01 16:00:00  49.08765  37.831378  11.256272  11.256272
2018-01-01 17:00:00  51.24645  37.831378  13.415072  13.415072
2018-01-01 18:00:00  41.64520  37.831378   3.813822   3.813822
2018-01-01 19:00:00  40.98405  37.831378   3.152672   3.152672
2018-01-01 20:00:00  45.36865  37.831378   7.537272   7.537272
2018-01-01 21:00:00  58.24830  37.831378  20.416922  20.416922
2018-01-01 22:00:00  63.21335  37.831378  25.381972  25.381972
2018-01-01 23:00:00  78.28435  37.831378  40.452972  40.452972
2018-01-02 00:00:00  91.30400  37.831378  53.472622  53.472622

Calculate and visualize results

In [17]:
%%time
# Returns a list of mean folds RMSE for n_pred_points (starting at 1 point forecast)
res = get_mean_folds_rmse_for_n_prediction_points(fold_results=fold_results, n_pred_points=n_pred_points)
res
CPU times: user 2min 56s, sys: 123 ms, total: 2min 56s
Wall time: 2min 56s
Out[17]:
[22.219594478879706,
 22.216437235996327,
 22.215180865472913,
 22.213085755280073,
 22.21252481634527,
 22.212170948117542,
 22.209960468319558,
 22.208431600091828,
 22.20969613177227,
 22.210756737832877,
 22.212124873737373,
 22.214644696969696,
 22.216666207529844,
 22.218942275022957,
 22.22145361570248,
 22.22403221992654,
 22.22595305325987,
 22.227605819559233,
 22.22990950413223,
 22.232791609274567,
 22.234595775941226,
 22.235836512855833,
 22.23636769972452,
 22.23493732782369]
In [18]:
print(res)
[22.219594478879706, 22.216437235996327, 22.215180865472913, 22.213085755280073, 22.21252481634527, 22.212170948117542, 22.209960468319558, 22.208431600091828, 22.20969613177227, 22.210756737832877, 22.212124873737373, 22.214644696969696, 22.216666207529844, 22.218942275022957, 22.22145361570248, 22.22403221992654, 22.22595305325987, 22.227605819559233, 22.22990950413223, 22.232791609274567, 22.234595775941226, 22.235836512855833, 22.23636769972452, 22.23493732782369]

[22.219594478879706, 22.216437235996327, 22.215180865472913, 22.213085755280073, 22.21252481634527, 22.212170948117542, 22.209960468319558, 22.208431600091828, 22.20969613177227, 22.210756737832877, 22.212124873737373, 22.214644696969696, 22.216666207529844, 22.218942275022957, 22.22145361570248, 22.22403221992654, 22.22595305325987, 22.227605819559233, 22.22990950413223, 22.232791609274567, 22.234595775941226, 22.235836512855833, 22.23636769972452, 22.23493732782369]

In [19]:
# Show forecasts for n-th point in the future
show_n_points_of_forecasts = [1, 12, 24] # for hourly data
#show_n_points_of_forecasts = [1, 3, 7] # for daily data

# Used to zoom the plots (date ranges shown in the plots)
# for hourly data
start_end_dates = [('2018-01-01', '2019-01-01'), ('2018-02-01', '2018-03-01'), ('2018-06-01', '2018-07-01')]
# for daily data
#start_end_dates = [('2018-01-01', '2019-01-01'), ('2018-02-01', '2018-04-01'), ('2018-06-01', '2018-08-01')]

# Type of plot
# 0 -> plot_observed_vs_predicted
# 1 -> plot_observed_vs_predicted_with_error
plot_types = [0, 1, 1]

# File names for plots (format png will be used, do not add .png extension)
base_file_path = f'images/pm25_obs_vs_pred_365_h_ts_{model_name}' # for hourly data
#base_file_path = f'images/pm25_obs_vs_pred_365_d_ts_{model_name}' # for daily data
In [20]:
visualize_results(show_n_points_of_forecasts=show_n_points_of_forecasts,
                   start_end_dates=start_end_dates,
                   plot_types=plot_types,
                   base_file_path=base_file_path,
                   fold_results=fold_results, 
                   n_pred_points=n_pred_points, 
                   cut_off_offset=cut_off_offset, 
                   model_name=model_name,
                timestamp=timestamp)


results.py | 92 | visualize_results | 11-Jun-20 19:26:35 | INFO: images/pm25_obs_vs_pred_365_h_ts_SAF_01_lag-01_2020-06-11_19-21-56.png


results.py | 92 | visualize_results | 11-Jun-20 19:26:35 | INFO: images/pm25_obs_vs_pred_365_h_ts_SAF_01_lag-12_2020-06-11_19-21-56.png


results.py | 92 | visualize_results | 11-Jun-20 19:26:36 | INFO: images/pm25_obs_vs_pred_365_h_ts_SAF_01_lag-24_2020-06-11_19-21-56.png


results.py | 92 | visualize_results | 11-Jun-20 19:26:48 | INFO: images/pm25_obs_vs_pred_365_h_ts_SAF_02_lag-01_2020-06-11_19-21-56.png


results.py | 92 | visualize_results | 11-Jun-20 19:26:49 | INFO: images/pm25_obs_vs_pred_365_h_ts_SAF_02_lag-12_2020-06-11_19-21-56.png


results.py | 92 | visualize_results | 11-Jun-20 19:26:50 | INFO: images/pm25_obs_vs_pred_365_h_ts_SAF_02_lag-24_2020-06-11_19-21-56.png


results.py | 92 | visualize_results | 11-Jun-20 19:27:02 | INFO: images/pm25_obs_vs_pred_365_h_ts_SAF_03_lag-01_2020-06-11_19-21-56.png


results.py | 92 | visualize_results | 11-Jun-20 19:27:03 | INFO: images/pm25_obs_vs_pred_365_h_ts_SAF_03_lag-12_2020-06-11_19-21-56.png


results.py | 92 | visualize_results | 11-Jun-20 19:27:03 | INFO: images/pm25_obs_vs_pred_365_h_ts_SAF_03_lag-24_2020-06-11_19-21-56.png

7-Day Prediction

In [21]:
df = dfd.copy()
df.columns = ['pm25']
df.head()
Out[21]:
pm25
Datetime
2008-01-01 53.586957
2008-01-02 30.958333
2008-01-03 46.104167
2008-01-04 42.979167
2008-01-05 57.312500
In [22]:
# Define first past/future cutoff point in time offset (1 year of data)
#cut_off_offset = 365*24 # for hourly data
cut_off_offset = 365 # for daily data

# Set datetime format for index
#dt_format = "%Y-%m-%d %H:%M:%S" # for hourly data
dt_format = "%Y-%m-%d" # for daily data

# Create train and validate sets
train_test_split_position = int(len(df)-cut_off_offset)

# Create as many folds as remains till the end of known data
n_folds = len(df) #train_test_split_position+3

# Predict for X points
#n_pred_points = 24 # for hourly data
n_pred_points = 7 # for daily data
In [23]:
%%time
# Validate result on test
# Creates 365*24*24 models for hourly data, or 365*7 models for hourly data

fold_results = walk_forward_ref_model_validation(data=df,
                                                 col_name='pm25',
                                                 model_type='SAF',
                                                 cut_off_offset=cut_off_offset,
                                                 n_pred_points=n_pred_points,
                                                 n_folds=-1,
                                                 period='')
print(len(fold_results))
print(fold_results[0])
validation.py | 105 | walk_forward_ref_model_validation | 11-Jun-20 19:28:10 | INFO: SAF model validation started
Started fold 000000/000365 - 2020-06-11_19-28-10
Started fold 000010/000365 - 2020-06-11_19-28-10
Started fold 000020/000365 - 2020-06-11_19-28-11
Started fold 000030/000365 - 2020-06-11_19-28-11
Started fold 000040/000365 - 2020-06-11_19-28-11
Started fold 000050/000365 - 2020-06-11_19-28-11
Started fold 000060/000365 - 2020-06-11_19-28-11
Started fold 000070/000365 - 2020-06-11_19-28-11
Started fold 000080/000365 - 2020-06-11_19-28-11
Started fold 000090/000365 - 2020-06-11_19-28-12
Started fold 000100/000365 - 2020-06-11_19-28-12
Started fold 000110/000365 - 2020-06-11_19-28-12
Started fold 000120/000365 - 2020-06-11_19-28-12
Started fold 000130/000365 - 2020-06-11_19-28-12
Started fold 000140/000365 - 2020-06-11_19-28-12
Started fold 000150/000365 - 2020-06-11_19-28-12
Started fold 000160/000365 - 2020-06-11_19-28-13
Started fold 000170/000365 - 2020-06-11_19-28-13
Started fold 000180/000365 - 2020-06-11_19-28-13
Started fold 000190/000365 - 2020-06-11_19-28-13
Started fold 000200/000365 - 2020-06-11_19-28-13
Started fold 000210/000365 - 2020-06-11_19-28-13
Started fold 000220/000365 - 2020-06-11_19-28-13
Started fold 000230/000365 - 2020-06-11_19-28-13
Started fold 000240/000365 - 2020-06-11_19-28-14
Started fold 000250/000365 - 2020-06-11_19-28-14
Started fold 000260/000365 - 2020-06-11_19-28-14
Started fold 000270/000365 - 2020-06-11_19-28-14
Started fold 000280/000365 - 2020-06-11_19-28-14
Started fold 000290/000365 - 2020-06-11_19-28-14
Started fold 000300/000365 - 2020-06-11_19-28-14
Started fold 000310/000365 - 2020-06-11_19-28-15
Started fold 000320/000365 - 2020-06-11_19-28-15
Started fold 000330/000365 - 2020-06-11_19-28-15
Started fold 000340/000365 - 2020-06-11_19-28-15
Started fold 000350/000365 - 2020-06-11_19-28-15
358
             observed  predicted      error  abs_error
Datetime                                              
2018-01-02  67.991848  37.836392  30.155456  30.155456
2018-01-03  16.026950  37.836392  21.809442  21.809442
2018-01-04  14.590020  37.836392  23.246371  23.246371
2018-01-05  22.094854  37.836392  15.741538  15.741538
2018-01-06  62.504217  37.836392  24.667825  24.667825
2018-01-07  43.929804  37.836392   6.093412   6.093412
2018-01-08  22.088192  37.836392  15.748200  15.748200
CPU times: user 5.06 s, sys: 42.6 ms, total: 5.1 s
Wall time: 5.11 s
In [ ]:
358
             observed  predicted      error  abs_error
Datetime                                              
2018-01-02  67.991848  37.836392  30.155456  30.155456
2018-01-03  16.026950  37.836392  21.809442  21.809442
2018-01-04  14.590020  37.836392  23.246371  23.246371
2018-01-05  22.094854  37.836392  15.741538  15.741538
2018-01-06  62.504217  37.836392  24.667825  24.667825
2018-01-07  43.929804  37.836392   6.093412   6.093412
2018-01-08  22.088192  37.836392  15.748200  15.748200
CPU times: user 5.06 s, sys: 42.6 ms, total: 5.1 s
Wall time: 5.11 s

Serialize output data

In [24]:
from joblib import dump, load

timestamp = get_datetime_identifier("%Y-%m-%d_%H-%M-%S")

path = f'results/pm25_ts_{model_name}_results_d_{timestamp}.joblib'

dump(fold_results, path) 
fold_results = load(path)
print(len(fold_results))
print(fold_results[0])
358
             observed  predicted      error  abs_error
Datetime                                              
2018-01-02  67.991848  37.836392  30.155456  30.155456
2018-01-03  16.026950  37.836392  21.809442  21.809442
2018-01-04  14.590020  37.836392  23.246371  23.246371
2018-01-05  22.094854  37.836392  15.741538  15.741538
2018-01-06  62.504217  37.836392  24.667825  24.667825
2018-01-07  43.929804  37.836392   6.093412   6.093412
2018-01-08  22.088192  37.836392  15.748200  15.748200

Calculate and visualize results

In [25]:
%%time
# Returns a list of mean folds RMSE for n_pred_points (starting at 1 point forecast)
res = get_mean_folds_rmse_for_n_prediction_points(fold_results=fold_results, n_pred_points=n_pred_points)
res
CPU times: user 2.14 s, sys: 4.57 ms, total: 2.15 s
Wall time: 2.15 s
Out[25]:
[19.635213105413108,
 19.601542735042734,
 19.632443589743588,
 19.589547863247862,
 19.633233903133906,
 19.623368376068374,
 19.691608831908837]
In [26]:
print(res)
[19.635213105413108, 19.601542735042734, 19.632443589743588, 19.589547863247862, 19.633233903133906, 19.623368376068374, 19.691608831908837]
In [ ]:
[19.635213105413108, 19.601542735042734, 19.632443589743588, 19.589547863247862, 19.633233903133906, 19.623368376068374, 19.691608831908837]
In [27]:
# Show forecasts for n-th point in the future
#show_n_points_of_forecasts = [1, 12, 24] # for hourly data
show_n_points_of_forecasts = [1, 3, 7] # for daily data

# Used to zoom the plots (date ranges shown in the plots)
# for hourly data
#start_end_dates = [('2018-01-01', '2019-01-01'), ('2018-02-01', '2018-03-01'), ('2018-06-01', '2018-07-01')]
# for daily data
start_end_dates = [('2018-01-01', '2019-01-01'), ('2018-02-01', '2018-04-01'), ('2018-06-01', '2018-08-01')]

# Type of plot
# 0 -> plot_observed_vs_predicted
# 1 -> plot_observed_vs_predicted_with_error
plot_types = [0, 1, 1]

# File names for plots (format png will be used, do not add .png extension)
#base_file_path = f'images/pm25_obs_vs_pred_365_h_ts_{model_name}' # for hourly data
base_file_path = f'images/pm25_obs_vs_pred_365_d_ts_{model_name}' # for daily data
In [28]:
visualize_results(show_n_points_of_forecasts=show_n_points_of_forecasts,
                   start_end_dates=start_end_dates,
                   plot_types=plot_types,
                   base_file_path=base_file_path,
                   fold_results=fold_results, 
                   n_pred_points=n_pred_points, 
                   cut_off_offset=cut_off_offset, 
                   model_name=model_name,
                timestamp=timestamp)


results.py | 92 | visualize_results | 11-Jun-20 19:29:42 | INFO: images/pm25_obs_vs_pred_365_d_ts_SAF_01_lag-01_2020-06-11_19-28-34.png


results.py | 92 | visualize_results | 11-Jun-20 19:29:43 | INFO: images/pm25_obs_vs_pred_365_d_ts_SAF_01_lag-03_2020-06-11_19-28-34.png


results.py | 92 | visualize_results | 11-Jun-20 19:29:43 | INFO: images/pm25_obs_vs_pred_365_d_ts_SAF_01_lag-07_2020-06-11_19-28-34.png


results.py | 92 | visualize_results | 11-Jun-20 19:29:44 | INFO: images/pm25_obs_vs_pred_365_d_ts_SAF_02_lag-01_2020-06-11_19-28-34.png


results.py | 92 | visualize_results | 11-Jun-20 19:29:45 | INFO: images/pm25_obs_vs_pred_365_d_ts_SAF_02_lag-03_2020-06-11_19-28-34.png


results.py | 92 | visualize_results | 11-Jun-20 19:29:46 | INFO: images/pm25_obs_vs_pred_365_d_ts_SAF_02_lag-07_2020-06-11_19-28-34.png


results.py | 92 | visualize_results | 11-Jun-20 19:29:47 | INFO: images/pm25_obs_vs_pred_365_d_ts_SAF_03_lag-01_2020-06-11_19-28-34.png


results.py | 92 | visualize_results | 11-Jun-20 19:29:47 | INFO: images/pm25_obs_vs_pred_365_d_ts_SAF_03_lag-03_2020-06-11_19-28-34.png


results.py | 92 | visualize_results | 11-Jun-20 19:29:48 | INFO: images/pm25_obs_vs_pred_365_d_ts_SAF_03_lag-07_2020-06-11_19-28-34.png

Persistence Model

The Persistence Model naive forecasting technique assumes that the next expected point is equal to the last observed point.

The Persistence Algorithm uses the value at the current time step (t) to predict the expected outcome at the next time step (t+1). This satisfies the three above conditions for a baseline forecast.

The Persistence Algorithm is naive. It is often called the Naive Forecast. It assumes nothing about the specifics of the time series problem to which it is applied. This is what makes it so easy to understand and so quick to implement and evaluate.

The actual prediction and other performance measures depend on the last train data point value, and will be different for different dataset samples and train-test split point. The naive method isn’t suited for datasets with high variability.

In [29]:
model_name = 'PER'

24-Hour Prediction

In [30]:
df = dfh.copy()
df.columns = ['pm25']
df.head()
Out[30]:
pm25
Datetime
2008-01-01 01:00:00 92.0
2008-01-01 02:00:00 81.0
2008-01-01 03:00:00 73.0
2008-01-01 04:00:00 60.5
2008-01-01 05:00:00 61.0
In [31]:
# Define first past/future cutoff point in time offset (1 year of data)
cut_off_offset = 365*24 # for hourly data
#cut_off_offset = 365 # for daily data

# Set datetime format for index
dt_format = "%Y-%m-%d %H:%M:%S" # for hourly data
#dt_format = "%Y-%m-%d" # for daily data

# Create train and validate sets
train_test_split_position = int(len(df)-cut_off_offset)

# Create as many folds as remains till the end of known data
n_folds = len(df) #train_test_split_position+3

# Predict for X points
n_pred_points = 24 # for hourly data
#n_pred_points = 7 # for daily data
In [32]:
%%time
# Validate result on test
# Creates 365*24*24 models for hourly data, or 365*7 models for hourly data

fold_results = walk_forward_ref_model_validation(data=df,
                                                 col_name='pm25',
                                                 model_type='PER',
                                                 cut_off_offset=cut_off_offset,
                                                 n_pred_points=n_pred_points,
                                                 n_folds=-1,
                                                 period='')
print(len(fold_results))
print(fold_results[0])
validation.py | 107 | walk_forward_ref_model_validation | 11-Jun-20 19:31:41 | INFO: PER model validation started
Started fold 000000/008760 - 2020-06-11_19-31-41
Started fold 000100/008760 - 2020-06-11_19-31-50
Started fold 000200/008760 - 2020-06-11_19-31-58
Started fold 000300/008760 - 2020-06-11_19-32-06
Started fold 000400/008760 - 2020-06-11_19-32-14
Started fold 000500/008760 - 2020-06-11_19-32-22
Started fold 000600/008760 - 2020-06-11_19-32-30
Started fold 000700/008760 - 2020-06-11_19-32-38
Started fold 000800/008760 - 2020-06-11_19-32-47
Started fold 000900/008760 - 2020-06-11_19-32-55
Started fold 001000/008760 - 2020-06-11_19-33-03
Started fold 001100/008760 - 2020-06-11_19-33-11
Started fold 001200/008760 - 2020-06-11_19-33-20
Started fold 001300/008760 - 2020-06-11_19-33-28
Started fold 001400/008760 - 2020-06-11_19-33-36
Started fold 001500/008760 - 2020-06-11_19-33-44
Started fold 001600/008760 - 2020-06-11_19-33-52
Started fold 001700/008760 - 2020-06-11_19-34-00
Started fold 001800/008760 - 2020-06-11_19-34-08
Started fold 001900/008760 - 2020-06-11_19-34-16
Started fold 002000/008760 - 2020-06-11_19-34-24
Started fold 002100/008760 - 2020-06-11_19-34-32
Started fold 002200/008760 - 2020-06-11_19-34-40
Started fold 002300/008760 - 2020-06-11_19-34-49
Started fold 002400/008760 - 2020-06-11_19-34-59
Started fold 002500/008760 - 2020-06-11_19-35-07
Started fold 002600/008760 - 2020-06-11_19-35-15
Started fold 002700/008760 - 2020-06-11_19-35-23
Started fold 002800/008760 - 2020-06-11_19-35-31
Started fold 002900/008760 - 2020-06-11_19-35-40
Started fold 003000/008760 - 2020-06-11_19-35-49
Started fold 003100/008760 - 2020-06-11_19-35-57
Started fold 003200/008760 - 2020-06-11_19-36-05
Started fold 003300/008760 - 2020-06-11_19-36-14
Started fold 003400/008760 - 2020-06-11_19-36-22
Started fold 003500/008760 - 2020-06-11_19-36-30
Started fold 003600/008760 - 2020-06-11_19-36-38
Started fold 003700/008760 - 2020-06-11_19-36-46
Started fold 003800/008760 - 2020-06-11_19-36-55
Started fold 003900/008760 - 2020-06-11_19-37-03
Started fold 004000/008760 - 2020-06-11_19-37-11
Started fold 004100/008760 - 2020-06-11_19-37-19
Started fold 004200/008760 - 2020-06-11_19-37-28
Started fold 004300/008760 - 2020-06-11_19-37-36
Started fold 004400/008760 - 2020-06-11_19-37-44
Started fold 004500/008760 - 2020-06-11_19-37-53
Started fold 004600/008760 - 2020-06-11_19-38-02
Started fold 004700/008760 - 2020-06-11_19-38-10
Started fold 004800/008760 - 2020-06-11_19-38-18
Started fold 004900/008760 - 2020-06-11_19-38-27
Started fold 005000/008760 - 2020-06-11_19-38-35
Started fold 005100/008760 - 2020-06-11_19-38-43
Started fold 005200/008760 - 2020-06-11_19-38-53
Started fold 005300/008760 - 2020-06-11_19-39-02
Started fold 005400/008760 - 2020-06-11_19-39-10
Started fold 005500/008760 - 2020-06-11_19-39-18
Started fold 005600/008760 - 2020-06-11_19-39-27
Started fold 005700/008760 - 2020-06-11_19-39-35
Started fold 005800/008760 - 2020-06-11_19-39-43
Started fold 005900/008760 - 2020-06-11_19-39-52
Started fold 006000/008760 - 2020-06-11_19-40-00
Started fold 006100/008760 - 2020-06-11_19-40-09
Started fold 006200/008760 - 2020-06-11_19-40-19
Started fold 006300/008760 - 2020-06-11_19-40-27
Started fold 006400/008760 - 2020-06-11_19-40-36
Started fold 006500/008760 - 2020-06-11_19-40-45
Started fold 006600/008760 - 2020-06-11_19-40-53
Started fold 006700/008760 - 2020-06-11_19-41-02
Started fold 006800/008760 - 2020-06-11_19-41-10
Started fold 006900/008760 - 2020-06-11_19-41-19
Started fold 007000/008760 - 2020-06-11_19-41-28
Started fold 007100/008760 - 2020-06-11_19-41-36
Started fold 007200/008760 - 2020-06-11_19-41-44
Started fold 007300/008760 - 2020-06-11_19-41-53
Started fold 007400/008760 - 2020-06-11_19-42-01
Started fold 007500/008760 - 2020-06-11_19-42-10
Started fold 007600/008760 - 2020-06-11_19-42-18
Started fold 007700/008760 - 2020-06-11_19-42-27
Started fold 007800/008760 - 2020-06-11_19-42-36
Started fold 007900/008760 - 2020-06-11_19-42-44
Started fold 008000/008760 - 2020-06-11_19-42-54
Started fold 008100/008760 - 2020-06-11_19-43-03
Started fold 008200/008760 - 2020-06-11_19-43-11
Started fold 008300/008760 - 2020-06-11_19-43-20
Started fold 008400/008760 - 2020-06-11_19-43-28
Started fold 008500/008760 - 2020-06-11_19-43-37
Started fold 008600/008760 - 2020-06-11_19-43-45
Started fold 008700/008760 - 2020-06-11_19-43-54
8736
                     observed  predicted     error  abs_error
Datetime                                                     
2018-01-01 01:00:00  84.90085    21.3458  63.55505   63.55505
2018-01-01 02:00:00  67.44355    21.3458  46.09775   46.09775
2018-01-01 03:00:00  76.66860    21.3458  55.32280   55.32280
2018-01-01 04:00:00  64.96090    21.3458  43.61510   43.61510
2018-01-01 05:00:00  64.14875    21.3458  42.80295   42.80295
2018-01-01 06:00:00  76.06410    21.3458  54.71830   54.71830
2018-01-01 07:00:00  69.19180    21.3458  47.84600   47.84600
2018-01-01 08:00:00  48.51735    21.3458  27.17155   27.17155
2018-01-01 09:00:00  45.92715    21.3458  24.58135   24.58135
2018-01-01 10:00:00  44.19595    21.3458  22.85015   22.85015
2018-01-01 11:00:00  39.27865    21.3458  17.93285   17.93285
2018-01-01 12:00:00  32.61625    21.3458  11.27045   11.27045
2018-01-01 13:00:00  34.09440    21.3458  12.74860   12.74860
2018-01-01 14:00:00  33.51795    21.3458  12.17215   12.17215
2018-01-01 15:00:00  41.24420    21.3458  19.89840   19.89840
2018-01-01 16:00:00  49.08765    21.3458  27.74185   27.74185
2018-01-01 17:00:00  51.24645    21.3458  29.90065   29.90065
2018-01-01 18:00:00  41.64520    21.3458  20.29940   20.29940
2018-01-01 19:00:00  40.98405    21.3458  19.63825   19.63825
2018-01-01 20:00:00  45.36865    21.3458  24.02285   24.02285
2018-01-01 21:00:00  58.24830    21.3458  36.90250   36.90250
2018-01-01 22:00:00  63.21335    21.3458  41.86755   41.86755
2018-01-01 23:00:00  78.28435    21.3458  56.93855   56.93855
2018-01-02 00:00:00  91.30400    21.3458  69.95820   69.95820
CPU times: user 12min 13s, sys: 1.28 s, total: 12min 14s
Wall time: 12min 15s
In [ ]:
8736
                     observed  predicted     error  abs_error
Datetime                                                     
2018-01-01 01:00:00  84.90085    21.3458  63.55505   63.55505
2018-01-01 02:00:00  67.44355    21.3458  46.09775   46.09775
2018-01-01 03:00:00  76.66860    21.3458  55.32280   55.32280
2018-01-01 04:00:00  64.96090    21.3458  43.61510   43.61510
2018-01-01 05:00:00  64.14875    21.3458  42.80295   42.80295
2018-01-01 06:00:00  76.06410    21.3458  54.71830   54.71830
2018-01-01 07:00:00  69.19180    21.3458  47.84600   47.84600
2018-01-01 08:00:00  48.51735    21.3458  27.17155   27.17155
2018-01-01 09:00:00  45.92715    21.3458  24.58135   24.58135
2018-01-01 10:00:00  44.19595    21.3458  22.85015   22.85015
2018-01-01 11:00:00  39.27865    21.3458  17.93285   17.93285
2018-01-01 12:00:00  32.61625    21.3458  11.27045   11.27045
2018-01-01 13:00:00  34.09440    21.3458  12.74860   12.74860
2018-01-01 14:00:00  33.51795    21.3458  12.17215   12.17215
2018-01-01 15:00:00  41.24420    21.3458  19.89840   19.89840
2018-01-01 16:00:00  49.08765    21.3458  27.74185   27.74185
2018-01-01 17:00:00  51.24645    21.3458  29.90065   29.90065
2018-01-01 18:00:00  41.64520    21.3458  20.29940   20.29940
2018-01-01 19:00:00  40.98405    21.3458  19.63825   19.63825
2018-01-01 20:00:00  45.36865    21.3458  24.02285   24.02285
2018-01-01 21:00:00  58.24830    21.3458  36.90250   36.90250
2018-01-01 22:00:00  63.21335    21.3458  41.86755   41.86755
2018-01-01 23:00:00  78.28435    21.3458  56.93855   56.93855
2018-01-02 00:00:00  91.30400    21.3458  69.95820   69.95820
CPU times: user 12min 13s, sys: 1.28 s, total: 12min 14s
Wall time: 12min 15s

Serialize output data

In [33]:
from joblib import dump, load

timestamp = get_datetime_identifier("%Y-%m-%d_%H-%M-%S")

path = f'results/pm25_ts_{model_name}_results_h_{timestamp}.joblib'

dump(fold_results, path) 
fold_results = load(path)
print(len(fold_results))
print(fold_results[0])
8736
                     observed  predicted     error  abs_error
Datetime                                                     
2018-01-01 01:00:00  84.90085    21.3458  63.55505   63.55505
2018-01-01 02:00:00  67.44355    21.3458  46.09775   46.09775
2018-01-01 03:00:00  76.66860    21.3458  55.32280   55.32280
2018-01-01 04:00:00  64.96090    21.3458  43.61510   43.61510
2018-01-01 05:00:00  64.14875    21.3458  42.80295   42.80295
2018-01-01 06:00:00  76.06410    21.3458  54.71830   54.71830
2018-01-01 07:00:00  69.19180    21.3458  47.84600   47.84600
2018-01-01 08:00:00  48.51735    21.3458  27.17155   27.17155
2018-01-01 09:00:00  45.92715    21.3458  24.58135   24.58135
2018-01-01 10:00:00  44.19595    21.3458  22.85015   22.85015
2018-01-01 11:00:00  39.27865    21.3458  17.93285   17.93285
2018-01-01 12:00:00  32.61625    21.3458  11.27045   11.27045
2018-01-01 13:00:00  34.09440    21.3458  12.74860   12.74860
2018-01-01 14:00:00  33.51795    21.3458  12.17215   12.17215
2018-01-01 15:00:00  41.24420    21.3458  19.89840   19.89840
2018-01-01 16:00:00  49.08765    21.3458  27.74185   27.74185
2018-01-01 17:00:00  51.24645    21.3458  29.90065   29.90065
2018-01-01 18:00:00  41.64520    21.3458  20.29940   20.29940
2018-01-01 19:00:00  40.98405    21.3458  19.63825   19.63825
2018-01-01 20:00:00  45.36865    21.3458  24.02285   24.02285
2018-01-01 21:00:00  58.24830    21.3458  36.90250   36.90250
2018-01-01 22:00:00  63.21335    21.3458  41.86755   41.86755
2018-01-01 23:00:00  78.28435    21.3458  56.93855   56.93855
2018-01-02 00:00:00  91.30400    21.3458  69.95820   69.95820

Calculate and visualize results

In [34]:
%%time
# Returns a list of mean folds RMSE for n_pred_points (starting at 1 point forecast)
res = get_mean_folds_rmse_for_n_prediction_points(fold_results=fold_results, n_pred_points=n_pred_points)
res
CPU times: user 2min 54s, sys: 121 ms, total: 2min 54s
Wall time: 2min 55s
Out[34]:
[4.736325355831037,
 7.053169960973371,
 8.956491999540862,
 10.574964657943067,
 11.980865300734619,
 13.22309616620753,
 14.234124506427918,
 15.063469019742884,
 15.818647486225895,
 16.437228615702477,
 16.915838096877867,
 17.294015518824608,
 17.48212400137741,
 17.558690449954085,
 17.534685365013775,
 17.42358526170799,
 17.222373507805326,
 17.018195661157026,
 16.706883551423324,
 16.26723811983471,
 15.87100703627181,
 15.53065098714417,
 15.336227892561983,
 15.33666230486685]
In [35]:
print(res)
[4.736325355831037, 7.053169960973371, 8.956491999540862, 10.574964657943067, 11.980865300734619, 13.22309616620753, 14.234124506427918, 15.063469019742884, 15.818647486225895, 16.437228615702477, 16.915838096877867, 17.294015518824608, 17.48212400137741, 17.558690449954085, 17.534685365013775, 17.42358526170799, 17.222373507805326, 17.018195661157026, 16.706883551423324, 16.26723811983471, 15.87100703627181, 15.53065098714417, 15.336227892561983, 15.33666230486685]

[4.736325355831037, 7.053169960973371, 8.956491999540862, 10.574964657943067, 11.980865300734619, 13.22309616620753, 14.234124506427918, 15.063469019742884, 15.818647486225895, 16.437228615702477, 16.915838096877867, 17.294015518824608, 17.48212400137741, 17.558690449954085, 17.534685365013775, 17.42358526170799, 17.222373507805326, 17.018195661157026, 16.706883551423324, 16.26723811983471, 15.87100703627181, 15.53065098714417, 15.336227892561983, 15.33666230486685]

In [36]:
# Show forecasts for n-th point in the future
show_n_points_of_forecasts = [1, 12, 24] # for hourly data
#show_n_points_of_forecasts = [1, 3, 7] # for daily data

# Used to zoom the plots (date ranges shown in the plots)
# for hourly data
start_end_dates = [('2018-01-01', '2019-01-01'), ('2018-02-01', '2018-03-01'), ('2018-06-01', '2018-07-01')]
# for daily data
#start_end_dates = [('2018-01-01', '2019-01-01'), ('2018-02-01', '2018-04-01'), ('2018-06-01', '2018-08-01')]

# Type of plot
# 0 -> plot_observed_vs_predicted
# 1 -> plot_observed_vs_predicted_with_error
plot_types = [0, 1, 1]

# File names for plots (format png will be used, do not add .png extension)
base_file_path = f'images/pm25_obs_vs_pred_365_h_ts_{model_name}' # for hourly data
#base_file_path = f'images/pm25_obs_vs_pred_365_d_ts_{model_name}' # for daily data
In [37]:
visualize_results(show_n_points_of_forecasts=show_n_points_of_forecasts,
                   start_end_dates=start_end_dates,
                   plot_types=plot_types,
                   base_file_path=base_file_path,
                   fold_results=fold_results, 
                   n_pred_points=n_pred_points, 
                   cut_off_offset=cut_off_offset, 
                   model_name=model_name,
                timestamp=timestamp)


results.py | 92 | visualize_results | 11-Jun-20 19:48:56 | INFO: images/pm25_obs_vs_pred_365_h_ts_PER_01_lag-01_2020-06-11_19-44-28.png


results.py | 92 | visualize_results | 11-Jun-20 19:48:56 | INFO: images/pm25_obs_vs_pred_365_h_ts_PER_01_lag-12_2020-06-11_19-44-28.png


results.py | 92 | visualize_results | 11-Jun-20 19:48:57 | INFO: images/pm25_obs_vs_pred_365_h_ts_PER_01_lag-24_2020-06-11_19-44-28.png


results.py | 92 | visualize_results | 11-Jun-20 19:49:09 | INFO: images/pm25_obs_vs_pred_365_h_ts_PER_02_lag-01_2020-06-11_19-44-28.png


results.py | 92 | visualize_results | 11-Jun-20 19:49:10 | INFO: images/pm25_obs_vs_pred_365_h_ts_PER_02_lag-12_2020-06-11_19-44-28.png


results.py | 92 | visualize_results | 11-Jun-20 19:49:11 | INFO: images/pm25_obs_vs_pred_365_h_ts_PER_02_lag-24_2020-06-11_19-44-28.png


results.py | 92 | visualize_results | 11-Jun-20 19:49:23 | INFO: images/pm25_obs_vs_pred_365_h_ts_PER_03_lag-01_2020-06-11_19-44-28.png


results.py | 92 | visualize_results | 11-Jun-20 19:49:24 | INFO: images/pm25_obs_vs_pred_365_h_ts_PER_03_lag-12_2020-06-11_19-44-28.png


results.py | 92 | visualize_results | 11-Jun-20 19:49:25 | INFO: images/pm25_obs_vs_pred_365_h_ts_PER_03_lag-24_2020-06-11_19-44-28.png

7-Day Prediction

In [38]:
df = dfd.copy()
df.columns = ['pm25']
df.head()
Out[38]:
pm25
Datetime
2008-01-01 53.586957
2008-01-02 30.958333
2008-01-03 46.104167
2008-01-04 42.979167
2008-01-05 57.312500
In [39]:
# Define first past/future cutoff point in time offset (1 year of data)
#cut_off_offset = 365*24 # for hourly data
cut_off_offset = 365 # for daily data

# Set datetime format for index
#dt_format = "%Y-%m-%d %H:%M:%S" # for hourly data
dt_format = "%Y-%m-%d" # for daily data

# Create train and validate sets
train_test_split_position = int(len(df)-cut_off_offset)

# Create as many folds as remains till the end of known data
n_folds = len(df) #train_test_split_position+3

# Predict for X points
#n_pred_points = 24 # for hourly data
n_pred_points = 7 # for daily data
In [40]:
%%time
# Validate result on test
# Creates 365*24*24 models for hourly data, or 365*7 models for hourly data

fold_results = walk_forward_ref_model_validation(data=df,
                                                 col_name='pm25',
                                                 model_type='PER',
                                                 cut_off_offset=cut_off_offset,
                                                 n_pred_points=n_pred_points,
                                                 n_folds=-1,
                                                 period='')
print(len(fold_results))
print(fold_results[0])
validation.py | 107 | walk_forward_ref_model_validation | 11-Jun-20 19:50:34 | INFO: PER model validation started
Started fold 000000/000365 - 2020-06-11_19-50-34
Started fold 000010/000365 - 2020-06-11_19-50-34
Started fold 000020/000365 - 2020-06-11_19-50-34
Started fold 000030/000365 - 2020-06-11_19-50-34
Started fold 000040/000365 - 2020-06-11_19-50-35
Started fold 000050/000365 - 2020-06-11_19-50-35
Started fold 000060/000365 - 2020-06-11_19-50-35
Started fold 000070/000365 - 2020-06-11_19-50-35
Started fold 000080/000365 - 2020-06-11_19-50-35
Started fold 000090/000365 - 2020-06-11_19-50-35
Started fold 000100/000365 - 2020-06-11_19-50-35
Started fold 000110/000365 - 2020-06-11_19-50-36
Started fold 000120/000365 - 2020-06-11_19-50-36
Started fold 000130/000365 - 2020-06-11_19-50-36
Started fold 000140/000365 - 2020-06-11_19-50-36
Started fold 000150/000365 - 2020-06-11_19-50-36
Started fold 000160/000365 - 2020-06-11_19-50-36
Started fold 000170/000365 - 2020-06-11_19-50-36
Started fold 000180/000365 - 2020-06-11_19-50-37
Started fold 000190/000365 - 2020-06-11_19-50-37
Started fold 000200/000365 - 2020-06-11_19-50-37
Started fold 000210/000365 - 2020-06-11_19-50-37
Started fold 000220/000365 - 2020-06-11_19-50-37
Started fold 000230/000365 - 2020-06-11_19-50-37
Started fold 000240/000365 - 2020-06-11_19-50-38
Started fold 000250/000365 - 2020-06-11_19-50-38
Started fold 000260/000365 - 2020-06-11_19-50-38
Started fold 000270/000365 - 2020-06-11_19-50-38
Started fold 000280/000365 - 2020-06-11_19-50-38
Started fold 000290/000365 - 2020-06-11_19-50-38
Started fold 000300/000365 - 2020-06-11_19-50-38
Started fold 000310/000365 - 2020-06-11_19-50-39
Started fold 000320/000365 - 2020-06-11_19-50-39
Started fold 000330/000365 - 2020-06-11_19-50-39
Started fold 000340/000365 - 2020-06-11_19-50-39
Started fold 000350/000365 - 2020-06-11_19-50-39
358
             observed  predicted      error  abs_error
Datetime                                              
2018-01-02  67.991848  53.008094  14.983754  14.983754
2018-01-03  16.026950  53.008094  36.981144  36.981144
2018-01-04  14.590020  53.008094  38.418073  38.418073
2018-01-05  22.094854  53.008094  30.913240  30.913240
2018-01-06  62.504217  53.008094   9.496123   9.496123
2018-01-07  43.929804  53.008094   9.078290   9.078290
2018-01-08  22.088192  53.008094  30.919902  30.919902
CPU times: user 5.32 s, sys: 46.9 ms, total: 5.37 s
Wall time: 5.38 s
In [ ]:
358
             observed  predicted      error  abs_error
Datetime                                              
2018-01-02  67.991848  53.008094  14.983754  14.983754
2018-01-03  16.026950  53.008094  36.981144  36.981144
2018-01-04  14.590020  53.008094  38.418073  38.418073
2018-01-05  22.094854  53.008094  30.913240  30.913240
2018-01-06  62.504217  53.008094   9.496123   9.496123
2018-01-07  43.929804  53.008094   9.078290   9.078290
2018-01-08  22.088192  53.008094  30.919902  30.919902
CPU times: user 5.32 s, sys: 46.9 ms, total: 5.37 s
Wall time: 5.38 s

Serialize output data

In [41]:
from joblib import dump, load

timestamp = get_datetime_identifier("%Y-%m-%d_%H-%M-%S")

path = f'results/pm25_ts_{model_name}_results_d_{timestamp}.joblib'

dump(fold_results, path) 
fold_results = load(path)
print(len(fold_results))
print(fold_results[0])
358
             observed  predicted      error  abs_error
Datetime                                              
2018-01-02  67.991848  53.008094  14.983754  14.983754
2018-01-03  16.026950  53.008094  36.981144  36.981144
2018-01-04  14.590020  53.008094  38.418073  38.418073
2018-01-05  22.094854  53.008094  30.913240  30.913240
2018-01-06  62.504217  53.008094   9.496123   9.496123
2018-01-07  43.929804  53.008094   9.078290   9.078290
2018-01-08  22.088192  53.008094  30.919902  30.919902

Calculate and visualize results

In [42]:
%%time
# Returns a list of mean folds RMSE for n_pred_points (starting at 1 point forecast)
res = get_mean_folds_rmse_for_n_prediction_points(fold_results=fold_results, n_pred_points=n_pred_points)
res
CPU times: user 2.2 s, sys: 7.22 ms, total: 2.21 s
Wall time: 2.21 s
Out[42]:
[9.682181481481482,
 14.066957549857548,
 14.978520227920228,
 15.17680683760684,
 15.717147008547009,
 17.067744444444447,
 17.992249857549858]
In [43]:
print(res)
[9.682181481481482, 14.066957549857548, 14.978520227920228, 15.17680683760684, 15.717147008547009, 17.067744444444447, 17.992249857549858]
In [ ]:
[9.682181481481482, 14.066957549857548, 14.978520227920228, 15.17680683760684, 15.717147008547009, 17.067744444444447, 17.992249857549858]
In [44]:
# Show forecasts for n-th point in the future
#show_n_points_of_forecasts = [1, 12, 24] # for hourly data
show_n_points_of_forecasts = [1, 3, 7] # for daily data

# Used to zoom the plots (date ranges shown in the plots)
# for hourly data
#start_end_dates = [('2018-01-01', '2019-01-01'), ('2018-02-01', '2018-03-01'), ('2018-06-01', '2018-07-01')]
# for daily data
start_end_dates = [('2018-01-01', '2019-01-01'), ('2018-02-01', '2018-04-01'), ('2018-06-01', '2018-08-01')]

# Type of plot
# 0 -> plot_observed_vs_predicted
# 1 -> plot_observed_vs_predicted_with_error
plot_types = [0, 1, 1]

# File names for plots (format png will be used, do not add .png extension)
#base_file_path = f'images/pm25_obs_vs_pred_365_h_ts_{model_name}' # for hourly data
base_file_path = f'images/pm25_obs_vs_pred_365_d_ts_{model_name}' # for daily data
In [45]:
visualize_results(show_n_points_of_forecasts=show_n_points_of_forecasts,
                   start_end_dates=start_end_dates,
                   plot_types=plot_types,
                   base_file_path=base_file_path,
                   fold_results=fold_results, 
                   n_pred_points=n_pred_points, 
                   cut_off_offset=cut_off_offset, 
                   model_name=model_name,
                timestamp=timestamp)


results.py | 92 | visualize_results | 11-Jun-20 19:52:26 | INFO: images/pm25_obs_vs_pred_365_d_ts_PER_01_lag-01_2020-06-11_19-51-52.png


results.py | 92 | visualize_results | 11-Jun-20 19:52:27 | INFO: images/pm25_obs_vs_pred_365_d_ts_PER_01_lag-03_2020-06-11_19-51-52.png


results.py | 92 | visualize_results | 11-Jun-20 19:52:27 | INFO: images/pm25_obs_vs_pred_365_d_ts_PER_01_lag-07_2020-06-11_19-51-52.png


results.py | 92 | visualize_results | 11-Jun-20 19:52:28 | INFO: images/pm25_obs_vs_pred_365_d_ts_PER_02_lag-01_2020-06-11_19-51-52.png


results.py | 92 | visualize_results | 11-Jun-20 19:52:29 | INFO: images/pm25_obs_vs_pred_365_d_ts_PER_02_lag-03_2020-06-11_19-51-52.png


results.py | 92 | visualize_results | 11-Jun-20 19:52:30 | INFO: images/pm25_obs_vs_pred_365_d_ts_PER_02_lag-07_2020-06-11_19-51-52.png


results.py | 92 | visualize_results | 11-Jun-20 19:52:31 | INFO: images/pm25_obs_vs_pred_365_d_ts_PER_03_lag-01_2020-06-11_19-51-52.png


results.py | 92 | visualize_results | 11-Jun-20 19:52:31 | INFO: images/pm25_obs_vs_pred_365_d_ts_PER_03_lag-03_2020-06-11_19-51-52.png


results.py | 92 | visualize_results | 11-Jun-20 19:52:32 | INFO: images/pm25_obs_vs_pred_365_d_ts_PER_03_lag-07_2020-06-11_19-51-52.png

Simple Moving Average (SMA)

In [46]:
model_name = 'SMA'

24-Hour Prediction

In [47]:
df = dfh.copy()
df.columns = ['pm25']
df.head()
Out[47]:
pm25
Datetime
2008-01-01 01:00:00 92.0
2008-01-01 02:00:00 81.0
2008-01-01 03:00:00 73.0
2008-01-01 04:00:00 60.5
2008-01-01 05:00:00 61.0
In [48]:
# Define first past/future cutoff point in time offset (1 year of data)
cut_off_offset = 365*24 # for hourly data
#cut_off_offset = 365 # for daily data

# Set datetime format for index
dt_format = "%Y-%m-%d %H:%M:%S" # for hourly data
#dt_format = "%Y-%m-%d" # for daily data

# Create train and validate sets
train_test_split_position = int(len(df)-cut_off_offset)

# Create as many folds as remains till the end of known data
n_folds = len(df) #train_test_split_position+3

# Predict for X points
n_pred_points = 24 # for hourly data
#n_pred_points = 7 # for daily data
In [49]:
%%time
# Validate result on test
# Creates 365*24*24 models for hourly data, or 365*7 models for hourly data

fold_results = walk_forward_ref_model_validation(data=df,
                                                 col_name='pm25',
                                                 model_type='SMA',
                                                 cut_off_offset=cut_off_offset,
                                                 n_pred_points=n_pred_points,
                                                 n_folds=-1,
                                                 period='')
print(len(fold_results))
print(fold_results[0])
validation.py | 109 | walk_forward_ref_model_validation | 11-Jun-20 19:54:02 | INFO: SMA model validation started
Started fold 000000/008760 - 2020-06-11_19-54-02
Started fold 000100/008760 - 2020-06-11_19-54-14
Started fold 000200/008760 - 2020-06-11_19-54-26
Started fold 000300/008760 - 2020-06-11_19-54-38
Started fold 000400/008760 - 2020-06-11_19-54-50
Started fold 000500/008760 - 2020-06-11_19-55-03
Started fold 000600/008760 - 2020-06-11_19-55-15
Started fold 000700/008760 - 2020-06-11_19-55-27
Started fold 000800/008760 - 2020-06-11_19-55-39
Started fold 000900/008760 - 2020-06-11_19-55-51
Started fold 001000/008760 - 2020-06-11_19-56-03
Started fold 001100/008760 - 2020-06-11_19-56-15
Started fold 001200/008760 - 2020-06-11_19-56-27
Started fold 001300/008760 - 2020-06-11_19-56-39
Started fold 001400/008760 - 2020-06-11_19-56-51
Started fold 001500/008760 - 2020-06-11_19-57-04
Started fold 001600/008760 - 2020-06-11_19-57-16
Started fold 001700/008760 - 2020-06-11_19-57-28
Started fold 001800/008760 - 2020-06-11_19-57-40
Started fold 001900/008760 - 2020-06-11_19-57-53
Started fold 002000/008760 - 2020-06-11_19-58-05
Started fold 002100/008760 - 2020-06-11_19-58-17
Started fold 002200/008760 - 2020-06-11_19-58-29
Started fold 002300/008760 - 2020-06-11_19-58-41
Started fold 002400/008760 - 2020-06-11_19-58-53
Started fold 002500/008760 - 2020-06-11_19-59-07
Started fold 002600/008760 - 2020-06-11_19-59-19
Started fold 002700/008760 - 2020-06-11_19-59-31
Started fold 002800/008760 - 2020-06-11_19-59-43
Started fold 002900/008760 - 2020-06-11_19-59-55
Started fold 003000/008760 - 2020-06-11_20-00-07
Started fold 003100/008760 - 2020-06-11_20-00-20
Started fold 003200/008760 - 2020-06-11_20-00-33
Started fold 003300/008760 - 2020-06-11_20-00-45
Started fold 003400/008760 - 2020-06-11_20-01-00
Started fold 003500/008760 - 2020-06-11_20-01-13
Started fold 003600/008760 - 2020-06-11_20-01-25
Started fold 003700/008760 - 2020-06-11_20-01-37
Started fold 003800/008760 - 2020-06-11_20-01-49
Started fold 003900/008760 - 2020-06-11_20-02-01
Started fold 004000/008760 - 2020-06-11_20-02-13
Started fold 004100/008760 - 2020-06-11_20-02-27
Started fold 004200/008760 - 2020-06-11_20-02-40
Started fold 004300/008760 - 2020-06-11_20-02-53
Started fold 004400/008760 - 2020-06-11_20-03-07
Started fold 004500/008760 - 2020-06-11_20-03-19
Started fold 004600/008760 - 2020-06-11_20-03-33
Started fold 004700/008760 - 2020-06-11_20-03-46
Started fold 004800/008760 - 2020-06-11_20-03-59
Started fold 004900/008760 - 2020-06-11_20-04-11
Started fold 005000/008760 - 2020-06-11_20-04-23
Started fold 005100/008760 - 2020-06-11_20-04-36
Started fold 005200/008760 - 2020-06-11_20-04-48
Started fold 005300/008760 - 2020-06-11_20-05-02
Started fold 005400/008760 - 2020-06-11_20-05-14
Started fold 005500/008760 - 2020-06-11_20-05-27
Started fold 005600/008760 - 2020-06-11_20-05-39
Started fold 005700/008760 - 2020-06-11_20-05-51
Started fold 005800/008760 - 2020-06-11_20-06-04
Started fold 005900/008760 - 2020-06-11_20-06-16
Started fold 006000/008760 - 2020-06-11_20-06-28
Started fold 006100/008760 - 2020-06-11_20-06-41
Started fold 006200/008760 - 2020-06-11_20-06-54
Started fold 006300/008760 - 2020-06-11_20-07-07
Started fold 006400/008760 - 2020-06-11_20-07-20
Started fold 006500/008760 - 2020-06-11_20-07-32
Started fold 006600/008760 - 2020-06-11_20-07-45
Started fold 006700/008760 - 2020-06-11_20-07-57
Started fold 006800/008760 - 2020-06-11_20-08-09
Started fold 006900/008760 - 2020-06-11_20-08-22
Started fold 007000/008760 - 2020-06-11_20-08-34
Started fold 007100/008760 - 2020-06-11_20-08-48
Started fold 007200/008760 - 2020-06-11_20-09-02
Started fold 007300/008760 - 2020-06-11_20-09-15
Started fold 007400/008760 - 2020-06-11_20-09-27
Started fold 007500/008760 - 2020-06-11_20-09-39
Started fold 007600/008760 - 2020-06-11_20-09-52
Started fold 007700/008760 - 2020-06-11_20-10-04
Started fold 007800/008760 - 2020-06-11_20-10-17
Started fold 007900/008760 - 2020-06-11_20-10-29
Started fold 008000/008760 - 2020-06-11_20-10-42
Started fold 008100/008760 - 2020-06-11_20-10-54
Started fold 008200/008760 - 2020-06-11_20-11-08
Started fold 008300/008760 - 2020-06-11_20-11-21
Started fold 008400/008760 - 2020-06-11_20-11-33
Started fold 008500/008760 - 2020-06-11_20-11-46
Started fold 008600/008760 - 2020-06-11_20-11-58
Started fold 008700/008760 - 2020-06-11_20-12-11
8736
                     observed  predicted      error  abs_error
Datetime                                                      
2018-01-01 01:00:00  84.90085  32.356375  52.544475  52.544475
2018-01-01 02:00:00  67.44355  33.695469  33.748081  33.748081
2018-01-01 03:00:00  76.66860  30.248536  46.420064  46.420064
2018-01-01 04:00:00  64.96090  29.411545  35.549355  35.549355
2018-01-01 05:00:00  64.14875  31.427981  32.720769  32.720769
2018-01-01 06:00:00  76.06410  31.195883  44.868217  44.868217
2018-01-01 07:00:00  69.19180  30.570986  38.620814  38.620814
2018-01-01 08:00:00  48.51735  30.651599  17.865751  17.865751
2018-01-01 09:00:00  45.92715  30.961612  14.965538  14.965538
2018-01-01 10:00:00  44.19595  30.845020  13.350930  13.350930
2018-01-01 11:00:00  39.27865  30.757304   8.521346   8.521346
2018-01-01 12:00:00  32.61625  30.803884   1.812366   1.812366
2018-01-01 13:00:00  34.09440  30.841955   3.252445   3.252445
2018-01-01 14:00:00  33.51795  30.812041   2.705909   2.705909
2018-01-01 15:00:00  41.24420  30.803796  10.440404  10.440404
2018-01-01 16:00:00  49.08765  30.815419  18.272231  18.272231
2018-01-01 17:00:00  51.24645  30.818303  20.428147  20.428147
2018-01-01 18:00:00  41.64520  30.812390  10.832810  10.832810
2018-01-01 19:00:00  40.98405  30.812477  10.171573  10.171573
2018-01-01 20:00:00  45.36865  30.814647  14.554003  14.554003
2018-01-01 21:00:00  58.24830  30.814454  27.433846  27.433846
2018-01-01 22:00:00  63.21335  30.813492  32.399858  32.399858
2018-01-01 23:00:00  78.28435  30.813767  47.470583  47.470583
2018-01-02 00:00:00  91.30400  30.814090  60.489910  60.489910
CPU times: user 18min 9s, sys: 2.94 s, total: 18min 12s
Wall time: 18min 13s
In [ ]:
8736
                     observed  predicted      error  abs_error
Datetime                                                      
2018-01-01 01:00:00  84.90085  32.356375  52.544475  52.544475
2018-01-01 02:00:00  67.44355  33.695469  33.748081  33.748081
2018-01-01 03:00:00  76.66860  30.248536  46.420064  46.420064
2018-01-01 04:00:00  64.96090  29.411545  35.549355  35.549355
2018-01-01 05:00:00  64.14875  31.427981  32.720769  32.720769
2018-01-01 06:00:00  76.06410  31.195883  44.868217  44.868217
2018-01-01 07:00:00  69.19180  30.570986  38.620814  38.620814
2018-01-01 08:00:00  48.51735  30.651599  17.865751  17.865751
2018-01-01 09:00:00  45.92715  30.961612  14.965538  14.965538
2018-01-01 10:00:00  44.19595  30.845020  13.350930  13.350930
2018-01-01 11:00:00  39.27865  30.757304   8.521346   8.521346
2018-01-01 12:00:00  32.61625  30.803884   1.812366   1.812366
2018-01-01 13:00:00  34.09440  30.841955   3.252445   3.252445
2018-01-01 14:00:00  33.51795  30.812041   2.705909   2.705909
2018-01-01 15:00:00  41.24420  30.803796  10.440404  10.440404
2018-01-01 16:00:00  49.08765  30.815419  18.272231  18.272231
2018-01-01 17:00:00  51.24645  30.818303  20.428147  20.428147
2018-01-01 18:00:00  41.64520  30.812390  10.832810  10.832810
2018-01-01 19:00:00  40.98405  30.812477  10.171573  10.171573
2018-01-01 20:00:00  45.36865  30.814647  14.554003  14.554003
2018-01-01 21:00:00  58.24830  30.814454  27.433846  27.433846
2018-01-01 22:00:00  63.21335  30.813492  32.399858  32.399858
2018-01-01 23:00:00  78.28435  30.813767  47.470583  47.470583
2018-01-02 00:00:00  91.30400  30.814090  60.489910  60.489910
CPU times: user 18min 9s, sys: 2.94 s, total: 18min 12s
Wall time: 18min 13s

Serialize output data

In [50]:
from joblib import dump, load

timestamp = get_datetime_identifier("%Y-%m-%d_%H-%M-%S")

path = f'results/pm25_ts_{model_name}_results_h_{timestamp}.joblib'

dump(fold_results, path) 
fold_results = load(path)
print(len(fold_results))
print(fold_results[0])
8736
                     observed  predicted      error  abs_error
Datetime                                                      
2018-01-01 01:00:00  84.90085  32.356375  52.544475  52.544475
2018-01-01 02:00:00  67.44355  33.695469  33.748081  33.748081
2018-01-01 03:00:00  76.66860  30.248536  46.420064  46.420064
2018-01-01 04:00:00  64.96090  29.411545  35.549355  35.549355
2018-01-01 05:00:00  64.14875  31.427981  32.720769  32.720769
2018-01-01 06:00:00  76.06410  31.195883  44.868217  44.868217
2018-01-01 07:00:00  69.19180  30.570986  38.620814  38.620814
2018-01-01 08:00:00  48.51735  30.651599  17.865751  17.865751
2018-01-01 09:00:00  45.92715  30.961612  14.965538  14.965538
2018-01-01 10:00:00  44.19595  30.845020  13.350930  13.350930
2018-01-01 11:00:00  39.27865  30.757304   8.521346   8.521346
2018-01-01 12:00:00  32.61625  30.803884   1.812366   1.812366
2018-01-01 13:00:00  34.09440  30.841955   3.252445   3.252445
2018-01-01 14:00:00  33.51795  30.812041   2.705909   2.705909
2018-01-01 15:00:00  41.24420  30.803796  10.440404  10.440404
2018-01-01 16:00:00  49.08765  30.815419  18.272231  18.272231
2018-01-01 17:00:00  51.24645  30.818303  20.428147  20.428147
2018-01-01 18:00:00  41.64520  30.812390  10.832810  10.832810
2018-01-01 19:00:00  40.98405  30.812477  10.171573  10.171573
2018-01-01 20:00:00  45.36865  30.814647  14.554003  14.554003
2018-01-01 21:00:00  58.24830  30.814454  27.433846  27.433846
2018-01-01 22:00:00  63.21335  30.813492  32.399858  32.399858
2018-01-01 23:00:00  78.28435  30.813767  47.470583  47.470583
2018-01-02 00:00:00  91.30400  30.814090  60.489910  60.489910

Calculate and visualize results

In [51]:
%%time
# Returns a list of mean folds RMSE for n_pred_points (starting at 1 point forecast)
res = get_mean_folds_rmse_for_n_prediction_points(fold_results=fold_results, n_pred_points=n_pred_points)
res
CPU times: user 2min 53s, sys: 100 ms, total: 2min 53s
Wall time: 2min 53s
Out[51]:
[7.146073473370064,
 8.508867665289257,
 9.847301721763085,
 11.215512121212122,
 12.709579889807163,
 13.664001698806244,
 14.504790140036732,
 15.262237959136824,
 15.907483918732781,
 16.393955383379247,
 16.74894462809917,
 16.967410560146924,
 17.0615229912764,
 17.040708000459137,
 16.931553018824605,
 16.742444249311298,
 16.497563636363637,
 16.19251631083563,
 15.801181577134988,
 15.423336340679521,
 15.106160066574839,
 14.923811191460054,
 14.947423083103766,
 15.229386306244264]
In [52]:
print(res)
[7.146073473370064, 8.508867665289257, 9.847301721763085, 11.215512121212122, 12.709579889807163, 13.664001698806244, 14.504790140036732, 15.262237959136824, 15.907483918732781, 16.393955383379247, 16.74894462809917, 16.967410560146924, 17.0615229912764, 17.040708000459137, 16.931553018824605, 16.742444249311298, 16.497563636363637, 16.19251631083563, 15.801181577134988, 15.423336340679521, 15.106160066574839, 14.923811191460054, 14.947423083103766, 15.229386306244264]

[7.146073473370064, 8.508867665289257, 9.847301721763085, 11.215512121212122, 12.709579889807163, 13.664001698806244, 14.504790140036732, 15.262237959136824, 15.907483918732781, 16.393955383379247, 16.74894462809917, 16.967410560146924, 17.0615229912764, 17.040708000459137, 16.931553018824605, 16.742444249311298, 16.497563636363637, 16.19251631083563, 15.801181577134988, 15.423336340679521, 15.106160066574839, 14.923811191460054, 14.947423083103766, 15.229386306244264]

In [53]:
# Show forecasts for n-th point in the future
show_n_points_of_forecasts = [1, 12, 24] # for hourly data
#show_n_points_of_forecasts = [1, 3, 7] # for daily data

# Used to zoom the plots (date ranges shown in the plots)
# for hourly data
start_end_dates = [('2018-01-01', '2019-01-01'), ('2018-02-01', '2018-03-01'), ('2018-06-01', '2018-07-01')]
# for daily data
#start_end_dates = [('2018-01-01', '2019-01-01'), ('2018-02-01', '2018-04-01'), ('2018-06-01', '2018-08-01')]

# Type of plot
# 0 -> plot_observed_vs_predicted
# 1 -> plot_observed_vs_predicted_with_error
plot_types = [0, 1, 1]

# File names for plots (format png will be used, do not add .png extension)
base_file_path = f'images/pm25_obs_vs_pred_365_h_ts_{model_name}' # for hourly data
#base_file_path = f'images/pm25_obs_vs_pred_365_d_ts_{model_name}' # for daily data
In [54]:
visualize_results(show_n_points_of_forecasts=show_n_points_of_forecasts,
                   start_end_dates=start_end_dates,
                   plot_types=plot_types,
                   base_file_path=base_file_path,
                   fold_results=fold_results, 
                   n_pred_points=n_pred_points, 
                   cut_off_offset=cut_off_offset, 
                   model_name=model_name,
                timestamp=timestamp)


results.py | 92 | visualize_results | 11-Jun-20 20:19:21 | INFO: images/pm25_obs_vs_pred_365_h_ts_SMA_01_lag-01_2020-06-11_20-12-48.png


results.py | 92 | visualize_results | 11-Jun-20 20:19:22 | INFO: images/pm25_obs_vs_pred_365_h_ts_SMA_01_lag-12_2020-06-11_20-12-48.png


results.py | 92 | visualize_results | 11-Jun-20 20:19:23 | INFO: images/pm25_obs_vs_pred_365_h_ts_SMA_01_lag-24_2020-06-11_20-12-48.png


results.py | 92 | visualize_results | 11-Jun-20 20:19:35 | INFO: images/pm25_obs_vs_pred_365_h_ts_SMA_02_lag-01_2020-06-11_20-12-48.png


results.py | 92 | visualize_results | 11-Jun-20 20:19:36 | INFO: images/pm25_obs_vs_pred_365_h_ts_SMA_02_lag-12_2020-06-11_20-12-48.png


results.py | 92 | visualize_results | 11-Jun-20 20:19:36 | INFO: images/pm25_obs_vs_pred_365_h_ts_SMA_02_lag-24_2020-06-11_20-12-48.png


results.py | 92 | visualize_results | 11-Jun-20 20:19:49 | INFO: images/pm25_obs_vs_pred_365_h_ts_SMA_03_lag-01_2020-06-11_20-12-48.png


results.py | 92 | visualize_results | 11-Jun-20 20:19:50 | INFO: images/pm25_obs_vs_pred_365_h_ts_SMA_03_lag-12_2020-06-11_20-12-48.png


results.py | 92 | visualize_results | 11-Jun-20 20:19:51 | INFO: images/pm25_obs_vs_pred_365_h_ts_SMA_03_lag-24_2020-06-11_20-12-48.png

7-Day Prediction

In [55]:
df = dfd.copy()
df.columns = ['pm25']
df.head()
Out[55]:
pm25
Datetime
2008-01-01 53.586957
2008-01-02 30.958333
2008-01-03 46.104167
2008-01-04 42.979167
2008-01-05 57.312500
In [56]:
# Define first past/future cutoff point in time offset (1 year of data)
#cut_off_offset = 365*24 # for hourly data
cut_off_offset = 365 # for daily data

# Set datetime format for index
#dt_format = "%Y-%m-%d %H:%M:%S" # for hourly data
dt_format = "%Y-%m-%d" # for daily data

# Create train and validate sets
train_test_split_position = int(len(df)-cut_off_offset)

# Create as many folds as remains till the end of known data
n_folds = len(df) #train_test_split_position+3

# Predict for X points
#n_pred_points = 24 # for hourly data
n_pred_points = 7 # for daily data
In [57]:
%%time
# Validate result on test
# Creates 365*24*24 models for hourly data, or 365*7 models for hourly data

fold_results = walk_forward_ref_model_validation(data=df,
                                                 col_name='pm25',
                                                 model_type='SMA',
                                                 cut_off_offset=cut_off_offset,
                                                 n_pred_points=n_pred_points,
                                                 n_folds=-1,
                                                 period='',
                                                 rolling_window=4)
print(len(fold_results))
print(fold_results[0])
validation.py | 109 | walk_forward_ref_model_validation | 11-Jun-20 20:20:31 | INFO: SMA model validation started
Started fold 000000/000365 - 2020-06-11_20-20-31
Started fold 000010/000365 - 2020-06-11_20-20-31
Started fold 000020/000365 - 2020-06-11_20-20-32
Started fold 000030/000365 - 2020-06-11_20-20-32
Started fold 000040/000365 - 2020-06-11_20-20-32
Started fold 000050/000365 - 2020-06-11_20-20-32
Started fold 000060/000365 - 2020-06-11_20-20-32
Started fold 000070/000365 - 2020-06-11_20-20-32
Started fold 000080/000365 - 2020-06-11_20-20-33
Started fold 000090/000365 - 2020-06-11_20-20-33
Started fold 000100/000365 - 2020-06-11_20-20-33
Started fold 000110/000365 - 2020-06-11_20-20-33
Started fold 000120/000365 - 2020-06-11_20-20-33
Started fold 000130/000365 - 2020-06-11_20-20-34
Started fold 000140/000365 - 2020-06-11_20-20-34
Started fold 000150/000365 - 2020-06-11_20-20-34
Started fold 000160/000365 - 2020-06-11_20-20-34
Started fold 000170/000365 - 2020-06-11_20-20-34
Started fold 000180/000365 - 2020-06-11_20-20-34
Started fold 000190/000365 - 2020-06-11_20-20-35
Started fold 000200/000365 - 2020-06-11_20-20-35
Started fold 000210/000365 - 2020-06-11_20-20-35
Started fold 000220/000365 - 2020-06-11_20-20-35
Started fold 000230/000365 - 2020-06-11_20-20-35
Started fold 000240/000365 - 2020-06-11_20-20-35
Started fold 000250/000365 - 2020-06-11_20-20-36
Started fold 000260/000365 - 2020-06-11_20-20-36
Started fold 000270/000365 - 2020-06-11_20-20-36
Started fold 000280/000365 - 2020-06-11_20-20-36
Started fold 000290/000365 - 2020-06-11_20-20-36
Started fold 000300/000365 - 2020-06-11_20-20-36
Started fold 000310/000365 - 2020-06-11_20-20-37
Started fold 000320/000365 - 2020-06-11_20-20-37
Started fold 000330/000365 - 2020-06-11_20-20-37
Started fold 000340/000365 - 2020-06-11_20-20-37
Started fold 000350/000365 - 2020-06-11_20-20-37
358
             observed  predicted      error  abs_error
Datetime                                              
2018-01-02  67.991848  32.029379  35.962469  35.962469
2018-01-03  16.026950  33.779889  17.752939  17.752939
2018-01-04  14.590020  33.403785  18.813765  18.813765
2018-01-05  22.094854  38.055287  15.960432  15.960432
2018-01-06  62.504217  34.317085  28.187132  28.187132
2018-01-07  43.929804  34.889011   9.040793   9.040793
2018-01-08  22.088192  35.166292  13.078100  13.078100
CPU times: user 6.4 s, sys: 49.6 ms, total: 6.45 s
Wall time: 6.46 s
In [ ]:
358
             observed  predicted      error  abs_error
Datetime                                              
2018-01-02  67.991848  32.029379  35.962469  35.962469
2018-01-03  16.026950  33.779889  17.752939  17.752939
2018-01-04  14.590020  33.403785  18.813765  18.813765
2018-01-05  22.094854  38.055287  15.960432  15.960432
2018-01-06  62.504217  34.317085  28.187132  28.187132
2018-01-07  43.929804  34.889011   9.040793   9.040793
2018-01-08  22.088192  35.166292  13.078100  13.078100
CPU times: user 6.4 s, sys: 49.6 ms, total: 6.45 s
Wall time: 6.46 s

Serialize output data

In [58]:
from joblib import dump, load

timestamp = get_datetime_identifier("%Y-%m-%d_%H-%M-%S")

path = f'results/pm25_ts_{model_name}_results_d_{timestamp}.joblib'

dump(fold_results, path) 
fold_results = load(path)
print(len(fold_results))
print(fold_results[0])
358
             observed  predicted      error  abs_error
Datetime                                              
2018-01-02  67.991848  32.029379  35.962469  35.962469
2018-01-03  16.026950  33.779889  17.752939  17.752939
2018-01-04  14.590020  33.403785  18.813765  18.813765
2018-01-05  22.094854  38.055287  15.960432  15.960432
2018-01-06  62.504217  34.317085  28.187132  28.187132
2018-01-07  43.929804  34.889011   9.040793   9.040793
2018-01-08  22.088192  35.166292  13.078100  13.078100

Calculate and visualize results

In [59]:
%%time
# Returns a list of mean folds RMSE for n_pred_points (starting at 1 point forecast)
res = get_mean_folds_rmse_for_n_prediction_points(fold_results=fold_results, n_pred_points=n_pred_points)
res
CPU times: user 2.44 s, sys: 13 ms, total: 2.45 s
Wall time: 2.47 s
Out[59]:
[11.28434188034188,
 12.888569800569801,
 13.431560683760683,
 14.104865242165241,
 15.333503703703705,
 16.172740740740746,
 16.877947293447296]
In [60]:
print(res)
[11.28434188034188, 12.888569800569801, 13.431560683760683, 14.104865242165241, 15.333503703703705, 16.172740740740746, 16.877947293447296]
In [ ]:
[11.28434188034188, 12.888569800569801, 13.431560683760683, 14.104865242165241, 15.333503703703705, 16.172740740740746, 16.877947293447296]
In [61]:
# Show forecasts for n-th point in the future
#show_n_points_of_forecasts = [1, 12, 24] # for hourly data
show_n_points_of_forecasts = [1, 3, 7] # for daily data

# Used to zoom the plots (date ranges shown in the plots)
# for hourly data
#start_end_dates = [('2018-01-01', '2019-01-01'), ('2018-02-01', '2018-03-01'), ('2018-06-01', '2018-07-01')]
# for daily data
start_end_dates = [('2018-01-01', '2019-01-01'), ('2018-02-01', '2018-04-01'), ('2018-06-01', '2018-08-01')]

# Type of plot
# 0 -> plot_observed_vs_predicted
# 1 -> plot_observed_vs_predicted_with_error
plot_types = [0, 1, 1]

# File names for plots (format png will be used, do not add .png extension)
#base_file_path = f'images/pm25_obs_vs_pred_365_h_ts_{model_name}' # for hourly data
base_file_path = f'images/pm25_obs_vs_pred_365_d_ts_{model_name}' # for daily data
In [62]:
visualize_results(show_n_points_of_forecasts=show_n_points_of_forecasts,
                   start_end_dates=start_end_dates,
                   plot_types=plot_types,
                   base_file_path=base_file_path,
                   fold_results=fold_results, 
                   n_pred_points=n_pred_points, 
                   cut_off_offset=cut_off_offset, 
                   model_name=model_name,
                timestamp=timestamp)


results.py | 92 | visualize_results | 11-Jun-20 20:21:27 | INFO: images/pm25_obs_vs_pred_365_d_ts_SMA_01_lag-01_2020-06-11_20-20-51.png


results.py | 92 | visualize_results | 11-Jun-20 20:21:28 | INFO: images/pm25_obs_vs_pred_365_d_ts_SMA_01_lag-03_2020-06-11_20-20-51.png


results.py | 92 | visualize_results | 11-Jun-20 20:21:29 | INFO: images/pm25_obs_vs_pred_365_d_ts_SMA_01_lag-07_2020-06-11_20-20-51.png


results.py | 92 | visualize_results | 11-Jun-20 20:21:30 | INFO: images/pm25_obs_vs_pred_365_d_ts_SMA_02_lag-01_2020-06-11_20-20-51.png


results.py | 92 | visualize_results | 11-Jun-20 20:21:30 | INFO: images/pm25_obs_vs_pred_365_d_ts_SMA_02_lag-03_2020-06-11_20-20-51.png


results.py | 92 | visualize_results | 11-Jun-20 20:21:31 | INFO: images/pm25_obs_vs_pred_365_d_ts_SMA_02_lag-07_2020-06-11_20-20-51.png


results.py | 92 | visualize_results | 11-Jun-20 20:21:32 | INFO: images/pm25_obs_vs_pred_365_d_ts_SMA_03_lag-01_2020-06-11_20-20-51.png


results.py | 92 | visualize_results | 11-Jun-20 20:21:33 | INFO: images/pm25_obs_vs_pred_365_d_ts_SMA_03_lag-03_2020-06-11_20-20-51.png


results.py | 92 | visualize_results | 11-Jun-20 20:21:34 | INFO: images/pm25_obs_vs_pred_365_d_ts_SMA_03_lag-07_2020-06-11_20-20-51.png

Exponential Moving Average

In [80]:
model_name = 'EMA'

24-Hour Prediction

In [81]:
df = dfh.copy()
df.columns = ['pm25']
df.head()
Out[81]:
pm25
Datetime
2008-01-01 01:00:00 92.0
2008-01-01 02:00:00 81.0
2008-01-01 03:00:00 73.0
2008-01-01 04:00:00 60.5
2008-01-01 05:00:00 61.0
In [82]:
# Define first past/future cutoff point in time offset (1 year of data)
cut_off_offset = 365*24 # for hourly data
#cut_off_offset = 365 # for daily data

# Set datetime format for index
dt_format = "%Y-%m-%d %H:%M:%S" # for hourly data
#dt_format = "%Y-%m-%d" # for daily data

# Create train and validate sets
train_test_split_position = int(len(df)-cut_off_offset)

# Create as many folds as remains till the end of known data
n_folds = len(df) #train_test_split_position+3

# Predict for X points
n_pred_points = 24 # for hourly data
#n_pred_points = 7 # for daily data
In [83]:
%%time
# Validate result on test
# Creates 365*24*24 models for hourly data, or 365*7 models for hourly data

fold_results = walk_forward_ref_model_validation(data=df,
                                                 col_name='pm25',
                                                 model_type='EMA',
                                                 cut_off_offset=cut_off_offset,
                                                 n_pred_points=n_pred_points,
                                                 n_folds=-1,
                                                 period='')
print(len(fold_results))
print(fold_results[0])
validation.py | 111 | walk_forward_ref_model_validation | 11-Jun-20 21:31:27 | INFO: EMA model validation started
Started fold 000000/008760 - 2020-06-11_21-31-27
Started fold 000100/008760 - 2020-06-11_21-31-49
Started fold 000200/008760 - 2020-06-11_21-32-13
Started fold 000300/008760 - 2020-06-11_21-32-36
Started fold 000400/008760 - 2020-06-11_21-32-57
Started fold 000500/008760 - 2020-06-11_21-33-20
Started fold 000600/008760 - 2020-06-11_21-33-42
Started fold 000700/008760 - 2020-06-11_21-34-04
Started fold 000800/008760 - 2020-06-11_21-34-25
Started fold 000900/008760 - 2020-06-11_21-34-47
Started fold 001000/008760 - 2020-06-11_21-35-09
Started fold 001100/008760 - 2020-06-11_21-35-31
Started fold 001200/008760 - 2020-06-11_21-35-53
Started fold 001300/008760 - 2020-06-11_21-36-16
Started fold 001400/008760 - 2020-06-11_21-36-38
Started fold 001500/008760 - 2020-06-11_21-37-01
Started fold 001600/008760 - 2020-06-11_21-37-24
Started fold 001700/008760 - 2020-06-11_21-37-46
Started fold 001800/008760 - 2020-06-11_21-38-11
Started fold 001900/008760 - 2020-06-11_21-38-33
Started fold 002000/008760 - 2020-06-11_21-38-55
Started fold 002100/008760 - 2020-06-11_21-39-17
Started fold 002200/008760 - 2020-06-11_21-39-40
Started fold 002300/008760 - 2020-06-11_21-40-04
Started fold 002400/008760 - 2020-06-11_21-40-26
Started fold 002500/008760 - 2020-06-11_21-40-48
Started fold 002600/008760 - 2020-06-11_21-41-10
Started fold 002700/008760 - 2020-06-11_21-41-33
Started fold 002800/008760 - 2020-06-11_21-41-58
Started fold 002900/008760 - 2020-06-11_21-42-22
Started fold 003000/008760 - 2020-06-11_21-42-44
Started fold 003100/008760 - 2020-06-11_21-43-06
Started fold 003200/008760 - 2020-06-11_21-43-29
Started fold 003300/008760 - 2020-06-11_21-43-52
Started fold 003400/008760 - 2020-06-11_21-44-15
Started fold 003500/008760 - 2020-06-11_21-44-37
Started fold 003600/008760 - 2020-06-11_21-44-59
Started fold 003700/008760 - 2020-06-11_21-45-23
Started fold 003800/008760 - 2020-06-11_21-45-48
Started fold 003900/008760 - 2020-06-11_21-46-10
Started fold 004000/008760 - 2020-06-11_21-46-33
Started fold 004100/008760 - 2020-06-11_21-46-55
Started fold 004200/008760 - 2020-06-11_21-47-17
Started fold 004300/008760 - 2020-06-11_21-47-41
Started fold 004400/008760 - 2020-06-11_21-48-06
Started fold 004500/008760 - 2020-06-11_21-48-28
Started fold 004600/008760 - 2020-06-11_21-48-50
Started fold 004700/008760 - 2020-06-11_21-49-13
Started fold 004800/008760 - 2020-06-11_21-49-37
Started fold 004900/008760 - 2020-06-11_21-50-02
Started fold 005000/008760 - 2020-06-11_21-50-24
Started fold 005100/008760 - 2020-06-11_21-50-47
Started fold 005200/008760 - 2020-06-11_21-51-09
Started fold 005300/008760 - 2020-06-11_21-51-33
Started fold 005400/008760 - 2020-06-11_21-51-56
Started fold 005500/008760 - 2020-06-11_21-52-19
Started fold 005600/008760 - 2020-06-11_21-52-42
Started fold 005700/008760 - 2020-06-11_21-53-04
Started fold 005800/008760 - 2020-06-11_21-53-29
Started fold 005900/008760 - 2020-06-11_21-53-54
Started fold 006000/008760 - 2020-06-11_21-54-18
Started fold 006100/008760 - 2020-06-11_21-54-41
Started fold 006200/008760 - 2020-06-11_21-55-03
Started fold 006300/008760 - 2020-06-11_21-55-27
Started fold 006400/008760 - 2020-06-11_21-55-50
Started fold 006500/008760 - 2020-06-11_21-56-15
Started fold 006600/008760 - 2020-06-11_21-56-38
Started fold 006700/008760 - 2020-06-11_21-57-01
Started fold 006800/008760 - 2020-06-11_21-57-25
Started fold 006900/008760 - 2020-06-11_21-57-51
Started fold 007000/008760 - 2020-06-11_21-58-15
Started fold 007100/008760 - 2020-06-11_21-58-38
Started fold 007200/008760 - 2020-06-11_21-59-01
Started fold 007300/008760 - 2020-06-11_21-59-25
Started fold 007400/008760 - 2020-06-11_21-59-48
Started fold 007500/008760 - 2020-06-11_22-00-15
Started fold 007600/008760 - 2020-06-11_22-00-38
Started fold 007700/008760 - 2020-06-11_22-01-01
Started fold 007800/008760 - 2020-06-11_22-01-24
Started fold 007900/008760 - 2020-06-11_22-01-48
Started fold 008000/008760 - 2020-06-11_22-02-12
Started fold 008100/008760 - 2020-06-11_22-02-36
Started fold 008200/008760 - 2020-06-11_22-02-59
Started fold 008300/008760 - 2020-06-11_22-03-22
Started fold 008400/008760 - 2020-06-11_22-03-48
Started fold 008500/008760 - 2020-06-11_22-04-14
Started fold 008600/008760 - 2020-06-11_22-04-39
Started fold 008700/008760 - 2020-06-11_22-05-02
8736
                     observed  predicted      error  abs_error
Datetime                                                      
2018-01-01 01:00:00  84.90085  21.221522  63.679328  63.679328
2018-01-01 02:00:00  67.44355  21.221522  46.222028  46.222028
2018-01-01 03:00:00  76.66860  21.221522  55.447078  55.447078
2018-01-01 04:00:00  64.96090  21.221522  43.739378  43.739378
2018-01-01 05:00:00  64.14875  21.221522  42.927228  42.927228
2018-01-01 06:00:00  76.06410  21.221522  54.842578  54.842578
2018-01-01 07:00:00  69.19180  21.221522  47.970278  47.970278
2018-01-01 08:00:00  48.51735  21.221522  27.295828  27.295828
2018-01-01 09:00:00  45.92715  21.221522  24.705628  24.705628
2018-01-01 10:00:00  44.19595  21.221522  22.974428  22.974428
2018-01-01 11:00:00  39.27865  21.221522  18.057128  18.057128
2018-01-01 12:00:00  32.61625  21.221522  11.394728  11.394728
2018-01-01 13:00:00  34.09440  21.221522  12.872878  12.872878
2018-01-01 14:00:00  33.51795  21.221522  12.296428  12.296428
2018-01-01 15:00:00  41.24420  21.221522  20.022678  20.022678
2018-01-01 16:00:00  49.08765  21.221522  27.866128  27.866128
2018-01-01 17:00:00  51.24645  21.221522  30.024928  30.024928
2018-01-01 18:00:00  41.64520  21.221522  20.423678  20.423678
2018-01-01 19:00:00  40.98405  21.221522  19.762528  19.762528
2018-01-01 20:00:00  45.36865  21.221522  24.147128  24.147128
2018-01-01 21:00:00  58.24830  21.221522  37.026778  37.026778
2018-01-01 22:00:00  63.21335  21.221522  41.991828  41.991828
2018-01-01 23:00:00  78.28435  21.221522  57.062828  57.062828
2018-01-02 00:00:00  91.30400  21.221522  70.082478  70.082478
CPU times: user 33min 38s, sys: 3.58 s, total: 33min 41s
Wall time: 33min 44s
In [ ]:
8736
                     observed  predicted      error  abs_error
Datetime                                                      
2018-01-01 01:00:00  84.90085  21.221522  63.679328  63.679328
2018-01-01 02:00:00  67.44355  21.221522  46.222028  46.222028
2018-01-01 03:00:00  76.66860  21.221522  55.447078  55.447078
2018-01-01 04:00:00  64.96090  21.221522  43.739378  43.739378
2018-01-01 05:00:00  64.14875  21.221522  42.927228  42.927228
2018-01-01 06:00:00  76.06410  21.221522  54.842578  54.842578
2018-01-01 07:00:00  69.19180  21.221522  47.970278  47.970278
2018-01-01 08:00:00  48.51735  21.221522  27.295828  27.295828
2018-01-01 09:00:00  45.92715  21.221522  24.705628  24.705628
2018-01-01 10:00:00  44.19595  21.221522  22.974428  22.974428
2018-01-01 11:00:00  39.27865  21.221522  18.057128  18.057128
2018-01-01 12:00:00  32.61625  21.221522  11.394728  11.394728
2018-01-01 13:00:00  34.09440  21.221522  12.872878  12.872878
2018-01-01 14:00:00  33.51795  21.221522  12.296428  12.296428
2018-01-01 15:00:00  41.24420  21.221522  20.022678  20.022678
2018-01-01 16:00:00  49.08765  21.221522  27.866128  27.866128
2018-01-01 17:00:00  51.24645  21.221522  30.024928  30.024928
2018-01-01 18:00:00  41.64520  21.221522  20.423678  20.423678
2018-01-01 19:00:00  40.98405  21.221522  19.762528  19.762528
2018-01-01 20:00:00  45.36865  21.221522  24.147128  24.147128
2018-01-01 21:00:00  58.24830  21.221522  37.026778  37.026778
2018-01-01 22:00:00  63.21335  21.221522  41.991828  41.991828
2018-01-01 23:00:00  78.28435  21.221522  57.062828  57.062828
2018-01-02 00:00:00  91.30400  21.221522  70.082478  70.082478
CPU times: user 33min 38s, sys: 3.58 s, total: 33min 41s
Wall time: 33min 44s

Serialize output data

In [84]:
from joblib import dump, load

timestamp = get_datetime_identifier("%Y-%m-%d_%H-%M-%S")

path = f'results/pm25_ts_{model_name}_results_h_{timestamp}.joblib'

dump(fold_results, path) 
fold_results = load(path)
print(len(fold_results))
print(fold_results[0])
8736
                     observed  predicted      error  abs_error
Datetime                                                      
2018-01-01 01:00:00  84.90085  21.221522  63.679328  63.679328
2018-01-01 02:00:00  67.44355  21.221522  46.222028  46.222028
2018-01-01 03:00:00  76.66860  21.221522  55.447078  55.447078
2018-01-01 04:00:00  64.96090  21.221522  43.739378  43.739378
2018-01-01 05:00:00  64.14875  21.221522  42.927228  42.927228
2018-01-01 06:00:00  76.06410  21.221522  54.842578  54.842578
2018-01-01 07:00:00  69.19180  21.221522  47.970278  47.970278
2018-01-01 08:00:00  48.51735  21.221522  27.295828  27.295828
2018-01-01 09:00:00  45.92715  21.221522  24.705628  24.705628
2018-01-01 10:00:00  44.19595  21.221522  22.974428  22.974428
2018-01-01 11:00:00  39.27865  21.221522  18.057128  18.057128
2018-01-01 12:00:00  32.61625  21.221522  11.394728  11.394728
2018-01-01 13:00:00  34.09440  21.221522  12.872878  12.872878
2018-01-01 14:00:00  33.51795  21.221522  12.296428  12.296428
2018-01-01 15:00:00  41.24420  21.221522  20.022678  20.022678
2018-01-01 16:00:00  49.08765  21.221522  27.866128  27.866128
2018-01-01 17:00:00  51.24645  21.221522  30.024928  30.024928
2018-01-01 18:00:00  41.64520  21.221522  20.423678  20.423678
2018-01-01 19:00:00  40.98405  21.221522  19.762528  19.762528
2018-01-01 20:00:00  45.36865  21.221522  24.147128  24.147128
2018-01-01 21:00:00  58.24830  21.221522  37.026778  37.026778
2018-01-01 22:00:00  63.21335  21.221522  41.991828  41.991828
2018-01-01 23:00:00  78.28435  21.221522  57.062828  57.062828
2018-01-02 00:00:00  91.30400  21.221522  70.082478  70.082478

Calculate and visualize results

In [85]:
%%time
# Returns a list of mean folds RMSE for n_pred_points (starting at 1 point forecast)
res = get_mean_folds_rmse_for_n_prediction_points(fold_results=fold_results, n_pred_points=n_pred_points)
res
CPU times: user 3min 2s, sys: 191 ms, total: 3min 3s
Wall time: 3min 3s
Out[85]:
[11.218553363177227,
 11.84939532828283,
 12.403505142332413,
 12.875550573921029,
 13.278597761707989,
 13.614746889348025,
 13.88626672405877,
 14.100883448117537,
 14.270553340220387,
 14.400224242424244,
 14.484374104683196,
 14.538612316345272,
 14.576462442607896,
 14.603670569329662,
 14.623023266758494,
 14.638892011019285,
 14.658972050045914,
 14.696223071625344,
 14.76162760560147,
 14.864599735996327,
 15.005145741505968,
 15.188601779155189,
 15.409990231864095,
 15.65693489439853]
In [86]:
print(res)
[11.218553363177227, 11.84939532828283, 12.403505142332413, 12.875550573921029, 13.278597761707989, 13.614746889348025, 13.88626672405877, 14.100883448117537, 14.270553340220387, 14.400224242424244, 14.484374104683196, 14.538612316345272, 14.576462442607896, 14.603670569329662, 14.623023266758494, 14.638892011019285, 14.658972050045914, 14.696223071625344, 14.76162760560147, 14.864599735996327, 15.005145741505968, 15.188601779155189, 15.409990231864095, 15.65693489439853]

[11.218553363177227, 11.84939532828283, 12.403505142332413, 12.875550573921029, 13.278597761707989, 13.614746889348025, 13.88626672405877, 14.100883448117537, 14.270553340220387, 14.400224242424244, 14.484374104683196, 14.538612316345272, 14.576462442607896, 14.603670569329662, 14.623023266758494, 14.638892011019285, 14.658972050045914, 14.696223071625344, 14.76162760560147, 14.864599735996327, 15.005145741505968, 15.188601779155189, 15.409990231864095, 15.65693489439853]

In [87]:
# Show forecasts for n-th point in the future
show_n_points_of_forecasts = [1, 12, 24] # for hourly data
#show_n_points_of_forecasts = [1, 3, 7] # for daily data

# Used to zoom the plots (date ranges shown in the plots)
# for hourly data
start_end_dates = [('2018-01-01', '2019-01-01'), ('2018-02-01', '2018-03-01'), ('2018-06-01', '2018-07-01')]
# for daily data
#start_end_dates = [('2018-01-01', '2019-01-01'), ('2018-02-01', '2018-04-01'), ('2018-06-01', '2018-08-01')]

# Type of plot
# 0 -> plot_observed_vs_predicted
# 1 -> plot_observed_vs_predicted_with_error
plot_types = [0, 1, 1]

# File names for plots (format png will be used, do not add .png extension)
base_file_path = f'images/pm25_obs_vs_pred_365_h_ts_{model_name}' # for hourly data
#base_file_path = f'images/pm25_obs_vs_pred_365_d_ts_{model_name}' # for daily data
In [88]:
visualize_results(show_n_points_of_forecasts=show_n_points_of_forecasts,
                   start_end_dates=start_end_dates,
                   plot_types=plot_types,
                   base_file_path=base_file_path,
                   fold_results=fold_results, 
                   n_pred_points=n_pred_points, 
                   cut_off_offset=cut_off_offset, 
                   model_name=model_name,
                timestamp=timestamp)


results.py | 92 | visualize_results | 11-Jun-20 22:48:16 | INFO: images/pm25_obs_vs_pred_365_h_ts_EMA_01_lag-01_2020-06-11_22-42-56.png


results.py | 92 | visualize_results | 11-Jun-20 22:48:17 | INFO: images/pm25_obs_vs_pred_365_h_ts_EMA_01_lag-12_2020-06-11_22-42-56.png


results.py | 92 | visualize_results | 11-Jun-20 22:48:17 | INFO: images/pm25_obs_vs_pred_365_h_ts_EMA_01_lag-24_2020-06-11_22-42-56.png


results.py | 92 | visualize_results | 11-Jun-20 22:48:30 | INFO: images/pm25_obs_vs_pred_365_h_ts_EMA_02_lag-01_2020-06-11_22-42-56.png


results.py | 92 | visualize_results | 11-Jun-20 22:48:30 | INFO: images/pm25_obs_vs_pred_365_h_ts_EMA_02_lag-12_2020-06-11_22-42-56.png


results.py | 92 | visualize_results | 11-Jun-20 22:48:31 | INFO: images/pm25_obs_vs_pred_365_h_ts_EMA_02_lag-24_2020-06-11_22-42-56.png


results.py | 92 | visualize_results | 11-Jun-20 22:48:43 | INFO: images/pm25_obs_vs_pred_365_h_ts_EMA_03_lag-01_2020-06-11_22-42-56.png


results.py | 92 | visualize_results | 11-Jun-20 22:48:44 | INFO: images/pm25_obs_vs_pred_365_h_ts_EMA_03_lag-12_2020-06-11_22-42-56.png


results.py | 92 | visualize_results | 11-Jun-20 22:48:45 | INFO: images/pm25_obs_vs_pred_365_h_ts_EMA_03_lag-24_2020-06-11_22-42-56.png

7-Day Prediction

In [89]:
df = dfd.copy()
df.columns = ['pm25']
df.head()
Out[89]:
pm25
Datetime
2008-01-01 53.586957
2008-01-02 30.958333
2008-01-03 46.104167
2008-01-04 42.979167
2008-01-05 57.312500
In [90]:
# Define first past/future cutoff point in time offset (1 year of data)
#cut_off_offset = 365*24 # for hourly data
cut_off_offset = 365 # for daily data

# Set datetime format for index
#dt_format = "%Y-%m-%d %H:%M:%S" # for hourly data
dt_format = "%Y-%m-%d" # for daily data

# Create train and validate sets
train_test_split_position = int(len(df)-cut_off_offset)

# Create as many folds as remains till the end of known data
n_folds = len(df) #train_test_split_position+3

# Predict for X points
#n_pred_points = 24 # for hourly data
n_pred_points = 7 # for daily data
In [91]:
%%time
# Validate result on test
# Creates 365*24*24 models for hourly data, or 365*7 models for hourly data

fold_results = walk_forward_ref_model_validation(data=df,
                                                 col_name='pm25',
                                                 model_type='EMA',
                                                 cut_off_offset=cut_off_offset,
                                                 n_pred_points=n_pred_points,
                                                 n_folds=-1,
                                                 period='',
                                                 rolling_window=4)
print(len(fold_results))
print(fold_results[0])
validation.py | 111 | walk_forward_ref_model_validation | 11-Jun-20 22:49:15 | INFO: EMA model validation started
Started fold 000000/000365 - 2020-06-11_22-49-15
Started fold 000010/000365 - 2020-06-11_22-49-15
Started fold 000020/000365 - 2020-06-11_22-49-15
Started fold 000030/000365 - 2020-06-11_22-49-16
Started fold 000040/000365 - 2020-06-11_22-49-16
Started fold 000050/000365 - 2020-06-11_22-49-16
Started fold 000060/000365 - 2020-06-11_22-49-16
Started fold 000070/000365 - 2020-06-11_22-49-16
Started fold 000080/000365 - 2020-06-11_22-49-16
Started fold 000090/000365 - 2020-06-11_22-49-17
Started fold 000100/000365 - 2020-06-11_22-49-17
Started fold 000110/000365 - 2020-06-11_22-49-17
Started fold 000120/000365 - 2020-06-11_22-49-17
Started fold 000130/000365 - 2020-06-11_22-49-17
Started fold 000140/000365 - 2020-06-11_22-49-18
Started fold 000150/000365 - 2020-06-11_22-49-18
Started fold 000160/000365 - 2020-06-11_22-49-18
Started fold 000170/000365 - 2020-06-11_22-49-18
Started fold 000180/000365 - 2020-06-11_22-49-18
Started fold 000190/000365 - 2020-06-11_22-49-18
Started fold 000200/000365 - 2020-06-11_22-49-19
Started fold 000210/000365 - 2020-06-11_22-49-19
Started fold 000220/000365 - 2020-06-11_22-49-19
Started fold 000230/000365 - 2020-06-11_22-49-19
Started fold 000240/000365 - 2020-06-11_22-49-19
Started fold 000250/000365 - 2020-06-11_22-49-20
Started fold 000260/000365 - 2020-06-11_22-49-20
Started fold 000270/000365 - 2020-06-11_22-49-20
Started fold 000280/000365 - 2020-06-11_22-49-20
Started fold 000290/000365 - 2020-06-11_22-49-20
Started fold 000300/000365 - 2020-06-11_22-49-21
Started fold 000310/000365 - 2020-06-11_22-49-21
Started fold 000320/000365 - 2020-06-11_22-49-21
Started fold 000330/000365 - 2020-06-11_22-49-21
Started fold 000340/000365 - 2020-06-11_22-49-21
Started fold 000350/000365 - 2020-06-11_22-49-21
358
             observed  predicted      error  abs_error
Datetime                                              
2018-01-02  67.991848  36.725819  31.266029  31.266029
2018-01-03  16.026950  36.725819  20.698869  20.698869
2018-01-04  14.590020  36.725819  22.135798  22.135798
2018-01-05  22.094854  36.725819  14.630964  14.630964
2018-01-06  62.504217  36.725819  25.778398  25.778398
2018-01-07  43.929804  36.725819   7.203986   7.203986
2018-01-08  22.088192  36.725819  14.637627  14.637627
CPU times: user 6.63 s, sys: 34.7 ms, total: 6.66 s
Wall time: 6.66 s
In [ ]:
358
             observed  predicted      error  abs_error
Datetime                                              
2018-01-02  67.991848  36.725819  31.266029  31.266029
2018-01-03  16.026950  36.725819  20.698869  20.698869
2018-01-04  14.590020  36.725819  22.135798  22.135798
2018-01-05  22.094854  36.725819  14.630964  14.630964
2018-01-06  62.504217  36.725819  25.778398  25.778398
2018-01-07  43.929804  36.725819   7.203986   7.203986
2018-01-08  22.088192  36.725819  14.637627  14.637627
CPU times: user 6.63 s, sys: 34.7 ms, total: 6.66 s
Wall time: 6.66 s

Serialize output data

In [92]:
from joblib import dump, load

timestamp = get_datetime_identifier("%Y-%m-%d_%H-%M-%S")

path = f'results/pm25_ts_{model_name}_results_d_{timestamp}.joblib'

dump(fold_results, path) 
fold_results = load(path)
print(len(fold_results))
print(fold_results[0])
358
             observed  predicted      error  abs_error
Datetime                                              
2018-01-02  67.991848  36.725819  31.266029  31.266029
2018-01-03  16.026950  36.725819  20.698869  20.698869
2018-01-04  14.590020  36.725819  22.135798  22.135798
2018-01-05  22.094854  36.725819  14.630964  14.630964
2018-01-06  62.504217  36.725819  25.778398  25.778398
2018-01-07  43.929804  36.725819   7.203986   7.203986
2018-01-08  22.088192  36.725819  14.637627  14.637627

Calculate and visualize results

In [93]:
%%time
# Returns a list of mean folds RMSE for n_pred_points (starting at 1 point forecast)
res = get_mean_folds_rmse_for_n_prediction_points(fold_results=fold_results, n_pred_points=n_pred_points)
res
CPU times: user 2.55 s, sys: 4.76 ms, total: 2.56 s
Wall time: 2.56 s
Out[93]:
[12.471266666666667,
 12.96436923076923,
 13.290245584045586,
 13.465590313390313,
 13.678582905982907,
 13.811899430199428,
 13.8896660968661]
In [94]:
print(res)
[12.471266666666667, 12.96436923076923, 13.290245584045586, 13.465590313390313, 13.678582905982907, 13.811899430199428, 13.8896660968661]
In [ ]:
[12.471266666666667, 12.96436923076923, 13.290245584045586, 13.465590313390313, 13.678582905982907, 13.811899430199428, 13.8896660968661]
In [95]:
# Show forecasts for n-th point in the future
#show_n_points_of_forecasts = [1, 12, 24] # for hourly data
show_n_points_of_forecasts = [1, 3, 7] # for daily data

# Used to zoom the plots (date ranges shown in the plots)
# for hourly data
#start_end_dates = [('2018-01-01', '2019-01-01'), ('2018-02-01', '2018-03-01'), ('2018-06-01', '2018-07-01')]
# for daily data
start_end_dates = [('2018-01-01', '2019-01-01'), ('2018-02-01', '2018-04-01'), ('2018-06-01', '2018-08-01')]

# Type of plot
# 0 -> plot_observed_vs_predicted
# 1 -> plot_observed_vs_predicted_with_error
plot_types = [0, 1, 1]

# File names for plots (format png will be used, do not add .png extension)
#base_file_path = f'images/pm25_obs_vs_pred_365_h_ts_{model_name}' # for hourly data
base_file_path = f'images/pm25_obs_vs_pred_365_d_ts_{model_name}' # for daily data
In [96]:
visualize_results(show_n_points_of_forecasts=show_n_points_of_forecasts,
                   start_end_dates=start_end_dates,
                   plot_types=plot_types,
                   base_file_path=base_file_path,
                   fold_results=fold_results, 
                   n_pred_points=n_pred_points, 
                   cut_off_offset=cut_off_offset, 
                   model_name=model_name,
                timestamp=timestamp)


results.py | 92 | visualize_results | 11-Jun-20 22:50:53 | INFO: images/pm25_obs_vs_pred_365_d_ts_EMA_01_lag-01_2020-06-11_22-49-44.png


results.py | 92 | visualize_results | 11-Jun-20 22:50:53 | INFO: images/pm25_obs_vs_pred_365_d_ts_EMA_01_lag-03_2020-06-11_22-49-44.png


results.py | 92 | visualize_results | 11-Jun-20 22:50:54 | INFO: images/pm25_obs_vs_pred_365_d_ts_EMA_01_lag-07_2020-06-11_22-49-44.png


results.py | 92 | visualize_results | 11-Jun-20 22:50:55 | INFO: images/pm25_obs_vs_pred_365_d_ts_EMA_02_lag-01_2020-06-11_22-49-44.png


results.py | 92 | visualize_results | 11-Jun-20 22:50:56 | INFO: images/pm25_obs_vs_pred_365_d_ts_EMA_02_lag-03_2020-06-11_22-49-44.png


results.py | 92 | visualize_results | 11-Jun-20 22:50:56 | INFO: images/pm25_obs_vs_pred_365_d_ts_EMA_02_lag-07_2020-06-11_22-49-44.png


results.py | 92 | visualize_results | 11-Jun-20 22:50:57 | INFO: images/pm25_obs_vs_pred_365_d_ts_EMA_03_lag-01_2020-06-11_22-49-44.png


results.py | 92 | visualize_results | 11-Jun-20 22:50:58 | INFO: images/pm25_obs_vs_pred_365_d_ts_EMA_03_lag-03_2020-06-11_22-49-44.png


results.py | 92 | visualize_results | 11-Jun-20 22:50:59 | INFO: images/pm25_obs_vs_pred_365_d_ts_EMA_03_lag-07_2020-06-11_22-49-44.png
In [ ]: