From: https://github.com/ksatola
Version: 0.1.0

Exploratory Data Analysis (EDA) of PM2.5¶

This Exploratory Data Analysis (EDA) covers particulate matter (PM) air pollutants with a special focus on fine particles (PM2.5) which are considered as the most harmful out of all air-pollutants. The measuresements under EDA were taken in Krakow area in the years of 2008-2018.

import pandas as pd 
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

# Load the dataset
data_path = '../data/final/'
data_file = data_path + 'dfpm2008_2018.csv'
df = pd.read_csv(data_file, encoding='utf-8', sep=",", index_col="Datetime")
df.head()

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 96388 entries, 2008-01-01 01:00:00 to 2019-01-01 00:00:00.000
Data columns (total 2 columns):
pm10    96388 non-null float64
pm25    96388 non-null float64
dtypes: float64(2)
memory usage: 2.2+ MB

df.isnull().sum()

pm10    0
pm25    0
dtype: int64

The data set contains hourly measurements of particulate matter (PM10) and fine particles (PM2.5) taken in Krakow area in the years of 2008-2018. There are 96 388 observations and no missing data.

df.index

Index(['2008-01-01 01:00:00', '2008-01-01 02:00:00', '2008-01-01 03:00:00',
       '2008-01-01 04:00:00', '2008-01-01 05:00:00', '2008-01-01 06:00:00',
       '2008-01-01 07:00:00', '2008-01-01 08:00:00', '2008-01-01 09:00:00',
       '2008-01-01 10:00:00',
       ...
       '2018-12-31 15:00:00.000', '2018-12-31 16:00:00.000',
       '2018-12-31 17:00:00.000', '2018-12-31 18:00:00.000',
       '2018-12-31 19:00:00.000', '2018-12-31 20:00:00.000',
       '2018-12-31 21:00:00.000', '2018-12-31 22:00:00.000',
       '2018-12-31 23:00:00.000', '2019-01-01 00:00:00.000'],
      dtype='object', name='Datetime', length=96388)

# Convert indextype to DateTime from the generic object type
df.index = pd.to_datetime(df.index)

df.index

DatetimeIndex(['2008-01-01 01:00:00', '2008-01-01 02:00:00',
               '2008-01-01 03:00:00', '2008-01-01 04:00:00',
               '2008-01-01 05:00:00', '2008-01-01 06:00:00',
               '2008-01-01 07:00:00', '2008-01-01 08:00:00',
               '2008-01-01 09:00:00', '2008-01-01 10:00:00',
               ...
               '2018-12-31 15:00:00', '2018-12-31 16:00:00',
               '2018-12-31 17:00:00', '2018-12-31 18:00:00',
               '2018-12-31 19:00:00', '2018-12-31 20:00:00',
               '2018-12-31 21:00:00', '2018-12-31 22:00:00',
               '2018-12-31 23:00:00', '2019-01-01 00:00:00'],
              dtype='datetime64[ns]', name='Datetime', length=96388, freq=None)

# Descriptive summary statistics
df.describe()

df['pm10'].idxmax(), df['pm25'].idxmax()

(Timestamp('2010-01-27 06:00:00'), Timestamp('2010-01-27 06:00:00'))

The maximum values observed are 546 [µg/m3] for PM10 and 445 [µg/m3] for PM2.5, both on 2010-01-27 at 6am. The mean values are respectively 53 and 37 [µg/m3]. 75% of all observations are below 65 and 45 [µg/m3]. Almost 75% of PM10 and almost 50% of PM2.5 observations exceeded WHO air quality guidelines (20 [µg/m3] for PM10 and 25 [µg/m3] for PM2.5).

df.boxplot(grid=True, figsize=(10, 8))
plt.title('Measured PM10 and PM2.5 distributions')
plt.ylabel('Observed values [µg/m3]')
plt.savefig('images/eda_pm_dists.png')
plt.show();

df.plot(kind='hist', bins=100, grid=True, figsize=(10, 8), alpha=0.5)
plt.title('PM10 and PM2.5 measured values frequency')
plt.xlabel('Observed value [µg/m3]')
plt.ylabel('Frequency')
plt.savefig('images/eda_pm_freq.png')
plt.show();

PM10 and PM2.5 distributions are similar (right-skewed) with many outliers at the high-end many times exeeding WHO and EU air quiality guidelines.

ax = df.plot(grid=True, figsize=(20, 8), alpha=1)
ax.axhline(25, color='orange', linestyle='--')
ax.axhline(20, color='blue', linestyle='-.')
plt.title('PM10 and PM2.5 Data')
plt.ylabel('Observed value [µg/m3]')
plt.xlabel('')
plt.savefig('images/eda_pm_data_representation.png')
plt.show();

There is yearly seasonality of data with the values increasing from October to April. PM10 and PM2.5 levels are similar and positively correlated. The biggest measurements were taken between 2009 and 2011.

# Mean of PM10, PM2.5 in 2008
m2008 = df['2008'].mean()
m2008

pm10    58.658482
pm25    35.409081
dtype: float64

# Mean daily measurements in 2008
ax = df['2008'].asfreq('D').plot(grid=True, figsize=(20, 8))
ax.set_ylim(0, 350)
ax.axhline(25, color='orange', linestyle='--')
ax.axhline(20, color='blue', linestyle='-.')
plt.title('Mean daily measurements in 2008')
plt.ylabel('Observed value [µg/m3]')
plt.xlabel('')
plt.savefig('images/eda_pm_mean_daily_2008.png')
plt.show();

# Mean PM10, PM2.5 in 2018
m2018 = df['2018'].mean()
m2018

pm10    40.703388
pm25    30.840038
dtype: float64

# And in 2018
ax = df['2018'].asfreq('D').plot(grid=True, figsize=(20, 8))
ax.set_ylim(0, 350)
ax.axhline(25, color='orange', linestyle='--')
ax.axhline(20, color='blue', linestyle='-.')
plt.title('Mean daily measurements in 2018')
plt.ylabel('Observed value [µg/m3]')
plt.xlabel('')
plt.savefig('images/eda_pm_mean_daily_2018.png')
plt.show();

The characteristics of measurements look similar in all years (above checked the edge cases - 2008 and 2018). In autumn and winter, there are more air pollutants in the air then during the rest of the year. Looking at the vertical axes (the scale is scaled to be the same) we can clearly see, that situation improved - the maximum values observed are lower in 2018 comparing to 2008 but still not at the satisfactory level.

# Is there any linear relationship between PM10 i PM2.5?
plt.figure(figsize=(10, 8))
plt.title("Correlation between PM10 oraz PM2.5")
plt.xlabel('PM10')
plt.ylabel('PM25')
plt.savefig('images/eda_pm_corr.png')
plt.scatter(df['pm10'], df['pm25']);

There is indeed a linear positive and strong correlation between PM10 and PM2.5.

corr = df['pm10'].corr(df['pm25'])
print("PM10 and PM2.5 correlation coefficient: ", corr)

PM10 and PM2.5 correlation coefficient:  0.9639339988579074

The Pearson's correlation coefficient is quite high (the maximum value is 1.0). This should not be surprising as PM2.5 is contained in PM10.

Question 1¶

What were mean yearly particulate matter observed values in Krakow?

# Yearly mean
dfts = df.resample(rule='A').mean()
dfts.plot(grid=True, figsize=(10, 8), marker='o')
plt.title('Mean yearly particulate matter level observed values in Krakow')
plt.ylabel('Mean observed value [µg/m3]')
plt.xlabel('Years')
plt.savefig('images/eda_pm_mean_yearly.png')
plt.show();

# Another method
index_year = df.index.year
mean_by_year = df.groupby(index_year).mean()
mean_by_year

mean_by_year.plot(grid=True, figsize=(10, 8), marker='o')
plt.title('Mean yearly particulate matter level observed values in Krakow')
plt.ylabel('Mean observed value [µg/m3]')
plt.xlabel('Years')
plt.show();

In 2010 in Krakow, there were the highest mean daily particulate matter levels observed. In the subsequent years the observed levels were lower.

Question 2¶

What was the characterists of maximum, minimum and median of PM2.5 values between 2008 and 2018?

dfts = df['pm25'].resample(rule='D').median().to_frame()
rolling = dfts.pm25.rolling(360)
dfts['q10'] = rolling.quantile(0.1)
dfts['q50'] = rolling.quantile(0.5)
dfts['q90'] = rolling.quantile(0.9)
dfts.head()

dfts.plot(grid=True, figsize=(20, 8))
plt.title('Maximum, minimum and median of PM2.5 values')
plt.ylabel('Observed value [µg/m3]')
plt.xlabel('Years')
plt.savefig('images/eda_pm_min_max_median_yearly.png')
plt.show();

Using rolling quantiles (with a window of 360 days) for 10%, 50% (median) and 90% quantiles, we can visualize trends of minimum, median and maximum observed levels of PM2.5. We can see, that the yearly maximum fine particle levels trend line goes down whereas minimum and median values stay rather at the same level.

Question 3¶

Which months out of the winter months (between 2017 and 2018) were the best and the worst in terms of observed particulate matter level?

dfts = df['2017-10':'2018-4'].resample(rule='M').mean()

dfts.plot(grid=True, figsize=(10, 8), marker='o')
plt.title('Mean monthly PM10 and PM2.5 levels')
plt.ylabel('Mean monthly observed value [µg/m3]')
plt.xlabel('')
plt.savefig('images/eda_pm_mean_monthly_winter.png')
plt.show();

Question 4¶

What is mean hourly distribution of particle matter?

index_hour = df.index.hour
mean_by_hour = df.groupby(index_hour).mean()
mean_by_hour

mean_by_hour.plot(grid=True, figsize=(10, 6), marker='o')
plt.xticks(mean_by_hour.index)
plt.title('Hourly mean particulate matter distribution')
plt.ylabel('Mean observed value [µg/m3]')
plt.xlabel('')
plt.savefig('images/eda_pm_mean_hourly_winter.png')
plt.show();

The lowest PM levels observed were around 2 and 3pm. The biggest during the night (6pm-8am).

Question 5¶

What are the average particle matter levels every day of a week?

# Monday = 0
index_dayofweek = df.index.dayofweek
mean_by_dayofweek = df.groupby(index_dayofweek).mean()
mean_by_dayofweek

mean_by_dayofweek.plot(grid=True, figsize=(10, 6), marker='o')
plt.title('Weekly average particle matter levels of PM10 and PM2.5 in 2008-2018')
plt.ylabel('Mean observed value [µg/m3]')
plt.xlabel('')
plt.savefig('images/eda_pm_mean_weekdays.png')
plt.show();

An average particle matter levels is the highest on Tuesdays (1) then goes down to Sundays (6).

Question 6¶

Which months are the cleanest on average?

index_month = df.index.month
mean_by_month = df.groupby(index_month).mean()
mean_by_month

mean_by_month.plot(grid=True, figsize=(10, 6), marker='o')
plt.xticks(mean_by_month.index)
plt.title('Monthly average particle matter levels of PM10 and PM2.5 in 2008-2018')
plt.ylabel('Mean observed value [µg/m3]')
plt.xlabel('')
plt.savefig('images/eda_pm_mean_months.png')
plt.show();

There is the cleanest air observed in Krakow from May till September.

Question 7¶

How the fine particles (PM2.5) levels looked like in different years?

years = range(2008, 2019)
df_years = pd.DataFrame()
for year in years:
    df_year = df.loc[str(year), ['pm25']].reset_index(drop=True)
    df_year.rename(columns={'pm25': year}, inplace=True)
    df_years = pd.concat([df_years, df_year], axis=1)

df_years.head()

plots = df_years.plot(subplots=True, grid=True, figsize=(20, 20))
for plot in plots:
    plot.set_ylim(0, 500)
    plot.xaxis.set_major_locator(plt.LinearLocator(13))
    plot.set_xticklabels(range(1, 13), rotation=0)
plt.savefig('images/eda_pm_mean_years.png')
plt.show();

Question 8¶

Is there any seasonality in PM2.5 data?

dfts = df.pm25.copy().to_frame()
dfts['30D'] = dfts.pm25.rolling(window='30D').mean()
dfts['90D'] = dfts.pm25.rolling(window='90D').mean()
dfts.head()

dfts.plot(grid=True, figsize=(20, 16))
plt.title('PM2.5 Seasonality')
plt.ylabel('Mean observed value [µg/m3]')
plt.xlabel('')
plt.savefig('images/eda_pm_mean_seasonality.png')
plt.show();

	pm10	pm25
Datetime
2008-01-01 01:00:00	109.5	92.0
2008-01-01 02:00:00	96.0	81.0
2008-01-01 03:00:00	86.5	73.0
2008-01-01 04:00:00	71.5	60.5
2008-01-01 05:00:00	72.0	61.0

	pm10	pm25
count	96388.000000	96388.000000
mean	53.538890	37.196145
std	48.546855	38.605214
min	3.000000	2.500000
25%	23.000000	13.699700
50%	38.000000	24.100000
75%	65.000000	45.000000
max	546.000000	445.000000

	pm10	pm25
Datetime
2008	58.658482	35.409081
2009	59.894241	40.826877
2010	67.148990	48.076949
2011	63.615723	43.532485
2012	58.044864	39.541272
2013	53.102708	37.220781
2014	50.559304	34.705822
2015	52.142013	35.649061
2016	42.422487	31.024153
2017	42.660029	32.353953
2018	40.703388	30.840038
2019	49.196300	36.607400

	pm25	q10	q50	q90
Datetime
2008-01-01	52.50	NaN	NaN	NaN
2008-01-02	28.50	NaN	NaN	NaN
2008-01-03	46.00	NaN	NaN	NaN
2008-01-04	44.00	NaN	NaN	NaN
2008-01-05	55.75	NaN	NaN	NaN

	pm10	pm25
Datetime
0	65.142781	48.841168
1	62.239097	47.202120
2	58.849983	44.983076
3	55.637367	42.564447
4	53.086791	40.324473
5	52.555740	39.261994
6	53.447625	38.949145
7	54.969955	38.525071
8	55.829887	37.592334
9	54.263990	35.754369
10	50.671900	32.946743
11	46.431875	29.939961
12	43.252247	27.429245
13	40.795246	25.600486
14	39.499671	24.698600
15	39.497136	24.481761
16	41.024019	25.505867
17	44.714742	28.151320
18	50.001380	32.379114
19	57.059593	37.843471
20	63.208892	43.207380
21	66.874105	47.127104
22	68.291337	49.469067
23	67.586187	49.925767

	pm10	pm25
Datetime
0	50.658805	34.724517
1	57.097404	39.140010
2	56.142090	38.327821
3	54.994405	37.513760
4	55.690797	38.219692
5	52.079464	37.304929
6	48.113621	35.144516

	pm10	pm25
Datetime
1	85.717663	65.816152
2	83.228053	61.435132
3	66.571877	45.508039
4	49.623295	30.178337
5	32.044806	19.652018
6	27.986224	16.111411
7	27.395191	15.589625
8	31.676381	18.871177
9	35.896588	23.145553
10	55.914509	39.070372
11	71.182091	53.079002
12	76.869549	59.162995

	2008	2009	2010	2011	2012	2013	2014	2015	2016	2017	2018
0	92.0	334.5	69.0	54.0	112.0	77.0	76.0	65.0	215.8730	119.915	21.34580
1	81.0	276.5	65.0	78.0	148.0	190.0	78.0	73.0	231.4550	177.442	84.90085
2	73.0	133.0	58.0	55.0	132.0	159.0	65.0	54.0	239.8225	178.942	67.44355
3	60.5	78.0	65.0	45.0	118.0	172.5	63.0	50.0	230.2770	174.770	76.66860
4	61.0	51.0	47.0	43.0	122.0	163.0	59.0	51.0	241.3810	154.054	64.96090