From: https://github.com/ksatola
Version: 0.1.0

Exploratory Data Analysis (EDA) of PM2.5

This Exploratory Data Analysis (EDA) covers particulate matter (PM) air pollutants with a special focus on fine particles (PM2.5) which are considered as the most harmful out of all air-pollutants. The measuresements under EDA were taken in Krakow area in the years of 2008-2018.

In [1]:
import pandas as pd 
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
In [2]:
# Load the dataset
data_path = '../data/final/'
data_file = data_path + 'dfpm2008_2018.csv'
df = pd.read_csv(data_file, encoding='utf-8', sep=",", index_col="Datetime")
df.head()
Out[2]:
pm10 pm25
Datetime
2008-01-01 01:00:00 109.5 92.0
2008-01-01 02:00:00 96.0 81.0
2008-01-01 03:00:00 86.5 73.0
2008-01-01 04:00:00 71.5 60.5
2008-01-01 05:00:00 72.0 61.0
In [3]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 96388 entries, 2008-01-01 01:00:00 to 2019-01-01 00:00:00.000
Data columns (total 2 columns):
pm10    96388 non-null float64
pm25    96388 non-null float64
dtypes: float64(2)
memory usage: 2.2+ MB
In [4]:
df.isnull().sum()
Out[4]:
pm10    0
pm25    0
dtype: int64

The data set contains hourly measurements of particulate matter (PM10) and fine particles (PM2.5) taken in Krakow area in the years of 2008-2018. There are 96 388 observations and no missing data.

In [5]:
df.index
Out[5]:
Index(['2008-01-01 01:00:00', '2008-01-01 02:00:00', '2008-01-01 03:00:00',
       '2008-01-01 04:00:00', '2008-01-01 05:00:00', '2008-01-01 06:00:00',
       '2008-01-01 07:00:00', '2008-01-01 08:00:00', '2008-01-01 09:00:00',
       '2008-01-01 10:00:00',
       ...
       '2018-12-31 15:00:00.000', '2018-12-31 16:00:00.000',
       '2018-12-31 17:00:00.000', '2018-12-31 18:00:00.000',
       '2018-12-31 19:00:00.000', '2018-12-31 20:00:00.000',
       '2018-12-31 21:00:00.000', '2018-12-31 22:00:00.000',
       '2018-12-31 23:00:00.000', '2019-01-01 00:00:00.000'],
      dtype='object', name='Datetime', length=96388)
In [6]:
# Convert indextype to DateTime from the generic object type
df.index = pd.to_datetime(df.index)
In [7]:
df.index
Out[7]:
DatetimeIndex(['2008-01-01 01:00:00', '2008-01-01 02:00:00',
               '2008-01-01 03:00:00', '2008-01-01 04:00:00',
               '2008-01-01 05:00:00', '2008-01-01 06:00:00',
               '2008-01-01 07:00:00', '2008-01-01 08:00:00',
               '2008-01-01 09:00:00', '2008-01-01 10:00:00',
               ...
               '2018-12-31 15:00:00', '2018-12-31 16:00:00',
               '2018-12-31 17:00:00', '2018-12-31 18:00:00',
               '2018-12-31 19:00:00', '2018-12-31 20:00:00',
               '2018-12-31 21:00:00', '2018-12-31 22:00:00',
               '2018-12-31 23:00:00', '2019-01-01 00:00:00'],
              dtype='datetime64[ns]', name='Datetime', length=96388, freq=None)
In [8]:
# Descriptive summary statistics
df.describe()
Out[8]:
pm10 pm25
count 96388.000000 96388.000000
mean 53.538890 37.196145
std 48.546855 38.605214
min 3.000000 2.500000
25% 23.000000 13.699700
50% 38.000000 24.100000
75% 65.000000 45.000000
max 546.000000 445.000000
In [9]:
df['pm10'].idxmax(), df['pm25'].idxmax()
Out[9]:
(Timestamp('2010-01-27 06:00:00'), Timestamp('2010-01-27 06:00:00'))

The maximum values observed are 546 [µg/m3] for PM10 and 445 [µg/m3] for PM2.5, both on 2010-01-27 at 6am. The mean values are respectively 53 and 37 [µg/m3]. 75% of all observations are below 65 and 45 [µg/m3]. Almost 75% of PM10 and almost 50% of PM2.5 observations exceeded WHO air quality guidelines (20 [µg/m3] for PM10 and 25 [µg/m3] for PM2.5).

In [10]:
df.boxplot(grid=True, figsize=(10, 8))
plt.title('Measured PM10 and PM2.5 distributions')
plt.ylabel('Observed values [µg/m3]')
plt.savefig('images/eda_pm_dists.png')
plt.show();
In [11]:
df.plot(kind='hist', bins=100, grid=True, figsize=(10, 8), alpha=0.5)
plt.title('PM10 and PM2.5 measured values frequency')
plt.xlabel('Observed value [µg/m3]')
plt.ylabel('Frequency')
plt.savefig('images/eda_pm_freq.png')
plt.show();

PM10 and PM2.5 distributions are similar (right-skewed) with many outliers at the high-end many times exeeding WHO and EU air quiality guidelines.

In [12]:
ax = df.plot(grid=True, figsize=(20, 8), alpha=1)
ax.axhline(25, color='orange', linestyle='--')
ax.axhline(20, color='blue', linestyle='-.')
plt.title('PM10 and PM2.5 Data')
plt.ylabel('Observed value [µg/m3]')
plt.xlabel('')
plt.savefig('images/eda_pm_data_representation.png')
plt.show();

There is yearly seasonality of data with the values increasing from October to April. PM10 and PM2.5 levels are similar and positively correlated. The biggest measurements were taken between 2009 and 2011.

In [13]:
# Mean of PM10, PM2.5 in 2008
m2008 = df['2008'].mean()
m2008
Out[13]:
pm10    58.658482
pm25    35.409081
dtype: float64
In [14]:
# Mean daily measurements in 2008
ax = df['2008'].asfreq('D').plot(grid=True, figsize=(20, 8))
ax.set_ylim(0, 350)
ax.axhline(25, color='orange', linestyle='--')
ax.axhline(20, color='blue', linestyle='-.')
plt.title('Mean daily measurements in 2008')
plt.ylabel('Observed value [µg/m3]')
plt.xlabel('')
plt.savefig('images/eda_pm_mean_daily_2008.png')
plt.show();
In [15]:
# Mean PM10, PM2.5 in 2018
m2018 = df['2018'].mean()
m2018
Out[15]:
pm10    40.703388
pm25    30.840038
dtype: float64
In [16]:
# And in 2018
ax = df['2018'].asfreq('D').plot(grid=True, figsize=(20, 8))
ax.set_ylim(0, 350)
ax.axhline(25, color='orange', linestyle='--')
ax.axhline(20, color='blue', linestyle='-.')
plt.title('Mean daily measurements in 2018')
plt.ylabel('Observed value [µg/m3]')
plt.xlabel('')
plt.savefig('images/eda_pm_mean_daily_2018.png')
plt.show();

The characteristics of measurements look similar in all years (above checked the edge cases - 2008 and 2018). In autumn and winter, there are more air pollutants in the air then during the rest of the year. Looking at the vertical axes (the scale is scaled to be the same) we can clearly see, that situation improved - the maximum values observed are lower in 2018 comparing to 2008 but still not at the satisfactory level.

In [17]:
# Is there any linear relationship between PM10 i PM2.5?
plt.figure(figsize=(10, 8))
plt.title("Correlation between PM10 oraz PM2.5")
plt.xlabel('PM10')
plt.ylabel('PM25')
plt.savefig('images/eda_pm_corr.png')
plt.scatter(df['pm10'], df['pm25']);

There is indeed a linear positive and strong correlation between PM10 and PM2.5.

In [18]:
corr = df['pm10'].corr(df['pm25'])
print("PM10 and PM2.5 correlation coefficient: ", corr)
PM10 and PM2.5 correlation coefficient:  0.9639339988579074

The Pearson's correlation coefficient is quite high (the maximum value is 1.0). This should not be surprising as PM2.5 is contained in PM10.

Question 1

What were mean yearly particulate matter observed values in Krakow?

In [19]:
# Yearly mean
dfts = df.resample(rule='A').mean()
dfts.plot(grid=True, figsize=(10, 8), marker='o')
plt.title('Mean yearly particulate matter level observed values in Krakow')
plt.ylabel('Mean observed value [µg/m3]')
plt.xlabel('Years')
plt.savefig('images/eda_pm_mean_yearly.png')
plt.show();
In [20]:
# Another method
index_year = df.index.year
mean_by_year = df.groupby(index_year).mean()
mean_by_year
Out[20]:
pm10 pm25
Datetime
2008 58.658482 35.409081
2009 59.894241 40.826877
2010 67.148990 48.076949
2011 63.615723 43.532485
2012 58.044864 39.541272
2013 53.102708 37.220781
2014 50.559304 34.705822
2015 52.142013 35.649061
2016 42.422487 31.024153
2017 42.660029 32.353953
2018 40.703388 30.840038
2019 49.196300 36.607400
In [21]:
mean_by_year.plot(grid=True, figsize=(10, 8), marker='o')
plt.title('Mean yearly particulate matter level observed values in Krakow')
plt.ylabel('Mean observed value [µg/m3]')
plt.xlabel('Years')
plt.show();

In 2010 in Krakow, there were the highest mean daily particulate matter levels observed. In the subsequent years the observed levels were lower.

Question 2

What was the characterists of maximum, minimum and median of PM2.5 values between 2008 and 2018?

In [22]:
dfts = df['pm25'].resample(rule='D').median().to_frame()
rolling = dfts.pm25.rolling(360)
dfts['q10'] = rolling.quantile(0.1)
dfts['q50'] = rolling.quantile(0.5)
dfts['q90'] = rolling.quantile(0.9)
dfts.head()
Out[22]:
pm25 q10 q50 q90
Datetime
2008-01-01 52.50 NaN NaN NaN
2008-01-02 28.50 NaN NaN NaN
2008-01-03 46.00 NaN NaN NaN
2008-01-04 44.00 NaN NaN NaN
2008-01-05 55.75 NaN NaN NaN
In [23]:
dfts.plot(grid=True, figsize=(20, 8))
plt.title('Maximum, minimum and median of PM2.5 values')
plt.ylabel('Observed value [µg/m3]')
plt.xlabel('Years')
plt.savefig('images/eda_pm_min_max_median_yearly.png')
plt.show();

Using rolling quantiles (with a window of 360 days) for 10%, 50% (median) and 90% quantiles, we can visualize trends of minimum, median and maximum observed levels of PM2.5. We can see, that the yearly maximum fine particle levels trend line goes down whereas minimum and median values stay rather at the same level.

Question 3

Which months out of the winter months (between 2017 and 2018) were the best and the worst in terms of observed particulate matter level?

In [24]:
dfts = df['2017-10':'2018-4'].resample(rule='M').mean()
In [25]:
dfts.plot(grid=True, figsize=(10, 8), marker='o')
plt.title('Mean monthly PM10 and PM2.5 levels')
plt.ylabel('Mean monthly observed value [µg/m3]')
plt.xlabel('')
plt.savefig('images/eda_pm_mean_monthly_winter.png')
plt.show();

Question 4

What is mean hourly distribution of particle matter?

In [26]:
index_hour = df.index.hour
mean_by_hour = df.groupby(index_hour).mean()
mean_by_hour
Out[26]:
pm10 pm25
Datetime
0 65.142781 48.841168
1 62.239097 47.202120
2 58.849983 44.983076
3 55.637367 42.564447
4 53.086791 40.324473
5 52.555740 39.261994
6 53.447625 38.949145
7 54.969955 38.525071
8 55.829887 37.592334
9 54.263990 35.754369
10 50.671900 32.946743
11 46.431875 29.939961
12 43.252247 27.429245
13 40.795246 25.600486
14 39.499671 24.698600
15 39.497136 24.481761
16 41.024019 25.505867
17 44.714742 28.151320
18 50.001380 32.379114
19 57.059593 37.843471
20 63.208892 43.207380
21 66.874105 47.127104
22 68.291337 49.469067
23 67.586187 49.925767
In [27]:
mean_by_hour.plot(grid=True, figsize=(10, 6), marker='o')
plt.xticks(mean_by_hour.index)
plt.title('Hourly mean particulate matter distribution')
plt.ylabel('Mean observed value [µg/m3]')
plt.xlabel('')
plt.savefig('images/eda_pm_mean_hourly_winter.png')
plt.show();

The lowest PM levels observed were around 2 and 3pm. The biggest during the night (6pm-8am).

Question 5

What are the average particle matter levels every day of a week?

In [28]:
# Monday = 0
index_dayofweek = df.index.dayofweek
mean_by_dayofweek = df.groupby(index_dayofweek).mean()
mean_by_dayofweek 
Out[28]:
pm10 pm25
Datetime
0 50.658805 34.724517
1 57.097404 39.140010
2 56.142090 38.327821
3 54.994405 37.513760
4 55.690797 38.219692
5 52.079464 37.304929
6 48.113621 35.144516
In [29]:
mean_by_dayofweek.plot(grid=True, figsize=(10, 6), marker='o')
plt.title('Weekly average particle matter levels of PM10 and PM2.5 in 2008-2018')
plt.ylabel('Mean observed value [µg/m3]')
plt.xlabel('')
plt.savefig('images/eda_pm_mean_weekdays.png')
plt.show();

An average particle matter levels is the highest on Tuesdays (1) then goes down to Sundays (6).

Question 6

Which months are the cleanest on average?

In [30]:
index_month = df.index.month
mean_by_month = df.groupby(index_month).mean()
mean_by_month
Out[30]:
pm10 pm25
Datetime
1 85.717663 65.816152
2 83.228053 61.435132
3 66.571877 45.508039
4 49.623295 30.178337
5 32.044806 19.652018
6 27.986224 16.111411
7 27.395191 15.589625
8 31.676381 18.871177
9 35.896588 23.145553
10 55.914509 39.070372
11 71.182091 53.079002
12 76.869549 59.162995
In [31]:
mean_by_month.plot(grid=True, figsize=(10, 6), marker='o')
plt.xticks(mean_by_month.index)
plt.title('Monthly average particle matter levels of PM10 and PM2.5 in 2008-2018')
plt.ylabel('Mean observed value [µg/m3]')
plt.xlabel('')
plt.savefig('images/eda_pm_mean_months.png')
plt.show();

There is the cleanest air observed in Krakow from May till September.

Question 7

How the fine particles (PM2.5) levels looked like in different years?

In [32]:
years = range(2008, 2019)
df_years = pd.DataFrame()
for year in years:
    df_year = df.loc[str(year), ['pm25']].reset_index(drop=True)
    df_year.rename(columns={'pm25': year}, inplace=True)
    df_years = pd.concat([df_years, df_year], axis=1)
In [33]:
df_years.head()
Out[33]:
2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018
0 92.0 334.5 69.0 54.0 112.0 77.0 76.0 65.0 215.8730 119.915 21.34580
1 81.0 276.5 65.0 78.0 148.0 190.0 78.0 73.0 231.4550 177.442 84.90085
2 73.0 133.0 58.0 55.0 132.0 159.0 65.0 54.0 239.8225 178.942 67.44355
3 60.5 78.0 65.0 45.0 118.0 172.5 63.0 50.0 230.2770 174.770 76.66860
4 61.0 51.0 47.0 43.0 122.0 163.0 59.0 51.0 241.3810 154.054 64.96090
In [34]:
plots = df_years.plot(subplots=True, grid=True, figsize=(20, 20))
for plot in plots:
    plot.set_ylim(0, 500)
    plot.xaxis.set_major_locator(plt.LinearLocator(13))
    plot.set_xticklabels(range(1, 13), rotation=0)
plt.savefig('images/eda_pm_mean_years.png')
plt.show();

Question 8

Is there any seasonality in PM2.5 data?

In [35]:
dfts = df.pm25.copy().to_frame()
dfts['30D'] = dfts.pm25.rolling(window='30D').mean()
dfts['90D'] = dfts.pm25.rolling(window='90D').mean()
dfts.head()
Out[35]:
pm25 30D 90D
Datetime
2008-01-01 01:00:00 92.0 92.000 92.000
2008-01-01 02:00:00 81.0 86.500 86.500
2008-01-01 03:00:00 73.0 82.000 82.000
2008-01-01 04:00:00 60.5 76.625 76.625
2008-01-01 05:00:00 61.0 73.500 73.500
In [36]:
dfts.plot(grid=True, figsize=(20, 16))
plt.title('PM2.5 Seasonality')
plt.ylabel('Mean observed value [µg/m3]')
plt.xlabel('')
plt.savefig('images/eda_pm_mean_seasonality.png')
plt.show();
In [ ]: