Time Series Basics

Modeling Steps

Data Preparation - Generate or Read data
- Ensure that date column is of date datatype

Resampling or Changing Frequency
- Downsampling (e.g. Quarterly to Yearly)
- Upsampling (e.g. Quarterly to Monthly)
  - May result in missing data for lower level rows. Also see Handle Missing Data
- Set Frequency
  - Set date column as index (if not the index already)
Handle missing data
Feature Engineering
- Add Basic Date Time features
- Add Lag features
- Add Windowing features
- Add Expanding/Cumulative features
- Add / Remove other columns as needed
Explore Data
- Test for Stationarity
- Test for Seasonality
  - Apply Differencing as needed
- Test for ACF and PACF
Split training and test data
- Walk Forward Validation

Data Prep

Read Data

# Parse date column as date when reading data
# dayfirst = True uses the dd/mm/yyyy format instead of the default mm/dd/yyyy; not needed if date is in default format
# we pass the index of the date column to the parse_dates parameter
df_ts_base1 = pd.read_csv('data/TimeSeriesData1.csv', dayfirst=True, parse_dates=[0])

# Alternatively, we can also convert the date column after reading the csv
# We would do this if for some reason we read the date column without parsing it as a date
df_ts_base2 = pd.read_csv('data/TimeSeriesData1.csv')
df_ts_base2.date = pd.to_datetime(df_ts_base2.date, dayfirst = True)

Resampling

Alias	Description
B	Business day
D	Calendar day
W	Weekly
M	Month end
Q	Quarter end
A	Year end
BA	Business year end
AS	Year start
H	Hourly frequency
T, min	Minutely frequency
S	Secondly frequency
L, ms	Millisecond frequency
U, us	Microsecond frequency
N, ns	Nanosecond frequency

Sample Code

Resample

# Downsample
df_ts1_mth = df_ts1.resample('M', on='date').mean()

# Upsample
df_ts1_hr = df_ts1.resample('H', on='date').mean()

Set Index

# Set date column as index if it has not been set as the index based on some prior operation
df_ts_base1.set_index("date", inplace=True)

Set desired frequency

# This will generate new rows for missing periods for the desired frequency
# Make sure to set the frequency only after setting the date column as index
# Parameter 'b' signifies business days
df_ts_base1 = df_ts_base1.asfreq("b")

Handle missing data

Method	Description
bfill	Backward fill
count	Count of values
ffill	Forward fill
first	First valid data value
last	Last valid data value
max	Maximum data value
mean	Mean of values in time range
median	Median of values in time range
min	Minimum data value
nunique	Number of unique values
ohlc	Opening value, highest value, lowest value, closing value
pad	Same as forward fill
std	Standard deviation of values
sum	Sum of values
var	Variance of values

Handle Missing Data

Interpolate

# Front fill NaNs
df_ts_base1.spx = df_ts_base1.spx.ffill()

# Back fill NaNs
df_ts_base1.spx = df_ts_base1.spx.bfill()

# Populate NaNs using mean value
df_ts_base1.spx = df_ts_base1.spx.fillna(value=df_ts_base1.spx.mean())

# Fill in missing values with linear interpolation (euqally spaced values)
df_ts1_hr_interpolated = df_ts1_hr.interpolate(method='linear')

Feature Engineering

Add Basic Date Columns

Pandas Datetime

# Add additional date columns as needed
df_ts1['year'] = df_ts1.index.year
df_ts1['month'] = df_ts1.index.month
df_ts1['day'] = df_ts1.index.day

Add Basic Date Columns

Datetime

# Add additional date columns as needed
df_ts1['year'] = df_ts1.index.year
df_ts1['month'] = df_ts1.index.month
df_ts1['day'] = df_ts1.index.day

Lag Features

Shift

df_ts1['last_day'] =  df_ts1.spx.shift(1)   # last data point's value
df_ts1['last_week'] =  df_ts1.spx.shift(7)  # values for current -7th data point

Window Features

Rolling

# Aggreates over specified rolling windows
df_ts1['2day_mean'] =  df_ts1.spx.rolling(window=2).mean()
df_ts1['2day_max'] =  df_ts1.spx.rolling(window=2).max()

Expanding/Cumulative Features

Expanding

# Cumulative max
df_ts1['cum_max'] =  df_ts1.spx.expanding().max()

Explore Data

Test for Stationarity

Stationarity

Augmented Dickey Fuller Test

import statsmodels.tsa.stattools as sts

adfuller_stats = sts.adfuller(df_ts_base1.market_value)
adfuller_stats

Test for Seasonality

Seasonality

Seasonal Decompose

from statsmodels.tsa.seasonal import seasonal_decompose

# Use model="multiplicative" for testing multiplicative naive decomposition
s_dec_additive = seasonal_decompose(df_ts_base1.market_value, model = "additive")
s_dec_additive.plot()
plt.show()

Differencing

# Remove Trend
# Lag 1 Differencing to get rid of trends
# Uses the difference of current and 'Lag1' values (col - col.shift(1))
df_ts_base3['diff1'] = df_ts_base3['MilesMM'].diff(periods=1)

# Remove Seasonality
# Lag 12 Differencing to get rid of monthly seasonality
# Uses the difference of current and 'Lag12' values (col - col.shift(12))
df_ts_base3['diff12'] = df_ts_base3['diff1'].diff(periods=12)

Test for Autocorrelation (ACF)

Autocorrelation

Plot ACF

import statsmodels.graphics.tsaplots as sgt

# zero = False means that the current period is not considered
# as the correlation between the current period and itself will always be 1
# 40 is the optimal number of lags for time series analysis
sgt.plot_acf(df_ts_base1.market_value, lags = 40, zero = False)
plt.title("ACF S&P", size = 24)
plt.show()

Test for Partial Autocorrelation (PACF)

Partial Autocorrelation

Plot PACF

import statsmodels.graphics.tsaplots as sgt

# zero = False means that the current period is not considered
# as the correlation between the current period and itself will always be 1
# 40 is the optimal number of lags for time series analysis
sgt.plot_pacf(df_ts_base1.market_value, lags = 40, zero = False, method = ('ols'))
plt.title("PACF S&P", size = 24)
plt.show()