TIME SERIES ANALYSIS & FORECASTING

TIME SERIES ANALYSIS & FORECASTING
7+

Contents:

 

  • Introduction

     

  • Dataset and Visualizations

     

  • ARIMA (Autoregressive Integrating Moving Average) Model and Forecasting

     

  • Prophet (Fbprophet) Model and Forecasting

     

  • Conclusion

Introduction

  • What is time series?

     

    A set of ordered observations of a quantitative variable taken at successive points in time (usually evenly spaced, like once a day) is known as Time Series. Time, in terms of years, months, days, or hours is simply a device that enables one to relate all phenomenon to a set of common, stable reference points.


    For example, the data of airline ticket sales per day is a time series.

     

  • Components of a Time Series:

 

  • Trend: The trend shows a general direction of the time series data over a long period of time. A trend can be increasing(upward), decreasing(downward), or horizontal(stationary).

     

  • Seasonality: The seasonality component exhibits a trend that repeats with respect to timing, direction, and magnitude. Some examples include an increase in water consumption in summer due to hot weather conditions, or an increase in the number of airline passengers during holidays each year.

 

  • Cyclical Component:  These are the trends with no set repetition over a particular period of time. A cycle refers to the period of ups and downs, booms and slums of a time series, mostly observed in business cycles. These cycles do not exhibit a seasonal variation but generally occur over a time period of 3 to 12 years depending on the nature of the time series.

 

  • Irregular Variation:  These are the fluctuations in the time series data which become evident when trend and cyclical variations are removed. These variations are unpredictable, erratic, and may or may not be random.

     

  • Importance of time series:

     

    • Analysis of causes and conditions prevailing during occurrence of past changes, one can easily determine the future policies and programs.

       

    • Estimation of future trends on the basis of analysis or past trends.
    • Trends of trade cycles are studied and their effect can be reduced to a considerable extent.
    • Comparative study with the other time series.

     

  • Forecasting:

     


    Time series analysis comprises methods for analyzing the time series data in order to extract the meaningful statistics and the other important characteristics of the data set. And in time series analysis, Forecasting algorithm is an information process that seeks to predict the future values based on past and present data and most commonly by analysis of trends. The historical data points are extracted and prepared trying to predict future values for a selected variable of the data set.

     

In this blog I am analyzing a univariate time series model.

  • So, what is univariate time series?

Univariate time series consists of single observation recorded sequentially over equal time increments. In this project there is only one variable that is sales counts which is recorded over a certain period of time.

Let’s get started!

 

 

  • Dataset and Visualizations

     

    Here, we are working with Superstore Sales Data set. Based on the given data set we have to predict the total sales for every product and store in the next month. The data set provides; ‘Row Id’ , ‘Order ID’ , ‘Order Date’ , ‘Ship Date’ , ‘Ship Mode’ , ‘Customer ID’ , ‘Customer Name’ , ‘Segment’ , ‘Country’ , ‘City’ , ‘State’ , ‘Postal Code’ , ‘Region’ , ‘Product ID’ , ‘Category’ , ‘Sub-category’ , ‘Product Name’ , ‘Sales’ , ‘Quantity’ , ‘Discount’ and ‘Profit’.

     

Import the Dataset and Packages

 

Here we are working with a large data set. Hence, we can go through the following process to import:

 

from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
import matplotlib
matplotlib.rcParams['axes.labelsize']=14
matplotlib.rcParams['xtick.labelsize']=12
matplotlib.rcParams['ytick.labelsize']=12
matplotlib.rcParams['text.color']='k'
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)



(Download the data)

downloaded = drive.CreateFile({'id':'1v_ibPwoNak9iD6TDmVfUgC9or-mMpOsP'})
downloaded.GetContentFile('Sample - Superstore (2).xls')
df=pd.read_excel('Sample - Superstore (2).xls')
df.head()

Output:

 


Hurrah! My Data set imported successfully.

 

  • Data Processing

     

    This step includes removing columns we do not need (unnecessary data), and checking is any missing values are present or not, aggregate sales by date and so on.

    #PREPROCESSING
    cols = ['Row ID', 'Order ID', 'Ship Date', 'Ship Mode', 'Customer ID', 'Customer Name', 'Segment', 'Country', 'City', 'State', 'Postal Code', 'Region', 'Product ID', 'Category', 'Sub-Category', 'Product Name', 'Quantity', 'Discount', 'Profit']df.drop(cols, axis=1, inplace=True)
    df.head()

Output

    Order Date    Sales
0    2016-11-08    261.9600
1    2016-11-08    731.9400
2    2016-06-12    14.6200
3    2015-10-11    957.5775
4    2015-10-11    22.3680
df=df.sort_values('Order Date')
df.isnull().sum()



Output

Order Date 0
Sales 0
dtype: int64

 

print("start date is:",df['Order Date'].min())
print("End date is:",df['Order Date'].max())

Output

 

start date is: 2014-01-03 00:00:00
End date is: 2017-12-30 00:00:00
  • Indexing with Time series data:

     

    As the given data set is in DataFrame format, we cannot work with this type of the data. We have to convert the dataset to time series format for further use.

    #Make the date an index
    df = df.set_index('Order Date')
    df.index

    Output

    DatetimeIndex(['2014-01-03', '2014-01-04', '2014-01-04', '2014-01-04',
                  '2014-01-05', '2014-01-06', '2014-01-06', '2014-01-06',
                  '2014-01-06', '2014-01-06',
                  ...
                  '2017-12-29', '2017-12-29', '2017-12-29', '2017-12-30',
                  '2017-12-30', '2017-12-30', '2017-12-30', '2017-12-30',
                  '2017-12-30', '2017-12-30'],
     dtype='datetime64[ns]', name='Order Date', length=9994, freq=None)
    df.head()

 

 

Output

              Sales
Order Date
2014-01-03    16.448
2014-01-04    11.784
2014-01-04    272.736
2014-01-04    3.540
2014-01-05    19.536

Here, our dataset is suitable for using Time series analysis.

 

  • Re-sampling

     

    Re-sampling is the process of drawing repeated samples from the original dataset. The intuition behind re-sampling methods is that it creates “similar” cases for our data classes in order to render the data representative of the population we wish to investigate, and therefore feed the algorithm enough data to output more accurate results (when the data is not enough).

     

    Re-sampling methods are very easy to use, requiring little mathematical knowledge. They are methods that are easy to understand and implement compared to specialized statistical methods that may require deep technical skill in order to select and interpret.

     

    Y=df['Sales'].resample('M').mean()

    

  • Visualizing the Dataset

     

    Y.plot(figsize=(156))
    plt.show()

    Output

     


    Here, we have plotted no of sales against each month of the year. Some distinguishable patterns appear when we plot the data. Sales are low at the beginning of the year and high at the end of the year. And it changes reversely in the next year. Every two years interval pattern looks quite similar. Couple of high months present in the mid of the year.

     

    ts=df.groupby(['Order Date'])['Sales'].mean()


    ts.astype('float')


    plt.figure(figsize=(16,8))


    plt.title('Mean Sales of the company')


    plt.xlabel('Time')


    plt.ylabel('Sales')


    plt.plot(ts)
 

    Output

    

Here, we plot the mean sales of every month. And we can see that there are small bars at the starting and ending of the year and couple of high bars at the middle year.

 

We can also visualize our data using a method called time-series decomposition that allows us to decompose our time series into three distinct components:

Trend, Seasonality and Noise.

from pylab import rcParams


rcParams['figure.figsize'] = 188


decomposition = sm.tsa.seasonal_decompose(Y, model='additive')


fig = decomposition.plot()


plt.show()



Output

 

    

The plot above clearly shows that the sales is unstable, along with its obvious seasonality.

 

 

 

  • ARIMA Model Fitting and Forecasting

     


     

    One of the most common methods used in time series forecasting is known as the ARIMA model, which stands for AutoRegressive Integrated Moving Average. ARIMA is a model that can be fitted to time series data in order to better understand or predict future points in the series.

    There are three distinct integers (pdq) that are used to parameterize ARIMA models. Because of that, ARIMA models are denoted with the notation ARIMA(p, d, q). Together these three parameters account for seasonality, trend, and noise in datasets:

    • p is the auto-regressive part of the model. It allows us to incorporate the effect of past values into our model. Intuitively, this would be similar to stating that it is likely to be warm tomorrow if it has been warm the past 3 days.
    • d is the integrated part of the model. This includes terms in the model that incorporate the amount of differencing (i.e. the number of past time points to subtract from the current value) to apply to the time series. Intuitively, this would be similar to stating that it is likely to be same temperature tomorrow if the difference in temperature in the last three days has been very small.
    • q is the moving average part of the model. This allows us to set the error of our model as a linear combination of the error values observed at previous time points in the past.
    • Checking Stationary of the Data

       

      • What is Stationary Time Series:

         

        In simple words, a stationary time series is one, whose statistical properties i.e. mean and variance does not depend on time.

         

    For checking stationary, we perform ADF Test .

  • ADF Test: Augmented Dickey Fuller test (ADF Test) is a common statistical test used to test whether a given time series is stationary or not. It is one of the most commonly used statistical test when it comes to analyzing the stationary of the series.

     

    It is fundamentally a statistical significance test. That means, there is a hypothesis testing involved with a null and alternative hypothesis and as a result a test statistic is computed and p-value get reported. It is from the test statistic and the p-value, we can make an inference as to whether the given series is stationary or not.

 

## Checking for Stationary of the given series


from statsmodels.tsa.stattools import adfuller


def test_stationarity(series,mlag = 365, lag = None,):


    print('ADF Test Result')


    res = adfuller(series, maxlag = mlag, autolag = lag)


    output = pd.Series(res[0:4],index = ['Test Statistic', 'p value', 'used lag', 'Number of observations used'])


    for key, value in res[4].items():


        output['Critical Value ' + key] = value


    print(output)



test_stationarity(df['Sales'])



Output
ADF Test Result


Test Statistic -5.172130


p value 0.000010


used lag 365.000000


Number of observations used 9628.000000


Critical Value 1% -3.431029


Critical Value 5% -2.861840


Critical Value 10% -2.566930


dtype: float64


Therefore based on the ADF Test result we can conclude that the series is stationary and useful for applying ARIMA Method.

 

 

 

  • Construct ARIMA Model
import itertools





p=d=q=range(0,2)


pdq=list(itertools.product(p,d,q))


seasonal_pdq=[(x[0],x[1],x[2],12)for x in list(itertools.product(p,d,q))]

print("Example of parameter combinations for Seasonal Arima")


print("Sarimax:{}X{}".format(pdq[1],seasonal_pdq[1]))


print("Sarimax:{}X{}".format(pdq[2],seasonal_pdq[2]))


print("Sarimax:{}X{}".format(pdq[3],seasonal_pdq[3]))


print("Sarimax:{}X{}".format(pdq[4],seasonal_pdq[4]))


Output

Example of parameter combinations for Seasonal Arima


Sarimax:(0, 0, 1)X(0, 0, 1, 12)


Sarimax:(0, 1, 0)X(0, 1, 0, 12)


Sarimax:(0, 1, 1)X(0, 1, 1, 12)


Sarimax:(1, 0, 0)X(1, 0, 0, 12)

We can now use the triplets of parameters defined above to automate the process of training and evaluating ARIMA models on different combinations. In Statistics and Machine Learning, this process is known as grid search (or hyper parameter optimization) for model selection.

When evaluating and comparing statistical models fitted with different parameters, each can be ranked against one another based on how well it fits the data or its ability to accurately predict future data points. We will use the AIC (Akaike Information Criterion) value, which is conveniently returned with ARIMA models fitted using statsmodels. The AIC measures how well a model fits the data while taking into account the overall complexity of the model. A model that fits the data very well while using lots of features will be assigned a larger AIC score than a model that uses fewer features to achieve the same goodness-of-fit. Therefore, we are interested in finding the model that yields the lowest AIC value.

 

for param in pdq:


    for param_seasonal in seasonal_pdq:


        try:


            mod = sm.tsa.statespace.SARIMAX(Y,


                                            order=param,


                                            seasonal_order=param_seasonal,


                                            enforce_stationarity=False,


                                            enforce_invertibility=False)


            results = mod.fit()


            print('ARIMA{}x{}12 - AIC:{}'.format(param, param_seasonal, results.aic))


        except:


            continue



Output

 

ARIMA(0, 0, 0)x(0, 0, 0, 12)12 - AIC:648.1839075951442


ARIMA(0, 0, 0)x(0, 0, 1, 12)12 - AIC:1573.9564723209578


ARIMA(0, 0, 0)x(0, 1, 0, 12)12 - AIC:387.257330070831


ARIMA(0, 0, 0)x(1, 0, 0, 12)12 - AIC:402.97856214610994


/usr/local/lib/python3.6/dist-packages/statsmodels/base/model.py:512: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals


"Check mle_retvals", ConvergenceWarning)


ARIMA(0, 0, 0)x(1, 0, 1, 12)12 - AIC:384.7995950669331


ARIMA(0, 0, 0)x(1, 1, 0, 12)12 - AIC:262.1644595966525


ARIMA(0, 0, 1)x(0, 0, 0, 12)12 - AIC:597.654605697939


ARIMA(0, 0, 1)x(0, 0, 1, 12)12 - AIC:nan


ARIMA(0, 0, 1)x(0, 1, 0, 12)12 - AIC:376.3251637580207


/usr/local/lib/python3.6/dist-packages/statsmodels/base/model.py:512: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals


"Check mle_retvals", ConvergenceWarning)


/usr/local/lib/python3.6/dist-packages/statsmodels/base/model.py:512: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals


"Check mle_retvals", ConvergenceWarning)


ARIMA(0, 0, 1)x(1, 0, 0, 12)12 - AIC:405.0472142226208


ARIMA(0, 0, 1)x(1, 0, 1, 12)12 - AIC:369.16844047362355


ARIMA(0, 0, 1)x(1, 1, 0, 12)12 - AIC:263.3626446818747


ARIMA(0, 1, 0)x(0, 0, 0, 12)12 - AIC:528.1869579691191


ARIMA(0, 1, 0)x(0, 0, 1, 12)12 - AIC:1454.5916794067966


ARIMA(0, 1, 0)x(0, 1, 0, 12)12 - AIC:402.41176085757166


ARIMA(0, 1, 0)x(1, 0, 0, 12)12 - AIC:387.20858105018107


/usr/local/lib/python3.6/dist-packages/statsmodels/base/model.py:512: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals


"Check mle_retvals", ConvergenceWarning)


/usr/local/lib/python3.6/dist-packages/statsmodels/base/model.py:512: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals


"Check mle_retvals", ConvergenceWarning)


ARIMA(0, 1, 0)x(1, 0, 1, 12)12 - AIC:1394.9326629156938


ARIMA(0, 1, 0)x(1, 1, 0, 12)12 - AIC:270.9438065103488


ARIMA(0, 1, 1)x(0, 0, 0, 12)12 - AIC:467.6414267046183


ARIMA(0, 1, 1)x(0, 0, 1, 12)12 - AIC:2609.961816454136


ARIMA(0, 1, 1)x(0, 1, 0, 12)12 - AIC:370.0243667754068


ARIMA(0, 1, 1)x(1, 0, 0, 12)12 - AIC:368.6117762611625


ARIMA(0, 1, 1)x(1, 0, 1, 12)12 - AIC:2849.340499469712


ARIMA(0, 1, 1)x(1, 1, 0, 12)12 - AIC:257.37284772726474


ARIMA(1, 0, 0)x(0, 0, 0, 12)12 - AIC:540.6772379709616


/usr/local/lib/python3.6/dist-packages/statsmodels/base/model.py:512: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals


"Check mle_retvals", ConvergenceWarning)


ARIMA(1, 0, 0)x(0, 0, 1, 12)12 - AIC:1536.2229591079033


ARIMA(1, 0, 0)x(0, 1, 0, 12)12 - AIC:389.2433793870417


ARIMA(1, 0, 0)x(1, 0, 0, 12)12 - AIC:387.316239028495


ARIMA(1, 0, 0)x(1, 0, 1, 12)12 - AIC:387.4871336951148


ARIMA(1, 0, 0)x(1, 1, 0, 12)12 - AIC:253.0789709391584


ARIMA(1, 0, 1)x(0, 0, 0, 12)12 - AIC:496.90504904914684


ARIMA(1, 0, 1)x(0, 0, 1, 12)12 - AIC:2841.1869647833105


ARIMA(1, 0, 1)x(0, 1, 0, 12)12 - AIC:374.4935542234095


ARIMA(1, 0, 1)x(1, 0, 0, 12)12 - AIC:371.8863361971654


ARIMA(1, 0, 1)x(1, 0, 1, 12)12 - AIC:359.29908312075713


ARIMA(1, 0, 1)x(1, 1, 0, 12)12 - AIC:254.99974976566307


ARIMA(1, 1, 0)x(0, 0, 0, 12)12 - AIC:508.22613161672086


/usr/local/lib/python3.6/dist-packages/statsmodels/base/model.py:512: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals


"Check mle_retvals", ConvergenceWarning)


ARIMA(1, 1, 0)x(0, 0, 1, 12)12 - AIC:1463.290184247775


ARIMA(1, 1, 0)x(0, 1, 0, 12)12 - AIC:397.02186857072024


ARIMA(1, 1, 0)x(1, 0, 0, 12)12 - AIC:362.307716753627


/usr/local/lib/python3.6/dist-packages/statsmodels/base/model.py:512: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals


"Check mle_retvals", ConvergenceWarning)


ARIMA(1, 1, 0)x(1, 0, 1, 12)12 - AIC:1470.1783831793764


ARIMA(1, 1, 0)x(1, 1, 0, 12)12 - AIC:245.87074565604823


ARIMA(1, 1, 1)x(0, 0, 0, 12)12 - AIC:468.8192870346768


ARIMA(1, 1, 1)x(0, 0, 1, 12)12 - AIC:3213.368937063234


ARIMA(1, 1, 1)x(0, 1, 0, 12)12 - AIC:372.0243404313211


ARIMA(1, 1, 1)x(1, 0, 0, 12)12 - AIC:357.38039336226905


ARIMA(1, 1, 1)x(1, 0, 1, 12)12 - AIC:1263.3417792136752


ARIMA(1, 1, 1)x(1, 1, 0, 12)12 - AIC:243.34456553221048


/usr/local/lib/python3.6/dist-packages/statsmodels/base/model.py:512: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals


"Check mle_retvals", ConvergenceWarning)

 

The output of our code suggests that SARIMAX(1, 1, 1)x(1, 1, 0, 12) yields the lowest AIC value of 243.34. We should therefore consider this to be optimal option out of all the models we have considered.

 

  • Fitting ARIMA Model

     

    Using grid search, we have identified the set of parameters that produces the best fitting model to our time series data. We can proceed to analyze this particular model in more depth.

    We’ll start by plugging the optimal parameter values into a new SARIMAX model:

    mod = sm.tsa.statespace.SARIMAX(Y, order=(111), seasonal_order=(11012),enforce_stationarity=False, enforce_invertibility=False)


    results = mod.fit()


    print(results.summary().tables[1])

Output

 

=====================================================================


coef std err z P>|z| [0.025 0.975]

---------------------------------------------------------------------------------------------------------------


ar.L1 -0.0754 0.251 -0.300 0.764 -0.567 0.417


ma.L1 -1.0000 1665.246 -0.001 1.000 -3264.823 3262.823


ar.S.L12 -0.5256 0.180 -2.919 0.004 -0.878 -0.173


sigma2 2305.2202 3.84e+06 0.001 1.000 -7.52e+06 7.53e+06


=====================================================================

 

The summary attribute that results from the output of SARIMAX returns a significant amount of information, but we’ll focus our attention on the table of coefficients. The coef column shows the weight (i.e. importance) of each feature and how each one impacts the time series. The P>|z| column informs us of the significance of each feature weight. Here, each weight has a p-value lower or close to 0.05, so it is reasonable to retain all of them in our model.

When fitting seasonal ARIMA models (and any other models for that matter), it is important to run model diagnostics to ensure that none of the assumptions made by the model have been violated. The plot_diagnostics object allows us to quickly generate model diagnostics and investigate for any unusual behavior.

 

results.plot_diagnostics(figsize=(168))


plt.show()

 

Output

 


Our primary concern is to ensure that the residuals of our model are uncorrelated and normally distributed with zero-mean. If the seasonal ARIMA model does not satisfy these properties, it is a good indication that it can be further improved.

In this case, our model diagnostics suggests that the model residuals are normally distributed based on the following:

  • In the top right plot, we see that the red KDE line follows closely with the N(0,1) line (where N(0,1)) is the standard notation for a normal distribution with mean 0 and standard deviation of 1). This is a good indication that the residuals are normally distributed.
  • The qq-plot on the bottom left shows that the ordered distribution of residuals (blue dots) follows the linear trend of the samples taken from a standard normal distribution with N(0, 1). Again, this is a strong indication that the residuals are normally distributed.
  • The residuals over time (top left plot) don’t display any obvious seasonality and appear to be white noise. This is confirmed by the autocorrelation (i.e. correlogram) plot on the bottom right, which shows that the time series residuals have low correlation with lagged versions of itself.

     

Those observations lead us to conclude that our model produces a satisfactory fit that could help us understand our time series data and forecast future values.

Note: Although we have a satisfactory fit, some parameters of our seasonal ARIMA model could be changed to improve our model fit. For example, our grid search only considered a restricted set of parameter combinations, so we might find better models if we widen the grid search.

 

  • Forecasting

     

    • Validating Forecasting

       

      To help us understand the accuracy of our forecasts, we compare predicted sales to real sales of the time series, and we set forecasts to start at 2017–01–01 to the end of the data.

       

      pred = results.get_prediction(start=pd.to_datetime('2017-01-31'), dynamic=False)


      pred_ci = pred.conf_int()


      ax = Y['2014':].plot(label='observed')


      pred.predicted_mean.plot(ax=ax, label='One-step ahead Forecast', alpha=.7, figsize=(147))


      ax.fill_between(pred_ci.index,


                      pred_ci.iloc[:, 0],


                      pred_ci.iloc[:, 1], color='k', alpha=.2)


      ax.set_xlabel('Date')


      ax.set_ylabel('Sales')


      plt.legend()


      plt.show()

 

Output

    

The line plot is showing the observed values compared to the rolling forecast predictions. Overall, our forecasts align with the true values very well, showing a downward trend starts from the beginning of the year and captured the seasonality toward the end of the year.

It is also useful to quantify the accuracy of our forecasts. We will use the (Mean Squared Error), which summarizes the average error of our forecasts. For each predicted value, we compute its distance to the true value and square the result. The results need to be squared so that positive/negative differences do not cancel each other out when we compute the overall mean.

y_forecasted = pred.predicted_mean


y_truth = Y['2017-01-01':]

mse = ((y_forecasted - y_truth) ** 2).mean()


print('The Mean Squared Error of our forecasts is {}'.format(round(mse, 2)))

 

Output

The Mean Squared Error of our forecasts is 2240.32

 

print('The Root Mean Squared Error of our forecasts is {}'.format(round(np.sqrt(mse), 2)))

Output

The Root Mean Squared Error of our forecasts is 47.33

In statistics, the mean square error(MSE) of an estimator measures the average of the squares of the errors — that is, the average squared difference between the estimated values and what is estimated. The MSE is a measure of the quality of an estimator — it is always non-negative, and the smaller the MSE, the closer we are to finding the line of best fit.

Root mean square error (RMSE) tells us that our model was able to forecast the average daily sales in the test set within 47.33 of the real sales. This is a pretty good model so far.

  • Producing and Visualizing Forecast

In the final step, we describe how to leverage our seasonal ARIMA time series model to forecast future values. The get_forecast() attribute of our time series object can compute forecasted values for a specified number of steps ahead.

pred_uc = results.get_forecast(steps=12)


pred_ci = pred_uc.conf_int()


ax = Y.plot(label='observed', figsize=(147))


pred_uc.predicted_mean.plot(ax=ax, label='Forecast')


ax.fill_between(pred_ci.index,


                pred_ci.iloc[:, 0],


                pred_ci.iloc[:, 1], color='k', alpha=.25)


ax.set_xlabel('Date')


ax.set_ylabel('Furniture Sales')


plt.legend()


plt.show()

 

Output


Both the forecasts and associated confidence interval that we have generated can now be used to further understand the time series and foresee what to expect. Our forecasts show that the time series is expected to continue increasing at a steady pace.

As we forecast further out into the future, it is natural for us to become less confident in our values. This is reflected by the confidence intervals generated by our model, which grow larger as we move further out into the future.

 

 


  • Prophet Model Fitting and Forecasting

     


     

    Released by Facebook in 2017, forecasting tool Prophet is designed for analyzing time-series that display patterns on different time scales such as yearly, weekly and daily. It also has advanced capabilities for modeling the effects of holidays on a time-series and implementing custom change points. Therefore, we are using Prophet to get a model up and running.

    “Prophet has been a key piece to improving Facebook’s ability to create a large number of trustworthy forecasts used for decision-making and even in product features.”

    It is a decomposable time series model with three main model components: trend, seasonality, and holidays. They are combined in the following equation:


    • g(t): piecewise linear or logistic growth curve for modeling non-periodic changes in time series
    • s(t): periodic changes (e.g. weekly/yearly seasonality)
    • h(t): effects of holidays (user provided) with irregular schedules
    • εt: error term accounts for any unusual changes not accommodated by the model

     

    Trend: Trend is smooth ,regular, long term movement of a statistical series: frequent changes either in absolute amount or in rates of increase or decrease are quite inconsistent with the idea of secular trend i.e there is no periodicity.

    Seasonal Variation: A periodic movement is one which recurs with some degree of regularity within a definite time period.

    Holiday: Holidays or events incur predictability of a time series. As for example in New year the sale of clothes remain quite high as compared to other regular days and same for Diwali (in India) or other occasions.

 

  • Installing Packages and Libraries
    import warnings


    import numpy as np


    import pandas as pd


    import matplotlib.pyplot as plt


    warnings.filterwarnings("ignore")


    %matplotlib inline



 

from fbprophet import Prophet


from sklearn.metrics import mean_squared_error, mean_absolute_error


plt.rcParams['figure.figsize']=(15,10)

 

 

  • Preparing Dataset for Prophet

     

    df = df.rename(columns={'Order Date''ds''Sales''y'})


    df_model = Prophet(interval_width=0.95)


    df_model.fit(df)

     

    Output

    INFO:numexpr.utils:NumExpr defaulting to 2 threads.


    INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.


    <fbprophet.forecaster.Prophet at 0x7f0ebc07c940>

     

df.set_index('ds').y.plot()

 

Output

 


Here, we plot the sales data with respect to selling date. And see that there are low bars in the starting of year and ending of year and couple of big bars in the mid year.

 

 

  • Forecast using Prophet

     

    model = Prophet()


    model.fit(df)

 

    Output

INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
future = model.make_future_dataframe(periods=12, freq = 'm')


future.tail()

 

    Output

 ds


10002 2018-07-31


10003     2018-08-31


10004     2018-09-30


10005     2018-10-31


10006     2018-11-30

 

forecast = model.predict(future)


forecast.tail()

 

Output


 

forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()

    Output

 ds     yhat     yhat_lower     yhat_upper


1244    2018-08-01    1108.296244    -1751.105682    3839.363077


1245    2018-09-01    3101.296688    461.439189     5855.126001


1246    2018-10-01    3369.973509    610.175122     6172.857257


1247    2018-11-01    3502.864480    679.938129     6185.642495


1248    2018-12-01    3674.377755    783.551508     6571.994512
  • Result of Forecast

 

model.plot(forecast)

 

Output


From the diagram, we can see the forecasting sales for the next year. And we can see that the rate of the sales is increasing with respect to years.

 

Now break the model in different components;

Trend , Seasonal Variation , Holidays ;

model.plot_components(forecast)

 

Outout


Good to see that the sales have been increasing over time and will be keep growing year after year. The worst day for sale is Wednesday (should be a holiday), The best day for sale is Monday and Friday. The worst month for sale is February, The best month for sale is November and September.

 

metric_df = forecast.set_index('ds')[['yhat']].join(df.set_index('ds').y).reset_index()


metric_df.tail()

 

Output

 

 ds     yhat      y


1244    2018-08-01    1108.296244    NaN


1245    2018-09-01    3101.296688    NaN


1246    2018-10-01    3369.973509    NaN


1247    2018-11-01    3502.864480    NaN


1248    2018-12-01    3674.377755    NaN

 

 

metric_df.dropna(inplace=True)


metric_df.tail()

Output

     ds     yhat     y


1232    2017-12-26    2636.565655    814.5940


1233    2017-12-27    1670.346485    177.6360


1234    2017-12-28    2764.198983    1657.3508


1235    2017-12-29    2922.492564    2915.5340


1236    2017-12-30    2416.579612    713.7900

 

mean_squared_error(metric_df.y, metric_df.yhat)

 

Output

4564396.647441651

 

mean_absolute_error(metric_df.y, metric_df.yhat)

 

Output

1403.109843097731

Therefore, RMSE = Square root (MSE) = 2136.445

Root mean square error (RMSE) tells us that our model was able to forecast the average daily sales in the test set within 2136.445 of the real sales. This is a quite good model so far.

 

Conclusion

So here we are to conclude that we focused on the application of different time series models in this study, learned how to use them with the objective to forecast new sale values. There are many time-series analysis we can explore from now on, such as forecast with uncertainty bounds, change point and anomaly detection, forecast time-series with external data source. With respect to accuracy level, we can conclude that ARIMA performed very well. Prophet is one of the new forecasting approach, it may need more tuning for getting more accuracy.


 


Project by <br>Siddhartha Saha
Project by
Siddhartha Saha

Msc Statistics Kalyani University

Mentored by

Souvik Manna
Souvik Manna

ISI Bangalore

7+

Mathematica-City

Mathematica-city is an online Education forum for Science students run by Kounteyo, Shreyansh and Souvik. We aim to provide articles related to Actuarial Science, Data Science, Statistics, Mathematics and their applications using different Statistical Software. Feel free to reach out to us for any kind of discussion on any of the related topics,

Leave a Reply

Your email address will not be published.