# TIME SERIES ANALYSIS & FORECASTING

7+

Contents:

• Introduction

• Dataset and Visualizations

• ARIMA (Autoregressive Integrating Moving Average) Model and Forecasting

• Prophet (Fbprophet) Model and Forecasting

• Conclusion

Introduction

• What is time series?

A set of ordered observations of a quantitative variable taken at successive points in time (usually evenly spaced, like once a day) is known as Time Series. Time, in terms of years, months, days, or hours is simply a device that enables one to relate all phenomenon to a set of common, stable reference points.

For example, the data of airline ticket sales per day is a time series.

• Components of a Time Series:

• Trend: The trend shows a general direction of the time series data over a long period of time. A trend can be increasing(upward), decreasing(downward), or horizontal(stationary).

• Seasonality: The seasonality component exhibits a trend that repeats with respect to timing, direction, and magnitude. Some examples include an increase in water consumption in summer due to hot weather conditions, or an increase in the number of airline passengers during holidays each year.

• Cyclical Component:  These are the trends with no set repetition over a particular period of time. A cycle refers to the period of ups and downs, booms and slums of a time series, mostly observed in business cycles. These cycles do not exhibit a seasonal variation but generally occur over a time period of 3 to 12 years depending on the nature of the time series.

• Irregular Variation:  These are the fluctuations in the time series data which become evident when trend and cyclical variations are removed. These variations are unpredictable, erratic, and may or may not be random.

• Importance of time series:

• Analysis of causes and conditions prevailing during occurrence of past changes, one can easily determine the future policies and programs.

• Estimation of future trends on the basis of analysis or past trends.
• Trends of trade cycles are studied and their effect can be reduced to a considerable extent.
• Comparative study with the other time series.

• Forecasting: Time series analysis comprises methods for analyzing the time series data in order to extract the meaningful statistics and the other important characteristics of the data set. And in time series analysis, Forecasting algorithm is an information process that seeks to predict the future values based on past and present data and most commonly by analysis of trends. The historical data points are extracted and prepared trying to predict future values for a selected variable of the data set.

In this blog I am analyzing a univariate time series model.

• So, what is univariate time series?

Univariate time series consists of single observation recorded sequentially over equal time increments. In this project there is only one variable that is sales counts which is recorded over a certain period of time.

Let’s get started!

• Dataset and Visualizations

Here, we are working with Superstore Sales Data set. Based on the given data set we have to predict the total sales for every product and store in the next month. The data set provides; ‘Row Id’ , ‘Order ID’ , ‘Order Date’ , ‘Ship Date’ , ‘Ship Mode’ , ‘Customer ID’ , ‘Customer Name’ , ‘Segment’ , ‘Country’ , ‘City’ , ‘State’ , ‘Postal Code’ , ‘Region’ , ‘Product ID’ , ‘Category’ , ‘Sub-category’ , ‘Product Name’ , ‘Sales’ , ‘Quantity’ , ‘Discount’ and ‘Profit’.

Import the Dataset and Packages

Here we are working with a large data set. Hence, we can go through the following process to import:

`from pydrive.auth import GoogleAuthfrom pydrive.drive import GoogleDrivefrom google.colab import authfrom oauth2client.client import GoogleCredentialsimport matplotlibmatplotlib.rcParams['axes.labelsize']=14matplotlib.rcParams['xtick.labelsize']=12matplotlib.rcParams['ytick.labelsize']=12matplotlib.rcParams['text.color']='k'auth.authenticate_user()gauth = GoogleAuth()gauth.credentials = GoogleCredentials.get_application_default()drive = GoogleDrive(gauth)`

`downloaded = drive.CreateFile({'id':'1v_ibPwoNak9iD6TDmVfUgC9or-mMpOsP'})downloaded.GetContentFile('Sample - Superstore (2).xls')df=pd.read_excel('Sample - Superstore (2).xls')df.head()`

Output: Hurrah! My Data set imported successfully.

• Data Processing

This step includes removing columns we do not need (unnecessary data), and checking is any missing values are present or not, aggregate sales by date and so on.

`#PREPROCESSINGcols = ['Row ID', 'Order ID', 'Ship Date', 'Ship Mode', 'Customer ID', 'Customer Name', 'Segment', 'Country', 'City', 'State', 'Postal Code', 'Region', 'Product ID', 'Category', 'Sub-Category', 'Product Name', 'Quantity', 'Discount', 'Profit']df.drop(cols, axis=1, inplace=True)df.head()`

Output

`    Order Date    Sales0    2016-11-08    261.96001    2016-11-08    731.94002    2016-06-12    14.62003    2015-10-11    957.57754    2015-10-11    22.3680`
`df=df.sort_values('Order Date')df.isnull().sum()`

Output

`Order Date 0Sales 0dtype: int64`

`print("start date is:",df['Order Date'].min())print("End date is:",df['Order Date'].max())`

Output

`start date is: 2014-01-03 00:00:00End date is: 2017-12-30 00:00:00`
• Indexing with Time series data:

As the given data set is in DataFrame format, we cannot work with this type of the data. We have to convert the dataset to time series format for further use.

`#Make the date an indexdf = df.set_index('Order Date')df.index`

Output

`DatetimeIndex(['2014-01-03', '2014-01-04', '2014-01-04', '2014-01-04',               '2014-01-05', '2014-01-06', '2014-01-06', '2014-01-06',               '2014-01-06', '2014-01-06',               ...               '2017-12-29', '2017-12-29', '2017-12-29', '2017-12-30',               '2017-12-30', '2017-12-30', '2017-12-30', '2017-12-30',               '2017-12-30', '2017-12-30'], dtype='datetime64[ns]', name='Order Date', length=9994, freq=None)`
`    df.head()`

Output

`              SalesOrder Date2014-01-03    16.4482014-01-04    11.7842014-01-04    272.7362014-01-04    3.5402014-01-05    19.536`

Here, our dataset is suitable for using Time series analysis.

• Re-sampling

Re-sampling is the process of drawing repeated samples from the original dataset. The intuition behind re-sampling methods is that it creates “similar” cases for our data classes in order to render the data representative of the population we wish to investigate, and therefore feed the algorithm enough data to output more accurate results (when the data is not enough).

Re-sampling methods are very easy to use, requiring little mathematical knowledge. They are methods that are easy to understand and implement compared to specialized statistical methods that may require deep technical skill in order to select and interpret.

`    Y=df['Sales'].resample('M').mean()`

• Visualizing the Dataset

`Y.plot(figsize=(15, 6))plt.show()`

Output Here, we have plotted no of sales against each month of the year. Some distinguishable patterns appear when we plot the data. Sales are low at the beginning of the year and high at the end of the year. And it changes reversely in the next year. Every two years interval pattern looks quite similar. Couple of high months present in the mid of the year.

`ts=df.groupby(['Order Date'])['Sales'].mean()ts.astype('float')plt.figure(figsize=(16,8))plt.title('Mean Sales of the company')plt.xlabel('Time')plt.ylabel('Sales')plt.plot(ts)`
` `

Output Here, we plot the mean sales of every month. And we can see that there are small bars at the starting and ending of the year and couple of high bars at the middle year.

We can also visualize our data using a method called time-series decomposition that allows us to decompose our time series into three distinct components:

Trend, Seasonality and Noise.

`from pylab import rcParamsrcParams['figure.figsize'] = 18, 8decomposition = sm.tsa.seasonal_decompose(Y, model='additive')fig = decomposition.plot()plt.show()`

Output The plot above clearly shows that the sales is unstable, along with its obvious seasonality.

• ARIMA Model Fitting and Forecasting One of the most common methods used in time series forecasting is known as the ARIMA model, which stands for AutoRegressive Integrated Moving Average. ARIMA is a model that can be fitted to time series data in order to better understand or predict future points in the series.

There are three distinct integers (pdq) that are used to parameterize ARIMA models. Because of that, ARIMA models are denoted with the notation ARIMA(p, d, q). Together these three parameters account for seasonality, trend, and noise in datasets:

• p is the auto-regressive part of the model. It allows us to incorporate the effect of past values into our model. Intuitively, this would be similar to stating that it is likely to be warm tomorrow if it has been warm the past 3 days.
• d is the integrated part of the model. This includes terms in the model that incorporate the amount of differencing (i.e. the number of past time points to subtract from the current value) to apply to the time series. Intuitively, this would be similar to stating that it is likely to be same temperature tomorrow if the difference in temperature in the last three days has been very small.
• q is the moving average part of the model. This allows us to set the error of our model as a linear combination of the error values observed at previous time points in the past.
• Checking Stationary of the Data

• What is Stationary Time Series:

In simple words, a stationary time series is one, whose statistical properties i.e. mean and variance does not depend on time.

For checking stationary, we perform ADF Test .

• ADF Test: Augmented Dickey Fuller test (ADF Test) is a common statistical test used to test whether a given time series is stationary or not. It is one of the most commonly used statistical test when it comes to analyzing the stationary of the series.

It is fundamentally a statistical significance test. That means, there is a hypothesis testing involved with a null and alternative hypothesis and as a result a test statistic is computed and p-value get reported. It is from the test statistic and the p-value, we can make an inference as to whether the given series is stationary or not.

`## Checking for Stationary of the given seriesfrom statsmodels.tsa.stattools import adfullerdef test_stationarity(series,mlag = 365, lag = None,):    print('ADF Test Result')    res = adfuller(series, maxlag = mlag, autolag = lag)    output = pd.Series(res[0:4],index = ['Test Statistic', 'p value', 'used lag', 'Number of observations used'])    for key, value in res.items():        output['Critical Value ' + key] = value    print(output)test_stationarity(df['Sales'])Output`
`ADF Test ResultTest Statistic -5.172130p value 0.000010used lag 365.000000Number of observations used 9628.000000Critical Value 1% -3.431029Critical Value 5% -2.861840Critical Value 10% -2.566930dtype: float64Therefore based on the ADF Test result we can conclude that the series is stationary and useful for applying ARIMA Method.`

• Construct ARIMA Model
`import itertoolsp=d=q=range(0,2)pdq=list(itertools.product(p,d,q))seasonal_pdq=[(x,x,x,12)for x in list(itertools.product(p,d,q))]print("Example of parameter combinations for Seasonal Arima")print("Sarimax:{}X{}".format(pdq,seasonal_pdq))print("Sarimax:{}X{}".format(pdq,seasonal_pdq))print("Sarimax:{}X{}".format(pdq,seasonal_pdq))print("Sarimax:{}X{}".format(pdq,seasonal_pdq))`

Output

`Example of parameter combinations for Seasonal ArimaSarimax:(0, 0, 1)X(0, 0, 1, 12)Sarimax:(0, 1, 0)X(0, 1, 0, 12)Sarimax:(0, 1, 1)X(0, 1, 1, 12)Sarimax:(1, 0, 0)X(1, 0, 0, 12)`

We can now use the triplets of parameters defined above to automate the process of training and evaluating ARIMA models on different combinations. In Statistics and Machine Learning, this process is known as grid search (or hyper parameter optimization) for model selection.

When evaluating and comparing statistical models fitted with different parameters, each can be ranked against one another based on how well it fits the data or its ability to accurately predict future data points. We will use the AIC (Akaike Information Criterion) value, which is conveniently returned with ARIMA models fitted using statsmodels. The AIC measures how well a model fits the data while taking into account the overall complexity of the model. A model that fits the data very well while using lots of features will be assigned a larger AIC score than a model that uses fewer features to achieve the same goodness-of-fit. Therefore, we are interested in finding the model that yields the lowest AIC value.

`for param in pdq:    for param_seasonal in seasonal_pdq:        try:            mod = sm.tsa.statespace.SARIMAX(Y,                                            order=param,                                            seasonal_order=param_seasonal,                                            enforce_stationarity=False,                                            enforce_invertibility=False)            results = mod.fit()            print('ARIMA{}x{}12 - AIC:{}'.format(param, param_seasonal, results.aic))        except:            continue`

Output

`ARIMA(0, 0, 0)x(0, 0, 0, 12)12 - AIC:648.1839075951442ARIMA(0, 0, 0)x(0, 0, 1, 12)12 - AIC:1573.9564723209578ARIMA(0, 0, 0)x(0, 1, 0, 12)12 - AIC:387.257330070831ARIMA(0, 0, 0)x(1, 0, 0, 12)12 - AIC:402.97856214610994/usr/local/lib/python3.6/dist-packages/statsmodels/base/model.py:512: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals "Check mle_retvals", ConvergenceWarning)ARIMA(0, 0, 0)x(1, 0, 1, 12)12 - AIC:384.7995950669331ARIMA(0, 0, 0)x(1, 1, 0, 12)12 - AIC:262.1644595966525ARIMA(0, 0, 1)x(0, 0, 0, 12)12 - AIC:597.654605697939ARIMA(0, 0, 1)x(0, 0, 1, 12)12 - AIC:nanARIMA(0, 0, 1)x(0, 1, 0, 12)12 - AIC:376.3251637580207/usr/local/lib/python3.6/dist-packages/statsmodels/base/model.py:512: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals "Check mle_retvals", ConvergenceWarning)/usr/local/lib/python3.6/dist-packages/statsmodels/base/model.py:512: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals "Check mle_retvals", ConvergenceWarning)ARIMA(0, 0, 1)x(1, 0, 0, 12)12 - AIC:405.0472142226208ARIMA(0, 0, 1)x(1, 0, 1, 12)12 - AIC:369.16844047362355ARIMA(0, 0, 1)x(1, 1, 0, 12)12 - AIC:263.3626446818747ARIMA(0, 1, 0)x(0, 0, 0, 12)12 - AIC:528.1869579691191ARIMA(0, 1, 0)x(0, 0, 1, 12)12 - AIC:1454.5916794067966ARIMA(0, 1, 0)x(0, 1, 0, 12)12 - AIC:402.41176085757166ARIMA(0, 1, 0)x(1, 0, 0, 12)12 - AIC:387.20858105018107/usr/local/lib/python3.6/dist-packages/statsmodels/base/model.py:512: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals "Check mle_retvals", ConvergenceWarning)/usr/local/lib/python3.6/dist-packages/statsmodels/base/model.py:512: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals "Check mle_retvals", ConvergenceWarning)ARIMA(0, 1, 0)x(1, 0, 1, 12)12 - AIC:1394.9326629156938ARIMA(0, 1, 0)x(1, 1, 0, 12)12 - AIC:270.9438065103488ARIMA(0, 1, 1)x(0, 0, 0, 12)12 - AIC:467.6414267046183ARIMA(0, 1, 1)x(0, 0, 1, 12)12 - AIC:2609.961816454136ARIMA(0, 1, 1)x(0, 1, 0, 12)12 - AIC:370.0243667754068ARIMA(0, 1, 1)x(1, 0, 0, 12)12 - AIC:368.6117762611625ARIMA(0, 1, 1)x(1, 0, 1, 12)12 - AIC:2849.340499469712ARIMA(0, 1, 1)x(1, 1, 0, 12)12 - AIC:257.37284772726474ARIMA(1, 0, 0)x(0, 0, 0, 12)12 - AIC:540.6772379709616/usr/local/lib/python3.6/dist-packages/statsmodels/base/model.py:512: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals "Check mle_retvals", ConvergenceWarning)ARIMA(1, 0, 0)x(0, 0, 1, 12)12 - AIC:1536.2229591079033ARIMA(1, 0, 0)x(0, 1, 0, 12)12 - AIC:389.2433793870417ARIMA(1, 0, 0)x(1, 0, 0, 12)12 - AIC:387.316239028495ARIMA(1, 0, 0)x(1, 0, 1, 12)12 - AIC:387.4871336951148ARIMA(1, 0, 0)x(1, 1, 0, 12)12 - AIC:253.0789709391584ARIMA(1, 0, 1)x(0, 0, 0, 12)12 - AIC:496.90504904914684ARIMA(1, 0, 1)x(0, 0, 1, 12)12 - AIC:2841.1869647833105ARIMA(1, 0, 1)x(0, 1, 0, 12)12 - AIC:374.4935542234095ARIMA(1, 0, 1)x(1, 0, 0, 12)12 - AIC:371.8863361971654ARIMA(1, 0, 1)x(1, 0, 1, 12)12 - AIC:359.29908312075713ARIMA(1, 0, 1)x(1, 1, 0, 12)12 - AIC:254.99974976566307ARIMA(1, 1, 0)x(0, 0, 0, 12)12 - AIC:508.22613161672086/usr/local/lib/python3.6/dist-packages/statsmodels/base/model.py:512: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals "Check mle_retvals", ConvergenceWarning)ARIMA(1, 1, 0)x(0, 0, 1, 12)12 - AIC:1463.290184247775ARIMA(1, 1, 0)x(0, 1, 0, 12)12 - AIC:397.02186857072024ARIMA(1, 1, 0)x(1, 0, 0, 12)12 - AIC:362.307716753627/usr/local/lib/python3.6/dist-packages/statsmodels/base/model.py:512: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals "Check mle_retvals", ConvergenceWarning)ARIMA(1, 1, 0)x(1, 0, 1, 12)12 - AIC:1470.1783831793764ARIMA(1, 1, 0)x(1, 1, 0, 12)12 - AIC:245.87074565604823ARIMA(1, 1, 1)x(0, 0, 0, 12)12 - AIC:468.8192870346768ARIMA(1, 1, 1)x(0, 0, 1, 12)12 - AIC:3213.368937063234ARIMA(1, 1, 1)x(0, 1, 0, 12)12 - AIC:372.0243404313211ARIMA(1, 1, 1)x(1, 0, 0, 12)12 - AIC:357.38039336226905ARIMA(1, 1, 1)x(1, 0, 1, 12)12 - AIC:1263.3417792136752ARIMA(1, 1, 1)x(1, 1, 0, 12)12 - AIC:243.34456553221048/usr/local/lib/python3.6/dist-packages/statsmodels/base/model.py:512: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals "Check mle_retvals", ConvergenceWarning)`

The output of our code suggests that SARIMAX(1, 1, 1)x(1, 1, 0, 12) yields the lowest AIC value of 243.34. We should therefore consider this to be optimal option out of all the models we have considered.

• Fitting ARIMA Model

Using grid search, we have identified the set of parameters that produces the best fitting model to our time series data. We can proceed to analyze this particular model in more depth.

We’ll start by plugging the optimal parameter values into a new SARIMAX model:

`mod = sm.tsa.statespace.SARIMAX(Y, order=(1, 1, 1), seasonal_order=(1, 1, 0, 12),enforce_stationarity=False, enforce_invertibility=False)results = mod.fit()print(results.summary().tables)`

Output

`===================================================================== coef std err z P>|z| [0.025 0.975]---------------------------------------------------------------------------------------------------------------ar.L1 -0.0754 0.251 -0.300 0.764 -0.567 0.417ma.L1 -1.0000 1665.246 -0.001 1.000 -3264.823 3262.823ar.S.L12 -0.5256 0.180 -2.919 0.004 -0.878 -0.173sigma2 2305.2202 3.84e+06 0.001 1.000 -7.52e+06 7.53e+06=====================================================================`

The summary attribute that results from the output of SARIMAX returns a significant amount of information, but we’ll focus our attention on the table of coefficients. The coef column shows the weight (i.e. importance) of each feature and how each one impacts the time series. The P>|z| column informs us of the significance of each feature weight. Here, each weight has a p-value lower or close to 0.05, so it is reasonable to retain all of them in our model.

When fitting seasonal ARIMA models (and any other models for that matter), it is important to run model diagnostics to ensure that none of the assumptions made by the model have been violated. The plot_diagnostics object allows us to quickly generate model diagnostics and investigate for any unusual behavior.

`results.plot_diagnostics(figsize=(16, 8))plt.show()`

Output Our primary concern is to ensure that the residuals of our model are uncorrelated and normally distributed with zero-mean. If the seasonal ARIMA model does not satisfy these properties, it is a good indication that it can be further improved.

In this case, our model diagnostics suggests that the model residuals are normally distributed based on the following:

• In the top right plot, we see that the red KDE line follows closely with the N(0,1) line (where N(0,1)) is the standard notation for a normal distribution with mean 0 and standard deviation of 1). This is a good indication that the residuals are normally distributed.
• The qq-plot on the bottom left shows that the ordered distribution of residuals (blue dots) follows the linear trend of the samples taken from a standard normal distribution with N(0, 1). Again, this is a strong indication that the residuals are normally distributed.
• The residuals over time (top left plot) don’t display any obvious seasonality and appear to be white noise. This is confirmed by the autocorrelation (i.e. correlogram) plot on the bottom right, which shows that the time series residuals have low correlation with lagged versions of itself.

Those observations lead us to conclude that our model produces a satisfactory fit that could help us understand our time series data and forecast future values.

Note: Although we have a satisfactory fit, some parameters of our seasonal ARIMA model could be changed to improve our model fit. For example, our grid search only considered a restricted set of parameter combinations, so we might find better models if we widen the grid search.

• Forecasting

• Validating Forecasting

To help us understand the accuracy of our forecasts, we compare predicted sales to real sales of the time series, and we set forecasts to start at 2017–01–01 to the end of the data.

`pred = results.get_prediction(start=pd.to_datetime('2017-01-31'), dynamic=False)pred_ci = pred.conf_int()ax = Y['2014':].plot(label='observed')pred.predicted_mean.plot(ax=ax, label='One-step ahead Forecast', alpha=.7, figsize=(14, 7))ax.fill_between(pred_ci.index,                pred_ci.iloc[:, 0],                pred_ci.iloc[:, 1], color='k', alpha=.2)ax.set_xlabel('Date')ax.set_ylabel('Sales')plt.legend()plt.show()`

Output The line plot is showing the observed values compared to the rolling forecast predictions. Overall, our forecasts align with the true values very well, showing a downward trend starts from the beginning of the year and captured the seasonality toward the end of the year.

It is also useful to quantify the accuracy of our forecasts. We will use the (Mean Squared Error), which summarizes the average error of our forecasts. For each predicted value, we compute its distance to the true value and square the result. The results need to be squared so that positive/negative differences do not cancel each other out when we compute the overall mean.

`y_forecasted = pred.predicted_meany_truth = Y['2017-01-01':]mse = ((y_forecasted - y_truth) ** 2).mean()print('The Mean Squared Error of our forecasts is {}'.format(round(mse, 2)))`

Output

`The Mean Squared Error of our forecasts is 2240.32`

`print('The Root Mean Squared Error of our forecasts is {}'.format(round(np.sqrt(mse), 2)))`

Output

`The Root Mean Squared Error of our forecasts is 47.33`

In statistics, the mean square error(MSE) of an estimator measures the average of the squares of the errors — that is, the average squared difference between the estimated values and what is estimated. The MSE is a measure of the quality of an estimator — it is always non-negative, and the smaller the MSE, the closer we are to finding the line of best fit.

Root mean square error (RMSE) tells us that our model was able to forecast the average daily sales in the test set within 47.33 of the real sales. This is a pretty good model so far.

• Producing and Visualizing Forecast

In the final step, we describe how to leverage our seasonal ARIMA time series model to forecast future values. The get_forecast() attribute of our time series object can compute forecasted values for a specified number of steps ahead.

`pred_uc = results.get_forecast(steps=12)pred_ci = pred_uc.conf_int()ax = Y.plot(label='observed', figsize=(14, 7))pred_uc.predicted_mean.plot(ax=ax, label='Forecast')ax.fill_between(pred_ci.index,                pred_ci.iloc[:, 0],                pred_ci.iloc[:, 1], color='k', alpha=.25)ax.set_xlabel('Date')ax.set_ylabel('Furniture Sales')plt.legend()plt.show()`

Output Both the forecasts and associated confidence interval that we have generated can now be used to further understand the time series and foresee what to expect. Our forecasts show that the time series is expected to continue increasing at a steady pace.

As we forecast further out into the future, it is natural for us to become less confident in our values. This is reflected by the confidence intervals generated by our model, which grow larger as we move further out into the future.

• Prophet Model Fitting and Forecasting Released by Facebook in 2017, forecasting tool Prophet is designed for analyzing time-series that display patterns on different time scales such as yearly, weekly and daily. It also has advanced capabilities for modeling the effects of holidays on a time-series and implementing custom change points. Therefore, we are using Prophet to get a model up and running.

“Prophet has been a key piece to improving Facebook’s ability to create a large number of trustworthy forecasts used for decision-making and even in product features.”

It is a decomposable time series model with three main model components: trend, seasonality, and holidays. They are combined in the following equation: • g(t): piecewise linear or logistic growth curve for modeling non-periodic changes in time series
• s(t): periodic changes (e.g. weekly/yearly seasonality)
• h(t): effects of holidays (user provided) with irregular schedules
• εt: error term accounts for any unusual changes not accommodated by the model

Trend: Trend is smooth ,regular, long term movement of a statistical series: frequent changes either in absolute amount or in rates of increase or decrease are quite inconsistent with the idea of secular trend i.e there is no periodicity.

Seasonal Variation: A periodic movement is one which recurs with some degree of regularity within a definite time period.

Holiday: Holidays or events incur predictability of a time series. As for example in New year the sale of clothes remain quite high as compared to other regular days and same for Diwali (in India) or other occasions.

• Installing Packages and Libraries
`import warningsimport numpy as npimport pandas as pdimport matplotlib.pyplot as pltwarnings.filterwarnings("ignore")%matplotlib inline`

`from fbprophet import Prophetfrom sklearn.metrics import mean_squared_error, mean_absolute_errorplt.rcParams['figure.figsize']=(15,10)`

• Preparing Dataset for Prophet

`df = df.rename(columns={'Order Date': 'ds', 'Sales': 'y'})df_model = Prophet(interval_width=0.95)df_model.fit(df)`

Output

`INFO:numexpr.utils:NumExpr defaulting to 2 threads.INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.<fbprophet.forecaster.Prophet at 0x7f0ebc07c940>`

`df.set_index('ds').y.plot()`

Output Here, we plot the sales data with respect to selling date. And see that there are low bars in the starting of year and ending of year and couple of big bars in the mid year.

• Forecast using Prophet

`model = Prophet()model.fit(df)`

Output

`INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.`
`future = model.make_future_dataframe(periods=12, freq = 'm')future.tail()`

Output

` ds10002 2018-07-3110003     2018-08-3110004     2018-09-3010005     2018-10-3110006     2018-11-30`

`forecast = model.predict(future)forecast.tail()`

Output `forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()`

Output

` ds     yhat     yhat_lower     yhat_upper1244    2018-08-01    1108.296244    -1751.105682    3839.3630771245    2018-09-01    3101.296688    461.439189     5855.1260011246    2018-10-01    3369.973509    610.175122     6172.8572571247    2018-11-01    3502.864480    679.938129     6185.6424951248    2018-12-01    3674.377755    783.551508     6571.994512`
• Result of Forecast

`model.plot(forecast)`

Output From the diagram, we can see the forecasting sales for the next year. And we can see that the rate of the sales is increasing with respect to years.

Now break the model in different components;

Trend , Seasonal Variation , Holidays ;

`model.plot_components(forecast)`

Outout Good to see that the sales have been increasing over time and will be keep growing year after year. The worst day for sale is Wednesday (should be a holiday), The best day for sale is Monday and Friday. The worst month for sale is February, The best month for sale is November and September.

`metric_df = forecast.set_index('ds')[['yhat']].join(df.set_index('ds').y).reset_index()metric_df.tail()`

Output

` ds     yhat      y1244    2018-08-01    1108.296244    NaN1245    2018-09-01    3101.296688    NaN1246    2018-10-01    3369.973509    NaN1247    2018-11-01    3502.864480    NaN1248    2018-12-01    3674.377755    NaN`

`metric_df.dropna(inplace=True)metric_df.tail()`

Output

`     ds     yhat     y1232    2017-12-26    2636.565655    814.59401233    2017-12-27    1670.346485    177.63601234    2017-12-28    2764.198983    1657.35081235    2017-12-29    2922.492564    2915.53401236    2017-12-30    2416.579612    713.7900`

`mean_squared_error(metric_df.y, metric_df.yhat)`

Output

`4564396.647441651`

`mean_absolute_error(metric_df.y, metric_df.yhat)`

Output

`1403.109843097731`

Therefore, RMSE = Square root (MSE) = 2136.445

Root mean square error (RMSE) tells us that our model was able to forecast the average daily sales in the test set within 2136.445 of the real sales. This is a quite good model so far.

## Conclusion

So here we are to conclude that we focused on the application of different time series models in this study, learned how to use them with the objective to forecast new sale values. There are many time-series analysis we can explore from now on, such as forecast with uncertainty bounds, change point and anomaly detection, forecast time-series with external data source. With respect to accuracy level, we can conclude that ARIMA performed very well. Prophet is one of the new forecasting approach, it may need more tuning for getting more accuracy.

Project by
Siddhartha Saha

Msc Statistics Kalyani University

Mentored by

Souvik Manna

ISI Bangalore

7+ ### Mathematica-City

Mathematica-city is an online Education forum for Science students run by Kounteyo, Shreyansh and Souvik. We aim to provide articles related to Actuarial Science, Data Science, Statistics, Mathematics and their applications using different Statistical Software. Feel free to reach out to us for any kind of discussion on any of the related topics,