• Introduction


  • Dataset and Visualizations


  • ARIMA (Autoregressive Integrating Moving Average) Model and Forecasting


  • Prophet (Fbprophet) Model and Forecasting


  • Conclusion


  • What is time series?


    A set of ordered observations of a quantitative variable taken at successive points in time (usually evenly spaced, like once a day) is known as Time Series. Time, in terms of years, months, days, or hours is simply a device that enables one to relate all phenomenon to a set of common, stable reference points.

    For example, the data of airline ticket sales per day is a time series.


  • Components of a Time Series:


  • Trend: The trend shows a general direction of the time series data over a long period of time. A trend can be increasing(upward), decreasing(downward), or horizontal(stationary).


  • Seasonality: The seasonality component exhibits a trend that repeats with respect to timing, direction, and magnitude. Some examples include an increase in water consumption in summer due to hot weather conditions, or an increase in the number of airline passengers during holidays each year.


  • Cyclical Component:  These are the trends with no set repetition over a particular period of time. A cycle refers to the period of ups and downs, booms and slums of a time series, mostly observed in business cycles. These cycles do not exhibit a seasonal variation but generally occur over a time period of 3 to 12 years depending on the nature of the time series.


  • Irregular Variation:  These are the fluctuations in the time series data which become evident when trend and cyclical variations are removed. These variations are unpredictable, erratic, and may or may not be random.


  • Importance of time series:


    • Analysis of causes and conditions prevailing during occurrence of past changes, one can easily determine the future policies and programs.


    • Estimation of future trends on the basis of analysis or past trends.
    • Trends of trade cycles are studied and their effect can be reduced to a considerable extent.
    • Comparative study with the other time series.


  • Forecasting:


    Time series analysis comprises methods for analyzing the time series data in order to extract the meaningful statistics and the other important characteristics of the data set. And in time series analysis, Forecasting algorithm is an information process that seeks to predict the future values based on past and present data and most commonly by analysis of trends. The historical data points are extracted and prepared trying to predict future values for a selected variable of the data set.


In this blog I am analyzing a univariate time series model.

  • So, what is univariate time series?

Univariate time series consists of single observation recorded sequentially over equal time increments. In this project there is only one variable that is sales counts which is recorded over a certain period of time.

Let’s get started!



  • Dataset and Visualizations


    Here, we are working with Superstore Sales Data set. Based on the given data set we have to predict the total sales for every product and store in the next month. The data set provides; ‘Row Id’ , ‘Order ID’ , ‘Order Date’ , ‘Ship Date’ , ‘Ship Mode’ , ‘Customer ID’ , ‘Customer Name’ , ‘Segment’ , ‘Country’ , ‘City’ , ‘State’ , ‘Postal Code’ , ‘Region’ , ‘Product ID’ , ‘Category’ , ‘Sub-category’ , ‘Product Name’ , ‘Sales’ , ‘Quantity’ , ‘Discount’ and ‘Profit’.


Import the Dataset and Packages


Here we are working with a large data set. Hence, we can go through the following process to import:


from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
import matplotlib
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

(Download the data)

downloaded = drive.CreateFile({'id':'1v_ibPwoNak9iD6TDmVfUgC9or-mMpOsP'})
downloaded.GetContentFile('Sample - Superstore (2).xls')
df=pd.read_excel('Sample - Superstore (2).xls')



Hurrah! My Data set imported successfully.


  • Data Processing


    This step includes removing columns we do not need (unnecessary data), and checking is any missing values are present or not, aggregate sales by date and so on.

    cols = ['Row ID', 'Order ID', 'Ship Date', 'Ship Mode', 'Customer ID', 'Customer Name', 'Segment', 'Country', 'City', 'State', 'Postal Code', 'Region', 'Product ID', 'Category', 'Sub-Category', 'Product Name', 'Quantity', 'Discount', 'Profit']df.drop(cols, axis=1, inplace=True)


    Order Date    Sales
0    2016-11-08    261.9600
1    2016-11-08    731.9400
2    2016-06-12    14.6200
3    2015-10-11    957.5775
4    2015-10-11    22.3680
df=df.sort_values('Order Date')


Order Date 0
Sales 0
dtype: int64


print("start date is:",df['Order Date'].min())
print("End date is:",df['Order Date'].max())



start date is: 2014-01-03 00:00:00
End date is: 2017-12-30 00:00:00
  • Indexing with Time series data:


    As the given data set is in DataFrame format, we cannot work with this type of the data. We have to convert the dataset to time series format for further use.

    #Make the date an index
    df = df.set_index('Order Date')


    DatetimeIndex(['2014-01-03', '2014-01-04', '2014-01-04', '2014-01-04',
                  '2014-01-05', '2014-01-06', '2014-01-06', '2014-01-06',
                  '2014-01-06', '2014-01-06',
                  '2017-12-29', '2017-12-29', '2017-12-29', '2017-12-30',
                  '2017-12-30', '2017-12-30', '2017-12-30', '2017-12-30',
                  '2017-12-30', '2017-12-30'],
     dtype='datetime64[ns]', name='Order Date', length=9994, freq=None)




Order Date
2014-01-03    16.448
2014-01-04    11.784
2014-01-04    272.736
2014-01-04    3.540
2014-01-05    19.536

Here, our dataset is suitable for using Time series analysis.


  • Re-sampling


    Re-sampling is the process of drawing repeated samples from the original dataset. The intuition behind re-sampling methods is that it creates “similar” cases for our data classes in order to render the data representative of the population we wish to investigate, and therefore feed the algorithm enough data to output more accurate results (when the data is not enough).


    Re-sampling methods are very easy to use, requiring little mathematical knowledge. They are methods that are easy to understand and implement compared to specialized statistical methods that may require deep technical skill in order to select and interpret.




  • Visualizing the Dataset





    Here, we have plotted no of sales against each month of the year. Some distinguishable patterns appear when we plot the data. Sales are low at the beginning of the year and high at the end of the year. And it changes reversely in the next year. Every two years interval pattern looks quite similar. Couple of high months present in the mid of the year.


    ts=df.groupby(['Order Date'])['Sales'].mean()



    plt.title('Mean Sales of the company')






Here, we plot the mean sales of every month. And we can see that there are small bars at the starting and ending of the year and couple of high bars at the middle year.


We can also visualize our data using a method called time-series decomposition that allows us to decompose our time series into three distinct components:

Trend, Seasonality and Noise.

from pylab import rcParams

rcParams['figure.figsize'] = 188

decomposition = sm.tsa.seasonal_decompose(Y, model='additive')

fig = decomposition.plot()





The plot above clearly shows that the sales is unstable, along with its obvious seasonality.




  • ARIMA Model Fitting and Forecasting



    One of the most common methods used in time series forecasting is known as the ARIMA model, which stands for AutoRegressive Integrated Moving Average. ARIMA is a model that can be fitted to time series data in order to better understand or predict future points in the series.

    There are three distinct integers (pdq) that are used to parameterize ARIMA models. Because of that, ARIMA models are denoted with the notation ARIMA(p, d, q). Together these three parameters account for seasonality, trend, and noise in datasets:

    • p is the auto-regressive part of the model. It allows us to incorporate the effect of past values into our model. Intuitively, this would be similar to stating that it is likely to be warm tomorrow if it has been warm the past 3 days.
    • d is the integrated part of the model. This includes terms in the model that incorporate the amount of differencing (i.e. the number of past time points to subtract from the current value) to apply to the time series. Intuitively, this would be similar to stating that it is likely to be same temperature tomorrow if the difference in temperature in the last three days has been very small.
    • q is the moving average part of the model. This allows us to set the error of our model as a linear combination of the error values observed at previous time points in the past.
    • Checking Stationary of the Data


      • What is Stationary Time Series:


        In simple words, a stationary time series is one, whose statistical properties i.e. mean and variance does not depend on time.


    For checking stationary, we perform ADF Test .

  • ADF Test: Augmented Dickey Fuller test (ADF Test) is a common statistical test used to test whether a given time series is stationary or not. It is one of the most commonly used statistical test when it comes to analyzing the stationary of the series.


    It is fundamentally a statistical significance test. That means, there is a hypothesis testing involved with a null and alternative hypothesis and as a result a test statistic is computed and p-value get reported. It is from the test statistic and the p-value, we can make an inference as to whether the given series is stationary or not.


## Checking for Stationary of the given series

from statsmodels.tsa.stattools import adfuller

def test_stationarity(series,mlag = 365, lag = None,):

    print('ADF Test Result')

    res = adfuller(series, maxlag = mlag, autolag = lag)

    output = pd.Series(res[0:4],index = ['Test Statistic', 'p value', 'used lag', 'Number of observations used'])

    for key, value in res[4].items():

        output['Critical Value ' + key] = value



ADF Test Result

Test Statistic -5.172130

p value 0.000010

used lag 365.000000

Number of observations used 9628.000000

Critical Value 1% -3.431029

Critical Value 5% -2.861840

Critical Value 10% -2.566930

dtype: float64

Therefore based on the ADF Test result we can conclude that the series is stationary and useful for applying ARIMA Method.




  • Construct ARIMA Model
import itertools



seasonal_pdq=[(x[0],x[1],x[2],12)for x in list(itertools.product(p,d,q))]

print("Example of parameter combinations for Seasonal Arima")






Example of parameter combinations for Seasonal Arima

Sarimax:(0, 0, 1)X(0, 0, 1, 12)

Sarimax:(0, 1, 0)X(0, 1, 0, 12)

Sarimax:(0, 1, 1)X(0, 1, 1, 12)

Sarimax:(1, 0, 0)X(1, 0, 0, 12)

We can now use the triplets of parameters defined above to automate the process of training and evaluating ARIMA models on different combinations. In Statistics and Machine Learning, this process is known as grid search (or hyper parameter optimization) for model selection.

When evaluating and comparing statistical models fitted with different parameters, each can be ranked against one another based on how well it fits the data or its ability to accurately predict future data points. We will use the AIC (Akaike Information Criterion) value, which is conveniently returned with ARIMA models fitted using statsmodels. The AIC measures how well a model fits the data while taking into account the overall complexity of the model. A model that fits the data very well while using lots of features will be assigned a larger AIC score than a model that uses fewer features to achieve the same goodness-of-fit. Therefore, we are interested in finding the model that yields the lowest AIC value.


for param in pdq:

    for param_seasonal in seasonal_pdq:


            mod = sm.tsa.statespace.SARIMAX(Y,





            results = mod.fit()

            print('ARIMA{}x{}12 - AIC:{}'.format(param, param_seasonal, results.aic))





ARIMA(0, 0, 0)x(0, 0, 0, 12)12 - AIC:648.1839075951442

ARIMA(0, 0, 0)x(0, 0, 1, 12)12 - AIC:1573.9564723209578

ARIMA(0, 0, 0)x(0, 1, 0, 12)12 - AIC:387.257330070831

ARIMA(0, 0, 0)x(1, 0, 0, 12)12 - AIC:402.97856214610994

/usr/local/lib/python3.6/dist-packages/statsmodels/base/model.py:512: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals

"Check mle_retvals", ConvergenceWarning)

ARIMA(0, 0, 0)x(1, 0, 1, 12)12 - AIC:384.7995950669331

ARIMA(0, 0, 0)x(1, 1, 0, 12)12 - AIC:262.1644595966525

ARIMA(0, 0, 1)x(0, 0, 0, 12)12 - AIC:597.654605697939

ARIMA(0, 0, 1)x(0, 0, 1, 12)12 - AIC:nan

ARIMA(0, 0, 1)x(0, 1, 0, 12)12 - AIC:376.3251637580207

/usr/local/lib/python3.6/dist-packages/statsmodels/base/model.py:512: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals

"Check mle_retvals", ConvergenceWarning)

/usr/local/lib/python3.6/dist-packages/statsmodels/base/model.py:512: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals

"Check mle_retvals", ConvergenceWarning)

ARIMA(0, 0, 1)x(1, 0, 0, 12)12 - AIC:405.0472142226208

ARIMA(0, 0, 1)x(1, 0, 1, 12)12 - AIC:369.16844047362355

ARIMA(0, 0, 1)x(1, 1, 0, 12)12 - AIC:263.3626446818747

ARIMA(0, 1, 0)x(0, 0, 0, 12)12 - AIC:528.1869579691191

ARIMA(0, 1, 0)x(0, 0, 1, 12)12 - AIC:1454.5916794067966

ARIMA(0, 1, 0)x(0, 1, 0, 12)12 - AIC:402.41176085757166

ARIMA(0, 1, 0)x(1, 0, 0, 12)12 - AIC:387.20858105018107

/usr/local/lib/python3.6/dist-packages/statsmodels/base/model.py:512: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals

"Check mle_retvals", ConvergenceWarning)

/usr/local/lib/python3.6/dist-packages/statsmodels/base/model.py:512: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals

"Check mle_retvals", ConvergenceWarning)

ARIMA(0, 1, 0)x(1, 0, 1, 12)12 - AIC:1394.9326629156938

ARIMA(0, 1, 0)x(1, 1, 0, 12)12 - AIC:270.9438065103488

ARIMA(0, 1, 1)x(0, 0, 0, 12)12 - AIC:467.6414267046183

ARIMA(0, 1, 1)x(0, 0, 1, 12)12 - AIC:2609.961816454136

ARIMA(0, 1, 1)x(0, 1, 0, 12)12 - AIC:370.0243667754068

ARIMA(0, 1, 1)x(1, 0, 0, 12)12 - AIC:368.6117762611625

ARIMA(0, 1, 1)x(1, 0, 1, 12)12 - AIC:2849.340499469712

ARIMA(0, 1, 1)x(1, 1, 0, 12)12 - AIC:257.37284772726474

ARIMA(1, 0, 0)x(0, 0, 0, 12)12 - AIC:540.6772379709616

/usr/local/lib/python3.6/dist-packages/statsmodels/base/model.py:512: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals

"Check mle_retvals", ConvergenceWarning)

ARIMA(1, 0, 0)x(0, 0, 1, 12)12 - AIC:1536.2229591079033

ARIMA(1, 0, 0)x(0, 1, 0, 12)12 - AIC:389.2433793870417

ARIMA(1, 0, 0)x(1, 0, 0, 12)12 - AIC:387.316239028495

ARIMA(1, 0, 0)x(1, 0, 1, 12)12 - AIC:387.4871336951148

ARIMA(1, 0, 0)x(1, 1, 0, 12)12 - AIC:253.0789709391584

ARIMA(1, 0, 1)x(0, 0, 0, 12)12 - AIC:496.90504904914684

ARIMA(1, 0, 1)x(0, 0, 1, 12)12 - AIC:2841.1869647833105

ARIMA(1, 0, 1)x(0, 1, 0, 12)12 - AIC:374.4935542234095

ARIMA(1, 0, 1)x(1, 0, 0, 12)12 - AIC:371.8863361971654

ARIMA(1, 0, 1)x(1, 0, 1, 12)12 - AIC:359.29908312075713

ARIMA(1, 0, 1)x(1, 1, 0, 12)12 - AIC:254.99974976566307

ARIMA(1, 1, 0)x(0, 0, 0, 12)12 - AIC:508.22613161672086

/usr/local/lib/python3.6/dist-packages/statsmodels/base/model.py:512: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals

"Check mle_retvals", ConvergenceWarning)

ARIMA(1, 1, 0)x(0, 0, 1, 12)12 - AIC:1463.290184247775

ARIMA(1, 1, 0)x(0, 1, 0, 12)12 - AIC:397.02186857072024

ARIMA(1, 1, 0)x(1, 0, 0, 12)12 - AIC:362.307716753627

/usr/local/lib/python3.6/dist-packages/statsmodels/base/model.py:512: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals

"Check mle_retvals", ConvergenceWarning)

ARIMA(1, 1, 0)x(1, 0, 1, 12)12 - AIC:1470.1783831793764

ARIMA(1, 1, 0)x(1, 1, 0, 12)12 - AIC:245.87074565604823

ARIMA(1, 1, 1)x(0, 0, 0, 12)12 - AIC:468.8192870346768

ARIMA(1, 1, 1)x(0, 0, 1, 12)12 - AIC:3213.368937063234

ARIMA(1, 1, 1)x(0, 1, 0, 12)12 - AIC:372.0243404313211

ARIMA(1, 1, 1)x(1, 0, 0, 12)12 - AIC:357.38039336226905

ARIMA(1, 1, 1)x(1, 0, 1, 12)12 - AIC:1263.3417792136752

ARIMA(1, 1, 1)x(1, 1, 0, 12)12 - AIC:243.34456553221048

/usr/local/lib/python3.6/dist-packages/statsmodels/base/model.py:512: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals

"Check mle_retvals", ConvergenceWarning)


The output of our code suggests that SARIMAX(1, 1, 1)x(1, 1, 0, 12) yields the lowest AIC value of 243.34. We should therefore consider this to be optimal option out of all the models we have considered.


  • Fitting ARIMA Model


    Using grid search, we have identified the set of parameters that produces the best fitting model to our time series data. We can proceed to analyze this particular model in more depth.

    We’ll start by plugging the optimal parameter values into a new SARIMAX model:

    mod = sm.tsa.statespace.SARIMAX(Y, order=(111), seasonal_order=(11012),enforce_stationarity=False, enforce_invertibility=False)

    results = mod.fit()





coef std err z P>|z| [0.025 0.975]


ar.L1 -0.0754 0.251 -0.300 0.764 -0.567 0.417

ma.L1 -1.0000 1665.246 -0.001 1.000 -3264.823 3262.823

ar.S.L12 -0.5256 0.180 -2.919 0.004 -0.878 -0.173

sigma2 2305.2202 3.84e+06 0.001 1.000 -7.52e+06 7.53e+06



The summary attribute that results from the output of SARIMAX returns a significant amount of information, but we’ll focus our attention on the table of coefficients. The coef column shows the weight (i.e. importance) of each feature and how each one impacts the time series. The P>|z| column informs us of the significance of each feature weight. Here, each weight has a p-value lower or close to 0.05, so it is reasonable to retain all of them in our model.

When fitting seasonal ARIMA models (and any other models for that matter), it is important to run model diagnostics to ensure that none of the assumptions made by the model have been violated. The plot_diagnostics object allows us to quickly generate model diagnostics and investigate for any unusual behavior.







Our primary concern is to ensure that the residuals of our model are uncorrelated and normally distributed with zero-mean. If the seasonal ARIMA model does not satisfy these properties, it is a good indication that it can be further improved.

In this case, our model diagnostics suggests that the model residuals are normally distributed based on the following:

  • In the top right plot, we see that the red KDE line follows closely with the N(0,1) line (where N(0,1)) is the standard notation for a normal distribution with mean 0 and standard deviation of 1). This is a good indication that the residuals are normally distributed.
  • The qq-plot on the bottom left shows that the ordered distribution of residuals (blue dots) follows the linear trend of the samples taken from a standard normal distribution with N(0, 1). Again, this is a strong indication that the residuals are normally distributed.
  • The residuals over time (top left plot) don’t display any obvious seasonality and appear to be white noise. This is confirmed by the autocorrelation (i.e. correlogram) plot on the bottom right, which shows that the time series residuals have low correlation with lagged versions of itself.


Those observations lead us to conclude that our model produces a satisfactory fit that could help us understand our time series data and forecast future values.

Note: Although we have a satisfactory fit, some parameters of our seasonal ARIMA model could be changed to improve our model fit. For example, our grid search only considered a restricted set of parameter combinations, so we might find better models if we widen the grid search.


  • Forecasting


    • Validating Forecasting


      To help us understand the accuracy of our forecasts, we compare predicted sales to real sales of the time series, and we set forecasts to start at 2017–01–01 to the end of the data.


      pred = results.get_prediction(start=pd.to_datetime('2017-01-31'), dynamic=False)

      pred_ci = pred.conf_int()

      ax = Y['2014':].plot(label='observed')

      pred.predicted_mean.plot(ax=ax, label='One-step ahead Forecast', alpha=.7, figsize=(147))


                      pred_ci.iloc[:, 0],

                      pred_ci.iloc[:, 1], color='k', alpha=.2)








The line plot is showing the observed values compared to the rolling forecast predictions. Overall, our forecasts align with the true values very well, showing a downward trend starts from the beginning of the year and captured the seasonality toward the end of the year.

It is also useful to quantify the accuracy of our forecasts. We will use the (Mean Squared Error), which summarizes the average error of our forecasts. For each predicted value, we compute its distance to the true value and square the result. The results need to be squared so that positive/negative differences do not cancel each other out when we compute the overall mean.

y_forecasted = pred.predicted_mean

y_truth = Y['2017-01-01':]

mse = ((y_forecasted - y_truth) ** 2).mean()

print('The Mean Squared Error of our forecasts is {}'.format(round(mse, 2)))



The Mean Squared Error of our forecasts is 2240.32


print('The Root Mean Squared Error of our forecasts is {}'.format(round(np.sqrt(mse), 2)))


The Root Mean Squared Error of our forecasts is 47.33

In statistics, the mean square error(MSE) of an estimator measures the average of the squares of the errors — that is, the average squared difference between the estimated values and what is estimated. The MSE is a measure of the quality of an estimator — it is always non-negative, and the smaller the MSE, the closer we are to finding the line of best fit.

Root mean square error (RMSE) tells us that our model was able to forecast the average daily sales in the test set within 47.33 of the real sales. This is a pretty good model so far.

  • Producing and Visualizing Forecast

In the final step, we describe how to leverage our seasonal ARIMA time series model to forecast future values. The get_forecast() attribute of our time series object can compute forecasted values for a specified number of steps ahead.

pred_uc = results.get_forecast(steps=12)

pred_ci = pred_uc.conf_int()

ax = Y.plot(label='observed', figsize=(147))

pred_uc.predicted_mean.plot(ax=ax, label='Forecast')


                pred_ci.iloc[:, 0],

                pred_ci.iloc[:, 1], color='k', alpha=.25)


ax.set_ylabel('Furniture Sales')





Both the forecasts and associated confidence interval that we have generated can now be used to further understand the time series and foresee what to expect. Our forecasts show that the time series is expected to continue increasing at a steady pace.

As we forecast further out into the future, it is natural for us to become less confident in our values. This is reflected by the confidence intervals generated by our model, which grow larger as we move further out into the future.



  • Prophet Model Fitting and Forecasting



    Released by Facebook in 2017, forecasting tool Prophet is designed for analyzing time-series that display patterns on different time scales such as yearly, weekly and daily. It also has advanced capabilities for modeling the effects of holidays on a time-series and implementing custom change points. Therefore, we are using Prophet to get a model up and running.

    “Prophet has been a key piece to improving Facebook’s ability to create a large number of trustworthy forecasts used for decision-making and even in product features.”

    It is a decomposable time series model with three main model components: trend, seasonality, and holidays. They are combined in the following equation:

    • g(t): piecewise linear or logistic growth curve for modeling non-periodic changes in time series
    • s(t): periodic changes (e.g. weekly/yearly seasonality)
    • h(t): effects of holidays (user provided) with irregular schedules
    • εt: error term accounts for any unusual changes not accommodated by the model


    Trend: Trend is smooth ,regular, long term movement of a statistical series: frequent changes either in absolute amount or in rates of increase or decrease are quite inconsistent with the idea of secular trend i.e there is no periodicity.

    Seasonal Variation: A periodic movement is one which recurs with some degree of regularity within a definite time period.

    Holiday: Holidays or events incur predictability of a time series. As for example in New year the sale of clothes remain quite high as compared to other regular days and same for Diwali (in India) or other occasions.


  • Installing Packages and Libraries
    import warnings

    import numpy as np

    import pandas as pd

    import matplotlib.pyplot as plt


    %matplotlib inline


from fbprophet import Prophet

from sklearn.metrics import mean_squared_error, mean_absolute_error




  • Preparing Dataset for Prophet


    df = df.rename(columns={'Order Date''ds''Sales''y'})

    df_model = Prophet(interval_width=0.95)




    INFO:numexpr.utils:NumExpr defaulting to 2 threads.

    INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.

    <fbprophet.forecaster.Prophet at 0x7f0ebc07c940>






Here, we plot the sales data with respect to selling date. And see that there are low bars in the starting of year and ending of year and couple of big bars in the mid year.



  • Forecast using Prophet


    model = Prophet()




INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
future = model.make_future_dataframe(periods=12, freq = 'm')





10002 2018-07-31

10003     2018-08-31

10004     2018-09-30

10005     2018-10-31

10006     2018-11-30


forecast = model.predict(future)





forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()


 ds     yhat     yhat_lower     yhat_upper

1244    2018-08-01    1108.296244    -1751.105682    3839.363077

1245    2018-09-01    3101.296688    461.439189     5855.126001

1246    2018-10-01    3369.973509    610.175122     6172.857257

1247    2018-11-01    3502.864480    679.938129     6185.642495

1248    2018-12-01    3674.377755    783.551508     6571.994512
  • Result of Forecast





From the diagram, we can see the forecasting sales for the next year. And we can see that the rate of the sales is increasing with respect to years.


Now break the model in different components;

Trend , Seasonal Variation , Holidays ;




Good to see that the sales have been increasing over time and will be keep growing year after year. The worst day for sale is Wednesday (should be a holiday), The best day for sale is Monday and Friday. The worst month for sale is February, The best month for sale is November and September.


metric_df = forecast.set_index('ds')[['yhat']].join(df.set_index('ds').y).reset_index()





 ds     yhat      y

1244    2018-08-01    1108.296244    NaN

1245    2018-09-01    3101.296688    NaN

1246    2018-10-01    3369.973509    NaN

1247    2018-11-01    3502.864480    NaN

1248    2018-12-01    3674.377755    NaN






     ds     yhat     y

1232    2017-12-26    2636.565655    814.5940

1233    2017-12-27    1670.346485    177.6360

1234    2017-12-28    2764.198983    1657.3508

1235    2017-12-29    2922.492564    2915.5340

1236    2017-12-30    2416.579612    713.7900


mean_squared_error(metric_df.y, metric_df.yhat)





mean_absolute_error(metric_df.y, metric_df.yhat)




Therefore, RMSE = Square root (MSE) = 2136.445

Root mean square error (RMSE) tells us that our model was able to forecast the average daily sales in the test set within 2136.445 of the real sales. This is a quite good model so far.



So here we are to conclude that we focused on the application of different time series models in this study, learned how to use them with the objective to forecast new sale values. There are many time-series analysis we can explore from now on, such as forecast with uncertainty bounds, change point and anomaly detection, forecast time-series with external data source. With respect to accuracy level, we can conclude that ARIMA performed very well. Prophet is one of the new forecasting approach, it may need more tuning for getting more accuracy.


Project by <br>Siddhartha Saha
Project by
Siddhartha Saha

Msc Statistics Kalyani University

Mentored by

Souvik Manna
Souvik Manna

ISI Bangalore



Mathematica-city is an online Education forum for Science students run by Kounteyo, Shreyansh and Souvik. We aim to provide articles related to Actuarial Science, Data Science, Statistics, Mathematics and their applications using different Statistical Software. Feel free to reach out to us for any kind of discussion on any of the related topics,

Leave a Reply

Your email address will not be published.