FORECASTING CRIME USING TIME SERIES
“The world is full of obvious things which nobody by any chance ever observes.”-Arthur Conan Doyle
Yes, it can’t be observable but today using huge amount of data we can predict upcoming rate of crime and take proper actions, revise our policies. All this can be done by using some simple statistical tools. Here, I am going to forecast upcoming crime rates in Boston city, does it follow a pattern? Does it increase in certain period of time or just a random trend? To do this forecast I will use Time Series.
- What is Time Series?
A series of observations ordered along a single dimension, time, is called a time series. In statistical sense, a time series is a random variable ordered in time.
- Why we need time series analysis in forecasting process?
By definition we already knew that, A time series is defined in popular sense as a sequence of observations on a variable over time and time series data are a collection of observation made sequentially in time, either hourly, daily, weekly, monthly, quarterly or yearly basis. So, using time series data and analyzing this type of data we can forecast upcoming possible trends.
In this blog I am analyzing a univariate time series model.
- So, what is univariate time series?
Uni variate time series consists of single observation recorded sequentially over equal time increments. In this project there is only one variable that is crime counts which is recorded over a certain period of time.
- My goal of this project:
Crime incident reports are provided by Boston Police Department (BPD) to document the initial details surrounding an incident to which BPD officers respond. This is a data set containing records from the new crime incident report system, which includes a reduced set of fields focused on capturing the type of incident as well as when and where it occurred. My, goal is to predict the up coming trend of crime in Boston city using time series analysis.
- Data Preparation:
In any analytics project data preparation is the first step. Under data preparation there are few steps to follow.
- INDEXING: this project is al about univariate time series. But This crime data set has 17 columns i.e., we have 17 variables. But, in case of univariate time series we have to choose only two columns “crime occurred on date” and “count of crime”. By indexing we will retrieve these two columns.
- ORDERING OF THE DATA: In any time series analysis ordering of the data set is a must. If, the data set is not orderly arranged (by default in ascending order) visualization of the data set will be not at all meaningful, we could not be able to see any cyclical or seasonal trend. If, luckily you got an orderly arranged data then ordering is not required.
- DATE OBJECT TO STRING: We have to convert date object into string. In python, we can not work with date object if we can’t convert it into a string.
- Data Visualization:
After our data is prepared, we can plot the data and visualize it. Data visualization give us a clear idea about the overall data. If there is any kind of trend, we can have an idea about it.
- What is trend?
Trend is one of the components of time series. There are four components of time series:
- TREND: When we see observations increases or decreases regularly through time, we can say this set of observation is following a trend.
- SEASONALITY: Observations stay high and then decreases and this pattern repeats from one period to the next. We can say observation set is following a seasonal trend. E.g. Sale of sweater increases in Winter and decreases in Summer and we can see this trend is repeating in every season.
- CYCLICAL TREND: Business cycle is a proper example of cyclical trend. Business cycle follows prosperity, boom, recession and depression.
- IRREGULAR FLUCTUATION OR NOISE: Irregular fluctuations are a totally random thing. Nobody can predict it. E.g. COVID-19 pandemic causes a depression in economic activities. This pandemic was not predicted beforehand so, the consequences are irregular. So, In the span of upcoming 20 years 2020 depression in world economy will be represented as a irregular trend.
VISUALIZING THE CRIME TIME SERIES DATA:
Here, we have plotted counts of crime against each month of the year. In this plot, we can see every year in the month of June and August amount of crime took a sharp rise and in every year in the month of January amount of crime decreases. we can see a seasonal trend in crime and also, we can see there is a regular up and down through time. Consequently, we can conclude there is a trend and seasonal component in this time series data.
For proper forecasting, in presence of trend the analyst has to detrend the series. There are three process to detrend a non stationary time series-
- PROCESS OF ROLLING AVERAGE: Here, we have to take average of points of either side. We are taking average because it tends to smooth out ‘noise’ also.
- PROCESS OF DIFFERENCING: Here we just calculate the difference of successive data points in the time series. It is so much helpful in turning the time series in a stationary time series.
Or,yt-yt-1= ut [where, ut is the white noise]
Or,∆yt=ut [ut is stationary, here we can see after first difference the process became stationary]
- PROCESS OF LOG DIFFERENCING: Another type of differencng which is widely considered for establishing stationarity in a non-stationary time series is log-differencing. Whre, simple differencing is not effective , there log differencing proves to be effective. Here, first we have to take the log values of the variable then take the successive differences.
- AFTER DETRENDING THE SERIES WE HAVE TO CHECK THE STATIONARITY:
First of all, we have to clear some ideas about stationarity:
- WHAT IS A STATIONARY TIME SERIES?
In simple words, a stationary time series is one, whose statistical properties i.e. mean and variance does not depend in time.
- PERIODICITY: A time series is periodic if it repeats itself t equally spaced intervals, every 12 months.
- AUTO CORRELATION: Auto correlation refers to the correlation of a time series variable with its own past and future values.
Auto correlation is defined) as the ratio of auto covariance and variance of a time series variable.
Now, we have to check the stationarity of the series. To check the stationarity, we will follow some steps and easily we can check if our series is stationary or not.
- STEPS OF CHECKING STATIONARITY OF THE TIME SERIES:
- LOOKING AT THE CORRELOGRAM: Correlogram is the graphical representation of ACF.
- SUMMARY STATISTICS
- STATISTICAL TEST: Here, we have to perform augmented Dickey Fuller test, where null hypothesis (Hₒ): The series is non stationary and alternative hypothesis(H1): The series will be stationary.
- PLOTTING THE CORRELOGRAM (GRAPHICAL REPRESENTATION OF ACF AND PACF):
Here, we can see, ACF and PACF both graphs have spikes on positive side and negative side. Hence, we can say this time series is a stationary time series.
- STATISTICAL TEST:
We have to do Augmented Dickey Fuller test.
def def test_stationarity(series,mlag = 365, lag = None,): print('ADF Test Result') res = adfuller(series, maxlag = mlag, autolag = lag) output = pd.Series(res[0:4],index = ['Test Statistic', 'p value', 'used lag', 'Number of observations used']) for key, value in res.items(): output['Critical Value ' + key] = value print(output)
running this code, we have:
ADF test result as, p-value=0.192997
We will accept null hypothesis if p-value>0.05. So, here we find p-value >0.05, hence this series is not stationary.
We have to take first order difference, and then again run the ADF test, to see if the series became stationary or not.
d1 = d.copy() d1['Count'] = d1['Count'].diff(1)
After, taking first order difference, again we run ADF test, now, the ADF test result is as follows:
p-value=2.029010e-17, Hence, we have p-value<0.05, consequently, we reject null hypothesis., hence, we can conclude the series is stationary.
DETERMINATION OF PARAMETERS(p,d,q):
From the above correlogram(graphical representation of acf) we have p=4 and from the graphical representation of pacf we have q=2. And , lastly, as we have got the stationary time series after first order differencing, hence , d=1
So, our parameters are (p,d,q)=(4,1,2)
- p: No of lag observations included in the model
- d: Degree of differencing
- q: The order of moving average
If, you want theory and maths behind it, please, comment below.
TIME SERIES FORECASTING WITH ARIMA: ARIMA(Auto Regressive Integrated Moving Average) models are denoted with the notation ARIMA (p,d,q). These three parameters are accounted for seasonality, trend and noise in the data. We have got for parameters (4,1,2). Now, its time for forecasting.
timeseries =d ['Count'] p,d,q = (4,1,2) arima_mod = ARIMA(timeseries,(p,d,q)).fit() summary = (arima_mod.summary2(alpha=.05, float_format="%.8f")) print(summary)
Our, result is as follows:
Results: ARIMA ==================================================================== Model: ARIMA BIC: 11362.8143 Dependent Variable: D.Count Log-Likelihood: -5653.1 Date: 2020-07-29 20:17 Scale: 1.0000 No. Observations: 1176 Method: css-mle Df Model: 7 Sample: 06-16-2015 Df Residuals: 1169 09-03-2018 Converged: 1.0000 S.D. of innovations: 29.586 No. Iterations: 15.0000 HQIC: 11337.549 AIC: 11322.2553 --------------------------------------------------------------------- Coef. Std.Err. t P>|t| [0.025 0.975] --------------------------------------------------------------------- const -0.0037 0.0642 -0.0579 0.9539 -0.1295 0.1221 ar.L1.D.Count 0.4726 0.1516 3.1170 0.0019 0.1754 0.7698 ar.L2.D.Count -0.1683 0.0461 -3.6534 0.0003 -0.2586 -0.0780 ar.L3.D.Count 0.0457 0.0354 1.2917 0.1967 -0.0236 0.1151 ar.L4.D.Count -0.0910 0.0309 -2.9471 0.0033 -0.1516 -0.0305 ma.L1.D.Count -1.1940 0.1500 -7.9578 0.0000 -1.4880 -0.8999 ma.L2.D.Count 0.2485 0.1406 1.7669 0.0775 -0.0271 0.5241 ----------------------------------------------------------------------------- Real Imaginary Modulus Frequency ----------------------------------------------------------------------------- AR.1 1.2461 -1.0366 1.6209 -0.1104 AR.2 1.2461 1.0366 1.6209 0.1104 AR.3 -0.9950 -1.7864 2.0448 -0.3309 AR.4 -0.9950 1.7864 2.0448 0.3309 MA.1 1.0805 0.0000 1.0805 0.0000 MA.2 3.7246 0.0000 3.7246 0.0000 ====================================================================
VISUALIZATION OF OUR PREDICTION:
predict_data =arima_mod.predict(start='2016-07-01', end='2017-07-01', dynamic = False) timeseries.index = pd.DatetimeIndex(timeseries.index) fig, ax = plt.subplots(figsize=(20, 15)) ax = timeseries.plot(ax=ax) predict_data.plot(ax=ax) plt.show()
VISUALIZATION OF FORECAST FOR COMING MONTHS:
ax = ts['Count'][-60:].plot(label='observed', figsize=(15, 10)) pred_uc.predicted_mean.plot(ax=ax, label='Forecast') ax.fill_between(pred_ci.index, pred_ci.iloc[:, 0], pred_ci.iloc[:, 1], color='k', alpha=.25) ax.set_xlabel('Date') ax.set_ylabel('Counts') plt.legend() plt.show()
That orange line shows our predicted result. This predicted result is not so different from our observed result. So, we can say, our prediction for crime in coming months is nearly perfect, hence it is good fitted model.
NOTE: If you have any doubt feel free to comment