Time series statistical analysis is a very useful way of examining data over time to explain past events as well as forecast the future. For example, looking at airline ticket costs, past demand for air travel, economic growth and other factors, an airline may hope to forecast future demand. This ability to estimate the future would aid tremendously in projecting revenue and budgeting for staff, airplanes, and other investments. While this ability is clearly very useful, there are many peculiarities of time series analysis that must be considered before acting on the results of the analysis.
Below, I will walk through a recent time series analysis that I attempted, whose outcome seemed very promising at first but later proved to be unreliable. As always, the python script underlying this analysis can be found on my Github.
E-Commerce and Internet Penetration
I recently stumbled upon historical data for e-commerce sales. Inspired by my recent move to the tech-heavy Seattle, I was interested in seeing whether there was any relationship between the growth in e-commerce and the growth in broadband internet access in the United States. I included some macroeconomic data in order to paint a more complete picture. The econometric equation that I was hoping to arrive at was:
E-Commerce Sales = Some constant + Broadband Internet Rates + GDP Growth + Inflation + some error term
The first problem I ran into was that I could only find data on internet usage dating back to 2002 on a semiannual basis (so n≅30). I would prefer a n>50, but roughly 30 is probably enough for this exercise. The small n would make it difficult to split the data into a training and testing blocks for the purpose of forecasting, but may be enough for simply explaining the past.
As always, my first step was to visualize the raw data. (Note: the raw data had already been seasonally adjusted).
It is pretty clear from the graphs that my dependent variable (e-commerce sales, blue) is strongly correlated with each of my explanatory variables (red). This is further confirmed through a series of scatter plots
An initial regression on the raw data produces an R-squared of 0.97, which further proves the strength of the trend.
Table 1
Dep. Variable: | Sales | R-squared: | 0.970 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.967 |
Method: | Least Squares | F-statistic: | 272.2 |
Date: | Wed, 26 Jul 2017 | Prob (F-statistic): | 3.30e-19 |
Time: | 12:01:18 | Log-Likelihood: | -304.70 |
No. Observations: | 29 | AIC: | 617.4 |
Df Residuals: | 25 | BIC: | 622.9 |
Df Model: | 3 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [95.0% Conf. Int.] | |
---|---|---|---|---|---|
Intercept | -9.4e+05 | 7.35e+04 | -12.793 | 0.000 | -1.09e+06 -7.89e+05 |
Internet | -5.235e+05 | 1.13e+05 | -4.632 | 0.000 | -7.56e+05 -2.91e+05 |
CPI | 3118.9581 | 604.414 | 5.160 | 0.000 | 1874.145 4363.771 |
GDP | 0.0163 | 0.003 | 5.903 | 0.000 | 0.011 0.022 |
I didn’t bother checking the suitability of the variables in this initial regression because it was clear to me that they would not meet the statistical requirements of OLS regression (namely normal distribution, stationarity, etc.). Furthermore, I was interested in growth of these variables over time, rather than the raw values. Examining the growth of these variables tells a more interesting story, but may also address the non-stationarity that is likely present in the raw data.
I next took the percent change of each variable from one period to the next and ran an OLS regression on that. I used ‘delta’ to represent the percent change in the data over time.
Table 2
Dep. Variable: | deltaSales | R-squared: | 0.806 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.782 |
Method: | Least Squares | F-statistic: | 33.28 |
Date: | Thu, 27 Jul 2017 | Prob (F-statistic): | 1.02e-08 |
Time: | 10:09:55 | Log-Likelihood: | 72.092 |
No. Observations: | 28 | AIC: | -136.2 |
Df Residuals: | 24 | BIC: | -130.9 |
Df Model: | 3 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [95.0% Conf. Int.] | |
---|---|---|---|---|---|
Intercept | 0.0435 | 0.007 | 6.177 | 0.000 | 0.029 0.058 |
deltaInternet | 0.2426 | 0.066 | 3.699 | 0.001 | 0.107 0.378 |
deltaCPI | -0.9479 | 0.502 | -1.889 | 0.071 | -1.984 0.088 |
deltaGDP | 3.2412 | 0.405 | 8.004 | 0.000 | 2.405 4.077 |
Table 3
Interestingly, all of my variables were significant, with the exception of the change in inflation (deltaCPI). I ran a third regression, dropping deltaCPI.
Dep. Variable: | deltaSales | R-squared: | 0.777 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.760 |
Method: | Least Squares | F-statistic: | 43.66 |
Date: | Thu, 27 Jul 2017 | Prob (F-statistic): | 6.98e-09 |
Time: | 10:10:18 | Log-Likelihood: | 70.152 |
No. Observations: | 28 | AIC: | -134.3 |
Df Residuals: | 25 | BIC: | -130.3 |
Df Model: | 2 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [95.0% Conf. Int.] | |
---|---|---|---|---|---|
Intercept | 0.0362 | 0.006 | 5.856 | 0.000 | 0.023 0.049 |
deltaInternet | 0.2140 | 0.067 | 3.192 | 0.004 | 0.076 0.352 |
deltaGDP | 3.1549 | 0.423 | 7.467 | 0.000 | 2.285 4.025 |
Examining this third regression, both explanatory variables — internet growth (deltaInternet) and GDP growth (deltaGDP) — were statistically significant. Also, the direction of coefficients made practical sense, i.e. both GDP growth and greater broadband penetration had a positive effect on the growth of e-commerce sales. My resulting equation turned out to be:
ΔSales = 0.04 + 0.21ΔInternet + 3.15ΔGDP
Very promising, but the R-squared of 0.78 still seems pretty high to me, so I need to look a little more closely at whether my variables satisfy all of the necessary assumptions of OLS.
First, let’s take a look at whether they are stationary, that is whether the mean and variance are constant over time. For this, I employed an Augmented Dickey Fuller test, which has a null hypothesis of non-stationarity.
Table 4
Table 5
Table 6
All three transformed variables proved to be stationary at the 5% confidence level (i.e. rejected the null hypothesis)
Another primary assumption of OLS regressions is that the variables are normally distributed.
Figure 5
Figure 6
Figure 7
Here is my first sign of trouble. When compared to the raw values, the three transformed variables were closer to normal distribution, but still not quite there. The histogram plotted for each variable do not look particularly normal, but I decided to run a test just to make sure.
Table 7
Variable | Test Statistic | p-value |
deltaSales | 21.6775 | 0.00001962 |
deltaInternet | 4.9528 | 0.08404574 |
deltaGDP | 32.8555 | 0.00000007 |
The above 2-tailed test has a null hypothesis of a normal distribution. The resulting p-values reject the null hypothesis for both the ΔSales and ΔGDP variables. The ΔInternet variable, however, failed to do so, suggesting that ΔInternet is in fact normally distributed. It is possible that because my n is so small (n<50) that this test for normality isn’t fully reliable. I therefore decided to continue with other tests to see how my variables fared on other metrics.
I next tested for Granger Causality, which checks whether the past of one variable has the ability to predict the other variable.
Table 8
Table 9
In this instance, I only used 2 lags, but I ran the test with various lag times and the results were all the same. Interestingly, the only pairing that exhibited Granger Causality was deltaSales on deltaInternet (the low p-value led to a rejection of the null hypothesis), but there was no Granger causality in the reverse. I suppose that it’s possible that as there are more e-Commerce options, more people decide to invest in broadband internet, but that seems unlikely to me.
Moving on from causality, I went onto testing for cointegration, which occurs when x and y have the same stochastic trend. In other words, while each of two variables may be randomly distributed on their own, there is some linear relationship that can be found between the two. In statistical terms, the residuals of the regression of one variable on the other are stationary. Given the high correlation among the raw data points and the high R-squared for the growth regression, I suspected that there may be some cointegration. I ran an Engle-Granger Two-Step approach for cointegration, which has as its null hypothesis that the residuals are non-stationary, i.e. no cointegration exists.
Table 10
Table 11
As suspected, both ΔGDP and ΔInternet rejected the null hypothesis of no cointegration.
Given the presence of cointegration, the next step would be to develop an error correction model that takes lagged data into consideration to correct for the presence of cointegration. But in this instance, the consideration of other factors led me to believe that this analysis has come to its natural end. First, the low number of observations (n30) gave me pause from the very beginning. Secondly, the non-normal distribution of two of my three variables (ΔSales and ΔGDP), called into question the accuracy of an OLS regression. Thirdly, the direction of theGranger Causality between ΔSales and ΔInternet doesn’t make logical sense. And lastly, the presence of cointegration was the nail in the coffin.
Conclusion
Although, I’m calling it quits on this dataset, there are a few things that could possibly save it, if I really wanted to:
First, I could increase the number of observations: while increasing n doesn’t necessarily bring a variable closer to normal distribution, it can help in many ways. Ideally, a data set contains at least 50 observations. This data set only had roughly 30. Some tests or normal distribution and other aspects don’t work very well with small data sets. So increasing the number of observations could potentially increase the accuracy of the tests and the model that they are based on. In this case, however, I was not able to locate additional data on internet penetration. So although there is plenty of available data on the other variables, I was limited by that one variable.
A second option would be further variable transformations. I took the percent change from one period to the next for each variable. While this moved the distribution closer to normal, it didn’t fully get there. I also tried transformations, such as taking the difference from one period to the next or the log of each variable, but they alone also didn’t exactly do the trick. (Also none of the growth in the variables was exponential, so taking the log may not have been appropriate.) Had I had access to more data (a higher n), perhaps there were other variable transformations that I could have considered
In closing, the statistical requirements of time series analysis can make make it difficult to arrive at strong conclusions in the real world. Even in the instance of data that produces a result that seems logical (such as my original regression equation above), the statistical rigor that is required can call even seemingly obvious results into question.