Forecasting Project

Introduction

The dataset that I will be working on is called “Production”. I’ve downloaded it from the Fred website where we can find economic data for research. Our data provide the scope of monthly industrial Production of Electric and gas utilities. It is observed from 1939 until 2020. The unit that the data is measured with is the index 2012 =100 which measures the real output of production in the United States.

We will start by importing our data and checking its structure:

data=read.csv2("/Users/khouloud/Documents/Portfolio/R/Production.csv", sep = ",")
data[,2]=as.numeric(data[,2])
str(data)
## 'data.frame':    976 obs. of  2 variables:
##  $ DATE: chr  "01/01/1939" "01/02/1939" "01/03/1939" "01/04/1939" ...
##  $ Prod: num  3.38 3.41 3.49 3.51 3.51 ...

As shown above, the data is composed of 976 instances and 2 variables. The first variable represents the monthly observations over the years and the second named “Prod” gives the yearly production of Electricity and Gaz.

Next, we have the different libraries we are going to use in our upcoming analysis.

library(tseries)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
library(forecast)
library(RColorBrewer)
#if(!require(devtools)) install.packages("devtools")
#devtools::install_github("cran/TSA", build_vignettes = FALSE)
library(TSA)
## Registered S3 methods overwritten by 'TSA':
##   method       from    
##   fitted.Arima forecast
##   plot.Arima   forecast
## 
## Attaching package: 'TSA'
## The following objects are masked from 'package:stats':
## 
##     acf, arima
## The following object is masked from 'package:utils':
## 
##     tar
library(fpp2)
## ── Attaching packages ────────────────────────────────────────────── fpp2 2.5 ──
## ✔ ggplot2   3.4.1     ✔ expsmooth 2.3  
## ✔ fma       2.5
## 

Data Understanding and Visualizing

Before diving deeper into the ARIMA modeling, we will try to preprocess the data, understand it a bit further, and Visualize it.

#Data preprocessing 

TimeS <- ts(data[,2],start=c(2005,1),frequency = 4)
class(TimeS)
## [1] "ts"
str(TimeS)
##  Time-Series [1:976] from 2005 to 2249: 3.38 3.41 3.49 3.51 3.51 ...

To prepare our data for the analysis, I started by transforming the Prod variable into a numerical one since it was imported as a character. I also changed the class of our data into time series using the ts function. Our data is now ready for time series Analysis.

Data Understanding and Visualization

summary(TimeS)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.384  19.659  55.211  55.004  88.093 128.907
boxplot(TimeS, main="Electric Production",box=0.1)

The mean of the data is 55.004 however the min is 3.384 and the max is 128.907. Thus there is visible variation in our data and we will explore it further in the analysis. We can also observe from the boxplot that there are no outliers.

autoplot(TimeS)+ggtitle("Electric Production over the years")+ylab( "index 2012 =100")

From this plot we can deduce: * It is apparent that the data is not stationary * The data has a steep upward trend which an increase in electricity production over the last 60 years. This is expected given the technological and demographic growth that the world in general and the United States, in particular, have experienced. * The data also have an increased variation at higher levels so it should be stabilized

  • The data may have seasonality but it is easily visible so we will explore it further
acf(TimeS)

pacf(TimeS)[1:10]

## 
## Partial autocorrelations of series 'TimeS', by lag
## 
##      1      2      3      4      5      6      7     NA     NA     NA 
##  0.504 -0.097 -0.057  0.068  0.014 -0.030 -0.016     NA     NA     NA
eacf(TimeS)
## AR/MA
##   0 1 2 3 4 5 6 7 8 9 10 11 12 13
## 0 x x x x x x x x x x x  x  x  x 
## 1 x x x x x x x x x x x  x  x  x 
## 2 x x x x x x x x x x x  x  x  x 
## 3 x x o x o x o o o x x  x  x  x 
## 4 x x x x o x x o o o x  x  x  x 
## 5 x x o x o x o o o o o  x  x  o 
## 6 x x o x o x o x o o o  x  x  o 
## 7 x x o x x x o x o o o  x  x  o
#Investigate seasonality

ggseasonplot(TimeS)+ggtitle("Seasonal Plot of Electrical Production")+ylab("Index 2012=100")

This graph plots each year individually which helps us identify any sort of seasonality. For the first 30 years, we can observe that the data is very stable and exudes little to no variation. However, after that, there is a pattern in which the production fluctuates each year. A downward sloêr around the star of the year followed by an upward. Around august, the production decreases again to increase around November. This pattern is repeatable in the last 40 years and even stronger in the last 10 years

==> This confirms the presence of seasonality in our data.

Data Transformation

BoxCox.lambda(TimeS)
## [1] -0.6500561
autoplot(TimeS)

TTimeS= BoxCox(TimeS,lambda =  0.279542)
autoplot(BoxCox(TimeS,lambda =  0.279542))

The variance has been stabilized but the strong trend is still present and must be accounted for before we entertain a stationary model.

library(uroot)
adf.test(TTimeS)
## Warning in adf.test(TTimeS): p-value greater than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  TTimeS
## Dickey-Fuller = 0.1554, Lag order = 9, p-value = 0.99
## alternative hypothesis: stationary
DTTimeS=diff(diff(TTimeS))
acf(DTTimeS,lag.max=10)[1:10]

## 
## Autocorrelations of series 'DTTimeS', by lag
## 
##      1      2     NA     NA     NA     NA     NA     NA     NA     NA 
## -0.309 -0.309     NA     NA     NA     NA     NA     NA     NA     NA
pacf(DTTimeS)

adf.test(DTTimeS)
## Warning in adf.test(DTTimeS): p-value smaller than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  DTTimeS
## Dickey-Fuller = -19.093, Lag order = 9, p-value = 0.01
## alternative hypothesis: stationary
plot(DTTimeS)

eacf(DTTimeS
     
     )
## AR/MA
##   0 1 2 3 4 5 6 7 8 9 10 11 12 13
## 0 x x x x x x x x x x x  x  x  x 
## 1 x x x x x x x x x x x  x  x  x 
## 2 x x o x x x x o o x x  x  x  x 
## 3 x x o x o o o o o o x  x  x  x 
## 4 x x x x o x o o o o x  x  x  o 
## 5 x x x x x o o o o o o  x  x  o 
## 6 x x x x x o o x o o o  x  x  x 
## 7 x x x o x o x x o o o  x  x  x
acf(DTTimeS ,lag.max=10)[1:10]

## 
## Autocorrelations of series 'DTTimeS', by lag
## 
##      1      2     NA     NA     NA     NA     NA     NA     NA     NA 
## -0.309 -0.309     NA     NA     NA     NA     NA     NA     NA     NA

The trend is now removed but the variation does not appear to be constant across.

decomposedRes <- decompose(TimeS, type="mult")
plot (decomposedRes) 

AdjTimeS=TimeS-decomposedRes$seasonal
plot(AdjTimeS)

fit=auto.arima(TimeS,d=1,stepwise=FALSE, approximation = FALSE)

print(summary(fit))
## Series: TimeS 
## ARIMA(3,1,0)(2,0,0)[4] 
## 
## Coefficients:
##           ar1      ar2      ar3     sar1     sar2
##       -0.7130  -0.7592  -0.6759  -0.9345  -0.9191
## s.e.   0.0247   0.0228   0.0242   0.0133   0.0129
## 
## sigma^2 = 6.479:  log likelihood = -2299.23
## AIC=4610.46   AICc=4610.54   BIC=4639.75
## 
## Training set error measures:
##                     ME     RMSE      MAE      MPE     MAPE      MASE      ACF1
## Training set 0.9159938 2.537587 1.815966 2.913793 3.994283 0.2879627 0.2864708