The dataset that I will be working on is called “Production”. I’ve downloaded it from the Fred website where we can find economic data for research. Our data provide the scope of monthly industrial Production of Electric and gas utilities. It is observed from 1939 until 2020. The unit that the data is measured with is the index 2012 =100 which measures the real output of production in the United States.
We will start by importing our data and checking its structure:
data=read.csv2("/Users/khouloud/Documents/Portfolio/R/Production.csv", sep = ",")
data[,2]=as.numeric(data[,2])
str(data)
## 'data.frame': 976 obs. of 2 variables:
## $ DATE: chr "01/01/1939" "01/02/1939" "01/03/1939" "01/04/1939" ...
## $ Prod: num 3.38 3.41 3.49 3.51 3.51 ...
As shown above, the data is composed of 976 instances and 2 variables. The first variable represents the monthly observations over the years and the second named “Prod” gives the yearly production of Electricity and Gaz.
Next, we have the different libraries we are going to use in our upcoming analysis.
library(tseries)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(forecast)
library(RColorBrewer)
#if(!require(devtools)) install.packages("devtools")
#devtools::install_github("cran/TSA", build_vignettes = FALSE)
library(TSA)
## Registered S3 methods overwritten by 'TSA':
## method from
## fitted.Arima forecast
## plot.Arima forecast
##
## Attaching package: 'TSA'
## The following objects are masked from 'package:stats':
##
## acf, arima
## The following object is masked from 'package:utils':
##
## tar
library(fpp2)
## ── Attaching packages ────────────────────────────────────────────── fpp2 2.5 ──
## ✔ ggplot2 3.4.1 ✔ expsmooth 2.3
## ✔ fma 2.5
##
Before diving deeper into the ARIMA modeling, we will try to preprocess the data, understand it a bit further, and Visualize it.
#Data preprocessing
TimeS <- ts(data[,2],start=c(2005,1),frequency = 4)
class(TimeS)
## [1] "ts"
str(TimeS)
## Time-Series [1:976] from 2005 to 2249: 3.38 3.41 3.49 3.51 3.51 ...
To prepare our data for the analysis, I started by transforming the Prod variable into a numerical one since it was imported as a character. I also changed the class of our data into time series using the ts function. Our data is now ready for time series Analysis.
summary(TimeS)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.384 19.659 55.211 55.004 88.093 128.907
boxplot(TimeS, main="Electric Production",box=0.1)
The mean of the data is 55.004 however the min is 3.384 and the max is 128.907. Thus there is visible variation in our data and we will explore it further in the analysis. We can also observe from the boxplot that there are no outliers.
autoplot(TimeS)+ggtitle("Electric Production over the years")+ylab( "index 2012 =100")
From this plot we can deduce: * It is apparent that the data is not
stationary * The data has a steep upward trend which an increase in
electricity production over the last 60 years. This is expected given
the technological and demographic growth that the world in general and
the United States, in particular, have experienced. * The data also have
an increased variation at higher levels so it should be stabilized
acf(TimeS)
pacf(TimeS)[1:10]
##
## Partial autocorrelations of series 'TimeS', by lag
##
## 1 2 3 4 5 6 7 NA NA NA
## 0.504 -0.097 -0.057 0.068 0.014 -0.030 -0.016 NA NA NA
eacf(TimeS)
## AR/MA
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13
## 0 x x x x x x x x x x x x x x
## 1 x x x x x x x x x x x x x x
## 2 x x x x x x x x x x x x x x
## 3 x x o x o x o o o x x x x x
## 4 x x x x o x x o o o x x x x
## 5 x x o x o x o o o o o x x o
## 6 x x o x o x o x o o o x x o
## 7 x x o x x x o x o o o x x o
#Investigate seasonality
ggseasonplot(TimeS)+ggtitle("Seasonal Plot of Electrical Production")+ylab("Index 2012=100")
This graph plots each year individually which helps us identify any sort
of seasonality. For the first 30 years, we can observe that the data is
very stable and exudes little to no variation. However, after that,
there is a pattern in which the production fluctuates each year. A
downward sloêr around the star of the year followed by an upward. Around
august, the production decreases again to increase around November. This
pattern is repeatable in the last 40 years and even stronger in the last
10 years
==> This confirms the presence of seasonality in our data.
BoxCox.lambda(TimeS)
## [1] -0.6500561
autoplot(TimeS)
TTimeS= BoxCox(TimeS,lambda = 0.279542)
autoplot(BoxCox(TimeS,lambda = 0.279542))
The variance has been stabilized but the strong trend is still present and must be accounted for before we entertain a stationary model.
library(uroot)
adf.test(TTimeS)
## Warning in adf.test(TTimeS): p-value greater than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: TTimeS
## Dickey-Fuller = 0.1554, Lag order = 9, p-value = 0.99
## alternative hypothesis: stationary
DTTimeS=diff(diff(TTimeS))
acf(DTTimeS,lag.max=10)[1:10]
##
## Autocorrelations of series 'DTTimeS', by lag
##
## 1 2 NA NA NA NA NA NA NA NA
## -0.309 -0.309 NA NA NA NA NA NA NA NA
pacf(DTTimeS)
adf.test(DTTimeS)
## Warning in adf.test(DTTimeS): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: DTTimeS
## Dickey-Fuller = -19.093, Lag order = 9, p-value = 0.01
## alternative hypothesis: stationary
plot(DTTimeS)
eacf(DTTimeS
)
## AR/MA
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13
## 0 x x x x x x x x x x x x x x
## 1 x x x x x x x x x x x x x x
## 2 x x o x x x x o o x x x x x
## 3 x x o x o o o o o o x x x x
## 4 x x x x o x o o o o x x x o
## 5 x x x x x o o o o o o x x o
## 6 x x x x x o o x o o o x x x
## 7 x x x o x o x x o o o x x x
acf(DTTimeS ,lag.max=10)[1:10]
##
## Autocorrelations of series 'DTTimeS', by lag
##
## 1 2 NA NA NA NA NA NA NA NA
## -0.309 -0.309 NA NA NA NA NA NA NA NA
The trend is now removed but the variation does not appear to be constant across.
decomposedRes <- decompose(TimeS, type="mult")
plot (decomposedRes)
AdjTimeS=TimeS-decomposedRes$seasonal
plot(AdjTimeS)
fit=auto.arima(TimeS,d=1,stepwise=FALSE, approximation = FALSE)
print(summary(fit))
## Series: TimeS
## ARIMA(3,1,0)(2,0,0)[4]
##
## Coefficients:
## ar1 ar2 ar3 sar1 sar2
## -0.7130 -0.7592 -0.6759 -0.9345 -0.9191
## s.e. 0.0247 0.0228 0.0242 0.0133 0.0129
##
## sigma^2 = 6.479: log likelihood = -2299.23
## AIC=4610.46 AICc=4610.54 BIC=4639.75
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 0.9159938 2.537587 1.815966 2.913793 3.994283 0.2879627 0.2864708