## Stock problem definition

In this article, we will address the problem of finding stocks that move in a similar matter over a certain time frame. In order to do this properly some of the main statistical methods that we need to use are the following:

- Linear regression
- Serial correlation
- Stationarity test
- Cointegration test

Each of the above-mentioned procedures will help us to uncover the real relationship between two or more stocks. For the showcase of these procedures, we will use the Box and Dropbox stocks.

The analysis that we’re going to perform can inform your domain knowledge when setting up a pair’s trading strategy. You can read more about it here, and you can also check out our cluster analysis article.

If you’re unfamiliar with the R basics or would like a refresher on it, I’d advise skimming through some bits in this article. We will use R studio without any fancy code for our analysis, and I’ll explain it bit by bit.

## How to download stock data with R?

Stock data can be downloaded in many ways with R and we will use the tidyquant library that was built for handling financial data. If the library isn’t installed you can obtain it with the following command:

`install.packages("tidyquant")`

You’ll see your console being populated and if it gets cluttered at any moment you can press CTRL + L to clear it out. Now that the library is installed let’s load it and obtain stock data for Box and Dropbox.

```
library(tidyquant)
boxx <- tq_get('BOX',
from = '2020-01-01',
to = '2021-06-06',
get="stock.prices")
dropbox <- tq_get('DBX',
from = '2020-01-01',
to = '2021-06-06',
get="stock.prices")
```

Take note: as “box” is a function in R, we named it “boxx”. Now let’s print out the first few columns of our data so you can see what we obtained and graph our stocks.

```
library(ggplot2)
boxx %>%
ggplot(aes(x = date, y = adjusted)) +
geom_line() +
theme_classic() +
labs(x = 'Date',
y = "Adjusted Price",
title = "Box price chart") +
scale_y_continuous(breaks = seq(0,300,10))
dropbox %>%
ggplot(aes(x = date, y = adjusted)) +
geom_line() +
theme_classic() +
labs(x = 'Date',
y = "Adjusted Price",
title = "Dropbox price chart") +
scale_y_continuous(breaks = seq(0,300,10))
```

Just by eye-balling, we can state that the stocks have indeed moved in a similar fashion. But our eyes have fooled us enough times in our lives to trust them, especially when it comes to financial matters.

On the bottom right side of your interface, you can browse through the plots. Now, let’s merge the data, rename the columns, and then take what we need from it into a new data frame for analysis.

```
# Merge
merged <- merge(boxx, dropbox, by=c("date"))
head(merged)
# Subset
df = subset(merged, select=c("date","adjusted.x","adjusted.y"))
# Rename
names(df)[names(df) == "adjusted.x"] <- "box"
names(df)[names(df) == "adjusted.y"] <- "dropbox"
head(df)
```

## How to save data to CSV with R?

To export a data frame to a CSV file in R, you will need to use the `write.csv()`

command where you specify the data frame and the directory path.

To obtain your default directory path you can write the following:

```
dir <- getwd()
dir
```

And now we save our data:

`write.csv(df, "box_dropbox.csv")`

To load it again simply write the following:

```
df <- read.csv("box_dropbox.csv")
# Delete index created by loading
df <- df[,-1]
```

## What is the p-value?

The p-value s the probability of obtaining test results at least as extreme as the results actually observed, under the assumption that the null hypothesis is correct.

The p-value helps us to determine the significance of our results. By convention, if the p-value is less than 0.05 (5%), we have strong evidence to reject the null hypothesis and vice-versa.

As this can be an article for itself let’s put it this way: the lower our p-value is, the more surprising our evidence is, and the more ridiculous is the null hypothesis.

## How to do a linear regression on stock data with R?

Linear regression on stock data can be done with R by using the in-built `lm()`

function and stating the appropriate variables. In our case, we will assume that the relationship between our stocks is linear.

Now, let us attach our dataset so R knows that we will pull the data from it and run the linear regression model on our stocks. We will also print out the summary statistics of our regression model.

```
attach(df)
reg <- lm(box~dropbox)
summary(reg)
```

If you look at the “Estimate” it tells us how much our dependent variable changes (Box) for each unit increase ($). In this case for each 1$ increase in Dropbox the price of Box changes by $0.807.

Below you can see that the p-value is really low and that we can reject the null hypothesis aka it is unlikely that the results above are due to random chance.

The F-statistic shows if there is a relationship between our predictor (Dropbox) and the response variable. The further it is away from 1 the better the case it makes for rejecting the null hypothesis.

The R-squared states that the movement in Dropbox explains 76% variance in Box. Let’s plot our linear regression to see the fit and check it for homoscedasticity.

```
par(mfrow=c(2,2))
plot(reg)
```

The scale location plot is used to check the homogeneity of variance of the residuals (homoscedasticity). A horizontal line with equally spread points is a reliable indication of homoscedasticity. In our case, it isn’t too bad.

The Residuals vs Fitted plot help us to check the linear relationship assumption. If the red line is horizontal or close to being one, the linear relationship is true.

In our case, it has a pattern and this isn’t surprising as we’re using a simple model on stock data. The Normal Q-Q plot shows if the data is normally distributed. If it follows a straight dashed line it is a good indicator.

And finally, the Residuals vs Leverage plot shows us if we have any extreme values that might skew our linear regression results.

Now that you know some basics about the regression, reverse the variables and see what you get. What does the data tell you? For a look at how to do some of this in python, you can check out his article.

## How to do a Granger causality test in R?

To perform a Granger causality test in R you will need to use the lmtest library and the `grangertest()`

function in which you specify the variables and lags.

Let’s do it for our data:

```
library(lmtest)
grangertest(box ~ dropbox, order = 3, data=df)
```

As you can see the p-value and F statistic support the rejection of the null hypothesis and thus we can say that the movement in Dropbox can predict Box. But have in mind the downfalls of the p-value and that it isn’t too small.

Now try to reverse the variables and see what you get.

## How to do a Serial Correlation test with R?

A Serial Correlation test can be done with R by using the Durbin Watson test that comes stored as a `dw()`

function from the lmtest library. As a parameter you pass the model that you want to test for serial correlation.

We want to do this test as a supplement to our linear regression so we will check our model for serial correlation. This is done as the linear regression might have picked up the noise from the correlated error terms in our time series.

`dwtest(reg)`

```
Result:
data: reg
DW = 0.16026, p-value < 2.2e-16
alternative hypothesis: true autocorrelation is greater than 0
```

The value of the DW test shows that the alternative hypothesis is true and the p-value is almost non-existent. This means that we indeed have a case of serial correlation in our data.

If you look at the residuals from our regression you will see that we have an overall positive trend. The residual is the difference between the observed value and the predicted one.

```
residuals = reg$residuals
plot(residuals, type="l")
```

And if we do a lagged regression on the residuals we can see how they impacted the main model.

```
sub_box <- subset(df, select=c("date","box"))
sub_dropbox <- subset(df, select=c("date","dropbox"))
d_box = diff(as.numeric(unlist(sub_box["box"])))
d_dbox = diff(as.numeric(unlist(sub_dropbox["dropbox"])))
lagged_reg <- lm(d_box~d_dbox)
summary(lagged_reg)
lagged_reg_res = lagged_reg$residuals
plot(lagged_reg_res, type="l")
```

## How to remove Serial Correlation from a model in R?

To eliminate Serial Correlation in your model you can do the Cochrane-Orcutt test. This can be done by using the `cochrane.orcutt()`

function that is a part of the orcutt package.

Let us run it on our model and see the diffrence:

```
#install.packages("orcutt")
library(orcutt)
co <- cochrane.orcutt(reg)
summary(co)
dwtest(co)
```

As you can see, we have quite an improvement on the serial correlation front. The results of the test show a lower estimate value (0.43) from the prior 0.80 one. The Durbin-Watson test isn’t significant anymore so we have eliminated autocorrelation.

Now, let us take the residuals from the first regression and the lagged one so you can see the correlation between each lag. This will get us a sense of stationarity that we will test later on.

```
acf_residual_reg = acf(residuals)
acf_lag_residual_reg = acf(lagged_reg_res)
acf_residual_reg
acf_lag_residual_reg
```

On the left chart, you can see some substantial correlation between the residuals, and when we do a regression on them the correlations (right chart) have dropped which we want to be the case.

If you need a practical guide to correlations you can check out this article.

But there is one thing that we need to have in mind and in states that the Cochrane-Orcutt test needs our time series to have constant mean, variance, and to be stationary.

## How to test for Stationarity with R?

Testing for Stationarity can be done by using several tests in R. Some of the tests are the following: ADF, KPSS, Philips-Peron, Zivot-Andrews, and more.

Let’s check out what our data shows us and run some of the mentioned tests in order:

```
#install.packages("egcm")
#install.packages("tseries")
library(egcm)
library(tseries)
adf.test(as.numeric(unlist(sub_box["box"])))
adf.test(as.numeric(unlist(sub_dropbox["dropbox"])))
adf.test(d_box)
adf.test(d_dbox)
```

Here we can observe that our data is non-stationary because we can’t reject the null hypothesis. But when we do a test on the difference of our stock data we obtain the opposite.

When we run the Philips-Peron test, it confirms the results of the previous one.

```
pp.test(as.numeric(unlist(sub_box["box"])))
pp.test(as.numeric(unlist(sub_dropbox["dropbox"])))
pp.test(d_box)
pp.test(d_dbox)
```

And finally, the KPSS test confirms it once again. The null hypothesis of the KPSS test states that the data IS stationary.

```
kpss.test(as.numeric(unlist(sub_box["box"])), null="Trend")
kpss.test(as.numeric(unlist(sub_dropbox["dropbox"])), null="Trend")
```

## What is Cointegration?

Cointegration can be viewed as a long-run relationship between two or more variables. For example, if two stocks tend to follow the same trends over enough time, we can say that they are cointegrated.

You can imagine it like walking an overly curious dog on a leash. The dog will move around and go from side to side, but he won’t ever move too much from you as he is on a leash. Well, that is cointegration.

## How to test for Cointegration with R?

Testing for Cointegration with R can be done by using the Engle-Granger test that comes as a part of the egcm package.

Let’s see how our two stocks do on the cointegration front:

```
egcm(as.numeric(unlist(sub_box["box"])), as.numeric(unlist(sub_dropbox["dropbox"])))
egcm(d_box, d_dbox)
plot(egcm(as.numeric(unlist(sub_box["box"])), as.numeric(unlist(sub_dropbox["dropbox"]))))
plot(egcm(d_box, d_dbox))
```

As we can observe, the pairs seem to be cointegrated while their differences, which I added out of curiosity, seem not to be.

Now let’s look at the plots that will shed more light on how the cointegration looks like: