Ernie Chan's book, Quantitative Trading, explains why cointegrated pairs of securities are useful for constructing mean-reverting trades. It also explains how to test pairs of securities for cointegration. Ernie uses Matlab, but some readers may want to use R, the software for statistical computing and graphics. This note explains how to perform the cointegration test using R.

Let's assume you have the history of daily prices for two stocks, GLD and GDX. You want to know if the prices are cointegrated. Let's also assume the data are captured in two files,

`GLD.csv`

and `GDX.csv`

,
which are
in comma-separated value (CSV) format with seven columns:
date, open, high, low, close, volume, and adjusted close.Two notes before we begin.

- This is not a tutorial on R. (Sorry.) You can learn about R from its official web site, from An Introduction to R, from one of the tutorials, or from one of the available books.
- This is also not a tutorial on cointegration. For that I recommend Ernie Chan's book; or, for the more mathematically inclined, the book by Bernhard Pfaff, Analysis of Integrated and Cointegrated Time Series with R. Both books are excellent.

Once your data are captured in a zoo object, say t, it behaves like a data frame. One zoo object can contain several columns, where each column is a different time series and each row is a set of (simultaneous) observations of those series. The object provides an additional attribute,

`index(t)`

,
which is the vector of dates, one date per observation. The
first and last dates are available using `start(t)`

and `end(t)`

,
respectively.- Read the two files into two data frames.
- Convert the dates from strings into Date objects.
- Convert the two data frames into two zoo objects.
- Take the intersection of the two zoo objects. That will create one zoo object with the observations common to both datasets.

library(zoo)
# Load
the zoo package

# Read the CSV files into data frames

#

gld <- read.csv("GLD.csv", stringsAsFactors=F)

gdx <- read.csv("GDX.csv", stringsAsFactors=F)

# The first column contains dates. The as.Date

# function can convert strings into Date objects.

#

gld_dates <- as.Date(gld[,1])

gdx_dates <- as.Date(gdx[,1])

# The seventh column contains the adjusted close.

# We use the zoo function to create zoo objects from that data.

# The function takes two arguments: a vector of data and

# a vector of dates.

#

gld <- zoo(gld[,7], gld_dates)

gdx <- zoo(gdx[,7], gdx_dates)

# The merge function can combine two zoo objects,

# computing either their intersection (all=FALSE)

# or union (all=TRUE).

#

t.zoo <- merge(gld, gdx, all=FALSE)

# At this point, t.zoo is a zoo object with two columns: gld and gdx.

# Most statistical functions expect a data frame for input,

# so we create a data frame here.

#

t <- as.data.frame(t.zoo)

# Tell the user what dates are spanned by the data.

#

cat("Date range is", format(start(t.zoo)), "to", format(end(t.zoo)), "\n")

# Read the CSV files into data frames

#

gld <- read.csv("GLD.csv", stringsAsFactors=F)

gdx <- read.csv("GDX.csv", stringsAsFactors=F)

# The first column contains dates. The as.Date

# function can convert strings into Date objects.

#

gld_dates <- as.Date(gld[,1])

gdx_dates <- as.Date(gdx[,1])

# The seventh column contains the adjusted close.

# We use the zoo function to create zoo objects from that data.

# The function takes two arguments: a vector of data and

# a vector of dates.

#

gld <- zoo(gld[,7], gld_dates)

gdx <- zoo(gdx[,7], gdx_dates)

# The merge function can combine two zoo objects,

# computing either their intersection (all=FALSE)

# or union (all=TRUE).

#

t.zoo <- merge(gld, gdx, all=FALSE)

# At this point, t.zoo is a zoo object with two columns: gld and gdx.

# Most statistical functions expect a data frame for input,

# so we create a data frame here.

#

t <- as.data.frame(t.zoo)

# Tell the user what dates are spanned by the data.

#

cat("Date range is", format(start(t.zoo)), "to", format(end(t.zoo)), "\n")

The as.Date function assumes your date strings are formatted like yyyy-mm-dd, which is the ISO standard format. If your strings use the common US format of mm/dd/yy, include a format specification.

gld_dates
<- as.Date(gld[,1], format="%m/%d/%y")

gdx_dates <- as.Date(gdx[,1], format="%m/%d/%y")

gdx_dates <- as.Date(gdx[,1], format="%m/%d/%y")

The spread is defined this way.

S
= y
- (β × x_{)}

where β is the hedge ratio, calculated using ordinary least squares (OLS). Rearranging terms, we want to find the value of β which best fits this equation.

y
= (-β) × x

This is a simple linear equation with no y intercept. In R, the

`lm`

function
can fit a linear model such as this.#
The lm function
builds linear regression models using OLS.

# We build the linear model, m, forcing a zero intercept,

# then we extract the model's first regression coefficient.

#

m <- lm(gld ~ gdx + 0, data=t)

beta <- coef(m)[1]

cat("Assumed hedge ratio is", beta, "\n")

# Now compute the spread

#

sprd <- t$gld - beta*t$gdx

# We build the linear model, m, forcing a zero intercept,

# then we extract the model's first regression coefficient.

#

m <- lm(gld ~ gdx + 0, data=t)

beta <- coef(m)[1]

cat("Assumed hedge ratio is", beta, "\n")

# Now compute the spread

#

sprd <- t$gld - beta*t$gdx

The first argument to

`lm`

is a formula, which specifies
the linear
model. The formula `gld ~ gdx + 0`

says the model isGLD_{i}
= β×GDX_{i}
+ ε_{i}

(If we omit "+ 0" from the formula, R would fit a y intercept, too.)

`adf.test`

function which is implemented in the `tseries`

package. The

function
returns an object which
contains the test results. In particular, it contains the p-value
that we want.library(tseries)
# Load
the tseries package

# Setting alternative="stationary" chooses the appropriate test.

# Setting k=0 forces a basic (not augmented) test. See the

# documentation for its full meaning.

#

ht <- adf.test(sprd, alternative="stationary", k=0)

cat("ADF p-value is", ht$p-value, "\n")

# Setting alternative="stationary" chooses the appropriate test.

# Setting k=0 forces a basic (not augmented) test. See the

# documentation for its full meaning.

#

ht <- adf.test(sprd, alternative="stationary", k=0)

cat("ADF p-value is", ht$p-value, "\n")

Setting

alternative="stationary"
is important.- To a statistician, it specifies a null hypothesis that the spread is non-stationary or explosive.
- For everyone else, it means that if the p-value is small, the spread is mean-reverting. By "small," I mean less than 0.10 or less than 0.05, depending how fussy you are. (Smaller is better.)

#
The ht object
contains the p-value from the ADF test.

# The p-value is the probability that the spread is NOT

# mean-reverting. Hence, a small p-value means it is very

# improbable that the spread is NOT mean-reverting

# (got that?).

#

if (ht$p.value < 0.05) {

cat("The spread is likely mean-reverting.\n")

} else {

cat("The spread is not mean-reverting.\n")

}

# The p-value is the probability that the spread is NOT

# mean-reverting. Hence, a small p-value means it is very

# improbable that the spread is NOT mean-reverting

# (got that?).

#

if (ht$p.value < 0.05) {

cat("The spread is likely mean-reverting.\n")

} else {

cat("The spread is not mean-reverting.\n")

}

One word of caution: The

`adf.test`

function essentially detrends your data before testing for
stationarity. If your data contains a strong trend, you might
be very surprised to learn it is "mean reverting" when it is obvously
moving upward or downward. If this is a problem for you, consider
the fUnitRoots package
which contains the `adfTest`

function (note the spelling!).
That function lets you analyze either with or without the trend
assumption.library(zoo)

library(tseries)

gld <- read.csv("GLD.csv", stringsAsFactors=F)

gdx <- read.csv("GDX.csv", stringsAsFactors=F)

gld <- zoo(gld[,7], as.Date(gld[,1]))

gdx <- zoo(gdx[,7], as.Date(gdx[,1]))

t.zoo <- merge(gld, gdx, all=FALSE)

t <- as.data.frame(t.zoo)

cat("Date range is", format(start(t.zoo)), "to", format(end(t.zoo)), "\n")

m <- lm(gld ~ gdx + 0, data=t)

beta <- coef(m)[1]

cat("Assumed hedge ratio is", beta, "\n")

sprd <- t$gld - beta*t$gdx

ht <- adf.test(sprd, alternative="stationary", k=0)

cat("ADF p-value is", ht$p.value, "\n")

if (ht$p.value < 0.05) {

cat("The spread is likely mean-reverting\n")

} else {

cat("The spread is not mean-reverting.\n")

}

library(tseries)

gld <- read.csv("GLD.csv", stringsAsFactors=F)

gdx <- read.csv("GDX.csv", stringsAsFactors=F)

gld <- zoo(gld[,7], as.Date(gld[,1]))

gdx <- zoo(gdx[,7], as.Date(gdx[,1]))

t.zoo <- merge(gld, gdx, all=FALSE)

t <- as.data.frame(t.zoo)

cat("Date range is", format(start(t.zoo)), "to", format(end(t.zoo)), "\n")

m <- lm(gld ~ gdx + 0, data=t)

beta <- coef(m)[1]

cat("Assumed hedge ratio is", beta, "\n")

sprd <- t$gld - beta*t$gdx

ht <- adf.test(sprd, alternative="stationary", k=0)

cat("ADF p-value is", ht$p.value, "\n")

if (ht$p.value < 0.05) {

cat("The spread is likely mean-reverting\n")

} else {

cat("The spread is not mean-reverting.\n")

}

When I ran this code on recent data, I got this output.

```
Date
range is 2006-05-23 to 2009-08-10
```

Assumed hedge ratio is 1.928096

ADF p-value is 0.2609909

The spread is not mean-reverting.

`stringsAsFactors=FALSE`

for `read.csv`

?"),
use the help facility to learn more about the function:```
>
?read.csv
```

These notes assume your data is captured inside CSV files. I did that just to mirror the example in Ernie Chan's book. R can read its data from many places. To learn more about R's input/output in general, start with the R Data Import/Export manual. To learn more about loading financial data in particular, visit the web site for the quantmod package, an excellent source of tools for anyone working with financial data.

R's input/output is so flexible, in fact, that you can read data directly from web sites, such as the Yahoo! Finance web site. Just use a URL instead of a file name, like this.

```
gld
<-
read.csv("http://ichart.finance.yahoo.com/table.csv?s=GLD&ignore=.csv",
stringsAsFactors=F)
```

gdx <-
read.csv("http://ichart.finance.yahoo.com/table.csv?s=GDX&ignore=.csv",
stringsAsFactors=F)

This eliminates the extra step of saving the data to an intermediate file.

I used the

`adf.test`

function, above, to test for stationarity. The ADF test is
implemented
in several other packages, too, such as the fUnitRoots and
urca packages. Bernhard
Pfaff's book contains an entire chapter on
tests for unit roots, including tests beyond the original ADF test.Finally, high-powered statisticians will be offended by my statement that "the p-value [of the ADF test] is the probability that the spread is not mean-reverting." Just for the record, here is the strict interpretation: "The p-value is the probability that we could have observed the spread data, given the assumption that the spread is not mean-reverting."