When calculating annualized historical volatilities for all CRSP stocks by year for 1980-present, are we calculating the standard deviation using all DAILY returns in each year, and then get annualized volatility for EACH year for EACH stock? If yes, from what I collected from CRSP, I got over 1.5GB of data. Or am I not obtaining the right data? Or are we using the ANNUAL returns throughout this 1980-2017 time period to obtain the standard deviation, and then get the annualized volatility for EACH year for EACH stock?


You are calculating annualized volatilities from the daily stock returns for each year for each stock. Welcome to "Big Data"! 1.5Gb is typical file size (and not that large), although you could break the data into chunks too. R/Python/SAS should easily handle file this size.
There is a variable in CCM (Fundamentals Annual) called OPTVOL. (Implied volatility of options, prefectly good measure), you get one value calculated at the end of each fiscal year. Unfortunately, the coverage is terrible. So since the purpose of the exercise is to get all volatilities in order to do a backtest, then this means we have to calculate from the daily (RET or RETX) for each stock for each year.
Here's a sample output for 2007 (all CRSP stocks):

Permno Volatility
1 10001 0.3088975
2 10002 0.3955540
3 10025 0.3877490
4 10026 0.3544119
5 10028 0.7179534
6 10032 0.4761090
7 10042 0.7790567
8 10044 0.2877138
9 10051 0.4303495
10 10065 0.1560827
...
6242 92284 0.5507326
6243 92340 0.8069427
6244 92399 0.4418432
6245 92583 0.7863388
6246 92655 0.2011622
6247 92663 0.5360361
6248 92690 0.2143214
6249 92807 0.3065137
6250 92874 0.2747127
6251 93105 0.9251976

Here is some rather inelegant code used to produce this (8 lines, I hope you can improve):

x2007 <- read.csv("c:/temp/482/data/crsp.2007.ret.csv", na.strings=c("NA", "", "C", "B")) #eliminate missings
calcVol <- function(x){
r <- log(1+x)
sd(r)*sqrt(length(x))
}
tmp <- tapply(x2007$RET, as.factor(x2007$PERMNO), calcVol)
tmp <- tmp[!is.na(tmp)] #get rid of NA volatilities
vol <- data.frame(as.numeric(names(tmp)), as.numeric(tmp), row.names = NULL)
names(vol) <- c("Permno", "Volatility")


Previous Back to STAT 482/682 Assignments FAQ Next