When calculating annualized historical volatilities for all CRSP stocks by year for 1980-present, are we calculating the standard deviation using all DAILY returns in each year, and then get annualized volatility for EACH year for EACH stock? If yes, from what I collected from CRSP, I got over 1.5GB of data. Or am I not obtaining the right data? Or are we using the ANNUAL returns throughout this 1980-2017 time period to obtain the standard deviation, and then get the annualized volatility for EACH year for EACH stock?

You are calculating annualized volatilities from the daily stock returns for each year for each stock. Welcome to "Big Data"! 1.5Gb is typical file size (and not that large), although you could break the data into chunks too. R/Python/SAS should easily handle file this size.
There is a variable in CCM (Fundamentals Annual) called OPTVOL. (Implied volatility of options, prefectly good measure), you get one value calculated at the end of each fiscal year. Unfortunately, the coverage is terrible. So since the purpose of the exercise is to get all volatilities in order to do a backtest, then this means we have to calculate from the daily (RET or RETX) for each stock for each year.
Here's a sample output for 2007 (all CRSP stocks):

Permno Volatility 1 10001 0.3088975 2 10002 0.3955540 3 10025 0.3877490 4 10026 0.3544119 5 10028 0.7179534 6 10032 0.4761090 7 10042 0.7790567 8 10044 0.2877138 9 10051 0.4303495 10 10065 0.1560827 ... 6242 92284 0.5507326 6243 92340 0.8069427 6244 92399 0.4418432 6245 92583 0.7863388 6246 92655 0.2011622 6247 92663 0.5360361 6248 92690 0.2143214 6249 92807 0.3065137 6250 92874 0.2747127 6251 93105 0.9251976
Here is some rather inelegant code used to produce this (8 lines, I hope you can improve):
x2007 <- read.csv("c:/temp/482/data/crsp.2007.ret.csv", na.strings=c("NA", "", "C", "B")) #eliminate missings calcVol <- function(x){ r <- log(1+x) sd(r)*sqrt(length(x)) } tmp <- tapply(x2007$RET, as.factor(x2007$PERMNO), calcVol) tmp <- tmp[!is.na(tmp)] #get rid of NA volatilities vol <- data.frame(as.numeric(names(tmp)), as.numeric(tmp), row.names = NULL) names(vol) <- c("Permno", "Volatility")

Back to STAT 482/682 Assignments FAQ