Stat 410 HW 8 11-3-2005 D. Scott Due 11-10 (in class) HOMEWORK PROBLEMS BELOW OUR TAKE-HOME EXAM IS NOW SCHEDULED: Handed out: Thursday November 10 Due back: Tuesday November 15 (beginning of class...not end) Materials: Open book, computer, notes, homeworks. Note: This homework is not as long as it looks!!! If you have problems with the function you are asked to write, contact me for hints... In this file below is code for reading in the SENIC dataset in a particular fashion. The same is true for the liver data in Table 9.1, commercial properties, and body fat sets. Also given is a function which compute all of the criterion functions mentioned in Chapter 9: model.crit -- returns a list saying which variables are in the models tried and a corresponding criterion value (1-7 corresponding to the columns in Table 9.2) 1. Use this function on the data in Table 9.1 (CH09TA01.txt) and verify you obtain the values in Table 9.2 2. Write a new R or S function, crit.plot <- function(ans, crit=1), which takes the output of function model.crit and produces the appropriate graph in Figure 9.4. Also include crit=1 which is a plot of SSEp vs. p. Your figure should also include the lines shown. Show that you can reproduce Figure 9.4 for the liver data. 3. Use these two functions on the senic data below. Comment on the "best" model found the criteria 4-8. Are some variables common to all? 4. Same for the commercial properties data. 5. Same for the body fat data. 6. Using the senic data as below, BUT STANDARDIZED by using scale(y.sen) and scale(X.sen), produce a plot similar to Figure 11.3 for these variables. The datasets needed are in this homework directory. sue <- read.table("CH09TA01.txt") # surgical unit example y.sue <- as.numeric(sue[,10]) X.sue <- sue[,1:4] senic <- read.table("APPENC01.txt") y.sen <- senic[,2] X.tmp <- as.matrix( cbind( senic[,3:7], 2-senic[,8], ifelse( senic[,9]==1, 1, 0 ), ifelse( senic[,9]==2, 1, 0 ), ifelse( senic[,9]==3, 1, 0 ), senic[,10:12] ) ) X.sen <- matrix( as.numeric(X.tmp), 113, 12 ) tmp <- read.table("CH06PR18.txt") # Commercial properties y.cp <- as.numeric( tmp[,1] ) X.cp <- cbind( as.numeric(tmp[,2]), as.numeric(tmp[,3]), as.numeric(tmp[,4]), as.numeric(tmp[,5]/10^6) ) tmp <- read.table("CH07TA01.txt") # body fat data y.bf <- as.numeric( tmp[,4] ) X.bf <- cbind( as.numeric(tmp[,1]), as.numeric(tmp[,2]), as.numeric(tmp[,3]) ) # Compute 7 model criteria for all subsets # X is n by p-1 matrix (i.e. no column of 1's) model.crit <- function(X,y) { if( !is.matrix(X) ) {X <- cbind(X)} p <- 1 + ncol(X); n <- nrow(X); if(length(y)!=n) stop("y wrong length") if( !is.numeric(X) ) { X <- as.matrix(X) } K <- 2^(p-1) # number of models to consider (all with intercept) crit <- matrix(0,K,7) dimnames(crit) <- list(NULL, c("SSEp","Rp2","Rp2a","Cp","AICp","SBCp","PRESSp")) # which variables are in model (tricky) vars <- matrix(0,K,p-1); k <- seq(0,K-1) for( m in 1:(p-1) ) { i <- k %% (2^m); vars[,p-m] <- sign(i); k <- k-i } # compute criteria for each model # let pp be the "p" for the subset model ssto <- sum( (y-mean(y))^2 ) # total SS msep <- sum( lsfit(X,y)$residuals^2 ) / (n-p) # full model MSE for Cp for(k in 1:K) { # get right columns vb <- seq(p-1)[ vars[k,]==1 ]; x <- X[,vb]; pp <- 1+length(vb) if(k==1) { ei <- y-mean(y) } else { ei <- lsfit(x,y)$residuals } sse <- sum(ei^2) crit[k,1] <- sse crit[k,2] <- 1-sse/ssto crit[k,3] <- 1-((n-1)/(n-pp))*(sse/ssto) crit[k,4] <- sse/msep - (n-2*pp) crit[k,5] <- n*log(sse) - n*log(n) + 2*pp crit[k,6] <- n*log(sse) - n*log(n) + log(n)*pp U <- svd(cbind(1,x))$u; hii <- c( U^2 %*% rep(1,pp) ) crit[k,7] <- sum( (ei/(1-hii))^2 ) } list(vars=vars,crit=crit) }