Stat 410    HW 8    11-3-2005    D. Scott    Due 11-10 (in class)


HOMEWORK PROBLEMS BELOW


OUR TAKE-HOME EXAM IS NOW SCHEDULED:
    Handed out:  Thursday November 10
    Due back:    Tuesday  November 15 (beginning of class...not end)
    Materials:   Open book, computer, notes, homeworks.


Note:  This homework is not as long as it looks!!!
       If you have problems with the function you
       are asked to write, contact me for hints...


In this file below is code for reading in the SENIC
dataset in a particular fashion.  The same is true
for the liver data in Table 9.1, commercial properties,
and body fat sets.

Also given is a function which compute all of the criterion
functions mentioned in Chapter 9:

	  model.crit -- returns a list saying which variables
			are in the models tried and a corresponding
			criterion value (1-7 corresponding to
			the columns in Table 9.2)


1.  Use this function on the data in Table 9.1 (CH09TA01.txt)
    and verify you obtain the values in Table 9.2

2.  Write a new R or S function,
	  crit.plot <- function(ans, crit=1),
    which takes the output of function  model.crit   and
    produces the appropriate graph in Figure 9.4.
    Also include  crit=1  which is a plot of SSEp vs. p.
    Your figure should also include the lines shown.
    Show that you can reproduce Figure 9.4 for the liver data.

3.  Use these two functions on the senic data below.
    Comment on the "best" model found the criteria 4-8.
    Are some variables common to all?

4.  Same for the commercial properties data.

5.  Same for the body fat data.

6.  Using the senic data as below, BUT STANDARDIZED by using
    scale(y.sen) and scale(X.sen), produce a plot similar
    to Figure 11.3 for these variables.


The datasets needed are in this homework directory.


sue <- read.table("CH09TA01.txt")	 # surgical unit example
y.sue <- as.numeric(sue[,10])
X.sue <- sue[,1:4]


senic <- read.table("APPENC01.txt")
y.sen <- senic[,2]
X.tmp <- as.matrix( cbind( senic[,3:7], 2-senic[,8],
	ifelse( senic[,9]==1, 1, 0 ),
	ifelse( senic[,9]==2, 1, 0 ),
	ifelse( senic[,9]==3, 1, 0 ),
	senic[,10:12] ) )
X.sen <- matrix( as.numeric(X.tmp), 113, 12 )

tmp <- read.table("CH06PR18.txt")	 # Commercial properties
y.cp <- as.numeric( tmp[,1] )
X.cp <- cbind( as.numeric(tmp[,2]), as.numeric(tmp[,3]), 
	       as.numeric(tmp[,4]), as.numeric(tmp[,5]/10^6) )


tmp <- read.table("CH07TA01.txt")   # body fat data
y.bf <- as.numeric( tmp[,4] )
X.bf <- cbind( as.numeric(tmp[,1]), as.numeric(tmp[,2]), 
	       as.numeric(tmp[,3]) )


#	Compute 7 model criteria for all subsets

#	X is n by p-1 matrix  (i.e.  no column of 1's)

model.crit <- function(X,y) {
  if( !is.matrix(X) ) {X <- cbind(X)}
  p <- 1 + ncol(X); n <- nrow(X); if(length(y)!=n) stop("y wrong length")
  if( !is.numeric(X) ) { X <- as.matrix(X) }

  K <- 2^(p-1)		# number of models to consider (all with intercept)
  crit <- matrix(0,K,7)
  dimnames(crit) <- list(NULL,
	c("SSEp","Rp2","Rp2a","Cp","AICp","SBCp","PRESSp"))

			# which variables are in model (tricky)
  vars <- matrix(0,K,p-1); k <- seq(0,K-1)
  for( m in 1:(p-1) ) { i <- k %% (2^m); vars[,p-m] <- sign(i); k <- k-i }

			# compute criteria for each model
			# let pp be the "p" for the subset model

  ssto <- sum( (y-mean(y))^2 )			# total SS
  msep <- sum( lsfit(X,y)$residuals^2 ) / (n-p)	# full model MSE for Cp
  for(k in 1:K) {				# get right columns
    vb <- seq(p-1)[ vars[k,]==1 ]; x <- X[,vb]; pp <- 1+length(vb)
    if(k==1) { ei <- y-mean(y) } else { ei <- lsfit(x,y)$residuals }
    sse <- sum(ei^2)

    crit[k,1] <- sse
    crit[k,2] <- 1-sse/ssto
    crit[k,3] <- 1-((n-1)/(n-pp))*(sse/ssto)
    crit[k,4] <- sse/msep - (n-2*pp)
    crit[k,5] <- n*log(sse) - n*log(n) + 2*pp
    crit[k,6] <- n*log(sse) - n*log(n) + log(n)*pp
    U <- svd(cbind(1,x))$u;  hii <- c( U^2 %*% rep(1,pp) )
    crit[k,7] <- sum( (ei/(1-hii))^2 )
  }
  list(vars=vars,crit=crit)
}