Stat 410 Final HW 9 12-1-2005 D. Scott Due 12-14 (sign in to DH 1088) HOMEWORK PROBLEMS BELOW Materials: Open book, computer, notes, homeworks. Time Limit: 3 hours Sign Pledge Note: Two problems. Each problem is worth 10 percent. 1. Consider the baseball salary dataset in this directory. These were the active players in Major League Baseball (who were not pitchers) for the 2000 season. The data consists of 439 players and 12 variables. The variables are: 1 Number - number on player's uniform 2 Player - name of player (no pitchers) 3 Team - which of the 30 baseball teams 4 Position - C 1B 2B SS 3B RF CF LF DH = designated hitter 5 Salary - dollars for 2000 season 6 Runs - number of runs scored 7 Hits - number of hits 8 HR - number of home runs 9 RBI - number of runs batted in 10 SB - number of stolen bases 11 CS - number of times caught stealing 12 BB - number of walks (base on balls) a. Use the transformation log(1+x) for variables 1, 6, 7, 8, 9, 10, 11, and 12. Fit a linear model to the log(salary) . Comment on the fit and significant variables. b. "Manually" remove the least significant variable and refit. Continue until all variables remaining have a p-value less than 0.05. Comment on the final variables remaining, and the overall model. c. Using all the variables in part (a) again, add the "position" variable. (I have coded these in the file X2.dat and the DH position is coded as all zeroes.) Do any of the positions matter? d. Using all the variables in part (a) again, add the "team" variable. (I have coded these in the file X1.dat and the TORonto is coded as all zeroes.) Do any of the teams matter? e. Consider variable 9, RBI's, without the log transform. It is reasonable to believe it is a Poisson random variable, whose mean depends on the other variables. Use the log-transformed variables 1,5-8,10-12 and fit a generalized linear model. What variables seem predictive? Comment. 2. For the dataset which you have collected (with n>30 and p>3), run a logistic regression (GLM with binomial link family). Email the data set to me. Include the exact source of the data and anything of interest about it. What are your findings?