Statistics 410: Regression

Spring 2002

 

 

Data Project 1: Correlation and Simple Linear Regression

Due: Thursday, January 31

 

 

This project involves data collected by Dr. Jonathan Cohen, University of Texas Southwestern Medical Center, to investigate genetic factors underlying various cholesterol components. In this project you will go through some of the steps taken by statistical geneticists at early stages of data analysis to investigate complex relationships between genetic traits and genetic factors (genetic markers, genes). In addition to addressing substantive issues, this project is also meant to introduce you to some basic operations in S-Plus.

 

Report

 

See handout on writing reports. This project is limited to a total of 5 pages, excluding the title page.

 

Data

 

The data set is comprised of nuclear family data, which includes two parents and their biological children. The variables are defined as follows. The data will be emailed to you as a text file.

 

fam: family identification (ID)     

ind: individual ID within family

gen: generation (1=parent; 2=children)

sex: sex of individual (1=male; 2=female)

age: age in years

tc: total cholesterol concentration (mg/dL)

tg: triglyceride (mg/dL)

ldl: low density lipoprotein cholesterol (“bad cholesterol”)

apoe1: apolipoprotein-E allele

apoe2: apolipoprotein-E allele

 

Analysis

 

  1. Summarize the distributions of the following variables: fam, gen, sex, age, tc, tg, ldl.  Keep in mind that the variables are a mixture of categorical, discrete, and continuous observations.

 

  1. Do males and females differ in their total cholesterol? triglycerides? LDL?

 

  1. Summarize associations between age and each of the lipids (tc, tg, ldl) and among the lipids. Keep in mind that there may be a sex difference for some of the lipids.

 

 

 

  1. One way that statistical geneticists investigate heritability of a quantitative trait is by familial resemblance. With nuclear family data this can be done by looking at how the mean value of a variable (x) in the children correlations with the mean parental value of x. Moreover, the estimated slope of the regression of “mean offspring” on “mean parent” can be shown to be a valid estimate of the proportion of total variation in x explained by genetic factors. There are certain assumptions that underlie the validity of this method, which we will discuss in class. Evaluate the heritability of (a) total cholesterol, (b) LDL, and (c) triglycerides. Can you think of possible limitations to the regression method of estimating heritability? (Think of possible confounders, which may explain any trends you see in the corresponding scatter plot.)