Statistics 410:
Regression
Spring 2002
Data Project 1: Correlation and Simple Linear Regression
Due: Thursday,
January 31
This project involves data collected by Dr. Jonathan Cohen,
University of Texas Southwestern Medical Center, to investigate genetic factors
underlying various cholesterol components. In this project you will go through
some of the steps taken by statistical geneticists at early stages of data
analysis to investigate complex relationships between genetic traits and
genetic factors (genetic markers, genes). In addition to addressing substantive
issues, this project is also meant to introduce you to some basic operations in
S-Plus.
Report
See handout on writing reports. This project is limited to a
total of 5 pages, excluding the title page.
Data
The data set is comprised of nuclear family data, which
includes two parents and their biological children. The variables are defined
as follows. The data will be emailed to you as a text file.
fam: family identification (ID)
ind: individual ID within family
gen: generation (1=parent;
2=children)
sex: sex of individual (1=male;
2=female)
age: age in years
tc: total cholesterol concentration
(mg/dL)
tg: triglyceride (mg/dL)
ldl: low density lipoprotein
cholesterol (“bad cholesterol”)
apoe1: apolipoprotein-E allele
apoe2: apolipoprotein-E allele
Analysis
- Summarize
the distributions of the following variables: fam, gen, sex, age, tc, tg,
ldl. Keep in mind that the
variables are a mixture of categorical, discrete, and continuous
observations.
- Do
males and females differ in their total cholesterol? triglycerides? LDL?
- Summarize
associations between age and each of the lipids (tc, tg, ldl) and among
the lipids. Keep in mind that there may be a sex difference for some of
the lipids.
- One
way that statistical geneticists investigate heritability of a
quantitative trait is by familial resemblance. With nuclear family data
this can be done by looking at how the mean value of a variable (x)
in the children correlations with the mean parental value of x.
Moreover, the estimated slope of the regression of “mean offspring” on
“mean parent” can be shown to be a valid estimate of the proportion of
total variation in x explained by genetic factors. There are certain
assumptions that underlie the validity of this method, which we will
discuss in class. Evaluate the heritability of (a) total cholesterol, (b)
LDL, and (c) triglycerides. Can you think of possible limitations to the
regression method of estimating heritability? (Think of possible
confounders, which may explain any trends you see in the corresponding
scatter plot.)