Lab 4: Design of Experiments and Sampling Distributions

OBJECTIVES: This lab is designed to show you aspects of experimental design, with an emphasis on randomization and randomized block designs. You will also investigate the effects of sample size on sampling distributions, as well as understand the motivation behind randomization as being a key to statistical inference.

DIRECTIONS: Follow the instructions below, answering all questions. Your answers for each of the questions, including output and any plots, should be summarized in the form of a brief report (Word), to be handed in to the instructor before 1:00pm Friday Oct. 12.
__________________________________________________________________________________________________________
__________________________________________________________________________________________________________

1.) Design of Experiments . . .

In preparation for this portion of the lab, describe the basic idea of sampling in statistical design.
Explain what the notion of randomization in experimental design means.
How is an experiment different than an observational study?
How is randomization helpful when designing an experiment? (e.g., what does it help prevent?)
Describe what a randomized block design is. How is this type of design used as a form of control in experimental design?

Problem Background . . .

Five treatments are being tested on a homogeneous group of 40 patients. In particular, there are four different medications and one placebo. You are asked to use randomization to assign each of the treatments to the available patients, and to do so, you will use Minitab to generate random data and take five samples.

Randomization . . .

a.) First of all, generate the random data to represent the 40 patients and store them in column 1 (C1).
(Hint: Investigate "Calc/Make Patterned Data/Simple Set of Numbers" -- Each patient can be represented by the numbers 1-40, of course :) ).

b.) Note that this first column contains your "original" data from which you will now sample. Now take a random sample of this data and store it in column 2. You may want to label your columns as well. VERY IMPORTANT!: remember that we don't want to sample the same patient twice, so make sure you sample in the proper fashion.
(Hint: Take a look at "Calc/Random Data/Sample from Columns", and remember, all you're doing is sampling 40 rows from C1!)

c.) You should now have your random sample of the 40 patients. The next task is to assign the five treatments among the patients. Be sure to assign the same number of patients receive each treatment.
(Note: Perhaps the simplest way is to just assign the first 8 patients in each sample to "Treatment 1", the next 8 patients to "Treatment 2", and so on to "Placebo.")

Describe how you allocated the treatments to the patients, and be sure it's noted on your worksheet before you turn it in.

"Restricted Randomization" -- Block Design . . .

In this group of 40 patients, it is known that there are 20 females and 20 males. We suspect that gender could have some effect in the efficacy of the treatments. We thus have two homogeneous blocks of subjects, each one homogeneous to the best of our knowledge. You are asked to implement a randomized block design and allocate the treatments within the two randomized blocks.

d.) For this part, you should start a new worksheet and generate the data as you did in part a.) above.
(Note: Remember that this time, you're basically generating two sets of data -- 20 males and 20 females -- to represent your "original" data. (i.e., you should have male patients 1-20, as well as female patients 1-20). Thus, be sure to label your columns "Male" and "Female" to distinguish the two blocks of data.)

e.) Next, you need to take a random sample of each block of patients, in the same manner you did in part b.) above. Store them in your next two columns. Again, to distinguish among the columns, they should be labeled appropriately.

f.) Now, within each randomized block of data, assign the 5 treatments to the patients.
(Hint: You may want to recall your method from part c.) above!)

Describe how you allocated the treatments within the blocks, and be sure it's noted on your worksheet before you turn it in.
How might the conclusions you draw from this type of design be different than those you draw from the design in parts a.) - c.) ?

__________________________________________________________________________________________________________
__________________________________________________________________________________________________________

2.) Sampling Design and Moving Towards Statistical Inference . . .

Producing data in order to draw conclusions about some larger population is a process known as what? Why?
In preparation for this portion of the lab, what is meant by a population and a sample?
What is a simple random sample?
What is the difference between a parameter and a statistic?
Describe what a sampling distribution is.

Problem Background . . .

There is a mayoral election that will be held in Owltown, USA. You are to assume that there are only going to be two candidates running -- Screech McTalons and Ima Hoot.

Let p be the proportion of voters that will vote for Mr. McTalons.

We wish to study the behavior of groups of samples of different sizes drawn from the population of voters (Owltown voter population: 500,000).

a.)    Suppose the actual value of p is a favorable 0.7, or 70%.
        Generate 25 samples of size 50 from the population.
        (Hint: Again refer to "Calc/Random Data", and we will simulate random data from a Bernoulli distribution!
            Also, to makes things easier, let the rows represent the 25 samples, and store them in columns C1-C50.)

b.) Next, calculate the means of each sample and store them in the next available column (C51). You may want to label this column (e.g., "Means").
(Hint: Since we're working with rows as each sample, you should investigate "Calc/Row Statistics." Be sure to calculate the mean for each of the 25 rows, which basically means calculating the row statistics among all 50 columns . . .)

Do the means of each sample seem to be accurate estimates of the true value of p?

c.) Further analyze the means you just calculated by looking at the basic descriptive statistics of the means.("Stat-> Basica Statistics")

In looking at "the mean of the means" here, would you say these means are reflective of the true value of p in this case? Why or why not?

d.) Create a histogram of the estimates of p from the 25 samples of size 50 drawn from the population with p=0.7.

From this histogram, what can you say about the sampling distribution in this case? Consider such aspects of it as it's shape (e.g. normal, skewed, outliers, etc.), center (e.g., mean, median, closeness to true value p), and spread.

e.) Now increase the sample size to 100 and repeat parts a.)-d.) (same p of 0.7). You should probably start a new worksheet before doing this! Remember, in doing this, you will be increasing the number of columns (sample size) to 100!

Compare your results here with what you saw when the sample size was only 50.

f.) Finally, with a sample size of 100 again (still 25 samples), suppose the actual value of p is now only 0.55, or 55%, and repeat parts a.)-d.) with this new information. Again, start a new worksheet before doing this.

Compare your results here with what you saw in part e.).

;) BONUS! ;)

Discuss what is meant by the variability of a statistic, and in particular comment on how the spread of a sampling distribution relates to the population.

What is the difference between bias and variability?