**helpdesk@stat.rice.edu**

Up to this point, we've been working with numeric manipulations in S-Plus; vectors, matrices, and mathematical operations defined with respect to these objects. However, focusing only on these things ignores S's strengths, and even renders it noncompetitive as these are things that can be done better in MATLAB. We now turn to things that are more unique to S, namely the design that recognizes that data comes in many forms, not all numeric, and that there are many standard operations that we invoke to act on data in its myriad forms. This is imbedded in what has come to be known as object-oriented programming (with the perhaps unfortunate acronym OOP). Things that you create and work with in S-Plus are objects. Each object has certain properties which are immediately apparent to S-Plus and which define in part how these objects behave and can be acted upon.

The most basic properties of an object are its size and mode. As an example, consider our friend the vector after some manipulations:

> x <- 1:5 # the basic vector > length(x) # how long is it? [1] 5 > mode(x) # what mode is it? This is new. [1] "numeric" > y <- c(x,"All") # add a string > y [1] "1" "2" "3" "4" "5" "All" > length(y) [1] 6 > mode(y) [1] "character" > x <- y[1:5] > x [1] "1" "2" "3" "4" "5" > mode(x) [1] "character" > x <- as.numeric(x) > x [1] 1 2 3 4 5 > y + y Problem in y + y: Non-numeric first operand Use traceback() to see the call stackIn the example above, we created a vector of numbers. S looked at the vector that we created, saw that all of the entries were numbers, and defined the mode of the vector to be

Even though the vector is not numeric, it still makes sense to ask
how many entries it has, so queries as to the length of the object `y`
still make sense and are answered politely. Similarly, many of the
tools that we learned for forming subsets of vectors work to form new
vectors or matrices or objects by referring to `y` only through
the indices of the entries of `y`, and these functions work just fine.
Thus, redefining `x` as the vector comprised of the first 5 elements
of the vector `y` works. Note that when we do this, S looks at the
mode of `y` and applies this as the mode of `x` automatically;
thus, we get a vector of mode character even though in this case each of
the character strings is a perfectly valid number. When something like
this happens, we may want to redefine the mode of `x` to be more
appropriate for the actions we wish to take. S lets us do this through
various conversion routines: `as.«modename»(x)` will recast the
mode of `x` to that specified. In this case, `as.numeric(x)`
returned the vector `x` that we started with initially. This
mode casting approach tries to be reasonable, but it may do something
that you don't expect if you try to coerce something in a nonstandard way.
For example, `as.numeric(y)` will produce a numeric vector of length 6,
but the last entry, corresponding to the character string ```All''`
will be rendered as a ``missing value'' or `NA`.

There are 5 ``atomic'' modes in S (we will discuss non-atomic modes shortly), corresponding to the modes that entries in a vector can have:

logical |

numeric |

complex |

character |

null. |

> x <- c(T,3) > mode(x) [1] "numeric" > x <- c(T,3,4+2i) > mode(x) [1] "complex" > x <- c(T,3,4+2i,"All") > x [1] "TRUE" "3" "4+2i" "All" > x <- c(T,T,F) > as.character(x) [1] "TRUE" "TRUE" "FALSE" > as.null(x) NULL > x <- c(x, NULL) > x [1] T T F > x <- c(1:5, NA) > x [1] 1 2 3 4 5 NA > as.logical(x) [1] T T T T T NA > as.logical(x) + as.logical(x) [1] 2 2 2 2 2 NA

Vectors (and matrices) are objects that consist of atoms of the same mode arranged in particular ways. Lists and frames are the generalizations of these concepts - lists and frames are objects that consist of objects, possibly ``nonprimitive'' in nature; an atom is a primitive object that cannot be reduced further. Vectors and matrices are said to be ``atomic'' objects, implying amongst other things that they have uniquely defined modes and that their elements are of the same size. Lists and frames are ``nonatomic'' and the elements of a list or a frame can have different modes and be of different sizes. As we've seen something of how vectors work, let's take a look at a list or two to illustrate the differences:

> x <- c(1,2,3,4,5,NA) > x [1] 1 2 3 4 5 NA > y <- c(x[1:5],"A") > y [1] "1" "2" "3" "4" "5" "A" > z <- matrix(0,2,2) > z [,1] [,2] [1,] 0 0 [2,] 0 0 > mylist <- list(x,y,z) > mylist [[1]]: [1] 1 2 3 4 5 NA [[2]]: [1] "1" "2" "3" "4" "5" "A" [[3]]: [,1] [,2] [1,] 0 0 [2,] 0 0 > length(mylist) [1] 3 > length(mylist[[2]]) [1] 6 > dim(mylist[[3]]) [1] 2 2 > mode(mylist[[1]]) [1] "numeric" > mode(mylist) [1] "list"Thus, we have combined in one list a numeric vector of length 6 (the first element of the list), a character vector of length 6 (the second element), and a numeric matrix of dimension 2 by 2 (the third element). The list is an object, so it has a length, 3, and a mode,

An additional feature of objects is that becomes very useful when
working with vectors, matrices, lists or frames, is that objects
and occasionally elements of objects can have names. For example,
we might want to return one vector of residuals, one vector of the
raw data, and one vector of the coefficients employed, and have them
named appropriately. To assign names to the elements of a list,
we use the function `names` with a vector of character
strings as follows:

> names(mylist) <- c("resids","rawdata","mycoefs") > mylist $resids: [1] 1 2 3 4 5 NA $rawdata: [1] "1" "2" "3" "4" "5" "A" $mycoefs: [,1] [,2] [1,] 0 0 [2,] 0 0 > mylist$rawdata [1] "1" "2" "3" "4" "5" "A" > mylist[[2]] [1] "1" "2" "3" "4" "5" "A" > names(mylist) <- c("resids","raw data","mycoefs") > mylist $resids: [1] 1 2 3 4 5 NA $"raw data": [1] "1" "2" "3" "4" "5" "A" $mycoefs: [,1] [,2] [1,] 0 0 [2,] 0 0Once elements in a list have been named, you can refer to them by name using the dollar sign, as in

mylist <- list(resids=x, rawdata=y, mycoefs=z)produces the same result as the first use of names above. You can also name the components of a vector, buy you cannot refer to them by name.

> x [1] 1 2 3 4 5 NA > names(x) <- c("A","B","C","D","E","F") > x A B C D E F 1 2 3 4 5 NA > x$A NULLMatrices have so many elements that naming all of them was deemed counterproductive, but it is possible to name the rows and columns of a matrix, giving names to each of the dimensions. This uses the

> z [,1] [,2] [1,] 0 0 [2,] 0 0 > dimnames(z) = list(c("row1","row2"),c("col1","col2")) > z col1 col2 row1 0 0 row2 0 0 > z[row1,col1] Problem: Object "row1" not found Use traceback() to see the call stackThe dimension names can also be supplied in the initial call to the matrix, as in

> matrix(0,2,2,dimnames=list(c("row1","row2"),c("col1","col2")))We note briefly that

If you want to change the names of the
rows or columns of a matrix without looking at everything, there
are a few options available. First, there is the function `row.names`.
Interestingly, there is no corresponding function `col.names`.
Second, you can reference the appropriate entry of the first or
second list element of `dimnames`.

> row.names(z) <- c("r1","r2") > z col1 col2 r1 0 0 r2 0 0 > col.names(z) <- c("c1","c2") Problem: couldn't find assignment function for "col.names" > dimnames(z) [[1]]: [1] "r1" "r2" [[2]]: [1] "col1" "col2" > dimnames(z)[[2]] [1] "col1" "col2" > dimnames(z)[[2]] <- c("c1","c2") > dimnames(z) [[1]]: [1] "r1" "r2" [[2]]: [1] "c1" "c2" > dimnames(z)[[1]][2] [1] "r2"

Data frames extend the idea of lists, allowing for multiple modes, reference by name, and so on, while imposing the structure of a matrix so that the data can be arranged in a rectangular manner with a fixed number of rows and columns. This is reasonable in the statistical data setting, where we can think of each row denoting an object on which observations are being made, and each column as an individual covariate ( or ) where the covariates need not all be numeric. In many cases the covariates are all numeric, in which case the frame can be coerced into a matrix and vice versa.

There are some notable ways (in addition to the allowance for columns of differing modes) in which data frames differ from matrices. Let's look at a few.

> x <- matrix(1:12,4,3) > x [,1] [,2] [,3] [1,] 1 5 9 [2,] 2 6 10 [3,] 3 7 11 [4,] 4 8 12 > zed <- data.frame(x) > zed x.1 x.2 x.3 1 1 5 9 2 2 6 10 3 3 7 11 4 4 8 12 > zed$x.1 [1] 1 2 3 4 > zed$1 Problem: Syntax error: illegal literal ("1") on input line 1 > zed[[1]] [1] 1 2 3 4 > x[1:5] [1] 1 2 3 4 5 > zed[1:5] Problem in "[.data.frame"(x, ..1): undefined columns selected Use traceback() to see the call stack > zed[c(2,3)] x.2 x.3 1 5 9 2 6 10 3 7 11 4 8 12 > zed[1,] x.1 x.2 x.3 1 1 5 9 > mode(zed) [1] "list"The first difference is that data frames, unlike matrices,

S assumes that you know what you're doing when you define functions. Generally, this is a good thing, but there's one way in which this is carried too far: If you define a function and give it the same name as a function that already exists in S, S accepts the new function as the one to invoke. The latest version of S-Plus will bother to tell you that it already has a function by that name; if you see this warning, move your function! If you later try to invoke the S function, it will find yours instead, which can lead to much frustration in trying to debug code that looks correct. Let's see what happens when I overwrite the matrix transpose function.

> x <- matrix(1:6,2,3) > x [,1] [,2] [,3] [1,] 1 3 5 [2,] 2 4 6 > t(x) [,1] [,2] [1,] 1 2 [2,] 3 4 [3,] 5 6 > t <- function(x){ + x + } Warning messages: Conflicting definitions of "t" on databases ".Data" and "splus" > t(x) [,1] [,2] [,3] [1,] 1 3 5 [2,] 2 4 6 > rm(t) > t(x) [,1] [,2] [1,] 1 2 [2,] 3 4 [3,] 5 6

Earlier, we talked about reading lines of text into
S-Plus using the `source` command, which interprets
the lines of the text file supplied as valid S commands
and attempts to execute them, one by one. While this is
useful, and is probably the best way to work on functions,
it is not the most efficient method for loading tables of
data. For these purposes, the functions `scan` and
`read.table` are better. The distinction is that
`scan` is designed to read in the contents of a
matrix, and hence looks for things to be of all the same
mode, whereas `read.table` is designed to read in
a data frame. In what follows, I will work with the
contents of the file `dummy.q`.

Scan reads entries from a file in a continuous stream, happily ignoring whitespace save for the purpose of indicating the separation between elements of the file.

> !more dummy.q 1 2 3 4 5 6 7 8 > x <- matrix(scan("dummy.q"),2,4) > x [,1] [,2] [,3] [,4] [1,] 1 3 5 7 [2,] 2 4 6 8 > x <- matrix(scan("dummy.q"),2,4,byrow=T) > x [,1] [,2] [,3] [,4] [1,] 1 2 3 4 [2,] 5 6 7 8In this case, we loaded the file into a matrix (it was very useful to know the number of rows and columns beforehand). We note once again that S reads matrices in by column, which is almost never the desired behavior when we are scanning data from a file.

Occasionally, a file comes with headings for the appropriate
columns. Unfortunately, `scan` does not provide for automatically
loading the column names, but it can be adjusted to skip over them:

> !more dummy.q col1 col2 col3 col4 1 2 3 4 5 6 7 8 > x <- matrix(scan("dummy.q",skip=1),2,4,byrow=T) > x [,1] [,2] [,3] [,4] [1,] 1 2 3 4 [2,] 5 6 7 8The optional argument

The `read.table` function is quite similar to scan, but
it allows (surprise, surprise) for data that are of different modes.
Unlike `scan`, `read.table` treats each row of the file as a distinct
unit, so rows of the data frame that is to result may not be split
across rows in the file.

> !more dummy.q 1 2 3 4 5 6 7 8 > zed <- read.table("dummy.q") > zed V1 V2 V3 V4 1 1 2 3 4 2 5 6 7 8 > mode(zed) [1] "list" > zed$V1 [1] 1 5Note that the result of

> zed <- read.table("dummy.q",header=T) > zed col1 col2 col3 col4 1 1 2 3 4 2 5 6 7 8Now, in some data files the rows are also labelled (for example, in medical files each row may denote a patient and each column some type of measurement). This can be accomodated (assuming the row label column also has an entry in the header line).

> !more dummy.q rowlabels col1 col2 col3 col4 row1 1 2 3 4 row2 5 6 7 8 > zed <- read.table("dummy.q",header=T,row.names="rowlabels") > zed col1 col2 col3 col4 row1 1 2 3 4 row2 5 6 7 8In general, S will determine the mode of each entry column by column automatically, so you don't have to worry about it.

Okay, we've just spent a goodly amount of time discussing
the types of data structures that are employed in S, leading
up to the idea of a data frame. The reason that a data frame
is important is that data frames are often supplied as
local databases to the functions in S which work with
statistical models. There are several such functions,
but for the moment we will just be focusing on one:
the `lm`, or ``linear model'' function. As you might
guess, this function is quite pertinent to our course.

We begin by examining the relationship between `fuel`
and `weight` that was the subject of the last homework.
Initially, both `fuel` and `weight` exist in my
directory as vectors of numbers.

> zed <- lm(fuel ~ weight) > zed Call: lm(formula = fuel ~ weight) Coefficients: (Intercept) weight 0.3914324 0.00131638 Degrees of freedom: 60 total; 58 residual Residual standard error: 0.3877015 > mode(zed) [1] "list" > class(zed) [1] "lm" > names(zed) [1] "coefficients" "residuals" "fitted.values" "effects" [5] "R" "rank" "assign" "df.residual" [9] "contrasts" "terms" "call"As you might suspect, the call to

with the errors assumed to be independent and identically distributed normal random variates. Note that what the function

> print(zed) Call: lm(formula = fuel ~ weight) Coefficients: (Intercept) weight 0.3914324 0.00131638 Degrees of freedom: 60 total; 58 residual Residual standard error: 0.3877015 > summary(zed) Call: lm(formula = fuel ~ weight) Residuals: Min 1Q Median 3Q Max -0.7957 -0.2703 0.01414 0.2547 0.9583 Coefficients: Value Std. Error t value Pr(>|t|) (Intercept) 0.3914 0.2995 1.3070 0.1964 weight 0.0013 0.0001 12.9323 0.0000 Residual standard error: 0.3877 on 58 degrees of freedom Multiple R-Squared: 0.7425 F-statistic: 167.2 on 1 and 58 degrees of freedom, the p-value is 0 Correlation of Coefficients: (Intercept) weight -0.9859 > anova(zed) Analysis of Variance Table Response: fuel Terms added sequentially (first to last) Df Sum of Sq Mean Sq F Value Pr(F) weight 1 25.13875 25.13875 167.2433 0 Residuals 58 8.71812 0.15031 > coef(zed) (Intercept) weight 0.3914324 0.00131638 > resid(zed)where I omitted the printing of the vector of residuals (all 60 of them). In any event, many of the basic linear model results are available using the

How do the statistical model functions involve data frames?
Well, the simple answer is that in passing arguments to `lm`,
the first argument is the `formula` describing what model
is to be fitted, and the optional second argument `data`
will accept a data frame, causing S to look at the columns of
the data frame as the covariates to be acted upon. This clusters
all of the functionality of interest in a single location, which
is a form of ``encapsulation'' (which a short way of arguing
against leaving things lying around). Thus, an equivalent way
of conducting the above analysis is

> cardata <- data.frame(cbind(fuel,weight)) > names(cardata) [1] "fuel" "weight" > zed2 <- lm(fuel ~ weight, data = cardata) > zed2 Call: lm(formula = fuel ~ weight, data = cardata) Coefficients: (Intercept) weight 0.3914324 0.00131638 Degrees of freedom: 60 total; 58 residual Residual standard error: 0.3877015Aside from having too many things lying around, are there other reasons for working with data frames? That depends on how difficult the data are to access initially. In this example, the gain is not significant. However, if there are lots of variables lying around and some of them are actually subsets of larger sets of interest, it can be useful to focus in on things by putting everything you want to work with into a single frame. If that sounds like a wishy-washy answer, that's because it is. The primary reason is aesthetic.

tilde indicates y on the left, x's on the right

the period by itself means include every main effect

Addition means inclusion of terms, not summation

Subtraction indicates specific exclusion

Colon indicates interaction terms, not sequences

Multiplication indicates inclusion of all higher order terms with repetitions of a given term excluded

Exponentiation indicates inclusion of higher order terms up to a specified order

The I notation says ``evaluate me first'' and treat formula operators as you would under normal addition rules.

you can apply exponentiation, addition, and subtraction to the dot.

This document was generated using the
**LaTeX**2`HTML` translator Version 2002-1 (1.68)

Copyright © 1993, 1994, 1995, 1996,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.

Copyright © 1997, 1998, 1999,
Ross Moore,
Mathematics Department, Macquarie University, Sydney.

The command line arguments were:

**latex2html** `-local_icons -no_subdir -split 0 Stut02.tex`

The translation was initiated by Statistics Helpdesk on 2003-08-19

Statistics Helpdesk 2003-08-19