.po 1.1i .he '''Rice Technical Report 311-87-1' .fo ''-%-'' .EQ delim && .EN .ce \fBSoftware for Univariate and Bivariate Averaged Shifted Histograms\fR\** .ce 5 David W. Scott Department of Statistics Rice University P.O. Box 1892 Houston, TX 77251-1892 \fI0. Abstract\fR .pp Software for computing one and two dimensional averaged shifted histograms (Scott, 1985, \fIAnnals of Statistics\fR, \fB13\fR:1024-40) is given. The algorithms are written in Fortran \fB77\fR. Also included are function definitions and documentation for use in the interactive \fBS\fR system from Bell Labs. [Revised August 26, 1991 to provide New S interfaces.] \fI1. Introduction\fR .pp The averaged shifted histogram is a computationally effective algorithm for computing nonparametric kernel density estimates. The main portion of this technical report is code, but here we indicate the structure chosen. In general, the averaged shifted histogram operates on bin counts rather than the data themselves. Of course in many applications the raw data are in fact bin counts. Thus we have separated these two functions: first compute the bin counts; second compute the averaged shifted histogram. .pp The bins are assumed to be regular intervals or rectangles. For plotting purposes, we note that the values returned are at the \fIcenter\fR of the bins and should be linearly interpolated. The routines have been designed so that the output can be plotted directly using the \fBS\fR functions \fBplot\fR and \fBpersp\fR in the univariate and bivariate cases, respectively. Notice that in the bivariate case, if the matrix fxy contains the density estimates, and if x(i) is the center of i-th bin along the x-axis and y(j) is the center of the j-th bin along the y-axis, then the value of the averaged shifted histogram at the bin center (x(i),y(j)) is z(i,j). This follows the usual \fBS\fR convention. .pp Both algorithms use the quartic or biweight kernel &K(x) ~=~ 15(1-x sup 2 ) sup 2 /16& with a product of these kernels used in the bivariate case. The original averaged shifted histogram used a triangle kernel. [The parameter kopt defaults to the biweight kernel but provides the option for other kernels in the polynomial class.] \fI2. Code and listings\fR .pp There are two binning and two averaging routines, one each for the univariate and bivariate case. Within the double lines are three routines for each function. First is a documentation file for \fBS\fR. For those not using \fBS\fR, note that the arguments are described in order and that an example of correct usage is also given. Second is a function definition for the algorithm using the \fBS\fR format. One of the advantages of using \fBS\fR rather than Fortran directly is that only the input arguments need to be specified since the output arguments are returned in a structure as defined in this second routine. Also work space allocation can be handled dynamically in \fBS\fR. Fortran programmers will have to provide this work space in their calling program; it will also be necessary to pass arguments of actual array dimensions in the main program. The third routine is the actual fortran listing of the algorithm. I have tried to be efficient where it matters, relying upon a fairly friendly optimizing compiler to avoid recomputation of certain values inside loops. This could be handled explicitly if desired. The algorithms are fairly well documented and straight forward, although it may be helpful to glance at the second \fBS\fR function definition if additional information is needed. The twelve listings follow. Any comments or suggestions would be welcomed. .(f \**This research was partially supported by ONR contract N00014-85-K-0100 and ARO contract DAAG-29-85-K0212. The author's electronic address is scottdw@rice.edu. .)f .bp .po .5i .ll 7.0i .ps 6 .2c .nf There are 3 sections in the following: 1. The New S function definitions. 2. The Fortran subroutine code. 3. Help files. # 5-8-91 source this file # ash.s---------------------------------------------------------------------- bin1 <- function(x,ab=nicerange(x),nbin=50) { n <- length(x) if(ab[1]>=ab[2]) fatal("Interval vector has negative orientation") if(nbin<=0) fatal("Number of bin intervals nonpositive") r<-.Fortran("bin1", as.single(x), as.integer(n), as.single(ab), as.integer(nbin), nc=integer(nbin), nskip=integer(1)) list(nc=r$nc,ab=ab,nskip=r$nskip) } ash1 <- function(bins,m=5,kopt=c(2,2)){ nc <- bins$nc ab <- bins$ab nbin <- length(nc) r <- .Fortran("ash1", as.integer(m), as.integer(nc), as.integer(nbin), as.single(ab), as.integer(kopt), t=single(nbin), f=single(nbin), single(m), ier=integer(1)) if(r$ier==1) print("ash estimate nonzero outside interval ab") list(x=r$t,y=r$f,m=m,ab=ab,kopt=kopt,ier=r$ier) } nicerange <- function(x, beta = .1) { ab <- range(x) # interval surrounding data del <- ((ab[2] - ab[1]) * beta)/2 return(c(ab + c( - del, del))) } bin2 <- function(x,ab,nbin=c(20,20)){ if(missing(ab)){ ab <- t(array(c(nicerange(x[,1]),nicerange(x[,2])),c(2,2))) } n <- nrow(x) r <- .Fortran("bin2", as.single(x), as.integer(n), as.single(ab), as.integer(nbin), nc=integer(nbin[1]*nbin[2]), nskip=integer(1)) list(nc=matrix(r$nc,nbin[1],nbin[2]),ab=ab,nskip=r$nskip) } ash2 <- function(bins,m=c(5,5),kopt=c(2,2)){ nc <- bins$nc; if(!is.matrix(nc)) stop("bins does not contain bin count matrix") ab <- bins$ab; if(!is.matrix(ab)) stop("ab not a matrix - should be 2 by 2") nbin <- dim(nc) r <- .Fortran("ash2", as.integer(m), as.integer(nc), as.integer(nbin[1]), as.integer(nbin[2]), as.single(ab), as.integer(kopt), f=single(nbin[1]*nbin[2]), single(m[1]*m[2]), ier=integer(1)) if(r$ier==1) print(" estimate nonzero outside ab rectangle") list(z=matrix(r$f,nbin[1],nbin[2]), x=center(ab[1,],nbin[1])[[1]],y=center(ab[2,],nbin[2])[[1]], ab=ab,m=m,kopt=kopt,ier=r$ier) } center <- function(ab,k) { h <- diff(ab)/k list( seq(ab[1]+h/2,by=h,length=k) ) } c ash.f Contains all the fortran routines c ash.s Contains the calling functions. c bin1 Help for bin1 c ash1 Help for ash1 c bin2 Help for bin2 c ash2 Help for ash2 c To install this in a local S. c 1) get the all the files in some directory. c 2) mkdir .Data c 3) mkdir .Data/.Help c 4) cp bin1 ash1 bin2 ash2 .Data/.Help c 5) f77 -c -o ash.o ash.f c 6) S LOAD ash.o; S < ash.s OR c 7) dyn.load2("ash.o",syslibs="-lF77 -lm -lc"); source("ash.s") c ash.f---------------------------------------------------------------------- c April 8, 1986 c c Find bin counts of data array "x(n)" for ASH estimator c c "nbin" bins are formed over the interval [a,b) c c bin counts returned in array "nc" - # pts outside [a,b) = "nskip" c ##### Copyright 1986 David W. Scott subroutine bin1 ( x , n , ab , nbin , nc , nskip ) dimension x(n) , nc(nbin) , ab(2) nskip = 0 a = ab(1) b = ab(2) do 5 i = 1,nbin nc(i) = 0 5 continue d = (b-a) / nbin do 10 i = 1,n k = (x(i)-a) / d + 1.0 if (k.ge.1 .and. k.le.nbin) then nc(k) = nc(k) + 1 else nskip = nskip + 1 end if 10 continue return end c April 8, 1986 c c Computer ASH density estimate; Quartic (biweight) kernel c c Average of "m" shifted histograms c c Bin counts in array "nc(nbin)" - from routine "bin1" c c "nbin" bins are formed over the interval [a,b) c c ASH estimates returned in array "f(nbin)" c c FP-ASH plotted at a+d/2 ... b-d/2 where d = (b-a)/nbin c c Note: If "nskip" was nonzero, ASH estimates incorrect near boundary c Note: Should leave "m" empty bins on each end of array "nc" so f OK c ##### Copyright 1986 David W. Scott subroutine ash1 ( m, nc, nbin, ab, kopt, t, f, w, ier ) dimension nc(nbin), ab(2), t(nbin), f(nbin), w(m), kopt(2) ier = 0 a = ab(1) b = ab(2) n = 0 c-compute weights cons * ( 1-abs((i/m))^kopt1)^kopt2 c -- should sum to "m" 5-8-91 c w-array shifted by 1 mm1 = m-1 xm = m c cons = sum of weights from -(m-1) to (m-1) = 1 + 2 (sum from 1 to m-1) w(1) = 1.0 cons = 1.0 do 5 i = 1,mm1 w(i+1) = ( 1.0 - abs(i/xm)**kopt(1) )**kopt(2) cons = cons + 2*w(i+1) 5 continue cons = float(m)/cons do 6 i=1,m w(i) = cons*w(i) 6 continue c-check if estimate extends beyond mesh do 7 i = 1,mm1 if( nc(i)+nc(nbin+1-i) .gt. 0) ier = 1 7 continue c-compute ash(m) estimate delta = (b-a) / nbin h = m*delta do 10 i = 1,nbin t(i) = a + (i-0.5)*delta f(i) = 0.0 n = n + nc(i) 10 continue do 20 i = 1,nbin if (nc(i).eq.0) goto 20 c = nc(i) / (n*h) do 15 k = max0(1,i-mm1) , min0(nbin,i+mm1) f(k) = f(k) + c * w( iabs(k-i)+1 ) 15 continue 20 continue return end c April 12, 1986 bin2.f c c Find bin counts of data array "x(n,2)" for ASH estimator c c "nbin(1)" by "nbin(2)" bins are formed c c x:axis [ ab(1,1) , ab(1,2) ) c y:axis [ ab(2,1) , ab(2,2) ) half-open c c bin counts returned in array "nc" - # pts outside [a,b) = "nskip" c ##### Copyright 1986 David W. Scott subroutine bin2 ( x , n , ab , nbin , nc , nskip ) dimension x(n,2) , nbin(2) , nc(nbin(1),nbin(2)) , ab(2,2) nskip = 0 ax = ab(1,1) bx = ab(1,2) ay = ab(2,1) by = ab(2,2) do 5 j = 1,nbin(2) do 4 i = 1,nbin(1) nc(i,j) = 0 4 continue 5 continue dx = (bx-ax) / nbin(1) dy = (by-ay) / nbin(2) do 10 i = 1,n kx = (x(i,1)-ax) / dx + 1.0 ky = (x(i,2)-ay) / dy + 1.0 if (kx.ge.1 .and. kx.le.nbin(1) .and. 1 ky.ge.1 .and. ky.le.nbin(2)) then nc(kx,ky) = nc(kx,ky) + 1 else nskip = nskip + 1 end if 10 continue return end c April 12, 1986 ash2.f c c Computer ASH density estimate; Product Quartic (biweight) kernel c c Average of "m[1] by m[2]" shifted histograms c c Bin counts in matrix "nc" - from routine "nbin2" c c ASH estimates returned in matrix "f" c c FP-ASH plotted at a+d/2 ... b-d/2 where d = (b-a)/nbin c c Note: If "nskip" was nonzero, ASH estimates incorrect near boundary c Note: Should leave "m" empty bins on each end of array "nc" so f OK c #### Copyright 1986 David W. Scott c #### added kernel option kopt 5-8-91 subroutine ash2 ( m, nc, nbinx, nbiny, ab, kopt, f, w, ier ) dimension m(2) , nc(nbinx,nbiny) , ab(2,2) , f(nbinx,nbiny) dimension w( m(1) , m(2) ), kopt(2) ier = 0 ax = ab(1,1) bx = ab(1,2) ay = ab(2,1) by = ab(2,2) c-compute weights cons * ( 1-abs(i/m)^kopt1)^kopt2 c -- should sum to "m" 5-8-91 c w-array shifted by 1 mx = m(1) my = m(2) mxm1 = mx-1 mym1 = my-1 xm = mx ym = my c-----ASSUMES f dimensioned larger than m(1) or m(2) c-put marginal weights in f array as work array f(1,1) = 1.0 f(2,1) = 1.0 c consx = sum of weights from -(m-1) to (m-1) = 1 + 2 (sum from 1 to m-1) consx = 1.0 consy = 1.0 do 5 i = 1,mxm1 f(1,i+1) = ( 1.0 - abs(i/xm)**kopt(1) )**kopt(2) consx = consx + 2*f(1,i+1) 5 continue consx = float(mx)/consx do 6 i = 1,mym1 f(2,i+1) = ( 1.0 - abs(i/ym)**kopt(1) )**kopt(2) consy = consy + 2*f(2,i+1) 6 continue consy = float(my)/consy c-computer product weight array (avoids later multiplications) do 3 j = 1,my do 2 i = 1,mx w(i,j) = (consx*f(1,i)) * (consy*f(2,j)) 2 continue 3 continue c-compute ash(m) estimate n = 0 do 10 j = 1,nbiny do 9 i = 1,nbinx f(i,j) = 0.0 n = n + nc(i,j) 9 continue 10 continue c-check if estimate extends beyond mesh ncheck = 0 do 12 j = my , nbiny+1-my do 11 i = mx , nbinx+1-mx ncheck = ncheck + nc(i,j) 11 continue 12 continue if (ncheck .ne. n) ier = 1 dx = (bx-ax) / nbinx dy = (by-ay) / nbiny hx = mx*dx hy = my*dy do 20 j = 1,nbiny do 19 i = 1,nbinx if (nc(i,j).eq.0) goto 19 c = nc(i,j) / (n*hx*hy) do 18 ky = max0(1,j-mym1) , min0(nbiny,j+mym1) do 17 kx = max0(1,i-mxm1) , min0(nbinx,i+mxm1) f(kx,ky) = f(kx,ky) + c * 1 w( iabs(kx-i)+1 , iabs(ky-j)+1 ) 17 continue 18 continue 19 continue 20 continue return end bin1---------------------------------------------------------------------- .BG .FN bin1 .TL bin1: Function to compute array of bin counts for a data vector .CS bin1(x, ab, nbin=50) .PP .AG x (input) data vector .AG ab (input vector of length 2): half-open interval for bins [a,b). If no value is specified, the range of x is stretched by 5% at each end and used the interval. .AG nbin (input integer): number of bins desired. Default 50. .AG opt (input OPTIONAL logical): to suppress output messages, set opt=TRUE .RT bin1 returns a list including the vector of integer bin counts and the ab vector and the number of points outside the ab interval. .EX x <- rnorm(100) # data vector ab <- c(-5,5) # bin interval bins <- bin1(x,ab,10) # bin x into 10 bins over ab .KW density estimation .WR ash1---------------------------------------------------------------------- .BG .FN ash1 .TL ash1: Computes univariate averaged shifted histogram (polynomial kernel) .CS ash1(bins, m, kopt) .PP .AG bins (input list) $nc=integer vector of bin counts and $ab=bin interval .AG m (input) optional integer smoothing parameter; default=5. .AG kopt (input) vector of length 2 specifying the kernel, which is proportional to ( 1 - abs(i/m)**kopt(1) )**kopt(2); (2,2)=biweight (default); (0,0)=uniform; (1,0)=triangle; (2,1)=Epanechnikov; (2,3)=triweight. .RT returns structure suitable for input to "plot" .RC x=t vector of bin center locations .RC y=f vector of ash estimates .RC ier 0=normal exit; 1=estimate nonzero outside interval ab .EX x <- rnorm(100) # data f <- ash1(bin1(x,nbin=50),5) # compute ash estimate plot( f , type="l" ) # line plot of estimate .KW density estimation .WR bin2---------------------------------------------------------------------- .BG .FN bin2 .TL bin2: Bin bivariate data x .CS bin2(x, ab, nbin) .PP .AG x (input matrix with 2 columns) data sample .AG ab (input 2 x 2 matrix) rows 1 and 2 contain x and y axis bin intervals, respectively. If not specified, the ranges are stretched by 5% at each end for each dimension. .AG nbin (input vector of length 2) number of bins along x and y axes. Default is 20 by 20. .RT bin2 returns a list including the bivariate bin matrix (see "persp" function for orientation) and the number of points outside the ab rectangle. .EX x <- matrix( rnorm(200), 100 , 2) # bivariate normal n=100 ab <- matrix( c(-5,-5,5,5), 2, 2) # interval [-5,5) x [-5,5) nbin <- c( 20, 20) # 400 bins bins <- bin2(x, ab, nbin) # bin counts,ab,nskip .KW density estimation .WR ash2---------------------------------------------------------------------- .BG .FN ash2 .TL ash2: Compute bivariate ASH estimate (product polynomial kernel) .CS ash2(bins, m, kopt) .PP .AG bins (input list) bin count matrix nc and interval matrix ab from "bin2" .AG m (input integer vector of length 2) x and y direction smoothing parameters. Default is 5 by 5. .RT Matrix of ASH estimates returned. Components x,y,z can be given to the contour function directly. Other input variables returned in list for record keeping. .EX Continuing example from help("bin2") 1. To use with "persp" m <- c(5,5) f <- ash2(bins,m) persp( f$z / max(f$z) ) 2. To use with "contour" (continuing part 1) contour( f ) # note: it is a good idea to not plot contours at level = 0 .KW density estimation .WR .fi