ds.tapply.assign {dsBaseClient}R Documentation

ds.tapply.assign calling tapplyDS.assign

Description

Apply one of a selected range of functions to summarize an outcome variable over one or more indexing factors and write the resultant summary as an object on the serverside

Usage

ds.tapply.assign(X.name = NULL, INDEX.names = NULL, FUN.name = NULL,
  newobj = "tapply.out", datasources = NULL)

Arguments

X.name,

the name of the variable to be summarized. The user must set the name as a character string in inverted commas. For example: X.name="var.name"

INDEX.names,

the name of a single factor or a vector of names of factors to index the variable to be summarized. Each name must be specified in inverted commas. For example: INDEX.names="factor.name" or INDEX.names=c("factor1.name", "factor2.name", "factor3.name"). The native R tapply function can coerce non-factor vectors into factors. However, this does not always work when using the DataSHIELD ds.tapply/ds.tapply.assign functions so if you are concerned that an indexing vector is not being treated correctly as a factor, please first declare it explicitly as a factor using ds.asFactor

FUN.name,

the name of one of the allowable summarizing functions to be applied specified in inverted commas. The present version of the function allows the user to choose one of five summarizing functions. These are "N" (or "length"), "mean","sd", "sum", or "quantile". For more information see Details.

newobj

A character string specifying the name of the vector to which the output vector is to be written. If no <newobj> argument is specified, the output vector defaults to "tapply.out".

datasources

specifies the particular opal object(s) to use. If the <datasources> argument is not specified the default set of opals will be used. The default opals are called default.opals and the default can be set using the function ds.setDefaultOpals. If the <datasources> is to be specified, it should be set without inverted commas: e.g. datasources=opals.em or datasources=default.opals. If you wish to apply the function solely to e.g. the second opal server in a set of three, the argument can be specified as: e.g. datasources=opals.em[2]. If you wish to specify the first and third opal servers in a set you specify: e.g. datasources=opals.em[c(1,3)]

Details

A clientside function calling an assign serverside function that uses the native R function tapply() to apply one of a selected range of functions to each cell of a ragged array, that is to each (non-empty) group of values given by each unique combination of a series of indexing factors. The native R tapply function is very flexible and the range of allowable summarizing functions is much more restrictive for the DataSHIELD ds.tapply function. This is to protect against disclosure risk. At present the allowable functions are: N or length (the number of (non-missing) observations in the group defined by each combination of indexing factors; mean; SD (standard deviation); sum; quantile (with quantile probabilities set at c(0.05,0.1,0.2,0.25,0.3,0.33,0.4,0.5,0.6,0.67,0.7,0.75,0.8,0.9,0.95). Should other functions be required in the future then, provided they are non-disclosive, the DataSHIELD development team could work on them if requested. As an assign function tapplyDS.assign writes the summarized values to the serverside. Because unlike the aggregate function tapplyDS, tapply.assign returns no results to the clientside, it is fundamentally non-disclosive and the number of observations in each unique indexing group does not need to be evaluated against nfilter.tab (the minimum allowable non-zero count in a contingency table). This means that tapplyDS.assign can be used, for example, to break a dataset down into a small number of values for each individual and then to flag up which individuals have got at least one positive value for a binary outcome variable. This will almost inevitably generate some indexing groups smaller than nfilter.tab but as the results are simply written as newobj to the serverside rather than returned to the clientside there is no overt disclosure risk. The native R tapply function has optional arguments such as na.rm=TRUE for FUN = mean which will exclude any NAs from the outcome variable to be summarized. However, in order to keep DataSHIELD's ds.tapply and ds.tapply.assign functions straightforward, the serverside functions tapplyDS and tapplyDS.assign both start by stripping any observations which have missing (NA) values in either the outcome variable or in any one of the indexing factors. In consequence, the resultant analyses are always based on complete.cases.

Value

an array of the summarized values created by the tapplyDS.assign function. This array is written as a newobj onto the serverside. It has the same number of dimensions as INDEX.

Author(s)

Paul Burton, Demetris Avraam for DataSHIELD Development Team


[Package dsBaseClient version 5.0.0 ]