ds.scatterPlot {dsBaseClient} | R Documentation |

This function uses two disclosure control methods to generate non-disclosive scatter plots of two continuous variables

ds.scatterPlot(x = NULL, y = NULL, method = "deterministic", k = 3, noise = 0.25, type = "split", datasources = NULL)

`x` |
a character, the name of a numeric vector, the x-variable. |

`y` |
a character, the name of a numeric vector, the y-variable. |

`method` |
a character which specifies the method that is used to generated non-disclosive
coordinates to be displayed in a scatter plot. If the |

`k` |
the number of the nearest neghbours for which their centroid is calculated.
The user can choose any value for k equal to or greater than the pre-specified threshold
used as a disclosure control for this method and lower than the number of observations
minus the value of this threshold. By default the value of k is set to be equal to 3
(we suggest k to be equal to, or bigger than, 3). Note that the function fails if the user
uses the default value but the study has set a bigger threshold. The value of k is used only
if the argument |

`noise` |
the percentage of the initial variance that is used as the variance of the embedded
noise if the argument |

`type` |
a character which represents the type of graph to display. A scatter plot for
combined data is generated when the |

`datasources` |
a list of opal object(s) obtained after login in to opal servers;
these objects hold also the data assign to R, as |

As the generation of a scatter plot from original data is disclosive and is not
permitted in DataSHIELD, this function allows the user to plot non-disclosive scatter plots.
If the argument `method`

is set to 'deterministic', the server side function searches
for the k-1 nearest neigbours of each single data point and calculates the centroid of such k
points. The proximity is defined by the minimum Euclidean distances of z-score transformed data.
When the coordinates of all centroids are estimated the function applies scaling to expand the
centroids back to the dispersion of the original data. The scaling is achieved by multiplying
the centroids with a scaling factor that is equal to the ratio between the standard deviation of
the original variable and the standard deviation of the calculated centroids. The coordinates of
the scaled centroids are then returned to the client.The value of k in this deterministic
approach, is specified by the user. The suggested and default value is equal to 3 which is also
the suggested minimum threshold that is used to prevent disclosure which is specified in the
protection filter 'nfilter.kNN'. When the value of k increases, the disclosure risk decreases
but the utility loss increases.
If the argument `method`

is set to 'probabilistic', the server side function generates a
random normal noise of zero mean and variance equal to 10
variable. The noise is added to each $x$ and $y$ variable and the disturbed by the addition of
noise data are returned to the client. Note that the seed random number generator is fixed to a
specific number generated from the data and therefore the user gets the same figure every time
that chooses the probabilistic method in a given set of variables.

one or more scatter plots depending on the argument `type`

Demetris Avraam for DataSHIELD Development Team

## Not run: # load the file that contains the login details data(logindata) # login to the servers opals <- opal::datashield.login(logins=logindata, assign=TRUE) # Example 1: generate a scatter plot for each study separately (the default behaviour) ds.scatterPlot(x='LD$PM_BMI_CONTINUOUS', y='LD$LAB_GLUC_ADJUSTED', type="split") # Example 2: generate a combined scatter plot with the default deterministic method ds.scatterPlot(x='LD$PM_BMI_CONTINUOUS', y='LD$LAB_GLUC_ADJUSTED', k=3, method='deterministic') # Example 3: if a variable is of type factor the scatter plot is not created ds.scatterPlot(x='LD$PM_BMI_CATEGORICAL', y='LD$LAB_GLUC_ADJUSTED') # Example 4: same as Example 2 but with k=50 ds.scatterPlot(x='LD$PM_BMI_CONTINUOUS', y='LD$LAB_GLUC_ADJUSTED', k=50, method='deterministic', type='combine') # Example 5: same as Example 2 but with k=1740 (here we see that as k increases we have big utility loss) ds.scatterPlot(x='LD$PM_BMI_CONTINUOUS', y='LD$LAB_GLUC_ADJUSTED', k=1740, method='deterministic', type='combine') # Example 6: same as Example 5 but for split analysis ds.scatterPlot(x='LD$PM_BMI_CONTINUOUS', y='LD$LAB_GLUC_ADJUSTED', k=1740, method='deterministic', type='split') # Example 7: if k is less than the specified threshold then the scatter plot is not created ds.scatterPlot(x='LD$PM_BMI_CONTINUOUS', y='LD$LAB_GLUC_ADJUSTED', k=2, method='deterministic') # Example 8: generate a combined scatter plot with the probabilistic method ds.scatterPlot(x='LD$PM_BMI_CONTINUOUS', y='LD$LAB_GLUC_ADJUSTED', method='probabilistic', type='combine') # Example 9: generate a scatter plot with the probabilistic method for each study separately ds.scatterPlot(x='LD$PM_BMI_CONTINUOUS', y='LD$LAB_GLUC_ADJUSTED', method='probabilistic', type='split') # Example 10: same as Example 9 but with higher level of noise ds.scatterPlot(x='LD$PM_BMI_CONTINUOUS', y='LD$LAB_GLUC_ADJUSTED', method='probabilistic', noise=0.5, type='split') # Example 11: if 'noise' is less than the specified threshold then the scatter plot is not created ds.scatterPlot(x='LD$PM_BMI_CONTINUOUS', y='LD$LAB_GLUC_ADJUSTED', method='probabilistic', noise=0.1, type='split') # clear the Datashield R sessions and logout opal::datashield.logout(opals) ## End(Not run)

[Package *dsBaseClient* version 5.0.0 ]