Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dRisk.R and issue in interval measure definition #49

Open
leodecarlo opened this issue Aug 6, 2024 · 2 comments
Open

dRisk.R and issue in interval measure definition #49

leodecarlo opened this issue Aug 6, 2024 · 2 comments

Comments

@leodecarlo
Copy link

leodecarlo commented Aug 6, 2024

Dear Authors,

I copy the message in the Issue I had opened in the developing Git: sdcTools/sdcMicro#351 (comment) .
Bernhard-da agrees that there is something to change, quoting: " hi, thanks for your question. I agree that there is some kind of ambiguity. I want to note that the sdcguide is not written by the maintainers of sdcMicro so I would suggest to create an issue for the authors of the guide https://github.com/ihsn/SDCPractice/issues "

Me and a colleague think that there is a problem with the dRisk.R method in the sdcMicro library:

dRisk_link ,

the guide on interval measure :

interval_measure

says " intervals are created around each perturbed value and then a determination is made as to whether the original value of that perturbed observation is contained in this interval."
we agree that this is what the lines from 84 to 87 do in
dRisk_link .
Which count 1 when x is inside the created interval around x_m and 0 when x is outside.

But we find that the next lines in
interval_measure
seem to say something from what the method does:

" Values that are within the interval around the initial value after perturbation are considered too close to the initial value and hence unsafe and need more perturbation. Values that are outside of the intervals are considered safe. "

and

"The result 1 indicates that all (100 percent) the observations are outside the interval of 0.1 times the standard deviation around the original values." ,

namely it refers to intervals created around the original values, while the intervals are created around the perturbed values x_m, and it says that a value is counted as 0 when inside and 1 when outside, while we understand the function is doing the opposite around x_m.

We paste the following R script to test the strange behavior of the function, where increasing the noise in the perturbed values, the dRisk() method gives 1 for very high noise and very small values for very low noise. Here the script:

library(sdcMicro)

keys <- c('sex', 'age')
num_var <- c('expend')

sdc1<-createSdcObj(dat=testdata2, keyVars = keys, numVars = num_var)

set.seed(100)
out <- addNoise(sdc1, noise = 500)
high_noise <- out@risk$numeric

set.seed(100)
out <- addNoise(sdc1, noise = 0.001)
out@risk$numeric
low_noise <- out@risk$numeric

sprintf("Level of anonimity with insignificant noise %f. Level of anonymity with high noise %f", low_noise, high_noise)

So we think that a part of the guide should be changed and the dRisk.R() method should be changed or not along the actual intention (i.e. it can stay like it is if the guide changes meaning or not).

@thijsbenschop
Copy link
Collaborator

@leodecarlo, thanks you very much for noting this and raising this issue. We will adjust the text in the guide to reflect the calculations in the dRisk function of the sdcMicro package.

@leodecarlo
Copy link
Author

thanks to you for the replay.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants