Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Merging two sibling ADaM datasets using default joining keys results in a cartesian product. #317

Open
3 tasks done
chlebowa opened this issue May 16, 2024 · 1 comment
Labels

Comments

@chlebowa
Copy link
Contributor

What happened?

Inference of joining keys by parent seems insufficient.

Consider a case where two child datasets are joined.
ADLB and ADRS have the same primary keys, "STUDYID", "USUBJID", "PARAMCD", "AVISIT".

> library(teal.data)
> library(dplyr)
> 
> ADLB <- rADLB
> ADRS <- rADRS
>
> jk <- default_cdisc_join_keys["ADLB", "ADRS"]
>
> full_join(ADLB, ADRS) |> dim()
Joining with `by = join_by(STUDYID, USUBJID, SUBJID, SITEID, AGE, AGEU, SEX, RACE, ETHNIC, COUNTRY, DTHFL, INVID, INVNAM, ARM, ARMCD, ACTARM, ACTARMCD, TRT01P, TRT01A, TRT02P, TRT02A, REGION1, STRATA1,  
STRATA2, BMRKR1, BMRKR2, ITTFL, SAFFL, BMEASIFL, BEP01FL, AEWITHFL, RANDDT, TRTSDTM, TRTEDTM, TRT01SDTM, TRT01EDTM, TRT02SDTM, TRT02EDTM, AP01SDTM, AP01EDTM, AP02SDTM, AP02EDTM, EOSSTT, EOTSTT, EOSDT,   
EOSDY, DCSREAS, DTHDT, DTHCAUS, DTHCAT, LDDTHELD, LDDTHGR1, LSTALVDT, DTHADY, ADTHAUT, ASEQ, PARAM, PARAMCD, AVAL, ADTM, ADY, AVISIT, AVISITN)`
[1] 11600   104
> full_join(ADLB, ADRS, by = jk) |> dim()
[1] 67200   165
Warning message:
In full_join(ADLB, ADRS, by = jk) :
  Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 1 of `x` matches multiple rows in `y`.
ℹ Row 1 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship = "many-to-many"` to silence this warning.
>

Joining by default, i.e. using intersect(names(x), names(y)) correctly uses all primary keys as joining keys. Extracting a join_key_set from default_cdisc_join_keys results in a cartesian product.

sessionInfo()

> packageVersion("teal.data")
[1] '0.6.0'

Relevant log output

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct.

Contribution Guidelines

  • I agree to follow this project's Contribution Guidelines.

Security Policy

  • I agree to follow this project's Security Policy.
@chlebowa chlebowa added the bug label May 16, 2024
@averissimo
Copy link
Contributor

This is a byproduct of the incomplete specification of the CDISC datasets file and the implicit relationship inference.

The implicit relationship inference currently works:

image

This creates problem when the 2 datasets have additional shared primary keys, resulting in "unnecessary" duplicate rows when merging.

Possible solution

We could evolve the mechanism to:

  • (keep) only determine implicit relationship if 2 datasets have same keys
  • Infer the joining keys (foreign) keys by intersecting the primary keys of the 2 datasets
    • And union with shared parent keys
POC Patch

This would require just a few modifications.

diff --git a/R/join_keys-utils.R b/R/join_keys-utils.R
index 7ace0e74..6626755a 100644
--- a/R/join_keys-utils.R
+++ b/R/join_keys-utils.R
@@ -136,12 +136,14 @@ update_keys_given_parents <- function(x) {
         common_ix_1 <- unname(keys_d1_parent) %in% unname(keys_d2_parent)
         common_ix_2 <- unname(keys_d2_parent) %in% unname(keys_d1_parent)
 
+        pk_intersect <- intersect(jk[d1, d1], jk[d2, d2])
+
         # No common keys between datasets - leave empty
         if (all(!common_ix_1)) next
 
         fk <- structure(
-          names(keys_d2_parent)[common_ix_2],
-          names = names(keys_d1_parent)[common_ix_1]
+          union(pk_intersect, names(keys_d2_parent)[common_ix_2]),
+          names = union(pk_intersect, names(keys_d1_parent)[common_ix_1])
         )
         jk[[d1]][[d2]] <- fk # mutate join key
       }

Assumption on this:

  • joining keys names that are the same, connect two datasets

This fix/enhancement would change the behavior of joining keys and we would need to communicate the change to our users. It's not a breaking change though.

Alternative solution (manual)

Expand the cdisc_datasets.yaml file to include joining keys for additional relationships between datasets that are different from parent.

I would advise against this as it would greatly expand on that file and its complexity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants