Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect handling of empty dataframe during conditional sampling #2334

Open
omelyanchikd opened this issue Jan 7, 2025 · 2 comments · May be fixed by #2335
Open

Incorrect handling of empty dataframe during conditional sampling #2334

omelyanchikd opened this issue Jan 7, 2025 · 2 comments · May be fixed by #2335
Labels
bug Something isn't working feature:sampling Related to generating synthetic data after a model is built under discussion Issue is currently being discussed

Comments

@omelyanchikd
Copy link

Environment Details

  • SDV version: 1.17.2
  • Python version: 3.9.19
  • Operating System: macOS 14.7.1

Error 1:

  • If single table synthesizer model doesn't produce any valid samples while trying to sample remaining columns, method _filter_conditions will throw an unhandled error when trying to access any of dataframe columns, because it is empty, and sampling will be stopped:

From sdv/single_table/base.py/_sample_rows

            sampled = self._data_processor.filter_valid(sampled)

            if conditions is not None:
                sampled = self._filter_conditions(sampled, conditions, float_rtol)

Line causing an error from sdv/single_table/base.py/_filter_conditions

        for column, value in conditions.items():
            column_values = sampled[column]

Error 2:

  • If single table synthesizer model has more than one constraint, and one of these constraints filters out all rows in the generated sample while trying to sample remaining columns, loop in method filter_valid will throw an unhandled error when trying to access any of sample dataframe columns, because it is empty, and sampling will be stopped:

From sdv/data_processing/data_processor.py/filter_valid

        for constraint in self._constraints:
            data = constraint.filter_valid(data)
@omelyanchikd omelyanchikd added bug Something isn't working new Automatic label applied to new issues labels Jan 7, 2025
@npatki
Copy link
Contributor

npatki commented Jan 8, 2025

Hi @omelyanchikd, nice to meet you.

My high-level understanding is that conditional sampling may fail to sample rows sometimes for completely valid reasons ... and that different parts of the code are responsible for re-trying the sampling until a sufficient # of rows are produced. This is why the sample function has parameters such as batch_size, max_tries_per_batch, etc.

So it's a bit hard for me to understand whether your proposed changes are internal optimizations, or whether they will fix a user-facing bug?

If it is a bug, would you mind sharing an example of code that replicates the bug you are trying to fix? -- eg. using an SDV synthesizer (such as GaussianCopulaSynthesizer) with conditional sampling, what fails? It would be nice to validate what the bug is before we try to fix it. Thanks.

@npatki npatki added under discussion Issue is currently being discussed feature:sampling Related to generating synthetic data after a model is built and removed new Automatic label applied to new issues labels Jan 8, 2025
@omelyanchikd
Copy link
Author

omelyanchikd commented Jan 9, 2025

hi @npatki
thank you for the quick response!

My high-level understanding is that conditional sampling may fail to sample rows sometimes for completely valid reasons ... and that different parts of the code are responsible for re-trying the sampling until a sufficient # of rows are produced. This is why the sample function has parameters such as batch_size, max_tries_per_batch, etc.

yes, you are totally correct that conditional sampling may fail for completely valid rows, and existing code is able to retry sampling until a sufficient # of rows are produced.

However, if valid rows filter produces an empty dataframe, the next line of code is checking whether sampled rows satisfy conditions, and this step produces a KeyError when trying to access a column of an empty dataframe, which breaks the whole generation process and makes it a user-facing bug. In my dataset this error could occur as early as within first 5 seconds after sampling start.

A simple fix that I am suggesting in this PR #2335 just returns an empty dataframe BEFORE accessing columns if the input is empty. And in this case all existing mechanisms for batch_sizes and sampling retries work as expected.

Unfortunately, I cannot share with you any actual data on which the code has been run, but an error did occur in an unmodified call to synthesizer.sample_remaining_columns(known_columns=df).

If you review my suggested change and code in question, you will immediately see why the error might be occurring in the real situation.

P.S. It is just two lines of code, so not a big change to review or look into.

Thank you very much in advance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working feature:sampling Related to generating synthetic data after a model is built under discussion Issue is currently being discussed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants