Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose Dataset API #158

Open
folmos-at-orange opened this issue Mar 12, 2024 · 1 comment
Open

Expose Dataset API #158

folmos-at-orange opened this issue Mar 12, 2024 · 1 comment
Labels
Priority/1-Medium To do after P0 Status/Draft The issue is still not well defined Type/Feature A new feature request or an improvement of a feature

Comments

@folmos-at-orange
Copy link
Member

folmos-at-orange commented Mar 12, 2024

Description

Currently the Dataset class is an internal utility for the sklearn module. The idea is to render this class public so it is an utility to create multi-table datasets.

Questions/Ideas

This feature would ease many tasks:

  • Splitting a dataset in train/test
  • Sorting a whole dataset
  • Building core api parameters (notably additional_data_tables)
  • Simplify tutorials and samples by having a method get_dataset_sample("Accidents", type="pandas")

Main design element: A builder pattern.

  • Add mutator methods to construct the dataset from an empty one (Dataset())
  • (Partially) Implemented in prototype:
    • Classes
      • PandasDataset
      • FileDataset
    • add_table(self, name, source, key=None)
      • key mandatory for multi-table
      • source will be different in each Dataset subclass
    • train_test_split (implemented in PandasDataset only)
    • sort sorts the dataset by their keys (implemented in FileDataset only)
    • create_khiops_dictionary_domain
    • create_additional_data_table_param
    • add_relation(self, parent_table_name, child_table_name, one_to_one=False)
  • Not in prototype:
    • remove_table(self, name)
      • Removes all relations asociated the the table
    • remove_relation(self, parent_table_name, child_table_name)
    • check(self):
      • Raises warnings and exceptions
        • errors:
          • Non-existent table names
          • No main table set in multi-table datasets
          • No key set in multi-table datasets
        • warnings:
          • Dangling tables
    • add_external_relation(self, parent_table_name, key, another_dataset)

Design questions:

  • Immediate consistency checks:
    • That is , should check be called at each mutator call ?
      • I'm inclined to this one since the target audience are not only dev's
    • or the user should check the consistency before using it ?
  • Should we accept mono-table datasets ?
    • This adds many edge-cases
  • What about helper functions using the FileDataset:
    • train_predictor_ds(ds, target_variable_name, output_dir, <kwargs without additional_data_tables, header_line, field_separator>)
    • deploy_model_ds(model_kdic, ds, output_dir, <kwargs - additional_data_tables, header_line, field_separator> )
@folmos-at-orange folmos-at-orange added Status/Draft The issue is still not well defined Type/Feature A new feature request or an improvement of a feature Priority/0-High To do now labels Mar 12, 2024
@folmos-at-orange folmos-at-orange self-assigned this Mar 12, 2024
@folmos-at-orange
Copy link
Member Author

Waiting for input after the first version of the spec: https://github.com/KhiopsML/khiops-python/wiki/Dataset-Spec-Proposal

@folmos-at-orange folmos-at-orange added Priority/1-Medium To do after P0 and removed Priority/0-High To do now labels Jun 27, 2024
@folmos-at-orange folmos-at-orange removed their assignment Sep 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Priority/1-Medium To do after P0 Status/Draft The issue is still not well defined Type/Feature A new feature request or an improvement of a feature
Projects
None yet
Development

No branches or pull requests

1 participant