Expose Dataset API #158

folmos-at-orange · 2024-03-12T11:15:40Z

Description

Currently the Dataset class is an internal utility for the sklearn module. The idea is to render this class public so it is an utility to create multi-table datasets.

Questions/Ideas

This feature would ease many tasks:

Splitting a dataset in train/test
Sorting a whole dataset
Building core api parameters (notably additional_data_tables)
Simplify tutorials and samples by having a method get_dataset_sample("Accidents", type="pandas")

Main design element: A builder pattern.

Add mutator methods to construct the dataset from an empty one (Dataset())
(Partially) Implemented in prototype:
- Classes
  - PandasDataset
  - FileDataset
- add_table(self, name, source, key=None)
  - key mandatory for multi-table
  - source will be different in each Dataset subclass
- train_test_split (implemented in PandasDataset only)
- sort sorts the dataset by their keys (implemented in FileDataset only)
- create_khiops_dictionary_domain
- create_additional_data_table_param
- add_relation(self, parent_table_name, child_table_name, one_to_one=False)
Not in prototype:
- remove_table(self, name)
  - Removes all relations asociated the the table
- remove_relation(self, parent_table_name, child_table_name)
- check(self):
  - Raises warnings and exceptions
    - errors:
      - Non-existent table names
      - No main table set in multi-table datasets
      - No key set in multi-table datasets
    - warnings:
      - Dangling tables
- add_external_relation(self, parent_table_name, key, another_dataset)

Design questions:

Immediate consistency checks:
- That is , should check be called at each mutator call ?
  - I'm inclined to this one since the target audience are not only dev's
- or the user should check the consistency before using it ?
Should we accept mono-table datasets ?
- This adds many edge-cases
What about helper functions using the FileDataset:
- train_predictor_ds(ds, target_variable_name, output_dir, <kwargs without additional_data_tables, header_line, field_separator>)
- deploy_model_ds(model_kdic, ds, output_dir, <kwargs - additional_data_tables, header_line, field_separator> )

The text was updated successfully, but these errors were encountered:

folmos-at-orange · 2024-03-29T15:31:40Z

Waiting for input after the first version of the spec: https://github.com/KhiopsML/khiops-python/wiki/Dataset-Spec-Proposal

folmos-at-orange added Status/Draft The issue is still not well defined Type/Feature A new feature request or an improvement of a feature Priority/0-High To do now labels Mar 12, 2024

folmos-at-orange self-assigned this Mar 12, 2024

This was referenced Mar 12, 2024

Implement helper to create multi-table dictionaries #52

Closed

Implement a sklearn helper to split multi-table datasets #51

Closed

folmos-at-orange added Priority/1-Medium To do after P0 and removed Priority/0-High To do now labels Jun 27, 2024

folmos-at-orange removed their assignment Sep 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose Dataset API #158

Expose Dataset API #158

folmos-at-orange commented Mar 12, 2024 •

edited

Loading

folmos-at-orange commented Mar 29, 2024

Expose Dataset API #158

Expose Dataset API #158

Comments

folmos-at-orange commented Mar 12, 2024 • edited Loading

Description

Questions/Ideas

folmos-at-orange commented Mar 29, 2024

folmos-at-orange commented Mar 12, 2024 •

edited

Loading