Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added autogluon support, more models, more preprocessing strategies #81

Merged
merged 65 commits into from
Sep 10, 2024
Merged
Show file tree
Hide file tree
Changes from 10 commits
Commits
Show all changes
65 commits
Select commit Hold shift + click to select a range
5fde57a
added autogluon support
Oufattole Aug 19, 2024
d6832cb
updates for autogluon
teyaberg Aug 19, 2024
0612730
[wip] filtering features
teyaberg Aug 20, 2024
2feee79
[wip] filtering features
teyaberg Aug 20, 2024
f3c985a
[wip] sharing for updates only
teyaberg Aug 20, 2024
b65754c
[wip] sharing for updates only
teyaberg Aug 20, 2024
a8d8417
[wip] doctests
teyaberg Aug 20, 2024
d07f6a2
autogluon
teyaberg Aug 20, 2024
2aebd70
added logged warning for static data being empty and added support fo…
Oufattole Aug 20, 2024
8c54317
Merge branch 'generalized_load_model' into dev
Oufattole Aug 20, 2024
ecf9292
Added support via hydra for selecting among four imputation methods (…
Oufattole Aug 21, 2024
e6cf085
fixed xgboost model yaml to load imputer and normalization from the m…
Oufattole Aug 21, 2024
94dfde2
added autogluon test and cli support
Oufattole Aug 21, 2024
527eda5
added three more sklearn models and fixed bug with normalzation and i…
Oufattole Aug 21, 2024
0d7ed27
fixed bugs so correlation code filters work now
Oufattole Aug 21, 2024
9c542ea
sweeper
teyaberg Aug 21, 2024
1a519ff
logging
teyaberg Aug 21, 2024
8fc8863
made tash caching parallelize and updated tests for configs
Oufattole Aug 21, 2024
5724d9b
Merge branch 'dev' of github.com:mmcdermott/MEDS_Tabular_AutoML into dev
Oufattole Aug 21, 2024
3e223bb
added more thourough tests for output file paths of task caching and …
Oufattole Aug 22, 2024
926732b
Merge branch 'main' into dev
Oufattole Aug 25, 2024
299bf6f
setup dynamic versioning
Oufattole Aug 25, 2024
8a7692a
version updates
teyaberg Sep 5, 2024
158b8fa
version updates
teyaberg Sep 6, 2024
e92049f
fix hydra-core version for experimental callback support
teyaberg Sep 6, 2024
0623aaa
eval callback logging
teyaberg Sep 6, 2024
e1be850
added script input args checks, reduced redundancy in model launcher …
Oufattole Sep 7, 2024
0e985ee
eval callback
teyaberg Sep 7, 2024
139870f
eval callback
teyaberg Sep 7, 2024
0d5e9e8
Updated pre-commit config too.
mmcdermott Sep 8, 2024
2563aaf
Removed a function that was not yet implemented.
mmcdermott Sep 8, 2024
2d80905
Removing unused function in evaluation callback.
mmcdermott Sep 8, 2024
d29ece9
eval callback
teyaberg Sep 8, 2024
81b022f
added yaml hierarchy for model_launcher
Oufattole Sep 8, 2024
57a4a81
updated configs, fixed most tests
Oufattole Sep 9, 2024
b704bba
Merged
mmcdermott Sep 9, 2024
2f564e6
Removed unused pass block.
mmcdermott Sep 9, 2024
6f68a4b
Removing unnecessary keys call
mmcdermott Sep 9, 2024
6c2ba9a
Fixed workflow files
mmcdermott Sep 9, 2024
e678145
fixed tabularize tests
Oufattole Sep 9, 2024
d64e237
added integration tests covering multirun for all launch_model models…
Oufattole Sep 9, 2024
8d12aed
merged dev
Oufattole Sep 9, 2024
c631e93
fixed tests
Oufattole Sep 9, 2024
2601fca
Merge pull request #90 from mmcdermott/configs
Oufattole Sep 9, 2024
a4ad03c
resolved review feedback. Added a based_model docstring. Added versio…
Oufattole Sep 9, 2024
0db7bd6
fixed min_code_inclusion_frequency kwarg
Oufattole Sep 9, 2024
b289033
added mimic iv tutorial
Oufattole Sep 9, 2024
9294920
updated tabularization script to fix bugs
Oufattole Sep 9, 2024
d71f9dc
reduced the number of workers for resharding
Oufattole Sep 9, 2024
aed27f1
Merged.
mmcdermott Sep 9, 2024
0dc2bc6
updated tabularize meds to take string input for tasks
Oufattole Sep 9, 2024
c981534
Merge pull request #91 from mmcdermott/improve_test_coverage
mmcdermott Sep 9, 2024
2aa4feb
Improved error handling per https://github.com/mmcdermott/MEDS_Tabula…
mmcdermott Sep 9, 2024
a6d9103
Update README.md
mmcdermott Sep 9, 2024
23eb4d4
added try except around loading 0 codes
Oufattole Sep 10, 2024
be5f723
fixed job name config bug where we were missing the $ so it was not …
Oufattole Sep 10, 2024
4c87e94
Merge branch 'dev' into MIMICIV
Oufattole Sep 10, 2024
d390658
fixed precommit issues
Oufattole Sep 10, 2024
b82ee6d
Merge branch 'dev' into MIMICIV
Oufattole Sep 10, 2024
a564886
fix paths for eval_callback and add check to test_integration
teyaberg Sep 10, 2024
430afba
fixing tests for delete_below_top_k
teyaberg Sep 10, 2024
6a89a9f
Merge pull request #92 from mmcdermott/MIMICIV
Oufattole Sep 10, 2024
9e6d99a
fix out of memory xgboost training and added test
teyaberg Sep 10, 2024
8316365
simplified pathing for results and evaluation callback
Oufattole Sep 10, 2024
f7e03dd
fixed doctest for deleting below top k models
Oufattole Sep 10, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ authors = [
]
description = "Scalable Tabularization of MEDS format Time-Series data"
readme = "README.md"
requires-python = ">=3.12"
requires-python = ">=3.11"
classifiers = [
"Programming Language :: Python :: 3",
"License :: OSI Approved :: MIT License",
Expand All @@ -17,7 +17,6 @@ classifiers = [
dependencies = [
"polars", "pyarrow", "loguru", "hydra-core", "numpy", "scipy<1.14.0", "pandas", "tqdm", "xgboost",
Oufattole marked this conversation as resolved.
Show resolved Hide resolved
"scikit-learn", "hydra-optuna-sweeper", "hydra-joblib-launcher", "ml-mixins", "meds==0.3",
"MEDS-transforms==0.0.5",
mmcdermott marked this conversation as resolved.
Show resolved Hide resolved
mmcdermott marked this conversation as resolved.
Show resolved Hide resolved
Oufattole marked this conversation as resolved.
Show resolved Hide resolved
]

[project.scripts]
Expand All @@ -33,6 +32,7 @@ generate-subsets = "MEDS_tabular_automl.scripts.generate_subsets:main"
dev = ["pre-commit"]
tests = ["pytest", "pytest-cov", "rootutils"]
profiling = ["mprofile", "matplotlib"]
autogluon = ["autogluon; python_version=='3.11.*'"] # Environment marker to restrict AutoGluon to Python 3.11

[build-system]
requires = ["setuptools>=61.0", "setuptools-scm>=8.0", "wheel"]
Expand Down
30 changes: 30 additions & 0 deletions src/MEDS_tabular_automl/base_model.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
from abc import ABC, abstractmethod
from pathlib import Path
from typing import TypeVar

from mixins import TimeableMixin
from omegaconf import DictConfig

T = TypeVar("T")


class BaseModel(ABC, TimeableMixin):
Oufattole marked this conversation as resolved.
Show resolved Hide resolved
@abstractmethod
def __init__(self):
pass

@abstractmethod
def train(self):
pass

@abstractmethod
def evaluate(self) -> float:
pass

@abstractmethod
def save_model(self, output_fp: Path):
pass

@classmethod
def initialize(cls: T, **kwargs) -> T:
return cls(DictConfig(kwargs, flags={"allow_objects": True}))
28 changes: 28 additions & 0 deletions src/MEDS_tabular_automl/configs/launch_autogluon.yaml
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this have any shared overlap with the launch_model.yaml?

Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
defaults:
- default
- tabularization: default
- override hydra/sweeper: optuna
Oufattole marked this conversation as resolved.
Show resolved Hide resolved
- override hydra/sweeper/sampler: tpe
- override hydra/launcher: joblib
- _self_

task_name: task

# Task cached data dir
input_dir: ${output_cohort_dir}/${task_name}/task_cache
# Directory with task labels
input_label_dir: ${output_cohort_dir}/${task_name}/labels/
# Where to output the model and cached data
model_dir: ${output_cohort_dir}/autogluon/autogluon_${now:%Y-%m-%d_%H-%M-%S}
output_filepath: ${model_dir}

# Model parameters
model_params:
iterator:
keep_data_in_memory: True
binarize_task: True

log_dir: ${model_dir}/.logs/
log_filepath: ${log_dir}/log.txt

name: launch_autogluon
31 changes: 31 additions & 0 deletions src/MEDS_tabular_automl/configs/launch_model.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
defaults:
- _self_
- default
- tabularization: default
- model: xgboost # This can be changed to sgd_classifier or any other model
- override hydra/sweeper: optuna
- override hydra/sweeper/sampler: tpe
- override hydra/launcher: joblib

task_name: task

# Task cached data dir
input_dir: ${output_cohort_dir}/${task_name}/task_cache
# Directory with task labels
input_label_dir: ${output_cohort_dir}/${task_name}/labels/
# Where to output the model and cached data
model_dir: ${output_cohort_dir}/model/model_${now:%Y-%m-%d_%H-%M-%S}
output_filepath: ${model_dir}/model_metadata.json

log_dir: ${model_dir}/.logs/

name: launch_model

hydra:
verbose: False
job:
name: MEDS_TAB_${name}_${worker}_{now:%Y-%m-%d_%H-%M-%S}
sweep:
dir: ${log_dir}
run:
dir: ${log_dir}
33 changes: 33 additions & 0 deletions src/MEDS_tabular_automl/configs/launch_sklearnmodel.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
defaults:
- default
- tabularization: default
- override hydra/sweeper: optuna
- override hydra/sweeper/sampler: tpe
- override hydra/launcher: joblib
- _self_

task_name: task

# Task cached data dir
input_dir: ${output_cohort_dir}/${task_name}/task_cache
# Directory with task labels
input_label_dir: ${output_cohort_dir}/${task_name}/labels/
# Where to output the model and cached data
model_dir: ${output_cohort_dir}/model/model_${now:%Y-%m-%d_%H-%M-%S}
output_filepath: ${model_dir}/model_metadata.json

# Model parameters
model_params:
epochs: 20
early_stopping_rounds: 5
model:
_target_: sklearn.linear_model.SGDClassifier
loss: log_loss
# n_iter: ${model_params.epochs} # not sure if we want this behaviour
iterator:
keep_data_in_memory: True
binarize_task: True

log_dir: ${model_dir}/.logs/

name: launch_sklearnmodel
2 changes: 1 addition & 1 deletion src/MEDS_tabular_automl/configs/launch_xgboost.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,6 @@ hydra:
model_params.num_boost_round: range(100, 1000)
model_params.early_stopping_rounds: range(1, 10)
+model_params.model.max_depth: range(2, 16)
tabularization.min_code_inclusion_frequency: tag(log, range(10, 1000000))
tabularization.min_code_inclusion_count: tag(log, range(10, 1000000))

name: launch_xgboost
30 changes: 30 additions & 0 deletions src/MEDS_tabular_automl/configs/model/sgd_classifier.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# @package _global_

model_target:
_target_: MEDS_tabular_automl.sklearn_model.SklearnModel.initialize
model_params: ${model_params}
input_dir: ${input_dir}
input_label_dir: ${input_label_dir}
model_dir: ${model_dir}
output_filepath: ${output_filepath}
log_dir: ${log_dir}
cache_dir: ${cache_dir}

model_params:
epochs: 20
early_stopping_rounds: 5
model:
_target_: sklearn.linear_model.SGDClassifier
loss: log_loss
iterator:
keep_data_in_memory: True
binarize_task: True

hydra:
sweeper:
params:
+model_params.model.alpha: tag(log, interval(1e-6, 1))
+model_params.model.l1_ratio: interval(0, 1)
+model_params.model.penalty: choice(['l1', 'l2', 'elasticnet'])
model_params.epochs: range(10, 100)
model_params.early_stopping_rounds: range(1, 10)
38 changes: 38 additions & 0 deletions src/MEDS_tabular_automl/configs/model/xgboost.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# @package _global_
mmcdermott marked this conversation as resolved.
Show resolved Hide resolved

model_target:
_target_: MEDS_tabular_automl.xgboost_model.XGBoostModel.initialize
model_params: ${model_params}
input_dir: ${input_dir}
input_label_dir: ${input_label_dir}
model_dir: ${model_dir}
output_filepath: ${output_filepath}
log_dir: ${log_dir}
cache_dir: ${cache_dir}
# tabularization: ${tabularization} # Ideally we should define tabularization here, but there is an issue initializing with it's resolvers.
mmcdermott marked this conversation as resolved.
Show resolved Hide resolved

model_params:
num_boost_round: 1000
early_stopping_rounds: 5
model:
booster: gbtree
device: cpu
nthread: 1
tree_method: hist
objective: binary:logistic
iterator:
keep_data_in_memory: True
binarize_task: True

hydra:
sweeper:
params:
+model_params.model.eta: tag(log, interval(0.001, 1))
+model_params.model.lambda: tag(log, interval(0.001, 1))
+model_params.model.alpha: tag(log, interval(0.001, 1))
+model_params.model.subsample: interval(0.5, 1)
+model_params.model.min_child_weight: interval(1e-2, 100)
model_params.num_boost_round: range(100, 1000)
model_params.early_stopping_rounds: range(1, 10)
+model_params.model.max_depth: range(2, 16)
tabularization.min_code_inclusion_frequency: tag(log, range(10, 1000000))
8 changes: 5 additions & 3 deletions src/MEDS_tabular_automl/configs/tabularization/default.yaml
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
# User inputs
filtered_code_metadata_fp: ${output_cohort_dir}/metadata/codes.parquet
allowed_codes: null
min_code_inclusion_frequency: 10
filtered_code_metadata_fp: ${output_cohort_dir}/tabularized_code_metadata.parquet
min_code_inclusion_count: 10
Oufattole marked this conversation as resolved.
Show resolved Hide resolved
min_code_inclusion_frequency: null
max_included_codes: null
window_sizes:
- "1d"
- "7d"
Expand All @@ -19,4 +21,4 @@ aggs:
- "value/max"

# Resolved inputs
_resolved_codes: ${filter_to_codes:${tabularization.allowed_codes},${tabularization.min_code_inclusion_frequency},${tabularization.filtered_code_metadata_fp}}
_resolved_codes: ${filter_to_codes:${tabularization.filtered_code_metadata_fp},${tabularization.allowed_codes},${tabularization.min_code_inclusion_count},${tabularization.min_code_inclusion_frequency},${tabularization.max_included_codes}}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you might simplify this to just have it take in the tabularization dictionary

37 changes: 37 additions & 0 deletions src/MEDS_tabular_automl/dense_iterator.py
mmcdermott marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
import numpy as np
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a docstring to this module explaining what it is and its purpose relative to other parts of the repo?

import scipy.sparse as sp
from mixins import TimeableMixin
from omegaconf import DictConfig

from .tabular_dataset import TabularDataset


class DenseIterator(TabularDataset, TimeableMixin):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as TabularDataset is a derived class of TimeableMixin, DenseIterator doesn't need that dependency as well. That can simplify your superclass related code and imports here.

def __init__(self, cfg: DictConfig, split: str):
"""Initializes the SklearnIterator with the provided configuration and data split.

Args:
cfg: The configuration dictionary.
split: The data split to use.
"""
TabularDataset.__init__(self, cfg=cfg, split=split)
TimeableMixin.__init__(self)
self.valid_event_ids, self.labels = self._load_ids_and_labels()
# check if the labels are empty
if len(self.labels) == 0:
raise ValueError("No labels found.")
# self._it = 0

def densify(self) -> np.ndarray:
"""Builds the data as a dense matrix based on column subselection."""

# get the dense matrix by iterating through the data shards
data = []
labels = []
for shard_idx in range(len(self._data_shards)):
shard_data, shard_labels = self.get_data_shards(shard_idx)
data.append(shard_data)
labels.append(shard_labels)
data = sp.vstack(data)
labels = np.concatenate(labels, axis=0)
return data, labels
2 changes: 2 additions & 0 deletions src/MEDS_tabular_automl/generate_static_features.py
Original file line number Diff line number Diff line change
Expand Up @@ -185,6 +185,8 @@ def get_flat_static_rep(
"""
static_features = get_feature_names(agg=agg, feature_columns=feature_columns)
static_measurements = summarize_static_measurements(agg, static_features, df=shard_df)
if len(static_features) == 0:
raise ValueError(f"No static features found. Remove the aggregation function {agg}")
# convert to sparse_matrix
matrix = get_sparse_static_rep(static_features, static_measurements.lazy(), shard_df, feature_columns)
assert matrix.shape[1] == len(
Expand Down
6 changes: 3 additions & 3 deletions src/MEDS_tabular_automl/mapper.py
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to just see if you can import these functions from MEDS-Transforms? I guess that would conflict with the python version change, though...

Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,12 @@
from collections.abc import Callable
from datetime import datetime
from pathlib import Path
from typing import TypeVar

from loguru import logger

LOCK_TIME_FMT = "%Y-%m-%dT%H:%M:%S.%f"
DF_T = TypeVar("DF_T")


def get_earliest_lock(cache_directory: Path) -> datetime | None:
Expand Down Expand Up @@ -82,9 +84,7 @@ def register_lock(cache_directory: Path) -> tuple[datetime, Path]:
return lock_time, lock_fp


def wrap[
DF_T
](
def wrap(
in_fp: Path,
out_fp: Path,
read_fn: Callable[[Path], DF_T],
Expand Down
75 changes: 75 additions & 0 deletions src/MEDS_tabular_automl/scripts/launch_autogluon.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
from importlib.resources import files

import hydra
import pandas as pd
from loguru import logger
from omegaconf import DictConfig

from MEDS_tabular_automl.dense_iterator import DenseIterator

from ..utils import hydra_loguru_init

config_yaml = files("MEDS_tabular_automl").joinpath("configs/launch_autogluon.yaml")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you might consider importing all config files in the raw package __init__.py or something. I'm not sure it is really better, but that's what I do in MEDS-transforms.

if not config_yaml.is_file():
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think at this point this error case is probably not needed.

raise FileNotFoundError("Core configuration not successfully installed!")


@hydra.main(version_base=None, config_path=str(config_yaml.parent.resolve()), config_name=config_yaml.stem)
def main(cfg: DictConfig) -> float:
"""Launches AutoGluon after collecting data based on the provided configuration.

Args:
cfg: The configuration dictionary specifying model and training parameters.
"""

# print(OmegaConf.to_yaml(cfg))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove unused comments.

if not cfg.loguru_init:
hydra_loguru_init()

# check that autogluon is installed
try:
import autogluon.tabular as ag
except ImportError:
logger.error("AutoGluon is not installed. Please install AutoGluon.")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you want to raise an exception here? I don't think logger.error does necessarily.


# collect data based on the configuration
itrain = DenseIterator(cfg, "train")
ituning = DenseIterator(cfg, "tuning")
iheld_out = DenseIterator(cfg, "held_out")

# collect data for AutoGluon
train_data, train_labels = itrain.densify()
Oufattole marked this conversation as resolved.
Show resolved Hide resolved
tuning_data, tuning_labels = ituning.densify()
held_out_data, held_out_labels = iheld_out.densify()

# construct dfs for AutoGluon
train_df = pd.DataFrame(train_data.todense()) # , columns=cols)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eliminate unused comments.

train_df[cfg.task_name] = train_labels
tuning_df = pd.DataFrame(
tuning_data.todense(),
) # columns=cols)
tuning_df[cfg.task_name] = tuning_labels
held_out_df = pd.DataFrame(held_out_data.todense()) # , columns=cols)
held_out_df[cfg.task_name] = held_out_labels

train_dataset = ag.TabularDataset(train_df)
tuning_dataset = ag.TabularDataset(tuning_df)
held_out_dataset = ag.TabularDataset(held_out_df)

# train model with AutoGluon
predictor = ag.TabularPredictor(
label=cfg.task_name, log_to_file=True, log_file_path=cfg.log_filepath, path=cfg.output_filepath
).fit(train_data=train_dataset, tuning_data=tuning_dataset)

# predict
predictions = predictor.predict(held_out_dataset.drop(columns=[cfg.task_name]))
print("Predictions:", predictions)
# evaluate
score = predictor.evaluate(held_out_dataset)
print("Test score:", score)

# TODO(model) add tests for autogluon pipeline


if __name__ == "__main__":
main()
Loading
Loading