-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
changed time_cutoff option #89
Changes from 2 commits
f723356
707b898
cb32b81
48c64e8
6b1912c
878fa96
7a3875d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2,6 +2,7 @@ | |
|
||
import hydra | ||
import omegaconf | ||
import numpy as np | ||
import os | ||
import pandas as pd | ||
import pathlib | ||
|
@@ -30,9 +31,10 @@ def __init__( | |
remove_non_standard_residues: bool, | ||
remove_pdb_unavailable: bool, | ||
train_val_test: List[float], | ||
split_type: Literal["sequence_similarity", "random"], | ||
split_type: Literal["sequence_similarity", "time_cutoff", "random"], | ||
split_sequence_similiarity: int, | ||
overwrite_sequence_clusters: bool | ||
overwrite_sequence_clusters: bool, | ||
split_time_frames: List[str] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should be There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In theory it is optional, but we cannot specify a default option as it is later in the argument list. If we want to make the type hint optional, should we move it up the list and give it a default value? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm, I'm generally in favour of fewer args (and not passing in things that aren't used since it can be confusing). What do you think about a pattern where we can pass the time splits into There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. cc @amorehead There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. With There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, I think clarification would be helpful here There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Was not sure what exactly you meant @a-r-j, but for now, I just reordered arguments so that we can give defaults to these and the user does not need to specify them. Does that work for you? |
||
): | ||
self.fraction = fraction | ||
self.molecule_type = molecule_type | ||
|
@@ -52,6 +54,7 @@ def __init__( | |
self.split_type = split_type | ||
self.split_sequence_similarity = split_sequence_similiarity | ||
self.overwrite_sequence_clusters = overwrite_sequence_clusters | ||
self.split_time_frames = [np.datetime64(date) for date in split_time_frames] | ||
self.splits = ["train", "val", "test"] | ||
|
||
def create_dataset(self): | ||
|
@@ -128,9 +131,15 @@ def create_dataset(self): | |
elif self.split_type == "sequence_similarity": | ||
log.info(f"Splitting dataset via sequence-similarity split into {self.train_val_test}...") | ||
log.info(f"Using {self.split_sequence_similarity} sequence similarity for split") | ||
pdb_manager.cluster(min_seq_id=self.split_sequence_similarity, update=True) | ||
splits = pdb_manager.split_clusters( | ||
pdb_manager.df, update=True, overwrite = self.overwrite_sequence_clusters) | ||
pdb_manager.cluster(min_seq_id=self.split_sequence_similarity, update=True, | ||
overwrite = self.overwrite_sequence_clusters) | ||
splits = pdb_manager.split_clusters(pdb_manager.df, update=True) | ||
|
||
elif self.split_type == "time_cutoff": | ||
log.info(f"Splitting dataset via time_cutoff split into {self.train_val_test}...") | ||
log.info(f"Using {self.split_time_frames} dates for split") | ||
pdb_manager.split_time_frames = self.split_time_frames | ||
splits = pdb_manager.split_by_deposition_date(df=pdb_manager.df, update=True) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Very nice! |
||
|
||
log.info(splits["train"]) | ||
return splits | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to set to
null
? What do you thinl?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean
split_time_frames
? Thought it would be good to have it in there so that users can see what format is requiredThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps you can
null
outsplit_time_frames
but leave your example in the in-line comment as follows:There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM