-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
changed time_cutoff option #89
Conversation
log.info(f"Splitting dataset via time_cutoff split into {self.train_val_test}...") | ||
log.info(f"Using {self.split_time_frames} dates for split") | ||
pdb_manager.split_time_frames = self.split_time_frames | ||
splits = pdb_manager.split_by_deposition_date(df=pdb_manager.df, update=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice!
LGTM |
split_sequence_similiarity: 0.3 # Clustering at 30% sequence similarity (argument is ignored if split_type="random") | ||
split_type: "sequence_similarity" # Split sequences by sequence similarity clustering, other options are "random" and "time_cutoff" | ||
split_sequence_similiarity: 0.3 # Clustering at 30% sequence similarity (argument is ignored if split_type!="sequence_similarity") | ||
split_time_frames: ["2020-01-01", "2021-01-01", "2023-03-01"] # Time-cutoffs for train, val and test set (argument is ignored if split_type!="time_cutoff") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to set to null
? What do you thinl?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean split_time_frames
? Thought it would be good to have it in there so that users can see what format is required
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps you can null
out split_time_frames
but leave your example in the in-line comment as follows:
split_time_frames: null # Time-cutoffs for train, val and test set (argument is ignored if split_type!="time_cutoff") - e.g., ["2020-01-01", "2021-01-01", "2023-03-01"]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
split_sequence_similiarity: int, | ||
overwrite_sequence_clusters: bool | ||
overwrite_sequence_clusters: bool, | ||
split_time_frames: List[str] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be Optional[List[str]]
. This arg probs needs a check for datetime format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory it is optional, but we cannot specify a default option as it is later in the argument list. If we want to make the type hint optional, should we move it up the list and give it a default value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I'm generally in favour of fewer args (and not passing in things that aren't used since it can be confusing). What do you think about a pattern where we can pass the time splits into train_test_split
instead of the list of floats and the behaviour is controlled by split_type
(with appropriate error catching)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @amorehead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With pass the time splits into train_test_split
, do you mean giving tuples of (split_ratio, split_time_cutoff) if the time version is chosen for example? And if the time version is not chosen, one would just take the first element of that tuple, with the default for the second being None?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think clarification would be helpful here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was not sure what exactly you meant @a-r-j, but for now, I just reordered arguments so that we can give defaults to these and the user does not need to specify them. Does that work for you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Ah, some tests are failing @kierandidi |
should be fixed now @a-r-j |
@a-r-j
Added option for the PDBDataModule to create splits based on time interval by exposing PDBManager functionality.
Tested both this functionality as well as random and sequence_similarity to ensure no regression is happening