Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nd explore multi 2 #356

Closed
wants to merge 223 commits into from
Closed

Nd explore multi 2 #356

wants to merge 223 commits into from

Conversation

ndiamant
Copy link
Contributor

@ndiamant ndiamant commented Jul 7, 2020

New way to handle multidimensional TensorMaps that doesn't run into problems with validators or normalizers. Addresses #354. This PR is missing a test for variable shape TensorMaps. Can someone (@paolodi ?) explain what the expected behavior for explore is for time series or other variable shape tmaps?

lucidtronix and others added 30 commits April 11, 2019 10:59
set those people to excluded. When exclusions occur when disease has
not occurred, also set those people to excluded.
(1) Need to apply the death censor dates.
(2) Need to harmonize field names.
date, set the censor date to enrollment date, rather than birthdate.
censor schema for the same purpose, and allow the missing_fields field
to be NA if we actually have everything we need.
add ml4cvd files
StevenSong and others added 22 commits May 30, 2020 21:52
* wip rehaul tensorizer

* wip tensorizer rehaul

* rehaul tensor writer, get metadata in tensorize

* bug fixes

* bug fixes and correctness regarding missingness

* new voltage qc tmaps

* default sampling frequency tmaps calculate based on length

* multiprocess -> multiprocessing
Also
* Remove obsolete instructions now that ml4cvd has the needed packages and permissions.
* Use newer hd5 for ECGs.
* Switch to local paths when on a ML4CVD VM.
* better generator stats, and stats multiprocessing bug fixed
* Plots histograms of continuous tensors in explore mode

* Updates figure subplot size and file name

* Fixes file extension
* wip multiple time windows

* cross reference multiple time windows

* help docs

* which -> order, reduce output

* exact/at least toggle, all/any window toggle, summary count formatting

* group by join tensor and time tensor

* global N per window, multilabel counts

* description of counts

* shorten line in output
* dynamic time series tmaps

* time series persistence #305 and redo cardiac surgery tmaps

* voltage _exact length tmaps, population_normalize -> normalization in TMap

* validator for voltage

* remove apollo xref, get newest surgery

* fixes selection of mrn_col_name in _sample_csv_to_set

* fix validator

* warning -> debug

* dsw infection

* columns

* outcome

* prolonged vent column name

* reformat voltage tmaps

* explicit _pc tmps

* type hint

* delete redundant length and zero tmaps

* use xref output csv to get newest surgery with preop ecg

* adds train_simple_model (#317)

* patient sex categorical tmaps

* explicit voltage tmaps

* sex tmap cats

* dsw outcomes resolved

* gender -> sex in plots

* voltage stats

* train/valid/test not useful in progress bar

* report median, generator

* consolidate simple shallow model

* revert change

* version TFA TFP #320

* fix abbreviations

Co-authored-by: Erik Reinertsen <[email protected]>
* #260 #324 #326

* get stats q

* variable names
* notebooks for mnist and hyperoptimization, survival analysis plots and documentation
* do not crash if sts data not found

* log and do not call build if no sts tmaps

* raise only when using sts tmaps
* define filter sizes per conv layer

* multiline fstring

* args in test

* filter size per layer or block

* standalone helper
* multiprocess -> multiprocessing

* cover all paths

* simplify

* fix cardiac surgery tmaps bug

* revert bug fix for separate pr

* revert bug fix for separate pr

* whitespace
* Fixed ecg_plot_rest to run from command line (--mode plot_resting_ecgs) with new tensors
* infer metrics

* time range consistency with cross reference
* explore now allows multidimensional tmaps

* tests for explore

* Bug found in default cts tff
* user/group of ml4cvd output is no longer root

* tf.sh options: run as root, set up jupyter. Closes #334

* root is no longer default

* disable silent reporting for -j in tf.sh

* many echo statements to check array vals

* user added to all user's groups in docker -> bash

* fix indentation

* adds --env to printed tf.sh call
* #314 ecg voltage plots

* #314 ecg voltage plots

* Formatting #315

* rehaul plot mode to use calculated scales

* readability

Co-authored-by: StevieSong <[email protected]>
@ndiamant ndiamant requested a review from StevenSong July 7, 2020 16:45
@StevenSong
Copy link
Collaborator

currently explore can handle variable shape tmaps that are dynamic in the first dimension i.e. time series tmaps - it loops through the time series and saves 1 row of tensors for each time sample

so like if a tmap were to return all ECGs from a hd5 that had 3 ECGs, there would be 3 rows in tensors_all_union.csv, 1 for each ECG in that hd5



def _channel_explore_error_header(tm: TensorMap, channel: str) -> str:
return f'{tm.name} {channel}'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesnt seem like this function does anything different from _channel_explore_header - also don't think this function is used

if tm.shape[0] is not None:
# If not a multi-tensor tensor, wrap in array to loop through
tensors = np.array([tensors])
for i, tensor in enumerate(tensors):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this for loop iterates over the tensors in a time series - if a tmap is not a time series tensors, it wraps the tensor in another dimension to simulate a time series of 1 time sample

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did these different samples get distinguished in the output CSVs?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if patient 123 has 2 total ECGs taken on 5/20 and 5/21, a tmap is given 123.hd5 and returns an array [5/20 data, 5/21 data] - in the output csv, the 2 ECGs are counted as separate samples, each ECG gets its own row in the output

@ndiamant ndiamant marked this pull request as draft July 9, 2020 18:31
@lucidtronix lucidtronix deleted the nd_explore_multi_2 branch January 17, 2023 10:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.