Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fully integrate the keep_demographic_info and date_format flags #249

Merged
merged 5 commits into from
Jan 21, 2025

Conversation

jessicarowell
Copy link
Collaborator

Description

  • Changed keep_personal_info / val_keep_pi flag to keep_demographic_info and fully implemented it
  • Changed to val_date_format_flag to date_format_flag and fully implemented it
  • Metadata validation will now scrub potentially identifiable information (host sex, age, race, and ethnicity) from the metadata if keep_demographic_info is false.

Checklist

Go Through Checklist Below and Place A ✔️ (X Inside the Box) if Completed

General Checks

  • Have you run appropriate tests (unit/integration/end-to-end) to check logic across run environments (Conda/Docker/Singularity on Scicomp/AWS/NF Tower/Local)?
    SciComp only

    For each relevant configuration:

    • Can the program run completely through without erroring out?
    • Does it produce the expected outputs, given the inputs provided?
  • Have you conducted proper linting procedures?

    • Numpy formatted docstrings for functions
    • Comments explaining lines of code
    • Consistent and intuitive naming conventions for variables, functions, classes, methods, attributes, and scripts
    • Single empty line between class functions, two lines between non-class functions, and two lines between imports and code body
    • Camel case formatting for class names
  • Have you updated existing documentation (README.md, etc.) or created new ones within docs?

CDC Checks

  • Did you check for sensitive data, and remove any?
  • If you added or modified HTML, did you check that it was 508 compliant?

Are additional approvals needed for this change? If so, please mention them below:

Are there potential vulnerabilities or licensing issues with any new dependencies introduced? If so, please mention them below:

@jessicarowell jessicarowell added this to the v4.1.2 milestone Jan 11, 2025
@jessicarowell
Copy link
Collaborator Author

For testing purposes, make sure to test these two things:

  1. Try adding/not adding info to host_sex, host_age, race, and ethnicity and toggle the keep_demographic_info param in nextflow.config. Check the resulting metadata tsv files and the logs and make sure they're correct.
  2. Try different collection_date formats and different options for the date_format param. Check the tsv and logs and make sure it's changing (or not changing) the date format according to the date_format param selection.

@RamiyapriyaS
Copy link
Collaborator

Testing with a modified metadata file rsv_test_metadata.xlsx

Command:

nextflow run main.nf -profile test,singularity --species rsv --annotation --submission --output_dir sample_name_check --submission_config ./tostadas/conf/submission_config.yaml --meta_path ./assets/sample_metadata/rsv_test_metadata.xlsx

Error message:

executor >  local (2)
[5b/3939e5] process > TOSTADAS_WORKFLOW:TOSTADAS:VALIDATE_PARAMS                      [100%] 1 of 1, failed: 1 ✘
[-        ] process > TOSTADAS_WORKFLOW:TOSTADAS:METADATA_VALIDATION                  -
[-        ] process > TOSTADAS_WORKFLOW:TOSTADAS:RUN_VADR:VADR_TRIM                   -
[-        ] process > TOSTADAS_WORKFLOW:TOSTADAS:RUN_VADR:VADR_ANNOTATION             -
[-        ] process > TOSTADAS_WORKFLOW:TOSTADAS:RUN_VADR:VADR_POST_CLEANUP           -
[-        ] process > TOSTADAS_WORKFLOW:TOSTADAS:GET_WAIT_TIME                        -
[-        ] process > TOSTADAS_WORKFLOW:TOSTADAS:INITIAL_SUBMISSION:SUBMISSION        -
[-        ] process > TOSTADAS_WORKFLOW:TOSTADAS:INITIAL_SUBMISSION:WAIT              -
[-        ] process > TOSTADAS_WORKFLOW:TOSTADAS:INITIAL_SUBMISSION:UPDATE_SUBMISSION -
WARN: Access to undefined parameter `val_keep_pi` -- Initialise it to a default value eg. `params.val_keep_pi = some_value`
WARN: Access to undefined parameter `val_date_format_flag` -- Initialise it to a default value eg. `params.val_date_format_flag = some_value`
ERROR ~ Error executing process > 'TOSTADAS_WORKFLOW:TOSTADAS:VALIDATE_PARAMS'

Caused by:
  assert params.val_date_format_flag == 's' || params.val_date_format_flag == 'o' || params.val_date_format_flag == 'v'
       |      |                           |  |      |                           |  |      |
       |      null                        |  |      null                        |  |      null
       |                                  |  |                                  |  ['schema':'nextflow_schema.json', 'validate_params':true, 'ref_fasta_path':'/scicomp/home-pure/rjd0/tostadas/assets/ref/Human_orthopneumovirus_NC_001781.fasta', 'meta_path':'/scicomp/home-pure/rjd0/tostadas/assets/sample_metadata/rsv_test_metadata.xlsx', 'ref_gff_path':'/scicomp/home-pure/rjd0/tostadas/assets/ref/ref.MPXV.NC063383.v7.gff', 'date_format_flag':'s', 'keep_demographic_info':false, 'validate_custom_fields':false, 'custom_fields_file':'/scicomp/home-pure/rjd0/tostadas/assets/custom_meta_fields/example_custom_fields.json', 'annotation':true, 'repeatmasker_liftoff':true, 'vadr':true, 'bakta':false, 'species':'rsv', 'submission':true, 'output_dir':'sample_name_check', 'submission_config':'/scicomp/home-pure/rjd0/tostadas/conf/submission_config.yaml', 'repeat_library':'/scicomp/home-pure/rjd0/tostadas/assets/lib/MPOX_repeats_lib.fasta', 'genbank':true, 'sra':true, 'gisaid':false, 'biosample':true, 'submission_mode':'ftp', 'submission_output_dir':'submission_outputs', 'submission_wait_time':380, 'submission_prod_or_test':'test', 'send_submission_email':false, 'update_submission':false, 'help':false, 'publish_dir_mode':'copy', 'bakta_output_dir':'bakta_outputs', 'vadr_output_dir':'vadr_outputs', 'final_liftoff_output_dir':'liftoff_outputs', 'val_output_dir':'validation_outputs', 'vadr_models_dir':'/scicomp/home-pure/rjd0/tostadas/vadr_files/rsv-models', 'env_yml':'/scicomp/home-pure/rjd0/tostadas/environment.yml', 'enable_conda':false, 'repeatmasker_env_yml':'/scicomp/home-pure/rjd0/tostadas/environments/repeatmasker_env.yml', 'vadr_env_yml':'/scicomp/home-pure/rjd0/tostadas/environments/vadr_env.yml', 'cleanup':false, 'clear_nextflow_log':false, 'clear_work_dir':false, 'clear_conda_env':false, 'clear_nf_results':false, 'overwrite_output':true, 'bakta_db_type':'light', 'download_bakta_db':false, 'bakta_db_path':'', 'bakta_min_contig_length':200, 'bakta_threads':2, 'bakta_gram':'?', 'bakta_genus':'Genus', 'bakta_species':'species', 'bakta_strain':'strain', 'bakta_plasmid':'unnamed', 'bakta_locus':'contig', 'bakta_locus_tag':'LOCUSTAG123', 'bakta_translation_table':11, 'bakta_complete':'', 'bakta_keep_contig_headers':'', 'bakta_replicons':'', 'bakta_proteins':'', 'bakta_skip_trna':'', 'bakta_skip_tmrna':'', 'bakta_skip_rrna':'', 'bakta_skip_ncrna':'', 'bakta_skip_ncrna_region':'', 'bakta_skip_crispr':'', 'bakta_skip_cds':'', 'bakta_skip_pseudo':'', 'bakta_skip_sorf':'', 'bakta_skip_gap':'', 'bakta_skip_ori':'', 'bakta_compliant':true, 'bakta_skip_plot':true, 'lift_print_version_exit':false, 'lift_print_help_exit':false, 'lift_parallel_processes':8, 'lift_coverage_threshold':0.5, 'lift_child_feature_align_threshold':0.5, 'lift_unmapped_features_file_name':'output.unmapped_features.txt', 'lift_copy_threshold':1.0, 'lift_distance_scaling_factor':2.0, 'lift_flank':0.0, 'lift_overlap':0.1, 'lift_mismatch':2, 'lift_gap_open':2, 'lift_gap_extend':1, 'lift_minimap_path':'N/A', 'lift_feature_database_name':'N/A', 'lift_feature_types':'/scicomp/home-pure/rjd0/tostadas/assets/feature_types.txt', 'processed_samples':'/scicomp/home-pure/rjd0/tostadas/test_output/submission_outputs']
       |                                  |  |                                  false
       |                                  |  ['schema':'nextflow_schema.json', 'validate_params':true, 'ref_fasta_path':'/scicomp/home-pure/rjd0/tostadas/assets/ref/Human_orthopneumovirus_NC_001781.fasta', 'meta_path':'/scicomp/home-pure/rjd0/tostadas/assets/sample_metadata/rsv_test_metadata.xlsx', 'ref_gff_path':'/scicomp/home-pure/rjd0/tostadas/assets/ref/ref.MPXV.NC063383.v7.gff', 'date_format_flag':'s', 'keep_demographic_info':false, 'validate_custom_fields':false, 'custom_fields_file':'/scicomp/home-pure/rjd0/tostadas/assets/custom_meta_fields/example_custom_fields.json', 'annotation':true, 'repeatmasker_liftoff':true, 'vadr':true, 'bakta':false, 'species':'rsv', 'submission':true, 'output_dir':'sample_name_check', 'submission_config':'/scicomp/home-pure/rjd0/tostadas/conf/submission_config.yaml', 'repeat_library':'/scicomp/home-pure/rjd0/tostadas/assets/lib/MPOX_repeats_lib.fasta', 'genbank':true, 'sra':true, 'gisaid':false, 'biosample':true, 'submission_mode':'ftp', 'submission_output_dir':'submission_outputs', 'submission_wait_time':380, 'submission_prod_or_test':'test', 'send_submission_email':false, 'update_submission':false, 'help':false, 'publish_dir_mode':'copy', 'bakta_output_dir':'bakta_outputs', 'vadr_output_dir':'vadr_outputs', 'final_liftoff_output_dir':'liftoff_outputs', 'val_output_dir':'validation_outputs', 'vadr_models_dir':'/scicomp/home-pure/rjd0/tostadas/vadr_files/rsv-models', 'env_yml':'/scicomp/home-pure/rjd0/tostadas/environment.yml', 'enable_conda':false, 'repeatmasker_env_yml':'/scicomp/home-pure/rjd0/tostadas/environments/repeatmasker_env.yml', 'vadr_env_yml':'/scicomp/home-pure/rjd0/tostadas/environments/vadr_env.yml', 'cleanup':false, 'clear_nextflow_log':false, 'clear_work_dir':false, 'clear_conda_env':false, 'clear_nf_results':false, 'overwrite_output':true, 'bakta_db_type':'light', 'download_bakta_db':false, 'bakta_db_path':'', 'bakta_min_contig_length':200, 'bakta_threads':2, 'bakta_gram':'?', 'bakta_genus':'Genus', 'bakta_species':'species', 'bakta_strain':'strain', 'bakta_plasmid':'unnamed', 'bakta_locus':'contig', 'bakta_locus_tag':'LOCUSTAG123', 'bakta_translation_table':11, 'bakta_complete':'', 'bakta_keep_contig_headers':'', 'bakta_replicons':'', 'bakta_proteins':'', 'bakta_skip_trna':'', 'bakta_skip_tmrna':'', 'bakta_skip_rrna':'', 'bakta_skip_ncrna':'', 'bakta_skip_ncrna_region':'', 'bakta_skip_crispr':'', 'bakta_skip_cds':'', 'bakta_skip_pseudo':'', 'bakta_skip_sorf':'', 'bakta_skip_gap':'', 'bakta_skip_ori':'', 'bakta_compliant':true, 'bakta_skip_plot':true, 'lift_print_version_exit':false, 'lift_print_help_exit':false, 'lift_parallel_processes':8, 'lift_coverage_threshold':0.5, 'lift_child_feature_align_threshold':0.5, 'lift_unmapped_features_file_name':'output.unmapped_features.txt', 'lift_copy_threshold':1.0, 'lift_distance_scaling_factor':2.0, 'lift_flank':0.0, 'lift_overlap':0.1, 'lift_mismatch':2, 'lift_gap_open':2, 'lift_gap_extend':1, 'lift_minimap_path':'N/A', 'lift_feature_database_name':'N/A', 'lift_feature_types':'/scicomp/home-pure/rjd0/tostadas/assets/feature_types.txt', 'processed_samples':'/scicomp/home-pure/rjd0/tostadas/test_output/submission_outputs']
       |                                  false
       ['schema':'nextflow_schema.json', 'validate_params':true, 'ref_fasta_path':'/scicomp/home-pure/rjd0/tostadas/assets/ref/Human_orthopneumovirus_NC_001781.fasta', 'meta_path':'/scicomp/home-pure/rjd0/tostadas/assets/sample_metadata/rsv_test_metadata.xlsx', 'ref_gff_path':'/scicomp/home-pure/rjd0/tostadas/assets/ref/ref.MPXV.NC063383.v7.gff', 'date_format_flag':'s', 'keep_demographic_info':false, 'validate_custom_fields':false, 'custom_fields_file':'/scicomp/home-pure/rjd0/tostadas/assets/custom_meta_fields/example_custom_fields.json', 'annotation':true, 'repeatmasker_liftoff':true, 'vadr':true, 'bakta':false, 'species':'rsv', 'submission':true, 'output_dir':'sample_name_check', 'submission_config':'/scicomp/home-pure/rjd0/tostadas/conf/submission_config.yaml', 'repeat_library':'/scicomp/home-pure/rjd0/tostadas/assets/lib/MPOX_repeats_lib.fasta', 'genbank':true, 'sra':true, 'gisaid':false, 'biosample':true, 'submission_mode':'ftp', 'submission_output_dir':'submission_outputs', 'submission_wait_time':380, 'submission_prod_or_test':'test', 'send_submission_email':false, 'update_submission':false, 'help':false, 'publish_dir_mode':'copy', 'bakta_output_dir':'bakta_outputs', 'vadr_output_dir':'vadr_outputs', 'final_liftoff_output_dir':'liftoff_outputs', 'val_output_dir':'validation_outputs', 'vadr_models_dir':'/scicomp/home-pure/rjd0/tostadas/vadr_files/rsv-models', 'env_yml':'/scicomp/home-pure/rjd0/tostadas/environment.yml', 'enable_conda':false, 'repeatmasker_env_yml':'/scicomp/home-pure/rjd0/tostadas/environments/repeatmasker_env.yml', 'vadr_env_yml':'/scicomp/home-pure/rjd0/tostadas/environments/vadr_env.yml', 'cleanup':false, 'clear_nextflow_log':false, 'clear_work_dir':false, 'clear_conda_env':false, 'clear_nf_results':false, 'overwrite_output':true, 'bakta_db_type':'light', 'download_bakta_db':false, 'bakta_db_path':'', 'bakta_min_contig_length':200, 'bakta_threads':2, 'bakta_gram':'?', 'bakta_genus':'Genus', 'bakta_species':'species', 'bakta_strain':'strain', 'bakta_plasmid':'unnamed', 'bakta_locus':'contig', 'bakta_locus_tag':'LOCUSTAG123', 'bakta_translation_table':11, 'bakta_complete':'', 'bakta_keep_contig_headers':'', 'bakta_replicons':'', 'bakta_proteins':'', 'bakta_skip_trna':'', 'bakta_skip_tmrna':'', 'bakta_skip_rrna':'', 'bakta_skip_ncrna':'', 'bakta_skip_ncrna_region':'', 'bakta_skip_crispr':'', 'bakta_skip_cds':'', 'bakta_skip_pseudo':'', 'bakta_skip_sorf':'', 'bakta_skip_gap':'', 'bakta_skip_ori':'', 'bakta_compliant':true, 'bakta_skip_plot':true, 'lift_print_version_exit':false, 'lift_print_help_exit':false, 'lift_parallel_processes':8, 'lift_coverage_threshold':0.5, 'lift_child_feature_align_threshold':0.5, 'lift_unmapped_features_file_name':'output.unmapped_features.txt', 'lift_copy_threshold':1.0, 'lift_distance_scaling_factor':2.0, 'lift_flank':0.0, 'lift_overlap':0.1, 'lift_mismatch':2, 'lift_gap_open':2, 'lift_gap_extend':1, 'lift_minimap_path':'N/A', 'lift_feature_database_name':'N/A', 'lift_feature_types':'/scicomp/home-pure/rjd0/tostadas/assets/feature_types.txt', 'processed_samples':'/scicomp/home-pure/rjd0/tostadas/test_output/submission_outputs'] -- Check script './workflows/../modules/local/general_util/validate_params/main.nf' at line: 126

Source block:
  assert params.meta_path
  if ( params.annotation ) {
              if ( params.repeatmasker_liftoff ) {
                  assert params.ref_fasta_path
                  assert params.ref_fasta_path
                  assert params.ref_gff_path
                  assert params.repeat_library
              }
              if ( params.vadr ) {
                  assert params.vadr_models_dir
              }
              if ( params.bakta ) {
                  if ( !params.download_bakta_db ) {
                      assert params.bakta_db_path
                  }
              }
          }
  if ( params.repeatmasker_liftoff == true ) {
              // Check whether populated or not 
              assert params.lift_parallel_processes == 0 || params.lift_parallel_processes
              assert params.lift_mismatch
              assert params.lift_gap_open
              assert params.lift_gap_extend 
              assert params.lift_print_version_exit == true || params.lift_print_version_exit == false
              assert params.lift_print_help_exit == true || params.lift_print_help_exit == false
  
              // Check data types 
              expected_liftoff_strings = [
                  "lift_minimap_path": params.lift_minimap_path,
                  "lift_feature_database_name": params.lift_feature_database_name  
              ]
  
              expected_liftoff_integers = [
                  "lift_parallel_processes" : params.lift_parallel_processes,
                  "lift_mismatch": params.lift_mismatch,
                  "lift_gap_open": params.lift_gap_open,
                  "lift_gap_extend": params.lift_gap_extend
              ]
  
              expected_liftoff_floats = [
                  "lift_coverage_threshold": params.lift_coverage_threshold,
                  "lift_child_feature_align_threshold": params.lift_child_feature_align_threshold,
                  "lift_copy_threshold": params.lift_copy_threshold,
                  "lift_distance_scaling_factor": params.lift_distance_scaling_factor,
                  "lift_flank": params.lift_flank,
                  "lift_overlap": params.lift_overlap
              ]
  
              expected_liftoff_strings.each { key, value ->
                  if ( expected_liftoff_strings[key] instanceof String == false ) {
                      throw new Exception("Value must be of string type: $value used for $key parameter")
                  }
              }
  
              expected_liftoff_integers.each { key, value ->
                  if ( expected_liftoff_integers[key] instanceof Integer == false ) {
                      throw new Exception("Value must be of integer type: $value used for $key parameter")
                  }
              }
  
              expected_liftoff_floats.each { key, value ->
                  if ( expected_liftoff_floats[key] instanceof Integer == true || expected_liftoff_floats[key] instanceof String == true ) {
                      throw new Exception("Value must be of float type and not integer or string: $value used for $key parameter")
                  }
              } 
          }
  if ( params.bakta == true ) {
              assert params.meta_path
              assert params.bakta_min_contig_length
              assert params.bakta_translation_table
              assert params.bakta_genus
              assert params.bakta_species
              assert params.bakta_strain
              assert params.bakta_plasmid
              assert params.bakta_locus
              assert params.bakta_locus_tag
          }
  assert params.clear_nextflow_log == true || params.clear_nextflow_log == false
  assert params.clear_work_dir == true || params.clear_work_dir == false
  assert params.submission == true || params.submission == false
  assert params.cleanup == true || params.cleanup == false
  assert params.overwrite_output == true || params.overwrite_output == false
  assert params.val_date_format_flag == 's' || params.val_date_format_flag == 'o' || params.val_date_format_flag == 'v'
  assert params.val_keep_pi == true || params.val_keep_pi == false
  expected_strings = [
              "ref_fasta_path": params.ref_fasta_path,
              "ref_gff_path": params.ref_gff_path,
              "meta_path": params.meta_path,
              "output_dir": params.output_dir,   
          ]
  expected_strings.each { key, value ->
              if (!(value instanceof String || value instanceof org.codehaus.groovy.runtime.GStringImpl)) {
                  throw new Exception("Value must be of string type: $value used for $key parameter")
              }
          }

Work dir:
  /scicomp/scratch/rjd0/nextflow/work/5b/3939e5fd09ec728c3927ec2199f782

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`

 -- Check '.nextflow.log' file for details

…atest nf-schema and param validation changes
@jessicarowell
Copy link
Collaborator Author

Ok I think I've fixed it by grabbing the latest changes from dev

@jessicarowell
Copy link
Collaborator Author

Just changed keep_demographic_info to remove_demographic_info based on our conversations - I agree the former was confusing. This version also follows best practices.

@jessicarowell
Copy link
Collaborator Author

Date function is fixed!

@jessicarowell jessicarowell modified the milestones: v4.1.2, v4.1.3, v4.1.4 Jan 18, 2025
@RamiyapriyaS RamiyapriyaS merged commit a614b35 into dev Jan 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Internal] [Bug] Enable date correction and demographic info scrubbing params
2 participants