Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ndv dont share mi clr but still lock per bootstrap #63

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

nickdeveaux
Copy link
Contributor

@kostyat @dayanne-castro

Calculating Mi and CLR and sending it to workers was sending a large amount of data to each worker per bootstrap. For example, for a 60k gene by 150 sample input file, the mi and clr matrices summed to .6 GB, and ended up being 1.6 GB of data once they were pickled. This was sent to 70 workers across 20 bootstraps on the cluster, leading to a massive (>10x) slowdown.

Now, each worker calculates mi and clr independently, and needs to wait for a new special key (bootstrap %idx) before moving forward

@codecov-io
Copy link

codecov-io commented May 10, 2018

Codecov Report

Merging #63 into master will decrease coverage by 0.09%.
The diff coverage is 0%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master      #63     +/-   ##
=========================================
- Coverage   70.54%   70.44%   -0.1%     
=========================================
  Files          18       18             
  Lines        1480     1482      +2     
=========================================
  Hits         1044     1044             
- Misses        436      438      +2
Impacted Files Coverage Δ
inferelator_ng/bbsr_tfa_workflow.py 0% <0%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3876f18...b4ea275. Read the comment docs.

@dayanne-castro
Copy link
Collaborator

👍

@kostyat
Copy link
Contributor

kostyat commented May 29, 2018

I tried this on NYU HPC. I submitted the following code into the system using sbatch:

#!/bin/sh

#SBATCH --nodes=3
#SBATCH --tasks-per-node=4
#SBATCH --mem=10GB
#SBATCH --time=2:00:00
#SBATCH --job-name=Infer_Test
#SBATCH --output=Infer_Test_KVS_10GB_3_nodes_4_tasks_pull63.out

module purge
module load r/intel/3.4.2 python/intel/2.7.12 bedtools/intel/2.26.0
source /home/kmt331/inferelator_ng/py2.7/bin/activate

cd /home/kmt331/inferelator_ng
export PYTHONPATH=$PYTHONPATH:$(pwd)/kvsstcp

time python ~/inferelator_ng/kvsstcp/kvsstcp.py --execcmd 'srun -n '${SLURM_NTASKS}' python bsubtilis_bbsr_workflow_runner.py'

When I ran this code using the original code on the master branch, everything ran fine and the results looked fine. But when I switched to the nickdeveaux-ndv_dont_share_mi_clr_but_still_lock_per_bootstrap branch (with the code in this pull request), I got the following error (not going to paste the entire output here, just the part that looks relevant):

Creating design and response matrix ... 
Setting up TFA specific response matrix ... 
Computing Transcription Factor Activity ... 
Bootstrap 1 of 2
Calculating MI, Background MI, and CLR Matrix
Traceback (most recent call last):
  File "bsubtilis_bbsr_workflow_runner.py", line 10, in <module>
    workflow.run() 
  File "/home/kmt331/inferelator_ng/inferelator_ng/bbsr_tfa_workflow.py", line 49, in run
    (self.clr_matrix, self.mi_matrix) = self.mi_clr_driver.run(X, Y)
  File "/home/kmt331/inferelator_ng/inferelator_ng/mi_R.py", line 83, in run
Creating design and response matrix ... 
Setting up TFA specific response matrix ... 
Computing Transcription Factor Activity ... 
Bootstrap 1 of 2
Calculating MI, Background MI, and CLR Matrix
Creating design and response matrix ... 
Setting up TFA specific response matrix ... 
Computing Transcription Factor Activity ... 
Bootstrap 1 of 2
Calculating MI, Background MI, and CLR Matrix
Traceback (most recent call last):
  File "bsubtilis_bbsr_workflow_runner.py", line 10, in <module>
Traceback (most recent call last):
  File "bsubtilis_bbsr_workflow_runner.py", line 10, in <module>
    workflow.run() 
    workflow.run() 
  File "/home/kmt331/inferelator_ng/inferelator_ng/bbsr_tfa_workflow.py", line 49, in run
  File "/home/kmt331/inferelator_ng/inferelator_ng/bbsr_tfa_workflow.py", line 49, in run
    (self.clr_matrix, self.mi_matrix) = self.mi_clr_driver.run(X, Y)
    (self.clr_matrix, self.mi_matrix) = self.mi_clr_driver.run(X, Y)
  File "/home/kmt331/inferelator_ng/inferelator_ng/mi_R.py", line 83, in run
  File "/home/kmt331/inferelator_ng/inferelator_ng/mi_R.py", line 83, in run
    matrix_data_frame = pd.read_csv(matrix_path, sep='\t')
  File "/share/apps/python/2.7.12/intel/lib/python2.7/site-packages/pandas-0.19.1-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 645, in parser_f
    matrix_data_frame = pd.read_csv(matrix_path, sep='\t')
  File "/share/apps/python/2.7.12/intel/lib/python2.7/site-packages/pandas-0.19.1-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 645, in parser_f
    matrix_data_frame = pd.read_csv(matrix_path, sep='\t')
  File "/share/apps/python/2.7.12/intel/lib/python2.7/site-packages/pandas-0.19.1-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 645, in parser_f
    return _read(filepath_or_buffer, kwds)
    return _read(filepath_or_buffer, kwds)
  File "/share/apps/python/2.7.12/intel/lib/python2.7/site-packages/pandas-0.19.1-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 400, in _read
  File "/share/apps/python/2.7.12/intel/lib/python2.7/site-packages/pandas-0.19.1-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 400, in _read
    return _read(filepath_or_buffer, kwds)
  File "/share/apps/python/2.7.12/intel/lib/python2.7/site-packages/pandas-0.19.1-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 400, in _read
    data = parser.read()
  File "/share/apps/python/2.7.12/intel/lib/python2.7/site-packages/pandas-0.19.1-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 938, in read
    data = parser.read()
  File "/share/apps/python/2.7.12/intel/lib/python2.7/site-packages/pandas-0.19.1-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 938, in read
    data = parser.read()
  File "/share/apps/python/2.7.12/intel/lib/python2.7/site-packages/pandas-0.19.1-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 938, in read
    ret = self._engine.read(nrows)
  File "/share/apps/python/2.7.12/intel/lib/python2.7/site-packages/pandas-0.19.
1-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 1507, in read
    ret = self._engine.read(nrows)
  File "/share/apps/python/2.7.12/intel/lib/python2.7/site-packages/pandas-0.19.1-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 1507, in read
    ret = self._engine.read(nrows)
  File "/share/apps/python/2.7.12/intel/lib/python2.7/site-packages/pandas-0.19.1-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 1507, in read
    data = self._reader.read(nrows)
  File "pandas/parser.pyx", line 846, in pandas.parser.TextReader.read (pandas/parser.c:9935)
    data = self._reader.read(nrows)
  File "pandas/parser.pyx", line 846, in pandas.parser.TextReader.read (pandas/parser.c:9935)
    data = self._reader.read(nrows)
  File "pandas/parser.pyx", line 846, in pandas.parser.TextReader.read (pandas/parser.c:9935)
  File "pandas/parser.pyx", line 868, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:10193)
  File "pandas/parser.pyx", line 868, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:10193)
  File "pandas/parser.pyx", line 868, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:10193)
  File "pandas/parser.pyx", line 922, in pandas.parser.TextReader._read_rows (pa
ndas/parser.c:10921)
  File "pandas/parser.pyx", line 922, in pandas.parser.TextReader._read_rows (pandas/parser.c:10921)
  File "pandas/parser.pyx", line 922, in pandas.parser.TextReader._read_rows (pandas/parser.c:10921)
  File "pandas/parser.pyx", line 909, in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:10792)
  File "pandas/parser.pyx", line 909, in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:10792)
  File "pandas/parser.pyx", line 909, in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:10792)
  File "pandas/parser.pyx", line 2018, in pandas.parser.raise_parser_error (pandas/parser.c:25929)
  File "pandas/parser.pyx", line 2018, in pandas.parser.raise_parser_error (pandas/parser.c:25929)
2018-05-29 18:44:01,710 INFO     kvs            : Closing connection from ('172.16.2.127', 55612)
2018-05-29 18:44:01,710 INFO     kvs            : Closing connection from ('172.16.2.127', 55614)
2018-05-29 18:44:01,710 INFO     kvs            : Closing connection from ('172.16.2.127', 55610)
  File "pandas/parser.pyx", line 2018, in pandas.parser.raise_parser_error (pandas/parser.c:25929)
pandas.io.common.CParserError: Error tokenizing data. C error: Expected 240 fields in line 1529, saw 281

pandas.io.common.CParserError: Error tokenizing data. C error: Expected 240 fields in line 1529, saw 281

pandas.io.common.CParserError: Error tokenizing data. C error: Expected 240 fields in line 1529, saw 281

Creating design and response matrix ... 
Setting up TFA specific response matrix ... 
Computing Transcription Factor Activity ... 
Bootstrap 1 of 2
Calculating MI, Background MI, and CLR Matrix
srun: error: c41-06: tasks 5-7: Exited with exit code 1
srun: Terminating job step 6497421.0
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 6497421.0 ON c41-04 CANCELLED AT 2018-05-29T18:44:01 ***
2018-05-29 18:44:01,881 INFO     kvs            : Closing connection from ('172.16.2.129', 55022)
2018-05-29 18:44:01,890 INFO     kvs            : Closing connection from ('172.16.2.127', 55616)

...
etc...
...

srun: error: c41-04: tasks 0-3: Killed
srun: error: c41-12: tasks 8-11: Killed
Traceback (most recent call last):
2018-05-29 18:44:02,022 INFO     kvs            : Server shutting down
  File "/home/kmt331/inferelator_ng/kvsstcp/kvsstcp.py", line 605, in <module>
    subprocess.check_call(args.execcmd, shell=True, env=t.env())
  File "/share/apps/python/2.7.12/intel/lib/python2.7/subprocess.py", line 541, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'srun -n 12 python bsubtilis_bbsr_workflow_runner.py' returned non-zero exit status 1

real    0m15.581s
user    0m0.054s
sys     0m0.049s

@kostyat
Copy link
Contributor

kostyat commented Jul 5, 2018

@nickdeveaux any ideas why i'm getting that error?

@kostyat
Copy link
Contributor

kostyat commented Aug 23, 2018

Has anybody else tried this? Does it for for anyone else? I am still getting the same error on NYU HPC. This time I was working on the InfereCLaDR branch and I put in the same changes that you did into bbsr_tfa_runner.py manually, and I still got the same error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants