Ndv dont share mi clr but still lock per bootstrap #63

nickdeveaux · 2018-05-10T23:04:30Z

Calculating Mi and CLR and sending it to workers was sending a large amount of data to each worker per bootstrap. For example, for a 60k gene by 150 sample input file, the mi and clr matrices summed to .6 GB, and ended up being 1.6 GB of data once they were pickled. This was sent to 70 workers across 20 bootstraps on the cluster, leading to a massive (>10x) slowdown.

Now, each worker calculates mi and clr independently, and needs to wait for a new special key (bootstrap %idx) before moving forward

codecov-io · 2018-05-10T23:16:22Z

Codecov Report

Merging #63 into master will decrease coverage by 0.09%.
The diff coverage is 0%.

@@            Coverage Diff            @@
##           master      #63     +/-   ##
=========================================
- Coverage   70.54%   70.44%   -0.1%     
=========================================
  Files          18       18             
  Lines        1480     1482      +2     
=========================================
  Hits         1044     1044             
- Misses        436      438      +2

Impacted Files	Coverage Δ
inferelator_ng/bbsr_tfa_workflow.py	`0% <0%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3876f18...b4ea275. Read the comment docs.

dayanne-castro · 2018-05-16T15:14:58Z

👍

kostyat · 2018-05-29T22:53:21Z

I tried this on NYU HPC. I submitted the following code into the system using sbatch:

#!/bin/sh

#SBATCH --nodes=3
#SBATCH --tasks-per-node=4
#SBATCH --mem=10GB
#SBATCH --time=2:00:00
#SBATCH --job-name=Infer_Test
#SBATCH --output=Infer_Test_KVS_10GB_3_nodes_4_tasks_pull63.out

module purge
module load r/intel/3.4.2 python/intel/2.7.12 bedtools/intel/2.26.0
source /home/kmt331/inferelator_ng/py2.7/bin/activate

cd /home/kmt331/inferelator_ng
export PYTHONPATH=$PYTHONPATH:$(pwd)/kvsstcp

time python ~/inferelator_ng/kvsstcp/kvsstcp.py --execcmd 'srun -n '${SLURM_NTASKS}' python bsubtilis_bbsr_workflow_runner.py'

When I ran this code using the original code on the master branch, everything ran fine and the results looked fine. But when I switched to the nickdeveaux-ndv_dont_share_mi_clr_but_still_lock_per_bootstrap branch (with the code in this pull request), I got the following error (not going to paste the entire output here, just the part that looks relevant):

Creating design and response matrix ... 
Setting up TFA specific response matrix ... 
Computing Transcription Factor Activity ... 
Bootstrap 1 of 2
Calculating MI, Background MI, and CLR Matrix
Traceback (most recent call last):
  File "bsubtilis_bbsr_workflow_runner.py", line 10, in <module>
    workflow.run() 
  File "/home/kmt331/inferelator_ng/inferelator_ng/bbsr_tfa_workflow.py", line 49, in run
    (self.clr_matrix, self.mi_matrix) = self.mi_clr_driver.run(X, Y)
  File "/home/kmt331/inferelator_ng/inferelator_ng/mi_R.py", line 83, in run
Creating design and response matrix ... 
Setting up TFA specific response matrix ... 
Computing Transcription Factor Activity ... 
Bootstrap 1 of 2
Calculating MI, Background MI, and CLR Matrix
Creating design and response matrix ... 
Setting up TFA specific response matrix ... 
Computing Transcription Factor Activity ... 
Bootstrap 1 of 2
Calculating MI, Background MI, and CLR Matrix
Traceback (most recent call last):
  File "bsubtilis_bbsr_workflow_runner.py", line 10, in <module>
Traceback (most recent call last):
  File "bsubtilis_bbsr_workflow_runner.py", line 10, in <module>
    workflow.run() 
    workflow.run() 
  File "/home/kmt331/inferelator_ng/inferelator_ng/bbsr_tfa_workflow.py", line 49, in run
  File "/home/kmt331/inferelator_ng/inferelator_ng/bbsr_tfa_workflow.py", line 49, in run
    (self.clr_matrix, self.mi_matrix) = self.mi_clr_driver.run(X, Y)
    (self.clr_matrix, self.mi_matrix) = self.mi_clr_driver.run(X, Y)
  File "/home/kmt331/inferelator_ng/inferelator_ng/mi_R.py", line 83, in run
  File "/home/kmt331/inferelator_ng/inferelator_ng/mi_R.py", line 83, in run
    matrix_data_frame = pd.read_csv(matrix_path, sep='\t')
  File "/share/apps/python/2.7.12/intel/lib/python2.7/site-packages/pandas-0.19.1-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 645, in parser_f
    matrix_data_frame = pd.read_csv(matrix_path, sep='\t')
  File "/share/apps/python/2.7.12/intel/lib/python2.7/site-packages/pandas-0.19.1-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 645, in parser_f
    matrix_data_frame = pd.read_csv(matrix_path, sep='\t')
  File "/share/apps/python/2.7.12/intel/lib/python2.7/site-packages/pandas-0.19.1-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 645, in parser_f
    return _read(filepath_or_buffer, kwds)
    return _read(filepath_or_buffer, kwds)
  File "/share/apps/python/2.7.12/intel/lib/python2.7/site-packages/pandas-0.19.1-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 400, in _read
  File "/share/apps/python/2.7.12/intel/lib/python2.7/site-packages/pandas-0.19.1-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 400, in _read
    return _read(filepath_or_buffer, kwds)
  File "/share/apps/python/2.7.12/intel/lib/python2.7/site-packages/pandas-0.19.1-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 400, in _read
    data = parser.read()
  File "/share/apps/python/2.7.12/intel/lib/python2.7/site-packages/pandas-0.19.1-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 938, in read
    data = parser.read()
  File "/share/apps/python/2.7.12/intel/lib/python2.7/site-packages/pandas-0.19.1-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 938, in read
    data = parser.read()
  File "/share/apps/python/2.7.12/intel/lib/python2.7/site-packages/pandas-0.19.1-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 938, in read
    ret = self._engine.read(nrows)
  File "/share/apps/python/2.7.12/intel/lib/python2.7/site-packages/pandas-0.19.
1-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 1507, in read
    ret = self._engine.read(nrows)
  File "/share/apps/python/2.7.12/intel/lib/python2.7/site-packages/pandas-0.19.1-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 1507, in read
    ret = self._engine.read(nrows)
  File "/share/apps/python/2.7.12/intel/lib/python2.7/site-packages/pandas-0.19.1-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 1507, in read
    data = self._reader.read(nrows)
  File "pandas/parser.pyx", line 846, in pandas.parser.TextReader.read (pandas/parser.c:9935)
    data = self._reader.read(nrows)
  File "pandas/parser.pyx", line 846, in pandas.parser.TextReader.read (pandas/parser.c:9935)
    data = self._reader.read(nrows)
  File "pandas/parser.pyx", line 846, in pandas.parser.TextReader.read (pandas/parser.c:9935)
  File "pandas/parser.pyx", line 868, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:10193)
  File "pandas/parser.pyx", line 868, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:10193)
  File "pandas/parser.pyx", line 868, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:10193)
  File "pandas/parser.pyx", line 922, in pandas.parser.TextReader._read_rows (pa
ndas/parser.c:10921)
  File "pandas/parser.pyx", line 922, in pandas.parser.TextReader._read_rows (pandas/parser.c:10921)
  File "pandas/parser.pyx", line 922, in pandas.parser.TextReader._read_rows (pandas/parser.c:10921)
  File "pandas/parser.pyx", line 909, in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:10792)
  File "pandas/parser.pyx", line 909, in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:10792)
  File "pandas/parser.pyx", line 909, in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:10792)
  File "pandas/parser.pyx", line 2018, in pandas.parser.raise_parser_error (pandas/parser.c:25929)
  File "pandas/parser.pyx", line 2018, in pandas.parser.raise_parser_error (pandas/parser.c:25929)
2018-05-29 18:44:01,710 INFO     kvs            : Closing connection from ('172.16.2.127', 55612)
2018-05-29 18:44:01,710 INFO     kvs            : Closing connection from ('172.16.2.127', 55614)
2018-05-29 18:44:01,710 INFO     kvs            : Closing connection from ('172.16.2.127', 55610)
  File "pandas/parser.pyx", line 2018, in pandas.parser.raise_parser_error (pandas/parser.c:25929)
pandas.io.common.CParserError: Error tokenizing data. C error: Expected 240 fields in line 1529, saw 281

pandas.io.common.CParserError: Error tokenizing data. C error: Expected 240 fields in line 1529, saw 281

pandas.io.common.CParserError: Error tokenizing data. C error: Expected 240 fields in line 1529, saw 281

Creating design and response matrix ... 
Setting up TFA specific response matrix ... 
Computing Transcription Factor Activity ... 
Bootstrap 1 of 2
Calculating MI, Background MI, and CLR Matrix
srun: error: c41-06: tasks 5-7: Exited with exit code 1
srun: Terminating job step 6497421.0
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 6497421.0 ON c41-04 CANCELLED AT 2018-05-29T18:44:01 ***
2018-05-29 18:44:01,881 INFO     kvs            : Closing connection from ('172.16.2.129', 55022)
2018-05-29 18:44:01,890 INFO     kvs            : Closing connection from ('172.16.2.127', 55616)

...
etc...
...

srun: error: c41-04: tasks 0-3: Killed
srun: error: c41-12: tasks 8-11: Killed
Traceback (most recent call last):
2018-05-29 18:44:02,022 INFO     kvs            : Server shutting down
  File "/home/kmt331/inferelator_ng/kvsstcp/kvsstcp.py", line 605, in <module>
    subprocess.check_call(args.execcmd, shell=True, env=t.env())
  File "/share/apps/python/2.7.12/intel/lib/python2.7/subprocess.py", line 541, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'srun -n 12 python bsubtilis_bbsr_workflow_runner.py' returned non-zero exit status 1

real    0m15.581s
user    0m0.054s
sys     0m0.049s

kostyat · 2018-07-05T21:54:18Z

@nickdeveaux any ideas why i'm getting that error?

kostyat · 2018-08-23T01:10:14Z

Has anybody else tried this? Does it for for anyone else? I am still getting the same error on NYU HPC. This time I was working on the InfereCLaDR branch and I put in the same changes that you did into bbsr_tfa_runner.py manually, and I still got the same error.

nickdeveaux added 5 commits May 10, 2018 13:10

calculate mi and clr on each worker

d9526d4

don't calculate your own kvs

39babaa

flush output

42f9096

removed unused import and key

dd8795f

missing import added

b4ea275

kostyat mentioned this pull request May 25, 2018

Running Inferelator on NYU HPC with KVS #64

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ndv dont share mi clr but still lock per bootstrap #63

Ndv dont share mi clr but still lock per bootstrap #63

nickdeveaux commented May 10, 2018

codecov-io commented May 10, 2018 •

edited

Loading

dayanne-castro commented May 16, 2018

kostyat commented May 29, 2018

kostyat commented Jul 5, 2018

kostyat commented Aug 23, 2018

Ndv dont share mi clr but still lock per bootstrap #63

Are you sure you want to change the base?

Ndv dont share mi clr but still lock per bootstrap #63

Conversation

nickdeveaux commented May 10, 2018

codecov-io commented May 10, 2018 • edited Loading

Codecov Report

dayanne-castro commented May 16, 2018

kostyat commented May 29, 2018

kostyat commented Jul 5, 2018

kostyat commented Aug 23, 2018

codecov-io commented May 10, 2018 •

edited

Loading