Error stitching large sample #43

FangmingXie · 2023-06-28T20:11:11Z

Bug report

Description of the problem

I was trying to stitch a large sample of 20 tiles, with each tile having [1920,1920,~2800] pixels. I kept getting a spark session time out error at different stages of the stitching pipeline.

For example, below is a case when the error came from the run_retile stage. But for the same data, it would sometime run through this stage but get the same session time out error at a later stage run_stitching.

This error is unique for large sample only. I have no problem running through a sample that is about ~10x smaller.

Log file(s)

Jun-28 10:26:01.924 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'stitching:stitch:run_retile:spark_start_app (1)'

Caused by:
  Process `stitching:stitch:run_retile:spark_start_app (1)` terminated with an error exit status (1)

Command executed:

  echo "Starting the spark driver"

  SESSION_FILE="/u/home/f/f7xiesnm/project-zipursky/easifish/lt185_stitch/spark/r1/.sessionId"
  echo "Checking for $SESSION_FILE"
  SLEEP_SECS=10
  MAX_WAIT_SECS=7200
  SECONDS=0

  while ! test -e "$SESSION_FILE"; do
      sleep ${SLEEP_SECS}
      if (( ${SECONDS} < ${MAX_WAIT_SECS} )); then
          echo "Waiting for $SESSION_FILE"
          SECONDS=$(( ${SECONDS} + ${SLEEP_SECS} ))
      else
          echo "-------------------------------------------------------------------------------"
          echo "ERROR: Timed out after ${SECONDS} seconds while waiting for $SESSION_FILE    "
          echo "Make sure that your --spark_work_dir is accessible to all nodes in the cluster "
          echo "-------------------------------------------------------------------------------"
          exit 1
      fi
  done
  
   if ! grep -F -x -q "dcfcb7c0-01b8-4119-90ec-8b3f63ab2c0e" $SESSION_FILE
  then
      echo "------------------------------------------------------------------------------"
      echo "ERROR: session id in $SESSION_FILE does not match current session            "
      echo "Make sure that your --spark_work_dir is accessible to all nodes in the cluster"
      echo "and that you are not running multiple pipelines with the same --spark_work_dir"
      echo "------------------------------------------------------------------------------"
      exit 1
  fi



  export SPARK_ENV_LOADED=
  export SPARK_HOME=/spark
  export PYSPARK_PYTHONPATH_SET=
  export PYTHONPATH="/spark/python"
  export SPARK_LOG_DIR="/u/home/f/f7xiesnm/project-zipursky/easifish/lt185_stitch/spark/r1"

  . "/spark/sbin/spark-config.sh"
  . "/spark/bin/load-spark-env.sh"



  SPARK_LOCAL_IP=`hostname -i | rev | cut -d' ' -f1 | rev`
  echo "Use Spark IP: $SPARK_LOCAL_IP"

  echo "    /spark/bin/spark-class org.apache.spark.deploy.SparkSubmit     --properties-file /u/home/f/f7xiesnm/project-zipursky/easifish/lt18
5_stitch/spark/r1/spark-defaults.conf          --conf spark.driver.host=${SPARK_LOCAL_IP}     --conf spark.driver.bindAddress=${SPARK_LOCAL_IP
}     --master spark://172.16.129.70:7077 --class org.janelia.stitching.ResaveAsSmallerTilesSpark --conf spark.executor.cores=16 --conf spark.
files.openCostInBytes=0 --conf spark.default.parallelism=16 --executor-memory 96g --conf spark.driver.cores=1 --driver-memory 12g /app/app.jar
  -i /u/home/f/f7xiesnm/project-zipursky/easifish/lt185_stitch/outputs/r1/stitching/c0-n5.json -i /u/home/f/f7xiesnm/project-zipursky/easifish
/lt185_stitch/outputs/r1/stitching/c2-n5.json -i /u/home/f/f7xiesnm/project-zipursky/easifish/lt185_stitch/outputs/r1/stitching/c3-n5.json --s
ize 64     "

  /spark/bin/spark-class org.apache.spark.deploy.SparkSubmit     --properties-file /u/home/f/f7xiesnm/project-zipursky/easifish/lt185_stitch/s
park/r1/spark-defaults.conf          --conf spark.driver.host=${SPARK_LOCAL_IP}     --conf spark.driver.bindAddress=${SPARK_LOCAL_IP}     --ma
ster spark://172.16.129.70:7077 --class org.janelia.stitching.ResaveAsSmallerTilesSpark --conf spark.executor.cores=16 --conf spark.files.open
CostInBytes=0 --conf spark.default.parallelism=16 --executor-memory 96g --conf spark.driver.cores=1 --driver-memory 12g /app/app.jar  -i /u/ho
me/f/f7xiesnm/project-zipursky/easifish/lt185_stitch/outputs/r1/stitching/c0-n5.json -i /u/home/f/f7xiesnm/project-zipursky/easifish/lt185_sti
tch/outputs/r1/stitching/c2-n5.json -i /u/home/f/f7xiesnm/project-zipursky/easifish/lt185_stitch/outputs/r1/stitching/c3-n5.json --size 64
 &> /u/home/f/f7xiesnm/project-zipursky/easifish/lt185_stitch/spark/r1/retileImages.log

Command exit status:
  1

Command output:
  Starting the spark driver
  Checking for /u/home/f/f7xiesnm/project-zipursky/easifish/lt185_stitch/spark/r1/.sessionId
  Use Spark IP: 172.16.129.70
      /spark/bin/spark-class org.apache.spark.deploy.SparkSubmit     --properties-file /u/home/f/f7xiesnm/project-zipursky/easifish/lt185_stit
ch/spark/r1/spark-defaults.conf          --conf spark.driver.host=172.16.129.70     --conf spark.driver.bindAddress=172.16.129.70     --master
 spark://172.16.129.70:7077 --class org.janelia.stitching.ResaveAsSmallerTilesSpark --conf spark.executor.cores=16 --conf spark.files.openCost
InBytes=0 --conf spark.default.parallelism=16 --executor-memory 96g --conf spark.driver.cores=1 --driver-memory 12g /app/app.jar  -i /u/home/f
/f7xiesnm/project-zipursky/easifish/lt185_stitch/outputs/r1/stitching/c0-n5.json -i /u/home/f/f7xiesnm/project-zipursky/easifish/lt185_stitch/outputs/r1/stitching/c2-n5.json -i /u/home/f/f7xiesnm/project-zipursky/easifish/lt185_stitch/outputs/r1/stitching/c3-n5.json --size 64

Command error:
  INFO:    Could not find any nv files on this host!
  INFO:    Converting SIF file to temporary sandbox...
  Starting the spark driver
  Checking for /u/home/f/f7xiesnm/project-zipursky/easifish/lt185_stitch/spark/r1/.sessionId
  Use Spark IP: 172.16.129.70
      /spark/bin/spark-class org.apache.spark.deploy.SparkSubmit     --properties-file /u/home/f/f7xiesnm/project-zipursky/easifish/lt185_stit
ch/spark/r1/spark-defaults.conf          --conf spark.driver.host=172.16.129.70     --conf spark.driver.bindAddress=172.16.129.70     --master
 spark://172.16.129.70:7077 --class org.janelia.stitching.ResaveAsSmallerTilesSpark --conf spark.executor.cores=16 --conf spark.files.openCost
InBytes=0 --conf spark.default.parallelism=16 --executor-memory 96g --conf spark.driver.cores=1 --driver-memory 12g /app/app.jar  -i /u/home/f
/f7xiesnm/project-zipursky/easifish/lt185_stitch/outputs/r1/stitching/c0-n5.json -i /u/home/f/f7xiesnm/project-zipursky/easifish/lt185_stitch/
outputs/r1/stitching/c2-n5.json -i /u/home/f/f7xiesnm/project-zipursky/easifish/lt185_stitch/outputs/r1/stitching/c3-n5.json --size 64
  INFO:    Cleaning up image..
  
Work dir:
  /u/home/f/f7xiesnm/try_multifish/multifish/work/b3/b86ba5188c95fab8d05a827c510a56

Environment

EASI-FISH Pipeline version: latest
Nextflow version: 22.10.7
Container runtime: Singularity
Platform: Local cluster
Operating system: Linux

Additional context

(Add any other context about the problem here)

The text was updated successfully, but these errors were encountered:

cgoina · 2023-06-29T14:20:16Z

Have you tried to use more workers or give more memory to a spark worker?

FangmingXie · 2023-06-29T16:28:53Z

Thanks @cgoina! Yes I am now trying those slowly, as each trial takes ~12 hrs to turn around. Which option do you think might be more useful? More memory per worker or more workers?

krokicki · 2023-08-10T13:08:18Z

@FangmingXie Either could work but that's only if the process is running out of memory. Your exit code is 1 which usually does not indicate a memory issue. Can you attach the contents of the retileImages.log so we can see the actual error? You'll see the path to retileImages.log in your output above.

FangmingXie added the bug Something isn't working label Jun 28, 2023

krokicki added this to EASI-FISH Data Analysis Pipeline Sep 27, 2023

krokicki moved this to Todo in EASI-FISH Data Analysis Pipeline Sep 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error stitching large sample #43

Error stitching large sample #43

FangmingXie commented Jun 28, 2023

cgoina commented Jun 29, 2023

FangmingXie commented Jun 29, 2023

krokicki commented Aug 10, 2023

Error stitching large sample #43

Error stitching large sample #43

Comments

FangmingXie commented Jun 28, 2023

Bug report

Description of the problem

Log file(s)

Environment

Additional context

cgoina commented Jun 29, 2023

FangmingXie commented Jun 29, 2023

krokicki commented Aug 10, 2023