Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error stitching large sample #43

Open
FangmingXie opened this issue Jun 28, 2023 · 3 comments
Open

Error stitching large sample #43

FangmingXie opened this issue Jun 28, 2023 · 3 comments
Labels
bug Something isn't working

Comments

@FangmingXie
Copy link
Contributor

Bug report

Description of the problem

I was trying to stitch a large sample of 20 tiles, with each tile having [1920,1920,~2800] pixels. I kept getting a spark session time out error at different stages of the stitching pipeline.

For example, below is a case when the error came from the run_retile stage. But for the same data, it would sometime run through this stage but get the same session time out error at a later stage run_stitching.

This error is unique for large sample only. I have no problem running through a sample that is about ~10x smaller.

Log file(s)

Jun-28 10:26:01.924 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'stitching:stitch:run_retile:spark_start_app (1)'

Caused by:
  Process `stitching:stitch:run_retile:spark_start_app (1)` terminated with an error exit status (1)

Command executed:

  echo "Starting the spark driver"

  SESSION_FILE="/u/home/f/f7xiesnm/project-zipursky/easifish/lt185_stitch/spark/r1/.sessionId"
  echo "Checking for $SESSION_FILE"
  SLEEP_SECS=10
  MAX_WAIT_SECS=7200
  SECONDS=0

  while ! test -e "$SESSION_FILE"; do
      sleep ${SLEEP_SECS}
      if (( ${SECONDS} < ${MAX_WAIT_SECS} )); then
          echo "Waiting for $SESSION_FILE"
          SECONDS=$(( ${SECONDS} + ${SLEEP_SECS} ))
      else
          echo "-------------------------------------------------------------------------------"
          echo "ERROR: Timed out after ${SECONDS} seconds while waiting for $SESSION_FILE    "
          echo "Make sure that your --spark_work_dir is accessible to all nodes in the cluster "
          echo "-------------------------------------------------------------------------------"
          exit 1
      fi
  done
  
   if ! grep -F -x -q "dcfcb7c0-01b8-4119-90ec-8b3f63ab2c0e" $SESSION_FILE
  then
      echo "------------------------------------------------------------------------------"
      echo "ERROR: session id in $SESSION_FILE does not match current session            "
      echo "Make sure that your --spark_work_dir is accessible to all nodes in the cluster"
      echo "and that you are not running multiple pipelines with the same --spark_work_dir"
      echo "------------------------------------------------------------------------------"
      exit 1
  fi



  export SPARK_ENV_LOADED=
  export SPARK_HOME=/spark
  export PYSPARK_PYTHONPATH_SET=
  export PYTHONPATH="/spark/python"
  export SPARK_LOG_DIR="/u/home/f/f7xiesnm/project-zipursky/easifish/lt185_stitch/spark/r1"

  . "/spark/sbin/spark-config.sh"
  . "/spark/bin/load-spark-env.sh"



  SPARK_LOCAL_IP=`hostname -i | rev | cut -d' ' -f1 | rev`
  echo "Use Spark IP: $SPARK_LOCAL_IP"

  echo "    /spark/bin/spark-class org.apache.spark.deploy.SparkSubmit     --properties-file /u/home/f/f7xiesnm/project-zipursky/easifish/lt18
5_stitch/spark/r1/spark-defaults.conf          --conf spark.driver.host=${SPARK_LOCAL_IP}     --conf spark.driver.bindAddress=${SPARK_LOCAL_IP
}     --master spark://172.16.129.70:7077 --class org.janelia.stitching.ResaveAsSmallerTilesSpark --conf spark.executor.cores=16 --conf spark.
files.openCostInBytes=0 --conf spark.default.parallelism=16 --executor-memory 96g --conf spark.driver.cores=1 --driver-memory 12g /app/app.jar
  -i /u/home/f/f7xiesnm/project-zipursky/easifish/lt185_stitch/outputs/r1/stitching/c0-n5.json -i /u/home/f/f7xiesnm/project-zipursky/easifish
/lt185_stitch/outputs/r1/stitching/c2-n5.json -i /u/home/f/f7xiesnm/project-zipursky/easifish/lt185_stitch/outputs/r1/stitching/c3-n5.json --s
ize 64     "

  /spark/bin/spark-class org.apache.spark.deploy.SparkSubmit     --properties-file /u/home/f/f7xiesnm/project-zipursky/easifish/lt185_stitch/s
park/r1/spark-defaults.conf          --conf spark.driver.host=${SPARK_LOCAL_IP}     --conf spark.driver.bindAddress=${SPARK_LOCAL_IP}     --ma
ster spark://172.16.129.70:7077 --class org.janelia.stitching.ResaveAsSmallerTilesSpark --conf spark.executor.cores=16 --conf spark.files.open
CostInBytes=0 --conf spark.default.parallelism=16 --executor-memory 96g --conf spark.driver.cores=1 --driver-memory 12g /app/app.jar  -i /u/ho
me/f/f7xiesnm/project-zipursky/easifish/lt185_stitch/outputs/r1/stitching/c0-n5.json -i /u/home/f/f7xiesnm/project-zipursky/easifish/lt185_sti
tch/outputs/r1/stitching/c2-n5.json -i /u/home/f/f7xiesnm/project-zipursky/easifish/lt185_stitch/outputs/r1/stitching/c3-n5.json --size 64
 &> /u/home/f/f7xiesnm/project-zipursky/easifish/lt185_stitch/spark/r1/retileImages.log

Command exit status:
  1

Command output:
  Starting the spark driver
  Checking for /u/home/f/f7xiesnm/project-zipursky/easifish/lt185_stitch/spark/r1/.sessionId
  Use Spark IP: 172.16.129.70
      /spark/bin/spark-class org.apache.spark.deploy.SparkSubmit     --properties-file /u/home/f/f7xiesnm/project-zipursky/easifish/lt185_stit
ch/spark/r1/spark-defaults.conf          --conf spark.driver.host=172.16.129.70     --conf spark.driver.bindAddress=172.16.129.70     --master
 spark://172.16.129.70:7077 --class org.janelia.stitching.ResaveAsSmallerTilesSpark --conf spark.executor.cores=16 --conf spark.files.openCost
InBytes=0 --conf spark.default.parallelism=16 --executor-memory 96g --conf spark.driver.cores=1 --driver-memory 12g /app/app.jar  -i /u/home/f
/f7xiesnm/project-zipursky/easifish/lt185_stitch/outputs/r1/stitching/c0-n5.json -i /u/home/f/f7xiesnm/project-zipursky/easifish/lt185_stitch/outputs/r1/stitching/c2-n5.json -i /u/home/f/f7xiesnm/project-zipursky/easifish/lt185_stitch/outputs/r1/stitching/c3-n5.json --size 64

Command error:
  INFO:    Could not find any nv files on this host!
  INFO:    Converting SIF file to temporary sandbox...
  Starting the spark driver
  Checking for /u/home/f/f7xiesnm/project-zipursky/easifish/lt185_stitch/spark/r1/.sessionId
  Use Spark IP: 172.16.129.70
      /spark/bin/spark-class org.apache.spark.deploy.SparkSubmit     --properties-file /u/home/f/f7xiesnm/project-zipursky/easifish/lt185_stit
ch/spark/r1/spark-defaults.conf          --conf spark.driver.host=172.16.129.70     --conf spark.driver.bindAddress=172.16.129.70     --master
 spark://172.16.129.70:7077 --class org.janelia.stitching.ResaveAsSmallerTilesSpark --conf spark.executor.cores=16 --conf spark.files.openCost
InBytes=0 --conf spark.default.parallelism=16 --executor-memory 96g --conf spark.driver.cores=1 --driver-memory 12g /app/app.jar  -i /u/home/f
/f7xiesnm/project-zipursky/easifish/lt185_stitch/outputs/r1/stitching/c0-n5.json -i /u/home/f/f7xiesnm/project-zipursky/easifish/lt185_stitch/
outputs/r1/stitching/c2-n5.json -i /u/home/f/f7xiesnm/project-zipursky/easifish/lt185_stitch/outputs/r1/stitching/c3-n5.json --size 64
  INFO:    Cleaning up image..
  
Work dir:
  /u/home/f/f7xiesnm/try_multifish/multifish/work/b3/b86ba5188c95fab8d05a827c510a56

Environment

  • EASI-FISH Pipeline version: latest
  • Nextflow version: 22.10.7
  • Container runtime: Singularity
  • Platform: Local cluster
  • Operating system: Linux

Additional context

(Add any other context about the problem here)

@FangmingXie FangmingXie added the bug Something isn't working label Jun 28, 2023
@cgoina
Copy link
Collaborator

cgoina commented Jun 29, 2023

Have you tried to use more workers or give more memory to a spark worker?

@FangmingXie
Copy link
Contributor Author

Thanks @cgoina! Yes I am now trying those slowly, as each trial takes ~12 hrs to turn around. Which option do you think might be more useful? More memory per worker or more workers?

@krokicki
Copy link
Member

@FangmingXie Either could work but that's only if the process is running out of memory. Your exit code is 1 which usually does not indicate a memory issue. Can you attach the contents of the retileImages.log so we can see the actual error? You'll see the path to retileImages.log in your output above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Development

No branches or pull requests

3 participants