You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I was trying to stitch a large sample of 20 tiles, with each tile having [1920,1920,~2800] pixels. I kept getting a spark session time out error at different stages of the stitching pipeline.
For example, below is a case when the error came from the run_retile stage. But for the same data, it would sometime run through this stage but get the same session time out error at a later stage run_stitching.
This error is unique for large sample only. I have no problem running through a sample that is about ~10x smaller.
Log file(s)
Jun-28 10:26:01.924 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'stitching:stitch:run_retile:spark_start_app (1)'
Caused by:
Process `stitching:stitch:run_retile:spark_start_app (1)` terminated with an error exit status (1)
Command executed:
echo "Starting the spark driver"
SESSION_FILE="/u/home/f/f7xiesnm/project-zipursky/easifish/lt185_stitch/spark/r1/.sessionId"
echo "Checking for $SESSION_FILE"
SLEEP_SECS=10
MAX_WAIT_SECS=7200
SECONDS=0
while ! test -e "$SESSION_FILE"; do
sleep ${SLEEP_SECS}
if (( ${SECONDS} < ${MAX_WAIT_SECS} )); then
echo "Waiting for $SESSION_FILE"
SECONDS=$(( ${SECONDS} + ${SLEEP_SECS} ))
else
echo "-------------------------------------------------------------------------------"
echo "ERROR: Timed out after ${SECONDS} seconds while waiting for $SESSION_FILE "
echo "Make sure that your --spark_work_dir is accessible to all nodes in the cluster "
echo "-------------------------------------------------------------------------------"
exit 1
fi
done
if ! grep -F -x -q "dcfcb7c0-01b8-4119-90ec-8b3f63ab2c0e" $SESSION_FILE
then
echo "------------------------------------------------------------------------------"
echo "ERROR: session id in $SESSION_FILE does not match current session "
echo "Make sure that your --spark_work_dir is accessible to all nodes in the cluster"
echo "and that you are not running multiple pipelines with the same --spark_work_dir"
echo "------------------------------------------------------------------------------"
exit 1
fi
export SPARK_ENV_LOADED=
export SPARK_HOME=/spark
export PYSPARK_PYTHONPATH_SET=
export PYTHONPATH="/spark/python"
export SPARK_LOG_DIR="/u/home/f/f7xiesnm/project-zipursky/easifish/lt185_stitch/spark/r1"
. "/spark/sbin/spark-config.sh"
. "/spark/bin/load-spark-env.sh"
SPARK_LOCAL_IP=`hostname -i | rev | cut -d' ' -f1 | rev`
echo "Use Spark IP: $SPARK_LOCAL_IP"
echo " /spark/bin/spark-class org.apache.spark.deploy.SparkSubmit --properties-file /u/home/f/f7xiesnm/project-zipursky/easifish/lt18
5_stitch/spark/r1/spark-defaults.conf --conf spark.driver.host=${SPARK_LOCAL_IP} --conf spark.driver.bindAddress=${SPARK_LOCAL_IP
} --master spark://172.16.129.70:7077 --class org.janelia.stitching.ResaveAsSmallerTilesSpark --conf spark.executor.cores=16 --conf spark.
files.openCostInBytes=0 --conf spark.default.parallelism=16 --executor-memory 96g --conf spark.driver.cores=1 --driver-memory 12g /app/app.jar
-i /u/home/f/f7xiesnm/project-zipursky/easifish/lt185_stitch/outputs/r1/stitching/c0-n5.json -i /u/home/f/f7xiesnm/project-zipursky/easifish
/lt185_stitch/outputs/r1/stitching/c2-n5.json -i /u/home/f/f7xiesnm/project-zipursky/easifish/lt185_stitch/outputs/r1/stitching/c3-n5.json --s
ize 64 "
/spark/bin/spark-class org.apache.spark.deploy.SparkSubmit --properties-file /u/home/f/f7xiesnm/project-zipursky/easifish/lt185_stitch/s
park/r1/spark-defaults.conf --conf spark.driver.host=${SPARK_LOCAL_IP} --conf spark.driver.bindAddress=${SPARK_LOCAL_IP} --ma
ster spark://172.16.129.70:7077 --class org.janelia.stitching.ResaveAsSmallerTilesSpark --conf spark.executor.cores=16 --conf spark.files.open
CostInBytes=0 --conf spark.default.parallelism=16 --executor-memory 96g --conf spark.driver.cores=1 --driver-memory 12g /app/app.jar -i /u/ho
me/f/f7xiesnm/project-zipursky/easifish/lt185_stitch/outputs/r1/stitching/c0-n5.json -i /u/home/f/f7xiesnm/project-zipursky/easifish/lt185_sti
tch/outputs/r1/stitching/c2-n5.json -i /u/home/f/f7xiesnm/project-zipursky/easifish/lt185_stitch/outputs/r1/stitching/c3-n5.json --size 64
&> /u/home/f/f7xiesnm/project-zipursky/easifish/lt185_stitch/spark/r1/retileImages.log
Command exit status:
1
Command output:
Starting the spark driver
Checking for /u/home/f/f7xiesnm/project-zipursky/easifish/lt185_stitch/spark/r1/.sessionId
Use Spark IP: 172.16.129.70
/spark/bin/spark-class org.apache.spark.deploy.SparkSubmit --properties-file /u/home/f/f7xiesnm/project-zipursky/easifish/lt185_stit
ch/spark/r1/spark-defaults.conf --conf spark.driver.host=172.16.129.70 --conf spark.driver.bindAddress=172.16.129.70 --master
spark://172.16.129.70:7077 --class org.janelia.stitching.ResaveAsSmallerTilesSpark --conf spark.executor.cores=16 --conf spark.files.openCost
InBytes=0 --conf spark.default.parallelism=16 --executor-memory 96g --conf spark.driver.cores=1 --driver-memory 12g /app/app.jar -i /u/home/f
/f7xiesnm/project-zipursky/easifish/lt185_stitch/outputs/r1/stitching/c0-n5.json -i /u/home/f/f7xiesnm/project-zipursky/easifish/lt185_stitch/outputs/r1/stitching/c2-n5.json -i /u/home/f/f7xiesnm/project-zipursky/easifish/lt185_stitch/outputs/r1/stitching/c3-n5.json --size 64
Command error:
INFO: Could not find any nv files on this host!
INFO: Converting SIF file to temporary sandbox...
Starting the spark driver
Checking for /u/home/f/f7xiesnm/project-zipursky/easifish/lt185_stitch/spark/r1/.sessionId
Use Spark IP: 172.16.129.70
/spark/bin/spark-class org.apache.spark.deploy.SparkSubmit --properties-file /u/home/f/f7xiesnm/project-zipursky/easifish/lt185_stit
ch/spark/r1/spark-defaults.conf --conf spark.driver.host=172.16.129.70 --conf spark.driver.bindAddress=172.16.129.70 --master
spark://172.16.129.70:7077 --class org.janelia.stitching.ResaveAsSmallerTilesSpark --conf spark.executor.cores=16 --conf spark.files.openCost
InBytes=0 --conf spark.default.parallelism=16 --executor-memory 96g --conf spark.driver.cores=1 --driver-memory 12g /app/app.jar -i /u/home/f
/f7xiesnm/project-zipursky/easifish/lt185_stitch/outputs/r1/stitching/c0-n5.json -i /u/home/f/f7xiesnm/project-zipursky/easifish/lt185_stitch/
outputs/r1/stitching/c2-n5.json -i /u/home/f/f7xiesnm/project-zipursky/easifish/lt185_stitch/outputs/r1/stitching/c3-n5.json --size 64
INFO: Cleaning up image..
Work dir:
/u/home/f/f7xiesnm/try_multifish/multifish/work/b3/b86ba5188c95fab8d05a827c510a56
Environment
EASI-FISH Pipeline version: latest
Nextflow version: 22.10.7
Container runtime: Singularity
Platform: Local cluster
Operating system: Linux
Additional context
(Add any other context about the problem here)
The text was updated successfully, but these errors were encountered:
Thanks @cgoina! Yes I am now trying those slowly, as each trial takes ~12 hrs to turn around. Which option do you think might be more useful? More memory per worker or more workers?
@FangmingXie Either could work but that's only if the process is running out of memory. Your exit code is 1 which usually does not indicate a memory issue. Can you attach the contents of the retileImages.log so we can see the actual error? You'll see the path to retileImages.log in your output above.
Bug report
Description of the problem
I was trying to stitch a large sample of 20 tiles, with each tile having [1920,1920,~2800] pixels. I kept getting a spark session time out error at different stages of the stitching pipeline.
For example, below is a case when the error came from the
run_retile
stage. But for the same data, it would sometime run through this stage but get the same session time out error at a later stagerun_stitching
.This error is unique for large sample only. I have no problem running through a sample that is about ~10x smaller.
Log file(s)
Environment
Additional context
(Add any other context about the problem here)
The text was updated successfully, but these errors were encountered: