28 May 2021

Clusterising V-REP, An Easy Guide

In the Dynamic Self-repair project we've been using large numbers of V-REP simulations to gather statistical data with which to compare the performance of different modular robot group behaviours against each other. The numbers of simulations which can be run are, if you get things right, well in excess of those you can run with real hardware, and yet, as it is a high fidelity physically detailed simulator, V-REP is liable to run slower than the real robots. As in many swarm and modular robotics experiments each simulated scenario, long and computationally intensive as it is, leads to effectively one data point for the analysis stage. And as V-REP runs slower than the real hardware, atleast in our scenarios with large numbers of fairly physically detailed robots in scenes, parallelising becomes very important. Getting clusterised V-REP to run took quite a while for someone like myself with a background in physics rather than computer science, so I thought I'd share here how it is done incase anyone else finds themselves needing to do V-REP bulk runs. The video below shows an example of our self-assembly simulations, captured while running on the cluster.

This guide provides a script for running V-REP on clusters with a SLURM workload manager and explains the key points in its operation. This method was developed for V-REP 3.5, which we have been using, but should still be applicable for later versions including CoppeliaSim

We'll start with the .job file for squeue submission:


#!/bin/bash
#SBATCH --job-name=insert_name       # If this script is used please acknowledge "Robert H. Peck" for providing it
#SBATCH --mail-type=END,FAIL             # Mailing events#SBATCH --mail-user=YourEmail@example.com     # Where to send emails 

#SBATCH --mem=2gb                        # Job memory request, 2gb is typically enough for V-REP
#SBATCH --time=47:00:00                  # Time limit hrs:min:sec, always set this value somewhat above the maximum wristwatch time a job will require

#SBATCH --nodes=100 #as many as tasks as nodes, this will give 100 paralle runs in this example
#SBATCH --ntasks=100
#SBATCH --cpus-per-task=1 #V-REP can only use one CPU at a time

#SBATCH --output=name_%j.log        # Standard output and error log
#SBATCH --account=if_applicable_on _system        # Project account
#SBATCH --ntasks-per-core=1 #only 1 task per core, must not be more
#SBATCH --ntasks-per-node=1 #only 1 task per node, must not be more
#SBATCH --ntasks-per-socket=1 #typically won't want multiple instances trying to use same socket

#SBATCH --no-kill #prevents restart if one of the 100 gets a NODE_FAIL

echo My working directory is `pwd`
echo Running job on host:
echo -e '\t'`hostname` at `date`
echo
 
module load toolchain/foss/2018b #depending on the cluster setup other moduels may need to be loaded to support V-REP's dependencies
cd scratch #filepaths will vary
cd V-REP_PRO_EDU_V3_5_0_Linux #V-REP's own folder

chmod +x generic_per_node_script.sh #ensure that the bash script is executable


VariableName1=4 #numerical variable to provide to V-REP
VariableName2="text" #string variable to provide to VREP
VariableName3="[[1,1,4,4],[1,3,4,7]]" #array, or table in lua, to provide to VREP
VariableName4="S5_filename" #filename, as another string, for V-REP to open


srun --no-kill -K0 -N "${SLURM_JOB_NUM_NODES}" -n "${SLURM_NTASKS}" ./generic_per_node_script.sh ${VariableName1} ${VariableName2} ${VariableName3} ${VariableName4} #but where is V-REP I hear you ask?


wait
 
echo
echo Job completed at `date`

The --dependency=afterany: flag should be used if submitting a series of jobs of this kind to a cluster, this ensures that subsequent jobs do not share nodes with, and conflict with, earlier batches of simulations.

So where is V-REP running? Well it isn't quite yet, because whilst there are ways to launch V-REP directly from the job script much more can be achieved if the job script instead launches a bash script on each node which then handles V-REP's functionality.

This is especially useful for video recording. If you want, on occasion, to get a visualisation of a simulation, this method also lets you take a screen captured video from the simulation. V-REP natively has a video recording function, but this can only be launched by pressing the button within the simulator's GUI, it cannot be triggered from the command line, this method allows you to record video from clusterised simulations despite that.

#!/bin/bash

#If this script is used please acknowledge "Robert H. Peck" for providing it
ScreenShottingPeriod=10 #defines a period in seconds of wristwatch time at which frames are taken
frame_rate=10 #defines the rate at which the output video will play these frames
TimeNow=$(date +'%d-%m-%Y_%H:%M:%S')
mkdir "vrep_frames_at_${TimeNow}_on_node_${SLURM_NODEID}"
#we have just made a folder within V-REP's folder, and on the scratch part of the cluster filesystem

Arg1=$1
Arg2=$2
Arg3=$3
Arg4=$4
Arg5=$5

if [[ -n "$Arg1" ]]; then #processing of variables from the job script, setting to default values if none are supplied
Var1=$Arg1
else
Var1=5
fi

if [[ -n "$Arg2" ]]; then
Var2=$Arg2
else
Var2=10
fi

if [[ -n "$Arg3" ]]; then
Var3=$Arg3
else
Var3="[[1,4,3,5],[1,3,1,8]]"
fi

if [[ -n "$Arg4" ]]; then
Var4=$Arg4
else
Var4="S1_filename"
fi

xvfb-run -s "-screen 0 1024x768x24" --server-num=92 ./vrep.sh -gvrep_frames_at_${TimeNow}_on_node_${SLURM_NODEID} -g${Var1} -g${Var2} -g${Var3} ${Var4}.ttt -s1200000 -q > /dev/null & #V-REP is launched within an xvfb environment to provide it with a graphical frontend, without which it cannot operate, variables 1,2 and 3 are supplied to the simulation and the fourth variable is used to select which simulation file to open. The server number should be set differently for different users of the cluster to avoid conflict between multiple xvfb users.
#now that we have launched V-REP we shift to the node's local filesystem

cd /tmp #this saves temorary files to the node's own storage ratehr than stressing interconnects with multiple small writes to the scratch directory
#and make a copy of the images directory on here, with the same name as the scratch copy
mkdir "vrep_frames_at_${TimeNow}_on_node_${SLURM_NODEID}"
cd /tmp/vrep_frames_at_${TimeNow}_on_node_${SLURM_NODEID}
#echo "local dir made on ${SLURM_NODEID}"
sleep 20 #this sleep is crucial, it ensures v-rep actually is running and has a pid by the time the next command runs
VREPpidof=$(pgrep -f "YourUsername.*Linux.*vrep")
#echo "vrep on ${SLURM_NODEID} 's pid is ${VREPpidof}"
PSPanswering=$(ps -p ${VREPpidof})
#echo "ps -p says ${PSPanswering}"
CountingVar=1 #counter to handle file names
while ps -p ${VREPpidof} > /dev/null; do #monitor V-REP's pid, until V-REP ends keep doing this loop
    TimeNow2=$(date +'%d-%m-%Y_%H:%M:%S')
    PicFileNameNow=$(printf "image-%0.5i.png\n" $CountingVar) #$(echo "image-${CountingVar}.png")
    #the PicFile goes in the /tmp local filesystem
    xwd -display :92 -root -silent | convert xwd:- ${PicFileNameNow} #takes the screenshot on screen 92, this 92 should be changed for different individuals using a cluster, as should it when launching xvfb
   
    sleep ${ScreenShottingPeriod} #sleep until we want the next image capture
    CountingVar=$((CountingVar+1)) #iterate the file numbering counter
done

wait #don't finish things off until V-REP is finished

#echo "vrep finished on ${SLURM_NODEID}, processing vid"

outputVidName=$(echo "${Var1}_${Var2}_${Var3}_V-REP_recording_${TimeNow2}_on_${SLURM_NODEID}")

module load vis/FFmpeg/4.1-foss-2018b #adds the ffmpeg module, afor each person to use can help.gain may vary on different systems
ffmpeg -r ${frame_rate} -start_number 1 -i %*.png -c:v libx264 -vf fps=${frame_rate} -pix_fmt yuv420p ${outputVidName}.mp4 2> /dev/null #creates the video file ensure this happens on the local filesystem, redirect text on command line output to trash

sleep 50 #ensure that the vid has saved properly

cp ${outputVidName}.mp4 ~/scratch/V-REP_PRO_EDU_V3_5_0_Linux/vrep_frames_at_${TimeNow}_on_node_${SLURM_NODEID}/${outputVidName}.mp4
#copy the video in to the scratch directory copy of "vrep_frames_at_${TimeNow}_on_node_${SLURM_NODEID}"

#now remove all the old png files in this directory so as to save some space
rm *.png #ensure this happens on the local file system


#delete the video and folder from the local filesystem
cd /tmp/vrep_frames_at_${TimeNow}_on_node_${SLURM_NODEID}
rm *.mp4
cd /tmp
rm -r vrep_frames_at_${TimeNow}_on_node_${SLURM_NODEID}
#echo "done on ${SLURM_NODEID}"

This script launches V-REP and takes a series of screenshots as it runs, these are saved in temporary folders on each node where a job is running. Once V-REP has completed, either the simulations run to the maximum 1200 seconds of simulated time specified, or an earlier ending condition occurs within a V-REP simulation, the images are combiend in to a video and the video transferred to the central scratch filesystem, where it is placed in the same folder as output files from the same node's V-REP simulation.

One should also note a further useful tip, as per [Joaquin Silveira's advice here], when multiple users needs to run V-REP simulations on a cluster, each should try editing portIndex1_port in remoteApiConnections.txt in their V-REP folder, on scratch, to an alternative value negotiated so as not to clash with the value used by anyone else. Assigning a unique port number in this way acts to prevent V-REP instances from different users, which may share the same node, from conflicting.

No comments:

Post a Comment

Thank you for your thoughts, your comment should be visible soon.