Saturday, March 05, 2011

No More Hanging Jobs in Cron

Have you encountered 'prtdiag' or other commands hang for whatever reason? If your script happens to run these commands and launch from cron, your job will simply pile up until cron hits the limit. By default, Solaris configured cron to run 100 concurrent jobs and the next 101th job will just fail.

I developed a launcher program (watchdog) to limit the runtime of script (worker) that may have the above mentioned behaviour. It works well with my worker script and it should work for other programs too. So, no more hanging job in cron !!

#! /bin/ksh
#
# A watchdog program to limit the elapsed time of the worker shell script
# to avoid hanging processes that can pile up if worker runs under cron
#


export PATH=/usr/bin:/usr/sbin:/bin


#
# default time limit is 60 seconds
#
timelimit=${1:-60}

B
worker="${0%/*}/check-worker.ksh"
worker_name=${worker##*/}
worker_name=${worker_name%.*}
if [ ! -f $worker ]; then
    echo "Error. \"$worker\" cannot be found"
    exit 1
fi
if [ ! -x $worker ]; then
    echo "Error. \"$worker\" is not executable"
    exit 2
fi


watchdog()
{
    sleep 1; # wait for the worker to start
    while [ $timelimit -gt 0 ]
    do
        # pgrep is available since 5.8, else use ps -ef | grep -v grep | grep $worker_name
        jobid=`pgrep $worker_name`
        if [ $? -eq 1 ]; then
            break
        else
            sleep 1
        fi
        ((timelimit-=1))
    done
    if [ $timelimit -eq 0 ]; then
        # kill worker + child processes
        ptree $jobid | awk '$1=='$jobid'{start=1}start==1{print $1}' | while read pid
            do
                kill -TERM "$pid" > /dev/null 2>&1
            done
    fi
}


#
# start the watchdog before the worker
#
watchdog &


tmpfile="/tmp/.$work_name.$$"
$worker > $tmpfile 2>&1 &
worker_id=$!
wait $worker_id > /dev/null 2>&1
rc=$?


if [ $rc -ne 0 ]; then
    # replace this line to do whatever you want, send email, sms, logger....
    #
    # echo .... | mailx someone@somewhere.com

    details=`cat $tmpfile 2>/dev/null`
    echo "Exit status=$rc. There is a problem with the server '`hostname`' - $details"
fi


rm -f $tmpfile

Labels: ,

2 Comments:

Blogger Wei-Yin Chen said...

If GNU utilities are available, coreutils "timeout" is suitable for this job.

11:22 AM  
Blogger chihungchan said...

Thanks, I didn't know that. Not sure if this utility is available on Solaris

12:04 PM  

Post a Comment

<< Home