Monitor Log & Relaunch Any Program in Linux

, , , , ,

As a server devops, one of the most annoying issues can be program crashes. At the least, this diminishes server performance, and at the worst, it cripples the entire server. When the server is now a boat anchor, clients experience more than frustration.

Clearly there are many reasons a server program can crash, and none of these issues may be the fault of the devops. Using a typical web server as an example, the programs commonly running are Nginx, MariaDB, PHP on a Linux server. This setup, commonly called a LEMP stack (LAMP if Apache), by default is setup to simply serve website content for a handful of site visitors.

As a server devops, it’s vital to understand the default LEMP stack configurations right out of the box. Since Apache, Nginx, PHP and MariaDB have no idea the level of performance any website will be experiencing, only the bare minimums are enabled to ensure the website delivers content.

Understand that changing the size of a server will do very little to improve the total number of concurrent connections a server can endure. And long before a client decides to upgrade their hosting (server size), the site will crash repeatedly until they find you, the devops, that can solve their problems.

Once ‘sudo’ access is established, you really have no idea what you’re getting yourself into until you terminal into the server to see what’s going on. One of the first things you should do is setup the monitoring and relaunch Bash script below.

# Bash script to monitor if vital programs are running
# If programs are not running, will attempt a restart
# set -x

declare -a PROGRAMS=('nginx' 'php-fpm' 'memcached')

timer=$(($RANDOM % 18))
sleep $timer
DATE=$(date "+%m-%d-%Y | %T")
for each in "${PROGRAMS[@]}"; do
  count=$( ps -ax | grep -o "$each" | wc -l );
  status=$( egrep -o running <<<$(systemctl status "$each") )
    if [ $count -lt 2 ] && [[ "$status" != "running" ]]; then
	echo "${each^^} PROGRAM WAS DOWN - RESTARTED: $DATE"$'\n\n' >> /var/log/relaunch.log;
	example_access_log=$( tail -n 300 /var/log/nginx/example.access.log );
	example_error_log=$( tail -n 300 /var/log/nginx/example.error.log );
	php_error_log=$( tail -n 50 /var/log/php-fpm/error.log );
	echo "=========== EXAMPLE NGINX ACCESS LOG: ==============="$'\n\n'"$example_access_log"$'\n\n' >> /var/log/relaunch.log;
	echo "=========== EXAMPLE NGINX ERROR LOG: ==============="$'\n\n'"$example_error_log"$'\n\n' >> /var/log/relaunch.log;
	echo "=========== PHP-FPM ERROR LOG: ==============="$'\n\n'"$php_error_log"$'\n\n' >> /var/log/relaunch.log;
	systemctl stop "$each";
	sleep 10;
	systemctl start "$each";

# set +x
exit 0

What does the above script do? First, this is a Bash script, so be sure this program is available before trying to set this up. On the command line, enter:

[user@localhost ~]$ which bash

This should return:

[user@localhost ~]$ /usr/bin/bash

The first line tells the script where to find Bash, so be sure this line matches the results from the ‘which’ command.

Anything starting with a ‘#’ hash tag is a comment line. The next command line ‘set -x’ can be used to debug the script, but the ‘#’ tag comments this out and the command is skipped.

Next, comes the list of programs the script should monitor. This is an array that’s built in the oldest form of Bash array format. This makes it compatible with any version of Bash, and this compatibility will make life easy when jumping from one server to another.

Next is the timer. Any program executed in Linux operates by informing the kernel that it’s executing. The kernel knows all, but it’s always a good idea to give the kernel a heads up and allow it to allocate resources for this script. The timer puts this script in an open file status without any drain on the server resources. Now the kernel knows there’s a file running and can limit other programs from executing while this script is executing.

Sleep executes the result of the timer value and this simply puts the script in a pause moment giving the kernel time to file this script in it’s list of processes.

The DATE variable is assigned the current date and time in the event there is a problem, let’s use this date/time to pinpoint when the error occurred and track down possible solutions.

Next is the magic that makes this script possible. The ‘for’ loop executes the actual monitor and restart of each program in the previously built array. The array can have any number of programs in it. This example would be for a typical web server.

The ‘for’ loop assigns each element in the array to the variable ‘each’. Then ‘each’ is used to verify that it is running by reviewing the program found in the array ‘${PROGRAMS[@]}’.

Inside the ‘for’ loop, the first thing to determine is the ‘count’ for the process snapshot. Any time a program runs in Linux, a search of the process snapshot (ps) list reveals whether or not the program currently exists in the kernel processes, and provides the script with a count that later confirms whether or not the script is running. The ‘sysctl’ variable is assigned the value of ‘running’ if in fact the program being monitored is ‘running’. Both the proper ‘count’ and the proper ‘sysctl’ status must exist for the restart process to be bypassed. If either should fail to meet the minimum parameters, the restart condition is executed.

The restart condition is found inside the ‘if’ statement that determines whether or not the restart is logged and executed. If the ‘count’ is less than ( -lt ) 2 and the ‘status’ does not equal “running”, then the ‘if’ condition is true and executes the error logging and restart.

The error logging starts by echoing which program was down and the date/time it was found to be down. It then creates a little separation with a couple of carriage returns $’\n\n’ and then the access and error log snapshots are assigned to their respective variables. These variables are then appended into the ‘/var/log/relaunch.log’. Notice that the logging variables are assigned the ‘tail’ end of the program logs. This tail can be as long or short as is required to help determine the potential cause of any troubles. Simply change the ‘tail’ number to change this output. Since this script will be executed by a cron job, it’s better to have a longer tail than a shorter one.

Once the logs are gathered and appended to the relaunch.log, a hard restart is executed. First the process that was not running is stopped, and then a 10 second ‘sleep’ is executed to give the server time to remove any processes from memory before the failed program is started once again.

The ‘for’ loop returns back to the top when it hits ‘done’ and runs it’s tests on the next program. By default, when the array has reached it’s last entry, the ‘for’ loop concludes and exits.

The ‘set +x’ will now turn off any debugging, if it was turned on to begin with, and will return the script back to a normal output. The ‘set’ command should always be enabled/disabled in pairs. First enable it with ‘set -x’, then disable it with ‘set +x’

Once all elements of the script have been executed, the script should ‘exit’ with a ‘0’ status. ‘0’ means success with no failures.

This small script is executed at a selected time interval using a cron job. Executing this script every 5 minutes would look like:

*/5 * * * * /usr/local/sbin/

Be sure to ‘chmod’ this script to 700 for the server to properly execute it.

sudo chmod 700 /usr/local/sbin/

To all the Bash scripting experts out there, this is the basic framework for what can be a far more complex script. This script can be adopted to send emails, exit on failures, log additional failures, start or stop additional programs and further interact with the programs being monitored.

Please add your comments below to help those that are learning how to solve server crashes.

, , , , ,

Leave a Reply