Automatic network recovery when ping fails

Problem Description

I encountered numerous times network issues where I had to manually intervene and resolve them. When the root cause is always the same, you may want to automate the resolution. Beware though, don’t try to automate something that does not occur frequently enough, otherwise you spent time and effort for trivial things that don’t bring any value. When I try to automate something, it is because I am fed up doing it manually (so it is proven it can be automated) or if I want to learn something new.

In this particular case, I have VPN connections to several remote locations and one of them sits behind CGNAT and a 4G router. If, for whatever reason, I lose the VPN to that remote site, I cannot initialize the VPN connection from the central site since the remote site is ‘hidden’ behind the provider’s NAT. In other words, I don’t know its public IP address. After a few trips to the site, I realize the interface (the wlan and eth0) go down and lose their IP address. Up until know, I don’t know the reason, I suspect it has something to do with the DHCPCD service and I am planning to replace it with networkmanager service in the near future. But for now, I need to automate the network recovery. This is all in a raspberry pi running the latest raspberry OS.

Bash script to recover from network failure

So, it was time for me to get into bash scripting for pretty much the first time. The idea is the following:

  • From the remote location, ping the VPN site on the central site.
  • If ping works, terminate the script.
  • If ping does not work (100% failure), restart the eth0 interface.
  • Once the interface is up, test again.
  • If ping fails, reboot the raspberry pi.

Here is the bash script

#!/bin/bash

ip_address="10.0.0.1"       # Replace with the target IP address
max_failures=5              # Maximum number of consecutive ping failures before taking action

# Function to restart the wlan0 interface
restart_eth0() {
    echo "$(date +"%Y-%m-%d %H:%M:%S") - Restarting eth0..."
    sudo ifconfig eth0 down
    sleep 5
    sudo ifconfig eth0 up
}

# Function to reboot the Raspberry Pi
restart_raspberry_pi() {
    echo "$(date +"%Y-%m-%d %H:%M:%S") - Rebooting the Raspberry Pi..."
    sudo reboot
}

# Loop for a specific number of ping attempts
for attempt in {1..5}; do
    # Ping the IP address
    ping -c 1 $ip_address > /dev/null

    # Check the return code of the ping command
    if [ $? -ne 0 ]; then
        echo "$(date +"%Y-%m-%d %H:%M:%S") - Ping failed ($attempt/$max_failures)"

        if [ $attempt -eq $max_failures ]; then
            echo "$(date +"%Y-%m-%d %H:%M:%S") - Ping failed ($attempt/$max_failures). Restarting eth0"
            restart_eth0

            # Wait for a moment before trying again
            sleep 60
            for attempt in {1..5}; do

                # Ping the IP address after restarting eth0
                ping -c 1 $ip_address > /dev/null

                if [ $? -ne 0 ]; then
                    echo "$(date +"%Y-%m-%d %H:%M:%S") - Ping failed ($attempt/$max_failures)"
                    if [ $attempt -eq $max_failures ]; then
                        echo "$(date +"%Y-%m-%d %H:%M:%S") - Ping failed after restarting eth0. Rebooting Raspberry Pi..."
                        restart_raspberry_pi
                    fi
                else
                    echo "$(date +"%Y-%m-%d %H:%M:%S") - Ping successful after restarting eth0."
                fi
            done
        fi
    else
        echo "$(date +"%Y-%m-%d %H:%M:%S") - Ping successful"
        break  # Exit the loop if ping is successful
    fi

    # Sleep for a moment before the next ping attempt
    sleep 10
done

I define two functions, restart_eth0() and restart_raspberry_pi() that restarts the eth0 interface and raspberry pi respectively. Then, we go into a loop where we ping the other end of the VPN. If successful, the script terminates with the break command. If not successful, we try 5 times in total and if all 5 fail then we execute the restart_eth0() function. After a period of 60 seconds, we start pinging again and if unsuccessful again we reboot the raspberry pi after 5 failed pings.

I saved the above script in a file called check_network.sh and I made it executable by issues the following command.

chmod +x check_network.sh

and we verify the script can be executed by all groups

cgeo@raspi:~ $ ls -ltrh check_network.sh
-rwxr-xr-x 1 cgeo cgeo 2.0K Jan 7 21:46 check_network.sh
cgeo@raspi:~ $

It’s probably better to limit the group that can execute the script to root since this script requires elevated rights anyway.

Once this is done, you can create a cron job for the root user by issuing the sudo crontab -e command.

You add the following line

0 * * * * /home/cgeo/check_network.sh >> /home/cgeo/check_network.log 2>&1 &

Which basically runs the script every hour. It writes the progress into the check_network.log and it also redirects the stderr to stdout (2>&1) to capture any potential errors. You can run it more frequently ie every 5 minutes but make sure the script terminates before you call it again.

Conclusion

I already had an incident a few days back where I lost connection to the site and the script miraculously brought the connectivity back up again within the hour. One thing that worries me is that in case of ping failures due to non-networking reasons (the remote router is down?) the raspberry may be stuck in a reboot loop every hour. I may need to revisit the script again.

Leave a Comment

Your email address will not be published. Required fields are marked *

This website uses cookies. By continuing to use this site, you accept our use of cookies.