Monitoring pfSense with Nagios XI Using SSH – part 2

Monitoring pfSense with Nagios XI Using SSH – part 2

Downloading and testing the checks

In the part 1, we setup password-less SSH. What good does that do? Now that we have a secure connection between the systems, we are quite a bit closer to securely running check commands using the SSH proxy on Nagios XI (or the check_by_ssh on Nagios Core).

First though, we need to get the various plugins on the pfSense box. We are going to use a handful of custom scripts, but we’ll also use some pre-compiled executables. You can compile your own by downloading them from https://nagios-plugins.org/downloads/, but I would not recommend it. Instead, you can grab these plugins pre-compiled from freshports.org. This is easy in FreeBSD because you just run ‘pkg install nagios-plugins’ from the command line as shown below. After replying ‘y’ to the ‘proceed with this action’ question the command will pull the files down and place them in the package’s preferred directory.

# sudo pkg install nagios-plugins
Updating pfSense-core repository catalogue...
pfSense-core repository is up to date.
Updating pfSense repository catalogue...
pfSense repository is up to date.
All repositories are up to date.
The following 1 package(s) will be affected (of 0 checked):

New packages to be INSTALLED:
        nagios-plugins: 2.2.1_5,1 [pfSense]

Number of packages to be installed: 1

The process will require 2 MiB more space.
366 KiB to be downloaded.

Proceed with this action? [y/N]:y

Excellent! Now the pre-compiled plugins can be found in the ‘/usr/local/libexec/nagios’ directory. Give your newly installed plugins a test run by typing in the command below. If all goes well, you should receive some output specifying your current number of processes.

# /usr/local/libexec/nagios/check_procs
PROCS OK: 67 processes | procs=67;;;0;

So that’s great, but those files aren’t specific to pfSense! What about monitoring items such as services, VPNs, etc. I have created custom scripts for those checks, which are freely available on GitHub. You can easily download these to your pfSense firewall using the curl and tar command below. Make sure you run these commands on your pfSense system.

# curl -LO https://github.com/oneoffdallas/pfsense-nagios-checks/archive/master.zip
# sudo unzip -j master.zip -d /usr/local/libexec/nagios/
# sudo chmod +x /usr/local/libexec/nagios/check_pf_*

If everything went as planned, you can run the command below and get some output back.

# /usr/local/libexec/nagios/check_pf_version
Current version: 2.4.2-RELEASE / Mon Nov 20 08:12:56 CST 2017

Various Checks Explained

Below, I’ve also went through a fair amount of effort explaining the plugins I recommend as well as some default values for pfSense that should work in most cases. If you want to test out some of them on your own pfSense, just run the commands (not their output) from the /usr/local/libexec/nagios directory (unless you want to type in the full path).

If you want to just go with my recommendations and don’t want the full explanations, then head over to part 3.

Go to Part 3: Configuring the checks on Nagios XI

Off the shelf FreeBSD checks

./check_ping -H 208.67.222.222 -w 80,10% -c 150,40%
PING OK - Packet loss = 0%, RTA = 25.71 ms|rta=25.712999ms;80.000000;150.000000;0.000000 pl=0%;10;40;0

So this isn’t ping “to” the pfSense. Instead, this is ping “from” the pfSense to another system. This is useful if you are monitoring the internet connection of a firewall – local or remote. It’s also useful if you stack a few of these together, i.e. you can compare your pings to OpenDNS (in the example) vs. pings to vendor XYZ. If the pings to vendor XYZ increase and corresponding data to OpenDNS does not, your vendor probably has an issue on their hands… Even better, now you’ll have the data to show them! Assuming you have decent internet, this setup should work. It will create a warning if the roundtrip is greater than 80ms and the percentage of packet loss is greater than 10%. Likewise, it will cause a critical alert if the roundtrip exceeds 150ms and the packet loss is greater than 40%.

./check_ntp_time -H time.google.com
NTP OK: Offset -0.006043791771 secs|offset=-0.006044s;60.000000;120.000000;

This check tests the firewall time against an NTP time source and produces a warning if the variation is greater than 60 seconds and a critical if it is greater than 120 seconds. As a best practice, I prefer to use a time source different than the one I use in the pfSense web GUI. Note: This check can have issues caused by UDP packets not returning properly or in time. If this check produces a fair amount of false positives, either try a different time source or simply increase the timeout to 30 seconds. Increasing the timeout to 30 seconds can be done by adding the “-t 30” in the Core Config Manager (Configure -> Core Config Manager -> NTP service) as shown below. This change can be made after the service is configured. Even with the occasional issue, I leave this check enabled because it is important for your firewall/logs to have the correct time. It also wouldn’t hurt to only check every few hours instead of every 5 minutes to further eliminate false positives.

Nagios XI NTP Variation Core Config Manager timeout

./check_disk -w 20% -c 5% -p /
DISK OK - free space: / 23066 MB (90.29% inode=99%);| /=2478MB;22212;26376;0;27765

The disk check might be counterintuitive to what you might think. Warn if less than 20% of disk is free and produce a critical alert if less than 5% of disk is free. You’ll also note I’m only checking on the root (/) partition. The other partition I would recommend checking in a standard pfSense setup is the /var/run partition using the command below.

./check_disk -w 20% -c 5% -p /var/run
DISK OK - free space: /var/run 3 MB (96.75% inode=97%);| /var/run=0MB;2;2;0;3
./check_load -w 3,2.8,2.6 -c 10,7,5 -r
OK - load average: 0.21, 0.17, 0.15|load1=0.210;3.000;10.000;0; load5=0.170;2.800;7.000;0; load15=0.150;2.600;5.000;0;

Load is a funny beast and lots of folks have different opinions on it. That’s because load can mean different things to different systems. Load is based on how busy your CPU, disk, and other resources are. Personally, on most *nix installs I prefer to keep the load under 3 so that is what I recommend here as well. The warning of 3,2.8,2.6 is the load average after 1 minute, 5 minutes, and 15 minutes respectively. You might also take note of the ‘-r’ which divides the load by the number of processors.

./check_procs -w 200 -c 400
PROCS OK: 64 processes | procs=64;200;400;0;

This checks the number of processes. On a somewhat busy system with IDS enabled, I found 200 was a good warning state and it did a good job of letting me know if something went haywire or jobs were hanging. If you find yourself getting some false positives and your system regularly sits around 200, go ahead and bump it up a bit.

./check_swap -w 90% -c 40%
SWAP OK - 92% free (1879 MB out of 2047 MB) |swap=1879MB;1843;819;0;2047

Swap is another funny one and arguably optional if you are monitoring everything else. Surprisingly, the pre-configured, ARM-based netgate/pfSense systems don’t even come with swap enabled so there is that debate on whether it is necessary at all. Maybe this is from a flawed line of thinking, but I still prefer having swap. When swap is enabled, I also like watching it because if you are using it, your system ran out of RAM at some point. So you would assume the warning would be 100%… Unfortunately, you would be wrong. If you happen to use swap, the safest way to clear it back out is a reboot and I am a fan of seeing my uptime climb. And no, “swapoff -a && swapon -a” is not always the best or safest route. Also, using swap on rare instances isn’t necessarily a bad thing. So instead, I set it to 90% and leave it at that. Add more RAM if you are frequently going above that mark. Incidentally, if you find yourself using a lot of swap, a memory increase would also help in other areas (and stats) including the load due to disk I/O. It’s just one of those things that can affect other metrics, which leads to red herrings when troubleshooting <- Yes, I’m speaking from experience!

Custom pfSense checks

So those are the standard checks Nagios provides for FreeBSD and while they are helpful, they are seriously lacking when monitoring a pfSense and firewall specific functionality. I mean what about VPN tunnels, interfaces, state tables, and services? That is where the power of Nagios and custom scripts come in!!!

Load is fantastic because it can give wonderful indicators on how a system. But if the load is high, is it the CPU, the memory, the disk i/o or perhaps even a combination of all 3? Thus, in addition to monitoring load, I recommend monitoring the CPU and memory as well using the commands below.

./check_pf_cpu -w 85 -c 95
OK - CPU Usage = 0%|CPU=0;;;;

Percentage of CPU used. It creates a warning if CPU is above 85% usage and critical if it is above 95%.

./check_pf_mem -w 90 -c 95
OK - Memory Usage = 41%|MEM=41;;;;

Percentage of memory used. It creates a warning if the memory is above 90% usage and critical if it is above 95%. Yes, this is also custom script and it is the same reason as CPU.

./check_pf_services -name snort
OK - snort service is running
./check_pf_services -name pinger
OK - dpinger service is running
./check_pf_services -name pfb_dnsbl
OK - pfb_dnsbl service is running
./check_pf_services -name dhcpd
OK - dhcpd service is running
./check_squid -name squid
OK - squid service is running

Ever had snort stop unexpectedly? What about unbound, Snort, dhcpd, or any other pfSense services? Now you can monitor all of them! Just specify the name of the service as shown in any one of the examples.

./check_pf_interface -i em1_vlan6
OK - em1_vlan6 up and active
./check_pf_interface -i em1_vlan6 -name LAN
OK - LAN(em1_vlan6) up and active

Check whether your interfaces are up. This is extremely helpful on a firewall with multiple interfaces. You can see the names of all interfaces via the ‘ifconfig’ command or by going to Interfaces -> Assignments from the web GUI. The naming on VLANs is a little odd so take note of that. Not a fan of the default name or want it to match what you have in the web interface? No problem! Use the ‘-name’ and it will instead provide a friendlier name of your choosing.

./check_pf_ipsec_tunnel -e <IP address or hostname of remote>
OK - IPSEC VPN tunnel to <IP address of remote> - ESTABLISHED 70 seconds ago
./check_pf_ipsec_tunnel -e <IP address or hostname of remote> -name DallasTX
OK - IPSEC VPN tunnel to DallasTX - ESTABLISHED 3 minutes ago

Ever have a VPN go down and not know about it for a while? Not anymore! Once again, if you’re not a fan of the default VPN name based on IP address or if want something more descriptive because you are monitoring numerous tunnels, you can use the ‘-name’ switch to provide a friendlier name of your choosing. On a side note, I always enjoyed calling vendors (who controlled the device on the other end) to let them know a VPN was down! If you have pfSense firewalls on both ends of the IPSEC tunnel and you’re monitoring both of them with Nagios, you will just double-up on your alerts if you monitor both ends of the tunnel.

./check_pf_state_table -w 60 -c 90
OK - PF state table: 315 ( 0% full - limit: 98000) | current_states=315;state_limit=98000;percent_used=0

The checks the percentage of the state table in use. From first-hand experience, if your state table fills up you’re going to have a bad day and your firewall will do some wonky things that are nearly impossible to pin down.

./check_pf_version
Current version: 2.4.2-RELEASE / Mon Nov 20 08:12:56 CST 2017

Some time ago, this check compared the local version against the latest for your branch on the web. Unfortunately, some code changed and I haven’t circled back to uncover the reason. Instead, the check now returns the currently installed version and build date.

Go to Part 3: Configuring the checks on Nagios XI

Dallas Haselhorst has worked as an IT and information security consultant for over 20 years. During that time, he has owned his own businesses and worked with companies in numerous industries. Dallas holds several industry certifications and when not working or tinkering in tech, he may be found attempting to mold his daughters into card carrying nerds and organizing BSidesKC.

Leave a Reply

Your email address will not be published.