Downloading and testing the checks
In the part 1, we setup password-less SSH. What good does that do? Now that we have a secure connection between the systems, we are quite a bit closer to securely running check commands using the SSH proxy on Nagios XI (or the check_by_ssh on Nagios Core).
First though, we need to get the various plugins on the pfSense box. We are going to use a handful of custom scripts, but we’ll also use some pre-compiled executables. You can compile your own by downloading them from https://nagios-plugins.org/downloads/, but I would not recommend it. Instead, you can grab these plugins pre-compiled from freshports.org. This is easy in FreeBSD because you just run ‘pkg install nagios-plugins’ from the command line as shown below. After replying ‘y’ to the ‘proceed with this action’ question the command will pull the files down and place them in the package’s preferred directory.
# sudo pkg install nagios-plugins Updating pfSense-core repository catalogue... pfSense-core repository is up to date. Updating pfSense repository catalogue... pfSense repository is up to date. All repositories are up to date. The following 1 package(s) will be affected (of 0 checked): New packages to be INSTALLED: nagios-plugins: 2.2.1_5,1 [pfSense] Number of packages to be installed: 1 The process will require 2 MiB more space. 366 KiB to be downloaded. Proceed with this action? [y/N]:y
Excellent! Now the pre-compiled plugins can be found in the ‘/usr/local/libexec/nagios’ directory. Give your newly installed plugins a test run by typing in the command below. If all goes well, you should receive some output specifying your current number of processes.
# /usr/local/libexec/nagios/check_procs PROCS OK: 67 processes | procs=67;;;0;
So that’s great, but those files aren’t specific to pfSense! What about monitoring items such as services, VPNs, etc. I have created custom scripts for those checks, which are freely available on GitHub. You can easily download these to your pfSense firewall using the curl and tar command below. Make sure you run these commands on your pfSense system.
# curl -LO https://github.com/oneoffdallas/pfsense-nagios-checks/archive/master.zip # sudo unzip -j master.zip -d /usr/local/libexec/nagios/ # sudo chmod +x /usr/local/libexec/nagios/check_pf_*
If everything went as planned, you can run the command below and get some output back.
# /usr/local/libexec/nagios/check_pf_version Current version: 2.4.2-RELEASE / Mon Nov 20 08:12:56 CST 2017
Various Checks Explained
Below, I’ve also went through a fair amount of effort explaining the plugins I recommend as well as some default values for pfSense that should work in most cases. If you want to test out some of them on your own pfSense, just run the commands (not their output) from the /usr/local/libexec/nagios directory (unless you want to type in the full path).
If you want to just go with my recommendations and don’t want the full explanations, then head over to part 3.
Off the shelf FreeBSD checks
./check_ping -H 220.127.116.11 -w 80,10% -c 150,40% PING OK - Packet loss = 0%, RTA = 25.71 ms|rta=25.712999ms;80.000000;150.000000;0.000000 pl=0%;10;40;0
So this isn’t ping “to” the pfSense. Instead, this is ping “from” the pfSense to another system. This is useful if you are monitoring the internet connection of a firewall – local or remote. It’s also useful if you stack a few of these together, i.e. you can compare your pings to OpenDNS (in the example) vs. pings to vendor XYZ. If the pings to vendor XYZ increase and corresponding data to OpenDNS does not, your vendor probably has an issue on their hands… Even better, now you’ll have the data to show them! Assuming you have decent internet, this setup should work. It will create a warning if the roundtrip is greater than 80ms and the percentage of packet loss is greater than 10%. Likewise, it will cause a critical alert if the roundtrip exceeds 150ms and the packet loss is greater than 40%.
./check_ntp_time -H time.google.com NTP OK: Offset -0.006043791771 secs|offset=-0.006044s;60.000000;120.000000;
This check tests the firewall time against an NTP time source and produces a warning if the variation is greater than 60 seconds and a critical if it is greater than 120 seconds. As a best practice, I prefer to use a time source different than the one I use in the pfSense web GUI. Note: This check can have issues caused by UDP packets not returning properly or in time. If this check produces a fair amount of false positives, either try a different time source or simply increase the timeout to 30 seconds. Increasing the timeout to 30 seconds can be done by adding the “-t 30” in the Core Config Manager (Configure -> Core Config Manager -> NTP service) as shown below. This change can be made after the service is configured. Even with the occasional issue, I leave this check enabled because it is important for your firewall/logs to have the correct time. It also wouldn’t hurt to only check every few hours instead of every 5 minutes to further eliminate false positives.
./check_disk -w 20% -c 5% -p / DISK OK - free space: / 23066 MB (90.29% inode=99%);| /=2478MB;22212;26376;0;27765
The disk check might be counterintuitive to what you might think. Warn if less than 20% of disk is free and produce a critical alert if less than 5% of disk is free. You’ll also note I’m only checking on the root (/) partition. The other partition I would recommend checking in a standard pfSense setup is the /var/run partition using the command below.
./check_disk -w 20% -c 5% -p /var/run DISK OK - free space: /var/run 3 MB (96.75% inode=97%);| /var/run=0MB;2;2;0;3
./check_load -w 3,2.8,2.6 -c 10,7,5 -r OK - load average: 0.21, 0.17, 0.15|load1=0.210;3.000;10.000;0; load5=0.170;2.800;7.000;0; load15=0.150;2.600;5.000;0;
Load is a funny beast and lots of folks have different opinions on it. That’s because load can mean different things to different systems. Load is based on how busy your CPU, disk, and other resources are. Personally, on most *nix installs I prefer to keep the load under 3 so that is what I recommend here as well. The warning of 3,2.8,2.6 is the load average after 1 minute, 5 minutes, and 15 minutes respectively. You might also take note of the ‘-r’ which divides the load by the number of processors.
./check_procs -w 200 -c 400 PROCS OK: 64 processes | procs=64;200;400;0;
This checks the number of processes. On a somewhat busy system with IDS enabled, I found 200 was a good warning state and it did a good job of letting me know if something went haywire or jobs were hanging. If you find yourself getting some false positives and your system regularly sits around 200, go ahead and bump it up a bit.
./check_swap -w 90% -c 40% SWAP OK - 92% free (1879 MB out of 2047 MB) |swap=1879MB;1843;819;0;2047
Swap is another funny one and arguably optional if you are monitoring everything else. Surprisingly, the pre-configured, ARM-based netgate/pfSense systems don’t even come with swap enabled so there is that debate on whether it is necessary at all. Maybe this is from a flawed line of thinking, but I still prefer having swap. When swap is enabled, I also like watching it because if you are using it, your system ran out of RAM at some point. So you would assume the warning would be 100%… Unfortunately, you would be wrong. If you happen to use swap, the safest way to clear it back out is a reboot and I am a fan of seeing my uptime climb. And no, “swapoff -a && swapon -a” is not always the best or safest route. Also, using swap on rare instances isn’t necessarily a bad thing. So instead, I set it to 90% and leave it at that. Add more RAM if you are frequently going above that mark. Incidentally, if you find yourself using a lot of swap, a memory increase would also help in other areas (and stats) including the load due to disk I/O. It’s just one of those things that can affect other metrics, which leads to red herrings when troubleshooting <- Yes, I’m speaking from experience!
Custom pfSense checks
So those are the standard checks Nagios provides for FreeBSD and while they are helpful, they are seriously lacking when monitoring a pfSense and firewall specific functionality. I mean what about VPN tunnels, interfaces, state tables, and services? That is where the power of Nagios and custom scripts come in!!!
Load is fantastic because it can give wonderful indicators on how a system. But if the load is high, is it the CPU, the memory, the disk i/o or perhaps even a combination of all 3? Thus, in addition to monitoring load, I recommend monitoring the CPU and memory as well using the commands below.
./check_pf_cpu -w 85 -c 95 OK - CPU Usage = 0%|CPU=0;;;;
Percentage of CPU used. It creates a warning if CPU is above 85% usage and critical if it is above 95%.
./check_pf_mem -w 90 -c 95 OK - Memory Usage = 41%|MEM=41;;;;
Percentage of memory used. It creates a warning if the memory is above 90% usage and critical if it is above 95%. Yes, this is also custom script and it is the same reason as CPU.
./check_pf_services -name snort OK - snort service is running
./check_pf_services -name pinger OK - dpinger service is running
./check_pf_services -name pfb_dnsbl OK - pfb_dnsbl service is running
./check_pf_services -name dhcpd OK - dhcpd service is running
./check_squid -name squid OK - squid service is running
Ever had snort stop unexpectedly? What about unbound, Snort, dhcpd, or any other pfSense services? Now you can monitor all of them! Just specify the name of the service as shown in any one of the examples.
./check_pf_interface -i em1_vlan6 OK - em1_vlan6 up and active
./check_pf_interface -i em1_vlan6 -name LAN OK - LAN(em1_vlan6) up and active
Check whether your interfaces are up. This is extremely helpful on a firewall with multiple interfaces. You can see the names of all interfaces via the ‘ifconfig’ command or by going to Interfaces -> Assignments from the web GUI. The naming on VLANs is a little odd so take note of that. Not a fan of the default name or want it to match what you have in the web interface? No problem! Use the ‘-name’ and it will instead provide a friendlier name of your choosing.
./check_pf_ipsec_tunnel -e <IP address or hostname of remote> OK - IPSEC VPN tunnel to <IP address of remote> - ESTABLISHED 70 seconds ago
./check_pf_ipsec_tunnel -e <IP address or hostname of remote> -name DallasTX OK - IPSEC VPN tunnel to DallasTX - ESTABLISHED 3 minutes ago
Ever have a VPN go down and not know about it for a while? Not anymore! Once again, if you’re not a fan of the default VPN name based on IP address or if want something more descriptive because you are monitoring numerous tunnels, you can use the ‘-name’ switch to provide a friendlier name of your choosing. On a side note, I always enjoyed calling vendors (who controlled the device on the other end) to let them know a VPN was down! If you have pfSense firewalls on both ends of the IPSEC tunnel and you’re monitoring both of them with Nagios, you will just double-up on your alerts if you monitor both ends of the tunnel.
./check_pf_state_table -w 60 -c 90 OK - PF state table: 315 ( 0% full - limit: 98000) | current_states=315;state_limit=98000;percent_used=0
The checks the percentage of the state table in use. From first-hand experience, if your state table fills up you’re going to have a bad day and your firewall will do some wonky things that are nearly impossible to pin down.
./check_pf_version Current version: 2.4.2-RELEASE / Mon Nov 20 08:12:56 CST 2017
Some time ago, this check compared the local version against the latest for your branch on the web. Unfortunately, some code changed and I haven’t circled back to uncover the reason. Instead, the check now returns the currently installed version and build date.