Have you ever gotten an alert in the middle of the night, just to login and check to see everything is good and happy? Then just as you’re about to logoff WhatsUp® Gold labels the device ‘Up’ again? Being a former system administrator, I had to deal with that on just one occasion. That is when I learned about timeout/retry values on active monitors within WhatsUp® Gold.
By default a lot of the timeout/retry values on the active monitors are too aggressive. For example, the ‘Ping’ active monitor (one of the number one offenders of false positives) has a default timeout of 1 second with 1 retry. So, let me lay out a scenario. WhatsUp® Gold polls active monitors every 60 seconds by default. Let’s say you have your action policy set to e-mail you immediately when down. That means if the system drops those two ICMP packets when the polling command is sent, it is going to be labeled down and you will end up getting an e-mail. The device will continue to be labeled down until a successful polling cycle occurs, which could be the next polling cycle or later.
Let’s say I have the active monitor’s timeout and retry values set higher. What I typically use is a timeout of 8 seconds with 2 retries. Under that same scenario described above, the system drops a couple of the ICMP requests but remains labeled ‘Up’, because it responded to the subsequent ones due to the higher timeout and retry values.
What is important to note is that every monitor within WhatsUp® Gold (excluding WMI based monitors) have an adjustable value for timeout and retries. Now, don’t go crazy and adjust them all if you don’t have to! Simply adjust the offending active monitor. To verify, when a monitor goes down refer to the ‘Device Status’ page and click on the ‘General’ tab. In there you will see ‘State Change Log’ for that device. If the monitor message shows ‘Timeout’ as the problem, then you’re good to go ahead and adjust the timeout for that monitor. Note that, adjusting the value is done within the active monitor library and thus applies that timeout/retry to *ALL* devices that have that monitor applied. From experience, the monitors that need to be adjusted more frequently are ping, interface, and power supply. Adjust them to 8 second timeout with 2 retries as recommend above. If you still see the issue, increase it a bit more.
v2016 and below
- In the web interface, go to Admin -> Monitors -> Active Monitors. Select your desired monitor and hit ‘Edit’.
- In the admin console, go to Configure -> Active Monitor Library… Select your desired monitor and hit ‘Edit’.
v2017 and above
- In the web interface, go to Settings -> Libraries -> Monitors. Select your desired monitor and hit the pencil icon (Edit).
Note that a lot of the monitors timeout/retry settings are in the ‘Advanced’ section of the subsequent dialog.