I remember the first remote management and monitoring (RMM) solution ever, the venerable and wonderful “ping”. We would use it all the time to see if a remote host was up and responding. And then, one day, someone wrote a program for Windows, Whatsup, and the world was changed forever. With that program, we admins could enter multiple IP addresses and that tool would ping them all day and night! It could even be set up to generate alerts.
We thought we had it made until someone asked, “Hey, I know I can ping the SQL server, but is it responding on TCP 1433?” At that point, we knew both that we needed more in our app and that there would be other admins, with other network ports, who would make similar requests. And so began the development of RMM tools.
At small companies, RMM may very well be not much more than a shareware ping/telnet suite that checks for hosts being up and responding on critical ports. It may involve learning multiple suites of RMM tools, roughly in conjunction with the trial period for one tool ending and a download for the new tool being complete. Most of what goes on is just monitoring, not management (does that mean they consume R_M products?), as there are few enough systems to manage where ssh and RDP sessions to the several devices that need management are sufficient.
Once we get to a medium company with multiple sites, that SSH/RDP solution for everything simply fails to scale. It’s time to lay some money out and actually pay for an RMM solution that will track those uptimes as well as do some kind of configuration management. Everyone makes demands of that config management solution – will it do rollbacks? Will it do point-in-time recovery? Will it track changes made outside the product? Will it enforce certain configuration parameters? Will it integrate with the helpdesk ticketing system?
The answer to all of those questions is either “no” or “yes, at an additional cost.” Nobody rides the RMM train for free.
And it’s not like that RMM will magically never make mistakes. We’re still in a garbage in, garbage out world. More than once, I was working on a project to integrate our routers and switches with a tool by pushing code to them with the RMM solution… only to have that code get overwritten because a different team pushed a change with an outdated template. So what’s the policy and procedure for undoing a change that was done in error? I found that part out the hard way as I waited for the next change window to get my changes put back into the environment.
I’ve seen RMM tools that can’t push version-specific code. Well, they can, but they don’t keep track of versions, so it’s a guess or a logic problem to figure out which devices are on which version. One solution I came up with was to push one line of code to all devices, knowing that it would fail for devices on the older version. The next push checked the config to see if that line I previously pushed was in the config. If so, skip the device. If not, then push a line of code compatible with the older versions. Would I have preferred that the tool have the intelligence to do a version check and then push the appropriate line of code, all in one go? Yes. Yes, I would. The biggest irony to me in this particular case was that the RMM tool was made by the vendor of the devices that the tool couldn’t track the version on. Very disappointing…
And then there’s RMM at the large corporation. Thousands of switches and routers, some on very dodgy Internet connections, all of them being monitored. This means the poor sap with the on-call phone is constantly answering when the NOC calls in to say that the Dakar site is down. Or the Guadalajara site. Or the Noida site. Or the Ho Chih Minh City site. Or the Chengdu site. Or the Narvik site. Or the Deadhorse site. And the NOC guy reads out the entire device name and IP address, letter and number by letter and number, so one has to sit and wait through it all before saying, “Acknowledged. Please open a ticket with the ISP.” I can’t remember a happier day than when the policy was finally re-done so that the NOC would just open the blasted ticket on their own without requiring acknowledgement from engineering.
Still, we were blessed in that we had nearly every switch under management. This did have one side effect, however… we wouldn’t believe a switch existed if it wasn’t in the RMM tool until we saw it listed as a neighbor on another switch and pinged it. That’s when we discovered that some switches couldn’t be brought into our RMM tool because they didn’t support the SNMPv2. Or because nobody could remember the password to get local access and nobody had the nerve to take it to ROMMON mode to break into it. Or because the local support contract kept that gear out of our global tools.
Those problems were relatively straightforward compared to getting gear from specialty vendors into the RMM tool. Not all of them had the same implementation when it came to reporting, even things as simple as disk space and CPU usage. For disk space, does the vendor report total available space, across all volumes, or will it send an alert when one particular volume hits 95% capacity? Will it report overall CPU utilization or will it fire an alert when one of 16 CPUs goes over 90%? The answer is, of course, “It depends.” That means that alerts from some vendors actually aren’t alerts, they’re more like transient conditions of no great importance. It also means that some vendor gear could be in an alert state, but it doesn’t actually report it as such, given how it implements a particular SNMP MIB.
At all companies, there’s the issue with keeping the tools up-to-date. The day that the tool is launched for general use is such a bright, shining moment in the history of the progress of humanity, with all the devices that need monitoring in that tool, right where they should be. Within a very short time – overnight, in some cases – the information in it is obsolete. New devices aren’t added and decommissioned devices are showing red because nothing is reporting back at that IP address… and then they go green again when that IP is re-used, but we just haven’t realized yet that it’s a security camera now, not a loopback address.
Finally, there’s the issue of access. Even at the small company, not everyone who wants to know if a system is up will have access to the RMM dashboard. At larger and larger companies, access to that dashboard can get limited to the point where even the network engineers can’t look at it… or the tool is so cumbersome, there’s severe mental pain involved in getting information out of it.
And that’s why, even at a massively huge global megacorporation, I still got plenty of use out of running a shareware app that would ping a list of devices, so I’d know if they were up… it wasn’t an official tool with management and headcount assigned to it. It just ran on my desktop and running it meant I wouldn’t have to open a service ticket to ask someone if they could check to see if the RMM had a green dot by my device or not.