With today’s user expectations of IT services being ‘always on’, it is ever more important to ensure you quickly detect, diagnose, and resolve network performance problems and outages – before you start getting calls asking if the network is down.
As the UK’s foremost authority on SolarWinds, Comtact Ltd. works with the UK's leading organisations to help them achieve the monitoring visibility they demand - avoiding over-monitoring and alert-fatigue, to quickly pinpoint the root cause of problems which impact service availability and user experience.
Unlocking the power of SolarWinds
One of the most challenging tasks for IT network administrators is keeping an eye on the health of your IT infrastructure – monitoring multiple services and applications, bandwidth usage, tracing and remediating problems and alerts - as well as taking care of your day job, to help drive forward your business!
As the undisputed leader, SolarWinds' monitoring software is hugely capable, but can quickly overwhelm organisations with the volume of alerts generated, if not configured and tuned to your organisations' unique requirements.
Common problems with SolarWinds
As industrial SolarWinds users ourselves, we help build your own teams’ expertise through a collaborative approach to...
- Avoid over-monitoring - and 'alert fatigue'.
- Eliminate false-positives - save time chasing shadows.
- Identify your critical alerts - with threshold tuning, dependencies and groupings.
- Create executive dashboards - for visibility of critical business services
...transferring our knowledge to support your internal resource and unlock the formidable power of SolarWinds' monitoring platform. You know your infrastructure. We know SolarWinds.
This guide outlines the 5 key principles of effective IT network monitoring with SolarWinds – to help your organisation eliminate downtime and improve user experience - and solve the biggest problem with SolarWinds.
1. Keep things simple
This is the most common issue we see. Overly complex monitoring can generate too many warnings, leading to an inevitable ‘cry wolf’ effect of false-positives. This ultimately has a negative impact on systems performance, as resource is incorrectly allocated to troubleshooting ‘false’ alerts. Over monitoring can also lead to excess infrastructure adjustments, as IT teams attempt to keep pace with their tuning parameters.
2. Focus on what’s actually going wrong
Smarter infrastructure monitoring systems eliminate the noise caused by outages through conditional alerting. Which component has actually failed? How does that failure impact the subsequent systems and processes?
An example of this could be the WAN link to your data centre. When the WAN router serving the link fails, you do not need 200 messages that each of the 200 servers in the data centre behind it are not available! Filtering through the 200 messages to find the single WAN router creates a significant drain on resources, as well as an unnecessary delay in responding to the fault.
In this case, the conditional alert would be that the servers do not need to notify of an outage if a WAN link issue has been reported.
Your IT monitoring solution should allow critical components to be ranked with an appropriate level of importance, as well as collating data from across existing enterprise monitoring systems.
3. Optimise IT resource usage by alerting the right people
With IT skills specialisation, it makes sense that the right people are tasked to fix a given issue – first time. Monitoring alerts should always be tuned to go to individuals in skills groups – Network issues go to network teams; Server issues to server teams.
Note the reference to "individuals" in a skills group. As far as possible, try to keep alerts directional to avoid "bystander effect".
To get the best possible outcome, it is often appropriate to make the alerting process a human one, employing techniques applied in Network Operation Centre (NOC) environments, whereby a responsible NOC agent will "in-person" track down fault owners to confirm issues are being addressed before standing down.
4. Automate, wherever possible
With skilled expert support and in-depth technology remediation, issues that are prone to reoccurrence can be successfully automated.
For example, the restart of a software component on a critical finance server can be automated on detection of a process crash. In this example, the automation could become even more pre-emptive by parsing component log files for error conditions typically generated prior to a crash. Components can therefore be scheduled for a restart ‘out-of-hours’, preventing the crash event from happening at all.
5. Ensure monitoring parameters are adaptable for change
Through the lifecycle of any given IT infrastructure component, its utilisation pattern will change and sometimes its role will modify – for instance, additional VPNs added to a firewall, leading to new ingress and egress points for networks behind it. Such changes require a review of threshold and alerting parameters for monitoring to remain ‘fit for purpose’.
Beyond tuning, incorporate a regular ‘big picture’ review to ensure that all opportunities to hit real-time resolution and pre-emption targets have been considered. Having done something well in the past does not preclude that you are doing things well now, particularly as the features of the monitoring systems you have continually evolve.
Want to read more about SolarWinds?
About Comtact Ltd.
Supporting clients 24x7x365 from our ISO27001-accredited Network Operations Centre (NOC) in Northampton, Comtact Ltd. is the UK’s leading authority on SolarWinds, with a large in-house team of SolarWinds-Certified Professionals supporting and managing the network operations of some of the UK’s leading organisations.