Why is IT Service monitoring so difficult?

Now in 2023, why are many still struggling to have a good set of monitoring capabilities on their services to end-users? at the end of the day, the main focus of IT is providing services to end-users. easily accessible, functional, available, and secure. In this blog post, I try to dig into the big ecosystem that now falls into monitoring which has evolved a lot in the last years.

A lifetime ago, I worked a lot with different monitoring tools such as Nagios and SCOM (System Center Operations Manager) to get an overview of what was going on inside the IT environment. While SCOM was used to monitor the Windows part and some network devices including VMware, Nagios was used for the Linux side. Using SCOM together with Squared Up as well, we had a good overview of what was the overall status of our services.

At that time, I used to work at the service desk, so I needed to have an easily accessible overview of what was going on and what was the status, where the service monitoring looked something like this. So, if a service were down and I got a support ticket or a phone call I could just say that we have an ongoing incident, and we are working on it. For systems that we did not have proper monitoring, it was always trying to figure out, who owned the system, do we had documentation, and then figure out if the person responsible knew what was going on or if something else was the issue.

A somewhat similar dashboard to what we had….

Sometimes the issue was the underlying infrastructure or even the network, but again having proper monitoring in place saved me and the others working there a lot of hours since we knew what the issue was, we knew that someone was working on it, and we knew what kind of services were impacted.

When I eventually started working more towards infrastructure, I spent a lot of time on fine-tuning the monitoring because I knew that it would save the service desk a lot of time in troubleshooting and I could then more easily report on what really mattered, the availability of the service. Now it is important to note that it would still generate a lot of noise since some services were monitored by one team such as the Oracle databases, but we have some services in Windows that were dependent on Oracle. This meant that if the Oracle Cluster were down, the Windows Services would fail….So the bottom line is.. it was not easy.

However, “back in the day” it was easy to set up monitoring since many of the systems were static, we knew what to monitor and with limited changes to those systems, the monitoring platform and rules worked as intended for a long time.

Now this was close to 15 years ago, a lot has changed since then. With the adoption of the Public Cloud, Cloud-native workloads and more services embedded into the software stack (virtualized network, storage, and so on) in addition now making the services more distributed across public cloud and on-premises workloads. Surely the monitoring solutions are much better equipped now than before to monitor all these services across platforms, cloud, and devices?

In addition, the requirements have changed. We have more rapid changes, we have new services introduced to the ecosystem, while the CXO level still worries about the overall availability.

As the overall complexity has increased, it often feels like finding a needle in a haystack when we want to configure monitoring for these new services. As an example, you might have a set of predefined monitoring rules for a product, but with all the constant changes to the service, the rules might get outdated quickly and you spend more time on tuning the alerts than getting actual value.

Another interesting dilemma is the use of SaaS services such as Office 365. One case a few years back was when I was talking with a customer who was troubleshooting OneDrive in a Citrix environment. OneDrive was not working, but Citrix was getting the blame, however, it turned out to be because Microsoft having issues with Microsoft 365 at the time which impacted Onedrive. The customer spent a lot of time troubleshooting the Citrix environment before they found out who the actual culprit was. If they only had some monitoring against Microsoft 365, they would have saved a lot of headache (Since this has happened multiple times)

In recent years, we have also seen a new trend growing which is the focus on observability. Monitoring primarily involves gathering and analyzing data to track system performance and identify issues, focusing on long-term trends and alerting, while observability on the other hand is about understanding the internal state of systems through external data like logs, metrics, and traces, focusing on providing deeper insights into interdependencies and root causes of issues.

Think about it this way, In the context of a car, monitoring would be like having a dashboard that displays information such as speed, fuel level, and engine temperature. This allows you to track the car’s performance and be alerted to potential issues like low fuel or overheating. Observability, on the other hand, would be akin to having a system that not only shows these basic metrics but also provides detailed insights into the car’s internal workings. It’s like having a diagnostic tool that can interpret data from various sensors throughout the car, enabling you to understand why a problem is occurring.

Within the observability ecosystem, we have new tools and products such as OpenTelemetry, Prometheus, ELK, Datadog, and so on. The list goes on and on. Where as mentioned much of the focus is digging into the logs, traces to figure out exactly why an application is not working or why a service is generating 404 errors. The older monitoring tools just focused on is it UP or Down?

Another piece of the puzzle is DEM (Digital experience monitoring) how can we monitor the availability of services and resources from an end-user perspective? This is another issue I faced many times working at the service desk. A user calls mentioning that an application is slow or not working, I check the monitoring tool everything seems fine. They call again and I start digging more into it and turns out the user is sitting on a bad 3G connection with 400 MS latency; hence the application is working fine, just a bad user connection.

Consider the following scenario, where the service desk is getting multiple user requests where people are saying that service X is slow or service Y is non-responsive. The incident is coming from a set of users working from home offices. The backend service consists of multiple components that need to work to ensure that the application is working.

This is the service stack from user to application.

End-user device – VPN – Firewall – Active Directory – SQL Server – Hypervisor – Storage – Entra ID – Kubernetes

While this could be a fictitious scenario, this was an actual case with a customer I worked with a while back. They spent many days troubleshooting what was going on, multiple teams were involved and they had a good mix of different monitoring tools which they could use to dig into the different components.

The issue? the underlying storage had multiple issues because of a faulty controller which impacted the overall I/O throughput. This storage solution was new and was only used for some of the components in the Kubernetes cluster.

Do you have insight and an overview of what is going on within these different stacks and including the storage service? probably not. What I tend to see is that some teams have an infrastructure-based monitoring that monitors the main parts of the infrastructure (namely hardware/virtualization/some generic OS capabilities) and we have some observability platforms on the cloud-native side such as Jaeger, Fluentd, Prometheus & Grafana. Then you also have some monitoring capabilities on the networking side, but those are also often separate tools. What tools and products are used tend to be something that is decided per team or platform.

Here is with vRealize Operations (or Aria) overview that provides insight from the virtualization layer.

Here we have Grafana and Jaeger which is often used in the context of Kubernetes to monitor what is going on inside of the web services that are running there.

While both are essential tools, they do not provide the full view of the other parts of the infrastructure that need to be working for the service to be available. While it does help to have more insight into most parts of the infrastructure it will not help if you do not have

1: Information channel across teams – meaning platform teams get to know about underlying issues to the infrastructure
2 Integrated service monitoring to be able to monitor underlying components and group different components together that then make up a service.

While the second part is mostly something that is done using an ITSM tool or an APM it is hard to maintain. And usually you need to have information from different components to get the overall picture including network, storage, virtualization layer, machines and services layer to get the whole picture on what you want to monitor.

The previous example which I wrote about earlier was focused on “reactive” monitoring where you know if something is down and you get notified that something is down. The big question is how long did it take before you got notified on the service outage? how many users were impacted on the outage?

Outages is something that we want to avoid as much as possible, since this impacts our SLA and depending on what kind of business and service that means that our company looses money. According to some the average cost of an outage is 5,600$ per minutes (that does not include the reputation that you loose). But maybe we can get notified or can fix problems before they occur?

How long does it take your organization to get up and running again when services is down? To visualize this I tend to borrow this picture from splunk. Does your end-user know about an outtage and you dont?

This is where AIOps comes in, which you can view as a dataplatform for IT operations which is using ML to identify problems based upon anamolies. While the term AIOps is from Gartner, it described how you can use to ML to “predict” future outtages and to find RCA and monitoring across different technology stacks.

To get this type of monitoring capabilities in place, you would either need to build a data lake where you integrate the different monitoring tools that you have or you need to use a third party platform that has these types of capabilities in place. In 2023 most traditional monitoring tools or APM tools have some form of AIOps capabilities in place, depending on where you want to have this functionatilty in place.

These are some of the vendors in the AIOps space. The biggest challenge with tools like this is that be able to leverage ML algoritms to be able to predict “future” events you need to have consistent data in place, however with modern cloud based services this data changes frequently and therefore it makes it difficult to get the desired effect out of them.

However the ability to “group together” multiple monitoring tools in one place and be able to have “one” incident instead of multiple incidents across multiple monitoring tools can be a big time saver. Also makes it easier to find the root cause of incidents.

However AIOps itself does not actually provide monitoring and observability, it is just about transforming and making sense of the data that comes in from one or multiple monitoring & observability tools.

How do we actually solve monitoring in 2023?

To try and summarize this pretty long blog post, how do we actually solve monitoring in 2023?

Before we could have one tool that we tried to “one size fits all” since it was mostly hardware and virtual machines. Now we have Clouds, software-defined datacenters, microservices which are short lived and cloud-native technology where much of the networking stack is also software based.

1: You cannot use one monitoring tool to cover everything. You need to have different tools on different levels. Some tools are GREAT at the networking stack, some tools are great for Kubernetes, but often not both. While having different monitoring tools require more maintance it is still better then not having insight at all.
2: You need observability platforms for custom built services (cloud-native) to provide logging and proper tracing mechanisms. Since they can be short lived you need to have trends/history to be able to troubleshoot and deep-dive into the data
3: You need have a way to integrate the data sources together, since a application or service to the end-users it not just the stuff running in Kubernetes. It is the storage, hypervisor, network, firewall and then Kubernetes. How do you troubleshoot if services are not running? How can you optimize an service if you do not know how things are integrated.
4: If you are using public cloud, you need to have prober monitoring aganst it as well. While the cloud providers have a decent uptime their services can still go down and having a small monitoring tool in place that can probe public cloud availability is going to save you a lot of time. I have lost count of how many times a little monitoring tool has notified me when Azure was down and instead of spending a lot of time on troubleshooting I just wait until Microsoft fixes the issue on their end.
5: AIOps can be beneficial if the service supports the different monitoring tools that you have and you have a pretty distributed technology stack.
6: Observability platforms alone are not equipped for service monitoring, since they tend to focus on logs, metrics and traces. From a service desk perspective they are useful to get deep-dive into what might be the issue when stuff breaks.

Leave a Reply

Scroll to Top