A lot of buzz these days from Gartner and many of the cloud providers and even now the APM vendors as well is pushing into the market. To be honest it took me some time to understand what AIOps actually was and what kind of capabilities it part of the definition, so therefore I wanted to use this post to elaborate a bit on what it can provide, and also how to filter it away from the marketing terms from the different vendors we can see on Google.
The term itself originated from Gartner, but is more and more used by (Automation, APM, ITSM products and vendors) so it is clear that most vendors is pushing this as their new sales pitch. So kinda hard to actually understand what is AIOps and what are the vendors pushing in terms of their AIOps value?
What is AIops on Google, where we have vendors like Splunk, Dynatrace, LogicMonitor and BMC really visible.
Now AIops is essentially a combination of integrating monitoring tools, the it infrastructure, the ITSM service and integrate it with a automation tool and the brains behind it is powered by Machine Learning and a Big Data service. Sound easy? but look of it as a way to provide intelligent filtering of incidents using Machine Learning and even providing self-healing capabilities using an automation framework. Well that is atleast what it is promising. You can also look at it this way, up until now ITSM and Monitoring has been mostly reactive based upon incidents, AIOps is to provide even proactive and self-healing IT infrastructure, sounds like magic?
AIOps overview from Gartner
Now before I go into AIOps I want to just to give a reflection on where most organizations are coming from. Many are coming from an ITIL based it organization where they have an ITSM product which also involves certain processes and routines in terms of incident management, knowledge base which is part of most ITSM tools such as ServiceNow, BMC Helix and so on.
Most of the monitoring tools that IT-departments use are then linked or integrated to this ITSM tool to provide a centralized view to provide follow up incidents and problems. Many have a lot of different monitoring tools to provide monitoring of different parts of the stack, such as Networking, OS monitoring, application monitoring, cloud based monitoring and event security monitoring tools. The functionality it provides is mostly on a centralized view of alerts and incidients of the different monitoring solutions. As a simplified overview of the ITSM status.
Traditional ITSM tooling and Operations Stack (Traditional ITops)
Where we have different monitoring tools used to monitor different parts of the application/infrastructure stack which then integrates into ITSM where we have different parts of the organization depending on role and responsibility which manages incidents/alerts. Where we typically also have a knowledgebase for documentation or known issues. Of course we also have the end-user which has the ability to report in incidents as part of the ITSM tooling or trough email/chat/phone.
Many have also invested into APM products (Application Performance Monitoring) or other monitoring tools which can also cover a wider scope of monitoring to also ensure that you can easier get RCA (Root Cause Analysis) when an issue occurs instead of having multple alerts from different monitoring tools. As an example when you have an network outtage somewhere you might get alerts from multiple monitoring tools at the same time, while a APM product might provide more “noise” reduction since it can see more of the picture and therefore determine that Application X is unavailable because of Network Y is down.
Now what is happening from an technology perspective which is driving the move to AIOps? Then I recommend that you listen to this podcast by Rackspace on how they adopted AIops https://www.rackspace.com/solve/how-we-got-aiops
Since IT is becoming more and more crucial for the day to day business for most companies, this means that it requries that we keep downtime of services to a minimum, so when downtime occurs we need to be able to determine the root cause as soon as possible to get it up and running. In addition to this the IT infrastructure is becoming more and more complex with the use of Cloud Services across different providers which also makes us a lot more reliant on third party vendors and API/Integrations which now makes up our ecosystem. On other other side we now have more and more developers moving to more DevOps based approach and even SRE where developers themselves are more responsible for the managing and maintaining their own applications and underlying services. This means that developers also can do more automated deployments and in a more rapid pace, which is not something that ITIL orginially was made for but instead the ITIL based processes are automated changes instead.
In addition now with businesses moving to cloud which means that there are new services being built, which also generates more data and requirements if terms of monitoring when we are moving to containers or PaaS based services.
So looking back at the orginial picture what is missing from the overview above in terms of AIOps? Since based upon the trends emerging we need a system that can not only identify that something is wrong, but pinpoint the true root cause
- Automation Layer – As part of an incident you should have some way of automation to ensure that when an issue occurs that it can be automatically resolved. As an example when a specific incident is created from the end-user such as password reset, there should be a predefined automation framework to run specific runbook to solve that issue to be automatically resolved instead of involving helpdesk for tasks that can be automated. This automation layer can consist of different automation framework, some might be API driven some RPA driven. This can also be depending on user interaction where you might have Virtual Agent driven automation as well (or self-service automation)
- Big Data Platform – To be able to collect data from the different data sources such as monitoring tools, APM tools, event logs, performance metrics it requries a data lake. This is required to collect the wast amount of data being generated over time because this is requried for the machine learning layer to build a set of models to predict what will happen over time.
- Machine Learning Layer – This layer is using the data being collected by the Big Data Platform to correlate and create patterns based upon data that is being collected. This allows for the product or service to understand Why service X is down when Network Y is down? and also be able to predict what is going to happen when certain data appears for instance. You can see this is providing proactive maintenance to the IT Operations.
Of course these capabilities needs to be tied together with the ITSM product, to ensure that we have reporting, visibility and knowledge of what is going on and also to be even better on updating our machine learning models based upon the models that are being collected.
So where should we place the AIOps Capabilities on the monitoring side, ITSM side? or in between? or a combination? Since most vendors within the APM space or Event/Logging products or ITSM side is now pushing in a AIOps direction.
Looking at the big vendors like Dynatrace from an APM perspective, they are adding AIops capabilities it to their monitoring stack. Which is based upon what they call AIOps done right. You can view their argumentation here –> https://www.dynatrace.com/news/blog/aiops-done-right-introducing-the-next-generation-of-software-intelligence/
Then we have ServiceNow which also has their own AIOps modules –> https://www.servicenow.com/content/dam/servicenow-assets/public/en-us/doc-type/resource-center/data-sheet/ds-itom-health.pdf which unlike Dynatrace supports that you can plugin other monitoring tools and based upon the data it collectes from those monitoring tools. ITOM Health consolidates data from monitoring tools, normalizing, deduplicating, filtering, and correlating events to generate alerts.
Then we have some vendors which are delivering an AIOps platform in between the monitoring tools and ITSM tools such as BigPanda. Which is placed between the different monitoring solutions to correlate data between the differnt monitoring tools and the ITSM services.
Or such as Moogsoft which they are using at Rackspace.
Which sits in between the ITSM tools and the monitoring tools. So it is interesting to see how this AIOps market is now pushing forward where we have different pushes from the application monitoring vendors, event and security logging vendors and even the ITSM vendors. But in the end I belive that the one that gets the furthest in this segment is those vendors that can adapt and integrate across multiple vendors, data sources and utilize an automation framework.