A Bayesian Network approach to diagnosing the root cause of failure from Trouble Tickets

Telecommunication networks consist of thousands of different hardware elements of the most varied kinds: servers, routers, modems, switching units, cables, base stations, cooling elements, energy elements, etc. Many of the possible relations between elements are not explicitly defined, even more when they are heterogeneous. For example, there is a clear relation between the cooling system that controls a room temperature and all the hardware installed in the room; a failure in the former will probably affect a smooth running of the hardware or make it break down. Another typical case is a fibber cut-off that makes many dependent mobile base stations be in turn cut off, consequently affecting many customers. While in certain types of hardware elements this information is explicitly modelled, it is not the case in many others. This makes it difficult to use traditional programming solutions to automatically link failures to a root cause [1]. Incident management systems, usually known as trouble ticketing (TT) systems, offer the technician the possibility to link an incident produced on an element to another existing incident, creating a child-parent relation. So, it depends on expert knowledge to be able to identify these situations quickly. Many times it is not until several similar incidents have appeared and technicians have dedicated much time and resources to analyze them all that a root cause is discovered to be the real problem. In the mean time many customers could have their services partially or fully affected. What makes matters worse is the fact that it is precisely root causes that are usually harder to detect and affect more customers. The inverse way is also important, i.e. predicting how a failure in an element can affect other elements, thus being able to evaluate the real scope of the problem as soon as possible [2, 3].

Different AI approaches are applied to address a variety of problems in the Telco area, mostly churning detection [5], and forecasting [6]. However, incident management systems have hardly used AI techniques to optimize the processes involved. A lot of work has been dedicated to supervision in order to prevent or detect problems as soon as possible, but little to nothing has been done once the incident has been created. In most cases the optimizations are reduced to more or less sophisticated decision rules, or searching for previous similar cases in knowledge bases in what is known as case based reasoning [4]. These approaches may be suitable for simple problems but are not adequate to address complex or changing environments. In recent years machine learning techniques are starting to be applied in the context of TT systems to discover information and automate certain tasks [7-9].

An example of complexity is the case of TT systems that manage incidents from large heterogeneous networks under difficult conditions: an incident could affect an important service offered to hundreds of people, thousands of incidents may appear every day, and the topology of the network is complex. Under these conditions, decisions cannot be delayed and actions must be carried out right away.

In recent years, with the development of efficient computational algorithms, Bayesian networks have had a revival within the AI community. BN’s causal semantics allows the representation of causal relationships between the variables. BNs model the quantitative strength of the connections between variables, allowing probabilistic beliefs about them to be updated automatically as new information becomes available. This allows inference and reasoning under uncertainty, probabilistically, in what is called Bayesian reasoning. As BNs provide full representations of probability distributions over their variables, they can be conditioned upon any subset of them, supporting any direction of reasoning. For example, diagnostic reasoning, that goes from symptoms observed to causes; or predictive reasoning, that goes from new information about causes to new beliefs about effects [10, 11].

All this makes BNs a good AI technique to address the problem of finding the root cause of an incident described in this article [12, 13].

The aim of this work is keeping a real time directed graph of interconnected elements, where each node indicates the current probability of that element having an incident, and where an edge going from an element A to an element B shows the current probability of a failure in A being the cause of a failure in B.

This article describes how Bayesian networks combined with classification algorithms can be used in the scope of telecommunications networks to address the aforementioned problems. Being this a practical work that is to be applied to a real TT system with several thousand incidents created daily, the solution pays important attention to performance issues, considering that response time is critical (it is no use providing very good results once incidents have already been solved).

For full text: click here

(Author: Fco. Javier Molinero Velasco

Published by Sciedu Press)