Preventing network outages in a 5G world

Chris Neisinger, CTO at Guavus.
(Image credit: Future)

Every year, mobile network operators spend nearly a quarter of their revenue on network management and maintenance. They have little choice.

It goes without saying that network outages annoy customers. This is not just bad for an MNO’s reputation, it is also potentially ruinous for their bottom line.

Why? Because downtime is one of the biggest drivers of churn. In one study (of cable customers), 60 percent of subscribers who had churned cited network performance as their main reason for quitting. Among ‘conditional churners’ – those considering leaving – 75 percent reported network issues.

In addition to lost customers and damaged credibility, network outages can also lead to costly employee overtime and possibly even penalties for not meeting service level agreements (SLAs).

"Thanks to the widespread adoption of 4G, people and enterprises now depend on mobile networks for their connectivity."

Chris Neisinger, Guavus.

It all explains why one study estimated that network outages cost the world’s MNOs around $15 billion a year.  However, that study was carried out in 2013. Today, the stakes are so much higher. This is because traditional telecommunications carriers have become Internet service providers. Thanks to the widespread adoption of 4G, people and enterprises now depend on mobile networks for their connectivity.  

So, a network outage no longer leads merely to dropped phone calls but – potentially – to the failure of critical enterprise services. Now, as the industry prepares to evolve to standalone 5G (SA), the volume of data signals generated by the network is set to escalate yet again. 

It raises the question: how can MNOs predict and fix network faults in such a complex environment filled with so much ‘noise’?

The solution for more forward-looking operators lies with handing over diagnostics to advanced analytics systems that use machine learning to:

•    Accurately analyze millions of network alarms 
•    Instantaneously identify alarms most likely to lead to a fault
•    Discover alarm relationships and alarm families 
•    Eliminate ‘noisy’ alarms 

Before we dive deeper into ML-based analytics systems, let’s look at how the majority of today’s network fault detection systems work.

Network diagnostics now: people, rules and alarms

Every day, the world’s MNOs engage in a battle to keep their networks operational. Unfortunately, there is plenty that can go wrong. In addition to technical faults (physical link failures, traffic congestion/overload, chip failures) there are other, more unpredictable factors – from cyberattacks to thunderstorms.

When something in the system does fail, it shows up as an anomaly in the network data. So MNOs put in place monitoring systems that trigger alarms when these anomalies occur. Network Operation Centers (NOCs) staffed by teams of human analysts then scrutinize this data and try to answer three key questions:

•    What happened to the network?
•    Why did it happen?
•    What will happen next?

The problem with this human-centered approach is obvious. NOC agents can only process a limited amount of information. They have to prioritize their attention. For this reason, they ‘mute’ the majority of the alarms they receive. In so doing they risk ignoring small problems that might develop into bigger faults and bring the network down.

"The problem is that humans are unreliable. Engineers write things down and maybe don’t pass them on."

Chris Neisinger, Guavus.

Most fault detections systems use rules that were defined and tuned by experts based on their experience.  But these systems can only find what they are looking for. And they rely on people to update them. The problem is that humans are unreliable. Engineers write things down and maybe don’t pass them on. Meanwhile, rules get too complex to maintain or they change and then you get false alarms. 

Also, network engineers can’t look at every signal, so they put in silencing features. As a result, they miss things. They look through their logs after an outage, and often find a signal from weeks before that they muted.

Clearly, these human-centered alarm-based systems are struggling to cope with the volume of signals generated by today’s networks. However, things are about to get vastly more challenging. Standalone 5G is coming – and it will drastically increase network complexity yet again.

Standalone 5G: an explosion in network data

How much greater is the challenge of analyzing the data emerging from standalone 5G networks than their 4G predecessors?  The simple answer is: exponentially.

Why? Because the ‘standalone’ 5G Core is a new kind of network.  It is virtual – with foundational technologies (Network Function Virtualization and Software Defined Networking) that turn physical network components into software. 

Virtualization will make 5G networks much bigger and able to support millions more connections. MNOs can use this extra capacity to allocate bandwidth to enterprises. Effectively, this network slicing gives private companies the ability to run their own mini-networks. 

All of which vastly escalates the amount of data generated by the network as a while. And to further complicate matters, the vast majority of connected 5G devices will be machines. So when there is an outage, these devices will not be able to report a problem.

"This move from 4G to 5G is more like going from a bicycle to stepping into the cockpit of an Airbus plane."

Chris Neisinger, Guavus.

People also lack perspective when it comes to 5G. As an analogy: they think, well I can already ride a bicycle, so now I am going to ride a motorcycle. But this move from 4G to 5G is more like going from a bicycle to stepping into the cockpit of an Airbus plane. The sheer amount of data and telemetry involved is orders of magnitude greater.

The exponential increase in complexity is compelling operators to consider a new automated approach to network analytics. In the past, MNOs have used rules to solve problems.  But the truth is, rules-based systems won’t work in a 5G world where there are millions of elements rather than hundreds. In this world, you need systems that can perform advanced network analytics. They are no longer a ‘nice to have.’ They are mandatory if you are going to efficiently run your network.

How to mitigate the risks: ML-based fault detection

In the world of network fault analytics, everything comes back to three acronyms:

•    MTTA (mean-time-to-acknowledge)
•    MTTD (mean-time-to-diagnose)
•    MTTR (mean-time-to-resolve) 

Mobile networks have to reduce these numbers if they want to reduce the number of damaging outages. 

However, we have already established how difficult it is for rules-based systems to handle the sheer volume of alarms generated by today’s mobile networks. For this reason, many forward-looking MNOs now use ML-based probabilistic algorithms to do the work instead. These systems carry out the task of monitoring network activity without human intervention. They can manage millions of alarms simultaneously. Over time, they can identify which to act upon and which to ignore. 

Here are four ways in which ML-based analytics systems produce better results.

1. They escalate alarms for predicted incidents

Probabilistic algorithms prioritize alarms that have a high probability of leading to network incidents. Typically, this is just 10 percent of all alarms. Engineers can then resolve these issues without relying on network inventory, topology or static rules.

2. They reduce the noise of low-impact alarms

Conversely, ML-based analytics systems learn over time which alarms don’t indicate serious problems. They deprioritize them. They can also suppress any alarm related to scheduled maintenance events.

3. They consolidate alarms

Sometimes a single event can trigger multiple alarms. ML-based analytics systems can consolidate them into one. This avoids having engineers waste time investigating multiple trails. 

4. They reveal relationships between alarms

Similarly, ML-based analytics systems can gather a set of alarm families together for further root issue analysis.

Of course, the ultimate pay-off of advanced analytics solutions is that they are self-healing – and can even anticipate faults before they occur. These models are continuously training themselves – and they get more accurate over time. So let’s say the operator adds a new cell site. This changes the network architecture, and it means a new model needs to be created. An ML-based analytics system will automatically adjust itself. Within days it will be very accurate again. And no person needs to be involved.

Chris Neisinger
CTO at Guavus

Chris Neisinger is CTO at Guavus, a Thales company and leader in AI/ML-driven analytics for CSPs.  Neisinger is responsible for defining and driving the company’s technology roadmap and customer engagement strategy.  He leads internal development, external partnerships and multi-vendor solutions to leverage the power of distributed data analytics.  Prior to Guavus, he was Executive Director of Services Architecture at Verizon, where he leveraged emerging technologies and network intelligence to build architectures that enable new business opportunities and created new sources of revenue.