‘Check, recheck, are we safe to go ahead?’. It was our job to act with due diligence; plan well, test well, implement with caution. We had to know when to go ahead and when to back out. That’s what a telecom engineer is trained to do as they work on the ‘big pipes’ that route internet data around the nation and around the world. 

Adobe Stock - By Royyimzy

‘Check, recheck, are we safe to go ahead?’. It was our job to act with due diligence; plan well, test well, implement with caution.  We had to know when to go ahead and when to back out.  That’s what a telecom engineer is trained to do as they work on the ‘big pipes’ that route internet data around the nation and around the world. 

I can’t tell you the number of nights that I’ve sat in dim, cold rooms, in obscure locations, laptop plugged in to a control module of the internet superhighway.  No matter how cold, or tired, or wired on cheap coffee, we were drilled with the responsibility we commanded.  Cut a finger, and you bleed, but stab an artery and you haemorrhage.  It’s like that working the big pipes.

 

For frontline engineers, there is immense personal responsibility, backed up by layers of team responsibility, with every participant contributing a vital part to the stability and resilience of critical infrastructure.  There are few that ‘know it all’ in telecoms.  Generalists rely on experts.  Experts achieve little without teams of knowledgeable doers at the frontline of operations. It only works because of planning, coordination, rigorous process, incredibly smart people doing what they do best, and very practical people bringing it all together with precision.

 

Last week’s Optus outage was a true ‘poor bugger’ moment for those of us who have been there.  We’ve all feared it, especially the idea of being the engineer that hit ‘go’ when it all went wrong. The cause of the Optus network failure is unclear, possibly due to a third-party network or a faulty upgrade. Regardless, networks have redundancy and failure modes that minimise the chances of, and the scale of, outages.  But it’s never perfect.

 

Engineers pursue the holy grail of maintaining “five nines” reliability, which translates to just 5.26 minutes of annual downtime, an impressively high bar to clear. In practice, many telcos offer a 99.9% uptime undertaking, which equates to roughly 8.76 hours of permissible annual downtime. Optus blew that bar in a day.  The duration of that outage was undoubtedly shocking with its implications for businesses and citizens (especially vulnerable citizens), but we should take stock of other issues at play.   

 

While it’s natural to expect high reliability from telcos, we must acknowledge that achieving 100% uptime is practically unattainable. Service interruptions can result from a variety of factors, including human and technical errors, natural disasters, and from cyber breaches.  And with increasingly advanced nefarious factors affecting telecom operators, disruption is likely to become more frequent and more impactful.

 

Beyond individuals, teams, and leadership all working to best endeavours, there’s a national level view which deserves more attention. It’s not just telco’s under increasing pressure from disruptive forces. It’s all of our critical infrastructures. The government acknowledges this and addresses it in legislation such as the Security of Critical Infrastructure (SOCI) Act. Which, whilst worthy and needed, sometimes feels like a slow burn in the right direction (perhaps an inappropriate metaphor for infrastructure at risk to natural disaster).

Are there lessons we can take from history?  Those of use that worked in telecoms at the turn of this century would tell you that there are.

 

Twenty-five years ago, when I joined Nortel as a graduate engineer, the world was gripped in anticipation of the rollover from 1999, to the year 2000, and the fear it would herald catastrophic systems failures. The Y2K bug, had a simple cause, where year dates were represented by their last two digits, the rollover from 99, back to 00, was going to cause havoc.  Except it didn’t. Often referred to as the Y2K bug ‘hoax’ this was no hoax, it was a success story. 

 

The Y2K Bug threat was averted because the risk was proactively recognised, and counter measures were prioritised, engaged with by leaders, championed by governments.  There was widespread collaboration and cooperation, with ample planning and resourcing.  And all this was wrapped in effective communication leading to global awareness.  For those of us who participated, it has become one of the greatest examples of global risk management of our time.

 

Optus is not an island.  By several definitions, its network is part of a global continuum.  And Optus is not alone in its responsibility to citizens and businesses.  Australia has many systems whose disruption, as we’ve seen with the DP World docks disruption, will cause immediate and widespread detrimental impact.  At some level all these systems intertwine and become co-dependent.  Telco’s need electricity, electricity needs fuel, fuel needs transportation, transportation needs freely operating transit routes; and so on.

 

With an increasing threat landscape targeting increasingly complex and intertwined systems of infrastructure, we face cumulative and cascading affects from disruption. To mitigate these potential impacts, Australia needs to back itself, building integrated risk management approaches which absorb the best of operations (people, processes, and systems) from across all critical infrastructures, and treat them with a holistic approach. 

 

To analyse, optimise and contingency plan across this level of complexity will take our best brains applying our most advanced analytical technologies, such as artificial intelligence.  It’s not futuristic, it’s here and now; Australia has these capabilities, in the domain of the CSIRO, our universities, and in private sector sovereign capabilities like Sentient Hubs.

 

Investing in AI technology capable of modelling and predicting critical infrastructure outages or impacts, must be a core priority of this government’s resilience agenda.  Whilst Optus and DP World will have questions to answer, they will not be the last businesses hit with downtime, but they should be a catalyst for an advanced, analytically informed, nationally integrated approach to critical infrastructure risk management.  The ‘next time’ is coming, let’s be ready as a nation.

 

 

Alison Howe

Co-Founder and Interim Chief Executive Officer

Author Profile