Home / Technology & Gadgets / IT News / Ten mistakes marred firewall upgrade at Australian telco • The Register

Ten mistakes marred firewall upgrade at Australian telco • The Register

Technicians working on a firewall upgrade made at least ten mistakes, contributing to two deaths, according to a report on a September incident that saw Australian telco Optus unable to route calls to emergency services.

As The Register reported at the time, Australia’s equivalent of the USA’s 911 and the UK’s 999 and 112 emergency contact number is 000 – Triple Zero – and local law requires all telcos to route emergency calls to that number. For 14 hours on September 18, Optus could not route some customers’ calls to 000 and was unaware of any problems on its network. The company eventually learned of the situation from customers who complained to its call center.

During the 000 outage, 455 calls to emergency services did not go through, and two of those callers died.

On Thursday, Optus published an independent report on the matter written by Dr Kerry Schott, an Australian executive who has held roles at many of the company’s most significant businesses.

The report found that Optus planned 18 firewall upgrades and had executed 15 without incident. But on the 16th upgrade, Optus issued incorrect instructions to its outsourced provider Nokia.

“These errors appear to be caused by a lack of attention to the matter by the firewall network engineers,” the report states, finding that “Some network engineers failed to attend these project meetings to assess the impact of the planned work.”

Staff later required changes that meant devices would be isolated and a gateway locked, a decision that meant traffic would not be redirected. Optus had not used that procedure on six previous firewall upgrades. Nokia, meanwhile, chose to use a Method of Procedure from 2022 it did not employ on past upgrades and which was the wrong one for the job.

When Nokia got to work, it incorrectly classified the job as having no impact on network traffic.

By that point, Optus had classified the job as urgent. Doing so meant it didn’t conduct an engineering review, as would normally be the case.

Nokia then implemented the upgrade using the wrong Method of Procedure. Not long afterwards, it detected signs of network problems. Nokia noted those issues but didn’t investigate. Optus was also aware of the warnings but chose not to dig into the matter.

At 2:40 AM, the teams made a post-implementation check. The report found that call failure rates were increasing, not declining as expected. “The anomaly was not picked up,” the report states.

The final mistake was that Optus used nationwide aggregate data to assess variation in call volumes across its network. “This data was not sufficiently granular to enable detection of the emerging problem,” Schott wrote, so local issues caused by one botched upgrade were not detectable.

Schott summarized the incident as follows:

The review also found that Optus’ call center didn’t appreciate it could be “the first alert channel for Triple Zero difficulties.”

The document also notes that Australian telcos try to route 000 calls during outages, but that doing so is not easy and is made harder by the fact that different smartphones behave in different ways. Optus does warn customers if their devices have not been tested for their ability to connect to 000, and maintains a list of known bad devices. But the report notes Optus’s process “does not capture so-called ‘grey’ devices that have been bought online or overseas and may not be compliant.”

All Australian telcos are currently trying to understand potential problems.

A source with knowledge of handset testing operations at another Australian carrier recently told us that their team has been assessing the performance of every phone they can get their hands on.

The report calls for Optus to end its current practice of working in silos, and to improve its incident and crisis management response capabilities.

But its strongest words are reserved for the tech teams involved with the failed upgrade.

“To have a standard firewall upgrade go so badly is inexcusable,” the document states. “Execution was poor and seemed more focussed on getting things done than on being right. Supervision of both network staff and Nokia must be more disciplined to get things right.” ®

Source

Leave a Reply

Your email address will not be published. Required fields are marked *