MedDevOps, Security Updates, and the Cloudflare Outage

Cloudflare is a large content delivery network (CDN) service, providing network services to a significant fraction of the internet services. They had a large outage on July 2nd, 2019 affecting many internet services, and afterwards they published a detailed post-mortem analysis in a blog post that is quite instructive from MedDevOps point of view.

Software Updates for Medical Cloud Services

A cloud service will need to be regularly updated, not only to fix defects and to add new features, but also to patch security vulnerabilities in system components. Such patching can be time-critical if there is already proof-of-concept code available for exploiting the vulnerabilities or the vulnerabilities are trivial to exploit.

Medical software updates are different from regular software updates not only because the stakes are higher than for most regular software, but also because they are governed by regulation. Medical software updates require appropriate verification and validation activities and possibly a notification to (or even prior approval from) a Notified Body in EU, or a 510(k) notification to FDA in US. In EU there is not much specific guidance for this, but FDA does provide guidance on deciding when to submit a 510(k) for a software change. The interesting bit is the following diagram from the guidance:

FDA-2011-D-0453: Deciding When to Submit a 510(k) for a Software Change to an Existing Device

Note that according to step 1, changes made solely to strengthen cybersecurity do not need a 510(k) notification. This makes a lot of sense – firstly, it is not like FDA would be able to tell whether your security patch is going to fail, and secondly, as already noted, security patches can be time-critical. This also has bearing on the Cloudflare case.

The Software Update Process at Cloudflare

Cloudflare is not a medical device company, but their outages can take out a significant chunk of the internet, so they pay just as much – if not more – attention to preventing update-related problems as a medical company would. They have a standard operating procedure for updates, and it involves verification and validation steps just like a medical device company would have:

Cloudflare software release process (from the blog post)

The above picture does not show all the steps, but according to the update procedure, a change is:

Created by developer
Reviewed by other developers
Verified by test automation
Approved by manager
Validated through gradually increasing exposure from least realistic to most realistic:
1. Internal users (“dogfood”)
2. Small subset of non-paying customer traffic (“guinea pig”)
3. A few sites with subset of their paying and non-paying customer traffic (“canary”)

Of course, Cloudflare does not contact a Notified Body or FDA, but other than that, their release process is pretty solid. Each change has an associated change request, an update plan and a rollback plan:

Change Request JIRA Ticket (from the blog post)

The only thing really missing from the process description is ensuring that the system remains in a validated state throughout the update process (a subject for another blog post), but that isn’t an issue for a small change like this. All in all, if you are following a similar update process as Cloudflare, you are already doing pretty well.

The Reason for the Outage

The problem at Cloudflare was due to several reasons – an initial error, lack of controls in some places, failure of controls in others – but the short version is this: The update was a new rule to the Web Application Firewall (WAF) that contained a badly formulated regular expression which caused CPU usage to spike. The rule worked, so testing did not catch it, because the testing concentrated on functional testing rather than performance. In real-life usage at scale though, the rule caused CPU overload in servers. And because it was a security-related update, it bypassed all the validation steps in the update procedure, which would have detected the problem.

Conclusions

There are several take-aways here – please read the original blog post to see a more detailed analysis of the outage and what actions Cloudflare took – but I would like to emphasize two of them.

Firstly, functional testing is not enough. You also need to look for anomalies in resource consumption (CPU, memory, disk, network, etc.) and performance (response time, throughput, error rate, etc.) at realistic loads. Note that the “realistic loads” part may require you to replicate your entire production system at full scale. Fortunately cloud computing makes it easy to spin up even large systems for temporary use and then shut them down afterwards, but nevertheless the cost can be significant. But if you are running things in your own on-premise data center, replicating everything for testing purposes might be impossible.

Another aspect of non-functional testing that often gets ignored is security testing. In the Cloudflare case the update was a firewall rule update, so functional testing was the security testing, but if the update had been an application feature update, separate security test cases might be required to verify e.g. that the updated feature correctly interacts with access control. Additionally, system testing should include a security test suite which verifies that all the security features are still working correctly, and general vulnerability scans. And the development pipeline itself should include security scans for weak code and for use of components with known vulnerabilities.

Secondly, bypassing lengthy validation processes for urgent security updates is a necessary evil. This is the real systemic problem: Security patches have to be applied quickly, because you are racing the bad guys. Even the FDA guidelines say that you can just go ahead and deploy security patches without notifying the FDA. You do not want to delay that process. On the other hand, validating updates takes time, and even though security updates do not directly affect the medical functionality of the software, without validation there is no way of guaranteeing that they will not indirectly cause the medical functionality to fail or become unaccessible.

So do you deploy the update with only partial validation, and risk failure due to a defect in the update? Or do you perform full validation, and risk a security breach? Damned if you do, damned if you don’t. The only thing you can do, like Cloudflare also noted, is to be careful in your risk assessment and only omit the validation process if the security patch is truly urgent. In this case it was not, so the outage could have been avoided.

So, in the end, as with so many things in medical devices, it boils down to a risk-benefit analysis. You already know how to do that. And now you know it also applies to security updates.

Comment on this article on Linkedin

MedDevOps, Security Updates, and the Cloudflare Outage

Software Updates for Medical Cloud Services

The Software Update Process at Cloudflare

The Reason for the Outage

Conclusions

Share this: