Why QA Matters in Light of the CrowdStrike Incident
THOMAS MOOR
Senior Consultant - QA
Jack Middleton
Lead Consultant - QA
Today, high-cadence software updates are the norm for businesses looking to mitigate security and data threats. Many of these updates come from third parties who provide specialist software, and it’s fair to say that we have all come to expect a high level of quality from them. We simply assume that their updates will work. So, when this assumption fails, the impact can be devastating to both parties.
You will have undoubtedly come across the CrowdStrike incident in the news or been affected by it yourself. In this article, we discuss what went wrong and share the QA approaches we use to avoid this happening to our clients.
Who is CrowdStrike?
CrowdStrike is a cyber-security company that provides services for endpoint security, threat intelligence and cyberattack response. They have developed a suite of tools to provide these services, including a Falcon Sensor product, which is their vulnerability scanner.
As of August 2024, CrowdStrike ranks top for market share in Endpoint Protection, with 23.67% market share, almost 30,000 customers, and a market cap of $83 billion1.
What went wrong?
On July 19, 2024, CrowdStrike deployed two additional “Template Instances”. These are essentially new Falcon Sensor capabilities for enhanced telemetry and threat detection, deployed as configuration, to their Sensor product. When the Sensor received these Templates, problematic content caused an out-of-bounds memory read, triggering an exception on Windows-based machines. This caused an operating system crash (the blue screen of death).
Why weren’t the errors picked up?
A couple of issues that contributed to the CrowdStrike incident were insufficient testing (both from an approach and coverage standpoint), and a lack of strategy regarding the deployment of updates2.
Critical tests were missed from their suite of automated checks, and some of the tests contained bugs themselves. This meant that passing tests were passing in error.
Of course, no system is bug-free, and mistakes do happen. But for a product as critical and embedded as Falcon Sensor, a stringent test approach is critical. Had more and better tests been in place, these problems could have been identified and resolved without incident.
What was the impact?
Roughly 8.5 million systems around the world crashed and were unable to properly restart without manual interaction. This interaction required an administrator to locally boot the machine in safe mode and delete specific files from within the CrowdStrike driver directory.
While the fix to the Falcon Sensor was resolved in under five hours, the full fix for customers took days.
It’s estimated that the financial cost of this outage could be in the tens of billions of dollars. CrowdStrike’s share price dropped 32% in the 12 days after the incident, causing a loss in market value of $25bn3. The human impact included grounded flights, cancelled surgeries, inaccessible banking apps, communication outages, and supermarket POS systems crashing. Beyond this, the reputational damage to CrowdStrike is likely to be far longer lasting, as they strive to recover from an update-gone-wrong that made front-page news around the world.
The clear misses in CrowdStrike’s approach to testing and delivery will take a long time to recover from.
How can incidents like this be avoided in future?
As the dust settles, it’s vital that we all learn from this incident to avoid a repeat scenario.
- Never cut corners when it comes to QA: The core success criterion of any delivery should always be quality. After all, we all want to ship successful systems and updates. And while no release is ever completely risk-free, a robust approach to testing will significantly reduce the risk of an incident of this magnitude occurring to you. Quality Assurance (QA) must be given the time and attention it requires. With continuous delivery practices commonplace nowadays, QA often finds itself in the spotlight as an easy area to cut corners to improve velocity.
Compromising on quality to meet deadlines, including delaying QA activities to the next sprint/release is something we actively advocate against at Ensono. QA must be an element of your team’s Definition of Done before a new release can be delivered - QA is everybody’s responsibility: No individual should be solely responsible for quality. It takes a team to deliver quality, so it follows that quality is everyone’s responsibility.
At Ensono, we actively encourage everyone to take part in test activities. This positions us far away from outdated development practices where code was “thrown over the fence” to QA to test.
When the whole team is responsible for quality, QAs aren’t the only ones actively testing changes. Testing occurs at a variety of levels, from unit tests written by developers alongside their code, to end-to-end tests written by QAs to cover business scenarios.
One of the core purposes of QA (and testing in general) is to give confidence ahead of a release (be it a new system or a small update) and a whole-team approach to quality ensures you get continuous feedback on the state of the system. - Test early: Testing should be considered from the requirements-gathering stage so that expected behavior is well understood, and everyone is clear in terms of what and how to test. We find that this also enables us to identify missing requirements early, mitigating issues before they become real problems later.
We also find it helpful when, as developers produce code, they pair up with a QA team member to undertake a “dev box testing” activity. Here, the developer demos the code to a QA ahead of more formal test activities. This enables the QA to get early eyes on changes so they can identify potential issues early, when they’re easier to fix - Utilize a variety of test techniques: To improve speed of delivery many companies take advantage of test automation. These automated tests enable us to regularly test many scenarios much faster than a human could.
It’s a cornerstone of any modern, successful test approach and, if done well, can identify regressed behavior in the system as other changes are added. It’s vital that these tests are accurate, test what they are intended to test, and are kept up to date. To ensure this, they need to be reviewed just like developer code. Bugs in tests, as we have seen, can lead to bugs in the system being missed all too easily.
Supplementing automated testing is exploratory testing. This is where the skills of a QA come to the fore. They use the system, learning about it and discovering how it behaves—all the while testing critical elements of behavior. We find that exploratory testing allows us to cover many more scenarios, helping to find more bugs in the system than we would through automation alone.
We also advocate for smoke testing and for avoiding big-bang type deployments. Smoke testing covers core functionality very quickly, whilst canary releases enable a small subset of users to receive a release first. If that release is completed successfully, then you widen it out to others. This type of testing, if it had been used by CrowdStrike during pre-release testing on Windows machines, would have immediately caught the issue.
We also advocate for User Acceptance Testing (UAT), in which a small group of users test the changes themselves. This gives users the opportunity to give feedback on the changes and to ensure that their needs are satisfied.
Finally, it’s vital to always focus on the end user. This means testing as a user would, using the systems that they would be using. Take browsers, for example. Cross-browser testing is important, as is cross-device testing for certain systems. It gives us confidence that the system under test will work in the various configurations that may be used. - Prepare for disasters: We know that no system is bug-free, and that a bug manifesting in your production environment is inevitable, so you must be prepared.
We have found it helpful to always consider Disaster Recovery (DR) testing for the systems we develop. We also consider advanced types of performance testing, such as stress testing, which is used to confirm the system’s response to specific events such as spikes in user load.
By developing an effective response to disaster, mitigation measures can be put in place should the worst happen. We build on this through continuous monitoring. The ability to detect issues as soon as possible after they occur in an environment enables us to respond appropriately and resolve them with minimal impact.
Conclusion
The CrowdStrike incident has demonstrated the need to manage speed vs. Risk and the danger of releasing something that’s only partially tested.
Unfortunately for CrowdStrike, this incident will go down as another example of why testing is so important and must not be neglected. The damage to their reputation will be hard to recover from.
At Ensono, we strive to deliver the best for our clients, with a constant focus on quality. We provide regular QA assessments. We also advocate strongly for a whole-team approach to quality. And we combine powerful test automation with the skills that only highly proficient, experienced test analysts can bring.
This approach helps ensure that any changes to a system, no matter how small in scope, are released safely and to a level of quality that you, as our customers, expect.
Everyone wants quality baked in, and a solid approach to QA is the best way to ensure this happens.
Get in touch to learn how Ensono can help you with your quality journey.
References
1 – https://6sense.com/tech/endpoint-protection/crowdstrike-market-share
3 – https://www.bbc.co.uk/news/articles/cy08ljxndr4o
Social Share
Don't miss the latest from Ensono
Keep up with Ensono
Innovation never stops, and we support you at every stage. From infrastructure-as-a-service advances to upcoming webinars, explore our news here.
Blog Post | November 14, 2024 | Best practices
Leveraging Observability in SREaaS: Building a Robust MELT Stack
Blog Post | November 4, 2024 | Best practices
Navigating the Matrix: Proactive Mainframe Management Strategies
Blog Post | October 30, 2024 | Industry trends