Software development is speeding up. We build it in smaller chunks and release more often. In the software testing space this necessitates the need for testing to keep up, which has largely driven the growth of functional test automation. But, what about performance testing?
Wherever I look online I see mention of ‘continuous performance testing’. At a high level this is straight forward – we performance test more often to keep up with this velocity. Often this means building our performance testing into our deployment pipeline. But does automated performance testing actually deliver what it should?
Maintaining Test Suites
In general, load testing assets are fragile. They are more fragile than functional automated test assets because we are mostly simulating network traffic. Depending on the application we are testing it can be hugely time consuming and complicated to build a load testing suite.
The age old problem with load testing assets is that when the application changes, our test suites have a habit of breaking. The effort required to repair or rebuild them can be substantial to the point where it is not cost effective in a rapid iterative life-cycle. From experience, this is especially true of off the shelf and legacy applications.
One or more of the following needs to be in place for us to succeed:
- The application needs to be testable. For a performance tester, this means the network traffic needs to be consistent and simple enough to keep on top of. That’s not something we are always in control of if we have purchased an off the shelf solution.
- Otherwise we have to limit what we test. If it’s not cost effective to continually run and maintain a fully integrated test suite we need to identify what we can test. API’s are often low hanging fruit but we need to be careful to keep the overall solution in mind. When we start breaking down our testing to the component level our testing loses a lot of its value.
Load test tool vendors are trying to tell a story about being DevOps and CI/CD ready. The reality is that from a technical viewpoint they have not significantly evolved in over a decade (with a few rare exceptions).
So we have some performance testing set up to run every time we deploy. A test is run and some results are recorded. Now what? How do we determine if the test ‘passed’ or ‘failed’?
How do we define pass and fail criteria that mean something to the business? We can define NFR’s, but from my experience these are often numbers plucked out of the air without any real connection to what matters. If we have a NFR that response time for all user actions should be 2 seconds or less at the 95th percentile and one low-impact activity takes 2.1 seconds, should we fail the build?
A better way would be to track performance over time. We could compare back to the past dozen runs and look for degradation over time. This is tricky to implement, nothing on the market does it out of the box. And how, even then, do you determine when to fail the build? Still, this is possible and a good avenue for future investigation.
For the meantime, performance testing results need to be analysed by someone who has the skills and experience to make sense of them and communicate them back to the business in a way they understand. Especially when we need to diagnose performance issues and do exploratory analysis. If we are going to automate our performance testing we have to account for that manual effort and the need to have someone with those skills. Our automatic validation is only going to scratch the surface at best.
Performance testing results are only accurate if the environment we test in matches (or is) production. The further we deviate from this, the higher the risk our results do not reflect the real world. Production-like does not just mean the same hardware configuration as production, it also means the same integrations are in place to external components, the software is configured the same way, and the database contains similar data.
So, what if the environment we are testing in is not production like? What can we say from the results? The best we can do is draw comparisons between builds and only about response time. On top of that the response times we observe will not necessarily reflect production, and capacity and stability cannot be measured accurately. It is a good early indication of some performance issues, but that is all.
For any performance testing to provide maximum value we need to test in a production-like environment. For continuous performance testing to work, this means continual deployment into a production-like environment. An an ideal world we would spin up a fully production-like environment at will but this is not the reality for most businesses. So what environments do we have available to us and what can we realistic achieve in them?
Classic performance tests generally run for an hour or more, Much longer if you are testing stability. Running a twelve hour test every time you deploy is hardly pragmatic if you are doing it multiple times a day. So how long should our ‘automated’ performance tests run for? What about five minutes? This severely limits the value of our testing:
- We do not get enough sample points to make meaningful conclusions about performance
- We miss out on periodic patterns over time – e.g. a spike in response time every quarter of an hour will not be picked up
- We are not running for long enough to assess stability (or even capacity, we need time to ramp up and let the system stabilise)
So there’s a conflict here. Either we run very short tests which provide only low-accuracy feedback on response time, or we compromise the agility and speed of our deployment schedule.
Something has to give. The conclusion I keep coming back to is that there are some limited things we can test continually, but there is also a place for some ‘big bang’ performance testing at less frequent milestones.
My view is that we shouldn’t be diving into the concept of ‘continuous performance testing’ without properly thinking about whether it actually provides value relative to the cost. More than ever we need performance specialists who can understand both the business risk and but also have the technical depth to understand how a solution is performing.
It’s not about doing what is possible. It’s about doing what provides the business confidence in the performance of their software, efficiently. And that might mean that performance testing needs to sit somewhat apart of the development life-cycle.
What are your thoughts?