By David Murphy, Enterprise Architect at Acora
Headline story: “TSB lacked common sense before IT meltdown, says report.” In a word – ouch.
There are several very interesting points that can be derived from this headline, and indeed the subsequent articles, the most standout points to me are:
- It specifically mentions IT – because banks, like the vast and growing number of companies today, are IT businesses.
- The wording – “TSB lacked…”. Not “TSBs IT department lacked…” or “TSBs CIO lacked…” – but referencing the business as a whole. Because like security issues, IT issues in an IT company are now a boardroom topic.
These points, and a firm grasp of the reason for them, should drive home the importance of testing. It should also highlight the importance of the frequency, depth and duration of that testing – and most importantly – the level of risk associated with the activity that the testing supports.
What is live testing?
It’s quite common in IT-driven companies to have significant and segregated test platforms, these are called “pre-production environments/platforms.” They mirror the technology and create a configuration of the live system on independent hardware and systems. Less commonly, they will also mirror the scale and capacity of the live systems (depending on the level of risk involved in that system) and it may be needed to prove the performance. It is important to note that this is not the same as the development or traditional test environment, but one that is controlled in the same way that the live environment is – quite often it is not managed by the development team, but by the infrastructure team to ensure this control.
This pre-production platform is where most testing takes place, it provides a safe space, somewhere that is controlled in terms of change. It ensures the test is true and that the results can be trusted. It can be quite expensive to deploy, maintain and to manage – but that effort and expenditure is worth it to mitigate the risk of untested changes impacting the live environment.
However, in a system that is sufficiently large-scale, or perhaps public facing, it is far more difficult to simulate the load and behaviour of the REAL user base in a pre-production platform. The sheer scale of the production platform and the associated workload can affect how a system accepts change or behaves, compared to the test platform. Add to that the complexity of human behaviour, which can be quite unpredictable, and you introduce an element of risk that demands attention. This is where live testing comes into its own.
In live testing, when a change is pushed to live, it is effectively on a trial basis. It may be to a subset of users initially, allowing a control group to interact and ‘use’ the system on a live basis. Or, it may be that a maintenance window is created to introduce that control – but the key is the change is not signed-off until all the test criteria has been met in production, and the performance and stability of the platform is observed as correct. Until that point, the rollback plan, which should have also been thoroughly proven in pre-production, should be ‘waiting in the wings’ ready to run.
How do we make sure it doesn’t happen to us?
I can only imagine the internal conversations that happened at TSB in the hours and days following the outage, but I imagine they included such statements as “it was tested” and “we got sign-off for the change”. But I would ask, was the risk and impact of that change properly understood when it was signed-off?
The key question a leader should be asking when anyone in IT, or indeed the whole business mentions words like ‘patching’, ‘updates’, ‘roll-out’ or ‘migration’ is “what is the worst that can happen?” Closely followed by “what is the test plan?” and “what is the rollback plan, has it been tested?”
What they should be asking themselves, and demanding their reports ask themselves before presenting the answers is, “Does the depth and rigour of testing, and rollback planning, reflect the level of operational and/or financial impact associated with an outage to the specific system/service in question?”
Sometimes it is as simple as numbers (e.g. retail), sometimes it is SLAs or contractual obligations (e.g. finance or law), sometimes it is reputation (public transport) and sometimes it is a combination of these factors.
The important point to understand at C-level, and to drive into the culture within IT, is that change control and risk decisions are not just IT decisions anymore – they are business decisions. It is the responsibility of everyone involved, regardless of whether they work in IT, to take the time to effectively understand and communicate both the risk and the impact all the way up to C-level, and inspect all the way back down to the planning level, to ensure that what has been planned is appropriate.
Then, when the worst happens and something comes out of the blue, we can rollback – and we know the rollback will work… because we tested it.
For more information about pre-production environments and IT project planning, contact us. We’re happy to help.