Capacity Increase: Not Always the Answer to Oracle System Optimisation

A Short Term Fix

In many instances of Oracle system optimisation, businesses consider a capacity increase as the first course of action. That is often a mistake. In my experience, 8 out of 10 times, an expansion isn’t needed to solve a problem, and just ends up an unnecessary and untimely cost.

Here are some surprising statistics we’ve seen from working in this space for the last 20 years.

80% of the fixes we’ve provided have not required any additional infrastructure or capacity spending.
Most of the fixes have been just using the capabilities of the database platform better.
In the last five years, 95% of fixes have not required upgrades or capacity spending.

So before you upgrade, check if you actually need one, you could save thousands of dollars, even millions depending on the size of your existing system.

Why is a smart fix more valuable than a capacity increase?

The reasoning behind this involves a series of complexities which your software consultant will understand on a deeper level. This includes the fact that in today’s typical IT landscape, small, self-contained systems are a rarity. Typically, systems are highly interconnected, supported by incredibly complex database and middleware technologies that run on a lot of shared infrastructure.

There are various system overlaps including;

the volumes of data
the type of workloads being placed on these systems
the features and capabilities of the Oracle platform

These make the applications and supporting technology stack far more sophisticated, and thus make optimisation and tuning a significantly more complex task.

The Performance Improvement Options

As a result of these complexities, you need to have a deep understanding of all the technological layers at play, what business function it’s supporting, and also have the ability to get people to better talk and better communicate issues and resolutions. So what are your options?

1. Quick Fixes and Band-Aids

These are often used as a temporary measure that deflects attention and investment from the real issues. Using a quick fix permanently leads to the real issue being ignored and it inevitably grows to be a larger more expensive problem in the future.

Band-aids should only be used to keep a system running while the deeper fix is being actively pursued.

2. Capacity Increase

The most common quick fix or band-aid proposed by those without the experience needed to tune a system is to increase the available capacity of the infrastructure platform. Think server, storage, network upgrades along with the associated software license costs.

Whilst this is a valid option when warranted, there is typically a major increase in cost and risk when replatforming mission critical systems.

Again this is the best option only when business growth that has resulted in the requirement of that system justifies the additional expense.

Some useful points to keep in mind when this is the most likely option:

Efforts to tune the application code are showing diminishing returns.
The existing infrastructure simply can not handle the workload requirements.
All software options available to support the workload more efficiently have been exhausted.

Whilst we typically consider this approach the last resort to resolve performance problems – it is certainly a useful approach when your system is optimised and can take advantage of technology improvements in compute, storage and network infrastructure.

3. Perform a root cause analysis and solve the real issue

The common perception of a deep fix is that it costs more money. This is as a result of the misconception that a costly hardware upgrade is always necessary or that a deep fix will interfere with day-to-day function.

This is not the case.

With a proper root cause analysis only the most effective specific and critical changes are made to improve application performance.

The benefits of addressing the root cause in this way are:

Improved ability to scale in the future, delaying and reducing the need for more and faster hardware, reducing overall costs
Reduced need for additional system licenses, reducing overall costs.
Lower risk and shorter time frames as a major change and/or redesign is not needed.
Positive user experience.

How to Perform an Effective Root Cause Fix

Overall you need to have a methodical and scientific approach to ensure accurate diagnosis, solution, and costing.

Step 1: Define the Problem

Defining the issue experienced by the user is usually the first and biggest problem when it comes to performance tuning any system. The differences in language used between the IT department and employees of other departments is often vast. Issues can be defined incorrectly or missing vital details to help solve the problem. This usually occurs because IT departments are unable to phrase questions in a way that elicits specific and accurate descriptions of user’s issues from the users.

“If I had an hour to solve a problem I’d spend 55 minutes thinking about the problem and 5 minutes thinking about solutions.” Albert Einstein

Step 2: Measure the Problem

Defining the problem isn’t enough. To fully gauge the extent of the problem, you need to understand in measurable terms how big the problem is. For example, if it’s an issue of delay, how long is the delay in seconds. Does it occur in more than one area and at what time?

Once the problem is defined, we know what we need to measure and which process we need to run to take that measurement.

Make sure you document the problem and the measurement of it. Check out my template for acquiring this data easily from users:

From the measurement step, we understand what’s actually causing that particular process to run slowly.

Step 3: Quantify the Cost of the Problem

We then identify the cost of what those processes are currently incurring in terms of whether they’re slowing down because there’s not enough capacity or if there is another root cause.

If, for example, someone reports that the system is slow, you must define exactly why it’s slow and quantify the impact of it being slow on the business.

Step 4: Identify Solutions: Including Hardware Options

To ensure that a fix is going to be cost-effective and work in the long term, it is critical that you identify solutions that are scalable. There are some cases where a capacity increase is the right answer.

Scalable solutions are fixes that allow for workload growth without incurring exponential time and monetary cost. Keep in mind that typically the most effective solution is typically a software solution.

Step 5: Quantify the Cost of Correcting Issue

Quantify the cost of fixing it, that includes the small chance that an upgrade is necessary.

Step 6: Execute

Implement the required fix and validate that the problem has been resolved entirely.

It’s Often Easier to Fix the System Than The People

There are occasions where people are using a system inefficiently, and we might say, “Okay, we need to change the way people are using the system.”

But in practice it’s very hard to change the way people use a system.

Instead, it’s better to support the person doing their job, however they’re doing it. Consider factors in technology that can be changed to speed up the process. Support what the person is doing, exactly the way they are doing it.

There are many technical options available to support unique workload characteristics – especially when it comes to the database platform. Capacity increase is only one option of many.

Unfortunately, IT groups will often see more efficient ways of carrying out a process and try to change user behaviour; in practice this rarely works.

The change is deemed a failure and blamed on IT. Not because there’s something wrong with the method but because people are unable to change their usage behaviours.

In practice, it’s far easier to speed up how the back-end process or system is running than trying to have people do things differently. There are so many options available around how that can be done. We’re almost spoilt for choice.

One example of when we did do this was on a BI analytics system, where the reporting dashboards were taking too long.

One approach would have been to train the users to change how they queried those dashboards and force them to be more selective in the data that they were viewing, or limit the types of reporting that could be performed.

Instead, we made adjustments to how the dashboards were configured using standard database features. After the changes were made, the dashboards ran in seconds, the staff used them the way they wanted and it required no more hardware, i.e. optimum performance with minimal change from the user perspective.

Using this method may require a little more work on the backend, but it significantly reduces the cost of re-training staff on an existing process which usually results in failure.

Case Study 1: Process Runtime: 3 days to 4.5 minutes

A few years back, I did a fix for a telco client.

They had a General Ledger program that would run for three days over a weekend, which had to be scheduled every month end. The closing of month end processes was dependent on this three-day process.

In many cases, the fix would be to recommend capacity increase. But it took us one day to investigate and implement a non-hardware change that dropped the runtime from 3 days to 4.5 minutes

As a result, their month-end closing wasn’t running for three days. This meant faster turnaround time in the reporting period and a vast business cost benefit.

No additional hardware or software was required – we just used a structured and methodical approach to identifying the root cause of the problem.

Case Study 2: Report Completion Time: 300 seconds to 2.5 seconds

BSC were supporting the implementation of a large analytic reporting application and dealing with huge volumes of data.

One requirement was that every dashboard displayed in less than five seconds. The problem was that the existing response times ranged from 70 seconds to five minutes for most of the reports, with some requests not receiving any response.

We implemented a non-hardware change that dropped the runtime from 5 minutes to 2.5 seconds.

The fix did not involve a capacity increase and the system was used much more effectively, returning reports in less than 2.5 seconds for the largest reports. Now, when people are using this application they are able to support near real-time decision making across complex business operations.

An indirect benefit of this solution, was that the system could be scaled, with consistently low response times, on a shared infrastructure platform to thousands of users without any additional monetary cost.

Herein lies the real value of analytics. People don’t want to wait two minutes because their tasks process more quickly than their daily workflows. The brain is working faster than the machine can generate reports. So we take the point of view that if you’re using analytics, it’s got to be instant. Otherwise, your value deteriorates.

Exponential Returns

These days, there’s no excuse for still having end users or the organisation waiting on poor performance. By and large, we are seeing cost-effective solutions for most problems that don’t cost millions of dollars and don’t take years of effort. The best solutions are just using what’s there a little bit better or just addressing some gaps in application development or design.

A Short Term Fix

Why is a smart fix more valuable than a capacity increase?