Conclusion First
Our production CPU utilization dropped from 99.7% to 19.7% after applying three focused optimizations:
- Index addition for high-impact access paths
- Database cache hit-rate tuning
- Java heap sizing (
Xmx) adjustment
This post summarizes the technical approach without exposing any internal SQL text, schema details, or confidential business data.
Context
We observed persistent high CPU usage under normal workload conditions. The issue was not caused by one single bottleneck, but by a combination of:
- inefficient query access patterns,
- low effective memory utilization in the database layer,
- and suboptimal JVM memory configuration in an application container.
Because these three factors amplified each other, we solved them as one coordinated optimization package.
Runbook: Where to Execute Each Check
Use this order during an incident or performance review.
- On host OS: check container CPU and memory pressure.
- In DB container: inspect query behavior, plans, and cache metrics.
- In app container: inspect JVM memory usage and heap settings.
0) Baseline Capture (before any tuning)
Run these first and save outputs.
| |
Decision signal:
- If one DB container is CPU-hot and query duration is high, start with index path checks.
- If DB hit rate is low or memory settings are tiny, prioritize hit-rate tuning.
- If JVM memory is tight and GC pressure is visible, prioritize Xmx tuning.
1) Index Addition (Execution Plan Driven)
1-1. Find long-running SQL safely
Use metadata views only. Do not copy internal SQL text into external docs.
| |
Decision signal:
- If duration keeps increasing and CPU is high, move to EXPLAIN ANALYZE.
1-2. Validate plan shape
Run EXPLAIN ANALYZE on the target query inside your secure environment.
| |
How to interpret:
Seq Scanon large tables: candidate for index improvement.Rows Removed by Filtervery large: filter index is likely missing.- Expensive sort node: consider index aligned with ORDER BY.
1-3. Add indexes by access pattern
Choose index type based on where cost appears:
- WHERE column hot: single-column index.
- WHERE + WHERE pattern repeated: composite index.
- JOIN key hot: index on joined column.
- ORDER BY hot: composite index with same column order.
| |
| |
Re-check:
| |
Decision signal:
- If plan changes from Seq Scan to Index Scan or Index Only Scan and runtime drops, keep.
- If no plan change, drop or redesign index (wrong column order or low selectivity).
2) Hit Rate Tuning (PostgreSQL Memory Efficiency)
2-1. Measure current hit rate
| |
Optional per-table view:
| |
Decision guide (practical heuristic):
- hit_rate < 30%: high priority memory tuning.
- 30% to 70%: tune and verify against workload.
70%: likely not the first bottleneck; check plan quality and JVM.
2-2. Inspect current memory settings
| |
2-3. Tune in controlled steps
Edit postgresql.conf and increase gradually.
| |
Restart DB service according to your environment policy, then measure again.
Decision signal:
- If hit rate rises and CPU drops without query regressions, keep settings.
- If memory pressure appears (swap/OOM risk), reduce and retest.
3) JVM Heap (Xmx) Adjustment
3-1. Check current JVM memory usage
Inside the application container:
| |
Interpretation:
VmRSS: actual resident memory used now.VmPeak: peak virtual memory.
If VmRSS is close to container memory limit and CPU is high, GC overhead is a likely contributor.
3-2. Increase Xmx safely
Example update in startup script:
| |
Restart the container and re-check VmRSS plus CPU trend.
Decision signal:
- If CPU spikes flatten and response time stabilizes, keep new Xmx.
- If memory contention increases at host level, rebalance Xmx and container limits.
Combined Result and Validation Loop
Validation sequence after each change:
- Capture docker stats snapshot.
- Capture query plan and runtime.
- Capture DB hit rate.
- Capture JVM VmRSS/VmPeak.
We repeated this loop until all three layers were stable.
Final outcome:
- Before: CPU 99.7%
- After: CPU 19.7%
The key was not one magic parameter. The key was plan quality + cache efficiency + heap sizing, validated step by step.
Rollback Procedure (Index, Xmx, DB Settings)
Use this when a tuning change causes regression (latency up, errors up, CPU unstable, or memory pressure).
A) Pre-rollback snapshot
| |
B) Rollback index changes
If newly added indexes made writes slower or plan quality worse, remove only the candidate indexes.
| |
| |
| |
Decision signal:
- If write throughput recovers or plan cost improves, keep rollback.
- If read latency worsens significantly, restore only the proven-good index.
C) Rollback Xmx change
If host memory pressure rises after Xmx increase, revert Xmx to previous value.
| |
| |
Decision signal:
- If OOM risk or memory contention disappears, keep the rollback value.
- If CPU spikes return, select an intermediate heap size and retest.
D) Rollback DB memory settings
If DB tuning causes memory pressure or instability, restore prior values in postgresql.conf.
| |
Restore previously recorded values in the config file, then restart DB according to your operation policy.
| |
After restart, verify:
| |
E) Post-rollback acceptance check
Rollback is accepted when all are true:
- Error rate returns to normal.
- CPU trend is stable (no sustained spike pattern).
- Memory pressure is controlled (no swap/OOM alerts).
- Critical transaction latency is back within SLO.
What We Intentionally Redacted
For confidentiality reasons, this article does not include:
- internal SQL statements,
- table/schema names,
- customer or business identifiers,
- raw operational datasets.
The optimization logic, however, is fully reproducible in other environments.
Final Takeaway
If you want repeatable CPU optimization, use an evidence-first workflow:
- Measure baseline,
- Change one layer,
- Re-measure,
- Keep only what improves both CPU and latency.
In this case, the most reliable sequence was index path tuning, then hit-rate tuning, then Xmx tuning.