Java - Performance Tuning


New Relic
JRebel allows for newly compiled code to be redeployed without restarting the application.

HotSwap support: the object-oriented architecture of the Java HotSpot VM enables advanced features such as on-the-fly class redefinition, or "HotSwap". This feature provides the ability to substitute modified code in a running application through the debugger APIs. HotSwap adds functionality to the Java Platform Debugger Architecture, enabling a class to be updated during execution while under the control of a debugger. It also allows profiling operations to be performed by hotswapping in versions of methods in which profiling code has been inserted.

Use most recent version of Java if possible.

Ensure that your operating system patches are up-to-date.

Eliminate variability. The software that you are benchmarking should be run on a strip-down machine (no other software should be on that system).

Change one variable at a time and benchmark each change.

Benchmark (measure your application performance repeatedly multiple times; maybe 100 times). Rigor is especially necessary when measuring Java application performance because the behavior of Java Hotspot VM adapts and reacts to specific machine and specific application it is running. At startup, JVM typically spend some time in interpreted mode while it is profiled to find hot methods. When a method get sufficiently hot, it may be compiled and optimized into native code. So when we benchmark, make sure that we hammer it with sufficient iteration, duration, and amount of data. Make sure to give JVM enough time to warm up. For certain applications, garbage collection can complicate writing benchmarks. It is important to note, however that for a given set of tuning parameters that GC throughput is predictable. So, either avoid object allocation in your inner loop (to avoid invoking GC) or run long enough to reach GC steady state. If you do allocate objects as part of the benchmark, be careful to size the heap as to minimize the impact of GC and gather enough samples so that you get a fair average for how much time is spent in GC.

There are even more subtle traps with benchmarking. What if the work inside the loop is not really constant for each iteration? If you append to a string, for example, you may be doing a copy before append which will increase the amount of work each time the loop is executed. Remember to try to make the computation in the loop constant and non-trivial.

Running to steady state is essential to getting repeatable results. Consider running the benchmark for several minutes. Any application which runs for less than one minute is likely to be dominated by JVM startup time.

Sun's HotSpot JVM has incorporated technology to tune itself. This smart tuning is referred to as Ergonomics. Most computers that have at least 2 CPU's and at least 2 GB of physical memory are considered server-class machines, which means that by default the settings are:

The -server compiler
The --XX:UseParallelGC parallel (throughput) garbage collector
The --Xms initial heap size is 1/64th of the machine's physical memory
The --Xmx maximum heap size is 1/4th of the machine's physical memory (up to 1 GB max)

On 32bit Windows systems, the -client compiler is used by default. On 64bit Windows systems which meet the criteria above will be treated as server-class machines.

Even though Ergonomics significantly improves the "out of the box" experience for many applications, optimal tuning often requires more attention to the sizing of the Java memory regions.

The maximum heap size of a Java application is limited by 3 factors:

  • the process data model (32-bit or 64-bit) and the associated operating system limitation
  • the amount of virtual memory available on the system
  • the amount of physical memory available on the system

The size of the Java heap for a particular application can never exceed or even reach the maximum virtual address space of the process data model. For a 32-bit process model, the maximum virtual address size of the process is typically 4GB, though some operating systems limit this to 2GB or 3GB.

The maximum heap size is typically Xmx3800m (1600m) for 2GB limits, though the actual limitation is application dependent. For 64-bit process model, the maximum is essentially unlimited.

The next most important Java memory tunable is the size of its young generation (also known as the NewSize). Generally speaking, the largest recommended value for the young generation is 3/8 of the maximum heap size.

The Java platform offers a choice of Garbage Collection algorithms. For each of these algorithms, there are various policy tunables. The first two common choices for large server applications:

The -XX:+UseParallelGC  parallel (throughput) garbage collector
The -XX:+UseConcMarkSweepGC concurrent (low pause time) garbage collector (also known as CMS)
The -XX:+UseSerialGC serial garbage collector (for smaller applications and systems)

By appropriately configuring the operating system, and then using the command line options -XX:+UseLargePages, and -XX:LargePageSizeInBytes, you can get the best efficiency out of your server. With larger page sizes we can make better use of virtual memory hardware resources (TLBs), but that may cause larger space sizes for the Permanent Generation and the Code Cache, which in turn can force you to reduce the size of your Java heap. This is a small concern with 2MB or 4MB page sizes, but more interesting concern with 256MB page sizes.

Tuning for throughput. Here is an example of specific command line tuning for a server application running on system with 4GB of memory and capable of running 32 threads simultaneously:

java -Xmx3800m -Xms3800m -Xmn2g -Xss128k -XX:+UseParallelGC -XX:ParallelGCThreads=20
  • -Xmx3800m -Xms3800m: configure a large Java heap to take advantage of the large memory system
  • -Xmn2g: configure large heap for the young generation (which can be collected in parallel), again taking advantage of the large memory system. It helps prevent short lived objects from being prematurely promoted to old generation, where garbage collection is more expensive
  • -Xss128k: reduce the default maximum thread stack size, which allows more of the process's virtual memory address space to be used by the Java heap
  • -XX:+UseParallelGC: select the parallel garbage collector for the new generation of the Java heap (this is generally the default on the server-class machines)
  • -XX:ParallelGCThreads=20: reduces the number of garbage collection threads. The default is equal to the number of processors, which would probably be unnecessarily high on a 32-thread-capable-system.

Try the old generation collector:

java -Xmx3550m -Xms3550m -Xmn2g -Xss128k -XX:+UseParallelGC -XX:ParallelGCThreads=20 -XX:+UseParallelOldGC
  • -Xmx3550m -Xms3550m: sizes have been reduced. The ParallelOldGC collector has additional native, non-Java heap memory requirements and so the Java heap sizes may need to be reduced when running a 32-bit JVM.
  • XX:+UseParallelOldGC: use the old generation collector. Certain phases of an old generation collection can be performed in parallel, speeding up the old generation collection.

Try 256MB pages (huge page sizes):

java -Xmx2506m -Xms2506m -Xmn1536m -Xss128k -XX:+UseParallelGC -XX:ParallelGCThreads=20 
-XX:+UseParallelOldGC -XX:LargePageSizeInBytes=256m
  • -Xmx2506m -Xms2506m: sizes have been reduced because using the large page setting causes the permanent generation and code cache sizes to be 256MB and this reduces memory available for the Java heap
  • -Xmn1536m: the young generation heap is often sized as a fraction of the overall Java heap size. Typically we suggest you start tuning with a young generation size of 1/4th the overall heap size. The young generation was reduced in this case to maintain a similar ratio between young generation and old generation sizing used in previous example.
  • -XX:LargePageSizeInBytes=256m: causes the Java heap, including the permanent generation, and the compiled code cache to use as a minimum size one 256MB page (for those platforms which support it)

Try -XX:+AggresiveOpts:

java -Xmx3550m -Xms3550m -Xmn2g -Xss128k -XX:+UseParallelGC -XX:ParallelGCThreads=20 
-XX:+UseParallelOldGC -XX:+AggresiveOpts
  • -XX:+AggressiveOpts: Turns on point performance optimizations that are expected to be on by default in upcoming releases. The changes grouped by this flag are minor changes to JVM runtime compiled code and not distinct performance features. This is a good flag to try the JVM engineering team's latest performance tweaks for upcoming releases. This option is experimental. The specific optimizations enabled by this option can change from release to release, and even from build to build.

Try Biased Locking:

java -Xmx3550m -Xms3550m -Xmn2g -Xss128k -XX:+UseParallelGC -XX:ParallelGCThreads=20 
-XX:+UseParallelOldGC -XX:+AggressiveOpts -XX:+UseBiasedLocking
  • -XX:+UseBiasedLocking: enables a technique for improving the performance of uncontended synchronization. An object is "biased" toward the thread which first acquires its monitor via a monitorenter bytecode or synchronized method invocation; subsequent monitor-related operations performed by that thread are relatively much faster on multi-processor machines. Some application with significant amounts of uncontended synchronization may attain significant speedups with this flag enabled. Some applications with certain patterns of locking may see slowdowns, though attempts have been made to minimize the negative impact.

Tuning for low pause time and high throughput (using the concurrent garbage collector):

java -Xmx3550m -Xms3550m -Xmn2g -Xss128k -XX:ParallelGCThreads=20 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC 
-XX:SurvivorRatio=8 -XX:TargetSurvivorRatio=90 -XX:MaxTenuringThreshold=31
  • -XX:+UseConcMarkSweepGC -XX:+UseParNewGC: selects the Concurrent Mark Sweep collector. This collector may deliver better response time properties for the application (i.e., low application pause time). It is a parallel and mostly-concurrent collector and can be a good match for the threading ability of a large multi-processor system.
  • XX:SurvivorRatio=8: set the survivor ratio to 1:8, resulting in larger survivor spaces (the smaller the ratio, the larger the space). Larger survivor spaces allows short lived objects a longer time period to die in the young generation
  • XX:TargetSurvivorRation=90: allows 90% of the survivor spaces to be occupied instead of the default 50%, allowing better utilization of the survivor space memory
  • XX:MaxTenuringThreshold=31: allows short lived objects a longer time period to die in the young generation (and hence, avoid promotion). A consequence of this setting is that minor GC times can increase due to additional objects to copy. This value and the survivor space size may need to be adjusted so as to balance overheads of copying between survivor spaces versus tenuring objects that are going to live for a long time. The default settings for CMS are SurvivorRatio=1024 and MaxTenuringThreshold=0 which cause all survivors of a scavenge to be promoted. This can place a lot of pressure on the single concurrent thread collecting the tenured generation. When used with XX:+UseBiasedLocking, this setting should be 15

The New I/O API's (or NIO) offer improved performance for operations like memory mapped files and scalable network operations. By using NIO developers may be able to significantly improve performance of memory or network intensive applications

Concurrency Utilities: Increasingly server applications are going to be targeting platforms with multiple CPU's and multiple cores per CPU. In order to best take advantage of these systems applications must be designed with multi-threading in mind. Classical multi-threaded programming is very complex and error prone due to subtleties in thread interactions such as race conditions. Now with the Concurrency Utilities developers finally have a solid set of building blocks upon which to build scalable multi-threaded applications while avoiding much of the complexity of writing a multi-threaded framework.

How can I determine if my JVM is 32-bit or 64-bit?

From inside the code:


From the command line, you can try:

java -D64 -version

If it's not a 64-bit version, you'll get a message that looks like: "This Java instance does not support a 64-bit JVM. Please install the desired version."

If you are on Windows NT, you can try:

java -version

and if a 64bit version is running, you'll get a message like java version "1.6.0_18" Java(TM) SE Runtime Environment (build 1.6.0_18-b07) Java HotSpot(TM) 64-Bit Server VM (build 16.0-b13, mixed mode)


Java Tuning White Paper
Using JConsole to Monitor Applications
HPROF: A Heap/CPU Profiling Tool in J2SE 5.0
Monitoring and Management for the JavaTM Platform
Tuning Garbage Collection with the 5.0 Java TM Virtual Machine
Java HotSpot VM Options
Ergonomics in the 5.0 Java TM Virtual Machine
java - the Java application launcher
Garbage Collector Ergonomics
Java HotSpot VM Options
A Collection of JVM Options
Tuning Garbage Collection with the 5.0 Java TM Virtual Machine
Monitoring and Management for the JavaTM Platform
visualgc - Visual Garbage Collection Monitoring Tool
SLAMD Distributed Load Generation Engine
New Features and Enhancements J2SE 5.0
Concurrency Utilities Overview

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License