The Java language specification imposes unique and demanding
requirements on the runtime environment. For example, Java has such
a high level of abstraction that a seemingly simple block of source
code can turn into a large number of implicit method calls, null
checks, boundary checks, and exception handling calls. This type of
code is a worst-case scenario for most processors due to heavy
branching and extensive use of indirection. Processors that are not
designed to work effectively with these types of operations will
suffer from poor performance due to stalling in the instruction
pipeline and frequent data cache misses.
Systems based on Intel architecture have proven to be a robust
and high-performance deployment platform for Java on both the server
and desktop. Start with the outstanding raw performance of the Intel
Pentium 4 and Xeon processors, incorporate them into a
high-performance system architecture, layer on a proven operating
system such as Windows XP or Linux, top this with an Java Virtual
Machine (JVM) optimized for Intel processors by vendors such as BEA
or IBM, and you have a top-flight scalable Java execution
environment.
Let's look at three of the technologies that directly—and
dramatically—improve Java performance on Intel architecture: garbage
collection, Hyper-Threading technology, and optimizations for
branching.
Efficient Garbage Collection
For best performance, Java
applications require software and hardware to cooperate to minimize
bottlenecks and to maximize throughput. Nowhere in a Java solution
is this more important than the area of garbage collection—the JVM's
automatic system for reclaiming data objects no longer in use. It is
typical for developers to focus their efforts on writing code to
meet functional requirements. As a result, class hierarchies are
commonly designed without a clear understanding of their impact on
the heap during program execution. This neglect can lead to serious
application performance and scalability problems as the system
pauses non-deterministically for garbage collection.
JVM vendors such as BEA and IBM have placed a high emphasis on
reducing the drag on performance by implementing capable garbage
collection algorithms on the Intel platform. Original
implementations of garbage collection simply did not scale on
machines with more than one processor. This lack of scalability was
caused by an algorithm design that stopped all Java threads from
running while garbage collection was in progress. The problem became
much more apparent as Java applications migrated away from the
desktop to enterprise-class machines.
Intel and JVM designers worked together through Intel's partner
support programs—such as the Early Access Program—to enhance these server-class
machines with improved garbage collection algorithms that allow for
concurrency between the garbage collection cycle and other threads
processing user requests. These new designs result in significant
scalability improvements as each processor in a n-way Xeon
processor-based system can be kept busy doing useful work instead of
waiting for a lengthy garbage collection and heap compaction cycle
to complete. Due to the widespread deployment of Intel machines in
enterprise computing, these innovations were made to JVMs running on
Intel architecture prior to being ported to virtual machines running
on other architectures.
Hyper-Threading Technology
In early 2002, Intel
introduced an architectural innovation that results in even better
Java performance and scalability. With Hyper-Threading technology,
one physical processor can be viewed as two logical processors each
with its own state. The performance improvements due to this design
arise from the following factors: (a) JVMs schedule threads to
execute simultaneously on the logical processors; (b) on-chip
execution resources are utilized more efficiently than when a single
thread consumes the execution resources.
Java should benefit from this Intel innovation more than any
other commercial language due to its insatiable demand for threads.
When Hyper-Threading technology becomes available on desktop
processors, , Java performance on single-processor machines will
improve dramatically due to interleaving the JVM threads (including
the garbage collection thread).
Optimizations for Branching
State-of-the-art JVM
implementations include sophisticated Just-In-Time (JIT) optimizing
compilers to minimize application execution time. These tools take
advantage of a static call graph inferred from the byte code
structure and execution time profiling to determine the best
feasible code generation. Due to the intrusive on-the-fly nature of
JIT compilation, there will always be a limit to what optimizations
can be accomplished in this phase. Although branching and
indirection can be reduced by intelligent JIT compilation, a large
burden still falls on the processor to quickly execute this complex
code stream.
The Pentium 4 and Xeon processors were designed with these
requirements in mind and provide tailored support for code heavy
with branching and indirection. Advanced branch prediction circuitry
is incorporated that can reduce delays for a correctly predicted
branch to almost zero clock cycles. The JIT cooperates with the
processor by generating code so that branches are predicted
correctly in most cases. Indirection is dealt with efficiently by
use of an L2 data cache of 512KB. This large cache combined with
support for speculative loads (that is, loading memory before a
branch is resolved) result in excellent throughput and utilization
of memory bandwidth.
Java and the Intel platform are an excellent match with features
designed for speed such as a 20-stage execution pipeline,
double-clock arithmetic units, 12K micro-operation instruction trace
cache, 512KB L2 cache, and Hyper-Threading support. When all else
fails and a Java application just needs a faster processor, Intel is
there with ever-increasing clock rates on the Pentium 4 and Xeon
processors (up to 2.40 GHz/sec.). Keeping a processor of these
speeds fed with instructions and data is accomplished by interfacing
with the system bus at an effective data transfer rate of
3.2GB/second.
The Proof in the Pudding
Independent benchmarks clearly show that the Intel
platform offers the best performance and price/performance for
Java-based solutions. The top three scores for the ECperf benchmark
are held by solutions running on Intel processors with divergent
combinations of operating systems, databases, and application
servers. The best performing solution on the SPECjvm98 benchmark, which focuses strictly on
JVM/JIT performance, is also based on an Intel processor.
Java developers interested in getting and staying ahead on Java
performance tips and tricks should sign up
for Intel Developer Services (it's free),where they'll find lots
of tips and technical papers.
Resources
An overview of the Intel Hyper-Threading technology.
Details about how Intel's NetBurst architecture improves performance.
Whitepaper discussing an approach to solving JVM performance problems.
Spotlight on IBM JVM performance and a timeline of performance improvements.
Benefits of Migrating from Sun Java to IBM Java 2
on Intel Architectures (PDF).
Review the details of independent benchmark testing of Java on Intel
architecture.