mirror of
https://github.com/async-profiler/async-profiler.git
synced 2026-04-28 10:53:49 +00:00
348 lines
22 KiB
Markdown
348 lines
22 KiB
Markdown
# Profiling modes
|
|
|
|
Besides CPU time, async-profiler provides various other profiling modes such as `Allocation`, `Wall Clock`, `Java Method`
|
|
and even a `Multiple Events` profiling mode.
|
|
|
|
## CPU profiling
|
|
|
|
In this mode, profiler collects stack trace samples that include **Java** methods,
|
|
**native** calls, **JVM** code and **kernel** functions.
|
|
|
|
The general approach is receiving call stacks generated by `perf_events`
|
|
and matching them up with call stacks generated by `AsyncGetCallTrace`,
|
|
in order to produce an accurate profile of both Java and native code.
|
|
Additionally, async-profiler provides a workaround to recover stack traces
|
|
in some [corner cases](https://bugs.openjdk.java.net/browse/JDK-8178287)
|
|
where `AsyncGetCallTrace` fails.
|
|
|
|
This approach has the following advantages compared to using `perf_events`
|
|
directly with a Java agent that translates addresses to Java method names:
|
|
|
|
- Does not require `-XX:+PreserveFramePointer`, which introduces
|
|
performance overhead that can be sometimes as high as 10%.
|
|
|
|
- Does not require starting JVM with an agent for translating Java code addresses
|
|
to method names.
|
|
|
|
- Displays interpreter frames.
|
|
|
|
- Does not produce large intermediate files (perf.data) for further processing in
|
|
user space scripts.
|
|
|
|
If you wish to resolve frames within `libjvm`, the [debug symbols](#installing-debug-symbols) are required.
|
|
|
|
## ALLOCATION profiling
|
|
|
|
The profiler can be configured to collect call sites where the largest amount
|
|
of heap memory is allocated.
|
|
|
|
async-profiler does not use intrusive techniques like bytecode instrumentation
|
|
or expensive DTrace probes which have significant performance impact.
|
|
It also does not affect Escape Analysis or prevent from JIT optimizations
|
|
like allocation elimination. Only actual heap allocations are measured.
|
|
|
|
The profiler features TLAB-driven sampling. It relies on HotSpot-specific
|
|
callbacks to receive two kinds of notifications:
|
|
|
|
- when an object is allocated in a newly created TLAB;
|
|
- when an object is allocated on a slow path outside TLAB.
|
|
|
|
Sampling interval can be adjusted with `--alloc` option.
|
|
For example, `--alloc 500k` will take one sample after 500 KB of allocated
|
|
space on average. Prior to JDK 11, intervals less than TLAB size will not take effect.
|
|
|
|
In allocation profiling mode, the top frame of every call trace is the class
|
|
of the allocated object, and the counter is the heap pressure (the total size
|
|
of allocated TLABs or objects outside TLAB).
|
|
|
|
### Installing Debug Symbols
|
|
|
|
Prior to JDK 11, the allocation profiler required HotSpot debug symbols.
|
|
Some OpenJDK distributions (Amazon Corretto, Liberica JDK, Azul Zulu)
|
|
already have them embedded in `libjvm.so`, other OpenJDK builds typically
|
|
provide debug symbols in a separate package. For example, to install
|
|
OpenJDK debug symbols on Debian / Ubuntu, run:
|
|
|
|
```
|
|
# apt install openjdk-17-dbg
|
|
```
|
|
|
|
(replace `17` with the desired version of JDK).
|
|
|
|
On CentOS, RHEL and some other RPM-based distributions, this could be done with
|
|
[debuginfo-install](http://man7.org/linux/man-pages/man1/debuginfo-install.1.html) utility:
|
|
|
|
```
|
|
# debuginfo-install java-1.8.0-openjdk
|
|
```
|
|
|
|
On Gentoo, the `icedtea` OpenJDK package can be built with the per-package setting
|
|
`FEATURES="nostrip"` to retain symbols.
|
|
|
|
The `gdb` tool can be used to verify if debug symbols are properly installed for the `libjvm` library.
|
|
For example, on Linux:
|
|
|
|
```
|
|
$ gdb $JAVA_HOME/lib/server/libjvm.so -ex 'info address UseG1GC'
|
|
```
|
|
|
|
This command's output will either contain `Symbol "UseG1GC" is at 0xxxxx`
|
|
or `No symbol "UseG1GC" in current context`.
|
|
|
|
## Native memory leaks
|
|
|
|
The profiling mode `nativemem` records `malloc`, `realloc`, `calloc` and `free` calls
|
|
with the addresses, so that allocations can be matched with frees. This helps to focus
|
|
the profile report only on unfreed allocations, which are the likely to be a source of a memory leak.
|
|
|
|
Example:
|
|
|
|
```
|
|
asprof start -e nativemem -f app.jfr <YourApp>
|
|
# or
|
|
asprof start --nativemem N -f app.jfr <YourApp>
|
|
# or if only allocation calls are interesting, do not collect free calls:
|
|
asprof start --nativemem N --nofree -f app.jfr <YourApp>
|
|
|
|
asprof stop <YourApp>
|
|
```
|
|
|
|
Now we need to process the jfr file, to find native memory leaks:
|
|
|
|
```
|
|
# --total for bytes, default counts invocations.
|
|
jfrconv --total --nativemem --leak app.jfr app-leak.html
|
|
|
|
# No leak analysis, include all native allocations:
|
|
jfrconv --total --nativemem app.jfr app-malloc.html
|
|
```
|
|
|
|
When `--leak` option is used, the generated flame graph will show allocations without matching `free` calls.
|
|
|
|

|
|
|
|
To avoid bias towards youngest allocations not freed by the end of the profiling session,
|
|
leak profiler ignores tail allocations made in the last 10% of the profiling period.
|
|
Tail length can be altered with `--tail` option that accepts `ratio` or `percent%` as an argument.
|
|
For example, to ignore allocations in the last 2 minutes of a 10 minutes profile, use
|
|
|
|
```
|
|
jfrconv --nativemem --leak --tail 20% app.jfr app-leak.html
|
|
```
|
|
|
|
The overhead of `nativemem` profiling depends on the number of native allocations,
|
|
but is usually small enough even for production use. If required, the overhead can be reduced
|
|
by configuring the profiling interval. E.g. if you add `nativemem=1m` profiler option,
|
|
allocation samples will be limited to at most one sample per allocated megabyte.
|
|
|
|
### Using LD_PRELOAD for finding native memory leaks
|
|
|
|
Similar to Java applications, `nativemem` mode can be also used with [non-Java processes](ProfilingNonJavaApplications.md).
|
|
|
|
Run an application with `nativemem` profiler that dumps recordings in JFR format every 10 minutes:
|
|
|
|
```
|
|
LD_PRELOAD=/path/to/libasyncProfiler.so ASPROF_COMMAND=start,nativemem,total,loop=10m,cstack=dwarf,file=profile-%t.jfr NativeApp [args]
|
|
```
|
|
|
|
Then run `jfrconv` to generate memory leak report as a flame graph:
|
|
|
|
```
|
|
jfrconv --total --nativemem --leak <profile>.jfr <profile>-leak.html
|
|
```
|
|
|
|
## Wall-clock profiling
|
|
|
|
`-e wall` option tells async-profiler to sample all threads equally every given
|
|
period of time regardless of thread status: Running, Sleeping or Blocked.
|
|
For instance, this can be helpful when profiling application start-up time.
|
|
|
|
Wall-clock profiler is most useful in per-thread mode: `-t`.
|
|
|
|
Example: `asprof -e wall -t -i 50ms -f result.html 8983`
|
|
|
|
## Lock profiling
|
|
|
|
`-e lock` option tells async-profiler to measure lock contention in the profiled application. Lock profiling can help
|
|
developers understand lock acquisition patterns, lock contention (when threads have to wait to acquire locks), time
|
|
spent waiting for locks and which code paths are blocked due to locks.
|
|
|
|
In lock profiling mode, the top frame is the class of lock/monitor, and the counter is number of nanoseconds it took to
|
|
enter this lock/monitor.
|
|
|
|
Example: `asprof -e lock -t -i 5ms -f result.html 8983`
|
|
|
|
## Native lock profiling
|
|
|
|
`--nativelock` option tells async-profiler to measure pthread lock contention in the profiled application.
|
|
Native lock profiling can help developers understand pthread lock acquisition patterns, lock contention (when threads
|
|
have to wait to acquire native locks), time spent waiting for pthread mutexes and read-write locks, and which code paths
|
|
are blocked due to native synchronization primitives.
|
|
|
|
Native lock profiling works by intercepting calls to:
|
|
|
|
- [`pthread_mutex_lock`](https://man7.org/linux/man-pages/man3/pthread_mutex_lock.3p.html)
|
|
- [`pthread_rwlock_rdlock`](https://man7.org/linux/man-pages/man3/pthread_rwlock_rdlock.3p.html)
|
|
- [`pthread_rwlock_wrlock`](https://man7.org/linux/man-pages/man3/pthread_rwlock_wrlock.3p.html)
|
|
|
|
In this mode, the top frame shows the native function that experienced contention (e.g., pthread_mutex_lock_hook),
|
|
and the counter represents the number of nanoseconds threads spent waiting to acquire the lock.
|
|
|
|
Key differences from Java lock profiling:
|
|
|
|
- Profiles native pthread locks instead of Java monitors.
|
|
- Works with C/C++ applications and native libraries used by Java applications.
|
|
- Captures contention in native code paths that Java lock profiling cannot see.
|
|
|
|
Example: `asprof --nativelock 5ms -t -f result.html 8983`
|
|
|
|
## Java method profiling
|
|
|
|
`-e ClassName.methodName` option instruments the given Java method
|
|
in order to record all invocations of this method with the stack traces.
|
|
|
|
Example: `-e java.util.Properties.getProperty` will profile all places
|
|
where `getProperty` method is called from.
|
|
|
|
Only non-native Java methods are supported. To profile a native method,
|
|
use hardware breakpoint event instead, e.g. `-e Java_java_lang_Throwable_fillInStackTrace`
|
|
|
|
**Be aware** that if you attach async-profiler at runtime, the first instrumentation
|
|
of a non-native Java method may cause the [deoptimization](https://github.com/openjdk/jdk/blob/bf2e9ee9d321ed289466b2410f12ad10504d01a2/src/hotspot/share/prims/jvmtiRedefineClasses.cpp#L4092-L4096)
|
|
of all compiled methods. The subsequent instrumentation flushes only the _dependent code_.
|
|
|
|
The massive CodeCache flush doesn't occur if attaching async-profiler as an agent.
|
|
|
|
### Latency profiling
|
|
|
|
Please refer to our blog post on [latency profiling](https://github.com/async-profiler/async-profiler/discussions/1497)
|
|
to know more about this profiling mode.
|
|
|
|
## Native function profiling
|
|
|
|
Here are some useful native functions to profile:
|
|
|
|
- `G1CollectedHeap::humongous_obj_allocate` - trace _humongous allocations_ of the G1 GC,
|
|
- `JVM_StartThread` - trace creation of new Java threads,
|
|
- `Java_java_lang_ClassLoader_defineClass1` - trace class loading.
|
|
|
|
## Multiple events
|
|
|
|
It is possible to profile CPU, allocations, and locks at the same time.
|
|
Instead of CPU, you may choose any other execution event: wall-clock,
|
|
perf event, tracepoint, Java method, etc.
|
|
|
|
The only output format that supports multiple events together is JFR.
|
|
The recording will contain the following event types:
|
|
|
|
- `jdk.ExecutionSample`
|
|
- `jdk.ObjectAllocationInNewTLAB` (alloc)
|
|
- `jdk.ObjectAllocationOutsideTLAB` (alloc)
|
|
- `jdk.JavaMonitorEnter` (lock)
|
|
- `jdk.ThreadPark` (lock)
|
|
|
|
To start profiling cpu + allocations + locks together, specify
|
|
|
|
```
|
|
asprof -e cpu,alloc,lock -f profile.jfr ...
|
|
```
|
|
|
|
or use `--alloc` and `--lock` parameters with the desired threshold:
|
|
|
|
```
|
|
asprof -e cpu --alloc 2m --lock 10ms -f profile.jfr ...
|
|
```
|
|
|
|
The same, when starting profiler as an agent:
|
|
|
|
```
|
|
-agentpath:/path/to/libasyncProfiler.so=start,event=cpu,alloc=2m,lock=10ms,file=profile.jfr
|
|
```
|
|
|
|
### Multi-event profiling using `--all`
|
|
|
|
The `--all` flag offers a way to simultaneously enable predefined collection of common profiling events. By default, `--all` activates profiling for `cpu`, `wall`, `alloc`, `live`, `lock` and `nativemem`.
|
|
|
|
**Important consideration**
|
|
|
|
While the `--all` flag can be useful for development environments to get a wide overview, it is not recommended to enable this in production, especially for continuous profiling. Users are invited to select carefully what to profile and with what settings.
|
|
|
|
**Sample command:**
|
|
|
|
This command enables the default set of events included in `--all`:
|
|
|
|
```
|
|
asprof --all -f profile.jfr
|
|
```
|
|
|
|
or combine it with `--alloc`/`--wall`/`--lock`/`--nativemem` options to override individual settings. For example:
|
|
|
|
```
|
|
asprof --all --alloc 2m --lock 10ms -f profile.jfr
|
|
```
|
|
|
|
The same, when starting profiler as an agent:
|
|
|
|
```
|
|
-agentpath:/path/to/libasyncProfiler.so=start,all,alloc=2m,lock=10ms,file=profile.jfr
|
|
```
|
|
|
|
Instead of `cpu`, it is possible to override the `--all` parameter with any other event type of your choice. For instance, the following command will profile `cycles` along with ` wall`, `alloc`, `live`, `lock` and `nativemem`:
|
|
|
|
```
|
|
asprof --all -e cycles -f profile.jfr
|
|
```
|
|
|
|
## Continuous profiling
|
|
|
|
Continuous profiling is a means by which an application can be profiled
|
|
continuously and dump profile results every specified time period.
|
|
It is a very effective technique in finding performance degradations proactively
|
|
and efficiently. Continuous profiling helps users to understand performance
|
|
differences between versions of the same application. Recent outputs can
|
|
be compared with continuous profiling output history to find differences
|
|
and optimize the changes introduced in case of performance degradations.
|
|
aysnc-profiler provides the ability to continously profile an application with
|
|
the `loop` option. Make sure the filename includes a timestamp pattern, or the
|
|
output will be overwritten on each iteration.
|
|
|
|
```
|
|
asprof --loop 1h -f /var/log/profile-%t.jfr 8983
|
|
```
|
|
|
|
## perf event types supported on Linux
|
|
|
|
| Usage | Description |
|
|
| ----------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
| Predefined: | |
|
|
| `-e cpu-clock` | High-resolution per-CPU timer. Similar to `-e cpu` but forces using perf_events. |
|
|
| `-e page-faults` | Software page faults |
|
|
| `-e context-switches` | Context switches |
|
|
| `-e cycles` | Total CPU cycles |
|
|
| `-e ref-cycles` | CPU reference cycles, not affected by CPU frequency scaling |
|
|
| `-e instructions` | Retired CPU instructions |
|
|
| `-e cache-references` | Cache accesses (usually Last Level Cache, but may depend on the architecture) |
|
|
| `-e cache-misses` | Cache accesses requiring fetching data from a higher-level cache or main memory |
|
|
| `-e branch-instructions` | Retired branch instructions |
|
|
| `-e branch-misses` | Mispredicted branch instructions |
|
|
| `-e bus-cycles` | Bus cycles |
|
|
| `-e L1-dcache-load-misses` | Cache misses on Level 1 Data Cache |
|
|
| `-e LLC-load-misses` | Cache misses on the Last Level Cache |
|
|
| `-e dTLB-load-misses` | Data load misses on the Translation Lookaside Buffer |
|
|
| Breakpoint: | |
|
|
| `-e mem:<addr>` | Breakpoint on a decimal or hex (0x) address |
|
|
| `-e mem:<func>` | Breakpoint on a public or a private symbol |
|
|
| `-e mem:<func>[+<offset>][/<len>][:rwx>]` | Breakpoint on a symbol or an address with offset, length and read/write/exec. Address, offset and length can be hex or dec. The format of `mem` event is the same as in [`perf-record`](https://man7.org/linux/man-pages/man1/perf-record.1.html). |
|
|
| `-e <symbol>` | Equivalent to an execution breakpoint on a symbol: `mem:<symbol>:x`. Example: `-e strcmp` will trace all calls of native `strcmp` function. |
|
|
| Tracepoint: | |
|
|
| `-e trace:<id>` | Kernel tracepoint with the given numeric id |
|
|
| `-e <tracepoint>` | Kernel tracepoint with the specified name. Example: `-e syscalls:sys_enter_open` will trace all `open` syscalls. |
|
|
| Probes: | |
|
|
| `-e kprobe:<func>[+<offset>]` | Kernel probe. Example: `-e kprobe:do_sys_open`. |
|
|
| `-e kretprobe:<func>[+<offset>]` | Kernel return probe. Example: `-e kretprobe:do_sys_open`. |
|
|
| `-e uprobe:<func>[+<offset>]` | Userspace probe. Example: `-e uprobe:/usr/lib64/libc-2.17.so+0x114790`. |
|
|
| `-e uretprobe:<func>[+<offset>]` | Userspace return probe |
|
|
| PMU: | |
|
|
| `-e r<NNN>` | Architecture-specific PMU event with the given number. Example: `-e r4d2` selects `MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM` event, which corresponds to event 0xd2, umask 0x4. |
|
|
| `-e <pmu descriptor>` | PMU event descriptor. Example: `-e cpu/cache-misses/`, `-e cpu/event=0xd2,umask=4/`. The same syntax can be used for uncore and vendor-specific events, e.g. `amd_l3/event=0x01,umask=0x80/` |
|