JavaScript Do It Yourself Java Profiling

META · Mar 29, 2026

But I also suggest taking a look at the 70K text of the illustrated article-transcript below the cut, compiled by me from the video and slides.

DIY Java Profiling (Roman Elizarov, ADD-2011).pdf

Today I'm giving a talk on "Do-It-Yourself Java Profiling." The slides will be in English, but I'll be delivering the talk in Russian. There are a lot of slides, and time is limited, so I'll be rushing through some of them and will try to leave more time for questions at the end, and maybe even somewhere in the middle. Feel free to ask questions or clarify anything, or even interrupt me. I'd rather cover what you're interested in in more detail than simply repeat what I wanted to say.

The talk is based on real-world experience: at our company, we've been creating complex, highly loaded financial applications for over ten years. We handle large data sets, handle millions of quotes per second, and handle tens of thousands of online users. Application

profiling is an essential component of any optimization; optimization without profiling is impossible. You profile, find bottlenecks, optimize, profile—it's a constant cycle.

Why is the talk called "Do It Yourself Java Profiler"? Why bother doing anything yourself? There are a huge number of ready-made tools that help with profiling—profilers themselves, and similar tools.

But the problem is, firstly, there might be a problem with a third-party tool. For some reason, such as reliability or security, you might not be able to run a third-party tool on a live environment. Unfortunately, profiling often has to be done not only on a test platform but also on a live platform, and for a high-load platform, you don't always have the capacity and resources to create an identical copy of the system and run it under the same load. Many bottlenecks can only be revealed under heavy load, only under very specific conditions. You see that the system is behaving differently, but you don't understand why. What kind of load pattern would need to be created for it to reveal the problem—that's why profiling is often necessary on a live system.

When we write financial applications, we also have the task of ensuring system reliability. And we're not building "banks," where the main goal is to protect your money, but which can be unavailable for hours. We're building online trading brokerage systems, where 24/7 system availability is one of their key qualities; they should never crash.

And as I've already mentioned, the entire industry is heavily regulated, and sometimes we can't use a third-party product on a real system.

But the tools themselves are often opaque. Yes, there's documentation that describes the "what," but it doesn't describe how exactly it gets there, and it's not always clear what the system actually intended.

Even if a tool is open source, it doesn't change anything because there's a lot of code, and you'll waste a ton of time figuring it out. Tools require learning, but doing something yourself is, of course, much more enjoyable.

What's the problem with learning? Naturally, if you use a tool frequently, it's worth learning. If you program every day in your favorite development environment, you know it inside and out, and this knowledge naturally pays off handsomely.

But if you need to do something once a month—for example, if you need to profile a performance bug once a month—then learning the relevant tool won't necessarily pay off. Of course, unless there's a situation where the tool solves the problem many times faster.

By doing something yourself, you can reuse your knowledge. For example, if you have knowledge of programming languages and your own tools, you can deepen, expand, and refine that knowledge by delving deeper into the tools you already have, rather than learning a new one.

Why a talk about Java? Not only does our company program in Java, it's the leading language of 2001 according to the TIOBE index, making it ideal for enterprise applications. And it's perfect for this particular lecture because Java is a managed language, runs in a virtual machine, and profiling in Java is incredibly easy.

First, I'll talk about how to solve many profiling problems by writing some Java code. I'll also discuss the capabilities of Java virtual machines that can be used, and I'll also introduce a technique called bytecode manipulation.

We'll look at both CPU and memory profiling today. I'll cover a variety of techniques.

CPU profiling. The easiest way is to just program. If you need to figure out where, how much, and what in your program is spending time, and how many times it's called, then this is the simplest way: no tools, nothing—just write a few lines of code.

Java has a wonderful function called "currentTimeMillis," which returns the current time. You can measure it in one place, measure it in another, and then count how many times it's been done, the total time, the minimum and maximum time, whatever you like. It's the simplest method. DIY at its most simple and straightforward.

Surprisingly, in practice, this method works great and brings a ton of benefits—because it's fast, convenient, and effective.

When does this method work well? It works great for business methods—the business method is large, not called very often, and you need to measure something about it. Moreover, once you've written this code, it becomes part of your application and part of its functionality. More or less any large modern application contains management interfaces, some statistics, and, in general, application performance is one of three things you often want an application to report about itself, simply as part of its functionality.

In this sense, programming an application to self-profile is a logical step. You increase the application's functionality, and application profiling becomes part of its functionality. Especially if you define your business methods that the end user calls this way, the end user will also find this information meaningful: how many times which methods were called, how long they ran, and so on. The information you collect, in this case, with this approach, is completely under your control. You can measure the number of calls, the minimum time, the maximum time, the average time, you can build histograms of execution time distribution, calculate medians and percentiles. You can track different execution paths in the code differently, as in this example. Those who noticed while I was talking noticed that depending on the execution path, we record different statistics: how often the query result was cached and how long that took, and how often the query had to access the database and how long that took.

This is possible if you write the code yourself, collect statistics, and integrate them into your application.

Moreover, as the person responsible for the profiling-optimization cycle, you then use this information to determine what's going on in your application. This information is always within your application; the code runs in the live system. If you had a bad day, or something went wrong with the system, you can look at these statistics in the logs, figure out what's going on, and so on.

A wonderful technique, no third-party tools required, just a bit of Java code.

What if the methods are shorter and called more frequently?

This direct approach is no longer suitable, because the "currentTimeMillis" method is slow and measures only milliseconds.

If you only need to measure the number of calls, you can do so quite quickly using the Java "AtomicLong" class. It allows you to count the number of calls to a particular method with minimal performance impact. This will work for up to tens of thousands of calls per second without significantly impacting the application's performance.

What if you also need to measure execution time? Measuring the execution time of short methods is a very complex topic. It can't be addressed with any standard methods, even though Java has the "systemNanoTime" method. It doesn't solve these problems; it's inherently slow, and it's difficult to measure anything fast with it.

The only real solution is to use native code. For x86 processors, there's the wonderful rdtsc instruction , which returns the processor cycle counter. There's no direct access to it; you can write a one-line method in C that calls "rdtsc," then link it to Java code and call it from Java. This call will take you 100 cycles, and if you need to measure a piece of code that takes a thousand or so cycles, this makes sense if you're optimizing every machine cycle and want to understand the performance, give or take, whether it's faster or slower. It's truly rare that you need to optimize every cycle.

More often, when dealing with shorter, more frequently called sections of code, a different approach called "sampling" is used. Instead of precisely measuring how many times what is called, you periodically analyze the program's execution, looking at where it's executed at random points in time, for example, once per second or once every ten seconds.

You look at where execution occurs and count where you find your program frequently executed. If you have a line in your program where it spends all, or at least 90%, of its time—for example, a loop, and then a line deep within it—then you'll likely find it there when execution stops.

Such a spot in the program is called a "hot spot." It's always a great candidate for optimization. What's cool is that there's a built-in function called "thread dump" to dump all threads. In Windows, this is done by pressing CTRL-Break in the console, and on Linux and other Unix systems, it's done by sending a third signal, the "kill -3" command.

In this case, the Java machine prints a detailed description of the state of all threads to the console. And if you truly have a hot spot in your code, you'll most likely find your program there. So again, when you have a performance problem with your code, don't rush to the profiler; don't do anything. If you see something slowing things down, run at least one thread dump and see. If you have a single hot spot where the program spends all its time, you'll see that line in the thread dump in your favorite development environment, without using any third-party tools. Look at this code, study it, optimize it—either it's being called too often or it's running slowly. Then it's time to analyze the situation, optimize, or further profile.

Modern Java machines also have a wonderful utility called "jstack." You can pass it a process ID and get a thread dump as output.

Run more than one thread dump; run several. If the first one doesn't catch anything, run a couple more. Maybe your program spends 50% of its time in a hot spot, not 100%. After making a few thread dumps, you'll likely retrieve your code from the hot spot at at least some of these points and inspect the places where you found your code.

This idea can be developed further. You can start a Java machine, redirect its output to a file, and run a script that performs a thread dump every three seconds. This can be done completely safely on a live system, without any risk of damaging it. Because collecting a thread dump is a fairly fast procedure, taking 100 milliseconds at most, even with a very large number of threads.

And if you're writing Java, your system likely isn't hard real-time, and nanoseconds aren't important to you because you're already periodically running garbage collection and so on. So, an unnecessary system sleep for 100 milliseconds won't create a disaster.

Even in our financial sector, most of the systems we write are ultimately written for people, and people operate them. Yes, there are millions of quotes per second, and yes, there are robots (that's another story), but most often, these quotes are viewed by a human, who won't notice a lag of plus or minus 100 milliseconds. A human will notice a lag of 200 milliseconds; that would be a noticeable delay, but a lag of 100 milliseconds is not.

Therefore, you don't need to worry about an extra 100 milliseconds, and it's perfectly safe to run a thread dump every three seconds, even on a live system. Moreover, thread dump is part of the Java machine, well-tested—in all my experience, I've never seen anything bad happen when trying to run a thread dump on a Java machine.

In other words, it's a completely safe tool for profiling live and running systems. Afterwards, having obtained a thread dump file, you can examine it visually, or you can write a simple piece of code that analyzes it, calculates some statistics—at least, simply by looking at which methods were invoked and how many times, and the thread states.

While standard profiling tools do look at the thread state reported by the Java machine ("RUNNABLE"), in reality, your state means nothing. If your program does a lot of database work, and a lot of external network access, your code may be waiting for data over the network, but the Java machine considers it "RUNNABLE," leaving you with no way to tell which methods are actually running and which are waiting for data from the network. On the other hand, if you analyze stack traces yourself, you can write down what your program does, that this is a database call, that this method in the stack means you've logged into the database, you can calculate the percentage of time you spend in the database, and so on and so forth. You can know that this method isn't actually hogging the CPU, even though the Java machine thinks it's "RUNNABLE."

Moreover, thread dumps can be integrated into applications; Java has a wonderful method.Thread.getAllStackTraces

This allows you to obtain stacktrace information programmatically.

This way, you can integrate profiling as a functional part of this application and distribute it to your clients with built-in profiling. This way, you'll have a constant stream of information that you can analyze to improve your application.

But there's a problem. When you ask the Java machine to perform a thread dump, it doesn't just stop the process and dump the stack; it also flags the Java machine to stop at the next safe point. "Safe points" are special places the compiler places throughout the code where the program has a specific state, where its register contents are clear, the execution point is clear, and so on. If you take a piece of sequential code with no backticks or loops, there might not be a "safe point" at all. Moreover, it doesn't matter that method calls can be inlined by a hotspot, and there won't be any savepoints either.

So, if you see a line in the thread dump, it doesn't necessarily mean it's a hot line of your code; it's simply the closest savepoint to the hot spot. Because when you press CTRL-BREAK, Java flags all threads to "stop at the nearest savepoint," and only when they stop does the JVM analyze their state.

Now let's move on to memory profiling and how it's done.

First, the JVM has some great, ready-made features. There's the excellent jmap tool, which displays a histogram of your memory usage, including which objects are occupying how much memory. This is a great tool for getting a general overview of what's going on and what's hogging your memory.

Again, if you've never profiled your program before, you'll often spot problems right away, giving you food for thought for further memory optimization.

The problem is that this method will give you information about all objects, even those that aren't currently in use and are in the garbage collection.

That's why jmap has a special "live" option, which performs garbage collection before creating a histogram, leaving only used objects, and only then builds the histogram.

The problem is that even with this option, it's impossible to profile a large, live system running with many gigabytes of memory, because garbage collection for a system running with tens of gigabytes of memory takes a few dozen seconds, and this can be unacceptable... In any system, if your system is running with end users, a person to whom the system doesn't respond for more than three seconds thinks it's frozen. You really can't stop a live system running with people for longer than a second. Even a second will be noticeable to a person, but it's not a disaster. But if you plug in some tool that stops it for 10 seconds, that would be a disaster. Therefore, on live systems, you often have to make do with jmaps of the objects that are there, and it generally doesn't matter whether they're garbage or not.

It's also useful to know additional Java machine options. For example, you can ask Java machines to print a class histogram when you perform a thread dump → "PrintClassHistogram."

You can ask the Java machine to write its state to disk when it runs out of memory. This is very useful, because people usually only start optimizing memory consumption when it's running low for some reason. No one profiles when everything is fine, but when a program runs out of memory, it crashes, and people start thinking about how to optimize it. Therefore, it's always useful to have this option enabled. Then, in the worst case, the Java machine will write a binary dump, which you can later analyze with any tools, even offline. This dump can be retrieved from the Java machine at any time, for example, with jmap, using the "-dump" option, but this, again, stops the Java machine for a long time, which is unlikely to be done on a live system.

Comment from the audience: There's a property there that says "HeapDumpOutOfMemory" is optimized for situations when memory has already run out.

Yes, of course. "HeapDumpOutOfMemory" is a very useful option, and even though it's "-XX," there's no need to be afraid of these options. The "XX" emphasizes their extreme specialization. These aren't experimental options; they're normal production Java machine options. They're stable, reliable, and can be used on a live, real system.

These aren't experimental options! The Java machine has a clear distinction, but the division between experimental and non-experimental options doesn't depend on the number of Xs.

A comment from the audience: This option sometimes doesn't save dumps...

Well, the Java machine has bugs too, it all depends on... there are various reasons for memory exhaustion, but I won't dwell on that; we're short on time.

I want to focus on a very important point in the remaining time: memory allocation profiling.

It's one thing what memory is occupied by, and how you use it. If you have an excessive allocation of temporary memory somewhere in your code—that is, you allocate, do something with it, in this method, and then forget about it—it goes to the garbage collector, and the garbage collector then picks it up. And you do this over and over again, your program will run slower than it would have with this... but you won't find this part of the code with any CPU profiler, because the memory allocation operation itself in the Java machine is fantastically fast, faster than in any non-managed language, C/C++, because in Java, memory allocation is simply the incrementing of a single pointer. That's all. It's just a few assembly instructions, everything happens very quickly. It's already pre-zeroed, everything is already allocated and prepared. You won't find this time when analyzing hotspots in your code—it will never show up in any thread dump or profiler as a hotspot. Even though all this will eat up your time and your application's performance, why? Because the garbage collector will then waste time collecting this garbage. Therefore, keep an eye on the percentage of your application's time spent on garbage collection.

The "-verbose:gc," "+PrintGC," and "+PrintGCDetails" options are useful; they will help you understand how much of your application's time is spent on garbage collection. If you see that garbage collection is spending a significant percentage of time, it means you're allocating a lot of memory somewhere in the program. You won't be able to find this place; you need to find out who is allocating the memory.

How do you find this? There's a built-in method in the Java machine, the "-Xaprof" switch. Unfortunately, it only displays the so-called allocation profile at process termination. This profile doesn't show the contents of memory, but rather the statistics of allocated objects—which objects were allocated and how often.

If this really happens frequently, you'll likely see a temporary class somewhere that is being allocated very frequently. Try running "aprof" right away—you might find your problem right away.

But that's not guaranteed. You might see allocations of a large number of character arrays, strings, or something else, and it's unclear where.

Naturally, you might have suspicions—perhaps a recent change could have caused this. Ultimately, you can add a statistic to the location where memory is being allocated too frequently, using the same technique for modifying the code in atomic longs. Count how many allocations occur in that location—look at the statistics; you can trigger and find suspicious places yourself.

But what if you have no idea where it's happening? Well, you need to somehow add statistics collection everywhere, for all places where memory is allocated. Aspect-oriented programming or direct bytecode manipulation are excellent for this kind of task.

I'll now use the remaining time to focus on bytecode manipulation, a technique that's ideal for solving problems like, "I want to count all the times an array is allocated throughout my code, so I can find the one place where I'm allocating a lot of int arrays for some reason." That is, I see a lot of them being allocated, but I just want to find where.

Bytecode manipulation not only solves these problems, it also allows for any changes, even non-functional ones, to be made to the code after it's compiled. Thus, this method decouples profiling from business logic. While I mentioned at the beginning that profiling can often be a logical part of your functionality, there are times when it's not necessary—when you need to find a problem, solve it, and eliminate any strings of code that might be lost. In these cases, a wonderful technique like bytecode manipulation is ideal.

This can be done either with post-compilation of the code output or on the fly, with the code.

The best way I know is to use the ASM library, from ObjectWeb. It's an open-source library that makes manipulating bytecode very easy, and it's incredibly fast—you can manipulate code on the fly without significantly slowing down the application's load time.

ASM is very simple. It has a class called class-reader, which reads .class files and converts the bytes, using the Visitor pattern, into a set of calls like "I see a method," "I see a field with such-and-such fields in this class," and so on. When it sees a method, it uses "MethodVisitor" to report the bytecode it sees there.

Then, on the other hand, there's something called "ClassWriter," which, conversely, converts the class into an array of bytes, which is what the Java machine needs.

For example, to track all array allocations using ASM... well, it's a simple matter. You only need to create a couple of classes. You need to define your own adapter class, which, when it's told that a method is visible, overrides and returns its own visitor method to find out what's going on in that method.

And when inside the method, it's told that there's an integer instruction with array allocation bytecode ("NEWARRAY"), then at that moment, it has the opportunity... to insert some bytecode into the upstream flow, and that's it. And you've tracked all the places where arrays are allocated and changed the corresponding bytecode.

Now—what if you want to make these changes on the fly?

If you have a set of compiled classes, that's easy → you've basically processed them with this tool, and that's it.

If you need to do this on the fly, the Java Machine has a wonderful feature called javaagent. You create a special jar file, specify the "Premain-Class" option in the manifest, and specify the name of your class there. Then... the "premain" method follows a specific template, and thus, even before the main code runs, you gain control with the main method and receive a pointer to the instrumentation interface. This interface is wonderful; it allows you to change classes in the Java Machine on the fly. It allows you to provide your own class-file transformer, which the Java Machine will call for each class loading attempt.

And you can substitute classes. That is, you can load not only the classes that are actually there, but also, using ObjectWebASM, analyze something, change it, and substitute it on the fly... you can even find out the size of a selected object.

A wonderful tool for this kind of profiling on your own, when you have a specific problem to solve.

In conclusion, I'll say that mastering a specific tool isn't necessary to solve profiling problems. Knowledge of bytecode, Java machine options, and standard Java libraries, such as Javalang, is sufficient. This allows you to solve a huge number of specific problems you encounter. Over the past ten years, we've developed several homegrown tools that solve problems that aren't specific to us and that specific profilers can't solve. Starting with the fact that, during our work, we developed a comprehensive yet simple tool that analyzes thread dumps and produces statistics on them. It's a simple utility, though, and can hardly be called a tool. It's a classic tool, a few pages long, that collects statistics and presents them in a beautiful format. It's incredibly useful because we don't need to connect any profilers to the production system; just a thread dump, and that's it.

Finally, we have our own memory profiling tool, which, again, is small; it's hardly a tool that tracks where and what is allocated, and it does this with virtually no impact on program performance. Both commercial and open-source profilers can also track memory allocation, but they're trying to solve a more complex and universal problem. They try to find out where memory allocation occurs, with a full stack trace. This takes a long time, and it slows things down significantly. They don't use sampling, for example. They don't always collect data, thereby not getting all the statistics, and so on.

They make their own compromises, which aren't necessary in your domain; you have your own specific tasks you want to solve when analyzing the performance of your systems.

Now I'll take questions (answers to questions from 30:06).

MickaelStorm · Apr 24, 2026

lalala790612 · May 6, 2026

Alain7x · Jun 2, 2026

JavaScript Do It Yourself Java Profiling

META

Activist

MickaelStorm

Hacker

lalala790612

Hacker

Alain7x

Hacker

Similar threads