Dienstag, 27. Mai 2008

Automated Heap Dump Analysis: Finding Memory Leaks with One Click

There is a common understanding that a single snapshot of the Java heap is not enough for finding a memory leak. The usual approach is to search for a monotonous increase of the number of objects of some class by “online” profiling/monitoring or by comparing a series of snapshots made over time. However, such a “live” monitoring is not always possible, and is especially difficult to be performed in productive systems because of the performance costs of using a profiler, and because of the fact that some leaks show themselves only rarely, when certain conditions have appeared.

In this blog will try to show that analysis based on a single heap dump can also be an extremely powerful means of finding memory leaks. I will give some tips how to obtain data suitable for the analysis. I will then describe how to use the automated analysis features of the Memory Analyzer tool, which was contributed several months ago to Eclipse. Automating the analysis greatly reduces the complexity of finding memory problems, and enables even non-experts to handle memory-related issues. All you need to do is provide a good heap dump, and click once to trigger the analysis. The Memory Analyzer will create for you a report with the leak suspects. What this report contains, and how the reported leak suspects are found is described below.


Preparation


The first thing to do before starting with the analysis is to collect enough data for it. This is fairly easy - one can configure the JVM to write a heap dump whenever an OutOfMemoryError occurs. Having this setup will ensure that you get the data without having to observe the system the whole time and wait for the proper moment to trigger the dump on your own. How to configure the VM is described here (in a nutshell: add the option -XX:+HeapDumpOnOutOfMemoryError).

The second step of the preparation is to enable the memory leak to become more visible and easily detectable. To achieve this one can use the following trick: configure the maximum size of the Java heap to be much higher than the heap used when the application is running correctly (for example set it to twice the size which is usually left after a full GC). Even if you don’t know how much memory the application really needs, increasing the heap is not a bad idea (it may turn out that there is no leak but simply more heap is required). I don’t want to go into discussions if running Java applications with too big heaps is a good approach in general - simply use the tip for the time of the troubleshooting.
What do you gain by this change? On the first OutOfMemoryError the VM will write a heap dump. Most likely the size of the objects related to the leak in this heap dump will be about the half of the total heap size, i.e. it should be relatively easy to detect the leak later.

Executing the Report


Now imagine you have the heap dump which was produced by the VM as the OutOfMemoryError reoccurred. It is time to begin the leak hunting. Start the Memory Analyzer tool and load the heap dump. Already after opening the heap dump, you see an info page with a chart of the biggest objects, and in many cases you will notice a single huge object already here.

But this is not the leak suspects report yet. The report, as I promised, is executed by a single click – on the “Leak Suspects” link of the overview:
RunReport

Alternatively, one can execute the report using the menu from the tool bar, but it takes two clicks then ;-)
Run Report From MenuBar

This is all you need to do. Behind the scenes we use several of the features available in the tool, and try to figure out suspiciously big objects or sets of objects. Then the findings are summarized in a comprehensive, though easy to understand HTML report. The HTML report will be displayed in the tool after it is generated. At the same time, it will be also persisted in a zip file next to the heap dump file that was provided. Thus it is very easy to ask colleagues to have a look at a specific problem, just passing them the several-kilobytes-big report, instead of transferring the whole (potentially gigabytes big) heap dump.

Content of the Report – Suspects Overview


Now let’s have a look at such a report which I have generated. As an example I have used a sample Eclipse plug-in which models a memory-leak. I called it "org.eclipse.mat.demo.leak".
This is the result I see when I do the one-click described above.

Leak Suspect Overview

The first thing that catches my attention is a pie chart, which gives me a good visual impression about the size the suspect (the darker color). I can easily see that for my example it is about 3/4 from the whole heap.

Then follows a short description, which tells me that one instance of my LeakingQueue class, loaded by "org.eclipse.mat.demo.leak" occupies 53Mb, or 80% of the heap.
It tells me also that the memory is piled up in an instance of Object[].

So, with just two sentences the report gives me a very short and meaningful explanation where the problem is – the name of the class keeping the memory, the component to which this class belongs, how much memory is kept, and where exactly the memory is accumulated.

Note: Here the component "org.eclipse.mat.demo.leak" is actually the name of my plug-in extracted from the ClassLoader that loaded it. This is a very handy info, as even in this relatively small heap dump, there were 181 different plug-ins/classloaders. Extracting the name makes the explanation much more helpful and intuitive to understand.

Then the report offers me a set of keywords. What are they good for? One of the goals we have set for the report was to enable the discovery of already known problems. Therefore we needed to provide for each suspect a unique identifier, which people can use and search for the problem against an existing bug-tracking system. All keywords in the report (when used together) are this identifier. If the one who initially encountered the problem has provided this keywords in a bug-report, then others that encounter the same problem and use the keywords to search for a solution, should be able to find it.

Good. So far I was able with one click to see a problem suspect, and to get some info which allows me to search for a known solution. This would enable me to react on this concrete problem, even if I were not the owner of the coding, even if I didn't have any experience with troubleshooting memory-related problems.

Content of the Report – Details about the Problem


Besides an overview of the leak suspects, the report contains detailed information about each of the suspects. You can display it by following the “details” link. What details are available? Well, while looking at many different real-life problems, we found that two questions usually arise when a leak suspect is found:


  • Why are the accumulated objects in memory? or Who is keeping them alive?


  • Why is the suspect so big? What is its content?




  • Therefore, we tried to pack the answers to these two questions in the report. First, you will find in the details the shortest path from a GC root to the accumulation point:
    Paths From GC Roots

    Here you can see all the classes and fields through which the reference chain goes, and if you are familiar with the coding they should give you a good understanding how the objects are held.

    Then (to answer the question why is the suspect so big) the report contains some information about the content which was accumulated:
    Accumulated Objects

    accumulatedobjectsbyclass.gif

    Here, one can see which objects have been piled up - in my example these are two different types of events kept by the queue.

    Content of the Report – System Overview


    Now that we have a detailed description of the problem, let's look at one more part of the reports - the "System Overview". Once a problem is identified, questions like “In what context did this problem appear?" or "What was the environment?” may arise. To give the answer to such questions, we pack into each report a "System Overview" page. This page contains a collection of different details extracted from the heap dump, that can help you better understand the context in which the problem has appeared. These details include:


  • information about the heap dump - size, number of classes, number of class loaders, etc... 


  • the system properties


  • an overview of all threads running at the moment the snapshot was taken


  • the top consumers - i.e. the biggest objects, classes, classloaders, packages


  • a class histogram


  • Here are two screenshots from my example - the "System Overview" start page and the "System Properties".

    System Overview

    System Properties

    Behind the Scenes - Finding the Leak Suspect


    Let me try now to explain how we actually find the leak suspects. When the heap dump is opened for the first time, we create several index files next to it, which enable us to access the data efficiently afterwards. During the first parsing we also build a dominator tree out of the object graph. And namely this dominator tree plays the most important role later, when we do the analysis and search for the suspects. It is difficult to explain the graph theory behind the dominator tree on a few lines only, therefore I will try to list the most important things we gain from using it:


  • the dominator tree models the keep alive dependencies among the objects in the heap. In this tree, every object is keeping alive all of its descendants. This means that if an object would be removed from the heap, then all of its descendants in the dominator tree would be garbage collected. The size of the object and all other objects it keeps alive we call retained size


  • the dominator tree can show us the biggest objects in the heap. Using the property from the previous point it is very easy to compute the retained size for every single object. Then ordering the objects by size is trivial

    Let me now explain how we use the dominator tree to find the leak suspects. Look at the following figure. It presents a part of the dominator tree and the size of the circles represents the retained size of the objects: the bigger the circle, the bigger the object.

    Leak Suspects In the Dominator Tree

    We simply treat all objects with size over a certain threshold as suspects. Then we go down the dominator tree and try to reach an object all of whose children are significantly smaller in size. This is what we call the "accumulation point". Then we just take these two objects - the suspect and the accumulation point - and use them to describe the problem.

    Some more information and a description how to perform the leak hunting manually could be found in my older blog. It is based on a different version of the tool (before it became an Eclipse project) and therefore some of the buttons differ. Nevertheless, I think the explanation may help you to understand better the content of the current blog.

    Conclusion


    I still think that both "online" profiling and "off-line" analysis of snapshots have their strengths and limitations. I hope that I was able to demonstrate that the heap dump based memory analysis could be extremely helpful for finding memory leaks (powered by the Memory Analyzer ;-) ). Some of its advantages - no performance cost during runtime, heap dumps automatically provided by the VM on OutOfMemoryError, simplicity coming from the automated analysis - make this approach my preferred one, especially for troubleshooting productive systems.

    Your feedback is highly appreciated!

    Krum
  • Kommentare:

    1. Wow! This article came at a providential time. I was just struggling with a nasty memory leak yesterday, and the first thing that I read this morning was your article. Within 10 minutes of reading it, I found my memory leak! Perfect!

      AntwortenLöschen
    2. Thank you for the positive feedback! Now I know that I did one good deed yesterday ;-)

      AntwortenLöschen
    3. Just added a blog entry to our blog, how we used the Eclipse Memory Analyzer to quickly find the memory leak in our application! Thanx for the tool, this really saved a lot of time!
      Our blog: http://blog.xebia.com/2008/09/15/beware-of-transitive-dependencies-for-they-can-be-old-and-leaky/

      AntwortenLöschen
    4. [...] http://dev.eclipse.org/blogs/memoryanalyzer/2008/05/27/automated-heap-dump-analysis-finding-memory-l... [...]

      AntwortenLöschen
    5. [...] Automated Heap Dump Analysis: Finding Memory Leaks with One Click [...]

      AntwortenLöschen
    6. [...] When analyzing generated heap dump I have found, that memory leak was caused by web application classloader, that managed thousands of CgLib dynamically generated classes. I was using Eclipse Memory Analyzer, that’s probably the best tool for memory heap dump analysis I have ever seen. It’s the third time it quickly identified the suspicious classes, by heuristic analysis called Leak suspect. [...]

      AntwortenLöschen
    7. [...] un trabajo tremendo (Yourkit se puede integrar con Eclipse también). Pero el hecho de que Eclipse la integre de una vez es super cómodo y [...]

      AntwortenLöschen
    8. [...] Automated Heap Dump Analysis: Finding Memory Leaks with One Click Posted by alextorex Filed in IT Leave a Comment » [...]

      AntwortenLöschen
    9. I'm sure I'm just not seeing it somehow, but how does one open the Memory Analyzer Tool?

      AntwortenLöschen
    10. If you have installed the standalone RCP application, then there should be a MemoryAnalyzer executable in the /mat directory.
      If you have installed just the Memory Analyzer feature to you Eclipse, then you need to open the proper perspective: Window -> Open Perspective -> Other ... -> Memory Analysis
      More about installation you can find here: http://www.eclipse.org/mat/downloads.php
      I hope this helps.

      AntwortenLöschen
    11. [...] http://dev.eclipse.org/blogs/memoryanalyzer/2008/05/27/automated-heap-dump-analysis-finding-memory-l... 32.058365 118.796468 [...]

      AntwortenLöschen
    12. [...] written heap dump with MAT can be a very easy way to find the root cause of the problem (read more here). If you wan to analyze what the footprint into memory of your application is, then MAT and heap [...]

      AntwortenLöschen
    13. [...] Memory Leaks Start by running the leak report to automatically check for memory leaks. This blog details How to Find a Leaking Workbench Window. [...]

      AntwortenLöschen
    14. Very useful article Krum...I have a very basic (may be silly) doubt here...the heap dump obtained on outofmemory shows some significant amount of Remainder memory as well...if there is space left in heap, why is the java.lang.OutOfMemory: java heap space exception raised by JVM? Is it that the total pie not representing the heap memory alone?

      AntwortenLöschen
    15. To add to be above question...in one of our eclipse plug-in application, when we look at the dump on outofmemory (with Xmx set to 1024m), we see an object occupying 380MB, all other objects together occupying less than 100MB and reminder space around 530MB when OOM occured.

      Does that mean now that when i try to process something on a button click (that is when OOM occured), my application is trying to create a single object which is trying to occupy more than 530MB and hence the JVM cries about memory crunch?

      AntwortenLöschen
    16. All the images in this article are broken links. Can they be recovered/corrected to help in following the article?

      AntwortenLöschen
    17. All the images in this article are broken links. Can they be recovered/corrected to help in following the article? +1

      AntwortenLöschen
    18. I'm looking for an analysers that can use in my car, do you have idea what are analysers are? thank you.

      AntwortenLöschen