Graph of the Day: Virtual and Physical Memory Starvation

Wednesday, April 24th, 2013

Today’s graph is a scatter plot of out-of-memory crashes. It categorizes crashes according to the smallest block of available VM and the amount of available pagefile space.

There were roughly 1000 crashes due to bug 829954 between 10-April and 15-April 2013. Click on individual crash plots to see memory details and a link to the crash report.

Direct link to SVG file. Link to raw data.

Conclusions

After graphing these crashes, it seems clear that there are two distinct issues:

  • Crashes which are above the blue line and to the left have free space in their page file, but we have run out of contiguous virtual memory space. This is likely caused by the virtual memory leak from last week.
  • Crashes which are below the blue line and to the right have available virtual memory, but don’t have any real memory for allocation. It is likely that the computer is already thrashing pretty heavily and appears very slow to the user.

I was surprised to learn that not all of these crashes were caused by the VM leak.

The short-term solution for this issue remains the same: the Mozilla graphics engine should stop using the infallible/aborting allocator for graphics buffers. All large allocations (network and graphics buffers) should use the fallible allocator and take extra effort to be OOM-safe.

Long-term, we need Firefox to be aware of the OS memory situation and continue to work on memory-shrinking behavior when the system starts paging or running out of memory. This includes obvious behaviors like throwing away the in-memory network and graphics caches, but it may also require drastic measures such as throwing away the contents of inactive tabs and reloading them later.

Charting Technique

With this post, I am starting to use a different charting technique. Previously, I was using the Flot JS library to generate graphs. Flot makes pretty graphs (although it doesn’t support labeling Axes without a plugin!). It also features a wide range of plugins which add missing features. But often, it doesn’t do exactly what I want and I’ve had to dig deep into its guts to get things to look good. It is also cumbersome to include dynamically generated JS graphs in a blog post, and the prior graphs have been screenshots.

This time around, I generated the graph as an SVG image using the svgwrite python library. This allows me to put the full SVG graph directly into the blog, and it also allows me to dynamic features such as rollovers directly in these blog posts. Currently I’m setting up the axes and labels manually in python, but I expect that this will turn into a library pretty quickly. I experimented with svgplotlib but the installation requirements were too complex for my needs.

I’m not sure whether or not the embedded SVG will make it through feed aggregators./readers or not. Leave comments if you see weird results.

Graph of the Day: Empty Minidump Crashes Per User

Monday, April 22nd, 2013

Sometimes I make a graph to confirm a theory. Sometimes it doesn’t work. This is one of those days.

I created this graph in an attempt to analyze bug 837835. In that bug, we are investigating an increase in the number of crash reports we receive which have an empty (0-byte) minidump file. We’re pretty sure that this usually happens because of an out-of-memory condition (or an out of VM space condition).

Robert Kaiser reported in the bug that he suspected two date ranges of causing the number of empty dumps to increase. Those numbers were generated by counting crashes per build date. But they were very noisy, partly because they didn’t account for the differences in user population between nightly builds.

In this graph, I attempt to account for crashes per user. This was a slightly complicated task, because it assembles information from three separate inputs:

  • ADU (Active Daily Users) data is collected by Metrics. After normalizing the data, it is saved into the crash-stats raw_adu table.
  • Build data is pulled into the crash-stats database by using a tool called ftpscraper and saved into the releases_raw table. Anything called “scraper” is finicky and changes to other system can break it.
  • Crash data is collected directly in crash-stats and stored in the reports_clean table.

Unfortunately, each of these systems has their own way of representing build IDs, channel information, and operating systems:

Product

Build ID

Channel

OS

raw_adu

“Firefox”

string “yyyymmddhhmmss”

“nightly”

“Windows”

releases_raw

“firefox”

integer yyyymmddhhmmss

“Nightly”

“win32”

reports_clean

“Firefox” (from product_versions)

integer yyyymmddhhmmss

“Nightly” when selecting from reports_clean.release_channel, but “nightly” when selecting from reports.release_channel.

“Windows NT”, but only when a valid minidump is found: when there is an empty minidump, os_name is actually “Unknown”

In this case, I’m only interested in the Windows data, and we can safely assuming that almost all of the empty minidump crashes occur on Windows. The script/SQL query to collect the data simply limits each data source separately and then combines them after they have been limited to windows nightly builds, users, and crashes.

Frequency of Empty Dump crashes on Windows Nightlies

This missing builds are the result of ftpscraper failure.

I’m not sure what to make of this data. It seems likely that we may have fixed part of the problem in the 2013-01-25-03-10-18 nightly. But I don’t see a distinct regression range within this time frame. Perhaps around 25-December? Of course, it could also be that the dataset is so noisy that we can’t draw any useful conclusions from it.

Graph of the Day: Old Flash Versions and Blocklist Effectiveness

Friday, April 19th, 2013

Today’s graph charts the percentage of Firefox users who have known-insecure versions of Flash. It also allows us to visually see the impact of various plugin blocks that have been staged over the past few months.

We are gradually rolling out blocks for more and more versions of Flash. In order to make sure that the blocklist was not causing significant user pain, we started out with the oldest versions of Flash that have the fewest users. We have since been expanding the block to include more recent versions of Flash that are still insecure. We hope to extend these blocks to all insecure versions of Flash in the next few months.

Flash Insecure Release Distribution

From the data, we see that users on very old versions of Flash (Flash 10.2 and earlier) are not changing their behavior because of the blocklist. This either means that the users never see Flash content, or that they always click through the warning. It is also possible that they attempted to upgrade but for some reason are unable.

Users with slightly newer versions seem more likely to upgrade. Over about a month, almost half of the users who had insecure versions of Flash 10.3-11.2 have upgraded.

Finally, it is interesting that these percentages drop down on the weekends. This indicates that work or school computers are more likely to have insecure versions of Flash than home computers. Because there are well-known exploits for all of these Flash versions, this represents a significant risk to organizations who are not keeping up with security updates!

View the chart in HTML version and the raw data. This data was brought to you by Telemetry, and so the standard cautions apply: telemetry is an opt-in sample on the beta/release channels, and may under-represent certain populations, especially enterprise deployments which may lock telemetry off by default. This data represents Windows users only, because we just recently started collecting Flash version information on Mac, and the Linux Flash player doesn’t expose its version at all.

Raw aggregates for Flash usage can be found in my dated directories on crash-analysis.mozilla.com, for example yesterday’s aggregate counts. You are welcome to scrape this data if you want to play with it; I am also willing to provide interested researchers with additional data dumps on request.

Chart of the Day: Firefox Nightly Update Adoption Curves

Monday, April 15th, 2013

In general, people who are running the Firefox Nightly and Aurora channel are offered a new build every day. But users don’t update immediately, because Firefox does not interrupt you with an update prompt upon receiving an update. Instead it waits and applies the update at the next Firefox restart, or prompts the user to update only after significant idle time.

This means that there is a noticeable “delay” between a nightly build and when people start reporting bugs or crashes against the build. It also means that the number of users using any particular nightly build can vary widely. The following charts demonstrate this variability and the update adoption curves:

Per-build usage and adoption curves, Firefox nightly builds on Windows, 1-March to 14-April 2013
Overlapped adoption curves, 1-March to 14-April 2013

Because of this variability, engineers and QA should use care when using data from nightly builds. Note the following conclusions and recommendations:

  • Holidays, weekends, and other unexplained factors may mean that some nightly builds get below-average user totals.
  • Users often skip nightlies: reported regression ranges should be verified.
  • Reliable crash metrics will not be available for several days after a nightly build is released.
  • It may be necessary to correlate crash rates on particular builds against the user counts for that build in order to accurately measure crashes-per-user.
  • When multiple nightlies are built on the same day (for example, a respin for a bad regression), the user count for each build will be lower than an average nightly build.

This data was collected from ADU data provided by metrics and mirrored in the crash-stats database. The script used to collect this data is available in socorro-toolbox.

Graph of the Day: Crash Report Metadata

Friday, April 12th, 2013

When Firefox crashes, it submits a minidump of the crash; it also submits some key/value metadata. The metadata includes basic information about the product/version/buildid, but it also includes a bunch of other information that we use to group, correlate, and diagnose crashes. Using data collected with jydoop, I created a graph of how frequently various metadata keys were submitted on various platforms:

Frequency of Crash Report Metadata on 2-April-2013

The jydoop script to collect this data is simple:

import crashstatsutils
import json
import jydoop

setupjob = crashstatsutils.dosetupjob([('meta_data', 'json'), ('processed_data', 'json')])

def map(k, meta_data, processed_data, context):
    if processed_data is None:
        return

    try:
        meta = json.loads(meta_data)
        processed = json.loads(processed_data)
    except:
        context.write('jsonerror', 1)
        return

    product = meta['ProductName']

    os = processed.get('os_name', None)
    if os is None:
        return

    context.write((product, os, '$total'), 1)
    for key in meta:
        context.write((product, os, key), 1)

combine = jydoop.sumreducer
reduce = jydoop.sumreducer

View the raw data (CSV).

If you are interested in learning what each piece of metadata means and is used for, I’ve started to document them on the Mozilla docs site.

This graph was generated using the flot JS library and the code is available on github.

Graph of the Day: Firefox Virtual Memory Plot

Thursday, April 11th, 2013

I spend a lot of time making sense out of data, and so I’m going to try a new “Graph of the Day” series on this blog.

Today’s plot was created from a crash report submitted by a Firefox user and filed in bugzilla. This user had been experiencing problems where Firefox would, after some time, start drawing black boxes instead of normal content and soon after would crash. Most of the time, his crash report would contain an empty (0-byte) minidump file. In our experience, 0-byte minidumps are usually caused by low-memory conditions causing crashes. But the statistics metadata reported along with the crash show that there was lots of available memory on the system.

This piqued my interest, and fortunately, at least one of the crash reports did contain a valid minidump. Not only did this point us to a Firefox bug where we are aborting when large allocations fail, but it also gave me information about the virtual memory space of the process when it crashed.

When creating a Windows minidump, Firefox calls the MinidumpWriteDump function with the MiniDumpWithFullMemoryInfo flag. This causes the minidump to contain a MINIDUMP_MEMORY_INFO_LIST block, which includes information about every single block of memory pages in the process, the allocation base/size, the free/reserved/committed state, whether the page is private (allocated) memory or some kind of mapped/shared memory, and whether the page is readable/writable/copy-on-write/executable.

(view the plot in a new window).

There are two interesting things that I learned while creating this plot and sharing it on mozilla.dev.platform:

Virtual Memory Fragmentation

Some code is fragmenting the page space with one-page allocations. On Windows, a page is a 4k block, but page space is not allocated in one-page chunks. Instead, the minimum allocation block is 16 pages (64k). So if any code is calling VirtualAlloc with just 4k, it is wasting 16 pages of memory space. Note that this doesn’t waste memory, it only wastes VM space, so it won’t show up on any traditional metrics such as “private bytes”.

Leaking Memory Mappings

Something is leaking memory mappings. Looking at the high end of memory space (bottom of the graphical plot), hover over the large blocks of purple (committed) memory and note that there are many allocations that are roughly identical:

  • Size: 0x880000
  • State: MEM_COMMIT
  • Protection: PAGE_READWRITE PAGE_WRITECOMBINE
  • Type: MEM_MAPPED

Given the other memory statistics from the crash report, it appears that these blocks are actually all mapping the same file or piece of shared memory. And it seems likely that there is a bug somewhere in code which is mapping the same memory repeatedly (with MapViewOfFile) and forgetting to call UnmapViewOfFile when it is done.

Conclusion

We’re still working on diagnosing this problem. The user who originally reported this issue notes that if he switches his laptop to use his integrated graphics card instead of his nvidia graphics, then the problem disappears. So we suspect something in the graphics subsystem, but we’re not sure whether the problem is in Firefox accelerated drawing code, the Windows D3D libraries, or in the nvidia driver itself. We are looking at the possibility of hooking allocations functions such as VirtualAlloc and MapViewOfFile in order to find the call stack at the point of allocation to help determine exactly what code is responsible. If you have any tips or want to follow along, see bug 859955.