Using crash-stats-api-magic

Monday, April 20th, 2015

A while back, I wrote the tool crash-stats-api-magic which allows custom processing of results from the crash-stats API. This tool is not user-friendly, but it can be used to answer some pretty complicated questions.

As an example and demonstration, see a bug that Matthew Gregan filed this morning asking for a custom report from crash-stats:

In trying to debug bug 1135562, it’s hard to guess the severity of the problem or look for any type of version/etc. correlation because there are many types of hangs caught under the same mozilla::MediaShutdownManager::Shutdown stack. I’d like a report that contains only those with mozilla::MediaShutdownManager::Shutdown in the hung (main thread) stack *and* has wasapi_stream_init on one of the other threads, please.

To build this report, start with a basic query and then refine it in the tool:

  1. Construct a supersearch query to select the crashes we’re interested in. The only criteria for this query was “signature contains ‘MediaShutdownManager::Shutdown`. When possible, filter on channel, OS, and version to reduce noise.
  2. After the supersearch query is constructed, choose “More Options” from the results page and copy the “Public API URL” link.
  3. Load crash-stats-api-magic and paste the query URL. Choose “Fetch” to fetch the results. Look through the raw data to get a sense for its structure. Link
  4. The meat of this function is to filter out the crashes that don’t have “wasapi_stream_init” on a thread. Choose “New Rule” and create a filter rule:
    function(d) {
      var ok = false;
      d.json_dump.threads.forEach(function(thread) {
        thread.frames.forEach(function(frame) {
          if (frame.function && frame.function.indexOf("wasapi_stream_init") != -1) {
            ok = true;
          }
        });
      });
      return ok;
    }

    Choose “Execute” to run the filter. Link

  5. To get the final report we output only the signature and the crash ID for each result. Choose “New Rule” again and create a mapping rule:
    function(d) {
      return [d.uuid, d.signature];
    }

    Link

One of the advantages of this tool is that it is possible to iterate quickly on the data without constantly re-querying, but at the end it should be possible to permalink to the results in bugzilla or email exchanges.

If you need to do complex crash-stats analysis, please try it out! email me if you have questions, and pull requests are welcome.

Debugging Official Builds (or, how cool is the Mozilla symbol server?)

Monday, June 11th, 2007

Not infrequently, there are bugs filed in Mozilla by Smart People who want to help and who are experiencing an odd behavior or a bug. They want to help, but they really don’t want to spend the time to build Mozilla themself (and I really don’t blame them).

Now, at least on Windows, interested hackers have the ability to debug release builds of Firefox! Mozilla finally has its own symbol server which will provide debugging PDBs for nightly and release builds. See the Mozilla Developer Center for more information about using this new and exciting service. Note that this will only work for trunk builds from 1.9a5 forward, so it won’t be much help with our current Firefox 2.0.0.x release series. If you want to disable breakpad crash reporting and have crashes in nightly builds go straight to the Windows JIT debugging system, export MOZ_CRASHREPORTER_DISABLE=1 in your environment.

Visual C++ Express Edition symbol path dialog.

Kudos to Ted and Aravind for getting this set up.

Crash! Bang! Boom!

Wednesday, May 30th, 2007

If you were one of the lucky users who tried to do a nightly update on Windows between 5am and 11am PDT Tuesday morning, you were probably treated to this dialog when you launched your new Firefox:

Crash! Bang! Boom!

Because the new crash-reporting UI had just been landed, people on the forums assumed that the new crash reporter UI was malfunctioning, but actually it was doing its job perfectly. It turns out that there was actually a startup crash on many Windows systems. What’s even more exciting is that this kind of crash is not caught by the Talkback crash reporting system because of the sequence of how we load XPCOM components.

The Breakpad/Socorro crash reporting project has really come together in the past few weeks. After a lot of pain and frustration, Morgamic and I concocted a database schema that is scalable and efficient. We have been building the basic pieces of a reporting app that will allow QA and developers to analyze crash data. Sayrer has spent untold hours getting Socorro ready for initial deployment, and then dealing with a set of problems1 that are still being diagnosed and fixed. Aravind has been patiently dealing with a new deployment of a complicated three-part application which is rough around the edges. Luser and dcamp rushed to get the client UI in usable shape from beltzner’s mockup, which we got landed 5 minutes before this morning’s nightly builds.

This is a major milestone, and I am really proud of the team that has come together to make this all happen.

Status Update

  • For Firefox 3.0a5, Breakpad is enabled by default:
    • on Mac: 100% of Mac installations will have Breakpad and Talkback will not be available.
    • on Windows: Breakpad is enabled on all installations, but 50% of installations will still have the Talkback client. This will allow us to compare some statistics between the old and new systems. When both systems are enabled, Talkback “wins”, because it registers last.
    • not on Linux: the Linux client is not ready yet; it will be completed within the next few weeks. There are some unsolved issues in the breakpad library itself, as well as integration with Mozilla and how to allow the client to submit reports via HTTPS.
  • The crash reporting server currently has some issues (i.e. it is only processing one report per hour, due to some design flaws). The fixes have been landed in SVN and should be on the staging server today.
  • The server currently has very basic reporting/searching capabilities only. These capabilities will be expanded fairly quickly, with weekly updates to the underlying software.
  • Currently we plan on keeping crash reporting data “forever”. The database has partitions that will allow most common queries to operate on a subset of the data in an efficient manner.

What’s Next?

There’s still a lot to be done. There are lots of reports we need on the server, and many more features that would be nice. Sancus, ispiked, and jay are on board to help develop the server, but we could use more help!

  • We need design help! If you do active QA using the existing Talkback reporting tools, please take a moment to think of what kinds of crash reporting features you would find most useful in the new system. Please post your ideas to the mozilla.dev.quality newsgroup, being as specific as possible.
  • We need a statistician. I am especially looking for someone who is skilled at identifying statistical anomalies over time in a fairly large set of data, for reports such as “Help me reproduce this crash” and “Find new crash regressions”.
  • We need implementation help on the server. To get people started, I have created a CentOS5 image which can be run in VMWare Player with a pre-installed version of the Socorro server (available on request). There are also documents on getting started hacking Socorro, building Firefox with breakpad symbols.
  • For more information about the project schedule and planning, see the Mozilla wiki.

If you are interested in helping, or just have questions, feel free to stop by the #breakpad channel on irc.mozilla.org, or post to mozilla.dev.quality.

Socorro server pieces and interactions.

Notes

  1. Deploying a web app is really hard. Production environments are hard to replicate on local testing servers: NFS mounts, tightly controlled versions, heavy loads, secured databases, and real-world data are hard to come by. #

When Partitioning Database Tables, EXPLAIN your queries

Saturday, May 12th, 2007

The past week or so I’ve been spending most of my time on the Socorro crash-reporting server software. One if the important things I’ve learned this week is that while database partitions solve some important problems, they create some equally nasty and unexpected problems.

Socorro will provide all of the reports and querying capabilities we need to analyze Firefox crashes. In order to gracefully deal with the volume of incoming crash reports from Firefox users (approx 30k reports per day), morgamic designed a database schema that would use postgres partitions to separate data into manageable and queryable pieces. This allows would allow any date-bound queries to read only the partitions of interest. And hopefully, we’re going to be able to design the reporting system so that all queries are date-bound.

The tables involved look something like this:

reports

id
(primary key)

date

1

2007-05-13 00:00:01

Yahoo games!

2

2007-05-13 00:00:03

Youtube video

3

2007-05-13 00:00:02

I just keep crashing :-(


(Index on date)

frames

report_id

frame_num

(primary key)

1

0

0x0

1

1

nsCOMPtr_base::~nsCOMPtr_base()


If you don’t partition the table, getting the last report is a very fast operation:

breakpad=> EXPLAIN SELECT max(date) FROM reports;
QUERY PLAN                                                          
--------------
 Result  (cost=1.73..1.74 rows=1 width=0)
   InitPlan
     ->  Limit  (cost=0.00..1.73 rows=1 width=8)
           ->  Index Scan Backward using idx_reports_date on reports  (cost=0.00..1728388.93 rows=999873 width=8)
                 Filter: (date IS NOT NULL)

For the uninitiated, this means that we are doing an index scan of the index on date and returning the highest value.

However, when you partition the table, things get ugly very quickly:

breakpad=> EXPLAIN SELECT max(date) FROM reports;
QUERY PLAN
--------------
 Aggregate  (cost=186247.04..186247.05 rows=1 width=8)
   ->  Append  (cost=0.00..175344.43 rows=4361043 width=8)
         ->  Seq Scan on reports  (cost=0.00..10.20 rows=20 width=8)
         ->  Seq Scan on reports_part0 reports  (cost=0.00..40209.73 rows=999873 width=8)
         ->  Seq Scan on reports_part1 reports  (cost=0.00..40205.75 rows=1000175 width=8)
         ->  Seq Scan on reports_part2 reports  (cost=0.00..40200.93 rows=1000093 width=8)
         ->  Seq Scan on reports_part3 reports  (cost=0.00..40197.31 rows=999731 width=8)
         ->  Seq Scan on reports_part4 reports  (cost=0.00..14510.31 rows=361131 width=8)
         ->  Seq Scan on reports_part5 reports  (cost=0.00..10.20 rows=20 width=8)

The query performs a full table scan of all the partitions, which is just about the worst result possible. Even if you don’t have any constraints or knowledge about the data in the date field, the query planner should be able to optimize the query to the following:

SELECT max(maxdate)
FROM
 (SELECT max(date) as maxdate FROM reports_part0 UNION
  SELECT max(date) FROM reports_part1 UNION...
 );

This is at most one index query per partition, which is perfectly reasonable. If you add range constraints to the date field of each partition, this query can be optimized into a loop where you query the “latest” partition first and work backwards until you find a single value that is higher than the range of all the remaining partitions.

But there are even more “gotchas” lurking in table partitioning. The query planner operates on queries before functions are called or bind parameters are substituted. This means that a SQL query which contains a constant can perform very differently than one containing a function:

breakpad=> EXPLAIN SELECT * FROM reports WHERE date < '2007-05-12 11:03' AND date > '2007-05-12 10:03' ORDER BY date DESC;
QUERY PLAN
--------------
 Sort
   Sort Key: public.reports.date
   ->  Result
         ->  Append
               ->  Seq Scan on reports
                     Filter: ((date < '2007-05-12 11:03:00'::timestamp without time zone) AND (date > '2007-05-12 10:03:00'::timestamp without time zone))
               ->  Index Scan using idx_reports_part0_date on reports_part0 reports
                     Index Cond: ((date < '2007-05-12 11:03:00'::timestamp without time zone) AND (date > '2007-05-12 10:03:00'::timestamp without time zone))

Because we have date constraints on the reports partitions, the planner is smart enough to know that only reports_part0 contains the data we’re looking for. But replace the literal dates with the equivalent functions, and the query planner has to search every partition:

breakpad=> EXPLAIN SELECT * FROM reports WHERE date < now() AND date > now() - interval '1 day' ORDER BY date DESC;
QUERY PLAN
---------------
 Sort
   Sort Key: public.reports.date
   ->  Result
         ->  Append
               ->  Seq Scan on reports
                     Filter: ((date < now()) AND (date > (now() - '1 day'::interval)))
               ->  Bitmap Heap Scan on reports_part0 reports
                     Recheck Cond: ((date < now()) AND (date > (now() - '1 day'::interval)))
                     ->  Bitmap Index Scan on idx_reports_part0_date
                           Index Cond: ((date < now()) AND (date > (now() - '1 day'::interval)))
               ->  Index Scan using idx_reports_part1_date on reports_part1 reports
                     Index Cond: ((date < now()) AND (date > (now() - '1 day'::interval)))
               ->  Index Scan using idx_reports_part2_date on reports_part2 reports
                     Index Cond: ((date < now()) AND (date > (now() - '1 day'::interval)))
               ->  Index Scan using idx_reports_part3_date on reports_part3 reports
                     Index Cond: ((date < now()) AND (date > (now() - '1 day'::interval)))
               ->  Index Scan using idx_reports_part4_date on reports_part4 reports
                     Index Cond: ((date < now()) AND (date > (now() - '1 day'::interval)))
               ->  Index Scan using idx_reports_part5_date on reports_part5 reports
                     Index Cond: ((date < now()) AND (date > (now() - '1 day'::interval)))

Both of these missed optimizations are extremely problematic when dealing with partitioned tables in postgresql. The first, less common issue should be easy to fix, because it doesn’t require any constraint information. The second one is not so easy, because it would require the query planner to divide its work into a “pre-function/bindparam expansion” stage, which is cacheable, and a “post-function/bindparam expansion stage”, which is not very easy to cache.

We are going to try and work around the data-binding issue by issuing the queries from Socorro using literals instead of bound variables. This is not ideal because it requires the database to completely re-plan every query that is issued.

The moral of the story is simple: if you are planning on using database partitions, be sure you EXPLAIN all the queries you’re planning, with the actual literals or bound data statements that will actually be used in production. Be prepared to significantly rework your queries if the queries perform unexpected full table scans.