Crash! Bang! Boom!
If you were one of the lucky users who tried to do a nightly update on Windows between 5am and 11am PDT Tuesday morning, you were probably treated to this dialog when you launched your new Firefox:
Because the new crash-reporting UI had just been landed, people on the forums assumed that the new crash reporter UI was malfunctioning, but actually it was doing its job perfectly. It turns out that there was actually a startup crash on many Windows systems. What’s even more exciting is that this kind of crash is not caught by the Talkback crash reporting system because of the sequence of how we load XPCOM components.
The Breakpad/Socorro crash reporting project has really come together in the past few weeks. After a lot of pain and frustration, Morgamic and I concocted a database schema that is scalable and efficient. We have been building the basic pieces of a reporting app that will allow QA and developers to analyze crash data. Sayrer has spent untold hours getting Socorro ready for initial deployment, and then dealing with a set of problems1 that are still being diagnosed and fixed. Aravind has been patiently dealing with a new deployment of a complicated three-part application which is rough around the edges. Luser and dcamp rushed to get the client UI in usable shape from beltzner’s mockup, which we got landed 5 minutes before this morning’s nightly builds.
This is a major milestone, and I am really proud of the team that has come together to make this all happen.
Status Update
- For Firefox 3.0a5, Breakpad is enabled by default:
- on Mac: 100% of Mac installations will have Breakpad and Talkback will not be available.
- on Windows: Breakpad is enabled on all installations, but 50% of installations will still have the Talkback client. This will allow us to compare some statistics between the old and new systems. When both systems are enabled, Talkback “wins”, because it registers last.
- not on Linux: the Linux client is not ready yet; it will be completed within the next few weeks. There are some unsolved issues in the breakpad library itself, as well as integration with Mozilla and how to allow the client to submit reports via HTTPS.
- The crash reporting server currently has some issues (i.e. it is only processing one report per hour, due to some design flaws). The fixes have been landed in SVN and should be on the staging server today.
- The server currently has very basic reporting/searching capabilities only. These capabilities will be expanded fairly quickly, with weekly updates to the underlying software.
- Currently we plan on keeping crash reporting data “forever”. The database has partitions that will allow most common queries to operate on a subset of the data in an efficient manner.
What’s Next?
There’s still a lot to be done. There are lots of reports we need on the server, and many more features that would be nice. Sancus, ispiked, and jay are on board to help develop the server, but we could use more help!
- We need design help! If you do active QA using the existing Talkback reporting tools, please take a moment to think of what kinds of crash reporting features you would find most useful in the new system. Please post your ideas to the mozilla.dev.quality newsgroup, being as specific as possible.
- We need a statistician. I am especially looking for someone who is skilled at identifying statistical anomalies over time in a fairly large set of data, for reports such as “Help me reproduce this crash” and “Find new crash regressions”.
- We need implementation help on the server. To get people started, I have created a CentOS5 image which can be run in VMWare Player with a pre-installed version of the Socorro server (available on request). There are also documents on getting started hacking Socorro, building Firefox with breakpad symbols.
- For more information about the project schedule and planning, see the Mozilla wiki.
If you are interested in helping, or just have questions, feel free to stop by the #breakpad channel on irc.mozilla.org, or post to mozilla.dev.quality.
Notes
- Deploying a web app is really hard. Production environments are hard to replicate on local testing servers: NFS mounts, tightly controlled versions, heavy loads, secured databases, and real-world data are hard to come by. #
May 30th, 2007 at 8:15 pm
Sounds like a case of “The patient died, but the operation was a success”!
I’m not a statistician, but I did dual major in math and physics in college. I’m also experienced in numerical computing and creating efficient numerical algorithms. I’d have to review my textbook from my probability and statistics course a lot, but I might be able to help.
What type of statistics/anomalies would be you be needing/looking for? What language should the code be written in?
June 1st, 2007 at 3:31 pm
It still gives that error here, even after updating to the latest trunk version. Is there any way to disable breakpad and enable talkback and see what’s going on? Installing the very same version on a ‘clean’ machine just works fine, so probably some file left somewhere or a registry thing…
June 1st, 2007 at 5:39 pm
Tommy, that particular startup crash cannot be caught by Talkback, because talkback registers itself during XPCOM startup after that crash would take place.
It’s almost certainly a problem with some particular font on the machine.
July 9th, 2007 at 2:58 pm
[…] named this tool Breakpad (it’s something like a pad to land safely on when the app breaks). After Firefox has been using/testing the new framework for a while, SeaMonkey now has switched over Nightly builds from Talkback to […]
July 13th, 2007 at 3:51 pm
[…] Firefox has been using/testing the new framework for a while, SeaMonkey now has switched over Nightly builds from Talkback to […]
May 26th, 2010 at 10:05 am
[…] by the same Breakpad system as desktop Firefox. While desktop Firefox has been using Breakpad since Firefox 3.0, it was only recently ported to the ARM architecture of the Nokia […]