If you were one of the lucky users who tried to do a nightly update on Windows between 5am and 11am PDT Tuesday morning, you were probably treated to this dialog when you launched your new Firefox:
Because the new crash-reporting UI had just been landed, people on the forums assumed that the new crash reporter UI was malfunctioning, but actually it was doing its job perfectly. It turns out that there was actually a startup crash on many Windows systems. What’s even more exciting is that this kind of crash is not caught by the Talkback crash reporting system because of the sequence of how we load XPCOM components.
The Breakpad/Socorro crash reporting project has really come together in the past few weeks. After a lot of pain and frustration, Morgamic and I concocted a database schema that is scalable and efficient. We have been building the basic pieces of a reporting app that will allow QA and developers to analyze crash data. Sayrer has spent untold hours getting Socorro ready for initial deployment, and then dealing with a set of problems1 that are still being diagnosed and fixed. Aravind has been patiently dealing with a new deployment of a complicated three-part application which is rough around the edges. Luser and dcamp rushed to get the client UI in usable shape from beltzner’s mockup, which we got landed 5 minutes before this morning’s nightly builds.
This is a major milestone, and I am really proud of the team that has come together to make this all happen.
Status Update
- For Firefox 3.0a5, Breakpad is enabled by default:
- on Mac: 100% of Mac installations will have Breakpad and Talkback will not be available.
- on Windows: Breakpad is enabled on all installations, but 50% of installations will still have the Talkback client. This will allow us to compare some statistics between the old and new systems. When both systems are enabled, Talkback “wins”, because it registers last.
- not on Linux: the Linux client is not ready yet; it will be completed within the next few weeks. There are some unsolved issues in the breakpad library itself, as well as integration with Mozilla and how to allow the client to submit reports via HTTPS.
- The crash reporting server currently has some issues (i.e. it is only processing one report per hour, due to some design flaws). The fixes have been landed in SVN and should be on the staging server today.
- The server currently has very basic reporting/searching capabilities only. These capabilities will be expanded fairly quickly, with weekly updates to the underlying software.
- Currently we plan on keeping crash reporting data “forever”. The database has partitions that will allow most common queries to operate on a subset of the data in an efficient manner.
What’s Next?
There’s still a lot to be done. There are lots of reports we need on the server, and many more features that would be nice. Sancus, ispiked, and jay are on board to help develop the server, but we could use more help!
- We need design help! If you do active QA using the existing Talkback reporting tools, please take a moment to think of what kinds of crash reporting features you would find most useful in the new system. Please post your ideas to the mozilla.dev.quality newsgroup, being as specific as possible.
- We need a statistician. I am especially looking for someone who is skilled at identifying statistical anomalies over time in a fairly large set of data, for reports such as “Help me reproduce this crash” and “Find new crash regressions”.
- We need implementation help on the server. To get people started, I have created a CentOS5 image which can be run in VMWare Player with a pre-installed version of the Socorro server (available on request). There are also documents on getting started hacking Socorro, building Firefox with breakpad symbols.
- For more information about the project schedule and planning, see the Mozilla wiki.
If you are interested in helping, or just have questions, feel free to stop by the #breakpad channel on irc.mozilla.org, or post to mozilla.dev.quality.
Notes
- Deploying a web app is really hard. Production environments are hard to replicate on local testing servers: NFS mounts, tightly controlled versions, heavy loads, secured databases, and real-world data are hard to come by. #