Answers and Questions

Archive for 2007

The Fun-Conference: FSOSS in Toronto

Wednesday, October 3rd, 2007

Come join me and a bunch of other Mozilllians in Toronto Oct 25-26 for the Free Software and Open Source Symposium (FSOSS). Last year’s FSOSS was a blast, and I’m sure this year’s will be as well! There is a posse of exciting speakers (including myself). This year I’ll be giving a presentation on a frequently under-valued topic: reading (and reviewing) code. Other speakers from the Mozilla community include Mike Beltzner, Shane Caraveo, Mark Finkle, and Cesar Oliveira.

P.S. I know I haven’t been blogging regularly… I’m going to try and get back into rhythm, especially since I’ve been doing some very interesting work on Mozilla2 and XPCOM that needs discussion.

Posted in Mozilla | 1 Comment »

What made the web great can make the client great, too.

Wednesday, June 13th, 2007

If you listen carefully to the Web 2.0 crowd, you will hear a disturbing undercurrent. They don’t just believe that the web is great. They believe that client applications are dinosaurs, dying relics of a programming paradigm that is irrelevant. I find this tendency alternately depressing and frustrating: not only is there a place for client applications, but it is a bright and promising future that can work in cooperation with the greatness of the internet.

Shaver is right: the web is great because it has fostered open cooperation, viral programming, coding by view-source, mashups and “being able to jam jQuery in the hole that used to have Prototype in it”. The internet provides an excellent medium for viral and open markup and programming. But this kind of programming does not need to be unique to the web, and the Mozilla platform is a great bridge between these two worlds.

Take a look at the Firefox extension system, for example. The extension system not only allows programmers to mashup the client, but it brings the world of view-source copy-paste development to the client in a way that is unprecedented. We can and should foster this same model of viral programming in client-side applications. This is the real power of XULRunner and the Mozilla platform, if we can tame and harness its complexity.

What does this mean in practice? The Mozpad (Mozilla Platform Application Developers) community has been working towards defining some projects and goals: I believe these goals should be evaluated in terms of their ability to bring web-style application development to the client. For example, a traditional “integrated” IDE is very poor at copying code from others and making it your own. If we really wanted to go down the IDE route (IMO we don’t) we should be thinking of features such as “I like this dialog, steal it” and integrated search for code patterns (from e.g. Google code).

Or, to take another example close to my heart, XULRunner application packaging. It is much more important to get a simple tool that can be used to package text files into an installer than worrying at all about compiling binary code. Binary code should be a minority and diminishing case in a viral-programming development model.

Even though Rafael Ebron claims that the rich client is dying, his point about offline web apps is important. The primary difference is of course the security and trust model: local apps have their run of the system and can perform complex network and disk activity (and install binaries if necessary), while offline web apps have to work within the bounds the browser has set out for them. Client apps are a complementary action for offline-enabled web applications, not direct competition.

I believe Mozpad should focus on the following high-impact investments:

reducing the barrier to entry for developers (code and documentation);
providing tools for viral programming (view-source, copy-paste, debugging and inspecting code)
evangelizing the platform’s strengths

There is a bright future for client-side applications! We need to stay focused on the strengths of the Mozilla platform and the web paradigm and not be distracted by “RIA madness”, complex toolchains, or the need to imitate existing proprietary solutions.

Posted in Mozilla | 2 Comments »

Debugging Official Builds (or, how cool is the Mozilla symbol server?)

Monday, June 11th, 2007

Not infrequently, there are bugs filed in Mozilla by Smart People who want to help and who are experiencing an odd behavior or a bug. They want to help, but they really don’t want to spend the time to build Mozilla themself (and I really don’t blame them).

Now, at least on Windows, interested hackers have the ability to debug release builds of Firefox! Mozilla finally has its own symbol server which will provide debugging PDBs for nightly and release builds. See the Mozilla Developer Center for more information about using this new and exciting service. Note that this will only work for trunk builds from 1.9a5 forward, so it won’t be much help with our current Firefox 2.0.0.x release series. If you want to disable breakpad crash reporting and have crashes in nightly builds go straight to the Windows JIT debugging system, export MOZ_CRASHREPORTER_DISABLE=1 in your environment.

Visual C++ Express Edition symbol path dialog.

Kudos to Ted and Aravind for getting this set up.

Posted in Mozilla | 5 Comments »

Crash! Bang! Boom!

Wednesday, May 30th, 2007

If you were one of the lucky users who tried to do a nightly update on Windows between 5am and 11am PDT Tuesday morning, you were probably treated to this dialog when you launched your new Firefox:

Crash! Bang! Boom!

Because the new crash-reporting UI had just been landed, people on the forums assumed that the new crash reporter UI was malfunctioning, but actually it was doing its job perfectly. It turns out that there was actually a startup crash on many Windows systems. What’s even more exciting is that this kind of crash is not caught by the Talkback crash reporting system because of the sequence of how we load XPCOM components.

The Breakpad/Socorro crash reporting project has really come together in the past few weeks. After a lot of pain and frustration, Morgamic and I concocted a database schema that is scalable and efficient. We have been building the basic pieces of a reporting app that will allow QA and developers to analyze crash data. Sayrer has spent untold hours getting Socorro ready for initial deployment, and then dealing with a set of problems¹ that are still being diagnosed and fixed. Aravind has been patiently dealing with a new deployment of a complicated three-part application which is rough around the edges. Luser and dcamp rushed to get the client UI in usable shape from beltzner’s mockup, which we got landed 5 minutes before this morning’s nightly builds.

This is a major milestone, and I am really proud of the team that has come together to make this all happen.

Status Update

For Firefox 3.0a5, Breakpad is enabled by default:
- on Mac: 100% of Mac installations will have Breakpad and Talkback will not be available.
- on Windows: Breakpad is enabled on all installations, but 50% of installations will still have the Talkback client. This will allow us to compare some statistics between the old and new systems. When both systems are enabled, Talkback “wins”, because it registers last.
- not on Linux: the Linux client is not ready yet; it will be completed within the next few weeks. There are some unsolved issues in the breakpad library itself, as well as integration with Mozilla and how to allow the client to submit reports via HTTPS.
The crash reporting server currently has some issues (i.e. it is only processing one report per hour, due to some design flaws). The fixes have been landed in SVN and should be on the staging server today.
The server currently has very basic reporting/searching capabilities only. These capabilities will be expanded fairly quickly, with weekly updates to the underlying software.
Currently we plan on keeping crash reporting data “forever”. The database has partitions that will allow most common queries to operate on a subset of the data in an efficient manner.

What’s Next?

There’s still a lot to be done. There are lots of reports we need on the server, and many more features that would be nice. Sancus, ispiked, and jay are on board to help develop the server, but we could use more help!

We need design help! If you do active QA using the existing Talkback reporting tools, please take a moment to think of what kinds of crash reporting features you would find most useful in the new system. Please post your ideas to the mozilla.dev.quality newsgroup, being as specific as possible.
We need a statistician. I am especially looking for someone who is skilled at identifying statistical anomalies over time in a fairly large set of data, for reports such as “Help me reproduce this crash” and “Find new crash regressions”.
We need implementation help on the server. To get people started, I have created a CentOS5 image which can be run in VMWare Player with a pre-installed version of the Socorro server (available on request). There are also documents on getting started hacking Socorro, building Firefox with breakpad symbols.
For more information about the project schedule and planning, see the Mozilla wiki.

If you are interested in helping, or just have questions, feel free to stop by the #breakpad channel on irc.mozilla.org, or post to mozilla.dev.quality.

Socorro server pieces and interactions.

Notes

Deploying a web app is really hard. Production environments are hard to replicate on local testing servers: NFS mounts, tightly controlled versions, heavy loads, secured databases, and real-world data are hard to come by. #

Posted in Mozilla | 6 Comments »

Shining Light on XUL Dark Matter

Wednesday, May 23rd, 2007

Late last year, roc wrote about XUL Dark Matter: semi-private intranet or domain-specific applications that are using XUL. Mark Finkle and I are setting out to try and get an estimate of the size and needs of this community, and how changes to XUL (especially in the Mozilla 2 timeframe) could help or hurt them.

Next month Mark will be in Tokyo and I will be in Paris for the week surrounding the Mozilla Developer Days. I am very interested in meeting up with developers of these semi-private XUL applications. If you are or know one of these developers, please encourage them to contact me so we can talk.

Here are some of the specific questions I’m trying to answer:

Why did you choose XUL?
Why was HTML insufficient for your needs?
If we added specific features of XUL to HTML, would you use HTML instead? e.g.:
- Specific UI controls easily available
- Templates
- Overlays
- Native Look and Feel
How do you control the version of Mozilla your users view your app with? Do you have concerns about the future stability of XUL?
Do you use XBL?

Posted in Mozilla | 1 Comment »

XULRunner: What we are doing

Tuesday, May 15th, 2007

In June 2003, Mozilla version 1.4 was released. For the first time, the Windows installers of Mozilla installed two separate pieces: the Gecko Runtime Environment (GRE) was installed into a shared location on the user’s hard drive, separate from the application components. Standalone installers for the shared GRE were made available so that other applications embedding Gecko could use the shared runtime. Life was full of promise… until users started upgrading or uninstalling their applications. The nightmare unfolded gradually: unsuspecting users would install a new version of Mozilla and Netscape or AIM would stop working. System administrators would install one version, and users would install into a different location, and both installs would fail to work correctly. It was not uncommon for the computer to become so horked that no combination of running uninstallers and installers would fix it: the only solution was to go into the registry and manually remove offending settings.

To this day, the phrase “GRE” conjures visions of frustrated users, unhappy developers, and unreliable instability.

The Perils of a Shared Runtime

This is the peril of a shared runtime, and one we absolutely must avoid. It is fiendishly difficult to get right, and the consequences of getting it wrong are devastating for users and developers alike. In March 2005, I posted a vision of how a shared XULRunner runtime could be implemented sanely. It kept registrations of installed runtimes and applications, and made sure that the correct version of the runtime was always available to applications. The XULRunner roadmap referenced this plan for XULRunner 1.9 and Firefox 3. Time has moved forward and we are now making hard decisions about features that were planned for Firefox 3, throwing many away because they do not fit in the schedule. The shared XULRunner runtime is one of those features.

This is the long story behind Mitchell’s post about XULRunner investment and Firefox 3. Mozilla is not “killing XULRunner”. It is merely stating the (obvious, I thought) fact that the previously published roadmaps are incorrect, and reiterating our support for continued development of the platform.

What are we doing?

Even though Mozilla isn’t going to ship Firefox 3 on a shared XULRunner, it is continuing active support of the Mozilla platform and the XULRunner project in particular:

Firefox 3 might ship a “private” XULRunner. A private XULRunner is a standard XULRunner build, but rather than trying to share it between Firefox and Thunderbird and other applications that might wish to use it, each application woulds ship and update its own copy independently. We have had a tinderbox building FF+XR experimental builds for months. The major tasks left to turn this on by default are mainly packaging issues to make it install correctly, and landing places-bookmarks. But Mozilla isn’t willing to let the Firefox 3 schedule slip for a feature such as XULRunner, so it may not make it in time. Developers who are interested in helping make this happen should contact me!
Make sure that prepackaged XULRunner builds are available.
- If Firefox 3 ships on top of XULRunner, the build process will produce XULRunner packages “for free” without requiring the Mozilla release team to do separate release work.
- Otherwise, I will be working with interested developers to make sure we have contributed XULRunner builds. This may take the form of build machines provided by Mozilla on the community network, or build contributed from elsewhere. The Eclipse AJAX Tools Framework community is already creating contributed builds of XULRunner 1.8.1.x; this is a natural continuation of that process.
Linux distributions will be shipping Firefox on XULRunner. Several Linux distributions are already shipping XULRunner packages by default, for use by Epiphany and other Mozilla embedders. These distributions have already indicated that they are planning on distributing a shared XULRunner package and building Firefox 3 on top of this package. This will reduce the maintenance burden of Mozilla security updates significantly.
Continue to support the XULRunner ecosystem. XULRunner is being used in a variety of new and exciting applications. Mark Finkle and I will continue to work with developers using XUL and XULRunner to uplift appropriate features and code back into the platform, as well as solicit, consolidate, and publish documentation. XULRunner is nothing more than two files which bootstrap an application and hook it up to the Mozilla toolkit: contributed patches help make life better for all Mozilla developers, including Firefox and its extension ecosystem. Perhaps something like MozPad would be helpful, though I would really like to get as much of that documentation back into the Mozilla Developer Center as possible, for shared consumption.
Continue development in Mozilla 2. Mozilla 2 will have a clean separation between the XULRunner platform and the applications built on top of it. Having a clean API separation and build system separation between components will reduce the barrier to entry of new hackers for both the platform and the applications built on top of it. Whether this means we will have a shared runtime, or simply have better tools for producing applications on “private” runtimes remains to be seen.

The Power of Openness

I would like to extend a special acknowledgment to Ben Turner of Songbird for his continued support and patching of XULRunner and of the Mozilla build system. He recently asked me “how can I help make the Mozilla build system easier to use for external applications?” This is the kind of question that makes me happy and lifts my spirits: there are some low-cost changes to our platform that could make everyone’s life a lot easier, and all you need is to send me an email or IRC ping. The XULRunner team is not “less that one developer”, but it is dozens of Mozilla Corp. employees, and hundreds or even thousands of developers working on the entire Mozilla platform. That I’m the only Mozilla Corporation employee working on nsXULStub.cpp is irrelevant, and a distraction from the accomplishments that we have all made together.

Posted in Mozilla | 11 Comments »

When Partitioning Database Tables, EXPLAIN your queries

Saturday, May 12th, 2007

The past week or so I’ve been spending most of my time on the Socorro crash-reporting server software. One if the important things I’ve learned this week is that while database partitions solve some important problems, they create some equally nasty and unexpected problems.

Socorro will provide all of the reports and querying capabilities we need to analyze Firefox crashes. In order to gracefully deal with the volume of incoming crash reports from Firefox users (approx 30k reports per day), morgamic designed a database schema that would use postgres partitions to separate data into manageable and queryable pieces. This allows would allow any date-bound queries to read only the partitions of interest. And hopefully, we’re going to be able to design the reporting system so that all queries are date-bound.

The tables involved look something like this:

reports

id (primary key)	date	…
1	2007-05-13 00:00:01	Yahoo games!
2	2007-05-13 00:00:03	Youtube video
3	2007-05-13 00:00:02	I just keep crashing :-(
…

(Index on date)

frames

report_id	frame_num	…
(primary key)
1	0	0x0
1	1	nsCOMPtr_base::~nsCOMPtr_base()
…

If you don’t partition the table, getting the last report is a very fast operation:

breakpad=> EXPLAIN SELECT max(date) FROM reports;
QUERY PLAN                                                          
--------------
 Result  (cost=1.73..1.74 rows=1 width=0)
   InitPlan
     ->  Limit  (cost=0.00..1.73 rows=1 width=8)
           ->  Index Scan Backward using idx_reports_date on reports  (cost=0.00..1728388.93 rows=999873 width=8)
                 Filter: (date IS NOT NULL)

For the uninitiated, this means that we are doing an index scan of the index on date and returning the highest value.

However, when you partition the table, things get ugly very quickly:

breakpad=> EXPLAIN SELECT max(date) FROM reports;
QUERY PLAN
--------------
 Aggregate  (cost=186247.04..186247.05 rows=1 width=8)
   ->  Append  (cost=0.00..175344.43 rows=4361043 width=8)
         ->  Seq Scan on reports  (cost=0.00..10.20 rows=20 width=8)
         ->  Seq Scan on reports_part0 reports  (cost=0.00..40209.73 rows=999873 width=8)
         ->  Seq Scan on reports_part1 reports  (cost=0.00..40205.75 rows=1000175 width=8)
         ->  Seq Scan on reports_part2 reports  (cost=0.00..40200.93 rows=1000093 width=8)
         ->  Seq Scan on reports_part3 reports  (cost=0.00..40197.31 rows=999731 width=8)
         ->  Seq Scan on reports_part4 reports  (cost=0.00..14510.31 rows=361131 width=8)
         ->  Seq Scan on reports_part5 reports  (cost=0.00..10.20 rows=20 width=8)

The query performs a full table scan of all the partitions, which is just about the worst result possible. Even if you don’t have any constraints or knowledge about the data in the date field, the query planner should be able to optimize the query to the following:

SELECT max(maxdate)
FROM
 (SELECT max(date) as maxdate FROM reports_part0 UNION
  SELECT max(date) FROM reports_part1 UNION...
 );

This is at most one index query per partition, which is perfectly reasonable. If you add range constraints to the date field of each partition, this query can be optimized into a loop where you query the “latest” partition first and work backwards until you find a single value that is higher than the range of all the remaining partitions.

But there are even more “gotchas” lurking in table partitioning. The query planner operates on queries before functions are called or bind parameters are substituted. This means that a SQL query which contains a constant can perform very differently than one containing a function:

breakpad=> EXPLAIN SELECT * FROM reports WHERE date < '2007-05-12 11:03' AND date > '2007-05-12 10:03' ORDER BY date DESC;
QUERY PLAN
--------------
 Sort
   Sort Key: public.reports.date
   ->  Result
         ->  Append
               ->  Seq Scan on reports
                     Filter: ((date < '2007-05-12 11:03:00'::timestamp without time zone) AND (date > '2007-05-12 10:03:00'::timestamp without time zone))
               ->  Index Scan using idx_reports_part0_date on reports_part0 reports
                     Index Cond: ((date < '2007-05-12 11:03:00'::timestamp without time zone) AND (date > '2007-05-12 10:03:00'::timestamp without time zone))

Because we have date constraints on the reports partitions, the planner is smart enough to know that only reports_part0 contains the data we’re looking for. But replace the literal dates with the equivalent functions, and the query planner has to search every partition:

breakpad=> EXPLAIN SELECT * FROM reports WHERE date < now() AND date > now() - interval '1 day' ORDER BY date DESC;
QUERY PLAN
---------------
 Sort
   Sort Key: public.reports.date
   ->  Result
         ->  Append
               ->  Seq Scan on reports
                     Filter: ((date < now()) AND (date > (now() - '1 day'::interval)))
               ->  Bitmap Heap Scan on reports_part0 reports
                     Recheck Cond: ((date < now()) AND (date > (now() - '1 day'::interval)))
                     ->  Bitmap Index Scan on idx_reports_part0_date
                           Index Cond: ((date < now()) AND (date > (now() - '1 day'::interval)))
               ->  Index Scan using idx_reports_part1_date on reports_part1 reports
                     Index Cond: ((date < now()) AND (date > (now() - '1 day'::interval)))
               ->  Index Scan using idx_reports_part2_date on reports_part2 reports
                     Index Cond: ((date < now()) AND (date > (now() - '1 day'::interval)))
               ->  Index Scan using idx_reports_part3_date on reports_part3 reports
                     Index Cond: ((date < now()) AND (date > (now() - '1 day'::interval)))
               ->  Index Scan using idx_reports_part4_date on reports_part4 reports
                     Index Cond: ((date < now()) AND (date > (now() - '1 day'::interval)))
               ->  Index Scan using idx_reports_part5_date on reports_part5 reports
                     Index Cond: ((date < now()) AND (date > (now() - '1 day'::interval)))

Both of these missed optimizations are extremely problematic when dealing with partitioned tables in postgresql. The first, less common issue should be easy to fix, because it doesn’t require any constraint information. The second one is not so easy, because it would require the query planner to divide its work into a “pre-function/bindparam expansion” stage, which is cacheable, and a “post-function/bindparam expansion stage”, which is not very easy to cache.

We are going to try and work around the data-binding issue by issuing the queries from Socorro using literals instead of bound variables. This is not ideal because it requires the database to completely re-plan every query that is issued.

The moral of the story is simple: if you are planning on using database partitions, be sure you EXPLAIN all the queries you’re planning, with the actual literals or bound data statements that will actually be used in production. Be prepared to significantly rework your queries if the queries perform unexpected full table scans.

Posted in Mozilla | 5 Comments »

The Power of Code Review

Tuesday, April 10th, 2007

Reviewing code written by others is hard, and yet it is one of the fundamental aspects of the Mozilla code process. Matthew Gertner recently posted about how he instituted a review requirement as part of the AllPeers development process, and asks some interesting questions about how reviews should be done.

What are reviews for? The primary goal of code reviews is to ensure
code correctness and quality:

“goal” review: is the issue being fixed actually a bug? Does the patch fix the fundamental problem? It is not uncommon for coders to provide a patch to fix a particular undesirable behavior, without understanding the relevant standards or compatibility requirements.
API/design review. Because APIs define the interactions between modules, they need special care. Review is especially important to keep APIs balanced and targeted, and not too specific or overdesigned.
Maintainability review. Code which is unreadable is impossible to maintain. If the reviewer has to ask questions about the purpose of a piece of code, then it is probably not documented well enough. The reviewer should enforce common style guidelines, with the help of automated tools when practical.
Security review. This is mostly a subset of design review. Reviewers should ensure that the design uses security concepts such as input sanitizers, wrappers, and other techniques. It may also be appropriate to do a detailed review of the code which is exposed to public content.
Integration review. Does this code work properly with other modules? Is it localized properly? Does it have server dependencies? Does it have user documentation?
Testing review. The best time to develop a comprehensive test suite for code is when it is first developed. Automated and manual tests should be developed for all code modules. These tests should not only test correct function, but also test error conditions and improper inputs which could happen during operation.

For the most part, reviewers are not responsible for ensuring correct code function: unit tests are much better suited to that task. What reviewers are responsible for is much more “social”, and typically does not require a detailed line-by-line analysis of the code to perform a review. In many cases, important parts of the review process should happen before a coder starts working on a patch, or after APIs are designed but before implementation.

There are some important side effects of the review process that are also beneficial:

More than one person knows every piece of code. Many Mozilla modules have grown a buddy system where two people know the code intimately. This is very helpful because it means that a single person going on vacation doesn’t imperil a release or schedule.
Reviewing is mentoring. New hackers who are not familiar with a project can be guided and improved through code review. Initiall, this requires additional effort and patience from the reviewer. Code from inexperienced hackers deserves a much more detailed review.
A public review log is a great historical resource for new and experienced hackers alike. Following CVS blame and log back to bug numbers can give lots of valuable historical information.

There can’t really be general guidelines on how much time to spend reviewing. Some experienced hackers may spend up to 50% of their time doing reviews (I typically spend two days a week doing design and code reviews and various planning tasks). This can be hard, because coding feels much more productive than reviewing.

Posted in Mozilla | 1 Comment »

Holy is God! Holy and strong! Holy Immortal One, have mercy on us.

Friday, April 6th, 2007

My people, My people what have I done to you, how have I offended you answer me!

I led you out of Egypt from slavery to freedom, but you have led your Savior, and nailed Him to a cross.

Hagios OTheos, Hagios ichyros,
Hagios athanatos eleison himas.
Holy is God, Holy and Strong,
Holy Immortal One, have mercy on us.

For forty years in safety, I led you through the desert, I fed you with my manna, I gave you your own land, but you have led your Savior, and nailed Him to a Cross.

Hagios O Theos, Hagios ichyros,
Hagios athanatos eleison himas.
Holy is God, Holy and Strong,
Holy Immortal One, have mercy on us.

O what more would you ask from me? I planted you, my vineyard, but sour grapes you gave me, and vinegar to drink, and you have pierced your Savior and pierced Him with a spear.

Hagios O Theos, Hagios ichyros,
Hagios athanatos eleison himas.
Holy is God, Holy and Strong,
Holy Immortal One, have mercy on us.

For you scourged your captors, their first born sons were taken, but you have taken scourges and brought them down on Me.