Yak Shaving

Yak shaving tends to be looked down on. I don’t necessarily see it that way. It can be a way to pay down technical debt, or learn a new skill. In many ways I consider it a sign of broad engineering skill if somebody is capable of solving a multi-part problem.

It started so innocently. My team has been working on unifying the Firefox Health Report and Telemetry data collection systems, and there was a bug that I thought I could knock off pretty easily: “FHR data migration: org.mozilla.crashes”. Below are the roadblocks, mishaps, and sideshows that resulted, and I’m not even done yet:

Tryserver failure: crashes

Constant crashes only on Linux opt builds. It turned out this was entirely my fault. The following is not a safe access pattern because of c++ temporary lifetimes:

nsCSubstringTuple str = str1 + str2;
Fn(str);
Backout #1: talos xperf failure

After landing, the code was backed out because the xperf Talos test detected main-thread I/O. On desktop, this was a simple ordering problem: we always do that I/O during startup to initialize the crypto system; I just moved it slightly earlier in the startup sequence. Why are we initializing the crypto system? To generate a random number. Fixed this by whitelisting the I/O. This involved landing code to the separate Talos repo and then telling the main Firefox tree to use the new revision. Much thanks to Aaron Klotz for helping me figure out the right steps.

Backout #2: test timeouts

Test timeouts if the first test of a test run uses the PopupNotifications API. This wasn’t caught during initial try runs because it appeared to be a well-known random orange. I was apparently changing the startup sequence just enough to tickle a focus bug in the test harness. It so happened that the particular test which runs first depends on the e10s or non-e10s configuration, leading to some confusion about what was going on. Fortunately, I was able to reproduce this locally. Gavin Sharp and Neil Deakin helped get the test harness in order in bug 1138079.

Local test failures on Linux

I discovered that several xpcshell tests were failing locally on Linux which were working fine on tryserver. After some debugging, I discovered that the tests thought I wasn’t using Linux, because the cargo-culted test for Linux was let isLinux = ("@mozilla.org/gnome-gconf-service;1" in Cc). This means that if gconf is disabled at build time or not present at runtime, the tests will fail. I installed GConf2-devel and rebuilt my tree and things were much better.

Incorrect failure case in the extension manager

While debugging the test failures, I discovered an incorrect codepath in GMPProvider.jsm for clients which are not Windows, Mac, or Linux (Android and the non-Linux Unixes).

Android performance regression

The landing caused an Android startup performance regression, bug 1163049. On Android, we don’t initialize NSS during startup, and the earlier initialization of the addon manager caused us to generate random Sync IDs for addons. I first fixed this by using Math.random() instead of good crypto, but Richard Newman suggested that I just make Sync generation lazy. This appears to work and will land when there is an open tree.

mach bootstrap on Fedora doesn’t work for Android

As part of debugging the performance regression, I built Firefox for Android for the first time in several years. I discovered that mach bootstrap for Android isn’t implemented on Fedora Core. I manually installed packages until it built properly. I have a list of the packages I installed and I’ll file a bug to fix mach bootstrap when I get a chance.

android build-tools not found

A configure check for the android build-tools package failed. I still don’t understand exactly why this happened; it has something to do with a version that’s too new and unexpected. Nick Alexander pointed me at a patch on bug 1162000 which fixed this for me locally, but it’s not the “right” fix and so it’s not checked into the tree.

Debugging on Android (jimdb)

Binary debugging on Android turned out to be difficult. There are some great docs on the wiki, but those docs failed to mention that you have to pass the configure flag –enable-debug-symbols. After that, I discovered that pending breakpoints don’t work by default with Android debugging, and since I was debugging a startup issue that was a critical failure. I wrote an ask.mozilla.org question and got a custom patch which finally made debugging work. I also had to patch the implementation of DumpJSStack() so that it would print to logcat on Android; this is another bug that I’ll file later when I clean up my tree.

Crash reporting broken on Mac

I broke crash report submission on mac for some users. Crash report annotations were being truncated from unicode instead of converted from UTF8. Because JSON.stringify doesn’t produce ASCII, this was breaking crash reporting when we tried to parse the resulting data. This was an API bug that existed prior to the patch, but I should have caught it earlier. Shoutout to Ted Mielczarek for fixing this and adding automated tests!

Semi-related weirdness: improving the startup performance of Pocket

The Firefox Pocket integration caused a significant startup performance issue on some trees. The details are especially gnarly, but it seems that by reordering the initialization of the addon manager, I was able to turn a performance regression into a win by accident. Probably something to do with I/O wait, but it still feels like spooky action at a distance. Kudos to Joel Maher, Jared Wein and Gijs Kruitbosch for diving into this under time pressure.

Experiences like this are frustrating, but as long as it’s possible to keep the final goal in sight, fixing unrelated bugs along the way might be the best thing for everyone involved. It will certainly save context-switches from other experts to help out. And doing the Android build and debugging was a useful learning experience.

Perhaps, though, I’ll go back to my primary job of being a manager.

Atom Feed for Comments 2 Responses to “Yak Shaving”

  1. glandium Says:

    > you have to pass the configure flag –enable-debug-symbols

    That’s puzzling, because it’s supposed to be the default.

  2. Steve Fink Says:

    I think there is good yak shaving and bad yak shaving. Bad yak shaving is when you get endlessly distracted by artificial prerequisites to a task that you really really need to accomplish. One type of good yak shaving is when you set a goal — the specific goal may not even be relevant — and plow towards it, solving every problem that gets in the way, even if you could work around it. This can be amazingly useful when your codebase/development environment has gathered cruft and small breakages over time, and everyone is unconsciously using workarounds or simply doing without certain tools. That can build up a barrier to entry that none of the regulars is even aware of, and cutting one clean path through the mess provides a way for others to follow after. And perhaps trim back the hair a bit further on the way to accomplishing their own goals.

    Note that my descriptions of good and bad yak shaving aren’t entirely incompatible.