Yak Shaving
Wednesday, May 27th, 2015Yak shaving tends to be looked down on. I don’t necessarily see it that way. It can be a way to pay down technical debt, or learn a new skill. In many ways I consider it a sign of broad engineering skill if somebody is capable of solving a multi-part problem.
It started so innocently. My team has been working on unifying the Firefox Health Report and Telemetry data collection systems, and there was a bug that I thought I could knock off pretty easily: “FHR data migration: org.mozilla.crashes”. Below are the roadblocks, mishaps, and sideshows that resulted, and I’m not even done yet:
- Tryserver failure: crashes
- Constant crashes only on Linux opt builds. It turned out this was entirely my fault. The following is not a safe access pattern because of c++ temporary lifetimes:
nsCSubstringTuple str = str1 + str2; Fn(str);
- Backout #1: talos xperf failure
- After landing, the code was backed out because the xperf Talos test detected main-thread I/O. On desktop, this was a simple ordering problem: we always do that I/O during startup to initialize the crypto system; I just moved it slightly earlier in the startup sequence. Why are we initializing the crypto system? To generate a random number. Fixed this by whitelisting the I/O. This involved landing code to the separate Talos repo and then telling the main Firefox tree to use the new revision. Much thanks to Aaron Klotz for helping me figure out the right steps.
- Backout #2: test timeouts
- Test timeouts if the first test of a test run uses the PopupNotifications API. This wasn’t caught during initial try runs because it appeared to be a well-known random orange. I was apparently changing the startup sequence just enough to tickle a focus bug in the test harness. It so happened that the particular test which runs first depends on the e10s or non-e10s configuration, leading to some confusion about what was going on. Fortunately, I was able to reproduce this locally. Gavin Sharp and Neil Deakin helped get the test harness in order in bug 1138079.
- Local test failures on Linux
- I discovered that several xpcshell tests were failing locally on Linux which were working fine on tryserver. After some debugging, I discovered that the tests thought I wasn’t using Linux, because the cargo-culted test for Linux was let isLinux = ("@mozilla.org/gnome-gconf-service;1" in Cc). This means that if gconf is disabled at build time or not present at runtime, the tests will fail. I installed GConf2-devel and rebuilt my tree and things were much better.
- Incorrect failure case in the extension manager
- While debugging the test failures, I discovered an incorrect codepath in GMPProvider.jsm for clients which are not Windows, Mac, or Linux (Android and the non-Linux Unixes).
- Android performance regression
- The landing caused an Android startup performance regression, bug 1163049. On Android, we don’t initialize NSS during startup, and the earlier initialization of the addon manager caused us to generate random Sync IDs for addons. I first fixed this by using Math.random() instead of good crypto, but Richard Newman suggested that I just make Sync generation lazy. This appears to work and will land when there is an open tree.
- mach bootstrap on Fedora doesn’t work for Android
- As part of debugging the performance regression, I built Firefox for Android for the first time in several years. I discovered that mach bootstrap for Android isn’t implemented on Fedora Core. I manually installed packages until it built properly. I have a list of the packages I installed and I’ll file a bug to fix mach bootstrap when I get a chance.
- android build-tools not found
- A configure check for the android build-tools package failed. I still don’t understand exactly why this happened; it has something to do with a version that’s too new and unexpected. Nick Alexander pointed me at a patch on bug 1162000 which fixed this for me locally, but it’s not the “right” fix and so it’s not checked into the tree.
- Debugging on Android (jimdb)
- Binary debugging on Android turned out to be difficult. There are some great docs on the wiki, but those docs failed to mention that you have to pass the configure flag –enable-debug-symbols. After that, I discovered that pending breakpoints don’t work by default with Android debugging, and since I was debugging a startup issue that was a critical failure. I wrote an ask.mozilla.org question and got a custom patch which finally made debugging work. I also had to patch the implementation of DumpJSStack() so that it would print to logcat on Android; this is another bug that I’ll file later when I clean up my tree.
- Crash reporting broken on Mac
- I broke crash report submission on mac for some users. Crash report annotations were being truncated from unicode instead of converted from UTF8. Because JSON.stringify doesn’t produce ASCII, this was breaking crash reporting when we tried to parse the resulting data. This was an API bug that existed prior to the patch, but I should have caught it earlier. Shoutout to Ted Mielczarek for fixing this and adding automated tests!
- Semi-related weirdness: improving the startup performance of Pocket
- The Firefox Pocket integration caused a significant startup performance issue on some trees. The details are especially gnarly, but it seems that by reordering the initialization of the addon manager, I was able to turn a performance regression into a win by accident. Probably something to do with I/O wait, but it still feels like spooky action at a distance. Kudos to Joel Maher, Jared Wein and Gijs Kruitbosch for diving into this under time pressure.
Experiences like this are frustrating, but as long as it’s possible to keep the final goal in sight, fixing unrelated bugs along the way might be the best thing for everyone involved. It will certainly save context-switches from other experts to help out. And doing the Android build and debugging was a useful learning experience.
Perhaps, though, I’ll go back to my primary job of being a manager.