{"id":1047,"date":"2015-05-27T16:22:56","date_gmt":"2015-05-27T20:22:56","guid":{"rendered":"http:\/\/benjamin.smedbergs.us\/blog\/?p=1047"},"modified":"2019-12-19T11:03:39","modified_gmt":"2019-12-19T15:03:39","slug":"yak-shaving","status":"publish","type":"post","link":"http:\/\/benjamin.smedbergs.us\/blog\/2015-05-27\/yak-shaving\/","title":{"rendered":"Yak Shaving"},"content":{"rendered":"<p><a href=\"http:\/\/sethgodin.typepad.com\/seths_blog\/2005\/03\/dont_shave_that.html\">Yak shaving<\/a> tends to be looked down on. I don&#8217;t necessarily see it that way. It can be a way to pay down technical debt, or learn a new skill. In many ways I consider it a sign of broad engineering skill if somebody is capable of solving a multi-part problem.<\/p>\n<p>It started so innocently. My team has been working on unifying the Firefox Health Report and Telemetry data collection systems, and there was a bug that I thought I could knock off pretty easily: <a href=\"https:\/\/bugzilla.mozilla.org\/show_bug.cgi?id=1121013\">&#8220;FHR data migration: org.mozilla.crashes&#8221;<\/a>. Below are the roadblocks, mishaps, and sideshows that resulted, and I&#8217;m not even done yet:<\/p>\n<dl>\n<dt>Tryserver failure: crashes<\/p>\n<dd>Constant crashes only on Linux opt builds. It turned out this was entirely my fault. The following is not a safe access pattern because of c++ temporary lifetimes:<\/p>\n<pre>nsCSubstringTuple str = str1 + str2;\r\nFn(str);<\/pre>\n<dt>Backout #1: talos xperf failure<\/p>\n<dd>After landing, the code was backed out because the xperf Talos test detected main-thread I\/O. On desktop, this was a simple ordering problem: we always do that I\/O during startup to initialize the crypto system; I just moved it slightly earlier in the startup sequence. Why are we initializing the crypto system? To generate a random number. Fixed this by <a href=\"https:\/\/bug1140132.bugzilla.mozilla.org\/attachment.cgi?id=8598171\">whitelisting the I\/O<\/a>. This involved landing code to the separate Talos repo and then telling the main Firefox tree to use the new revision. Much thanks to Aaron Klotz for <a href=\"https:\/\/bugzilla.mozilla.org\/show_bug.cgi?id=1140132#c10\">helping me<\/a> figure out the right steps.<\/p>\n<dt>Backout #2: test timeouts<\/p>\n<dd>Test timeouts if the first test of a test run uses the PopupNotifications API. This wasn&#8217;t caught during initial try runs because it appeared to be a well-known random orange. I was apparently changing the startup sequence just enough to tickle a focus bug in the test harness. It so happened that the particular test which runs first depends on the e10s or non-e10s configuration, leading to some confusion about what was going on. Fortunately, I was able to reproduce this locally. Gavin Sharp and Neil Deakin helped get the test harness in order in <a href=\"https:\/\/bugzilla.mozilla.org\/show_bug.cgi?id=1138079\">bug 1138079<\/a>. <\/p>\n<dt>Local test failures on Linux<\/p>\n<dd>I discovered that several xpcshell tests were failing locally on Linux which were working fine on tryserver. After some debugging, I discovered that the tests thought I wasn&#8217;t using Linux, because the cargo-culted test for Linux was <tt>let isLinux = (\"@mozilla.org\/gnome-gconf-service;1\" in Cc)<\/tt>. This means that if gconf is disabled at build time or not present at runtime, the tests will fail. I installed GConf2-devel and rebuilt my tree and things were much better.<\/p>\n<dt>Incorrect failure case in the extension manager<\/p>\n<dd>While debugging the test failures, I discovered an <a href=\"https:\/\/bugzilla.mozilla.org\/show_bug.cgi?id=1167204\">incorrect codepath in GMPProvider.jsm<\/a> for clients which are not Windows, Mac, or Linux (Android and the non-Linux Unixes).<\/p>\n<dt>Android performance regression<\/p>\n<dd>The landing caused an Android startup performance regression, <a href=\"https:\/\/bugzilla.mozilla.org\/show_bug.cgi?id=1163049\">bug 1163049<\/a>. On Android, we don&#8217;t initialize NSS during startup, and the earlier initialization of the addon manager caused us to generate random Sync IDs for addons. I first fixed this by using Math.random() instead of good crypto, but Richard Newman suggested that I just make Sync generation lazy. This appears to work and will land when there is an open tree.<\/p>\n<dt>mach bootstrap on Fedora doesn&#8217;t work for Android<\/p>\n<dd>As part of debugging the performance regression, I built Firefox for Android for the first time in several years. I discovered that mach bootstrap for Android isn&#8217;t implemented on Fedora Core. I manually installed packages until it built properly. I have a list of the packages I installed and I&#8217;ll file a bug to fix mach bootstrap when I get a chance.<\/p>\n<dt>android build-tools not found<\/p>\n<dd>A configure check for the android build-tools package failed. I still don&#8217;t understand exactly why this happened; it has something to do with a version that&#8217;s too new and unexpected. Nick Alexander pointed me at a patch on <a href=\"https:\/\/bugzilla.mozilla.org\/show_bug.cgi?id=1162000\">bug 1162000<\/a> which fixed this for me locally, but it&#8217;s not the &#8220;right&#8221; fix and so it&#8217;s not checked into the tree.<\/p>\n<dt>Debugging on Android (jimdb)<\/p>\n<dd>Binary debugging on Android turned out to be difficult. There are some great docs <a href=\"https:\/\/wiki.mozilla.org\/Mobile\/Fennec\/Android\/GDB\">on the wiki<\/a>, but those docs failed to mention that you have to pass the configure flag &#8211;enable-debug-symbols. After that, I discovered that pending breakpoints don&#8217;t work by default with Android debugging, and since I was debugging a startup issue that was a critical failure. I wrote an <a href=\"https:\/\/ask.mozilla.org\/question\/1563\/how-do-i-set-an-early-breakpoint-in-firefox-for-android\/\">ask.mozilla.org question<\/a> and got a custom patch which finally made debugging work. I also had to patch the implementation of DumpJSStack() so that it would print to logcat on Android; this is another bug that I&#8217;ll file later when I clean up my tree.<\/p>\n<dt>Crash reporting broken on Mac<\/p>\n<dd>I <a href=\"https:\/\/bugzilla.mozilla.org\/show_bug.cgi?id=1166759\">broke crash report submission<\/a> on mac for some users. Crash report annotations were being truncated from unicode instead of converted from UTF8. Because JSON.stringify doesn&#8217;t produce ASCII, this was breaking crash reporting when we tried to parse the resulting data. This was an API bug that existed prior to the patch, but I should have caught it earlier. Shoutout to Ted Mielczarek for fixing this and adding automated tests!<\/p>\n<dt>Semi-related weirdness: improving the startup performance of Pocket<\/p>\n<dd>The Firefox Pocket integration caused a significant startup performance issue on some trees. The <a href=\"https:\/\/bugzilla.mozilla.org\/show_bug.cgi?id=1163512#c35\">details<\/a> are especially gnarly, but it seems that by reordering the initialization of the addon manager, I was able to turn a performance regression into a win by accident. Probably something to do with I\/O wait, but it still feels like spooky action at a distance. Kudos to Joel Maher, Jared Wein and Gijs Kruitbosch for diving into this under time pressure.<\/p>\n<\/dd>\n<p>Experiences like this are frustrating, but as long as it&#8217;s possible to keep the final goal in sight, fixing unrelated bugs along the way might be the best thing for everyone involved.  It will certainly save context-switches from other experts to help out. And doing the Android build and debugging was a useful learning experience. <\/p>\n<p>Perhaps, though, I&#8217;ll go back to my primary job of being a manager.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Yak shaving tends to be looked down on. I don&#8217;t necessarily see it that way. It can be a way to pay down technical debt, or learn a new skill. In many ways I consider it a sign of broad engineering skill if somebody is capable of solving a multi-part problem. It started so innocently. [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[2],"tags":[304],"class_list":["post-1047","post","type-post","status-publish","format-standard","hentry","category-mozilla","tag-yak-shaving"],"_links":{"self":[{"href":"http:\/\/benjamin.smedbergs.us\/blog\/wp-json\/wp\/v2\/posts\/1047","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/benjamin.smedbergs.us\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/benjamin.smedbergs.us\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/benjamin.smedbergs.us\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/benjamin.smedbergs.us\/blog\/wp-json\/wp\/v2\/comments?post=1047"}],"version-history":[{"count":6,"href":"http:\/\/benjamin.smedbergs.us\/blog\/wp-json\/wp\/v2\/posts\/1047\/revisions"}],"predecessor-version":[{"id":1053,"href":"http:\/\/benjamin.smedbergs.us\/blog\/wp-json\/wp\/v2\/posts\/1047\/revisions\/1053"}],"wp:attachment":[{"href":"http:\/\/benjamin.smedbergs.us\/blog\/wp-json\/wp\/v2\/media?parent=1047"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/benjamin.smedbergs.us\/blog\/wp-json\/wp\/v2\/categories?post=1047"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/benjamin.smedbergs.us\/blog\/wp-json\/wp\/v2\/tags?post=1047"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}