Archive for 2015

Yak Shaving

Wednesday, May 27th, 2015

Yak shaving tends to be looked down on. I don’t necessarily see it that way. It can be a way to pay down technical debt, or learn a new skill. In many ways I consider it a sign of broad engineering skill if somebody is capable of solving a multi-part problem.

It started so innocently. My team has been working on unifying the Firefox Health Report and Telemetry data collection systems, and there was a bug that I thought I could knock off pretty easily: “FHR data migration: org.mozilla.crashes”. Below are the roadblocks, mishaps, and sideshows that resulted, and I’m not even done yet:

Tryserver failure: crashes

Constant crashes only on Linux opt builds. It turned out this was entirely my fault. The following is not a safe access pattern because of c++ temporary lifetimes:

nsCSubstringTuple str = str1 + str2;
Fn(str);
Backout #1: talos xperf failure

After landing, the code was backed out because the xperf Talos test detected main-thread I/O. On desktop, this was a simple ordering problem: we always do that I/O during startup to initialize the crypto system; I just moved it slightly earlier in the startup sequence. Why are we initializing the crypto system? To generate a random number. Fixed this by whitelisting the I/O. This involved landing code to the separate Talos repo and then telling the main Firefox tree to use the new revision. Much thanks to Aaron Klotz for helping me figure out the right steps.

Backout #2: test timeouts

Test timeouts if the first test of a test run uses the PopupNotifications API. This wasn’t caught during initial try runs because it appeared to be a well-known random orange. I was apparently changing the startup sequence just enough to tickle a focus bug in the test harness. It so happened that the particular test which runs first depends on the e10s or non-e10s configuration, leading to some confusion about what was going on. Fortunately, I was able to reproduce this locally. Gavin Sharp and Neil Deakin helped get the test harness in order in bug 1138079.

Local test failures on Linux

I discovered that several xpcshell tests were failing locally on Linux which were working fine on tryserver. After some debugging, I discovered that the tests thought I wasn’t using Linux, because the cargo-culted test for Linux was let isLinux = ("@mozilla.org/gnome-gconf-service;1" in Cc). This means that if gconf is disabled at build time or not present at runtime, the tests will fail. I installed GConf2-devel and rebuilt my tree and things were much better.

Incorrect failure case in the extension manager

While debugging the test failures, I discovered an incorrect codepath in GMPProvider.jsm for clients which are not Windows, Mac, or Linux (Android and the non-Linux Unixes).

Android performance regression

The landing caused an Android startup performance regression, bug 1163049. On Android, we don’t initialize NSS during startup, and the earlier initialization of the addon manager caused us to generate random Sync IDs for addons. I first fixed this by using Math.random() instead of good crypto, but Richard Newman suggested that I just make Sync generation lazy. This appears to work and will land when there is an open tree.

mach bootstrap on Fedora doesn’t work for Android

As part of debugging the performance regression, I built Firefox for Android for the first time in several years. I discovered that mach bootstrap for Android isn’t implemented on Fedora Core. I manually installed packages until it built properly. I have a list of the packages I installed and I’ll file a bug to fix mach bootstrap when I get a chance.

android build-tools not found

A configure check for the android build-tools package failed. I still don’t understand exactly why this happened; it has something to do with a version that’s too new and unexpected. Nick Alexander pointed me at a patch on bug 1162000 which fixed this for me locally, but it’s not the “right” fix and so it’s not checked into the tree.

Debugging on Android (jimdb)

Binary debugging on Android turned out to be difficult. There are some great docs on the wiki, but those docs failed to mention that you have to pass the configure flag –enable-debug-symbols. After that, I discovered that pending breakpoints don’t work by default with Android debugging, and since I was debugging a startup issue that was a critical failure. I wrote an ask.mozilla.org question and got a custom patch which finally made debugging work. I also had to patch the implementation of DumpJSStack() so that it would print to logcat on Android; this is another bug that I’ll file later when I clean up my tree.

Crash reporting broken on Mac

I broke crash report submission on mac for some users. Crash report annotations were being truncated from unicode instead of converted from UTF8. Because JSON.stringify doesn’t produce ASCII, this was breaking crash reporting when we tried to parse the resulting data. This was an API bug that existed prior to the patch, but I should have caught it earlier. Shoutout to Ted Mielczarek for fixing this and adding automated tests!

Semi-related weirdness: improving the startup performance of Pocket

The Firefox Pocket integration caused a significant startup performance issue on some trees. The details are especially gnarly, but it seems that by reordering the initialization of the addon manager, I was able to turn a performance regression into a win by accident. Probably something to do with I/O wait, but it still feels like spooky action at a distance. Kudos to Joel Maher, Jared Wein and Gijs Kruitbosch for diving into this under time pressure.

Experiences like this are frustrating, but as long as it’s possible to keep the final goal in sight, fixing unrelated bugs along the way might be the best thing for everyone involved. It will certainly save context-switches from other experts to help out. And doing the Android build and debugging was a useful learning experience.

Perhaps, though, I’ll go back to my primary job of being a manager.

Hiring at Mozilla: Beyond Resumés and Interview Panels

Monday, May 11th, 2015

The standard tech hiring process is not good at selecting the best candidates, and introduces unconscious bias into the hiring process. The traditional resume screen, phone screen, and interview process is almost a dice-roll for a hiring manager. This year, my team has several open positions and we’re trying something different, both in the pre-interview screening process and in the interview process itself.

Hiring Firefox Platform Engineers now!

Earlier this year I attended a workshop for Mozilla managers by the Clayman Institute at Stanford. One of the key lessons is that when we (humans) don’t have clear criteria for making a choice, we tend alter our criteria to match subconscious preferences (see this article for some examples and more information). Another key lesson is that when humans lack information about a situation, our brain uses its subconscious associations to fill in the gaps.

Candidate Screening

I believe job descriptions are very important: not only do they help candidates decide whether they want a particular job, but they also serve as a guide to the kinds of things that will be required or important during the interview process. Please read the job description carefully before applying to any job!

In order to hire more fairly, I have changed the way I write job descriptions. Previously I mixed up job responsibilities and applicant requirements in one big bulleted list. Certainly every engineer on my team is going to eventually use C++ and JavaScript, and probably Python, and in the future Rust. But it isn’t a requirement that you know all of these coming into a job, especially as a junior engineer. It’s part of the job to work on a high-profile open-source project in a public setting. But that doesn’t mean you must have prior open-source experience. By separating out the job expectations and the applicant requirements, I was able to create a much clearer set of screening rules for incoming applications, and also clearer expectations for candidates.

Resumés are a poor tool for ranking candidates and deciding which candidates are worth the investment in a phone screen or interview. Resumés give facts about education or prior experience, but rarely make it clear whether somebody is an excellent engineer. To combat this, my team won’t be using only resumés as a screening tool. If a candidate matches basic criteria, such as living in a reasonable time zone and having demonstrated expertise in C++, JavaScript, or Python on their resumé or code samples, we will ask each candidate to submit a short written essay (like a blog post) describing their favorite debugging or profiling tool.

Why did I pick an essay about a debugging or profiling tool? In my experience, every good coder has a toolbox, and as coders gain experience they are naturally better toolsmiths. I hope that this essay requirement will be good way to screen for programmer competence and to gauge expertise.

With resumés, essays, and code samples in hand, Vladan and I will go through the applications and filter the applications. Each passing candidate will proceed to phone screens, to check for technical skill but more importantly to sell the candidate on the team and match them up with the best position. My goal is to exclude applications that don’t meet the requirements, not to rank candidates against each other. If there are too many qualified applicants, we will select a random sample for interviews. In order to make this possible, we will be evaluating applications in weekly batches.

Interview Process

To the extent possible, the interview format should line up with the job requirements. The typical Mozilla technical interview is five or six 45-minute 1:1 interview sessions. This format heavily favors people who can think quickly on their feet and who are personable. Since neither of those attributes is a requirement for this job, that format is a poor match. Here are the requirements in the job description that we need to evaluate during the interview:

  • Experience writing code. A college degree is not necessary or sufficient.
  • Expertise in any of C++, JavaScript, or Python.
  • Ability to learn new skills and solve unfamiliar problems effectively.
  • Experience debugging or profiling.
  • Good written and verbal communication skills.
  • Candidates must be located in North or South America, Europe, Africa, or the Middle East.

This is the interview format that we came up with to assess the requirements:

  • A 15-minute prepared presentation on a topic related to the candidate’s prior experience and expertise. This will be done in front of a small group. A 30-minute question and answer session will follow. Assesses experience writing code and verbal communication skills.
  • A two-hour mentoring session with two engineers from the team. The candidate will be working in a language they already know (C++/JS/Python), but will be solving an unfamiliar problem. Assesses experience writing code, language expertise, and ability to solve unfamiliar problems.
  • A 45-minute 1:1 technical interview. This will assess some particular aspect of the candidate’s prior experience with technical questions, especially if that experience is related to optional/desired skills in the job description. Assesses specialist or general expertise and verbal communication.
  • A 45-minute 1:1 interview with the hiring manager. This covers a wide range of topics including work location and hours, expectations about seniority, and to answer questions that the candidate has about Mozilla, the team, or the specific role they are interviewing for. Assesses candidate location and desire to be part of the team.

During the debrief and decision process, I intend to focus as much as possible on the job requirements. Rather than asking a simple “should we hire this person” question, I will ask interviewers to rate the candidate on each job requirement and responsibility, as well as any desired skillset. By orienting the feedback to the job description I hope that we can reduce the effects of unconscious bias and improve the overall hiring process.

Conclusion

This hiring procedure is experimental. My team and I have concerns about whether candidates will be put off by the essay requirement or an unusual interview format, and whether plagiarism will make the essay an ineffective screening tool. We’re concerned about keeping the hiring process moving and not introducing too much delay. After the first interview rounds, I plan on evaluating the process, and ask candidates to provide feedback about their experience.

If you’re interested, check out my prior post, How I Hire At Mozilla.

Using crash-stats-api-magic

Monday, April 20th, 2015

A while back, I wrote the tool crash-stats-api-magic which allows custom processing of results from the crash-stats API. This tool is not user-friendly, but it can be used to answer some pretty complicated questions.

As an example and demonstration, see a bug that Matthew Gregan filed this morning asking for a custom report from crash-stats:

In trying to debug bug 1135562, it’s hard to guess the severity of the problem or look for any type of version/etc. correlation because there are many types of hangs caught under the same mozilla::MediaShutdownManager::Shutdown stack. I’d like a report that contains only those with mozilla::MediaShutdownManager::Shutdown in the hung (main thread) stack *and* has wasapi_stream_init on one of the other threads, please.

To build this report, start with a basic query and then refine it in the tool:

  1. Construct a supersearch query to select the crashes we’re interested in. The only criteria for this query was “signature contains ‘MediaShutdownManager::Shutdown`. When possible, filter on channel, OS, and version to reduce noise.
  2. After the supersearch query is constructed, choose “More Options” from the results page and copy the “Public API URL” link.
  3. Load crash-stats-api-magic and paste the query URL. Choose “Fetch” to fetch the results. Look through the raw data to get a sense for its structure. Link
  4. The meat of this function is to filter out the crashes that don’t have “wasapi_stream_init” on a thread. Choose “New Rule” and create a filter rule:
    function(d) {
      var ok = false;
      d.json_dump.threads.forEach(function(thread) {
        thread.frames.forEach(function(frame) {
          if (frame.function && frame.function.indexOf("wasapi_stream_init") != -1) {
            ok = true;
          }
        });
      });
      return ok;
    }

    Choose “Execute” to run the filter. Link

  5. To get the final report we output only the signature and the crash ID for each result. Choose “New Rule” again and create a mapping rule:
    function(d) {
      return [d.uuid, d.signature];
    }

    Link

One of the advantages of this tool is that it is possible to iterate quickly on the data without constantly re-querying, but at the end it should be possible to permalink to the results in bugzilla or email exchanges.

If you need to do complex crash-stats analysis, please try it out! email me if you have questions, and pull requests are welcome.

Gratitude Comes in Threes

Tuesday, February 17th, 2015

Today Johnathan Nightingale announced his departure from Mozilla. There are three special people at Mozilla who shaped me into the person I am today, and Johnathan Nightingale is one of them:

Mike Shaver taught me how to be an engineer. I was a full-time musician who happened to be pretty good at writing code and volunteering for Mozilla. There were many people at Mozilla who helped teach me the fine points of programming, and techniques for being a good programmer, but it was shaver who taught me the art of software engineering: to focus on simplicity, to keep the ultimate goal always in mind, when to compromise in order to ship, and when to spend the time to make something impossibly great. Shaver was never my manager, but I credit him with a lot of my engineering success. Shaver left Mozilla a while back to do great things at Facebook, and I still miss him.

Mike Beltzner taught me to care about users. Beltzner was never my manager either, but his single-minded and sometimes pugnacious focus on users and the user experience taught me how to care about users and how to engineer products that people might actually want to use. It’s easy for an engineer to get caught up in the most perfect technology and forget why we’re building any of this at all. Or to get caught up trying to change the world, and forget that you can’t change the world without a great product. Beltzner left Mozilla a while back and is now doing great things at Pinterest.

Perhaps it is just today talking, but I will miss Johnathan Nightingale most of all. He taught me many things, but mostly how to be a leader. I have had the privilege of reporting to Johnathan for several years now. He taught me the nuances of leadership and management; how to support and grow my team and still be comfortable applying my own expertise and leadership. He has been a great and empowering leader, both for me personally and for Firefox as a whole. He also taught me how to edit my own writing and others, and especially never to bury the lede. Now Johnathan will also be leaving Mozilla, and undoubtedly doing great things on his next adventure.

It doesn’t seem coincidental that triumverate were all Torontonians. Early Toronto Mozillians, including my three mentors, built a culture of teaching, leading, mentoring, and Mozilla is better because of it. My new boss isn’t in Toronto, so it’s likely that I will be traveling there less. But I still hold a special place in my heart for it and hope that Mozilla Toronto will continue to serve as a model of mentoring and leadership for Mozilla.

Now I’m a senior leader at Mozilla. Now it’s my job to mentor, teach, and empower Mozilla’s leaders. I hope that I can be nearly as good at it as these wonderful Mozillians have been for me.