VCS Migration: The Hare and the Tortoise
Friday, January 26th, 2007What VCS would you rather have?
- The Tortoise: Bazaar (bzr)
- A version control system, which appears to have a careful and well-planned development effort. The latest release is numbered 0.14. The import from Mozilla CVS is working so far; unfortunately, at the current import rate it will take more than a month to complete. And we have concerns about it’s performance even after import is complete.
- The Hare: Mercurial (Hg)
- A version control system, which appears to have a fast-and-loose development style. The latest release is numbered 0.9.3. If it would run to completion, importing Mozilla CVS would take a couple days. Unfortunately, every time we’ve tried to do an import we’ve run into bugs or undocumented features in the import tool, or odd edge-cases in the Mozilla CVS tree.
I have been helping preed bring Mozilla into the world of distributed version control systems. It sucks.
Learning It All Over
At least for me this is a very uncomfortable experience. I’m not going to pretend I like CVS, but at least the usage model for CVS is straightforward and fairly consistent. For my projects I’ve been using SVN without problems. Using a distributed VCS is a mind-bending exercise in which familiar terms like “repository”, “branch”, and “merge” no longer mean what they meant.
The documentation for these tools is in general quite painful. Partly this is because of terminology, but I have discovered that a lot of the confusion is that there is not a single usage model. There may be an official repository which hackers “push” to (a CVS-like model). There may be a central repository maintained by the project owner who “pulls” changes from others. There may not be an official repository at all, with a bunch of mostly-equal peers.
I think that I have a general understanding of how this is supposed to work, and I have a couple projects in Mercurial trees now. But I haven’t had to resolve merge conflicts yet, so I don’t feel that I’ve done more than scratch the surface of how these tools are supposed to be used.
Importing Mozilla CVS
Importing Mozilla’s CVS repository is a large task. The entire Mozilla CVS repository, with all its branches, has over 1 million file revisions which can be represented as about 200,000 change sets.
One of the things we decided early on is that Mozilla wanted a prepackaged solution as much as possible. Mozilla developers’ collective expertise is writing a web browser, not hacking version control. Any time that developers have to spend patching/hacking their VCS is collective waste. We are willing to hire outside experts to solve problems if necessary, and would very much like to do so. Of course, before hiring somebody you have to pick a system.
Unfortunately, neither of the two candidates has reached a 1.0 release yet, and the tools which import a CVS repository into the new system are even less complete. The two candidate VCS systems seem to have very different design philosophies, which show up quickly when performing import operations.
The Mercurial->CVS importer is very hacky code. The import process proceeds fairly quickly, but every time we tried it we ran into errors. We have yet to do a complete import that contains anything but the trunk:
- Random CVS commit message character sets (patchset 565)
- The commit message in CVS are all sorts of character sets. Initially the importer assumed that they were UTF8 and failed when it encountered invalid UTF8. I discovered a hidden HGENCODING environment variable which allowed the import to proceed by ignoring unknown character sets.
- CVS backbranching (patchset 5155)
- In CVS, branching is not an atomic operation, and there are several old branches that were performed at multiple times. The importer did not know how to handle this situation. Solution: don’t import those branches.
- Commit messages containing cvsps-like output (patchset 47070)
- Several CVS commit messages contained output that could be parsed as cvsps output. The cvsps parser contained the hg-cvs-import tool is terrible. Solution: I replaced the parser with the cvsps parser code contained in bzr.
- AssertionError: failed to remove webshell/embed/ActiveX/tests/vbrowse/VBrowse.vbp from manifest (patchset 61380)
- This one is still undiagnosed.
Bazaar’s CVS importer, on the other hand, feels like a carefully designed tool, with modules and unit tests. Unlike the Hg importer, the bzr->CVS importer has dealt with all odd input successfully. However, importing the trunk and all branches looks like it will take more than a month on a fast machine.
I like the fact that Bazaar focuses on correctness first and then deal with performance. Bazaar provides several desirable features, such as partial pull and a SVN frontend. But, until we can actually complete an import from CVS we don’t know whether bzr will perform acceptably. We have asked the bzr developers for assistance; we’ll see what happens.
To help speed the process, I created a tool to post-process cvsps output and limit it to branches which are still actively being developed. This should help reduce import times somewhat, as well as solving the backbranching issue which broke the Hg importer.
Other Issues
In addition to the import issues, there are other unsolved problems with deploying a distributed VCS system. We need to be able to control commit (aka “push”) access to the repository using our existing LDAP/ssh infrastructure. In addition, we want to be able to log who committed which changesets to the main repository: because a user can push other people’s commits, this is not as easy as it sounds, and I haven’t found a builtin log which saves this information.
Conclusion
Currently, we don’t have enough information to choose between Mercurial and Bazaar for a VCS, because we have not been able to complete a CVS import of either system. Right now we’re leaning heavily toward Hg, because of speed issues and because we have managed to get a trunk-only import kinda working. But the entire process is much more complicated than any of us would like, and is already turning into a major time-sucking adventure. Hopefully we can pick a system soon, and contract out the remaining work to developers who have done this before, or at least know the tools extremely well.