VCS Migration: The Hare and the Tortoise
What VCS would you rather have?
- The Tortoise: Bazaar (bzr)
- A version control system, which appears to have a careful and well-planned development effort. The latest release is numbered 0.14. The import from Mozilla CVS is working so far; unfortunately, at the current import rate it will take more than a month to complete. And we have concerns about it’s performance even after import is complete.
- The Hare: Mercurial (Hg)
- A version control system, which appears to have a fast-and-loose development style. The latest release is numbered 0.9.3. If it would run to completion, importing Mozilla CVS would take a couple days. Unfortunately, every time we’ve tried to do an import we’ve run into bugs or undocumented features in the import tool, or odd edge-cases in the Mozilla CVS tree.
I have been helping preed bring Mozilla into the world of distributed version control systems. It sucks.
Learning It All Over
At least for me this is a very uncomfortable experience. I’m not going to pretend I like CVS, but at least the usage model for CVS is straightforward and fairly consistent. For my projects I’ve been using SVN without problems. Using a distributed VCS is a mind-bending exercise in which familiar terms like “repository”, “branch”, and “merge” no longer mean what they meant.
The documentation for these tools is in general quite painful. Partly this is because of terminology, but I have discovered that a lot of the confusion is that there is not a single usage model. There may be an official repository which hackers “push” to (a CVS-like model). There may be a central repository maintained by the project owner who “pulls” changes from others. There may not be an official repository at all, with a bunch of mostly-equal peers.
I think that I have a general understanding of how this is supposed to work, and I have a couple projects in Mercurial trees now. But I haven’t had to resolve merge conflicts yet, so I don’t feel that I’ve done more than scratch the surface of how these tools are supposed to be used.
Importing Mozilla CVS
Importing Mozilla’s CVS repository is a large task. The entire Mozilla CVS repository, with all its branches, has over 1 million file revisions which can be represented as about 200,000 change sets.
One of the things we decided early on is that Mozilla wanted a prepackaged solution as much as possible. Mozilla developers’ collective expertise is writing a web browser, not hacking version control. Any time that developers have to spend patching/hacking their VCS is collective waste. We are willing to hire outside experts to solve problems if necessary, and would very much like to do so. Of course, before hiring somebody you have to pick a system.
Unfortunately, neither of the two candidates has reached a 1.0 release yet, and the tools which import a CVS repository into the new system are even less complete. The two candidate VCS systems seem to have very different design philosophies, which show up quickly when performing import operations.
The Mercurial->CVS importer is very hacky code. The import process proceeds fairly quickly, but every time we tried it we ran into errors. We have yet to do a complete import that contains anything but the trunk:
- Random CVS commit message character sets (patchset 565)
- The commit message in CVS are all sorts of character sets. Initially the importer assumed that they were UTF8 and failed when it encountered invalid UTF8. I discovered a hidden HGENCODING environment variable which allowed the import to proceed by ignoring unknown character sets.
- CVS backbranching (patchset 5155)
- In CVS, branching is not an atomic operation, and there are several old branches that were performed at multiple times. The importer did not know how to handle this situation. Solution: don’t import those branches.
- Commit messages containing cvsps-like output (patchset 47070)
- Several CVS commit messages contained output that could be parsed as cvsps output. The cvsps parser contained the hg-cvs-import tool is terrible. Solution: I replaced the parser with the cvsps parser code contained in bzr.
- AssertionError: failed to remove webshell/embed/ActiveX/tests/vbrowse/VBrowse.vbp from manifest (patchset 61380)
- This one is still undiagnosed.
Bazaar’s CVS importer, on the other hand, feels like a carefully designed tool, with modules and unit tests. Unlike the Hg importer, the bzr->CVS importer has dealt with all odd input successfully. However, importing the trunk and all branches looks like it will take more than a month on a fast machine.
I like the fact that Bazaar focuses on correctness first and then deal with performance. Bazaar provides several desirable features, such as partial pull and a SVN frontend. But, until we can actually complete an import from CVS we don’t know whether bzr will perform acceptably. We have asked the bzr developers for assistance; we’ll see what happens.
To help speed the process, I created a tool to post-process cvsps output and limit it to branches which are still actively being developed. This should help reduce import times somewhat, as well as solving the backbranching issue which broke the Hg importer.
Other Issues
In addition to the import issues, there are other unsolved problems with deploying a distributed VCS system. We need to be able to control commit (aka “push”) access to the repository using our existing LDAP/ssh infrastructure. In addition, we want to be able to log who committed which changesets to the main repository: because a user can push other people’s commits, this is not as easy as it sounds, and I haven’t found a builtin log which saves this information.
Conclusion
Currently, we don’t have enough information to choose between Mercurial and Bazaar for a VCS, because we have not been able to complete a CVS import of either system. Right now we’re leaning heavily toward Hg, because of speed issues and because we have managed to get a trunk-only import kinda working. But the entire process is much more complicated than any of us would like, and is already turning into a major time-sucking adventure. Hopefully we can pick a system soon, and contract out the remaining work to developers who have done this before, or at least know the tools extremely well.
January 26th, 2007 at 5:09 pm
Isn’t there a way to use Subversion for a central repository and let all the freedom loving hackers use distributed frontends to it? A best of both worlds sort of situation?
I’ve always just used SVN or CVS myself, so I could be way off base.
January 26th, 2007 at 6:42 pm
Indeed, I think the best solution would be to use SVN and let those folks that like distributed systems use git-svn, SVK, bzr’s svn plugin, etc.
January 26th, 2007 at 6:55 pm
Check out the git archives. Importing Mozilla into git was discussed extensively there before Mozilla chose not to use git. After you get an import to finish carefully check out how the branch bases and merge points are detected. The best import tools for import Mozilla is the cvs2svn one, but it does not detect branch bases and merge points correctly. This is discussed in the git archive.
If the branches are not correct when you load Mozilla into the git visualizer it looks like a bowl of spaghetti instead of something a human would have made. These spaghetti branches do produce the right output in SVN but they are impossible to understand. If you then look at the crazy branches by hand you will be able to figure out alternate bases that make sense. Several algorithms for fixing this in cvs2svn were discussed but I am not sure if any got implemented.
Using a modified cvs2svn front end and the git-fastimport tool I was able to import Mozilla CVS to git on my 2.5GB machine in under two hours. The resulting repository was about 450MB, a 10:1 compression from the 4GB of CVS files.
January 27th, 2007 at 3:47 am
Any idea why Bzr’s import is so slow ?
January 27th, 2007 at 4:34 am
Hi, CVS is not the most easy system to import from. Have you tried importing into svn first and importing from that? The cvs2svn tool is quite mature and the svn repository more closely matches next gen version control systems, including support for e.g. changesets. In other words, a svn repository is a far better starting point for importing into any of the next gen tools than a CVS repository. If all else fails, svk is a nice way to layer distributed version management on top of svn. It has the nice feature that you can mix its use with regular svn use.
BTW. a problem with most next gen version management systems is that the tooling sucks. Most of them are only available as command line tools. CVS and SVN have a wealth of tooling available that to the best of my knowledge is unmatched by any other open source version management system. This includes things like viewcvs/viewsvn, build integration servers, statistics gathering tools, IDE integration, non command-line clients, integration into bug tracking systems, etc. Mozilla has historically been quite innovative with respect to tooling so clearly that matters to mozilla developers. Also in my experience tooling for svn is typically as good or better than that for cvs.
January 27th, 2007 at 5:00 am
As you said, ‘Bazaar focuses on correctness first and then deal with performance’. There’s a major performance drive going on, with many recent versions being a lot faster than the previous. Wait a little and see.
January 27th, 2007 at 11:08 am
A very interesting read. To be honest though, the tone of your post makes me think that there is an unrealistic deadline lurking somewhere – a migration project like this will naturally be pretty complex and time-consuming to sort out. As Jonas says, Bazaar is still being optimized, so if you think it’s a better fit than Mercurial then perhaps it would be better to pause, rather than moving the development infrastructure to your second choice of VCS in order to meet an arbitrary deadline.
January 27th, 2007 at 11:52 am
Do note that any tool that imports from CVS primarily using timestamps is going to get a lot of things wrong. cvs2svn was rewritten to use the dependencies in the ,v files instead of the timestamps. The dependencies are always right, they have to be.
cvsps can not handle the Mozilla repo and it is too much trouble to fix it.
January 27th, 2007 at 12:22 pm
I’d second the recommendations to go via SVN first – GNOME recently did this and you should learn from their experience. Beyond that, did you look at monotone? Hg does look slightly more comprehensible though.
January 28th, 2007 at 1:14 am
Coming from a totally non-programming background, I see some similarities here to bench research (ie fast vs correct). Our biggest expense is human costs, and doing something fast and then having to spend man-hours fixing the mistakes is more often than not more expensive than doing it the slower way. This is similar to what Stuart Ellis says. If there’s a tool that does it right, then it’s better to get it right the first time (even if you have to wait a bit longer) than to have to redo it again because Hg screwed up.
Can bzr reimport changes that have happened after the first import? If so, this would let you make the initial import in the background w/ little intervention while hacking continues on CVS (eg not much manpower expended on the import, esp if the inital import is focused on the active branches). And then once the original import is completed, changes that occured in CVS during that time (as well as obsoleted branches) can be imported.
January 30th, 2007 at 1:37 am
[…] BSBlog » Blog Archive » VCS Migration: The Hare and the Tortoise “I have been helping preed bring Mozilla into the world of distributed version control systems. It sucks.” – so i keep hearing (tags: via:mark DSCM DVCS Bazaar Mercurial) […]
January 30th, 2007 at 12:15 pm
Just a question. Have you seen http://www.darcs.net/ as an option? A friend told me about it a few weeks ago and I am going to install and test it in my own projects, may be it is not enough for the big and complex mozilla tree, but may be it is a possible solution.
January 30th, 2007 at 2:42 pm
You really should have a look at other ->hg converters (contrib/convert-repo in mercurial and fromcvs/tohg which is used by BSD people).
January 30th, 2007 at 2:52 pm
By the way, what you really are criticizing are the converters which (at least in the case of Hg) are very distinct from the scm itself. So I don’t really agree with the “hare” and “turtle” comparison, for me the main difference is that hg has always focused on speed and was done by kernel people, whereas bzr does a lot more things (different transport, lightweight layout) while trying to emphasizing on speed after.
February 5th, 2007 at 3:30 am
James: Graydon Hoare works for Mozilla. We’ve looked at mtn and are using it happily in small-scale projects (e.g. the reference implementation in SML for JS2 AKA ECMAScript 4). Monotone’s solid but we would have to deal with the same problem git has: Windows performance sucking due to cygwin. We need a VCS that works well on Windows now, and that will continue to be supported on Windows. This is outside of git’s scope, and for mtn it would require us to clone Graydon (again; last time used up all our cloning budget ;-)).
Mariano: Darcs is interesting but no one has claimed it is ready for Mozilla-scale hosting.
All SVN recommenders: see preed’s blog item at http://weblogs.mozillazine.org/preed/2007/01/downplaying_the_distributed_do.html.
/be
February 6th, 2007 at 6:01 am
Regarding GIT on windows. There is now an experimental
MinGW version of GIT, which is rumored to be substantially
quicker than the Cygwin version.
http://repo.or.cz/w/git/mingw.git?a=blob_plain;f=README.MinGW;hb=master
March 9th, 2007 at 8:06 pm
Not sure if you’ve tried it, but in my hands the cvs import tool for mercurial, cvs20hg has chewed through about a year of the mozilla cvs archive on a slow computer in about 12 hours. (On a much faster computer, I’ve gotten from Mar to Oct 1998 in about 4 hours.)
I don’t know how correct the import will be but I’ve gotten much farther with this tool than any of the cvs to git converters I could get my hands on. Hopefully it’ll terminate in a couple of days.
March 14th, 2007 at 2:36 am
Brendan, I’m not sure why you’d bother to use the Cygwin port of Monotone unless you had some special need. We have been providing a fully supported native MinGW Win32 binary for Monotone for many, many releases.
June 26th, 2007 at 9:20 pm
[…] an idea of the complexity involved, listen to what some of the Mozilla folks have to say about the prospect of […]