lj_article.txt [plain text]


          The Subversion Project:  Building a Better CVS
          ==============================================

              Ben Collins-Sussman <sussman@collab.net>
              
                      Written in August 2001
              Published in Linux Journal, January 2002

Abstract
--------

This article discusses the history, goals, features and design of
Subversion (http://subversion.tigris.org), an open-source project that
aims to produce a compelling replacement for CVS.


Introduction 
------------

If you work on any kind of open-source project, you've probably worked
with CVS.  You probably remember the first time you learned to do an
anonymous checkout of a source tree over the net -- or your first
commit, or learning how to look at CVS diffs.  And then the fateful
day came: you asked your friend how to rename a file.

"You can't", was the reply.

What?  What do you mean?

"Well, you can delete the file from the repository and then re-add it
under a new name."

Yes, but then nobody would know it had been renamed...

"Let's call the CVS administrator.  She can hand-edit the repository's
RCS files for us and possibly make things work."

What?

"And by the way, don't try to delete a directory either."

You rolled your eyes and groaned.  How could such simple tasks be
difficult?


The Legacy of CVS
-----------------

No doubt about it, CVS has evolved into the standard Software
Configuration Management (SCM) system of the open source community.
And rightly so!  CVS itself is Free software, and its wonderful "non
locking" development model -- whereby dozens of far-flung programmers
collaborate -- fits the open-source world very well.  In fact, one
might argue that without CVS, it's doubtful whether sites like
Freshmeat or Sourceforge would ever have flourished as they do now.
CVS and its semi-chaotic development model have become an essential
part of open source culture.

So what's wrong with CVS?

Because it uses the RCS storage-system under the hood, CVS can only
track file contents, not tree structures.  As a result, the user has
no way to copy, move, or rename items without losing history.  Tree
rearrangements are always ugly server-side tweaks.

The RCS back-end cannot store binary files efficiently, and branching
and tagging operations can grow to be very slow.  CVS also uses the
network inefficiently; many users are annoyed by long waits, because
file differeces are sent in only one direction (from server to client,
but not from client to server), and binary files are always
transmitted in their entirety.

From a developer's standpoint, the CVS codebase is the result of
layers upon layers of historical "hacks".  (Remember that CVS began
life as a collection of shell-scripts to drive RCS.)  This makes the
code difficult to understand, maintain, or extend.  For example: CVS's
networking ability was essentially "stapled on".  It was never
designed to be a native client-server system.  

Rectifying CVS's problems is a huge task -- and we've only listed just
a few of the many common complaints here.


Enter Subversion
----------------

In 1995, Karl Fogel and Jim Blandy founded Cyclic Software, a company
for commercially supporting and improving CVS.  Cyclic made the first
public release of a network-enabled CVS (contributed by Cygnus
software.)  In 1999, Karl Fogel published a book about CVS and the
open-source development model it enables (cvsbook.red-bean.com).  Karl
and Jim had long talked about writing a replacement for CVS; Jim had
even drafted a new, theoretical repository design.  Finally, in
February of 2000, Brian Behlendorf of CollabNet (www.collab.net)
offered Karl a full-time job to write a CVS replacement.  Karl
gathered a team together and work began in May.

The team settled on a few simple goals: it was decided that Subversion
would be designed as a functional replacement for CVS.  It would do
everything that CVS does -- preserving the same development model
while fixing the flaws in CVS's (lack-of) design.  Existing CVS users
would be the target audience: any CVS user should be able to start
using Subversion with little effort.  Any other SCM "bonus features"
were decided to be of secondary importance (at least before a 1.0
release.)

At the time of writing, the original team has been coding for a little
over a year, and we have a number of excellent volunteer contributors.
(Subversion, like CVS, is a open-source project!)


Subversion's Features
----------------------

Here's a quick run-down of some of the reasons you should be excited
about Subversion:

  * Real copies and renames.  The Subversion repository doesn't use
    RCS files at all; instead, it implements a 'virtual' versioned
    filesystem that tracks tree-structures over time (described
    below).  Files *and* directories are versioned.  At last, there
    are real client-side `mv' and `cp' commands that behave just as
    you think.

  * Atomic commits.  A commit either goes into the repository
    completely, or not all.  

  * Advanced network layer.  The Subversion network server is Apache,
    and client and server speak WebDAV(2) to one another.  (See the
    'design' section below.)

  * Faster network access. A binary diffing algorithm is used to
    store and transmit deltas in *both* directions, regardless of
    whether a file is of text or binary type.

  * Filesystem "properties".  Each file or directory has an invisible
    hashtable attached.  You can invent and store any arbitrary
    key/value pairs you wish: owner, perms, icons, app-creator,
    mime-type, personal notes, etc.  This is a general-purpose feature
    for users.  Properties are versioned, just like file contents.
    And some properties are auto-detected, like the mime-type of a
    file (no more remembering to use the '-kb' switch!)

  * Extensible and hackable.  Subversion has no historical baggage; it
    was designed and then implemented as a collection of shared C
    libraries with well-defined APIs.  This makes Subversion extremely
    maintainable and usable by other applications and languages.

  * Easy migration.  The Subversion command-line client is very
    similar to CVS; the development model is the same, so CVS users
    should have little trouble making the switch.  Development of a
    'cvs2svn' repository converter is in progress.

  * It's Free.  Subversion is released under a Apache/BSD-style
    open-source license.


Subversion's Design
-------------------

Subversion has a modular design; it's implemented as a collection of C
libraries.  Each layer has a well-defined purpose and interface.  In
general, code flow begins at the top of the diagram and flows
"downward" -- each layer provides an interface to the layer above it.

              <<insert diagram here:  svn.tiff>>


Let's take a short tour of these layers, starting at the bottom.


--> The Subversion filesystem.  

The Subversion Filesystem is *not* a kernel-level filesystem that one
would install in an operating system (like the Linux ext2 fs.)
Instead, it refers to the design of Subversion's repository.  The
repository is built on top of a database -- currently Berkeley DB --
and thus is a collection of .db files.  However, a library accesses
these files and exports a C API that simulates a filesystem --
specifically, a "versioned" filesystem.

This means that writing a program to access the repository is like
writing against other filesystem APIs: you can open files and
directories for reading and writing as usual.  The main difference is
that this particular filesystem never loses data when written to; old
versions of files and directories are always saved as historical
artifacts.

Whereas CVS's backend (RCS) stores revision numbers on a per-file
basis, Subversion numbers entire trees.  Each atomic 'commit' to the
repository creates a completely new filesystem tree, and is
individually labeled with a single, global revision number.  Files and
directories which have changed are rewritten (and older versions are
backed up and stored as differences against the latest version), while
unchanged entries are pointed to via a shared-storage mechanism.  This
is how the repository is able to version tree structures, not just
file contents.

Finally, it should be mentioned that using a database like Berkeley DB
immediately provides other nice features that Subversion needs: data
integrity, atomic writes, recoverability, and hot backups.  (See
www.sleepycat.com for more information.)


--> The network layer.

Subversion has the mark of Apache all over it.  At its very core, the
client uses the Apache Portable Runtime (APR) library.  (In fact, this
means that Subversion client should compile and run anywhere Apache
httpd does -- right now, this list includes all flavors of Unix,
Win32, BeOS, OS/2, Mac OS X, and possibly Netware.)

However, Subversion depends on more than just APR -- the Subversion
"server" is Apache httpd itself.

Why was Apache chosen?  Ultimately, the decision was about not
reinventing the wheel.  Apache is a time-tested, open-source server
process that ready for serious use, yet is still extensible.  It can
sustain a high network load.  It runs on many platforms and can
operate through firewalls.  It's able to use a number of different
authentication protocols.  It can do network pipelining and caching.
By using Apache as a server, Subversion gets all these features for
free.  Why start from scratch?

Subversion uses WebDAV as its network protocol.  DAV (Distributed
Authoring and Versioning) is a whole discussion in itself (see
www.webdav.org) -- but in short, it's an extension to HTTP that allows
reads/writes and "versioning" of files over the web.  The Subversion
project is hoping to ride a slowly rising tide of support for this
protocol: all of the latest file-browsers for Win32, MacOS, and GNOME
speak this protocol already.  Interoperability will (hopefully) become
more and more of a bonus over time.

For users who simply wish to access Subversion repositories on local
disk, the client can do this too; no network is required.  The
"Repository Access" layer (RA) is an abstract API implemented by both
the DAV and local-access RA libraries.  This is a specific benefit of
writing a "librarized" version control system; it's a big win over
CVS, which has two very different, difficult-to-maintain codepaths for
local vs. network repository-access.  Feel like writing a new network
protocol for Subversion?  Just write a new library that implements the
RA API!


--> The client libraries.

On the client side, the Subversion "working copy" library maintains
administrative information within special SVN/ subdirectories, similar
in purpose to the CVS/ administrative directories found in CVS working
copies.

A glance inside the typical SVN/ directory turns up a bit more than
usual, however.  The `entries' file contains XML which describes the
current state of the working copy directory (and which basically
serves the purposes of CVS's Entries, Root, and Repository files
combined).  But other items present (and not found in CVS/) include
storage locations for the versioned "properties" (the metadata
mentioned in 'Subversion Features' above) and private caches of
pristine versions of each file.  This latter feature provides the
ability to report local modifications -- and do reversions --
*without* network access.  Authentication data is also stored within
SVN/, rather than in a single .cvspass-like file.

The Subversion "client" library has the broadest responsibility; its
job is to mingle the functionality of the working-copy library with
that of the repository-access library, and then to provide a
highest-level API to any application that wishes to perform general
version control actions.

For example: the C routine `svn_client_checkout()' takes a URL as an
argument.  It passes this URL to the repository-access library and
opens an authenticated session with a particular repository.  It then
asks the repository for a certain tree, and sends this tree into the
working-copy library, which then writes a full working copy to disk
(SVN/ directories and all.)

The client library is designed to be used by any application.  While
the Subversion source code includes a standard command-line client, it
should be very easy to write any number of GUI clients on top of the
client library.  Hopefully, these GUIs should someday prove to be much
better than the current crop of CVS GUI applications (the majority of
which are no more than fragile "wrappers" around the CVS command-line
client.)

In addition, proper SWIG bindings (www.swig.org) should make
the Subversion API available to any number of languages:  java, perl,
python, guile, and so on.  In order to Subvert CVS, it helps to be
ubiquitous! 


Subversion's Future
-------------------

The release of Subversion 1.0 is currently planned for early 2002.
After the release of 1.0, Subversion is slated for additions such as
i18n support, "intelligent" merging, better "changeset" manipulation,
client-side plugins, and improved features for server administration.
(Also on the wishlist is an eclectic collection of ideas, such as
distributed, replicating repositories.)

A final thought from Subversion's FAQ:

   "We aren't (yet) attempting to break new ground in SCM systems, nor
   are we attempting to imitate all the best features of every SCM
   system out there.  We're trying to replace CVS."

If, in three years, Subversion is widely presumed to be the "standard"
SCM system in the open-source community, then the project will have
succeeded.   But the future is still hazy:  ultimately, Subversion
will have to win this position on its own technical merits.

Patches are welcome.


For More Information
--------------------

Please visit the Subversion project website at
http://subversion.tigris.org.  There are discussion lists to join, and
the source code is available via anonymous CVS -- and soon through
Subversion itself.