Hello Tornado

To get a feel for the different ways of interacting with Couch, let’s start with a Hello World. Here’s the boilerplate to set up a tornado server with two endpoints at /hello and /hi:

import tornado
application = tornado.web.Application([
    (r'/hello/([^/]+)', JumpyHello),
    (r'/hi/([^/]+)', RelaxedHello),
]).listen(1920)
tornado.ioloop.IOLoop.instance().start()

Since both of the request handlers (defined below) will need a reference to the database, set up a global for them to share:

from corduroy import Database, NotFound, relax
people_db = Database('people') # i.e., http://127.0.0.1:5984/people

The Twisted-Up Way

Our first request handler will parse the url and pull out the last component (which will be treated as a document ID). It will then try to retrieve that doc from the database and print a greeting based on its contents. Here is a handler that uses an explicit callback to make a non-blocking request:

class JumpyHello(tornado.web.RequestHandler):
    @tornado.web.asynchronous
    def get(self, user_id):
        # Request the corresponding user's doc. This will return
        # immediately and control will leave this method. Later on
        # the got_user_doc callback will be invoked with the 
        # response and a status object as arguments.
        people_db.get(user_id, callback=self.got_user_doc)

    def got_user_doc(self, doc, status):
        # Generate output based on the db's response
        if status.ok:
            self.write('hello %s %s'%(doc['first'],doc['last']))
        elif status.error is NotFound:
            self.write('hello whoever you are')
        else:
            raise status.exception
        self.finish()

Though it’s gratifying to know that your server process isn’t blocking while waiting for Couch to respond, the resulting code doesn’t feel terribly pythonic. Ideally something as simple as a GET request should be a one-liner. In addition, the use of callbacks means your code is no longer in the call stack should an exception occur during the request. As a result, error handling becomes a C-like process of manual status inspection in lieu of idiomatic try/except blocks.

The Way Out

A particularly nice solution to this problem of twisted async code is provided by the tornado.gen module. As the abbreviated name suggests, their approach is to use python generators to turn request handler methods into coroutines that can be suspended during i/o then restarted when the response arrives.

Corduroy is happy to work in this style and provides the @relax decorator to make the syntax more transparent (or at least more glazed with sugar). When applied to one of your methods, the decorator allows you to treat the API as if it were blocking and no longer requires explicit callbacks. All your code needs to do to make this possible is place a yield in front of calls to the library.

The result of this yield expression will be the data that would ordinarily be passed to your callback function but can now be captured through simple assignment. If an error occurrs, the decorator handles that as well by raising the exception at the point of the yield in your code:

class RelaxedHello(tornado.web.RequestHandler):
    @relax
    def get(self, user_id):
        try:
            doc = yield people_db.get(user_id)
            self.write('hello %s %s'%(doc['first'],doc['last']))
        except NotFound:
            self.write('hello whoever you are')
        self.finish()

This code looks like ‘normal’ blocking code but will in fact execute asynchronously. At the point of the yield statement, the request handler releases control of the event loop while Corduroy creates a callback behind the scenes. Once this internal callback fires, the request handler’s method is resumed. The handler can then make other asynchronous requests or just return its output if there’s nothing more to be done.

Calling Conventions

Virtually every method in Corduroy accepts an optional keyword argument called callback. When this argument is omitted (and the @relax decorator is not in effect), the call will use blocking i/o and the function call will not return until the operation is complete (or until an exception is raised).

If you pass a callable object (a.k.a. a function) as the callback arg, the call will complete almost immediately and the return value will not be the final data but a replica of the HTTP Request (which can be useful for debugging e.g., when passing a lot of options). The callback will be invoked moments later, when the server response arrives.

Callbacks should expect two arguments and be of the form:

def mycallback(data, status):
    pass

Callback Arguments

The first argument will contain the response from the server, either as a unicode string or as a decoded json object.

The second argument allows for error checking and has five attributes of intrest:

When status.ok is False, the contents of the first argument are somewhat variable. Sometimes Couch responds verbosely to error conditions and the data argument will contain a json object. At other times data will simply be None. When in doubt, consult the HTTP API.

Explicit vs. Implicit Callbacks

When you call the database from within a @relax-decorated function you don't have to provide a callback; the decorator will do it for you. To give you some idea of what happens when you type yield, the decorator-provided callback’s logic looks something like this:

def relax_callback(data, status):
    if status.exception:
        raise status.exception # (in the context of your function)
    else:
        return data # (and assign it to the yield's lvalue)

The rest of this guide will use the implicit callback style for brevity’s sake. Just keep in mind that anywhere you see a yield before a library call, you could pass a callback argument instead.

Couches & Databases

When thought of purely as a key/value store, a CouchDB installation is a cascade of json objects with three basic levels of hierarchy. In Couch nomenclature, a single Server can contain many Databases which in turn contain many Documents.

The CouchDB Server

Servers are represented by Couch objects. The constructor takes a url as an argument, but will use the values in corduroy.defaults to construct a default url if none is provided. You can also include login credentials, either inline or as a 2-tuple. If you haven’t overridden the host and port defaults, all of these instantiations should be equivalent:

couchdb = Couch('http://username:pass@127.0.0.1:5984')
couchdb = Couch('http://127.0.0.1:5984', auth=('username','pass'))
couchdb = Couch(auth=('username','pass'))

All three of the above will return with:

<Couch 'http://127.0.0.1:5984'>

Creating a Couch object doesn’t actually connect to the server. As a result it’s safe to call the constructor in an event handler without needing a callback. You can then use the object’s all_dbs method to obtain a list of available databases, monitor server activity with tasks, or read/write configuration options with config. See the Reference docs for other server-level operations.

In all likelihood the only methods you’ll use regularly are db and create which let you retrieve a reference to a specifc database using its name:

couch = Couch(auth=('user','pass'))
try:
    mydb = yield couch.db('mine')
except NotFound:
    mydb = yield couch.create('mine')

Since this pattern is fairly common, the above can be simplified to:

mydb = yield couch.db('mine', create_if_missing=True)

Databases

You don’t actually need to create a Couch just to access a database. You can also instantiate them directly by passing the full url to the db to the Database constructor. As with Couch objects, the default server url will be prepended if necessary. Both of these are equivalent:

db = Database('http://127.0.0.1:5984/some_db_name')
db = Database('some_db_name')
<Database 'some_db_name'>

Creating a Database object doesn’t contact the server, but an efficient check for its presence can be performed by calling its exists method. Similarly useful is the info method which performs a GET on the database’s root url.

Documents

CouchDB documents are dict-style json objects with two required keys: _id and _rev. The _id field is a unique-to-that-database string that allows the document to be requested by name. The _rev value is filled in by the server when you create or update a document. It is also the basis of Couch’s transaction-less mechanism for detecting write conflicts (see the Eventual Consistency section for details).

Corduroy will accept any dict-like you pass and treat it as a document. The objects it returns will by default use the corduroy.Document class (though this can be overridden). The Document class inherits from dict and has two noticeable differences from the stdlib model:

  1. It preserves the order of keys in dictionary objects (mimicing Erlang and Javascript behavior w/r/t json serialization)
  2. It allows you to refer to items in the dict as if they were attributes.

Basic CRUD

To fetch a single document, call the Database object’s get method with the desired _id string. The Document object it returns can be used just like a dictionary:

db = Database('underlings')
ollie = yield db.get('lackey-129')

print ollie
print 'Mr. %(first)s %(last)s can be found at %(office)s.' % ollie
<Document lackey-129[1] {first:"Oliver", last:"Reeder", office:"Richmond Terrace"}>
Mr. Oliver Reeder can be found at Richmond Terrace.

After making local changes, a new version of the document can be written to the db by calling the save method:

del ollie.office
ollie.education = u'Oxbridge'
yield db.save(ollie)

print ollie
<Document lackey-129[2] {first:"Oliver", last:"Reeder", education:"Oxbridge"}>

Note that the number in brackets incremented as a result of the save. This number is the first portion of the _rev value (the full value looks more like “2-0348cd6cc49cb4cacdb9b94c87c83808”). The _rev will change on every successful update. Also notice that we ignored the return value since the save method updates its argument as a side effect.

To remove a document from the db, pass a current version of the doc to the delete method. If your argument’s _rev value doesn’t match the server copy, a conflict exception will be raised.

try:
    yield db.delete(ollie)
except Conflict:
    print 'Our doc is stale. Need to refetch and try again.'

To create a new document, save a dict with a valid _id string. Its _rev will be set in the process:

newdoc = {'_id':'lackey-130', 'first':'Angela', 'last':'Heaney'}
yield db.save(newdoc)

print newdoc._rev
1-d0a259b5b8e71a3c0b0bc7facbb690d5

If you don’t specify an _id, one will be chosen for you. Corduroy keeps a cache of identifiers collected from the couch server’s _uuids API. Relying on server-provided IDs can purportedly improve performance since the generated identifiers are semi-sequential in a way that is friendly to b-tree traversal. I take no definite stance on this.

anon = {'first':'Julius', 'last':'Nicholson'}
yield db.save(anon)

print anon._id, anon._rev
6efcdf33df6c82bfc53d7416c660ef5a 1-967a00dff5e02add41819138abb3284d

Batch CRUD

Both the get and save methods can accept either a single value or a list as the first argument. To fetch multiple documents in a single request, pass a list of ID strings. The return value is a list of Document objects in the same order as the IDs list:

db = Database('backbench')
doc_ids = ['ballentine', 'holhurst', 'swain']
docs = yield db.get(doc_ids)

print docs[0]
<Document ballentine[9] {first:"Claire", last:"Ballentine", highly_regarded:True}>

Whereas the single-doc get call will raise a NotFound exception should the requested doc not exist, a batch get flags missing documents by including a None at the corresponding element of the results list.

To update multiple documents in a single request, pass your list of updated docs to save. To delete one or more docs in the batch, add a _deleted key to each such doc before submitting the request:

claire, geoff, ben = docs

claire.standing = u'not standing'
geoff._deleted = True
ben.newsnight = {'paxman':1, 'swain':0}
yield db.save([claire, geoff, ben])

Eventual Consistency

One of the fundamental differences between CouchDB and traditional RDBMSs is the way data integrity and transactional semantics are handled. The tl;dr version is that Couch abandons nearly all SQL-ish guarantees and pushes responsibility for conflict resolution to the client.

This might sound like a mis-feature, but in the best Worse-is-Better tradition, the lack of abstraction over what-gets-written-when can force you to improve the way you structure your code. This may just be stockholm syndrome talking, but there’s an argument to be made that the client can do a better job of deciding how to deal with conflicts than a one-size-fits-all transaction could.

Couch’s Approach to MVCC

Regardless of how they’re rationalized, update conflicts are a common enough occurrance when dealing with documents that your code should generally consider them the norm rather than an exceptional condition.

The basic rule Couch uses to determine whether an update should succeed is quite simple:

The new version of a doc must have the same _rev value as the copy currently in the database.

To see how this on some level solves the entire problem of lost data consider the scenario where two clients simultaneously download the same copy of a doc whose revision is currently 1. Both clients will modify their local copy of the doc and attempt to save it back to the database.

The first client to connect will succeed, since the _rev of its modified doc matches the value in the database. The server’s copy of the doc is then updated and its _rev is incremented to 2.

When the second client’s save attempt is handled, the client’s _rev (1) no longer matches the server’s copy (now 2). As a result the save will fail and raise a Conflict exception.

The second client must now fetch the newly updated doc from the server, re-apply its modifications, then attempt the save again.

The Standard Recipe

The end result of this _rev-matching rule is that any writes attempted by your code should use the following algorithm:

  1. Attempt the write.
  2. If no conflict occurred, you’re done.
  3. If there was a conflict, request the now-current copy of the doc.
  4. At the very least, copy the _rev from the newly-retrieved copy of the doc to your local copy. Ideally do something clever that merges the two docs without losing any edits.
  5. Attempt a write of the merged doc and GOTO 2

It’s worth acknowledging that doing things ‘correctly’ is a fair amount of (fairly repetitive) work. Thus the temptation to perform blindfolded writes and just hope for the best can be dangerously strong.

To try to make it easier to be responsible in this context, Corduroy treats every write as a potentially conflict-inducing operation. In previous examples of the save method, the return value was ignored. Let’s now take a look at the ConflictResolution object that save returns:

# create a pair of new docs
docs = [{'_id':'first', 'n':1}, {'_id':'second', 'n':2}]
conflicts = yield db.save(docs)
print conflicts
<Success: 2 docs updated>
# create a conflict by deleting the rev and re-saving
del docs[1]._rev
conflicts = yield db.save(docs)
print conflicts
<Conflict: second>

From the repr strings you can see whether the write was successful and the list of conflicted IDs if not. To access this information from your code, the conflicts variable contains a pair of attributes to be inspected:

Resolving Conflicts

You could use the values in pending to plan a fetch request, merge the results with your local edits, then resubmit them. But since this is such a common pattern, the ConflictResolution object provides a method called resolve to handle all of this in one shot.

Since all parts of the Standard Recipe are identical except for step 4 (merging the local and fetched copies of the doc), Corduroy allows you to encapsulate your merge logic in a function and pass that to resolve which will orchestrate the required HTTP traffic.

def mergefn(local_doc, server_doc):
    # just copy over the rev (a.k.a. not a real strategy)
    local_doc._rev = server_doc._rev
    return local_doc

conflicts = yield db.save(docs)
if conflicts.pending:
    print 'pre-merge: ', len(conflicts.pending)
    yield conflicts.resolve(mergefn)
    print 'post-merge:', len(conflicts.pending)
pre-merge:  1
post-merge: 0

The ConflictsResolution object fetches the current versions of all the pending docs, then repeatedly calls your merge function with local and remote copies of each. The merge function should return a dictionary that merges the data in the divergent copies. These returned dictionaries will then be sent to the server in a batch write attempt. If the merge function returns None, no attempt to write that doc will be made.

When the resolve call completes, the pending and resolved dictionaries will be updated to reflect the new state.

Anticipatory Conflict Handling

The same merge functions that that the resolve method accepts can also be passed to the Database.save method directly. If any conflicts occur, a resolution will automatically be attempted using your merge function.

def mergefn(local_doc, server_doc):
    local_doc._rev = server_doc._rev
    return local_doc

yield db.save(docs, merge=mergefn)

Despite this particular merge function’s obvious unsuitability for use in production, a forced overwrite is ocasionally just the thing you need (especially during development). As a further shorthand, the above can be rewritten as:

yield db.save(docs, force=True)

Views

Beyond accessing documents individually, Couch provides a mechanism for building named indexes called ‘views’ that can aggregate data across documents. Views are defined by javascript functions on the server that update the index every time a document is added or modified. From the client’s perspective they are a series of ‘rows’ with three attributes:

The mapping of rows to documents is not one-to-one, so it’s quite possible for one doc to be represented by multiple rows while another is omitted altogether.

Similarly, key values may be unique between rows but often will not be. In fact one of the more useful properties of views is that multiple documents can be grouped together on the basis of having the same key.

Querying a View

To retrieve all of the rows from a view, call the database object’s view method with the ‘name’ of that view as an argument. Views are named according to the design documents they live in, so for a view called byname in a design document called _design/employees, its ‘name’ would be "employees/byname".

The result of a call to view is an iterable list of Row objects. Here we grab all of the rows from a view and begin printing them out:

db = Database('dosac')
rows = yield db.view('employees/byname')
print rows
<employees/byname: 17/17 rows>
for row in rows:
    print 'key:%s\t| id:%s' % (row.key, row.id)
key:abbott   | id:emp-2c9792a3
key:coverley | id:emp-fad5095d
key:cullen   | id:emp-687e8224
⋮
key:reeder   | id:emp-c110c0a8

Random Access Rows (and Docs)

Since every row in a view has an associated key, you can selectively query the view only for rows matching a particular value. For instance, to grab just the first row the query would be:

yield db.view('employees/byname', key='abbott')

Queries with multiple keys are also possible:

yield db.view('employees/byname', keys=['coverley', 'murray'])

Part of what makes this a useful feature is that the documents associated with rows can also be included in the response. As a result, views can be used to define aliases to documents independent of their potentially unwieldy _id values:

rows = yield db.view('employees/byname', key='abbott', include_docs=True)
print rows[0].doc
<Document emp-2c9792a3[5932] {first:"Hugh", last:"Abbott", locale:"Unknown"}>

Ranged Queries

Since the rows in a view are sorted in ascending order based on their keys, you can also request all rows within a range by specifying startkey and endkey values. To select all the “C” names, the query wouldbe:

rows = yield db.view('employees/byname', inclusive_end=False,
                                         startkey='c', endkey='d')
print [row.key for row in rows]
[u"coverley", u"cullen"]

If it’s not immediately apparent why the above query looks the way it does I highly recommend the CouchDB Guide’s chapter on the subject.

Reductions

For the most part views as described so far – an ordered set of key/id/value rows – provide all the data-access flexibility you need for reading out the state of your database. But Couch also allows for a serverside pre-processing step to be associated with each view through the use of a reduce function.

Couch’s semantics would seem to encourage using reduce as a way to create denormalized copies of your view data (e.g., creating a list of all the values for a given key). But in practice this sort of usage is to be avoided due to an unfortunate trade-off in Couch’s b-tree-based internal design.

Instead, reduce should be thought of as a way to convert rows into a handful of numeric values – often only one. Accepting that limitation, it frequently makes sense to use one of Couch’s built-in reductions instead of writing your own. In combination with View Collation this becomes a surprisingly powerful technique.

Here is a view that uses the _count built-in reduction on rows keyed by the date. Views with reduce functions will return the output of that reduction by default. The pre-reduction rows can still be accessed with the query:

for row in (yield db.view('chronological/counts', reduce=False)):
    print row.key
["2007","11","28"]
["2007","11","29"]
["2007","11","29"]
⋮
["2007","12","02"]

The group and group_level keyword arguments allow you to control whether the reduce function is applied uniformly to the rows (resulting in a single value), or to groupings based on shared key values.

for row in (yield db.view('chronological/counts', group=True)):
    print "%(key)s%(value)i" % row
["2007","11","28"] → 1
["2007","11","29"] → 17
["2007","11","30"] → 3
["2007","12","01"] → 92
["2007","12","02"] → 50
for row in (yield db.view('chronological/counts', group_level=2)):
    print "%(key)s%(value)i" % row
["2007","11"] → 20
["2007","12"] → 142

Data Formatters

Couch is not limited to presenting your documents and views as json-formatted text. It allows you to define special handlers to transform the raw data into HTML, BibTeX, WAD or whatever is appropriate for your application.

These custom formatters are defined in a design document specific to your database and are referred to as ‘lists’ (which process the output of views) and ‘shows’ (which process the content of documents). Their client APIs are quite similar, in both cases you provide the name of the formatter along with the view or document it will be applied to. In response an object is returned with two attributes of interest:

Formatting Documents with Shows

‘Show’ functions are reuseable javascript routines that have access to a specified document as well as any query arguments passed in the request. Here the function csv in the design doc records is being used to format a document. The include_titles argument instructs this particular ‘show’ to output a header row in addition to the document data:

db = Database('mannion')
response = yield db.show('records/csv', '1978-wotw', include_titles=True)
print response.headers['Content-Type']
text/csv
print response.body
artist,title,review,doc_id
Jeff Wayne,War of the Worlds,Stimulating,1978-wotw

Formatting Views with Lists

‘List’ functions use similar syntax but apply to a view instead of a single document. In addition to query arguments, the same slicing operations that apply to views can be included to filter out selected rows before formatting:

db = Database('tucker')
response = yield db.list('agenda/html', 'nomfup', 
                         key='n', limit=3, descending=True)
print response.body
<ul><li>Nicola Murray</li><li>Newcastle</li><li>National Trust</li></ul>

If your list function is general enough, you can apply it to other views as well; even views in other design documents:

response = yield db.list('agenda/html', 'bollockings/byday', 
                         key='2007-10-13')
print response.body
<ul><li>Steve Fleming 13:00-13:07</li><li>Julius Nicholson 13:30–</li></ul>

Replication

Reliable, master-less replication is one of Couch’s marquee features. A replication involves a source and a target database and propagates changes from the former to the latter. To copy an existing database, pass the source and target database names (or urls) to a Couch object:

couch = Couch()
yield couch.replicate('olddb', 'newdb', create_target=True)

Database objects can also replicate themselves through their push and pull methods. Their behavior is analogous to the eponymous commands used by distributed version control systems. These operations are one-way in nature meaning a full synchronization of two databases (each with local edits) requires a reciprocal push and pull:

local_db = Database('localdb')
remote_db = Database('http://elsewhere.com:5984/remotedb')
yield local_db.push(remote_db)
yield local_db.pull(remote_db)

Persistence

In addition to one-off, ‘anonymous’ replications, Couch maintains a system-controlled database called _replicator to track named replications over time. This database is exposed by the Couch object and behaves like any other (supporting get, save, and friends).

The _replicator database has the special property that when docs of a certain form are created, they will trigger a new replication. Its ongoing status can then be monitored by polling the document:

couch = Couch()
repl = { "_id":"local-to-remote",
         "source":"localdb", 
         "target":"http://elsewhere.com:5984/remotedb",
         "continuous":True }
yield couch.replicator.save(repl)

repl_doc = yield couch.replicator.get('local-to-remote')
print json.dumps(repl_doc, indent=2)
{
  "_id":"local-to-remote",
  "source":"localdb", 
  "target":"http://elsewhere.com:5984/remotedb",
  "continuous":true,
  "_replication_id":  "c0ebe9256695ff083347cbf95f93e280",
  "_replication_state":  "triggered",
  "_replication_state_time":  1297974122
}

As a convenience, all of the replication methods accept a keyword argument called _id. If included, a new document will be created in the _replicator database using the arguments for its fields. Thus the previous example could be rewritten as:

local_db = Database('localdb')
yield local_db.push(remote_db, 
                    "http://elsewhere.com:5984/remotedb", 
                    continuous=True, 
                    _id='local-to-remote')

Or if you want to get fancy:

repl = { "_id":"local-to-remote",
         "target":"http://elsewhere.com:5984/remotedb",
         "continuous":True }
yield local_db.push(**repl)

Change Notifications

At the filesystem level, a couch database is stored as something more akin to a journal file than a snapshot of the current state. Every document modification is appended to this journal and the ‘live’ database can be thought of as the result of playing back all these modification records in order.

Beyond being just an implementational detail, this alternative method of representing a database (as a sequence of changesets) is exposed through the _changes API. Your application code can make use of this data in two main ways:

  1. retrieving a list of changes since a known timepoint (a ‘seq’ value).
  2. subscribing to an asynchronous feed in which a user-provided callback is fired in realtime coincident with changes to the db.

The former can be useful for reinventing differently-sized wheels (see also Replication), while the latter can be used to trigger scripted events such as cache invalidation or summary statistics generation.

When requesting a list of changes, you can either specify no arguments (in which case somewhere between 0 and update_seq change records will be returned), or you can bracket the time period in order to limit the response size:

db = Database('watched')
info = yield db.info()
print info.update_seq
11519
changes = yield db.changes()
print "seq: %i (%i changes)"% (changes.last_seq, len(changes.results))
seq: 11519 (210 changes)

changes = yield db.changes(since=11500, limit=5)
print "seq: %i (%i changes)"% (changes.last_seq, len(changes.results))
seq: 11506 (5 changes)

Each element of the changes.results list corresponds to a document and contains a delta relative to its previous state. For all the details see the CouchDB Book’s chapter on the subject.

Asynchronous Feeds

Changes can also be accessed as a ‘feed’ in which an HTTP connection is left open and Couch will write individual change notifications to it as they occur. To listen for changes in this manner, request a continous feed and pass an explicit callback (i.e., do not call yield this time):

def listener(last_seq, changes):
    pass

feed = db.changes(since=11500, feed='continuous', callback=listener)

The return value of a feed request is a ChangesFeed object. It will keep the connection alive and handles the invocation of your callback at regular intervals (see the latency parameter). The feed will continue listening until it is explicitly stopped:

feed.stop()

Further Reading

The overall documentation scene for Couch isn’t quite as focused as it could be. There’s quite a bit of good infomation out there, but it’s scattered all over the net. Here are some I’ve found useful: