CouchDB and Translucent Databases

I've been playing around with CouchDB for several weeks while working on a story for InfoWorld. The tool is little more than a big pile of pairs of data. You put in one value, the key, and back comes the data associated with it. The system is still very much in its alpha stage and the developers are debating how to add more features like security. They've got a good model but I wanted to write up how some of the most basic techniques I wrote about in Translucent Databases can be simpler and a bit more secure.

The current security model is evolving. Each pile of data pairs is called a document and each database is just a pile of documents paired with their own key. The model gives each document a readers list and an authors list . You can only read or write if you're on this list. That's a good model but it can be relatively heavy for such a light-weight tool.

Users can also use one-way functions like SHA-256 to push some of the security onto the client. Let's imagine that we've got a store and we want to store customer data. The traditional solution is to give each user a user id and a password, force them to login in, and then arrange for the db to return only their own data. The db checks all of the readers lists and returns only the valid information.

Another solution is to ask the user to pass this user id and password through a one-way function to create a random-looking pseudonym. In this case, the user might calculate SHA256(userid, password) and submit this. The db will search for this result as it would any other key and produce only the data attached to this key. It doesn't need to bother with readers lists and writers lists.

How does this change the danger of attacks? First, assume that an attacker can submit any number of queries looking for someone's data. If the attacker can guess the userid and password, the attacker can compute the digital pseudonym and retrieve the data attached to it. But if the attacker knows these values, the attacker could also log in and gain access in the traditional model too.

What if the attacker can't guess the right combination of userid and password? If the digital pseudonyms are long enough-- SHA256 returns 256 bits-- then they'll be practically impossible to guess. The database is just a pile of documents indexed with 256 bit long random id strings.

Now imagine that an attacker can gain root access to the database. This is a problem for a traditional system because each document is indexed by the user name and password. A root-level attacker can search through the pairs to find a person's documents because the reader list can't do anything. But with the translucent approach, there's nothing of value on the server. It's just a big pile of documents indexed with 256 bit long keys. If there's no personally identifiable information in each document, then there's nothing to see.

Even personally sensitive information can be further protected if the client encrypts it before storing it with the central server. I've described a number of models for how this can help. Here's a case study of a library that tracks the books without keeping track of the user's reading habits. Here's another case study of a store.

I'm going to experiment more with adding translucent solutions like this to CouchDB. If anyone has any thoughts or suggestions, I hope you'll write.