Can Amazon Have A Feature-Rich Site And Protect Sensitive Information?
How much information should a company retain about its customers? I started wondering about this question when I wrote Translucent Databases last year to explore the many ways that someone could build a database that kept no personal information yet still performed useful work. The technology is easy to implement, but the social matrix where the technology must live is hard to understand.
Over the last few weeks, I've been discussing the usefulness of the approach with Jon Udell. He sees the technical charm, but wonders whether anyone would ever want to use it. So as a challenge, I offered to show how Amazon could use the techniques to offer almost all of their services without retaining any personal information about their customers. If someone breaks into their computers or someone on staff decides to get nosy, they can't find a customer's browsing habits, credit card number, address or anything. Yet, Amazon could still tune their offerings to you and suggest items related to your previous purchases.
Although it sounds a paradox for a database to contain no useful information and give valuable responses, the systems can be built. The right amount of encryption can scramble data without retaining it. One-way functions like SHA can turn names, social security numbers, and medical records into unintelligible 160-bit numbers. These 160-bit numbers can be used as surrogates for names, credit card numbers and what not, but they can't be reversed because, behold the magic of definition, the functions like SHA only work in one-way. They can't be inverted.
Inscrutable Surrogates
The best way to continue is with an example. SHA is a cryptographically secure hash function. That means no one has publically described an efficient way to take SHA(x) and figure out a value of x that produced it. For instance, SHA("pcw@flyzone.com/swordfish") produces the hexadecimal string 0185fd1f5137ec04a564fdd8ef043e12fd643511. If you start with 0185fd1f5137ec04a564fdd8ef043e12fd643511, though, it is practically impossible to find any string that generates the number, much less find "pcw@flyzone.com/swordfish".
The reason I computed that string is that they're my email address (I can't possibly get more spam) and a password. Mixing them together with SHA produces a bag of 160 bits that can act as a surrogate for the email address and password. Anyone who knows the name and address can compute them, but someone who begins with the value 0185fd1f5137ec04a564fdd8ef043e12fd643511 can't go backwards. There's no public method for reversing them and it seems unlikely that there's any method at all.
You can try to scramble your own name and password here based on this Javascript :
These surrogates can replace your email address and password in the Amazon system. Every time the machines at Amazon store information about the items you examined or purchased, they can store them in their huge databases under the surrogate 0185fd1f5137ec04a564fdd8ef043e12fd643511.
When you return, you log into the system again with your email address and password. Amazon passes them through SHA, looks the result up in their database and starts tuning the experience again. They can look up what 0185fd1f5137ec04a564fdd8ef043e12fd643511 bought on the last trip and start posting similar items on the screen as temptation.
It should be obvious that a malicious hacker or a curious insider at Amazon can't find out what someone bought or even browsed for in the past. If someone wants to browse through the databases, all they find is inscrutable numbers like 0185fd1f5137ec04a564fdd8ef043e12fd643511.
Offering Services
How many services offered by Amazon can be converted to use this system? A surprising amount. Without any real knowledge of the Amazon system beyond what I've experienced as a customer, I can divide their databases into these categories:
- Website Customization -- If Bob bought New York on $100 a day, offer him books like A History of New York , recordings of Frank Sinatra singing "New York, New York" and maybe even movies like "The Muppets Take Manhattan".
- Purchase Analysis -- Do customers buy red shirts at Target in the same order as books like the Communist Manifesto ? Someone in the marketting department may want to know.
- Fulfillment Services -- If someone orders something, get it to them toute de suite.
- Spam-like Services -- Did someone buy a book like Dow Jones 36,000 several years ago? Send them some email and suggest a book about accounting fraud and irrational exuberance.
I think the first three of these can be accomplished with fully translucent databases that protect people's sensitive information without compromising any of the features. The last one can't be fixed permanently, but the dangers can be limited by creating a somewhat translucent proxy remailer.
- Solving Website Customization -- When someone logs in with their email and password, compute the SHA surrogate and use it to track their progress. If they've visited before, use this value to look up their history and start offering them suggestions.
A trickier problem is keeping a storehouse of addresses and credit card numbers to save customers from retyping them on the next visit. This information can be encrypted with a key based upon the email and password and then the key can be thrown away after the data is scrambled.
SHA(password|name), for instance, is a good password for the encryption that bears no resemblance to SHA(name|password). You can't compute one from the other. So use one for the surrogate and one for the password that encrypts the personal information. Once this password is destroyed, insiders and hackers can't abuse it. The address and the credit card information can only be decryptedafter someone returns and presents the password and email address.
Jon points out that Microsoft offered a similar solution in Hailstorm. They encrypted the information but needed your permission to decrypt it. Hmmm.
- Solving Purchase Analysis -- This is the easiest one of the four. The surrogates also link together people's purchases. Any analysis that required the person's name can compute the same results with the surrogate.
- Solving Fulfillment Services -- This is a bit trickier. Sending a package requires collecting someone's address. This information, however, can be destroyed after the package arrives. Amazon could keep this information around until it ships and then forget it. If someone logged in to check the status of their purchase, they would log in with the email address and password to unlock the shipping records.
- Solving Spam-like Services -- One solution is to stop using them. Many people would be happy for them to end. But others might not be. I know one author who watched his new book sail into the Hot 100 after Amazon sent announcements to people who bought his last book. While many were inconvenienced, some were happy enough to buy.
Still, I don't receive many of these messages despite buying books by repeat authors. It may be that Amazon finds that they're more trouble than they're worth. Many people report these messages as Spam, even though they chose to turn on the service in the past. They either forget or change their mind. This is a real hassle for Amazon because they get lumped in with random spammers. Dealing with so-called legitimate spam like this may be an even bigger practical problem then dealing with the illegitimate stuff.
I can't think of any good way to offer these services without keeping the email addresses tied to the purchase records. It is possible to isolate this information on a separate server with special access restrictions. This will constrain hackers and insiders, but it won't remove any of the deeper problems.
Social Problems
Amazon is a good choice for this demonstration because the company is constantly tweaking its privacy policy and experimenting with different rules. Some users were even prompted to start a boycott the firm. Of course, it's not fair to pick on Amazon alone. Many stores use the same techniques and may not even do as good a job protecting privacy. To some extent, I chose Amazon for this example because they have such a feature-rich site, not because they have a history of wrestling with customer privacy. If Amazon's site can be duplicated with these techniques, then they're pretty powerful.
Some argue that the success of Amazon shows that many people don't care about privacy. This assumption may be a bit of a leap because it could be that many people weigh the benefits of Amazon against the costs of giving up some of their privacy. Free shipping and a great selection outweigh having someone track what you read. The approach in this note shows that the company could give people most of the benefits and protect their personal information.
The techniques can also fight crime and terrorism, a big problem today. One man watched his credit card number get stolen and used to send a night-vision scope to the Middle East. Insiders steal personal information all of the time. Others plunder databases filled with email and sell the lists to spammers. While I'm sure the majority of folks at Amazon are clean cut citizens, there may be someone who isn't. A translucent database can stop the insiders with malice aforethought.
More
This note is only meant to address a challenge that Jon Udell made to help determine the utility of translucency. He seemed skeptical that the ideas would find widespread use. I hope this note will at least make it clear that a fully-functional, highly customized store like Amazon can be built with a few very simple techniques. Amazon can have its cake and eat most of it too.
Of course, just because an idea is simple and stops terrorism (among other things), doesn't mean that it can or will be widely adopted. I think the resistence in ourselves is deeply buried, perhaps even below our logical layer. Many people still feel a packrat's instinct with data. They feel that this information should be kept around, just in case.
This is a natural human wish, but it should also be balanced by the just as natural aversion to responsibility. Most businesses don't have to pay the price if a customer's identity gets stolen, their credit cards get cloned, or their bank account is raided. This may change as more people and businesses become aware of the danger of misused information and the responsibility to protect it.
If anyone has thoughts about the advantages and limitations of the approach taken here, please write.
--- Peter Wayner,
p3 (at) wayner (dooot) org