Thursday, January 12, 2012

Creating a Lock with Redis

This post is a partially rewritten excerpt from Redis in Action, the book that I am writing for Manning Publications, which does not currently have a release date. Over time, I hope to include other excerpts from the book as time allows.

Redis Transactions

In the world of concurrent and parallel programming, there will be some point where you will write something that requires access to a resource without any other thread or process accessing it. In Redis this is actually very common; multiple readers, multiple writers, all dealing with the same data structures. Redis includes a combination of five commands for handling optimistic data locking. Those commands are WATCH, UNWATCH, MULTI, EXEC, and DISCARD.

The typical path for code that intends to modify data in Redis that requires checking values before updating will have a 4-step process. First you WATCH data that you want to modify/check on. Next you check your data. If it isn't what you need to continue, you UNWATCH and return. If it was what you were looking for, you start a MULTI transaction, send your commands, then EXEC them. Below is a simple example that transfers money between two accounts, making sure to prevent overdrafts.

def transfer_funds(conn, sender, recipient, amount):
    pipe = conn.pipeline(True)
    while 1:
        try:
            pipe.watch(sender)
            if pipe.hget(sender, 'money') < amount:
                pipe.unwatch()
                return False

            pipe.multi()
            pipe.hincrby(sender, 'money', -amount)
            pipe.hincrby(recipient, 'money', amount)
            pipe.execute()
            return True
        except redis.exceptions.WatchError:
            pass
The only command not used above is the DISCARD command, which will discard any commands passed after MULTI. As you can see, we will attempt to loop through the above money transfer until it either fails due to no money, or succeeds. This is fine, but if you have some account that is updated often, or if you have some shared data structure that is constantly being updated, you can fall into the except clause and have to retry the transaction repeatedly.

Locking

In other software, rather than using WATCH/MULTI/EXEC, there is a primitive called a Lock, which allows us to gain exclusive access to a resource. You acquire the lock, perform your operation, then release the lock. Because of the variety of commands in Redis, you can build a lock using the available tools. Unfortunately, using those tools correctly is not easy. In my research about locks in Redis, I haven't found a single implementation available online that is 100% correct (until now). Some of the problems with locks that I have seen are as follows:
  1. A process acquired a lock, operated on data, but took too long and the lock was automatically released. The process doesn't know that it lost the lock, or may even release the lock that some other process has since acquired.
  2. A process acquired a lock for an operation that takes a long time, and crashed. Other processes that want the lock don't know what process had the lock, so cannot detect that the process failed, and wastes time waiting for the lock to be released.
  3. One process had a lock, but it timed out. Other processes try to acquire the lock simultaneously, and multiple processes are able to get the lock.
  4. Because of a combination of #1 and #3, many processes now hold the believed exclusive lock, leading to data corruption or other incorrect behavior.
Even if each of these problems had a 1 in a million chance of occurring, Redis' high performance (typically 100,000 to 225,000 operations/second) can cause those problems to occur under heavy load surprisingly often (up to a few times per minute), so it is important to get locking right.

Building a mostly correct lock in Redis is pretty easy. Building a completely correct lock in Redis isn't actually much more difficult, but it requires being extra careful about the operations we use to build it (in the book, we first build a simple lock to show the basics, then add in the full functionality, here we will jump to the fully-featured lock).

The first part of making sure that no other code can run is to "acquire" the lock. The natural building block to use for acquiring a lock is the SETNX command, which will only set a value if the value doesn't already exist. We will set the value to be a unique identifier to ensure that no other process can get the lock. If we were able to set the value (we have acquired the lock), then we immediately set the expiration of the key to ensure that if we take too long with our operation, the lock is eventually released. But if our client happens to crash (and the worst place for it to crash for us is between SETNX and EXPIRE), we still want the lock to eventually time out. To handle that situation, any time a client fails to get the lock, the client will check the expiration on the lock, and if it's not set, set it. Because clients are going to be checking and setting timeouts if they fail to get a lock, the lock will always have a timeout, and will eventually expire, letting other clients get a timed out lock.

But what if multiple clients set the expiration time at the same time? That is fine, they will be running essentially at the same time, so the expiration will be set for the same time.

def acquire_lock(conn, lockname, identifier, atime=10, ltime=10):
    end = time.time() + atime
    while end > time.time():
        if conn.setnx(lockname, identifier):
            conn.expire(lockname, ltime)
            return identifier
        elif not conn.ttl(lockname):
            conn.expire(lockname, ltime)

        time.sleep(.001)

    return False
To release the lock, we have to be at least as careful as when acquiring the lock. Between the time when we acquired the lock and when we are trying to release it, someone may have done bad things to the lock. To release the lock, we actually need to WATCH the lock key, then check to make sure that the value is still the same as what we set it to before we delete it. This also prevents us from releasing a lock multiple times.
def release_lock(conn, lockname, identifier):
    pipe = conn.pipeline(True)

    while True:
        try:
            pipe.watch(lockname)
            if pipe.get(lockname) == identifier:
                pipe.multi()
                pipe.delete(lockname)
                pipe.execute()
                return True

            pipe.unwatch()
            break

        except redis.exceptions.WatchError:
            pass

    # we lost the lock
    return False
And as a bonus, a Python context manager with the updated transfer_funds() code using it...
import contextlib, uuid

class LockException(Exception):
    pass

@contextlib.contextmanager
def redis_lock(conn, lockname, atime=10, ltime=10):
    identifier = str(uuid.uuid4())
    if acquire_lock(**locals()) != identifier:
        raise LockException("could not acquire lock")
    try:
        yield identifier
    finally:
        if not release_lock(conn, lockname, identifier):
            raise LockException("lock was lost")

def transfer_funds(conn, sender, recipient, amount):
    with redis_lock(conn, 'lock:' + sender):
        if conn.hget(sender, 'money') < amount:
            return False

        pipe = conn.pipeline(True)
        pipe.hincrby(sender, 'money', -amount)
        pipe.hincrby(recipient, 'money', amount)
        pipe.execute()
        return True
If you generated your identifier correctly (UUID like we did, or IP address + process id + thread id, etc.), this lock is correct. Not almost correct, not mostly correct, not a little bit wrong. Completely correct. Other locks that I've seen for Redis have one of a couple different mistakes, usually either accidentally resetting the timeout when you shouldn't, or deleting the lock when you shouldn't. Those kinds of errors lead to the 4 problems I listed earlier.

Simplifying Locking

After reading this, you may think that this is a little more work than should be necessary to build a basic lock with timeouts. You are not alone. Two requests have been made to try to help the situation. The first is allowing SETNX to take an optional 3rd argument that is the expiration time if the item was set. That reduces the Redis commands in acquire_lock() to one command that is looped over (the 10 lines above turn into 4 lines). The second is a new command DELC <key> <value>, which will only delete the given key if the current value is the same as the passed value. This reduces the commands in release_lock() to one command that is executed once in the body of the function (the 15 lines above turn into 2 lines). You can read and follow the discussion on the Redis mailing list.


Thank You

If you liked this article about Redis, the section on locks that will be included in the book expands the discussion and includes more examples. If Redis changes to incorporate the new commands and features to make locking easier, you can look forward to another article revisiting classic Redis locking behavior. If you want to see more posts like this, you can buy my book, Redis in Action from Manning Publications today!

7 comments:

  1. Hey Josh,
    Thanks for the article.

    The new EVAL command would be able to achieve your "simplified locking" scheme.

    lock (setnx with expire)

    local val = redis.call('setnx', KEYS[1], KEYS[2])
    if (val == 1)
    then
    redis.call('expire', KEYS[1], KEYS[3])
    return KEYS[2]
    end
    return 0

    unlock (delete if current)

    local val = redis.call('get', KEYS[1])
    if (val == KEYS[2])
    then
    redis.call('del', KEYS[1])
    return 1
    end
    return 0

    Thoughts ?

    ReplyDelete
    Replies
    1. Indeed, with EVAL/EVALSHA, locks can be cleaned up. See chapter 11 section 2 of my book, Redis in Action, for the version that I use. It is similar to what you describe, but uses get+setex instead of setnx+expire.

      Delete
  2. Hi Josiah,

    the first of the problems you're describing in your intro is the following:

    [..]
    1. A process acquired a lock, operated on data, but took too long and the lock was automatically released. The process doesn't know that it lost the lock, or may even release the lock that some other process has since acquired.
    [..]

    I don't see how you address this issue in your design. Here's an example:
    1. Client A tries to acquire the lock and succeeds, at time 0.

    2. Client A immediately sets an expiration to let's say, 30 seconds

    3. Client A starts the critical section which is quite heavy, thus taking totally 40 seconds

    4. Somewhere in the middle (let's say at time 34) comes Client B, trying to acquire the lock

    5. Since the lock created by client A expired 4 seconds ago (at time 30), B's call to setnx succeeds and it acquires the lock

    6. Client A never finds out and between times 34 and 40, both A and B occupy the critical section

    7. At least A does not release B's lock, but that's the least of our problems right now. Our biggest one is that A and B were in the critical section for 6 seconds

    Am I missing something? If not, I don't see any way around this, except for a "refresh_lock" issued by A at intervals necessarily smaller than the expiration time.

    Of course client A could always set an incredibly high expiration time, but that would practically:
    1. Lower the possibility that A runs late and occupies the critical section longer than the (incredibly high) timeout, thus losing the lock. But stil, "lower" means "tends to zero but it's not zero", so the lock is not completely safe, but is "very safe", with "very" being quantified as the probability that something goes really wrong and A needs (more) incredibly large amount of time to leave the critical section.

    2. Had a terrible impact on concurrency. In case A crashed, then all other clients would be locked out waiting for an incredibly high expiration to take place.

    ReplyDelete
    Replies
    1. If you want locks to time out properly, the critical observation is that you must set your lock timeout to be longer than the time your process must be inside the critical section.

      As an alternative to picking longer lock times is to pick a shorter time and to "refresh" the lock periodically. Something like https://gist.github.com/josiahcarlson/5453257 , which is an answer to a similar question from the Manning forums: http://www.manning-sandbox.com/thread.jspa?threadID=56702&tstart=0 .

      Of course then you are going to need to pick a refresh timeout that is long enough to prevent a slow refresh from losing the lock, but not so long as to block other clients that want the lock.


      Building a lock in Redis has a variety of trade-offs, and the locks that can be built in Redis aren't necessarily a useful solution for 100% of applications that could find a use for locks. If you know what timeouts to use for initial and refresh timeouts, that can help locks in Redis be useful for more applications. But if you don't know what timeouts to use, or if you can't make refresh calls, then they aren't going to work for you.

      Delete
  3. Thanks for the nice post. Our first locking algorithm had problems described by you. Also in new Redis version 'SET' command has many useful options such as key expiration and nx. Definitely I'll buy your book in Kindle version.

    ReplyDelete
  4. Hi,

    I want to set a lock for a particular KEY.
    can some one tell me how to do that.

    -Thanks,

    ReplyDelete
    Replies
    1. Please see this message in the forum (locks are semantics over data, you can't actually prevent writing to a key in Redis): https://forums.manning.com/posts/list/39216.page

      Delete