LRU缓存

LRU Cache

LRU即Least Recently Used, 近期最少使用, 如名字, 作为一种缓存, 在容量受限的情况下, 会优先淘汰最少使用的对象.

在Python中, 位置在functools.lru_cache, 是一个修饰器, 用户对需要缓存结果的函数在定义时, 进行修饰. 在运行中, 作为一个代替redis的轻缓存, 以函数的入参为key, 以该入参运行得到的结果为value, 进行缓存. 调用此函数时, 如果入参在缓存中, 则直接返回结果, 不再调用函数, 大大加速了运行效率.

根据以上的特点, 可以得到, 想使用lru_cache, 必须满足:

入参必须是不可变参数, 如int, str, tuple等, list, dict等是无法使用的
- 遇到这种情况, 只能通过闭包的方法解决
函数的映射关系是固定的, 即对于同一个输入, 输出必须是固定不变的. 对于一些有状态的函数可能无法使用

原理

functools.lru_cache是通过一个双向链表外加一个字典实现缓存的.

字典的作用是在 $O(1)$ 的时间内找到结果
双向链表的作用是在 $O(1)$ 的时间内删除掉最少使用的key

源码

lru_cache

def lru_cache(maxsize=128, typed=False):
  if maxsize is not None and not isinstance(maxsize, int):
  raise TypeError('Expected maxsize to be an integer or None')

  def decorating_function(user_function):
    wrapper = _lru_cache_wrapper(user_function, maxsize, typed, _CacheInfo)
    return update_wrapper(wrapper, user_function)

  return decorating_function

lru_cache接受两个参数:

maxsize: 缓存的数量, 即双向链表的最大长度. 可以为None, 表示没有限制; 可以为0, 表示不使用缓存
typed: 以入参作为缓存, 此参数是指是否对入参的类型做区分. 例如3.0和3, 在typed=False的情况下, 共用同一缓存

lru_cache_wrapper

函数的逻辑是在_lru_cache_wrapper中实现的, 代码分段如下:

def _lru_cache_wrapper(user_function, maxsize, typed, _CacheInfo):
    # Constants shared by all lru cache instances:
    sentinel = object()          # 特殊标记, 用来表示缓存未命中
    make_key = _make_key         # 根据入参来生成key的函数
    PREV, NEXT, KEY, RESULT = 0, 1, 2, 3   # 表示单个节点的4个字段

    cache = {}
    hits = misses = 0
    full = False
    cache_get = cache.get    # bound method to lookup a key or return None
    cache_len = cache.__len__  # get cache size without calling len()
    lock = RLock()           # because linkedlist updates aren't threadsafe
    root = []                # root of the circular doubly linked list
    root[:] = [root, root, None, None]     # initialize by pointing to self

sentinel是一个唯一的对象, 在从字典中获取缓存结果时, 如果对应的key不在字典中, 默认返回这个值. 之所以用这个值, 是因为None也是函数可能的返回值, 不能作为判定key是否在字典中的条件.

对于一个结点, 我们使用一个长度为4, 表示为[prev, key, result, next]的列表来存储, 每个位置代表的意义是固定的, 分别为双向链表上一个节点(实际上是一个列表), key值, 对应的结果值, 下一个结点. 因此用PREV, NEXT, KEY, RESULT = 0, 1, 2, 3常量来表示和记录每个位置, 及其意思.

cache就是记录数据的字典, 根据key获取结果.

root代表双向链表的根节点. 双向链表即环形链表, 往往使用一个空节点(没有key和value, 只连接前后结点)作为定位, 因此长度为 $N$ 的双向链表, 其实只存储了 $N-1$ 个数据点. 在后面操作更新状态时, 要维护好root节点.

maxsize == 0

if maxsize == 0:
    def wrapper(*args, **kwds):
        # No caching -- just a statistics update after a successful call
        nonlocal misses
        result = user_function(*args, **kwds)
        misses += 1
        return result

这种情况就是不使用缓存, 直接调用函数返回即可.

maxsize is None

elif maxsize is None:
    def wrapper(*args, **kwds):
        # Simple caching without ordering or size limit
        nonlocal hits, misses
        key = make_key(args, kwds, typed)
        result = cache_get(key, sentinel)
        if result is not sentinel:
            hits += 1
            return result
        result = user_function(*args, **kwds)
        cache[key] = result
        misses += 1
        return result

这种情况代表缓存大小没有限制, 在这种情况下就不涉及不常使用的key被淘汰的问题, 无需(也不能)使用双向链表, 直接从字典中获取缓存的值即可.

有大小限制的缓存

def wrapper(*args, **kwds):
    # Size limited caching that tracks accesses by recency
    nonlocal root, hits, misses, full
    key = make_key(args, kwds, typed)
    with lock:
        link = cache_get(key)
        if link is not None:
            # Move the link to the front of the circular queue
            link_prev, link_next, _key, result = link
            link_prev[NEXT] = link_next
            link_next[PREV] = link_prev
            last = root[PREV]
            last[NEXT] = root[PREV] = link
            link[PREV] = last
            link[NEXT] = root
            hits += 1
            return result
    result = user_function(*args, **kwds)
    with lock:
        if key in cache:
            # Getting here means that this same key was added to the
            # cache while the lock was released.  Since the link
            # update is already done, we need only return the
            # computed result and update the count of misses.
            pass
        elif full:
            # Use the old root to store the new key and result.
            oldroot = root
            oldroot[KEY] = key
            oldroot[RESULT] = result
            # Empty the oldest link and make it the new root.
            # Keep a reference to the old key and old result to
            # prevent their ref counts from going to zero during the
            # update. That will prevent potentially arbitrary object
            # clean-up code (i.e. __del__) from running while we're
            # still adjusting the links.
            root = oldroot[NEXT]
            oldkey = root[KEY]
            oldresult = root[RESULT]
            root[KEY] = root[RESULT] = None
            # Now update the cache dictionary.
            del cache[oldkey]
            # Save the potentially reentrant cache[key] assignment
            # for last, after the root and links have been put in
            # a consistent state.
            cache[key] = oldroot
        else:
            # Put result in a new link at the front of the queue.
            last = root[PREV]
            link = [last, root, key, result]
            last[NEXT] = root[PREV] = cache[key] = link
            # Use the cache_len bound method instead of the len() function
            # which could potentially be wrapped in an lru_cache itself.
            full = (cache_len() >= maxsize)
        misses += 1
    return result

这种情况, 才是正常的情况. 上面是完整的代码, 拆开来看.

获取缓存

with lock:
    link = cache_get(key)
    if link is not None:
        # Move the link to the front of the circular queue
        link_prev, link_next, _key, result = link
        link_prev[NEXT] = link_next
        link_next[PREV] = link_prev
        last = root[PREV]
        last[NEXT] = root[PREV] = link
        link[PREV] = last
        link[NEXT] = root
        hits += 1
        return result

根据入参得到对应的key, 从字段中尝试获取缓存的结果. 如果是有对应缓存的, 在返回结果之前, 要更新双向链表. 这是因为我们激活了key值, 需要将这个key移动到代表最新的位置. 代码中对应的是root前面(prev)的结点.

link_prev, link_next, _key, result = link: 得到key对应结点的前后结点link
link_prev[NEXT] = link_next, link_next[PREV] = link_prev: 将key的前后结点相接, 因为key结点移走了
last = root[PREV]: root之前的结点, 是刚才最新的结点, 获取这个结点, 接下来要把key这个结点插入到last和root之间
last[NEXT] = root[PREV] = link: root之前的结点更新为link, last之后的结点更新为link, 完成了link之外相关结点的更新, 接下来更新link本身
link[PREV] = last, link[NEXT] = root: 更新link本身, 双向链表调整完成

执行函数

result = user_function(*args, **kwds)

只有在缓存中不存在时才会执行.

写入缓存

执行完函数, 就要把函数运行的结果和对应的key写入到缓存中.

if key in cache:
    # Getting here means that this same key was added to the
    # cache while the lock was released.  Since the link
    # update is already done, we need only return the
    # computed result and update the count of misses.
    pass

这里考虑了多线程的情况. 可能当前线程执行和另一个线程执行同一入参, 之前缓存不存在, 两个线程都去执行了函数. 但另一个线程快一些, 已经执行完毕, 并且写入了缓存. 轮到此线程时, 再判断一下是否已经写入到缓存了, 如果已经写入, 就不用再费心思, 跳过.

队列已满

elif full:
    # Use the old root to store the new key and result.
    oldroot = root
    oldroot[KEY] = key
    oldroot[RESULT] = result
    # Empty the oldest link and make it the new root.
    # Keep a reference to the old key and old result to
    # prevent their ref counts from going to zero during the
    # update. That will prevent potentially arbitrary object
    # clean-up code (i.e. __del__) from running while we're
    # still adjusting the links.
    root = oldroot[NEXT]
    oldkey = root[KEY]
    oldresult = root[RESULT]
    root[KEY] = root[RESULT] = None
    # Now update the cache dictionary.
    del cache[oldkey]
    # Save the potentially reentrant cache[key] assignment
    # for last, after the root and links have been put in
    # a consistent state.
    cache[key] = oldroot

写入新缓存时队列已满, 我们就要剔除掉最旧的key. 环形链表中, 所有的key按操作时间排序, 以root为界限, root之前是最新的结点, 那么root之后是最旧的结点, 应该被剔除掉. 剔除的方法, 是把root结点转移到最旧的结点(即root[NETX])上, 新的key对应的结点覆盖原来root的位置.

oldroot = root: 获取此时的根节点, 即旧的root节点, oldroot
oldroot[KEY] = key, oldroot[RESULT] = result: 将新的key的信息保存到oldroot上. 这是因为root节点都是空节点, 没有什么好转移的, 直接覆盖就好
root = oldroot[NEXT]: 新的root结点选取为原来的根节点oldroot之后的结点
oldkey = root[KEY], oldresult = root[RESULT]: 获取要删除的, 最旧结点的key值, 后面在缓存字典cache中删掉
root[KEY] = root[RESULT] = None: root结点没有key, result, 清空
del cache[oldkey]: 从缓存中删除最旧的key
cache[key] = oldroot: 缓存字典中装入最新的key

队列未满

可能是有限长度缓存还未装满, 或者缓存的大小就没有限制. 只需要在root和root[PREV]之间插入一个新的结点即可. 最后更新一下链表是否已满的状态.

else:
    # Put result in a new link at the front of the queue.
    last = root[PREV]
    link = [last, root, key, result]
    last[NEXT] = root[PREV] = cache[key] = link
    # Use the cache_len bound method instead of the len() function
    # which could potentially be wrapped in an lru_cache itself.
    full = (cache_len() >= maxsize)

参考资料

最后更新于4年前

这有帮助吗？