asynchronous processing using celery

As we posted earlier, we have converted our backend to work asynchronously. In this post, we will explain what this means and how one can go about writing an application that will use asynchronous queues with celery. If you aren't particularly technical, this post is unlikely to interest you very much, but you're still welcome to read on to get a glimpse of how the internals of historious work!

As you probably know, historious is a bookmark search engine. You enter a URL, it gets downloaded (unless you provide the source, which has many advantages such as bypassing paywalls and registrations) and indexed. This way, you can search across all your bookmarks for any word or words that appear in the text of the page, rather than just the title or the tags. You do also have the option to tag posts, but that is optional.

So, to support this architecture, we need something that will accept a URL, download it, index it and store it. The way this was done until a few days ago was this:

  1. The URL is stored in a database table (along with the source, if it exists).
  2. A daemon polls the database every second for documents that haven't been downloaded yet, downloads them and stores their source in the above database table.
  3. A second daemon polls the database every second for documents that haven't been indexed yet, indexes them and marks them as indexed in the database.

While this worked rather well until now, it is a rather big waste of resources to have to query a table with millions of rows every second. The table is indexed, of course, so the query only takes a fraction of a second, but the database still gets hit, which can be bad when the server is already under load.

This approach, however, has the advantage that a user who adds a URL doesn't have to wait for the URL to be downloaded and indexed, which can take a few seconds. The user is assured that the document they just historified has been successfully added within a few milliseconds, which is the time it takes to insert a URL in the database. This is especially useful when the user migrates to historious from another bookmarking service and has many thousands of URLs to add.

Instead of querying the entire table every second, we decided to improve this process a bit by setting a flag in our redis store every time a document was added. This way, we could just poll the (much faster) redis key for changes instead of the database, and only hit the database when necessary. This approach did improve matters, but it was a bit unwieldy in that the flag might be dropped and reraised in odd ways, and some documents were left unindexed for a few seconds until another document was added.

An even better solution, and the one we have implemented now, is to use celery to run the function asynchronously. This way, we can just call the downloading function, which will, in turn, call the indexing function when it's done. Since we are already using redis (and RabbitMQ doesn't support priorities yet), we decided to use the redis store for celery (redis has "databases", which are basically namespaces for keys, so the celery database doesn't interfere with our caching one).

Adding celery to the project was very easy using the django-celery package. To use redis, all one needs to do is add the following lines to the settings.py file:

CARROT_BACKEND = "ghettoq.taproot.Redis"
BROKER_HOST = "localhost"
BROKER_PORT = 6379
BROKER_VHOST = "0"

CELERY_RESULT_BACKEND = "redis"
REDIS_HOST = "localhost"
REDIS_PORT = 6379
REDIS_DB = "1"

Adding both backends is necessary, as the first one is the messages backend (where celery stores the function calls) and the second is the results backend (where celery stores the return values).

After doing this, we can just call our functions in the normal celery way (myfunction.delay(<arguments>)), and celery will take care of everything for us.

At this point, the first piece of the puzzle is done. What's equally important, though, is to ensure that users who run big import jobs won't disturb users who use the bookmarklet or extensions. When using the bookmarklet, the source of the current page gets submitted to historious, so we don't need to download it ourselves. Since the source is already there, we can just index it, and we need to make sure that these indexing tasks run before the downloading tasks.

When a user adds two thousand bookmarks, all these are being added as "download" jobs. This means that we will have to wait for all two thousand bookmarks to be downloaded (and indexed) before anyone else can use historious. This is clearly unacceptable, so we need to find a way to prioritise indexing jobs over downloading jobs.

Fortunately, celery supports task priorities. RabbitMQ doesn't support them yet, but redis does, and they work fine (the docs are a bit hazy on this, but they do work). To specify a task's priority, all we need to do is add the argument to the decorator (@task(priority=0)) Taking advantage of this, we just set the indexing jobs to have a higher priority than the downloading jobs, and celery will automatically prioritise any indexing over all the downloading jobs, even if the indexing job is added last!

As you can see, adding celery support in any existing project is very simple. Just add the above configuration in your settings.py, declare your tasks, run the daemon and call them! Celery has been working very well, both in testing and production (for the few hours it has been there). Most importantly, the load on the server remains very low, even though users are constantly adding new documents. This is a great improvement on the old way of doing things, and we hope you have gotten enough of a taste of celery to use it in your own projects!

Good luck, and don't hesitate to comment on this post with your experiences!

Team historious

About

historious is a new way to bookmark sites, with no lists to wade through or categories to ponder. Just click a button and your site is saved, and search for it using keywords you remember when you need to find it again.

TwitterFacebookPage