Importing thousands of entities into CubicWeb within a few seconds with dataimport

In most cubicweb projects I've been developing on, there always comes a time where I need to import legacy data in the new application. CubicWeb provides Store and Controller objects in the dataimport module. I won't talk here about the recommended general procedure described in the module's docstring (I find it a bit convoluted for simple cases) but I will focus on Store objects. Store objects in this module are more or less a thin layer around session objects, they provide high-level helpers such as create_entity(), relate() and keep track of what was inserted, errors occurred, etc.

In a recent project, I had to create a somewhat fair amount (a few million) of simple entities (strings, integers, floats and dates) and relations. Default object store (i.e. cubicweb.dataimport.RQLObjectStore) is painfully slow, the reason being all integrity / security / metadata hooks that are constantly selected and executed. For large imports, dataimport also provides the cubicweb.dataimport.NoHookRQLObjectStore. This store bypasses all hooks and uses the underlying system source primitives directly, making it around two-times faster than the standard store. The problem is that we're still doing each sql query sequentially and we're talking here of millions of INSERT / UPDATE queries.

My idea was to create my own ObjectStore class inheriting from NoHookRQLObjectStore that would try to use executemany or even copy_from when possible [1]. It is actually not hard to make groups of similar SQL queries since create_entity() generates the same query for a given set of parameters. For instance:

create_entity('Person', firstname='John', surname='Doe')
create_entity('Person', firstname='Tim', surname='BL')

will generate the following sql queries:

INSERT INTO cw_Person ( cw_cwuri, cw_eid, cw_modification_date,
                        cw_creation_date, cw_firstname, cw_surname )
       VALUES ( %(cw_cwuri)s, %(cw_eid)s, %(cw_modification_date)s,
                %(cw_creation_date)s, %(cw_firstname)s, %(cw_surname)s )
INSERT INTO cw_Person ( cw_cwuri, cw_eid, cw_modification_date,
                        cw_creation_date, cw_firstname, cw_surname )
       VALUES ( %(cw_cwuri)s, %(cw_eid)s, %(cw_modification_date)s,
                %(cw_creation_date)s, %(cw_firstname)s, %(cw_surname)s )

The only thing that will differ is the actual data inserted. Well ... ahem ... CubicWeb actually also generates a "few" extra sql queries to insert metadata for each entity:

INSERT INTO is_instance_of_relation(eid_from,eid_to) VALUES (%s,%s)
INSERT INTO is_relation(eid_from,eid_to) VALUES (%s,%s)
INSERT INTO cw_source_relation(eid_from,eid_to) VALUES (%s,%s)
INSERT INTO owned_by_relation ( eid_to, eid_from ) VALUES ( %(eid_to)s, %(eid_from)s )
INSERT INTO created_by_relation ( eid_to, eid_from ) VALUES ( %(eid_to)s, %(eid_from)s )

Those extra queries are actually even exactly the same for each entity insterted, whatever the entity type is, hence craving for executemany or copy_from. Grouping together SQL queries is not that hard [2] but has a drawback : as you don't have an intermediate state (the data is actually inserted only at the very end of the process), you loose the ability to query your database to fetch the entities you've just created during the import.

Now, a few benchmarks ...

To create those benchmarks, I decided to use the workorder cube which is a simple cube, yet complete enough : it provides only two entity types (WorkOrder and Order), a relation between them (Order split_into WorkOrder) and uses different kind of attributes (String, Date, Float).

Once the cube was instantiated, I ran the following script to populate the database with my 3 different stores:

import sys
from datetime import date
from random import choice
from itertools import count

from logilab.common.decorators import timed

from cubicweb import cwconfig
from cubicweb.dbapi import in_memory_repo_cnx

def workorders_data(n, seq=count()):
    for i in xrange(n):
        yield \{'title': u'wo-title%s' % seq.next(), 'description': u'foo',
               'begin_date': date.today(), 'end_date': date.today()\}

def orders_data(n, seq=count()):
    for i in xrange(n):
        yield \{'title': u'o-title%s' % seq.next(), 'date': date.today(), 'budget': 0.8\}

def split_into(orders, workorders):
    for workorder in workorders:
        yield choice(orders), workorder

def initial_state(session, etype):
    return session.execute('Any S WHERE S is State, WF initial_state S, '
                           'WF workflow_of ET, ET name %(etn)s', \{'etn': etype\})[0][0]


@timed
def populate(store, nb_workorders, nb_orders, set_state=False):
    orders = [store.create_entity('Order', **attrs)
              for attrs in orders_data(nb_orders)]
    workorders = [store.create_entity('WorkOrder', **attrs)
                  for attrs in workorders_data(nb_workorders)]
    ## in_state is set by a hook, so NoHookObjectStore will need
    ## to set the relation manually
    if set_state:
        order_state = initial_state(store.session, 'Order')
        workorder_state = initial_state(store.session, 'WorkOrder')
        for order in orders:
            store.relate(order.eid, 'in_state', order_state)
        for workorder in workorders:
            store.relate(workorder.eid, 'in_state', workorder_state)
    for order, workorder in split_into(orders, workorders):
        store.relate(order.eid, 'split_into', workorder.eid)
    store.commit()


if __name__ == '__main__':
    config = cwconfig.instance_configuration(sys.argv[1])
    nb_orders = int(sys.argv[2])
    nb_workorders = int(sys.argv[3])
    repo, cnx = in_memory_repo_cnx(config, login='admin', password='admin')
    session = repo._get_session(cnx.sessionid)
    from cubicweb.dataimport import RQLObjectStore, NoHookRQLObjectStore
    from cubes.mycube.dataimport.store import CopyFromRQLObjectStore
    print 'testing RQLObjectStore'
    store = RQLObjectStore(session)
    populate(store, nb_workorders, nb_orders)
    print 'testing NoHookRQLObjectStore'
    store = NoHookRQLObjectStore(session)
    populate(store, nb_workorders, nb_orders, set_state=True)
    print 'testing CopyFromRQLObjectStore'
    store = CopyFromRQLObjectStore(session)

I ran the script and asked to create 100 Order entities, 1000 WorkOrder entities and to link each created WorkOrder to a parent Order

adim@esope:~/tmp/bench_cwdi$ python bench_cwdi.py bench_cwdi 100 1000
testing RQLObjectStore
populate clock: 24.590000000 / time: 46.169721127
testing NoHookRQLObjectStore
populate clock: 8.100000000 / time: 25.712352991
testing CopyFromRQLObjectStore
populate clock: 0.830000000 / time: 1.180006981

My interpretation of the above times is :

The clock time indicates the time spent on CubicWeb server side (i.e. hooks and data pre/postprocessing around SQL queries). The time time should be the sum of clock time + time spent in postgresql.
RQLObjectStore is slow ;-). Nothing new here, but the clock/time ratio means that we're speding a lot of time on the python side (i.e. hooks as I told earlier) and a fair amount of time in postgresql.
NoHookRQLObjectStore really takes down the time spent on the python side, the time in postgresql remains about the same as for RQLObjectStore, this is not surprising, queries performed are the same in both cases.
CopyFromRQLObjectStore seems blazingly fast in comparison (inserting a few thousands of elements in postgresql with a COPY FROM statement is not a problem). And ... yes, I checked the data was actually inserted, and I even a ran a cubicweb-ctl db-check on the instance afterwards.

This probably opens new perspective for massive data imports since the client API remains the same as before for the programmer. It's still a bit experimental, can only be used for "dummy", brute-force import scenario where you can preprocess your data in Python before updating the database, but it's probably worth having such a store in the the dataimport module.

[1] The idea is to promote an executemany('INSERT INTO ...', data) statement into a COPY FROM whenever possible (i.e. simple data types, easy enough to escape). In that case, the underlying database and python modules have to provide support for this functionality. For the record, the psycopg2 module exposes a copy_from() method and soon logilab-database will provide an additional high-level helper for this functionality (see this ticket).

[2]	The code will be posted later or even integrated into CubicWeb at some point. For now, it requires a bit of monkey patching around one or two methods in the source so that SQL is not executed but just recorded for later executions.