In most cubicweb projects I've been developing on, there always comes a time where I need to import legacy data in the new application. CubicWeb provides Store and Controller objects in the dataimport module. I won't talk here about the recommended general procedure described in the module's docstring (I find it a bit convoluted for simple cases) but I will focus on Store objects. Store objects in this module are more or less a thin layer around session objects, they provide high-level helpers such as create_entity(), relate() and keep track of what was inserted, errors occurred, etc. In a recent project, I had to create a somewhat fair amount (a few million) of simple entities (strings, integers, floats and dates) and relations. Default object store (i.e. cubicweb.dataimport.RQLObjectStore) is painfully slow, the reason being all integrity / security / metadata hooks that are constantly selected and executed. For large imports, dataimport also provides the cubicweb.dataimport.NoHookRQLObjectStore. This store bypasses all hooks and uses the underlying system source primitives directly, making it around two-times faster than the standard store. The problem is that we're still doing each sql query sequentially and we're talking here of millions of INSERT / UPDATE queries. My idea was to create my own ObjectStore class inheriting from NoHookRQLObjectStore that would try to use executemany or even copy_from when possible [1]. It is actually not hard to make groups of similar SQL queries since create_entity() generates the same query for a given set of parameters. For instance: create_entity('Person', firstname='John', surname='Doe')
create_entity('Person', firstname='Tim', surname='BL')
will generate the following sql queries: INSERT INTO cw_Person ( cw_cwuri, cw_eid, cw_modification_date,
cw_creation_date, cw_firstname, cw_surname )
VALUES ( %(cw_cwuri)s, %(cw_eid)s, %(cw_modification_date)s,
%(cw_creation_date)s, %(cw_firstname)s, %(cw_surname)s )
INSERT INTO cw_Person ( cw_cwuri, cw_eid, cw_modification_date,
cw_creation_date, cw_firstname, cw_surname )
VALUES ( %(cw_cwuri)s, %(cw_eid)s, %(cw_modification_date)s,
%(cw_creation_date)s, %(cw_firstname)s, %(cw_surname)s )
The only thing that will differ is the actual data inserted. Well ... ahem ... CubicWeb actually also generates a "few" extra sql queries to insert metadata for each entity: INSERT INTO is_instance_of_relation(eid_from,eid_to) VALUES (%s,%s)
INSERT INTO is_relation(eid_from,eid_to) VALUES (%s,%s)
INSERT INTO cw_source_relation(eid_from,eid_to) VALUES (%s,%s)
INSERT INTO owned_by_relation ( eid_to, eid_from ) VALUES ( %(eid_to)s, %(eid_from)s )
INSERT INTO created_by_relation ( eid_to, eid_from ) VALUES ( %(eid_to)s, %(eid_from)s )
Those extra queries are actually even exactly the same for each entity insterted, whatever the entity type is, hence craving for executemany or copy_from. Grouping together SQL queries is not that hard [2] but has a drawback : as you don't have an intermediate state (the data is actually inserted only at the very end of the process), you loose the ability to query your database to fetch the entities you've just created during the import. Now, a few benchmarks ... To create those benchmarks, I decided to use the workorder cube which is a simple cube, yet complete enough : it provides only two entity types (WorkOrder and Order), a relation between them (Order split_into WorkOrder) and uses different kind of attributes (String, Date, Float). Once the cube was instantiated, I ran the following script to populate the database with my 3 different stores: import sys
from datetime import date
from random import choice
from itertools import count
from logilab.common.decorators import timed
from cubicweb import cwconfig
from cubicweb.dbapi import in_memory_repo_cnx
def workorders_data(n, seq=count()):
for i in xrange(n):
yield {'title': u'wo-title%s' % seq.next(), 'description': u'foo',
'begin_date': date.today(), 'end_date': date.today()}
def orders_data(n, seq=count()):
for i in xrange(n):
yield {'title': u'o-title%s' % seq.next(), 'date': date.today(), 'budget': 0.8}
def split_into(orders, workorders):
for workorder in workorders:
yield choice(orders), workorder
def initial_state(session, etype):
return session.execute('Any S WHERE S is State, WF initial_state S, '
'WF workflow_of ET, ET name %(etn)s', {'etn': etype})[0][0]
@timed
def populate(store, nb_workorders, nb_orders, set_state=False):
orders = [store.create_entity('Order', **attrs)
for attrs in orders_data(nb_orders)]
workorders = [store.create_entity('WorkOrder', **attrs)
for attrs in workorders_data(nb_workorders)]
## in_state is set by a hook, so NoHookObjectStore will need
## to set the relation manually
if set_state:
order_state = initial_state(store.session, 'Order')
workorder_state = initial_state(store.session, 'WorkOrder')
for order in orders:
store.relate(order.eid, 'in_state', order_state)
for workorder in workorders:
store.relate(workorder.eid, 'in_state', workorder_state)
for order, workorder in split_into(orders, workorders):
store.relate(order.eid, 'split_into', workorder.eid)
store.commit()
if __name__ == '__main__':
config = cwconfig.instance_configuration(sys.argv[1])
nb_orders = int(sys.argv[2])
nb_workorders = int(sys.argv[3])
repo, cnx = in_memory_repo_cnx(config, login='admin', password='admin')
session = repo._get_session(cnx.sessionid)
from cubicweb.dataimport import RQLObjectStore, NoHookRQLObjectStore
from cubes.mycube.dataimport.store import CopyFromRQLObjectStore
print 'testing RQLObjectStore'
store = RQLObjectStore(session)
populate(store, nb_workorders, nb_orders)
print 'testing NoHookRQLObjectStore'
store = NoHookRQLObjectStore(session)
populate(store, nb_workorders, nb_orders, set_state=True)
print 'testing CopyFromRQLObjectStore'
store = CopyFromRQLObjectStore(session)
I ran the script and asked to create 100 Order entities, 1000 WorkOrder entities and to link each created WorkOrder to a parent Order adim@esope:~/tmp/bench_cwdi$ python bench_cwdi.py bench_cwdi 100 1000 testing RQLObjectStore populate clock: 24.590000000 / time: 46.169721127 testing NoHookRQLObjectStore populate clock: 8.100000000 / time: 25.712352991 testing CopyFromRQLObjectStore populate clock: 0.830000000 / time: 1.180006981 My interpretation of the above times is :
This probably opens new perspective for massive data imports since the client API remains the same as before for the programmer. It's still a bit experimental, can only be used for "dummy", brute-force import scenario where you can preprocess your data in Python before updating the database, but it's probably worth having such a store in the the dataimport module.
| |||||


Reusing OpenData from Data.gouv.fr with CubicWeb in 2 hours

Comments
With dataimport features and your above work, the CW community benefits from both advanced reutilization through cubes and impressive performances. This is really cool, congrats and many thanks for the great job!