In most cubicweb projects I've been developing on, there always comes a time where
I need to import legacy data in the new application. CubicWeb provides Store
and Controller objects in the dataimport module. I won't talk here about the
recommended general procedure described in the module's docstring (I find it a bit
convoluted for simple cases) but I will focus on Store objects. Store
objects in this module are more or less a thin layer around session objects,
they provide high-level helpers such as create_entity(), relate()
and keep track of what was inserted, errors occurred, etc.
In a recent project, I had to create a somewhat fair amount (a few million)
of simple entities (strings, integers, floats and dates) and relations. Default
object store (i.e. cubicweb.dataimport.RQLObjectStore) is painfully slow,
the reason being all integrity / security / metadata hooks that are constantly
selected and executed. For large imports, dataimport also provides the
cubicweb.dataimport.NoHookRQLObjectStore. This store bypasses all hooks
and uses the underlying system source primitives directly, making it around
two-times faster than the standard store. The problem is that we're still
doing each sql query sequentially and we're talking here of millions of
INSERT / UPDATE queries.
My idea was to create my own ObjectStore class inheriting from NoHookRQLObjectStore
that would try to use executemany or even copy_from when possible . It
is actually not hard to make groups of similar SQL queries since create_entity()
generates the same query for a given set of parameters. For instance:
create_entity('Person', firstname='John', surname='Doe')
create_entity('Person', firstname='Tim', surname='BL')
will generate the following sql queries:
INSERT INTO cw_Person ( cw_cwuri, cw_eid, cw_modification_date,
cw_creation_date, cw_firstname, cw_surname )
VALUES ( %(cw_cwuri)s, %(cw_eid)s, %(cw_modification_date)s,
%(cw_creation_date)s, %(cw_firstname)s, %(cw_surname)s )
INSERT INTO cw_Person ( cw_cwuri, cw_eid, cw_modification_date,
cw_creation_date, cw_firstname, cw_surname )
VALUES ( %(cw_cwuri)s, %(cw_eid)s, %(cw_modification_date)s,
%(cw_creation_date)s, %(cw_firstname)s, %(cw_surname)s )
The only thing that will differ is the actual data inserted. Well ... ahem ... CubicWeb actually
also generates a "few" extra sql queries to insert metadata for each entity:
INSERT INTO is_instance_of_relation(eid_from,eid_to) VALUES (%s,%s)
INSERT INTO is_relation(eid_from,eid_to) VALUES (%s,%s)
INSERT INTO cw_source_relation(eid_from,eid_to) VALUES (%s,%s)
INSERT INTO owned_by_relation ( eid_to, eid_from ) VALUES ( %(eid_to)s, %(eid_from)s )
INSERT INTO created_by_relation ( eid_to, eid_from ) VALUES ( %(eid_to)s, %(eid_from)s )
Those extra queries are actually even exactly the same for each entity insterted, whatever
the entity type is, hence craving for executemany or copy_from. Grouping together SQL queries
is not that hard but has a drawback : as you don't have an intermediate state
(the data is actually inserted only at the very end of the process),
you loose the ability to query your database to fetch the entities you've just created
during the import.
Now, a few benchmarks ...
To create those benchmarks, I decided to use the workorder cube which is a simple cube, yet complete enough : it provides only two entity types (WorkOrder and Order), a relation between them (Order split_into WorkOrder) and uses different kind of attributes (String, Date, Float).
Once the cube was instantiated, I ran the following script to populate the database with my 3 different stores:
import sys
from datetime import date
from random import choice
from itertools import count
from logilab.common.decorators import timed
from cubicweb import cwconfig
from cubicweb.dbapi import in_memory_repo_cnx
def workorders_data(n, seq=count()):
for i in xrange(n):
yield {'title': u'wo-title%s' % seq.next(), 'description': u'foo',
'begin_date': date.today(), 'end_date': date.today()}
def orders_data(n, seq=count()):
for i in xrange(n):
yield {'title': u'o-title%s' % seq.next(), 'date': date.today(), 'budget': 0.8}
def split_into(orders, workorders):
for workorder in workorders:
yield choice(orders), workorder
def initial_state(session, etype):
return session.execute('Any S WHERE S is State, WF initial_state S, '
'WF workflow_of ET, ET name %(etn)s', {'etn': etype})[0][0]
@timed
def populate(store, nb_workorders, nb_orders, set_state=False):
orders = [store.create_entity('Order', **attrs)
for attrs in orders_data(nb_orders)]
workorders = [store.create_entity('WorkOrder', **attrs)
for attrs in workorders_data(nb_workorders)]
## in_state is set by a hook, so NoHookObjectStore will need
## to set the relation manually
if set_state:
order_state = initial_state(store.session, 'Order')
workorder_state = initial_state(store.session, 'WorkOrder')
for order in orders:
store.relate(order.eid, 'in_state', order_state)
for workorder in workorders:
store.relate(workorder.eid, 'in_state', workorder_state)
for order, workorder in split_into(orders, workorders):
store.relate(order.eid, 'split_into', workorder.eid)
store.commit()
if __name__ == '__main__':
config = cwconfig.instance_configuration(sys.argv[1])
nb_orders = int(sys.argv[2])
nb_workorders = int(sys.argv[3])
repo, cnx = in_memory_repo_cnx(config, login='admin', password='admin')
session = repo._get_session(cnx.sessionid)
from cubicweb.dataimport import RQLObjectStore, NoHookRQLObjectStore
from cubes.mycube.dataimport.store import CopyFromRQLObjectStore
print 'testing RQLObjectStore'
store = RQLObjectStore(session)
populate(store, nb_workorders, nb_orders)
print 'testing NoHookRQLObjectStore'
store = NoHookRQLObjectStore(session)
populate(store, nb_workorders, nb_orders, set_state=True)
print 'testing CopyFromRQLObjectStore'
store = CopyFromRQLObjectStore(session)
I ran the script and asked to create 100 Order entities, 1000 WorkOrder entities
and to link each created WorkOrder to a parent Order
adim@esope:~/tmp/bench_cwdi$ python bench_cwdi.py bench_cwdi 100 1000
testing RQLObjectStore
populate clock: 24.590000000 / time: 46.169721127
testing NoHookRQLObjectStore
populate clock: 8.100000000 / time: 25.712352991
testing CopyFromRQLObjectStore
populate clock: 0.830000000 / time: 1.180006981
My interpretation of the above times is :
- The clock time indicates the time spent on CubicWeb server side (i.e. hooks
and data pre/postprocessing around SQL queries). The time time should be
the sum of clock time + time spent in postgresql.
- RQLObjectStore is slow ;-). Nothing new here, but the clock/time ratio
means that we're speding a lot of time on the python side (i.e. hooks
as I told earlier) and a fair amount of time in postgresql.
- NoHookRQLObjectStore really takes down the time spent on the python side,
the time in postgresql remains about the same as for RQLObjectStore, this
is not surprising, queries performed are the same in both cases.
- CopyFromRQLObjectStore seems blazingly fast in comparison (inserting
a few thousands of elements in postgresql with a COPY FROM statement
is not a problem). And ... yes, I checked the data was actually inserted,
and I even a ran a cubicweb-ctl db-check on the instance afterwards.
This probably opens new perspective for massive data imports since the client API
remains the same as before for the programmer. It's still a bit experimental, can only be
used for "dummy", brute-force import scenario where you can preprocess your data
in Python before updating the database, but it's probably worth having such
a store in the the dataimport module.