In most cubicweb projects I've been developing on, there always comes a time where I need to import legacy data in the new application. CubicWeb provides Store and Controller objects in the dataimport module. I won't talk here about the recommended general procedure described in the module's docstring (I find it a bit convoluted for simple cases) but I will focus on Store objects. Store objects in this module are more or less a thin layer around session objects, they provide high-level helpers such as create_entity(), relate() and keep track of what was inserted, errors occurred, etc.
In a recent project, I had to create a somewhat fair amount (a few million) of simple entities (strings, integers, floats and dates) and relations. Default object store (i.e. cubicweb.dataimport.RQLObjectStore) is painfully slow, the reason being all integrity / security / metadata hooks that are constantly selected and executed. For large imports, dataimport also provides the cubicweb.dataimport.NoHookRQLObjectStore. This store bypasses all hooks and uses the underlying system source primitives directly, making it around two-times faster than the standard store. The problem is that we're still doing each sql query sequentially and we're talking here of millions of INSERT / UPDATE queries.
My idea was to create my own ObjectStore class inheriting from NoHookRQLObjectStore that would try to use executemany or even copy_from when possible [1]. It is actually not hard to make groups of similar SQL queries since create_entity() generates the same query for a given set of parameters. For instance:
create_entity('Person', firstname='John', surname='Doe') create_entity('Person', firstname='Tim', surname='BL')
will generate the following sql queries:
INSERT INTO cw_Person ( cw_cwuri, cw_eid, cw_modification_date, cw_creation_date, cw_firstname, cw_surname ) VALUES ( %(cw_cwuri)s, %(cw_eid)s, %(cw_modification_date)s, %(cw_creation_date)s, %(cw_firstname)s, %(cw_surname)s ) INSERT INTO cw_Person ( cw_cwuri, cw_eid, cw_modification_date, cw_creation_date, cw_firstname, cw_surname ) VALUES ( %(cw_cwuri)s, %(cw_eid)s, %(cw_modification_date)s, %(cw_creation_date)s, %(cw_firstname)s, %(cw_surname)s )
The only thing that will differ is the actual data inserted. Well ... ahem ... CubicWeb actually also generates a "few" extra sql queries to insert metadata for each entity:
INSERT INTO is_instance_of_relation(eid_from,eid_to) VALUES (%s,%s) INSERT INTO is_relation(eid_from,eid_to) VALUES (%s,%s) INSERT INTO cw_source_relation(eid_from,eid_to) VALUES (%s,%s) INSERT INTO owned_by_relation ( eid_to, eid_from ) VALUES ( %(eid_to)s, %(eid_from)s ) INSERT INTO created_by_relation ( eid_to, eid_from ) VALUES ( %(eid_to)s, %(eid_from)s )
Those extra queries are actually even exactly the same for each entity insterted, whatever the entity type is, hence craving for executemany or copy_from. Grouping together SQL queries is not that hard [2] but has a drawback : as you don't have an intermediate state (the data is actually inserted only at the very end of the process), you loose the ability to query your database to fetch the entities you've just created during the import.
Now, a few benchmarks ...
To create those benchmarks, I decided to use the workorder cube which is a simple cube, yet complete enough : it provides only two entity types (WorkOrder and Order), a relation between them (Order split_into WorkOrder) and uses different kind of attributes (String, Date, Float).
Once the cube was instantiated, I ran the following script to populate the database with my 3 different stores:
import sys from datetime import date from random import choice from itertools import count from logilab.common.decorators import timed from cubicweb import cwconfig from cubicweb.dbapi import in_memory_repo_cnx def workorders_data(n, seq=count()): for i in xrange(n): yield {'title': u'wo-title%s' % seq.next(), 'description': u'foo', 'begin_date': date.today(), 'end_date': date.today()} def orders_data(n, seq=count()): for i in xrange(n): yield {'title': u'o-title%s' % seq.next(), 'date': date.today(), 'budget': 0.8} def split_into(orders, workorders): for workorder in workorders: yield choice(orders), workorder def initial_state(session, etype): return session.execute('Any S WHERE S is State, WF initial_state S, ' 'WF workflow_of ET, ET name %(etn)s', {'etn': etype})[0][0] @timed def populate(store, nb_workorders, nb_orders, set_state=False): orders = [store.create_entity('Order', **attrs) for attrs in orders_data(nb_orders)] workorders = [store.create_entity('WorkOrder', **attrs) for attrs in workorders_data(nb_workorders)] ## in_state is set by a hook, so NoHookObjectStore will need ## to set the relation manually if set_state: order_state = initial_state(store.session, 'Order') workorder_state = initial_state(store.session, 'WorkOrder') for order in orders: store.relate(order.eid, 'in_state', order_state) for workorder in workorders: store.relate(workorder.eid, 'in_state', workorder_state) for order, workorder in split_into(orders, workorders): store.relate(order.eid, 'split_into', workorder.eid) store.commit() if __name__ == '__main__': config = cwconfig.instance_configuration(sys.argv[1]) nb_orders = int(sys.argv[2]) nb_workorders = int(sys.argv[3]) repo, cnx = in_memory_repo_cnx(config, login='admin', password='admin') session = repo._get_session(cnx.sessionid) from cubicweb.dataimport import RQLObjectStore, NoHookRQLObjectStore from cubes.mycube.dataimport.store import CopyFromRQLObjectStore print 'testing RQLObjectStore' store = RQLObjectStore(session) populate(store, nb_workorders, nb_orders) print 'testing NoHookRQLObjectStore' store = NoHookRQLObjectStore(session) populate(store, nb_workorders, nb_orders, set_state=True) print 'testing CopyFromRQLObjectStore' store = CopyFromRQLObjectStore(session)
I ran the script and asked to create 100 Order entities, 1000 WorkOrder entities and to link each created WorkOrder to a parent Order
adim@esope:~/tmp/bench_cwdi$ python bench_cwdi.py bench_cwdi 100 1000 testing RQLObjectStore populate clock: 24.590000000 / time: 46.169721127 testing NoHookRQLObjectStore populate clock: 8.100000000 / time: 25.712352991 testing CopyFromRQLObjectStore populate clock: 0.830000000 / time: 1.180006981
My interpretation of the above times is :
- The clock time indicates the time spent on CubicWeb server side (i.e. hooks and data pre/postprocessing around SQL queries). The time time should be the sum of clock time + time spent in postgresql.
- RQLObjectStore is slow ;-). Nothing new here, but the clock/time ratio means that we're speding a lot of time on the python side (i.e. hooks as I told earlier) and a fair amount of time in postgresql.
- NoHookRQLObjectStore really takes down the time spent on the python side, the time in postgresql remains about the same as for RQLObjectStore, this is not surprising, queries performed are the same in both cases.
- CopyFromRQLObjectStore seems blazingly fast in comparison (inserting a few thousands of elements in postgresql with a COPY FROM statement is not a problem). And ... yes, I checked the data was actually inserted, and I even a ran a cubicweb-ctl db-check on the instance afterwards.
This probably opens new perspective for massive data imports since the client API remains the same as before for the programmer. It's still a bit experimental, can only be used for "dummy", brute-force import scenario where you can preprocess your data in Python before updating the database, but it's probably worth having such a store in the the dataimport module.
[1] | The idea is to promote an executemany('INSERT INTO ...', data) statement into a COPY FROM whenever possible (i.e. simple data types, easy enough to escape). In that case, the underlying database and python modules have to provide support for this functionality. For the record, the psycopg2 module exposes a copy_from() method and soon logilab-database will provide an additional high-level helper for this functionality (see this ticket). |
[2] | The code will be posted later or even integrated into CubicWeb at some point. For now, it requires a bit of monkey patching around one or two methods in the source so that SQL is not executed but just recorded for later executions. |
- cubicweb #1732685 Cluttered cached entity in dataimport's NoHookRQLObjectStore
- cubicweb #2341237 datafeed/ldapfeed should not allow no/empty parser on edition
- cubicweb #847652 re enable testing agains real db
- cubicweb #154057 define sources in database not in config
- cubicweb #1382471 crash when using stale session
Comments
With dataimport features and your above work, the CW community benefits from both advanced reutilization through cubes and impressive performances. This is really cool, congrats and many thanks for the great job!