Blog entries by Adrien Di Mascio [9]

Importing thousands of entities into CubicWeb within a few seconds with dataimport

2011/12/09 by Adrien Di Mascio

In most cubicweb projects I've been developing on, there always comes a time where I need to import legacy data in the new application. CubicWeb provides Store and Controller objects in the dataimport module. I won't talk here about the recommended general procedure described in the module's docstring (I find it a bit convoluted for simple cases) but I will focus on Store objects. Store objects in this module are more or less a thin layer around session objects, they provide high-level helpers such as create_entity(), relate() and keep track of what was inserted, errors occurred, etc.

In a recent project, I had to create a somewhat fair amount (a few million) of simple entities (strings, integers, floats and dates) and relations. Default object store (i.e. cubicweb.dataimport.RQLObjectStore) is painfully slow, the reason being all integrity / security / metadata hooks that are constantly selected and executed. For large imports, dataimport also provides the cubicweb.dataimport.NoHookRQLObjectStore. This store bypasses all hooks and uses the underlying system source primitives directly, making it around two-times faster than the standard store. The problem is that we're still doing each sql query sequentially and we're talking here of millions of INSERT / UPDATE queries.

My idea was to create my own ObjectStore class inheriting from NoHookRQLObjectStore that would try to use executemany or even copy_from when possible [1]. It is actually not hard to make groups of similar SQL queries since create_entity() generates the same query for a given set of parameters. For instance:

create_entity('Person', firstname='John', surname='Doe')
create_entity('Person', firstname='Tim', surname='BL')

will generate the following sql queries:

INSERT INTO cw_Person ( cw_cwuri, cw_eid, cw_modification_date,
                        cw_creation_date, cw_firstname, cw_surname )
       VALUES ( %(cw_cwuri)s, %(cw_eid)s, %(cw_modification_date)s,
                %(cw_creation_date)s, %(cw_firstname)s, %(cw_surname)s )
INSERT INTO cw_Person ( cw_cwuri, cw_eid, cw_modification_date,
                        cw_creation_date, cw_firstname, cw_surname )
       VALUES ( %(cw_cwuri)s, %(cw_eid)s, %(cw_modification_date)s,
                %(cw_creation_date)s, %(cw_firstname)s, %(cw_surname)s )

The only thing that will differ is the actual data inserted. Well ... ahem ... CubicWeb actually also generates a "few" extra sql queries to insert metadata for each entity:

INSERT INTO is_instance_of_relation(eid_from,eid_to) VALUES (%s,%s)
INSERT INTO is_relation(eid_from,eid_to) VALUES (%s,%s)
INSERT INTO cw_source_relation(eid_from,eid_to) VALUES (%s,%s)
INSERT INTO owned_by_relation ( eid_to, eid_from ) VALUES ( %(eid_to)s, %(eid_from)s )
INSERT INTO created_by_relation ( eid_to, eid_from ) VALUES ( %(eid_to)s, %(eid_from)s )

Those extra queries are actually even exactly the same for each entity insterted, whatever the entity type is, hence craving for executemany or copy_from. Grouping together SQL queries is not that hard [2] but has a drawback : as you don't have an intermediate state (the data is actually inserted only at the very end of the process), you loose the ability to query your database to fetch the entities you've just created during the import.

Now, a few benchmarks ...

To create those benchmarks, I decided to use the workorder cube which is a simple cube, yet complete enough : it provides only two entity types (WorkOrder and Order), a relation between them (Order split_into WorkOrder) and uses different kind of attributes (String, Date, Float).

Once the cube was instantiated, I ran the following script to populate the database with my 3 different stores:

import sys
from datetime import date
from random import choice
from itertools import count

from logilab.common.decorators import timed

from cubicweb import cwconfig
from cubicweb.dbapi import in_memory_repo_cnx

def workorders_data(n, seq=count()):
    for i in xrange(n):
        yield {'title': u'wo-title%s' %, 'description': u'foo',
               'begin_date':, 'end_date':}

def orders_data(n, seq=count()):
    for i in xrange(n):
        yield {'title': u'o-title%s' %, 'date':, 'budget': 0.8}

def split_into(orders, workorders):
    for workorder in workorders:
        yield choice(orders), workorder

def initial_state(session, etype):
    return session.execute('Any S WHERE S is State, WF initial_state S, '
                           'WF workflow_of ET, ET name %(etn)s', {'etn': etype})[0][0]

def populate(store, nb_workorders, nb_orders, set_state=False):
    orders = [store.create_entity('Order', **attrs)
              for attrs in orders_data(nb_orders)]
    workorders = [store.create_entity('WorkOrder', **attrs)
                  for attrs in workorders_data(nb_workorders)]
    ## in_state is set by a hook, so NoHookObjectStore will need
    ## to set the relation manually
    if set_state:
        order_state = initial_state(store.session, 'Order')
        workorder_state = initial_state(store.session, 'WorkOrder')
        for order in orders:
            store.relate(order.eid, 'in_state', order_state)
        for workorder in workorders:
            store.relate(workorder.eid, 'in_state', workorder_state)
    for order, workorder in split_into(orders, workorders):
        store.relate(order.eid, 'split_into', workorder.eid)

if __name__ == '__main__':
    config = cwconfig.instance_configuration(sys.argv[1])
    nb_orders = int(sys.argv[2])
    nb_workorders = int(sys.argv[3])
    repo, cnx = in_memory_repo_cnx(config, login='admin', password='admin')
    session = repo._get_session(cnx.sessionid)
    from cubicweb.dataimport import RQLObjectStore, NoHookRQLObjectStore
    from import CopyFromRQLObjectStore
    print 'testing RQLObjectStore'
    store = RQLObjectStore(session)
    populate(store, nb_workorders, nb_orders)
    print 'testing NoHookRQLObjectStore'
    store = NoHookRQLObjectStore(session)
    populate(store, nb_workorders, nb_orders, set_state=True)
    print 'testing CopyFromRQLObjectStore'
    store = CopyFromRQLObjectStore(session)

I ran the script and asked to create 100 Order entities, 1000 WorkOrder entities and to link each created WorkOrder to a parent Order

adim@esope:~/tmp/bench_cwdi$ python bench_cwdi 100 1000
testing RQLObjectStore
populate clock: 24.590000000 / time: 46.169721127
testing NoHookRQLObjectStore
populate clock: 8.100000000 / time: 25.712352991
testing CopyFromRQLObjectStore
populate clock: 0.830000000 / time: 1.180006981

My interpretation of the above times is :

  • The clock time indicates the time spent on CubicWeb server side (i.e. hooks and data pre/postprocessing around SQL queries). The time time should be the sum of clock time + time spent in postgresql.
  • RQLObjectStore is slow ;-). Nothing new here, but the clock/time ratio means that we're speding a lot of time on the python side (i.e. hooks as I told earlier) and a fair amount of time in postgresql.
  • NoHookRQLObjectStore really takes down the time spent on the python side, the time in postgresql remains about the same as for RQLObjectStore, this is not surprising, queries performed are the same in both cases.
  • CopyFromRQLObjectStore seems blazingly fast in comparison (inserting a few thousands of elements in postgresql with a COPY FROM statement is not a problem). And ... yes, I checked the data was actually inserted, and I even a ran a cubicweb-ctl db-check on the instance afterwards.

This probably opens new perspective for massive data imports since the client API remains the same as before for the programmer. It's still a bit experimental, can only be used for "dummy", brute-force import scenario where you can preprocess your data in Python before updating the database, but it's probably worth having such a store in the the dataimport module.

[1]The idea is to promote an executemany('INSERT INTO ...', data) statement into a COPY FROM whenever possible (i.e. simple data types, easy enough to escape). In that case, the underlying database and python modules have to provide support for this functionality. For the record, the psycopg2 module exposes a copy_from() method and soon logilab-database will provide an additional high-level helper for this functionality (see this ticket).
[2]The code will be posted later or even integrated into CubicWeb at some point. For now, it requires a bit of monkey patching around one or two methods in the source so that SQL is not executed but just recorded for later executions.

CubicWeb sprint in Paris / Need for Speed

2011/03/22 by Adrien Di Mascio

Logilab is hosting a CubicWeb sprint - 3 days in our Paris offices.

The general focus will be on speed :

  • on cubicweb-server side : improve performance of massive insertions / deletions
  • on cubicweb-client side : cache implementation, HTTP server, massive parallel usage, etc.

This sprint will take place from in April 2011 from tuesday the 26th to thursday the 28th. You are more than welcome to come along and help out, contribute, but unlike previous sprints, at least basic knowledge of CubicWeb will be required for participants since no introduction is planned.

Network resources will be available for those bringing laptops.

Address : 104 Boulevard Auguste-Blanqui, Paris. Ring "Logilab" (googlemap)

Metro : Glacière

Contact :

Dates : 26/04/2011 to 28/04/2011

CSS+JS sprint report - Day 1 and 2 (April 2010)

2010/04/30 by Adrien Di Mascio

These first two days essentially consisted in exploring the javascript world.

Documenting javascript

Sandrine and Alain worked on the javascript documentation tools and how they could be integrated into our sphinx generated documentation.

They first studied pyjsdoc which unfortunately only generates HTML. After a somewhat successful attempt to generate sphinx ReST, we decided to use a consistent documentation format between python modules and js modules and therefore switched to a home-made, very simple javascript comment parser. Here's an example of what the parser understands:

 * .. cfunction:: myFunction(a, b, /*...*/, c, d)
 *    This function is very **well** documented and does quite
 *    a lot of stuff :
 *    - task 1
 *    - task 2
 *    :param a: this is the first parameter
 *    ...
 *    :return: 42
function myFunction(a, b, /*...*/, c, d) {

The extracted ReST snippets are then concatenated and inserted in the general documentation.

Unit testing javascript

Katia, Julien and Adrien looked at the different testing tools for javascript, with the two following goals in mind:

  • low-level unit testing, as cubicweb agnostic as possible
  • high-level / functional testing, we want to write navigation scenarios and replay them

And the two winners of the exploration are:
  • QUnit for pure javascript / DOM testing. Julien and Adrien successfully managed to test a few cubicweb js functions, most notably the loadxhtml jquery plugin.
  • Windmill for higher level testing. Katia and Sylvain were able to integrate Windmill within the CubicWeb unit testing framework.

Of course, there is still a lot of work that needs to be done. For instance, we would like to have a test runner facility to run QUnit-based tests on multiple platforms / browsers automatically.

Parametrized stylesheets and vertical rhythm

Sylvain worked on property sheets and managed to implement compiled CSS based on simple string interpolation. Of course, compiled CSS are still HTTP cached, automatically recompiled on debug mode, etc. On his way, he also got rid of the external_resources file. Backward compatibility will of course be guaranteed for a while.

Nicolas worked on CSS and vertical rythm and prepared a patch that introduces a basic rhythm. The tedious work will be to get every stylesheet to dance to the beat.

Customizing search box with magicsearch

2009/12/13 by Adrien Di Mascio

During last cubicweb sprint, I was asked if it was possible to customize the search box CubicWeb comes with. By default, you can use it to either type RQL queries, plain text queries or standard shortcuts such as <EntityType> or <EntityType> <attrname> <value>.

Ultimately, all queries are translated to rql since it's the only language understood on the server (data) side. To transform the user query into RQL, CubicWeb uses the so-called magicsearch component which in turn delegates to a number of query preprocessor that are responsible of interpreting the user query and generating corresponding RQL.

The code of the main processor loop is easy to understand:

for proc in self.processors:
        return proc.process_query(uquery, req)
    except (RQLSyntaxError, BadRQLQuery):

The idea is simple: for each query processor, try to translate the query. If it fails, try with the next processor, if it succeeds, we're done and the RQL query will be executed.

Now that the general mechanism is understood, here's an example of code that could be used in a forge-based cube to add a new search shortcut to find tickets. We'd like to use the project_name:text syntax to search for tickets of project_name containing text (e.g pylint:warning).

Here's the corresponding preprocessor code:

from cubicweb.web.views.magicsearch import BaseQueryProcessor

class MyCustomQueryProcessor(BaseQueryProcessor):
    priority = 0 # controls order in which processors are tried

    def preprocess_query(self, uquery, req):
        :param uqery: the query as sent by the browser
        :param req: the standard, omnipresent, cubicweb's req object
            project_name, text = uquery.split(':')
        except ValueError:
            return None # the shortcut doesn't apply
        return (u'Any T WHERE T is Ticket, T concerns P, P name %(p)s, '
                u'T has_text %(t)s', {'p': project_name, 't': text})

The code is rather self-explanatory, but here's a few additional comments:

  • the class is registered with the standard vregistry mechanism and should be defined along the views
  • the priority attribute is used to sort and define the order in which processors will be tried in the main processor loop
  • the preprocess_query returns None or raise an exception if the query can't be processed

To summarize, if you want to customize the search box, you have to:

  1. define a new query preprocessor component
  2. define its priority wrt other standard processors
  3. implement the preprocess_query method

and CubicWeb will do the rest !

Sparkles everywhere, CubicWeb gets fizzy

2009/07/28 by Adrien Di Mascio

Last week, we finally took a few days to dive into SPARQL in order to transform any CubicWeb application into a potential SPARQL endpoint.

The first step was to get a parser. Fortunately the w3c provides a grammar definition and around 200 test cases. There was a few interesting options around there: we tried to reuse rdflib, rasqal, the sparql.g version designed for antlr3 and SimpleParse but after two days of work, we had nothing that worked well enough. We decided it was not worth it and switched to yapps since we knew yapps and rql already had a dependency on it.

Maybe we'll consider changing the parser at some point later but the priority was to get something working as soon as we could and we finally came up with a version of fyzz passing 90% of the W3C test suite (of course, there might be some false positives).

Fyzz parses the SPARQL query and generates something we decided to call an AST although it's still a bit rough for now. Fyzz understands simple triples, distincts, limits, offsets and other basic functionalities.

Please note that fyzz is totally independent of cubicweb and it can be reused by any project.

Here's an example of how to use fyzz:

>>> from fyzz.yappsparser import parse
>>> ast = parse("""PREFIX doap: <>
... SELECT ?project ?name WHERE {
...    ?project a doap:Project;
...         doap:name ?name.
... }
... ORDER BY ?name LIMIT 5 OFFSET 10
... """)
>>> print ast.selected
[SparqlVar('project'), SparqlVar('name')]
>>> print ast.prefixes
{'doap': ''}
>>> print ast.orderby
[(SparqlVar('name'), 'asc')]
>>> print ast.limit, ast.offset
5 10
>>> print ast.where
[(SparqlVar('project'), ('', 'a'), ('', 'Project')),
 (SparqlVar('project'), ('', 'name'), SparqlVar('name'))]

This AST is then processed and transformed into a RQL query which can finally be processed by CubicWeb directly.

Here's what can be done in cubicweb-ctl shell session (of course, this can also be done in the web application) of our forge cube:

>>> from cubicweb.spa2rql import Sparql2rqlTranslator
>>> query = """PREFIX doap: <>
... SELECT ?project ?name WHERE {
...    ?project a doap:Project;
...         doap:name ?name.
... }
... ORDER BY ?name LIMIT 5 OFFSET 10
... """
>>> qinfo = translator.translate(query)
>>> rql, args = qinfo.finalize()
>>> print rql, args

From the above example, we can notice two things. First, for cubicweb to understand the doap namespace, we have to declare the correspondance between the standard doap vocabulary and our internal schema, this is done with yams.xy:

>>> from yams import xy
>>> xy.register_prefix('', 'doap')
>>> xy.add_equivalence('Project', 'doap:Project')
>>> xy.add_equivalence('Project name', 'doap:Project doap:name')

Secondly, for now, we notice that the case is not preserved during the transformation : ?project becomes PROJECT in the rql query. This is probably something that we'll need to tackle quickly.

We've also add a few views in CubicWeb to wrap that and it will be available in the upcoming version 3.4.0 and is already available through our pulic mercurial repository.

The door is now open, the path is still long, stay tuned !

image under creative commons by beger (original)

Some new standard facets on the way

2009/05/29 by Adrien Di Mascio

CubicWeb has this really nice builtin facet system to define restrictions filters really as easily as possible.

We've just added two new kind of facets in CubicWeb :

  • The RangeFacet which displays a slider using jquery to choose a lower bound and an upper bound. The RangeWidget works with either numerical values or date values
  • The HasRelationFacet which displays a simple checkbox and lets you refine your selection in order to get only entities that actually use this relation.

Here's an example of code that defines a facet to filter musical works according to their composition date:

class CompositionDateFacet(DateRangeFacet):
    # 1. make sure this facet is displayed only on Track selection
    __select__ = DateRangeFacet.__select__ & implements('Track')
    # 2. give the facet an id (required by CubicWeb)
    id = 'compdate-facet'
    # 3. specify the attribute name that actually stores the date in the DB
    rtype = 'composition_date'

And that's it, on each page displaying tracks, you'll be able to filter them according to their composition date with a jquery slider.

All this, brought by CubicWeb (in the next 3.3 version)

Profiling your CubicWeb instance

2009/03/27 by Adrien Di Mascio

If you feel that one of your pages takes more time than it should to be generated, chances are that you're making too many RQL queries. Obviously, there are other reasons but my personal experience tends to show this is first thing to track down. Luckily for us, CubicWeb provides a configuration option to log rql queries. In your all-in-one.conf file, set the query-log-file option:

# web application query log file

Then restart your application, reload your page and stop your application. The file myapp-rql.log now contains the list of RQL queries that were executed during your test. It's a simple text file containing lines such as:

Any A WHERE X eid %(x)s, X lastname A {'x': 448} -- (0.002 sec, 0.010 CPU sec)
Any A WHERE X eid %(x)s, X firstname A {'x': 447} -- (0.002 sec, 0.000 CPU sec)

The structure of each line is:


Use the cubicweb-ctl exlog command to examine and summarize data found in such a file:

adim@crater:~$ cubicweb-ctl exlog < ~/myapp-rql.log
0.07 50 Any A WHERE X eid %(x)s, X firstname A {}
0.05 50 Any A WHERE X eid %(x)s, X lastname A {}
0.01 1 Any X,AA ORDERBY AA DESC WHERE E eid %(x)s, E employees X, X modification_date AA {}
0.01 1 Any X WHERE X eid %(x)s, X owned_by U, U eid %(u)s {, }
0.01 1 Any B,T,P ORDERBY lower(T) WHERE B is Bookmark,B title T, B path P, B bookmarked_by U, U eid %(x)s {}
0.01 1 Any A,B,C,D WHERE A eid %(x)s,A name B,A creation_date C,A modification_date D {}

This command sorts and uniquifies queries so that it's easy to see where is the hot spot that needs optimization.

Having said all this, it would probably be worth talking about the fetch_attrs attribute you can define in your entity classes because it can greatly reduce the number of queries executed but I'll make a specific blog entry for this.

I should finally mention the existence of the profile option in the all-in-on.conf. If set, this option will make your application run in an hotshot session and store the results in the specified file.

Google Maps and CubicWeb

2009/03/09 by Adrien Di Mascio

There is this so-called 'gmap-view' in CubicWeb, the question is: how to use it ?

Well, first, no surprise, you have to generate an API key to be able to use google maps on your server (make sure your usage conforms the terms as defined by Google).

Now, let's say you have defined the following schema:

class Company(EntityType):
    name = String(required=True, maxsize=64)
    # ... some other attributes ...
    latitude = Float(required=True)
    longitude = Float(required=True)

class Employee(EntityType):
    # ... some attributes ...
    works_for = SubjectRelation('Company', cardinality='1*')

And you'd like to be able to display companies on a map; you've also got these nice icons that you'd wish to use as markers on the map. First thing, define those three icons as external resources. You can do that by editing your CUBE/data/external_resources file:


We're nearly done, now. We just have to make our entity class implement the cubicweb.interfaces.IGeocodable interface. Here's an example:

from cubicweb.entities import AnyEntity
from cubicweb.interfaces import IGeocodable

class Company(AnyEntity):
    id = 'Company' # this must match the type as defined in your schema
    __implements__ = AnyEntity.__implements__ + (IGeocodable,)

    def size(self):
        return self.req.execute('Any COUNT(E) WHERE E works_for C, C eid %(c)s',
                                {'c': self.eid})

    # this is a method of IGeocodable
    def marker_icon(self):
        size = self.size()
        if size < 20:
            return self.req_external_resource('SMALL_MARKER_ICON')
        elif size < 500:
            return self.req_external_resource('MEDIUM_MARKER_ICON')
            return self.req_external_resource('BIG_MARKER_ICON')

That's it, you can now call the gmap-view on a resultset containing companies:

rset = self.req.execute('Any C WHERE C is Company')
self.wview(rset, 'gmap-view', gmap_key=YOUR_API_KEY)

Further configuration is possible, especially to control the size of the map or the default zoom level.

To be fair, I must say that in a real-life cube, chances are you won't be able to specificy directly latitude and longitude and that you'll only have an address. This is slightly more complex to do since you'll need to query a geocoding service (the google one for instance) to transform your address into latitude/longitude. This will typically be done in a hook

Here is an screenshot of google maps on a production site, the museums in Normandy :

Using Facets in Cubicweb

2009/02/25 by Adrien Di Mascio

Recently, for internal purposes, we've made a little cubicweb application to help us organizing visits to find new office locations. Here's an excerpt of the schema:

class Office(WorkflowableEntityType):
    price = Int(description='euros / m2 / HC / HT')
    surface = Int(description='m2')
    description = RichString(fulltextindexed=True)
    has_address = SubjectRelation('PostalAddress', cardinality='1?', composite='subject')
    proposed_by = SubjectRelation('Agency')
    comments = ObjectRelation('Comment', cardinality='1*', composite='object')
    screenshots = SubjectRelation(('File', 'Image'), cardinality='*1',

The two other entity types defined in the schema are Visit and Agency but we can also guess from the above that this application uses the two cubes comment and addressbook (remember, cubicweb is only a game where you assemble cubes !).

While we know that just defining the schema in enough to have a full, usable, (testable !) application, we also know that every application needs to be customized to fulfill the needs it was built for. So in this case, what we needed most was some custom filters that would let us restrict searches according to surfaces, prices or zipcodes. Fortunately for us, Cubicweb provides the facets (image) mechanism and a few base classes that make the task quite easy:

class PostalCodeFacet(RelationFacet):
    id = 'postalcode-facet'             # every registered class must have an id
    __select__ = implements('Office')   # this facet should only be selected when
                                        # visualizing offices
    rtype = 'has_address'               # this facet is a filter on the entity linked to
                                        # the office thrhough the relation has_address
    target_attr = 'postalcode'          # the filter's key is the attribute "postal_code"
                                        # of the target PostalAddress entity

This is a typical RelationFacet: we want to be able to filter offices according to the attribute postalcode of their associated PostalAdress. Each line in the class is explained by the comment on its right.

Now, here is the code to define a filter based on the surface attribute of the Office:

class SurfaceFacet(AttributeFacet):
    id = 'surface-facet'              # every registered class must have an id
    __select__ = implements('Office') # this facet should only be selected when
                                      # visualizing offices
    rtype = 'surface'                 # the filter's key is the attribute "surface"
    comparator = '>='                 # override the default value of operator since
                                      # we want to filter according to a minimal
                                      # value, not an exact one

    def rset_vocabulary(self, ___):
        """override the default vocabulary method since we want to hard-code
        our threshold values.
        Not overriding would generate a filter box with all existing surfaces
        defined in the database.
        return [('> 200', '200'), ('> 250', '250'),
                ('> 275', '275'), ('> 300', '300')]

And that's it: we have two filter boxes automatically displayed on each page presenting more than one office. The price facet is basically the same as the surface one but with a different vocabulary and with rtype = 'price'.

(The cube also benefits from the builtin google map views defined by cubicweb but that's for another blog).