|
Blog entries by Adrien Di Mascio [9]
In most cubicweb projects I've been developing on, there always comes a time where
I need to import legacy data in the new application. CubicWeb provides Store
and Controller objects in the dataimport module. I won't talk here about the
recommended general procedure described in the module's docstring (I find it a bit
convoluted for simple cases) but I will focus on Store objects. Store
objects in this module are more or less a thin layer around session objects,
they provide high-level helpers such as create_entity(), relate()
and keep track of what was inserted, errors occurred, etc.
In a recent project, I had to create a somewhat fair amount (a few million)
of simple entities (strings, integers, floats and dates) and relations. Default
object store (i.e. cubicweb.dataimport.RQLObjectStore) is painfully slow,
the reason being all integrity / security / metadata hooks that are constantly
selected and executed. For large imports, dataimport also provides the
cubicweb.dataimport.NoHookRQLObjectStore. This store bypasses all hooks
and uses the underlying system source primitives directly, making it around
two-times faster than the standard store. The problem is that we're still
doing each sql query sequentially and we're talking here of millions of
INSERT / UPDATE queries.
My idea was to create my own ObjectStore class inheriting from NoHookRQLObjectStore
that would try to use executemany or even copy_from when possible . It
is actually not hard to make groups of similar SQL queries since create_entity()
generates the same query for a given set of parameters. For instance:
create_entity('Person', firstname='John', surname='Doe')
create_entity('Person', firstname='Tim', surname='BL')
will generate the following sql queries:
INSERT INTO cw_Person ( cw_cwuri, cw_eid, cw_modification_date,
cw_creation_date, cw_firstname, cw_surname )
VALUES ( %(cw_cwuri)s, %(cw_eid)s, %(cw_modification_date)s,
%(cw_creation_date)s, %(cw_firstname)s, %(cw_surname)s )
INSERT INTO cw_Person ( cw_cwuri, cw_eid, cw_modification_date,
cw_creation_date, cw_firstname, cw_surname )
VALUES ( %(cw_cwuri)s, %(cw_eid)s, %(cw_modification_date)s,
%(cw_creation_date)s, %(cw_firstname)s, %(cw_surname)s )
The only thing that will differ is the actual data inserted. Well ... ahem ... CubicWeb actually
also generates a "few" extra sql queries to insert metadata for each entity:
INSERT INTO is_instance_of_relation(eid_from,eid_to) VALUES (%s,%s)
INSERT INTO is_relation(eid_from,eid_to) VALUES (%s,%s)
INSERT INTO cw_source_relation(eid_from,eid_to) VALUES (%s,%s)
INSERT INTO owned_by_relation ( eid_to, eid_from ) VALUES ( %(eid_to)s, %(eid_from)s )
INSERT INTO created_by_relation ( eid_to, eid_from ) VALUES ( %(eid_to)s, %(eid_from)s )
Those extra queries are actually even exactly the same for each entity insterted, whatever
the entity type is, hence craving for executemany or copy_from. Grouping together SQL queries
is not that hard but has a drawback : as you don't have an intermediate state
(the data is actually inserted only at the very end of the process),
you loose the ability to query your database to fetch the entities you've just created
during the import.
Now, a few benchmarks ...
To create those benchmarks, I decided to use the workorder cube which is a simple cube, yet complete enough : it provides only two entity types (WorkOrder and Order), a relation between them (Order split_into WorkOrder) and uses different kind of attributes (String, Date, Float).
Once the cube was instantiated, I ran the following script to populate the database with my 3 different stores:
import sys
from datetime import date
from random import choice
from itertools import count
from logilab.common.decorators import timed
from cubicweb import cwconfig
from cubicweb.dbapi import in_memory_repo_cnx
def workorders_data(n, seq=count()):
for i in xrange(n):
yield {'title': u'wo-title%s' % seq.next(), 'description': u'foo',
'begin_date': date.today(), 'end_date': date.today()}
def orders_data(n, seq=count()):
for i in xrange(n):
yield {'title': u'o-title%s' % seq.next(), 'date': date.today(), 'budget': 0.8}
def split_into(orders, workorders):
for workorder in workorders:
yield choice(orders), workorder
def initial_state(session, etype):
return session.execute('Any S WHERE S is State, WF initial_state S, '
'WF workflow_of ET, ET name %(etn)s', {'etn': etype})[0][0]
@timed
def populate(store, nb_workorders, nb_orders, set_state=False):
orders = [store.create_entity('Order', **attrs)
for attrs in orders_data(nb_orders)]
workorders = [store.create_entity('WorkOrder', **attrs)
for attrs in workorders_data(nb_workorders)]
## in_state is set by a hook, so NoHookObjectStore will need
## to set the relation manually
if set_state:
order_state = initial_state(store.session, 'Order')
workorder_state = initial_state(store.session, 'WorkOrder')
for order in orders:
store.relate(order.eid, 'in_state', order_state)
for workorder in workorders:
store.relate(workorder.eid, 'in_state', workorder_state)
for order, workorder in split_into(orders, workorders):
store.relate(order.eid, 'split_into', workorder.eid)
store.commit()
if __name__ == '__main__':
config = cwconfig.instance_configuration(sys.argv[1])
nb_orders = int(sys.argv[2])
nb_workorders = int(sys.argv[3])
repo, cnx = in_memory_repo_cnx(config, login='admin', password='admin')
session = repo._get_session(cnx.sessionid)
from cubicweb.dataimport import RQLObjectStore, NoHookRQLObjectStore
from cubes.mycube.dataimport.store import CopyFromRQLObjectStore
print 'testing RQLObjectStore'
store = RQLObjectStore(session)
populate(store, nb_workorders, nb_orders)
print 'testing NoHookRQLObjectStore'
store = NoHookRQLObjectStore(session)
populate(store, nb_workorders, nb_orders, set_state=True)
print 'testing CopyFromRQLObjectStore'
store = CopyFromRQLObjectStore(session)
I ran the script and asked to create 100 Order entities, 1000 WorkOrder entities
and to link each created WorkOrder to a parent Order
adim@esope:~/tmp/bench_cwdi$ python bench_cwdi.py bench_cwdi 100 1000
testing RQLObjectStore
populate clock: 24.590000000 / time: 46.169721127
testing NoHookRQLObjectStore
populate clock: 8.100000000 / time: 25.712352991
testing CopyFromRQLObjectStore
populate clock: 0.830000000 / time: 1.180006981
My interpretation of the above times is :
- The clock time indicates the time spent on CubicWeb server side (i.e. hooks
and data pre/postprocessing around SQL queries). The time time should be
the sum of clock time + time spent in postgresql.
- RQLObjectStore is slow ;-). Nothing new here, but the clock/time ratio
means that we're speding a lot of time on the python side (i.e. hooks
as I told earlier) and a fair amount of time in postgresql.
- NoHookRQLObjectStore really takes down the time spent on the python side,
the time in postgresql remains about the same as for RQLObjectStore, this
is not surprising, queries performed are the same in both cases.
- CopyFromRQLObjectStore seems blazingly fast in comparison (inserting
a few thousands of elements in postgresql with a COPY FROM statement
is not a problem). And ... yes, I checked the data was actually inserted,
and I even a ran a cubicweb-ctl db-check on the instance afterwards.
This probably opens new perspective for massive data imports since the client API
remains the same as before for the programmer. It's still a bit experimental, can only be
used for "dummy", brute-force import scenario where you can preprocess your data
in Python before updating the database, but it's probably worth having such
a store in the the dataimport module.
Logilab is hosting a CubicWeb sprint - 3 days in our Paris offices.
The general focus will be on speed :
- on cubicweb-server side : improve performance of massive insertions / deletions
- on cubicweb-client side : cache implementation, HTTP server, massive parallel usage, etc.
This sprint will take place from in April 2011 from tuesday the 26th to thursday the 28th. You are more than welcome to come along and help out, contribute, but unlike previous sprints, at least basic knowledge of CubicWeb will be required for participants since no introduction is planned.
Network resources will be available for those bringing laptops.
Address : 104 Boulevard Auguste-Blanqui, Paris. Ring "Logilab" (googlemap)
Metro : Glacière
Contact : http://www.logilab.fr/contact
Dates : 26/04/2011 to 28/04/2011
These first two days essentially consisted in exploring the
javascript world.
Sandrine and Alain worked on the
javascript documentation tools and how they could be integrated
into our sphinx generated documentation.
They first studied pyjsdoc which unfortunately only generates
HTML. After a somewhat successful attempt to generate sphinx ReST, we
decided to use a consistent documentation format between python
modules and js modules and therefore switched to a home-made, very simple
javascript comment parser. Here's an example of what the parser understands:
/**
* .. cfunction:: myFunction(a, b, /*...*/, c, d)
*
* This function is very **well** documented and does quite
* a lot of stuff :
* - task 1
* - task 2
*
* :param a: this is the first parameter
* ...
* :return: 42
*/
function myFunction(a, b, /*...*/, c, d) {
}
The extracted ReST snippets are then concatenated and inserted
in the general documentation.
Katia, Julien and Adrien looked at the different testing tools
for javascript, with the two following goals in mind:
- low-level unit testing, as cubicweb agnostic as possible
- high-level / functional testing, we want to write navigation scenarios
and replay them
And the two winners of the exploration are:
- QUnit for pure javascript / DOM testing. Julien and Adrien
successfully managed to test a few cubicweb js functions, most
notably the loadxhtml jquery plugin.
- Windmill for higher level testing. Katia and Sylvain were able
to integrate Windmill within the CubicWeb unit testing framework.
Of course, there is still a lot of work that needs to be done. For
instance, we would like to have a test runner facility to run
QUnit-based tests on multiple platforms / browsers automatically.
Sylvain worked on property sheets and managed to implement
compiled CSS based on simple string interpolation. Of course,
compiled CSS are still HTTP cached, automatically recompiled
on debug mode, etc. On his way, he also got rid of the
external_resources file. Backward compatibility will of
course be guaranteed for a while.
Nicolas worked on CSS and vertical rythm and prepared a patch
that introduces a basic rhythm. The tedious work will be to get
every stylesheet to dance to the beat.
During last cubicweb sprint, I was asked if it was possible to customize
the search box CubicWeb comes with. By default, you can use it
to either type RQL queries, plain text queries or standard shortcuts
such as <EntityType> or <EntityType> <attrname> <value>.
Ultimately, all queries are translated to rql since it's the only
language understood on the server (data) side. To transform the user
query into RQL, CubicWeb uses the so-called
magicsearch component which in turn delegates to a number of
query preprocessor that are responsible of interpreting the
user query and generating corresponding RQL.
The code of the main processor loop is easy to understand:
for proc in self.processors:
try:
return proc.process_query(uquery, req)
except (RQLSyntaxError, BadRQLQuery):
pass
The idea is simple: for each query processor, try to translate the
query. If it fails, try with the next processor, if it succeeds,
we're done and the RQL query will be executed.
Now that the general mechanism is understood, here's an example
of code that could be used in a forge-based cube to add
a new search shortcut to find tickets. We'd like to use
the project_name:text syntax to search for tickets of
project_name containing text (e.g pylint:warning).
Here's the corresponding preprocessor code:
from cubicweb.web.views.magicsearch import BaseQueryProcessor
class MyCustomQueryProcessor(BaseQueryProcessor):
priority = 0 # controls order in which processors are tried
def preprocess_query(self, uquery, req):
"""
:param uqery: the query as sent by the browser
:param req: the standard, omnipresent, cubicweb's req object
"""
try:
project_name, text = uquery.split(':')
except ValueError:
return None # the shortcut doesn't apply
return (u'Any T WHERE T is Ticket, T concerns P, P name %(p)s, '
u'T has_text %(t)s', {'p': project_name, 't': text})
The code is rather self-explanatory, but here's a few additional comments:
- the class is registered with the standard vregistry mechanism and should
be defined along the views
- the priority attribute is used to sort and define the order
in which processors will be tried in the main processor loop
- the preprocess_query returns None or raise an exception if the
query can't be processed
To summarize, if you want to customize the search box, you have to:
- define a new query preprocessor component
- define its priority wrt other standard processors
- implement the preprocess_query method
and CubicWeb will do the rest !
Last week, we finally took a few days to dive into SPARQL in
order to transform any CubicWeb application into a potential
SPARQL endpoint.
The first step was to get a parser. Fortunately
the w3c provides a grammar definition and around 200 test
cases. There was a few interesting options around there: we
tried to reuse rdflib, rasqal, the sparql.g version
designed for antlr3 and SimpleParse but after two days of
work, we had nothing that worked well enough. We decided it was
not worth it and switched to yapps since we knew yapps and rql
already had a dependency on it.
Maybe we'll consider changing the parser at some point later but
the priority was to get something working as soon as we could and
we finally came up with a version of fyzz passing 90% of the
W3C test suite (of course, there might be some false positives).
Fyzz parses the SPARQL query and generates something we decided to call an
AST although it's still a bit rough for now. Fyzz understands simple triples,
distincts, limits, offsets and other basic functionalities.
Please note that fyzz is totally independent of cubicweb and it can
be reused by any project.
Here's an example of how to use fyzz:
>>> from fyzz.yappsparser import parse
>>> ast = parse("""PREFIX doap: <http://usefulinc.com/ns/doap#>
... SELECT ?project ?name WHERE {
... ?project a doap:Project;
... doap:name ?name.
... }
... ORDER BY ?name LIMIT 5 OFFSET 10
... """)
>>> print ast.selected
[SparqlVar('project'), SparqlVar('name')]
>>> print ast.prefixes
{'doap': 'http://usefulinc.com/ns/doap#'}
>>> print ast.orderby
[(SparqlVar('name'), 'asc')]
>>> print ast.limit, ast.offset
5 10
>>> print ast.where
[(SparqlVar('project'), ('', 'a'), ('http://usefulinc.com/ns/doap#', 'Project')),
(SparqlVar('project'), ('http://usefulinc.com/ns/doap#', 'name'), SparqlVar('name'))]
This AST is then processed and transformed into a RQL query which
can finally be processed by CubicWeb directly.
Here's what can be done in cubicweb-ctl shell session (of course,
this can also be done in the web application) of our forge
cube:
>>> from cubicweb.spa2rql import Sparql2rqlTranslator
>>> query = """PREFIX doap: <http://usefulinc.com/ns/doap#>
... SELECT ?project ?name WHERE {
... ?project a doap:Project;
... doap:name ?name.
... }
... ORDER BY ?name LIMIT 5 OFFSET 10
... """
>>> qinfo = translator.translate(query)
>>> rql, args = qinfo.finalize()
>>> print rql, args
Any PROJECT, NAME ORDERBY NAME ASC LIMIT 5 OFFSET 10 WHERE PROJECT name NAME, PROJECT is Project {}
From the above example, we can notice two things. First, for
cubicweb to understand the doap namespace, we have to
declare the correspondance between the standard doap vocabulary
and our internal schema, this is done with yams.xy:
>>> from yams import xy
>>> xy.register_prefix('http://usefulinc.com/ns/doap#', 'doap')
>>> xy.add_equivalence('Project', 'doap:Project')
>>> xy.add_equivalence('Project name', 'doap:Project doap:name')
Secondly, for now, we notice that the case is not preserved during the
transformation : ?project becomes PROJECT in the rql query. This
is probably something that we'll need to tackle quickly.
We've also add a few views in CubicWeb to wrap that and it will
be available in the upcoming version 3.4.0 and is already
available through our pulic mercurial repository.
The door is now open, the path is still long, stay tuned !
image under creative commons by beger (original)
CubicWeb has this really nice builtin facet system to
define restrictions filters really as easily as possible.
We've just added two new kind of facets in CubicWeb :
- The RangeFacet which displays a slider using jquery
to choose a lower bound and an upper bound. The RangeWidget
works with either numerical values or date values
- The HasRelationFacet which displays a simple checkbox and
lets you refine your selection in order to get only entities
that actually use this relation.
Here's an example of code that defines a facet to filter
musical works according to their composition date:
class CompositionDateFacet(DateRangeFacet):
# 1. make sure this facet is displayed only on Track selection
__select__ = DateRangeFacet.__select__ & implements('Track')
# 2. give the facet an id (required by CubicWeb)
id = 'compdate-facet'
# 3. specify the attribute name that actually stores the date in the DB
rtype = 'composition_date'
And that's it, on each page displaying tracks, you'll be able to filter them
according to their composition date with a jquery slider.
All this, brought by CubicWeb (in the next 3.3 version)
If you feel that one of your pages takes more time than it should
to be generated, chances are that you're making too many RQL queries.
Obviously, there are other reasons but my personal experience tends
to show this is first thing to track down. Luckily for us, CubicWeb
provides a configuration option to log rql queries. In your
all-in-one.conf file, set the query-log-file
option:
# web application query log file
query-log-file=~/myapp-rql.log
Then restart your application, reload your page and stop your application.
The file myapp-rql.log now contains the list of RQL queries that were
executed during your test. It's a simple text file containing lines such as:
Any A WHERE X eid %(x)s, X lastname A {'x': 448} -- (0.002 sec, 0.010 CPU sec)
Any A WHERE X eid %(x)s, X firstname A {'x': 447} -- (0.002 sec, 0.000 CPU sec)
The structure of each line is:
<RQL QUERY> <QUERY ARGS IF ANY> -- <TIME SPENT>
Use the cubicweb-ctl exlog command to examine and summarize data found
in such a file:
adim@crater:~$ cubicweb-ctl exlog < ~/myapp-rql.log
0.07 50 Any A WHERE X eid %(x)s, X firstname A {}
0.05 50 Any A WHERE X eid %(x)s, X lastname A {}
0.01 1 Any X,AA ORDERBY AA DESC WHERE E eid %(x)s, E employees X, X modification_date AA {}
0.01 1 Any X WHERE X eid %(x)s, X owned_by U, U eid %(u)s {, }
0.01 1 Any B,T,P ORDERBY lower(T) WHERE B is Bookmark,B title T, B path P, B bookmarked_by U, U eid %(x)s {}
0.01 1 Any A,B,C,D WHERE A eid %(x)s,A name B,A creation_date C,A modification_date D {}
This command sorts and uniquifies queries so that it's easy to see where
is the hot spot that needs optimization.
Having said all this, it would probably be worth talking about the fetch_attrs attribute
you can define in your entity classes because it can greatly reduce the
number of queries executed but I'll make a specific blog entry for this.
I should finally mention the existence of the profile option in the
all-in-on.conf. If set, this option will make your application run in
an hotshot session and store the results in the specified file.
There is this so-called 'gmap-view' in CubicWeb, the question is: how to use it ?
Well, first, no surprise, you have to generate an API key to be able to use
google maps on your server (make sure your usage conforms the terms as defined by
Google).
Now, let's say you have defined the following schema:
class Company(EntityType):
name = String(required=True, maxsize=64)
# ... some other attributes ...
latitude = Float(required=True)
longitude = Float(required=True)
class Employee(EntityType):
# ... some attributes ...
works_for = SubjectRelation('Company', cardinality='1*')
And you'd like to be able to display companies on a map; you've
also got these nice icons that you'd wish to use as markers on the map.
First thing, define those three icons as external resources. You can
do that by editing your CUBE/data/external_resources file:
SMALL_MARKER_ICON=DATADIR/small_company.png
MEDIUM_MARKER_ICON=DATADIR/MEDIUM_company.png
BIG_MARKER_ICON=DATADIR/big_company.png
We're nearly done, now. We just have to make our entity class implement
the cubicweb.interfaces.IGeocodable interface. Here's an example:
from cubicweb.entities import AnyEntity
from cubicweb.interfaces import IGeocodable
class Company(AnyEntity):
id = 'Company' # this must match the type as defined in your schema
__implements__ = AnyEntity.__implements__ + (IGeocodable,)
def size(self):
return self.req.execute('Any COUNT(E) WHERE E works_for C, C eid %(c)s',
{'c': self.eid})
# this is a method of IGeocodable
def marker_icon(self):
size = self.size()
if size < 20:
return self.req_external_resource('SMALL_MARKER_ICON')
elif size < 500:
return self.req_external_resource('MEDIUM_MARKER_ICON')
else:
return self.req_external_resource('BIG_MARKER_ICON')
That's it, you can now call the gmap-view on a resultset containing companies:
rset = self.req.execute('Any C WHERE C is Company')
self.wview(rset, 'gmap-view', gmap_key=YOUR_API_KEY)
Further configuration is possible, especially to control the size of the map
or the default zoom level.
To be fair, I must say that in a real-life cube, chances are you won't be
able to specificy directly latitude and longitude and that you'll only
have an address. This is slightly more complex to do since you'll need to
query a geocoding service (the google one for instance) to transform your
address into latitude/longitude. This will typically be done in a hook
Here is an screenshot of google maps on a production site, the museums in Normandy :
Recently, for internal purposes, we've made a little cubicweb application to help us
organizing visits to find new office locations. Here's an excerpt of the schema:
class Office(WorkflowableEntityType):
price = Int(description='euros / m2 / HC / HT')
surface = Int(description='m2')
description = RichString(fulltextindexed=True)
has_address = SubjectRelation('PostalAddress', cardinality='1?', composite='subject')
proposed_by = SubjectRelation('Agency')
comments = ObjectRelation('Comment', cardinality='1*', composite='object')
screenshots = SubjectRelation(('File', 'Image'), cardinality='*1',
composite='subject')
The two other entity types defined in the schema are Visit and Agency but we
can also guess from the above that this application uses the two cubes comment and
addressbook (remember, cubicweb is only a game where you assemble cubes !).
While we know that just defining the schema in enough to have a full, usable,
(testable !) application, we also know that every application needs to be
customized to fulfill the needs it was built for. So in this case, what we
needed most was some custom filters that would let us restrict searches according
to surfaces, prices or zipcodes. Fortunately for us, Cubicweb provides the
facets (image) mechanism and a few base classes that make the task quite easy:
class PostalCodeFacet(RelationFacet):
id = 'postalcode-facet' # every registered class must have an id
__select__ = implements('Office') # this facet should only be selected when
# visualizing offices
rtype = 'has_address' # this facet is a filter on the entity linked to
# the office thrhough the relation has_address
target_attr = 'postalcode' # the filter's key is the attribute "postal_code"
# of the target PostalAddress entity
This is a typical RelationFacet: we want to be able to filter offices according
to the attribute postalcode of their associated PostalAdress. Each line in
the class is explained by the comment on its right.
Now, here is the code to define a filter based on the surface attribute of the Office:
class SurfaceFacet(AttributeFacet):
id = 'surface-facet' # every registered class must have an id
__select__ = implements('Office') # this facet should only be selected when
# visualizing offices
rtype = 'surface' # the filter's key is the attribute "surface"
comparator = '>=' # override the default value of operator since
# we want to filter according to a minimal
# value, not an exact one
def rset_vocabulary(self, ___):
"""override the default vocabulary method since we want to hard-code
our threshold values.
Not overriding would generate a filter box with all existing surfaces
defined in the database.
"""
return [('> 200', '200'), ('> 250', '250'),
('> 275', '275'), ('> 300', '300')]
And that's it: we have two filter boxes automatically displayed on each page
presenting more than one office. The price facet is basically the same as the
surface one but with a different vocabulary and with rtype = 'price'.
(The cube also benefits from the builtin google map views defined by
cubicweb but that's for another blog).
|