Blog entries february 2014 [1]

CubicWeb sprint / winter 2014

2014/02/12 by Nicolas Chauvat

This sprint took place at Logilab's offices in Paris on Feb 13/14. People from CEA, Unlish, Crealibre and Logilab teamed up to push CubicWeb forward.

We did not forget the priorities from the roadmap:

  • CubicWeb 3.17.13 and 3.18.3 were released, and CubicWeb 3.19 made progress
  • the branch about ComputedAttributes and ComputedRelations (CWEP-002) is ready to be merged,
  • the branch about the FROM clause (CWEP-003) made progress (the CWEP was reviewed and part of the resulting spec was implemented),
  • in order to reduce work in progress, the number of patches in state reviewed or pending-review was brought down to 243 (from 302, that is 60 or 20%, which is not bad).

CubicWeb using Postgresql at its best

2014/02/08 by Nicolas Chauvat

We had a chat today with a core contributor to Postgresql from whom we may buy consulting services in the future. We discussed how CubicWeb could get the best out of Postgresql:

  • making use of the LISTEN/NOTIFY mechanism built into PG could be useful (to warn the cache about modified items for example) and PgQ is its good friend;
  • views (materialized or not) are another way to implement computed attributes and relations (see CWEP number 002) and it could be that the Entities table is in fact a view of other tables;
  • implementing RQL as an in-database language could open the door to new things (there is PL/pgSQL, PL/Python, what if we had PL/RQL?);
  • Foreign Data Wrappers written with Multicorn would be another way to write data feeds (see LDAP integration for an example);
  • managing dates can be tricky when users reside in different timezones and UTC is important to keep in mind (unicode/str is a good analogy);
  • for transitive closures that are often needed when implementing access control policies with __permissions, Postgresql can go a long way with queries like "WITH ... (SELECT UNION ALL SELECT RETURNING *) UPDATE USING ...";
  • the fastest way to load tabular data that does not need too much pre-processing is to create a temporary table in memory, then COPY-FROM the data into that table, then index it, then write the transform and load step in SQL (maybe with PL/Python);
  • when executing more than 10 updates in a row, it is better to write into a temporary table in memory, then update the actual tables with UPDATE USING (let's check if the psycopg driver does that when executemany is called);
  • reaching 10e8 rows in a table is at the time of this writing the stage when you should start monitoring your db seriously and start considering replication, partition and sharding.
  • full-text search is much better in Postgresql than the general public thinks it is and recent developments made it orders of magnitude faster than tools like Lucene or Solr and ElasticSearch;
  • when dealing with complex queries (searching graphs maybe), an option to consider is to implement a specific data type, use it into a materialized view and use GIN or GIST indexes over it;
  • for large scientific data sets, it could be interesting to link the numpy library into Postgresql and turn numpy arrays into a new data type;
  • Oh, and one last thing: the object-oriented tables of Postgresql are not such a great idea, unless you have a use case that fits them perfectly and does not hit their limitations (CubicWeb's is_instance_of does not seem to be one of these).

Hopin' I got you thinkin' :)

http://developer.postgresql.org/~josh/graphics/logos/elephant.png