Blog entries october 2013 [3]

Apache authentication

2013/10/10 by Dimitri Papadopoulos

An Apache front end might be useful, as Apache provides standard log files, monitoring or authentication. In our case, we have Apache authenticate users before they are cleared to access our CubicWeb application. Still, we would like user accounts to be managed within a CubicWeb instance, avoiding separate sets of identifiers, one for Apache and the other for CubicWeb.

We have to address two issues:

  • have Apache authenticate users against accounts in the CubicWeb database,
  • have CubicWeb trust Apache authentication.

Apache authentication against CubicWeb accounts

A possible solution would be to access the identifiers associated to a CubicWeb account at the SQL level, directly from the SQL database underneath a CubicWeb instance. The login password can be found in the cw_login and cw_upassword columns of the cw_cwuser table. The benefit is that we can use existing Apache modules for authentication against SQL databases, typically mod_authn_dbd. On the other hand this is highly dependant on the underlying SQL database.

Instead we have chosen an alternate solution, directly accessing the CubicWeb repository. Since we need Python to access the repository, our sysasdmins have deployed mod_python on our Apache server.

We wrote a Python authentication module that accesses the repository using ZMQ. Thus ZMQ needs be enabled. To enable ZMQ uncomment and complete the following line in all-in-one.conf:

zmq-repository-address=zmqpickle-tcp://localhost:8181

The Python authentication module looks like:

from mod_python import apache
from cubicweb import dbapi
from cubicweb import AuthenticationError

def authenhandler(req):
    pw = req.get_basic_auth_pw()
    user = req.user

    database = 'zmqpickle-tcp://localhost:8181'
    try:
        cnx = dbapi.connect(database, login=user, password=pw)
    except AuthenticationError:
        return apache.HTTP_UNAUTHORIZED
    else:
        cnx.close()
        return apache.OK

CubicWeb trusts Apache

Our sysadmins set up Apache to add x-remote-user to the HTTP headers forwarded to CubicWeb - more on the relevant Apache configuration in the next paragraph.

We then add the cubicweb-trustedauth cube to the dependencies of our CubicWeb application. We simply had to add to the __pkginfo__.py file of our CubicWeb application:

__depends__ =  {
    'cubicweb': '>= 3.16.1',
    'cubicweb-trustedauth': None,
}

This cube gets CubicWeb to trust the x-remote-user header sent by the Apache front end. CubicWeb bypasses its own authentication mechanism. Users are directly logged into CubicWeb as the user with a login identical to the Apache login.

Apache configuration and deployment

Our Apache configuration looks like:

<Location /apppath >
  AuthType Basic
  AuthName "Restricted Area"
  AuthBasicAuthoritative Off
  AuthUserFile /dev/null
  require valid-user

  PythonAuthenHandler cubicwebhandler

  RewriteEngine On
  RewriteCond %{REMOTE_USER} (.*)
  RewriteRule . - [E=RU:%1]
<Location /apppath >

RequestHeader set X-REMOTE-USER %{RU}e

ProxyPass          /apppath  http://127.0.0.1:8080
ProxyPassReverse   /apppath  http://127.0.0.1:8080

The CubicWeb application is accessed as http://ourserver/apppath/.

The Python authentication module is deployed as /usr/lib/python2.7/dist-packages/cubicwebhandler/handler.py where cubicwebhandler is the attribute associated to PythonAuthenHandler in the Apache configuration.


Brainomics / CrEDIBLE conference report

2013/10/09 by Vincent Michel

Cubicweb and the Brainomics project were presented last week at the CrEDIBLE workshop (October 2-4, 2013, Sophia-Antipolis) on "Federating distributed and heterogeneous biomedical data and knowledge". We would like to thank the organizers for this nice opportunity to show the features of CubicWeb and Brainomics in the context of biomedical data.

http://credible.i3s.unice.fr/lib/tpl/credible/images/credible.png

Workshop highlights

  • A short presentation of SHI3LD that defines data access based on conditions that are based on ASK request. The other part was a state of the art of Open data license, and the (poor) existence of licenses expressed in RDF. Future work seems to be an interesting combination of both SHI3LD and RDF-based licenses for data access.
  • MIDAS, an open-source software for sharing medical data. This project could be an interesting source of inspiration for the file sharing part of CubicWeb, even if the (really complicated in my opinion) case of large files downloads is not addressed for now.
  • Federated queries based on FedX - the optimization techniques based on source selection & exclusive groups seems a good approach for avoiding large data transfers and finding some (sub-)optimal ways to join the different data sources. This should be taken into account in the future work on the "FROM" clause in CubicWeb.
  • WebPIE/QueryPIE: a map-reduce-based approach for large-scale reasoning.

CubicWeb and Brainomics

The slides of the presentation can be download as a PDF or viewed on slideshare.

Some people seem confused on the RQL to SQL translation. This relies on a simple translation logic that is implemented in the rql2sql file. This is only an implementation trick, not so different from the one used in RDBMS-based triplestores that have to convert SPARQL into SQL.

RQL inference : there is no magic behind the RQL inference process. As opposed to triplestores that store RDF triples that contain their own schema, and thus cannot easily know the full data model in these triples without looking at all the triples, RQL relies on a relational database with an fixed (at a given moment) data model, thus allowing inference and simple checks. In particular, in this example, we want All the Cities of `Île de France` with more than 100 000 inhabitants ?, which is expressed in RQL:

Any X WHERE X region Y, X population > 100000,
            Y uri "http://fr.dbpedia.org/resource/Île-de-France"

and SPARQL:

select ?ville where {
?ville db-owl:region <http://fr.dbpedia.org/resource/Île-de-France> .
?ville db-owl:populationTotal ?population .
FILTER (?population > 100000)
}

Beside the fact that RQL is less verbose that SPARQL (syntax matters), the simplicity of RQL relies on the fact that it can automatically infer (similarly to SPARQL) that if X is related to Y by the region relation and has a population attribute, it should be a city. If city and district both have the region relation and a population attribute, the RQL inference allows to fetch them both transparently, otherwise one can be specific by using the is relation:

Any X WHERE X is City, X region Y, X population > 100000,
            Y uri "http://fr.dbpedia.org/resource/Île-de-France"

RQL also allows subqueries, union, full-text search, stored procedures, ... (see the doc).

These really interesting discussions convinced us that we should write a journal paper for detailing the theoretical and technical concepts behind RQL and the YAMS schema.


Logilab will be in Toulouse métropole Open Data Barcamp tomorrow

2013/10/08 by Sylvain Thenault

Meet us tomorrow at the Toulouse's Cantine where several people from Logilab will be there for the open data barcamp organized by Toulouse Metropole.

More infos on barcamp.org. We'll probably talk abouthow CubicWeb manages to import large amounts of open-data to reuse.