Blog entries

  • Geonames in CubicWeb !

    2011/12/14 by Vincent Michel

    CubicWeb is a semantic web framework written in Python that has been succesfully used in large-scale projects, such as data.bnf.fr (French National Library's opendata) or Collections des musées de Haute-Normandie (museums of Haute-Normandie).

    CubicWeb provides a high-level query language, called RQL, operating over a relational database (PostgreSQL in our case), and allows to quickly instantiate an entity-relationship data-model. By separating in two distinct steps the query and the display of data, it provides powerful means for data retrieval and processing.

    In this blog, we will demonstrate some of these capabilities on the Geonames data.

    Geonames

    Geonames is an open-source compilation of geographical data from various sources:

    "...The GeoNames geographical database covers all countries and contains over eight million placenames that are available for download free of charge..." (http://www.geonames.org)

    The data is available as a dump containing different CSV files:

    • allCountries: main file containing information about 8,000,000 places in the world. We won't detail the various attributes of each location, but we will focus on some important properties, such as population and elevation. Moreover, admin_code_1 and admin_code_2 will be used to link the different locations to the corresponding AdministrativeRegion, and feature_code will be used to link the data to the corresponding type.
    • admin1CodesASCII.txt and admin2Codes.txt detail the different administrative regions, that are parts of the world such as region (Ile-de-France), department (Department of Yvelines), US counties...
    • featureCodes.txt details the different types of location that may be found in the data, such as forest(s), first-order administrative division, aqueduct, research institute, ...
    • timeZones.txt, countryInfo.txt, iso-languagecodes.txt are additional files prodividing information about timezones, countries and languages. They will be included in our CubicWeb database but won't be explained in more details here.

    The Geonames website also provides some ways to browse the data: by Countries, by Largest Cities, by Highest mountains, by postal codes, etc. We will see that CubicWeb could be used to automatically create such ways of browsing data while allowing far deeper queries. There are two main challenges when dealing with such data:

    • the number of entries: with 8,000,000 placenames, we have to use efficient tools for storing and querying them.
    • the structure of the data: the different types of entries are separated in different files, but should be merged for efficient queries (i.e. we have to rebuild the different links between entities, e.g Location to Country or Location to AdministrativeRegion).

    Data model

    With CubicWeb, the data model of the application is written in Python. It defines different entity classes with their attributes, as well as the relationships between the different entity classes. Here is a sample of the schema.py that we have used for Geonames data:

    class Location(EntityType):
        name = String(maxsize=1024, indexed=True)
        uri = String(unique=True, indexed=True)
        geonameid = Int(indexed=True)
        latitude = Float(indexed=True)
        longitude = Float(indexed=True)
        feature_code = SubjectRelation('FeatureCode', cardinality='?*', inlined=True)
        country = SubjectRelation('Country', cardinality='?*', inlined=True)
        main_administrative_region = SubjectRelation('AdministrativeRegion',
                                  cardinality='?*', inlined=True)
        timezone = SubjectRelation('TimeZone', cardinality='?*', inlined=True)
        ...
    

    This indicates that the main Location class has a name attribute (string), an uri (string), a geonameid (integer), a latitude and a longitude (both floats), and some relation to other entity classes such as FeatureCode (the relation is named feature_code), Country (the relation is named country), or AdministrativeRegion called main_administrative_region.

    The cardinality of each relation is classically defined in a similar way as RDBMS, where * means any number, ? means zero or one and 1 means one and only one.

    We give below a visualisation of the schema (obtained using the /schema relative url)

    http://www.cubicweb.org/file/2124618?vid=download

    Import

    The data contained in the CSV files could be pushed and stored without any processing, but it is interesting to reconstruct the relations that may exist between different entities and entity classes, so that queries will be easier and faster.

    Executing the import procedure took us 80 minutes on regular hardware, which seems very reasonable given the amount of data (~7,000,000 entities, 920MB for the allCountries.txt file), and the fact that we are also constructing many indexes (on attributes or on relations) to improve the queries. This import procedure uses some low-level SQL commands to load the data into the underlying relational database.

    Queries and views

    As stated before, queries are performed in CubicWeb using RQL (Relational Query Language), which is similar to SPARQL, but with a syntax that is closer to SQL. This language may be used to query directly the concepts while abstracting the physical structure of the underlying database. For example, one can use the following request:

    Any X LIMIT 10 WHERE X is Location, X population > 1000000,
        X country C, C name "France"
    

    that means:

    Give me 10 locations that have a population greater than 1000000, and that are in a country named "France"

    The corresponding SQL query is:

    SELECT _X.cw_eid FROM cw_Country AS _C, cw_Location AS _X
    WHERE _X.cw_population>1000000
          AND _X.cw_country=_C.cw_eid AND _C.cw_name="France"
    LIMIT 10
    

    We can see that RQL is higher-level than SQL and abstracts the details of the tables and the joins.

    A query returns a result set (a list of results), that can be displayed using views. A main feature of CubicWeb is to separate the two steps of querying the data and displaying the results. One can query some data and visualize the results in the standard web framework, download them in different formats (JSON, RDF, CSV,...), or display them in some specific view developed in Python.

    In particular, we will use the mapstraction.map which is based on the Mapstraction and the OpenLayers libraries to display information on maps using data from OpenStreetMap. This mapstraction.map view uses a feature of CubicWeb called adapter. An adapter adapts a class of entity to some interface, hence views can rely on interfaces instead of types and be able to display entities with different attributes and relations. In our case, the IGeocodableAdapter returns a latitude and a longitude for a given class of entity (here, the mapping is trivial, but there are more complex cases... :) ):

    class IGeocodableAdapter(EntityAdapter):
          __regid__ = 'IGeocodable'
          __select__ = is_instance('Location')
          @property
          def latitude(self):
              return self.entity.latitude
          @property
          def longitude(self):
              return self.entity.longitude
    

    We will give some results of queries and views later. It is important to notice that the following screenshoots are taken without any modification of the standard web interface of CubicWeb. It is possible to write specific views and to define a specific CSS, but we only wanted to show how CubicWeb could handle such data. However, the default web template of CubicWeb is sufficient for what we want to do, as it dynamically creates web pages showing attributes and relations, as well as some specific forms and javascript applets adapted directly to the data (e.g. map-based tools). Last but not least, the query and the view could be defined within the url, and thus open a world of new possibilities to the user:

    http://baseurl:port/?rql=The query that I want&vid=Identifier-of-the-view
    

    Facets

    We will not get into too much details about Facets, but let's just say that this feature may be used to determine some filtering axis on the data, and thus may be used to post-filter a result set. In this example, we have defined four different facets: on the population, on the elevation, one the feature_code and one the main_administrative_region. We will see illustration of these facets below.

    We give here an example of the definition of a Facet:

    class LocationPopulationFacet(facet.RangeFacet):
        __regid__ = 'population-facet'
        __select__ = is_instance('Location')
        order = 2
        rtype = 'population'
    

    where __select__ defines which class(es) of entities are targeted by this facet, order defines the order of display of the different facets, and rtype defines the target attribute/relation that will be used for filtering.

    Geonames in CubicWeb

    The main page of the Geoname application is illustrated in the screenshot below. It provides general information on the database, in particular the number of entities in the different classes:

    • 7,984,330 locations.
    • 59,201 administrative regions (e.g. regions, counties, departments...)
    • 7,766 languages.
    • 656 features (e.g. types of location).
    • 410 time zones.
    • 252 countries.
    • 7 continents.
    http://www.cubicweb.org/file/2124617?vid=download

    Simple query

    We will first illustrate the possibilites of CubicWeb with the simple query that we have detailed before (that could be directly pasted in the url...):

    Any X LIMIT 10 WHERE X is Location, X population > 1000000,
        X country C, C name "France"
    

    We obtain the following page:

    http://www.cubicweb.org/file/2124615?vid=download

    This is the standard view of CubicWeb for displaying results. We can see (right box) that we obtain 10 locations that are indeed located in France, with a population of more than 1,000,000 inhabitants. The left box shows the search panel that could be used to launch queries, and the facet filters that may be used for filtering results, e.g. we may ask to keep only results with a population greater than 4,767,709 inhabitants within the previous results:

    http://www.cubicweb.org/file/2124616?vid=download

    and we obtain now only 4 results. We can also notice that the facets are linked: by restricting the result set using the population facet, the other facets also restricted their possibilities.

    Simple query (but with more information !)

    Let's say that we now want more information about the results that we have obtained previously (for example the exact population, the elevation and the name). This is really simple ! We just have to ask within the RQL query what we want (of course, the names N, P, E of the variables could be almost anything...):

    Any N, P, E LIMIT 10 WHERE X is Location,
        X population P, X population > 1000000,
        X elevation E, X name N, X country C, C name "France"
    
    http://www.cubicweb.org/file/2124619?vid=download

    The empty column for the elevation simply means that we don't have any information about elevation.

    Anyway, we can see that fetching particular information could not be simpler! Indeed, with more complex queries, we can access countless information from the Geonames database:

    Any N,E,LA,LO ORDERBY E DESC LIMIT 10  WHERE X is Location,
          X latitude LA, X longitude LO,
          X elevation E, NOT X elevation NULL, X name N,
          X country C, C name "France"
    

    which means:

    Give me the 10 highest locations (the 10 first when sorting by decreasing elevation) with their name, elevation, latitude and longitude that are in a country named "France"
    http://www.cubicweb.org/file/2124626?vid=download

    We can now use another view on the same request, e.g. on a map (view mapstraction.map):

    Any X ORDERBY E DESC LIMIT 10  WHERE X is Location,
           X latitude LA, X longitude LO, X elevation E,
           NOT X elevation NULL, X country C, C name "France"
    
    http://www.cubicweb.org/file/2124631?vid=download

    And now, we can add the fact that we want more results (20), and that the location should have a non-null population:

    Any N, E, P, LA, LO ORDERBY E DESC LIMIT 20  WHERE X is Location,
           X latitude LA, X longitude LO,
           X elevation E, NOT X elevation NULL, X population P,
           X population > 0, X name N, X country C, C name "France"
    
    http://www.cubicweb.org/file/2124632?vid=download

    ... and on a map ...

    http://www.cubicweb.org/file/2124633?vid=download

    Conclusion

    In this blog, we have seen how CubicWeb could be used to store and query complex data, while providing (among other...) Web-based views for data vizualisation. It allows the user to directly query data within the URL and may be used to interact with and explore the data in depth. In a next blog, we will give more complex queries to show the full possibilities of the system.


  • Sparkles everywhere, CubicWeb gets fizzy

    2009/07/28 by Adrien Di Mascio
    http://www.logilab.org/file/9845/raw/sparkling.jpg

    Last week, we finally took a few days to dive into SPARQL in order to transform any CubicWeb application into a potential SPARQL endpoint.

    The first step was to get a parser. Fortunately the w3c provides a grammar definition and around 200 test cases. There was a few interesting options around there: we tried to reuse rdflib, rasqal, the sparql.g version designed for antlr3 and SimpleParse but after two days of work, we had nothing that worked well enough. We decided it was not worth it and switched to yapps since we knew yapps and rql already had a dependency on it.

    Maybe we'll consider changing the parser at some point later but the priority was to get something working as soon as we could and we finally came up with a version of fyzz passing 90% of the W3C test suite (of course, there might be some false positives).

    Fyzz parses the SPARQL query and generates something we decided to call an AST although it's still a bit rough for now. Fyzz understands simple triples, distincts, limits, offsets and other basic functionalities.

    Please note that fyzz is totally independent of cubicweb and it can be reused by any project.

    Here's an example of how to use fyzz:

    >>> from fyzz.yappsparser import parse
    >>> ast = parse("""PREFIX doap: <http://usefulinc.com/ns/doap#>
    ... SELECT ?project ?name WHERE {
    ...    ?project a doap:Project;
    ...         doap:name ?name.
    ... }
    ... ORDER BY ?name LIMIT 5 OFFSET 10
    ... """)
    >>> print ast.selected
    [SparqlVar('project'), SparqlVar('name')]
    >>> print ast.prefixes
    {'doap': 'http://usefulinc.com/ns/doap#'}
    >>> print ast.orderby
    [(SparqlVar('name'), 'asc')]
    >>> print ast.limit, ast.offset
    5 10
    >>> print ast.where
    [(SparqlVar('project'), ('', 'a'), ('http://usefulinc.com/ns/doap#', 'Project')),
     (SparqlVar('project'), ('http://usefulinc.com/ns/doap#', 'name'), SparqlVar('name'))]
    

    This AST is then processed and transformed into a RQL query which can finally be processed by CubicWeb directly.

    Here's what can be done in cubicweb-ctl shell session (of course, this can also be done in the web application) of our forge cube:

    >>> from cubicweb.spa2rql import Sparql2rqlTranslator
    >>> query = """PREFIX doap: <http://usefulinc.com/ns/doap#>
    ... SELECT ?project ?name WHERE {
    ...    ?project a doap:Project;
    ...         doap:name ?name.
    ... }
    ... ORDER BY ?name LIMIT 5 OFFSET 10
    ... """
    >>> qinfo = translator.translate(query)
    >>> rql, args = qinfo.finalize()
    >>> print rql, args
    Any PROJECT, NAME ORDERBY NAME ASC LIMIT 5 OFFSET 10 WHERE PROJECT name NAME, PROJECT is Project {}
    

    From the above example, we can notice two things. First, for cubicweb to understand the doap namespace, we have to declare the correspondance between the standard doap vocabulary and our internal schema, this is done with yams.xy:

    >>> from yams import xy
    >>> xy.register_prefix('http://usefulinc.com/ns/doap#', 'doap')
    >>> xy.add_equivalence('Project', 'doap:Project')
    >>> xy.add_equivalence('Project name', 'doap:Project doap:name')
    

    Secondly, for now, we notice that the case is not preserved during the transformation : ?project becomes PROJECT in the rql query. This is probably something that we'll need to tackle quickly.

    We've also add a few views in CubicWeb to wrap that and it will be available in the upcoming version 3.4.0 and is already available through our pulic mercurial repository.

    The door is now open, the path is still long, stay tuned !

    image under creative commons by beger (original)