blog entries created by Simon Chabot
  • Report of May 19th Cubicweb Meeting

    2020/05/20 by Simon Chabot

    Hi everyone,

    Yesterday we held a new CubicWeb meeting, and we would like to share with you what was said during that meeting :

    • some issues have been migrated from cubicweb.org forge to the heptapod forge. You can find them here. Only the issues which were in the cubicweb-dev-board have been migrated, the others have been considered too old to make their migration worthwhile ;

    • new milestones have been created, and some issues have been affected to them ;

    • if you want to help out with Cubicweb's development, you can have a look to the Issues Board, and pick one task in the “To Do” column (this task should be related to a milestone) ;

    • CW tests are still failing on the forge, and it's also related to other packages that have been released. Fixing those tests is quite urgent now, therefore we suggest to fix them in a sprint this Friday afternoon. Feel
      free to join !

    • On main Cubicweb's dependencies (RQL, YAMS, logilab-common, etc), a heptapod job has been added to trigger CW's tests with the last version of the dependency in the forge and not only the last released version on pypi. This should help to release new versions of CW's dependencies with
      more confidence. (for now, it only triggers the job, a future version of heptapod should provide a better integration of multi-project pipelines) ;

    • The documentations of CubicWeb's dependencies are now automatically published on readthedocs. The work is in progress for CubicWeb itself ;

    Next week, if the tests are successful, we will talk about a release candidate for 3.28.

    Stay tuned :)


  • A roadmap to Cubicweb 3.28 (and beyond)

    2020/05/13 by Simon Chabot

    Yesterday at Logilab we had a small meeting to discuss about a roadmap to Cubicweb 3.28 (and beyond), and we would like to report back to you from this meeting.

    Cubicweb 3.28 will mainly bring the implementation of content negotiation. It means that Cubicweb will handle content negotiation and will be able to return RDF using Cubicweb's ontology when requested by a client.

    The 3.28 will have other features as well (like a new variables attributes to ResultSet that contains the name of the projected variables, etc). Those features will be detailed the release changelog.

    Before releasing this version, we would like to finish the migration to heptapod, to make sure that everything is ok. The remaining tasks are:

    • fixing the CI (there is still some random failings, that need further investigation)
    • migrate the jenkins job that pushes images to hub.docker.com on heptapod, to make everthing available from the forge. It will be explicit for everyone when a job is done, and what is its status.

    Beside of releasing Cubicweb 3.28, its ecosystem will also be updated:

    • logilab-common, a new version will be released very soon, which brings a refactoring of the deprecation system, and annotations (coming from pyannotate)
    • yams, a new version is coming. This version:
    • brings type annotation (manually done, a carefully checked);
    • removes a lot of abbreviation to make the code clearer;
    • removes some magic related to a object which used to behave like a string;

    The goal of these two releases, is to have type annotations in the core libraries used by CubicWeb, and then to be able to bring type annotation into CubicWeb itself, in a future version.

    On those projects, some “modernisation” has been started too ; (fixing flake8 when needed, repaint the code black). This “modernisation” step is still on going on the different projects related to CubicWeb (and achieved for yams, and logilab-common).

    In the medium term, we would like to focus on the documentation of CubicWeb and its ecosystem. We do know that it's really hard for newcomers (and even ourself sometime) to understand how to start, what each module is doing etc. An automatic documentation has been released for some modules (see 1, 2 or 3 for instance). It would be nice to automatize the update of the documentation on readthedocs, update the old examples, and add new ones about the new feature we are adding (like content negotiation, pyramid predicates, etc). This could be done in team Friday's sprint or hackathon for instance. CubicWeb would also need some modernisation (running black ? and above all, make all files flake8 compilant…).

    Regarding CubicWeb development, all (or, at least a lot of) cubes and Cubicweb related projects moved from cubicweb.org's forge to our instance of heptapod (4 and 5). Some issues have been imported from cubicweb.org to heptapod. New issues should be opened on heptapod, and the review should also be done there. We hope that will ease the reappropriation of the code basis and stimulates new merge-requests :)

    To end this report, we like to emphasis that we will try to make a « remote Cubicweb meeting » each Tuesday at 2 pm. If you would like to participate to this meeting, it's with great pleasure (if you need the webconference URL, contact one of us, we will provide it to you). We also created a #Cubicweb channel on matrix.logilab.org ; feel free to ask for an invitation if you'd like to discuss Cubicweb related things with us.

    All the best, and… see you next Tuesday :)


  • Géo − Geonames alignment

    2012/12/20 by Simon Chabot

    This blog post describes the main points of the alignment process between the French National Library's Géo repository of data, and the data extracted from Geonames.

    Alignment is the process of finding similar entities in different repositories. The Géo repository of data contains a lot of locations and the goal is to find those locations in the Geonames repository, and to be able to say that location in *Géo* is the same than this one in *Geonames*. For that purpose, Logilab developed a library, called Nazca, to build those links.

    To process the alignment between Géo and Geonames, we divided the Géo repository into two groups:

    • A group gathering the Géo data having information about longitude and latitude.
    • An other, gathering the data having no information about longitude and latitude.

    Group 1 - Data having geographical information

    The alignment process is made in five steps (see figure below):

    1. Data gathering

    We gather the information needed to align, that is to say, the unique identifier, the name, the longitude and the latitude. The same applies to the Geonames data.

    2. Standardization

    This step aims to make the data the as standard as possible. ie, set to lower case, remove the stop words, remove the punctuation and so on.

    4. Alignment

    Thanks to the Kdtree, we can quickly find the geographical nearest neighbours. During this fourth step, we loop over the nearest neighbours and assign to each a grade according to the similarity of its name and the name of the location we're looking for, using the Levenshtein distance. The alignment will be made with the best graded one.

    5. Saving the results

    Finally, we save all the results into a file.

    Group 2 - Data having no geographical information

    Let's have a look to the data having no information on the longitude and the latitude. The steps are more or less the same than before, except that we cannot find neighbours using a Kdtree. So, we use an other method to find location having a quite high level of similarity in their names. This method is called the Minhashing which has been shown to be quite relevant for this purpose.

    To minimise the amount of mistakes, we try to gather locations according to their country, knowing the country in often written in the location's preferred_label. This pre-treatment helps us to filter out the cities having the same name but located in different countries. For instance, there is Paris in France, there is Paris in the United States, and there is Paris in Canada. So the alignment is made country by country.

    The fourth and the fifth steps remain the sames.

    Results obtained

    The results we got are the followings :

      Amount of locations Aligned Non-aligned
    Group 1 97572 (89.3%) (10.7%)
    Group 2 150528 (72.9%) (27.1%)
    Total 248100 (79.3%) (20.7%)

    One problem we met is the language used to describe the location. Indeed, the similarity grade is given according the distance between the names, and one can notice that Londres and London, for instance, do not having the same spelling.despite they represent the same location.

    Results improvement

    In order to improve a little bit the results, we had a closer look to the 10.7% non-aligned of the first group. The problem of the language mentioned before was pretty clear. So we decided to use the following definition : two locations are identical, if they are geographically very close. Using this definition, we get rid of the name, and focus on the longitude and the latitude only.

    To estimate the exactness of the results, we pick 50 randomly chosen location and process to a manual checking. And the results are pretty good ! 98% are correct (49/50). That's how, based on a purely geographical approach, we can increase the results covering rate (from 89.3% to 99.6%).

    In the end, we get those results :

      Amount of locations Aligned Non-aligned
    Group 1 97572 (99.6%) (0.4%)
    Group 2 150528 (72.9%) (27.1%)
    Total 248100 (83.4%) (16.4%)