Blog entries by Simon Chabot [3]
  • CubicWeb Monthly news february/march 2021

    2021/04/09 by Simon Chabot

    It has been quite a time since the last public news ; we will try in the letter to summarize what we did during february and march 2021. During this period, we tried to fix issues, migrate a lot of our code base to latest version of CubicWeb. We also did some archeology ; we have been looking at all the old heads in CubicWeb's repository, and we have tried to rebase them on the latest public head. A lot of merge requests have been created and are still under review.

    CubicWeb 3.30 is out!

    We released Cubicweb 3.30 on march 16th. There are no “big” changes in this release. A lot of issues have been fixed, the documentation have been improved − a new tuto is coming ! − and a few new features have been implemented.

    Here is an extract of the changelog:

    • it is now possible to use smtp authentification to send an email (#221)
    • required variables can be read from the environment (#85)
    • one can specify scripts attributes when using the add_js function (#210)
    • GROUP_CONCAT was not returning the expected results when NULL values were encountered (#109)
    • it is now possible not to drop table indexes when using the MassiveStore (#219)

    One can find the full changelog on Cubicweb release page here.

    Adding python types to RQL

    In the past months, python types have progressively been included in RQL. This is a step forward to python types in CubicWeb itself.

    A lot of work has been done, and there is still a lot of work to do. Adding types to RQL help us to simplify the refactorings. As RQL's API is not clearly defined, we will run CubicWeb's tests on each merge request of RQL, to make sure nothing breaks.

    Separate Back and Front

    For some time now, the trend has been towards a clear separation between front and back. We are adding more features to use React in the front.

    CwClientLibJS, a library allowing to use rql in the browser. CwClientLibJS is now in version 1.1.0 and is able to generate query on entity state (see !20).

    react-admin-cubicweb, new tools to generate admin pages. react-admin-cubicweb is still at an early stage but can display entity attribute and relation. It also allows edition as well. The version 0.3.0 has been released !

    Release-new

    Release-new is one of our latest utility to release new versions of our cubes. Assuming that you are using conventional commits, you should really like this tool to release new version of your cubes.

    Release-new takes care to:

    • update the version of the cube in __pkginfo__.py
    • update the debian/control if there is one
    • update the changelog
    • create a new commit with theses changes
    • tag the commit

    By default, it will try to guess if it is a major, minor or patch release (you can specify the release type if need be). The changelog will be updated and you will be able to edit it before the commit is done. The update of the changelog will be eased if you use conventionnal commit as your commit history will be dispatched in the appropriate sections. For instance, the last changelog of CubicWeb has been produced by this tool !

    Docker images

    During our last hackathon, a team worked on our CubicWeb Docker images. The code to generate them has been rewritten to be simpler. During this refactoring, we took care to generate docker images for minor versions of CubicWeb as well. Therefore, on https://hub.docker.com/r/logilab/cubicweb/tags you can find docker images for major and minor versions of CubicWeb with different versions of python, to suit your needs the best we can :)

    See you next month!


  • Report of May 19th Cubicweb Meeting

    2020/05/20 by Simon Chabot

    Hi everyone,

    Yesterday we held a new CubicWeb meeting, and we would like to share with you what was said during that meeting :

    • some issues have been migrated from cubicweb.org forge to the heptapod forge. You can find them here. Only the issues which were in the cubicweb-dev-board have been migrated, the others have been considered too old to make their migration worthwhile ;

    • new milestones have been created, and some issues have been affected to them ;

    • if you want to help out with Cubicweb's development, you can have a look to the Issues Board, and pick one task in the “To Do” column (this task should be related to a milestone) ;

    • CW tests are still failing on the forge, and it's also related to other packages that have been released. Fixing those tests is quite urgent now, therefore we suggest to fix them in a sprint this Friday afternoon. Feel
      free to join !

    • On main Cubicweb's dependencies (RQL, YAMS, logilab-common, etc), a heptapod job has been added to trigger CW's tests with the last version of the dependency in the forge and not only the last released version on pypi. This should help to release new versions of CW's dependencies with
      more confidence. (for now, it only triggers the job, a future version of heptapod should provide a better integration of multi-project pipelines) ;

    • The documentations of CubicWeb's dependencies are now automatically published on readthedocs. The work is in progress for CubicWeb itself ;

    Next week, if the tests are successful, we will talk about a release candidate for 3.28.

    Stay tuned :)


  • A roadmap to Cubicweb 3.28 (and beyond)

    2020/05/13 by Simon Chabot

    Yesterday at Logilab we had a small meeting to discuss about a roadmap to Cubicweb 3.28 (and beyond), and we would like to report back to you from this meeting.

    Cubicweb 3.28 will mainly bring the implementation of content negotiation. It means that Cubicweb will handle content negotiation and will be able to return RDF using Cubicweb's ontology when requested by a client.

    The 3.28 will have other features as well (like a new variables attributes to ResultSet that contains the name of the projected variables, etc). Those features will be detailed the release changelog.

    Before releasing this version, we would like to finish the migration to heptapod, to make sure that everything is ok. The remaining tasks are:

    • fixing the CI (there is still some random failings, that need further investigation)
    • migrate the jenkins job that pushes images to hub.docker.com on heptapod, to make everthing available from the forge. It will be explicit for everyone when a job is done, and what is its status.

    Beside of releasing Cubicweb 3.28, its ecosystem will also be updated:

    • logilab-common, a new version will be released very soon, which brings a refactoring of the deprecation system, and annotations (coming from pyannotate)
    • yams, a new version is coming. This version:
    • brings type annotation (manually done, a carefully checked);
    • removes a lot of abbreviation to make the code clearer;
    • removes some magic related to a object which used to behave like a string;

    The goal of these two releases, is to have type annotations in the core libraries used by CubicWeb, and then to be able to bring type annotation into CubicWeb itself, in a future version.

    On those projects, some “modernisation” has been started too ; (fixing flake8 when needed, repaint the code black). This “modernisation” step is still on going on the different projects related to CubicWeb (and achieved for yams, and logilab-common).

    In the medium term, we would like to focus on the documentation of CubicWeb and its ecosystem. We do know that it's really hard for newcomers (and even ourself sometime) to understand how to start, what each module is doing etc. An automatic documentation has been released for some modules (see 1, 2 or 3 for instance). It would be nice to automatize the update of the documentation on readthedocs, update the old examples, and add new ones about the new feature we are adding (like content negotiation, pyramid predicates, etc). This could be done in team Friday's sprint or hackathon for instance. CubicWeb would also need some modernisation (running black ? and above all, make all files flake8 compilant…).

    Regarding CubicWeb development, all (or, at least a lot of) cubes and Cubicweb related projects moved from cubicweb.org's forge to our instance of heptapod (4 and 5). Some issues have been imported from cubicweb.org to heptapod. New issues should be opened on heptapod, and the review should also be done there. We hope that will ease the reappropriation of the code basis and stimulates new merge-requests :)

    To end this report, we like to emphasis that we will try to make a « remote Cubicweb meeting » each Tuesday at 2 pm. If you would like to participate to this meeting, it's with great pleasure (if you need the webconference URL, contact one of us, we will provide it to you). We also created a #Cubicweb channel on matrix.logilab.org ; feel free to ask for an invitation if you'd like to discuss Cubicweb related things with us.

    All the best, and… see you next Tuesday :)


  • Géo − Geonames alignment

    2012/12/20 by Simon Chabot

    This blog post describes the main points of the alignment process between the French National Library's Géo repository of data, and the data extracted from Geonames.

    Alignment is the process of finding similar entities in different repositories. The Géo repository of data contains a lot of locations and the goal is to find those locations in the Geonames repository, and to be able to say that location in *Géo* is the same than this one in *Geonames*. For that purpose, Logilab developed a library, called Nazca, to build those links.

    To process the alignment between Géo and Geonames, we divided the Géo repository into two groups:

    • A group gathering the Géo data having information about longitude and latitude.
    • An other, gathering the data having no information about longitude and latitude.

    Group 1 - Data having geographical information

    The alignment process is made in five steps (see figure below):

    1. Data gathering

    We gather the information needed to align, that is to say, the unique identifier, the name, the longitude and the latitude. The same applies to the Geonames data.

    2. Standardization

    This step aims to make the data the as standard as possible. ie, set to lower case, remove the stop words, remove the punctuation and so on.

    4. Alignment

    Thanks to the Kdtree, we can quickly find the geographical nearest neighbours. During this fourth step, we loop over the nearest neighbours and assign to each a grade according to the similarity of its name and the name of the location we're looking for, using the Levenshtein distance. The alignment will be made with the best graded one.

    5. Saving the results

    Finally, we save all the results into a file.

    Group 2 - Data having no geographical information

    Let's have a look to the data having no information on the longitude and the latitude. The steps are more or less the same than before, except that we cannot find neighbours using a Kdtree. So, we use an other method to find location having a quite high level of similarity in their names. This method is called the Minhashing which has been shown to be quite relevant for this purpose.

    To minimise the amount of mistakes, we try to gather locations according to their country, knowing the country in often written in the location's preferred_label. This pre-treatment helps us to filter out the cities having the same name but located in different countries. For instance, there is Paris in France, there is Paris in the United States, and there is Paris in Canada. So the alignment is made country by country.

    The fourth and the fifth steps remain the sames.

    Results obtained

    The results we got are the followings :

      Amount of locations Aligned Non-aligned
    Group 1 97572 (89.3%) (10.7%)
    Group 2 150528 (72.9%) (27.1%)
    Total 248100 (79.3%) (20.7%)

    One problem we met is the language used to describe the location. Indeed, the similarity grade is given according the distance between the names, and one can notice that Londres and London, for instance, do not having the same spelling.despite they represent the same location.

    Results improvement

    In order to improve a little bit the results, we had a closer look to the 10.7% non-aligned of the first group. The problem of the language mentioned before was pretty clear. So we decided to use the following definition : two locations are identical, if they are geographically very close. Using this definition, we get rid of the name, and focus on the longitude and the latitude only.

    To estimate the exactness of the results, we pick 50 randomly chosen location and process to a manual checking. And the results are pretty good ! 98% are correct (49/50). That's how, based on a purely geographical approach, we can increase the results covering rate (from 89.3% to 99.6%).

    In the end, we get those results :

      Amount of locations Aligned Non-aligned
    Group 1 97572 (99.6%) (0.4%)
    Group 2 150528 (72.9%) (27.1%)
    Total 248100 (83.4%) (16.4%)