Blog entries

  • Hypermedia API with cubicweb-jsonschema

    2017/04/04 by Denis Laxalde

    This is the second post of a series about cubicweb-jsonschema. The first post mainly dealt with JSON Schema representations of CubicWeb entities along with a brief description of the JSON API. In this second post, I'll describe another aspect of the project that aims at building an hypermedia API by leveraging the JSON Hyper Schema specification.

    Hypermedia APIs and JSON Hyper Schema

    Hypermedia API is somehow a synonymous of RESTful API but it makes it clearer that the API serves hypermedia responses, i.e. content that helps discoverability of other resources.

    At the heart of an hypermedia API is the concept of link relation which both aims at describing relationships between resources as well as provinding ways to manipulate them.

    In JSON Hyper Schema terminology, link relations take the form of a collection of Link Description Objects gathered into a links property of a JSON Schema document. These Link Description Objects thus describes relationships between the instance described by the JSON Schema document at stake and other resources; they hold a number of properties that makes relationships manipulation possible:

    • rel is the name of the relation, it is usually one of relation names registered at IANA;
    • href indicates the URI of the target of the relation, it may be templated by a JSON Schema;
    • targetSchema is a JSON Schema document (or reference) describing the target of the link relation;
    • schema (recently renamed as submissionSchema) is a JSON Schema document (or reference) describing what the target of the link expects when submitting data.

    Hypermedia walkthrough

    In the remaining of the article, I'll walk through a navigation path that is made possible by hypermedia controls provided by cubicweb-jsonschema. I'll continue on the example application described in the first post of the series which schema consists of Book, Author and Topic entity types. In essence, this walkthrough is typical of what an intelligent client could do when exposed to the API, i.e. from any resource, discover other resources and navigate or manipulate them.

    This walkthrough assumes that, given any resource (i.e. something that has a URL like /book/1), the server would expose data at the main URL when the client asks for JSON through the Accept header and it would expose the JSON Schema of the resource at a schema view of the same URL (i.e. /book/1/schema). This assumption can be regarded as a kind of client/server coupling, which might go away in later implementation.

    Site root

    While client navigation could start from any resource, we start from the root resource and retrieve its schema:

    GET /schema
    Accept: application/schema+json
    
    HTTP/1.1 200 OK
    Content-Type: application/json
    
    {
        "links": [
            {
                "href": "/author/",
                "rel": "collection",
                "schema": {
                    "$ref": "/author/schema?role=creation"
                },
                "targetSchema": {
                    "$ref": "/author/schema"
                },
                "title": "Authors"
            },
            {
                "href": "/book/",
                "rel": "collection",
                "schema": {
                    "$ref": "/book/schema?role=creation"
                },
                "targetSchema": {
                    "$ref": "/book/schema"
                },
                "title": "Books"
            },
            {
                "href": "/topic/",
                "rel": "collection",
                "schema": {
                    "$ref": "/topic/schema?role=creation"
                },
                "targetSchema": {
                    "$ref": "/topic/schema"
                },
                "title": "Topics"
            }
        ]
    }
    

    So at root URL, our application serves a JSON Hyper Schema that only consists of links. It has no JSON Schema document, which is natural since there's usually no data bound to the root resource (think of it as empty rset in CubicWeb terminology).

    These links correspond to top-level entity types, i.e. those that would appear in the default startup page of a CubicWeb application. They all have "rel": "collection" relation name (this comes from RFC6573) as their target is a collection of entities. We also have schema and targetSchema properties.

    From collection to items

    Now that we have added a new book, let's step back and use our books link to retrieve data (verb GET):

    GET /book/
    Accept: application/json
    
    HTTP/1.1 200 OK
    Allow: GET, POST
    Content-Type: application/json
    
    [
        {
            "id": "859",
            "title": "L'homme qui rit"
        },
        {
            "id": "858",
            "title": "The Old Man and the Sea"
        },
    ]
    

    which, as always, needs to be completed by a JSON Schema:

    GET /book/schema
    Accept: application/schema+json
    
    
    HTTP/1.1 200 OK
    Content-Type: application/json
    
    {
        "$ref": "#/definitions/Book_plural",
        "definitions": {
            "Book_plural": {
                "items": {
                    "properties": {
                        "id": {
                            "type": "string"
                        },
                        "title": {
                            "type": "string"
                        }
                    },
                    "type": "object"
                },
                "title": "Books",
                "type": "array"
            }
        },
        "links": [
            {
                "href": "/book/",
                "rel": "collection",
                "schema": {
                    "$ref": "/book/schema?role=creation"
                },
                "targetSchema": {
                    "$ref": "/book/schema"
                },
                "title": "Books"
            },
            {
                "href": "/book/{id}",
                "rel": "item",
                "targetSchema": {
                    "$ref": "/book/schema?role=view"
                },
                "title": "Book"
            }
        ]
    }
    

    Consider the last item of links in the above schema. It has a "rel": "item" property which indicates how to access items of the collection; its href property is a templated URI which can be expanded using instance data and schema (here we only have a single id template variable).

    So our client may navigate to the first item of the collection (id="859") at /book/859 URI, and retrieve resource data:

    GET /book/859
    Accept: application/json
    
    HTTP/1.1 200 OK
    Allow: GET, PUT, DELETE
    Content-Type: application/json
    
    {
        "author": [
            "Victor Hugo"
        ],
        "publication_date": "1869-04-01T00:00:00",
        "title": "L'homme qui rit"
    }
    

    and schema:

    GET /book/859/schema
    Accept: application/schema+json
    
    HTTP/1.1 200 OK
    Content-Type: application/json
    
    {
        "$ref": "#/definitions/Book",
        "definitions": {
            "Book": {
                "additionalProperties": false,
                "properties": {
                    "author": {
                        "items": {
                            "type": "string"
                        },
                        "title": "author",
                        "type": "array"
                    },
                    "publication_date": {
                        "format": "date-time",
                        "title": "publication date",
                        "type": "string"
                    },
                    "title": {
                        "title": "title",
                        "type": "string"
                    },
                    "topics": {
                        "items": {
                            "type": "string"
                        },
                        "title": "topics",
                        "type": "array"
                    }
                },
                "title": "Book",
                "type": "object"
            }
        },
        "links": [
            {
                "href": "/book/",
                "rel": "up",
                "targetSchema": {
                    "$ref": "/book/schema"
                },
                "title": "Book_plural"
            },
            {
                "href": "/book/859/",
                "rel": "self",
                "schema": {
                    "$ref": "/book/859/schema?role=edition"
                },
                "targetSchema": {
                    "$ref": "/book/859/schema?role=view"
                },
                "title": "Book #859"
            }
        ]
    }
    

    Entity resource

    The resource obtained above as an item of a collection is actually an entity. Notice the rel="self" link. It indicates how to manipulate the current resource (i.e. at which URI, using a given schema depending on what actions we want to perform). Still this link does not indicate what actions may be performed. This indication is found in the Allow header of the data response above:

    Allow: GET, PUT, DELETE
    

    With these information bits, our intelligent client is able to, for instance, form a request to delete the resource. On the other hand, the action to update the resource (which is allowed because of the presence of PUT in Allow header, per HTTP semantics) would take the form of a request which body conforms to the JSON Schema pointed at by the schema property of the link.

    Also note the rel="up" link which makes it possible to navigate to the collection of books.

    Conclusions

    This post introduced the main hypermedia capabilities of cubicweb-jsonschema, built on top of the JSON Hyper Schema specification. The resulting Hypermedia API makes it possible for an intelligent client to navigate through hypermedia resources and manipulate them by using both link relation semantics and HTTP verbs.

    In the next post, I'll deal with relationships description and manipulation both in terms of API (endpoints) and hypermedia representation.


  • Exploring the datafeed API in CubicWeb

    2014/09/26 by Denis Laxalde

    The datafeed API is one of the nice features of the CubicWeb framework. It makes it possible to easily build such things as a news aggregator (or even a semantic news feed reader), a LDAP importer or an application importing data from another web platform. The underlying API is quite flexible and powerful. Yet, the documentation being quite thin, it may be hard to find one's way through. In this article, we'll describe the basics of the datafeed API and provide guiding examples.

    The datafeed API is essentially built around two things: a CWSource entity and a parser, which is a kind of AppObject.

    The CWSource entity defines a list of URL from which to fetch data to be imported in the current CubicWeb instance, it is linked to a parser through its __regid__. So something like the following should be enough to create a usable datafeed source [1].

    create_entity('CWSource', name=u'some name', type=u'datafeed', parser=u'myparser')
    

    The parser is usually a subclass of DataFeedParser (from cubicweb.server.sources.datafeed). It should at least implement the two methods process and before_entity_copy. To make it easier, there are specialized parsers such as DataFeedXMLParser that already define process so that subclasses only have to implement the process_item method.

    Overview of the datafeed API

    Before going into further details about the actual implementation of a DataFeedParser, it's worth having in mind a few details about the datafeed parsing and import process. This involves various players from the CubicWeb server, namely: a DataFeedSource (from cubicweb.server.sources.datafeed), the Repository and the DataFeedParser.

    • Everything starts from the Repository which loops over its sources and pulls data from each of these (this is done using a looping task which is setup upon repository startup). In the case of datafeed sources, Repository sources are instances of the aforementioned DataFeedSource class [2].
    • The DataFeedSource selects the appropriate parser from the registry and loops on each uri defined in the respective CWSource entity by calling the parser's process method with that uri as argument (methods pull_data and process_urls of DataFeedSource).
    • If the result of the parsing step is successful, the DataFeedSource will call the parser's handle_deletion method, with the URI of the previously imported entities.
    • Then, the import log is formatted and the transaction committed. The DataFeedSource and DataFeedParser are connected to an import_log which feeds the CubicWeb instance with a CWDataImport per data pull. This usually contains the number of created and updated entities along with any error/warning message logged by the parser. All this is visible in a table from the CWSource primary view.

    So now, you might wonder what actually happens during the parser's process method call. This method takes an URL from which to fetch data and processes further each piece of data (using a process_item method for instance). For each data-item:

    1. the repository is queried to retrieve or create an entity in the system source: this is done using the extid2entity method;
    2. this extid2entity method essentially needs two pieces of information:
      • a so-called extid, which uniquely identifies an item in the distant source
      • any other information needed to create or update the corresponding entity in the system source (this will be later refered to as the sourceparams)
    3. then, given the (new or existing) entity returned by extid2entity, the parser can perform further postprocessing (for instance, updating any relation on this entity).

    In step 1 above, the parser method extid2entity in turns calls the repository method extid2eid given the current source and the extid value. If an entry in the entities table matches with the specified extid, the corresponding eid (identifier in the system source) is returned. Otherwise, a new eid is created. It's worth noting that the created entity (in case the entity is to be created) is not complete with respect to the data model at this point. In order the entity to be completed, the source method before_entity_insertion is called. This is where the aforementioned sourceparams are used. More specifically, on the parser side the before_entity_copy method is called: it usually just updates (using entity.cw_set() for instance) the fetched entity with any relevant information.

    Case study: a news feeds parser

    Now we'll go through a concrete example to illustrate all those fairly abstract concepts and implement a datafeed parser which can be used to import news feeds. Our parser will create entities of type FeedArticle, which minimal data model would be:

    class FeedArticle(EntityType):
        title = String(fulltextindexed=True)
        uri = String(unique=True)
        author = String(fulltextindexed=True)
        content = RichString(fulltextindexed=True, default_format='text/html')
    

    Here we'll reuse the DataFeedXMLParser, not because we have XML data to parse, but because its interface fits well with our purpose, namely: it ships an item-based processing (a process_item method) and it relies on a parse method to fetch raw data. The underlying parsing of the news feed resources will be handled by feedparser.

    class FeedParser(DataFeedXMLParser):
        __regid__ = 'newsaggregator.feed-parser'
    

    The parse method is called by process, it should return a list tuples with items information.

    def parse(self, url):
        """Delegate to feedparser to retrieve feed items"""
        data = feedparser.parse(url)
        return zip(data.entries)
    

    Then the process_item method takes an individual item (i.e. an entry of the result obtained from feedparser in our case). It essentially defines an extid, here the uri of the feed entry (good candidate for unicity) and calls extid2entity with that extid, the entity type to be created / retrieved and any additional data useful for entity completion passed as keyword arguments. (The process_feed method call just transforms the results obtained from feedparser into a dict suitable for entity creation following the data model described above.)

    def process_item(self, entry):
        data = self.process_feed(entry)
        extid = data['uri']
        entity = self.extid2entity(extid, 'FeedArticle', feeddata=data)
    

    The before_entity_copy method is called before the entity is actually created (or updated) in order to give the parser a chance to complete it with any other attribute that could be set from source data (namely feedparser data in our case).

    def before_entity_copy(self, entity, sourceparams):
        feeddata = sourceparams['feeddata']
        entity.cw_edited.update(feeddata)
    

    And this is all what's essentially needed for a simple parser. Further details could be found in the news aggregator cube. More sophisticated parsers may use other concepts not described here, such as source mappings.

    Testing datafeed parsers

    Testing a datafeed parser often involves pulling data from the corresponding datafeed source. Here is a minimal test snippet that illustrates how to retrieve the datafeed source from a CWSource entity and to pull data from it.

    with self.admin_access.repo_cnx() as cnx:
        # Assuming one knows the URI of a CWSource.
        rset = cnx.execute('CWSource X WHERE X uri %s' % uri)
        # Retrieve the datafeed source instance.
        dfsource = self.repo.sources_by_eid[rset[0][0]]
        # Make sure it's parser matches the expected.
        self.assertEqual(dfsource.parser_id, '<my-parser-id>')
        # Pull data using an internal connection.
        with self.repo.internal_cnx() as icnx:
            stats = dfsource.pull_data(icnx, force=True, raise_on_error=True)
            icnx.commit()
    

    The resulting stats is a dictionnary containing eids of created and updated entities during the pull. In addition all entities created should have the cw_source relation set to the corresponding CWSource entity.

    Notes

    [1]

    It is possible to add some configuration to the CWSource entity in the form a string of configuration items (one per line). Noteworthy items are:

    • the synchronization-interval;
    • use-cwuri-as-url=no, which avoids using external URL inside the CubicWeb instance (leading to any link on an imported entity to point to the external source URI);
    • delete-entities=[yes,no] which controls if entities not found anymore in the distant source should be deleted from the CubicWeb instance.
    [2]The mapping between CWSource entities' type (e.g. "datafeed") and DataFeedSource object is quite unusual as it does not rely on the vreg but uses a specific sources registry (defined in cubicweb.server.SOURCE_TYPES).