[dataimport] introduce the importer and extentity classes

This introduces the ExtEntity class which is a transitional state between data at external source and the actual CubicWeb entities.

ExtEntitiesImporter is then in charge to turn a bunch of ext entities into CW entities in repository, using a given store.

This changeset also introduces SimpleImportLog and HTMLImportLog which implement the CW DataImportLog interface in order to show log messages in UI using simple text and HTML formats respectively, instead of storing these messages in database.

Both have mostly been backported from cubes.skos.dataimport.

Closes #5414753.

authorYann Voté <yann.vote@logilab.fr>
changesetd260722f2453
branchdefault
phasepublic
hiddenno
parent revision#5ccc3bd8927e [test] Use store.prepare_insert_relation instead of deprecated relate method
child revision#37644c518705 [doc] Add a tutorial and extend documentation for ExtEntityImporter
files modified by this revision
dataimport/importer.py
dataimport/test/data/people.csv
dataimport/test/data/schema.py
dataimport/test/unittest_importer.py
doc/book/en/devrepo/dataimport.rst
# HG changeset patch
# User Yann Voté <yann.vote@logilab.fr>
# Date 1435327767 -7200
# Fri Jun 26 16:09:27 2015 +0200
# Node ID d260722f2453ae50e2accbc72510854334dde2da
# Parent 5ccc3bd8927e1080d14e7368ea4ae159b7a8b41a
[dataimport] introduce the importer and extentity classes

This introduces the ``ExtEntity`` class which is a transitional state between
data at external source and the actual CubicWeb entities.

``ExtEntitiesImporter`` is then in charge to turn a bunch of ext entities into
CW entities in repository, using a given store.

This changeset also introduces ``SimpleImportLog`` and ``HTMLImportLog`` which
implement the CW DataImportLog interface in order to show log messages in UI
using simple text and HTML formats respectively, instead of storing these
messages in database.

Both have mostly been backported from cubes.skos.dataimport.

Closes #5414753.

diff --git a/dataimport/importer.py b/dataimport/importer.py
@@ -0,0 +1,408 @@
1 +# copyright 2015 LOGILAB S.A. (Paris, FRANCE), all rights reserved.
2 +# contact http://www.logilab.fr -- mailto:contact@logilab.fr
3 +#
4 +# This program is free software: you can redistribute it and/or modify it under
5 +# the terms of the GNU Lesser General Public License as published by the Free
6 +# Software Foundation, either version 2.1 of the License, or (at your option)
7 +# any later version.
8 +#
9 +# This program is distributed in the hope that it will be useful, but WITHOUT
10 +# ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
11 +# FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more
12 +# details.
13 +#
14 +# You should have received a copy of the GNU Lesser General Public License along
15 +# with this program. If not, see <http://www.gnu.org/licenses/>.
16 +"""This module contains tools to programmatically import external data into CubicWeb. It's designed
17 +on top of the store concept to leverage possibility of code sharing accross various data import
18 +needs.
19 +
20 +The following classes are defined:
21 +
22 +* :class:`ExtEntity`: some intermediate representation of data to import, using external identifier
23 +  but no eid,
24 +
25 +* :class:`ExtEntitiesImporter`: class responsible for turning ExtEntity's extid to eid, and create
26 +  or update CubicWeb entities accordingly (using a Store).
27 +
28 +What is left to do is to write a class or a function that will yield external entities from some
29 +data source (eg RDF, CSV) which will be case dependant (the *generator*).  You may then plug
30 +arbitrary filters into the external entities stream between the generator and the importer, allowing
31 +to have some generic generators whose generated content is rafined by specific filters.
32 +
33 +.. code-block:: python
34 +
35 +    ext_entities = fetch(<source>) # function yielding external entities
36 +    log = SimpleImportLog('<source file/url/whatever>')
37 +    importer = ExtEntitiesImporter(cnx, store, import_log=log)
38 +    importer.import_entities(ext_entities)
39 +
40 +Here are the two classes that you'll have to deal with, and maybe to override:
41 +
42 +.. autoclass:: cubicweb.dataimport.importer.ExtEntitiesImporter
43 +.. autoclass:: cubicweb.dataimport.importer.ExtEntity
44 +"""
45 +
46 +from collections import defaultdict
47 +import logging
48 +
49 +from logilab.mtconverter import xml_escape
50 +
51 +
52 +def cwuri2eid(cnx, etypes, source_eid=None):
53 +    """Return a dictionary mapping cwuri to eid for entities of the given entity types and / or
54 +    source.
55 +    """
56 +    assert source_eid or etypes, 'no entity types nor source specified'
57 +    rql = 'Any U, X WHERE X cwuri U'
58 +    args = {}
59 +    if len(etypes) == 1:
60 +        rql += ', X is %s' % etypes[0]
61 +    elif etypes:
62 +        rql += ', X is IN (%s)' % ','.join(etypes)
63 +    if source_eid is not None:
64 +        rql += ', X cw_source S, S eid %(s)s'
65 +        args['s'] = source_eid
66 +    return dict(cnx.execute(rql, args))
67 +
68 +
69 +class RelationMapping(object):
70 +    """Read-only mapping from relation type to set of related (subject, object) eids.
71 +
72 +    If `source` is specified, only returns relations implying entities from
73 +    this source.
74 +    """
75 +
76 +    def __init__(self, cnx, source=None):
77 +        self.cnx = cnx
78 +        self._rql_template = 'Any S,O WHERE S {} O'
79 +        self._kwargs = {}
80 +        if source is not None:
81 +            self._rql_template += ', S cw_source SO, O cw_source SO, SO eid %(s)s'
82 +            self._kwargs['s'] = source.eid
83 +
84 +    def __getitem__(self, rtype):
85 +        """Return a set of (subject, object) eids already related by `rtype`"""
86 +        rql = self._rql_template.format(rtype)
87 +        return set(tuple(x) for x in self.cnx.execute(rql, self._kwargs))
88 +
89 +
90 +class ExtEntity(object):
91 +    """Transitional representation of an entity for use in data importer.
92 +
93 +    An external entity has the following properties:
94 +
95 +    * ``extid`` (external id), an identifier for the ext entity,
96 +    * ``etype`` (entity type), a string which must be the name of one entity type in the schema
97 +      (eg. ``'Person'``, ``'Animal'``, ...),
98 +    * ``values``, a dictionary whose keys are attribute or relation names from the schema (eg.
99 +      ``'first_name'``, ``'friend'``), and whose values are *sets*
100 +
101 +    For instance:
102 +
103 +    ..code-block::python
104 +
105 +        ext_entity.extid = 'http://example.org/person/debby'
106 +        ext_entity.etype = 'Person'
107 +        ext_entity.values = {'first_name': set([u"Deborah", u"Debby"]),
108 +                            'friend': set(['http://example.org/person/john'])}
109 +
110 +    """
111 +
112 +    def __init__(self, etype, extid, values=None):
113 +        self.etype = etype
114 +        self.extid = extid
115 +        if values is None:
116 +            values = {}
117 +        self.values = values
118 +        self._schema = None
119 +
120 +    def __repr__(self):
121 +        return '<%s %s %s>' % (self.etype, self.extid, self.values)
122 +
123 +    def iter_rdefs(self):
124 +        """Yield (key, rtype, role) defined in `.values` dict, with:
125 +
126 +        * `key` is the original key in `.values` (i.e. the relation type or a 2-uple (relation type,
127 +          role))
128 +
129 +        * `rtype` is a yams relation type, expected to be found in the schema (attribute or
130 +          relation)
131 +
132 +        * `role` is the role of the entity in the relation, 'subject' or 'object'
133 +
134 +        Iteration is done on a copy of the keys so values may be inserted/deleted during it.
135 +        """
136 +        for key in list(self.values):
137 +            if isinstance(key, tuple):
138 +                rtype, role = key
139 +                assert role in ('subject', 'object'), key
140 +                yield key, rtype, role
141 +            else:
142 +                yield key, key, 'subject'
143 +
144 +    def prepare(self, schema):
145 +        """Prepare an external entity for later insertion:
146 +
147 +        * ensure attributes and inlined relations have a single value
148 +        * turn set([value]) into value and remove key associated to empty set
149 +        * remove non inlined relations and return them as a [(e1key, relation, e2key)] list
150 +
151 +        Return a list of non inlined relations that may be inserted later, each relations defined by
152 +        a 3-tuple (subject extid, relation type, object extid).
153 +
154 +        Take care the importer may call this method several times.
155 +        """
156 +        assert self._schema is None, 'prepare() has already been called for %s' % self
157 +        self._schema = schema
158 +        eschema = schema.eschema(self.etype)
159 +        deferred = []
160 +        entity_dict = self.values
161 +        for key, rtype, role in self.iter_rdefs():
162 +            rschema = schema.rschema(rtype)
163 +            if rschema.final or (rschema.inlined and role == 'subject'):
164 +                assert len(entity_dict[key]) <= 1, \
165 +                    "more than one value for %s: %s (%s)" % (rtype, entity_dict[key], self.extid)
166 +                if entity_dict[key]:
167 +                    entity_dict[rtype] = entity_dict[key].pop()
168 +                    if key != rtype:
169 +                        del entity_dict[key]
170 +                    if (rschema.final and eschema.has_metadata(rtype, 'format')
171 +                            and not rtype + '_format' in entity_dict):
172 +                        entity_dict[rtype + '_format'] = u'text/plain'
173 +                else:
174 +                    del entity_dict[key]
175 +            else:
176 +                for target_extid in entity_dict.pop(key):
177 +                    if role == 'subject':
178 +                        deferred.append((self.extid, rtype, target_extid))
179 +                    else:
180 +                        deferred.append((target_extid, rtype, self.extid))
181 +        return deferred
182 +
183 +    def is_ready(self, extid2eid):
184 +        """Return True if the ext entity is ready, i.e. has all the URIs used in inlined relations
185 +        currently existing.
186 +        """
187 +        assert self._schema, 'prepare() method should be called first on %s' % self
188 +        # as .prepare has been called, we know that .values only contains subject relation *type* as
189 +        # key (no more (rtype, role) tuple)
190 +        schema = self._schema
191 +        entity_dict = self.values
192 +        for rtype in entity_dict:
193 +            rschema = schema.rschema(rtype)
194 +            if not rschema.final:
195 +                # .prepare() should drop other cases from the entity dict
196 +                assert rschema.inlined
197 +                if not entity_dict[rtype] in extid2eid:
198 +                    return False
199 +        # entity is ready, replace all relation's extid by eids
200 +        for rtype in entity_dict:
201 +            rschema = schema.rschema(rtype)
202 +            if rschema.inlined:
203 +                entity_dict[rtype] = extid2eid[entity_dict[rtype]]
204 +        return True
205 +
206 +
207 +class ExtEntitiesImporter(object):
208 +    """This class is responsible for importing externals entities, that is instances of
209 +    :class:`ExtEntity`, into CubicWeb entities.
210 +
211 +    Parameters:
212 +
213 +    * `schema`: the CubicWeb's instance schema
214 +
215 +    * `store`: a CubicWeb `Store`
216 +
217 +    * `extid2eid`: optional {extid: eid} dictionary giving information on existing entities. It
218 +    will be completed during import. You may want to use :func:`cwuri2eid` to build it.
219 +
220 +    * `existing_relation`: optional {rtype: set((subj eid, obj eid))} mapping giving information on
221 +    existing relations of a given type. You may want to use :class:`RelationMapping` to build it.
222 +
223 +    * `etypes_order_hint`: optional ordered iterable on entity types, giving an hint on the order in
224 +      which they should be attempted to be imported
225 +
226 +    * `import_log`: optional object implementing the :class:`SimpleImportLog` interface to record
227 +      events occuring during the import
228 +
229 +    * `raise_on_error`: optional boolean flag - default to false, indicating whether errors should
230 +      be raised or logged. You usually want them to be raised during test but to be logged in
231 +      production.
232 +    """
233 +
234 +    def __init__(self, schema, store, extid2eid=None, existing_relations=None,
235 +                 etypes_order_hint=(), import_log=None, raise_on_error=False):
236 +        self.schema = schema
237 +        self.store = store
238 +        self.extid2eid = extid2eid if extid2eid is not None else {}
239 +        self.existing_relations = (existing_relations if existing_relations is not None
240 +                                   else defaultdict(set))
241 +        self.etypes_order_hint = etypes_order_hint
242 +        if import_log is None:
243 +            import_log = SimpleImportLog('<unspecified>')
244 +        self.import_log = import_log
245 +        self.raise_on_error = raise_on_error
246 +        # set of created/updated eids
247 +        self.created = set()
248 +        self.updated = set()
249 +
250 +    def import_entities(self, ext_entities):
251 +        """Import given external entities (:class:`ExtEntity`) stream (usually a generator)."""
252 +        # {etype: [etype dict]} of entities that are in the import queue
253 +        queue = {}
254 +        # order entity dictionaries then create/update them
255 +        deferred = self._import_entities(ext_entities, queue)
256 +        # create deferred relations that don't exist already
257 +        missing_relations = self.prepare_insert_deferred_relations(deferred)
258 +        self._warn_about_missing_work(queue, missing_relations)
259 +
260 +    def _import_entities(self, ext_entities, queue):
261 +        extid2eid = self.extid2eid
262 +        deferred = {}  # non inlined relations that may be deferred
263 +        self.import_log.record_debug('importing entities')
264 +        for ext_entity in self.iter_ext_entities(ext_entities, deferred, queue):
265 +            try:
266 +                eid = extid2eid[ext_entity.extid]
267 +            except KeyError:
268 +                self.prepare_insert_entity(ext_entity)
269 +            else:
270 +                if ext_entity.values:
271 +                    self.prepare_update_entity(ext_entity, eid)
272 +        return deferred
273 +
274 +    def iter_ext_entities(self, ext_entities, deferred, queue):
275 +        """Yield external entities in an order which attempts to satisfy
276 +        schema constraints (inlined / cardinality) and to optimize the import.
277 +        """
278 +        schema = self.schema
279 +        extid2eid = self.extid2eid
280 +        for ext_entity in ext_entities:
281 +            # check data in the transitional representation and prepare it for
282 +            # later insertion in the database
283 +            for subject_uri, rtype, object_uri in ext_entity.prepare(schema):
284 +                deferred.setdefault(rtype, set()).add((subject_uri, object_uri))
285 +            if not ext_entity.is_ready(extid2eid):
286 +                queue.setdefault(ext_entity.etype, []).append(ext_entity)
287 +                continue
288 +            yield ext_entity
289 +            # check for some entities in the queue that may now be ready. We'll have to restart
290 +            # search for ready entities until no one is generated
291 +            new = True
292 +            while new:
293 +                new = False
294 +                for etype in self.etypes_order_hint:
295 +                    if etype in queue:
296 +                        new_queue = []
297 +                        for ext_entity in queue[etype]:
298 +                            if ext_entity.is_ready(extid2eid):
299 +                                yield ext_entity
300 +                                # may unlock entity previously handled within this loop
301 +                                new = True
302 +                            else:
303 +                                new_queue.append(ext_entity)
304 +                        if new_queue:
305 +                            queue[etype][:] = new_queue
306 +                        else:
307 +                            del queue[etype]
308 +
309 +    def prepare_insert_entity(self, ext_entity):
310 +        """Call the store to prepare insertion of the given external entity"""
311 +        eid = self.store.prepare_insert_entity(ext_entity.etype, **ext_entity.values)
312 +        self.extid2eid[ext_entity.extid] = eid
313 +        self.created.add(eid)
314 +        return eid
315 +
316 +    def prepare_update_entity(self, ext_entity, eid):
317 +        """Call the store to prepare update of the given external entity"""
318 +        self.store.prepare_update_entity(ext_entity.etype, eid, **ext_entity.values)
319 +        self.updated.add(eid)
320 +
321 +    def prepare_insert_deferred_relations(self, deferred):
322 +        """Call the store to insert deferred relations (not handled during insertion/update for
323 +        entities). Return a list of relations `[(subj ext id, obj ext id)]` that may not be inserted
324 +        because the target entities don't exists yet.
325 +        """
326 +        prepare_insert_relation = self.store.prepare_insert_relation
327 +        rschema = self.schema.rschema
328 +        extid2eid = self.extid2eid
329 +        missing_relations = []
330 +        for rtype, relations in deferred.items():
331 +            self.import_log.record_debug('importing %s %s relations' % (len(relations), rtype))
332 +            symmetric = rschema(rtype).symmetric
333 +            existing = self.existing_relations[rtype]
334 +            for subject_uri, object_uri in relations:
335 +                try:
336 +                    subject_eid = extid2eid[subject_uri]
337 +                    object_eid = extid2eid[object_uri]
338 +                except KeyError:
339 +                    missing_relations.append((subject_uri, rtype, object_uri))
340 +                    continue
341 +                if (subject_eid, object_eid) not in existing:
342 +                    prepare_insert_relation(subject_eid, rtype, object_eid)
343 +                    existing.add((subject_eid, object_eid))
344 +                    if symmetric:
345 +                        existing.add((object_eid, subject_eid))
346 +        return missing_relations
347 +
348 +    def _warn_about_missing_work(self, queue, missing_relations):
349 +        error = self.import_log.record_error
350 +        if queue:
351 +            msgs = ["can't create some entities, is there some cycle or "
352 +                    "missing data?"]
353 +            for ext_entities in queue.values():
354 +                for ext_entity in ext_entities:
355 +                    msgs.append(str(ext_entity))
356 +            map(error, msgs)
357 +            if self.raise_on_error:
358 +                raise Exception('\n'.join(msgs))
359 +        if missing_relations:
360 +            msgs = ["can't create some relations, is there missing data?"]
361 +            for subject_uri, rtype, object_uri in missing_relations:
362 +                msgs.append("%s %s %s" % (subject_uri, rtype, object_uri))
363 +            map(error, msgs)
364 +            if self.raise_on_error:
365 +                raise Exception('\n'.join(msgs))
366 +
367 +
368 +class SimpleImportLog(object):
369 +    """Fake CWDataImport log using a simple text format.
370 +
371 +    Useful to display logs in the UI instead of storing them to the
372 +    database.
373 +    """
374 +
375 +    def __init__(self, filename):
376 +        self.logs = []
377 +        self.filename = filename
378 +
379 +    def record_debug(self, msg, path=None, line=None):
380 +        self._log(logging.DEBUG, msg, path, line)
381 +
382 +    def record_info(self, msg, path=None, line=None):
383 +        self._log(logging.INFO, msg, path, line)
384 +
385 +    def record_warning(self, msg, path=None, line=None):
386 +        self._log(logging.WARNING, msg, path, line)
387 +
388 +    def record_error(self, msg, path=None, line=None):
389 +        self._log(logging.ERROR, msg, path, line)
390 +
391 +    def record_fatal(self, msg, path=None, line=None):
392 +        self._log(logging.FATAL, msg, path, line)
393 +
394 +    def _log(self, severity, msg, path, line):
395 +        encodedmsg = u'%s\t%s\t%s\t%s' % (severity, self.filename,
396 +                                          line or u'', msg)
397 +        self.logs.append(encodedmsg)
398 +
399 +
400 +class HTMLImportLog(SimpleImportLog):
401 +    """Fake CWDataImport log using a simple HTML format."""
402 +    def __init__(self, filename):
403 +        super(HTMLImportLog, self).__init__(xml_escape(filename))
404 +
405 +    def _log(self, severity, msg, path, line):
406 +        encodedmsg = u'%s\t%s\t%s\t%s<br/>' % (severity, self.filename,
407 +                                               line or u'', xml_escape(msg))
408 +        self.logs.append(encodedmsg)
diff --git a/dataimport/test/data/people.csv b/dataimport/test/data/people.csv
@@ -0,0 +1,3 @@
409 +# uri,name,knows
410 +http://www.example.org/alice,Alice,
411 +http://www.example.org/bob,Bob,http://www.example.org/alice
diff --git a/dataimport/test/data/schema.py b/dataimport/test/data/schema.py
@@ -0,0 +1,29 @@
412 +# copyright 2003-2011 LOGILAB S.A. (Paris, FRANCE), all rights reserved.
413 +# contact http://www.logilab.fr/ -- mailto:contact@logilab.fr
414 +#
415 +# This file is part of CubicWeb.
416 +#
417 +# CubicWeb is free software: you can redistribute it and/or modify it under the
418 +# terms of the GNU Lesser General Public License as published by the Free
419 +# Software Foundation, either version 2.1 of the License, or (at your option)
420 +# any later version.
421 +#
422 +# CubicWeb is distributed in the hope that it will be useful, but WITHOUT
423 +# ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
424 +# FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public License for more
425 +# details.
426 +#
427 +# You should have received a copy of the GNU Lesser General Public License along
428 +# with CubicWeb.  If not, see <http://www.gnu.org/licenses/>.
429 +
430 +from yams.buildobjs import EntityType, String, SubjectRelation
431 +
432 +from cubicweb.schema import RQLConstraint
433 +
434 +
435 +class Personne(EntityType):
436 +    nom = String(required=True)
437 +    prenom = String()
438 +    enfant = SubjectRelation('Personne', inlined=True, cardinality='?*')
439 +    connait = SubjectRelation('Personne', symmetric=True,
440 +                              constraints=[RQLConstraint('NOT S identity O')])
diff --git a/dataimport/test/unittest_importer.py b/dataimport/test/unittest_importer.py
@@ -0,0 +1,173 @@
441 +# -*- coding: utf-8 -*-
442 +# copyright 2015 LOGILAB S.A. (Paris, FRANCE), all rights reserved.
443 +# contact http://www.logilab.fr -- mailto:contact@logilab.fr
444 +#
445 +# This program is free software: you can redistribute it and/or modify it under
446 +# the terms of the GNU Lesser General Public License as published by the Free
447 +# Software Foundation, either version 2.1 of the License, or (at your option)
448 +# any later version.
449 +#
450 +# This program is distributed in the hope that it will be useful, but WITHOUT
451 +# ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
452 +# FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more
453 +# details.
454 +#
455 +# You should have received a copy of the GNU Lesser General Public License along
456 +# with this program. If not, see <http://www.gnu.org/licenses/>.
457 +"""Tests for cubicweb.dataimport.importer"""
458 +
459 +from collections import defaultdict
460 +
461 +from logilab.common.testlib import unittest_main
462 +
463 +from cubicweb import ValidationError
464 +from cubicweb.devtools.testlib import CubicWebTC
465 +from cubicweb.dataimport import RQLObjectStore, ucsvreader
466 +from cubicweb.dataimport.importer import ExtEntity, ExtEntitiesImporter, SimpleImportLog, RelationMapping
467 +
468 +
469 +class RelationMappingTC(CubicWebTC):
470 +
471 +    def test_nosource(self):
472 +        with self.admin_access.repo_cnx() as cnx:
473 +            alice_eid = cnx.create_entity('Personne', nom=u'alice').eid
474 +            bob_eid = cnx.create_entity('Personne', nom=u'bob', connait=alice_eid).eid
475 +            cnx.commit()
476 +            mapping = RelationMapping(cnx)
477 +            self.assertEqual(mapping['connait'],
478 +                             set([(bob_eid, alice_eid), (alice_eid, bob_eid)]))
479 +
480 +    def test_with_source(self):
481 +        with self.admin_access.repo_cnx() as cnx:
482 +            alice_eid = cnx.create_entity('Personne', nom=u'alice').eid
483 +            bob_eid = cnx.create_entity('Personne', nom=u'bob', connait=alice_eid).eid
484 +            cnx.commit()
485 +            mapping = RelationMapping(cnx, cnx.find('CWSource', name=u'system').one())
486 +            self.assertEqual(mapping['connait'],
487 +                             set([(bob_eid, alice_eid), (alice_eid, bob_eid)]))
488 +
489 +
490 +class ExtEntitiesImporterTC(CubicWebTC):
491 +
492 +    def importer(self, cnx):
493 +        store = RQLObjectStore(cnx)
494 +        return ExtEntitiesImporter(self.schema, store, raise_on_error=True)
495 +
496 +    def test_simple_import(self):
497 +        with self.admin_access.repo_cnx() as cnx:
498 +            importer = self.importer(cnx)
499 +            personne = ExtEntity('Personne', 1, {'nom': set([u'de la lune']),
500 +                                                 'prenom': set([u'Jean'])})
501 +            importer.import_entities([personne])
502 +            cnx.commit()
503 +            rset = cnx.execute('Any X WHERE X is Personne')
504 +            entity = rset.get_entity(0, 0)
505 +            self.assertEqual(entity.nom, u'de la lune')
506 +            self.assertEqual(entity.prenom, u'Jean')
507 +
508 +    def test_import_missing_required_attribute(self):
509 +        """Check import of ext entity with missing required attribute"""
510 +        with self.admin_access.repo_cnx() as cnx:
511 +            importer = self.importer(cnx)
512 +            tag = ExtEntity('Personne', 2, {'prenom': set([u'Jean'])})
513 +            self.assertRaises(ValidationError, importer.import_entities, [tag])
514 +
515 +    def test_import_inlined_relation(self):
516 +        """Check import of ext entities with inlined relation"""
517 +        with self.admin_access.repo_cnx() as cnx:
518 +            importer = self.importer(cnx)
519 +            richelieu = ExtEntity('Personne', 3, {'nom': set([u'Richelieu']),
520 +                                                  'enfant': set([4])})
521 +            athos = ExtEntity('Personne', 4, {'nom': set([u'Athos'])})
522 +            importer.import_entities([athos, richelieu])
523 +            cnx.commit()
524 +            rset = cnx.execute('Any X WHERE X is Personne, X nom "Richelieu"')
525 +            entity = rset.get_entity(0, 0)
526 +            self.assertEqual(entity.enfant[0].nom, 'Athos')
527 +
528 +    def test_import_non_inlined_relation(self):
529 +        """Check import of ext entities with non inlined relation"""
530 +        with self.admin_access.repo_cnx() as cnx:
531 +            importer = self.importer(cnx)
532 +            richelieu = ExtEntity('Personne', 5, {'nom': set([u'Richelieu']),
533 +                                                  'connait': set([6])})
534 +            athos = ExtEntity('Personne', 6, {'nom': set([u'Athos'])})
535 +            importer.import_entities([athos, richelieu])
536 +            cnx.commit()
537 +            rset = cnx.execute('Any X WHERE X is Personne, X nom "Richelieu"')
538 +            entity = rset.get_entity(0, 0)
539 +            self.assertEqual(entity.connait[0].nom, 'Athos')
540 +            rset = cnx.execute('Any X WHERE X is Personne, X nom "Athos"')
541 +            entity = rset.get_entity(0, 0)
542 +            self.assertEqual(entity.connait[0].nom, 'Richelieu')
543 +
544 +    def test_import_missing_inlined_relation(self):
545 +        """Check import of ext entity with missing inlined relation"""
546 +        with self.admin_access.repo_cnx() as cnx:
547 +            importer = self.importer(cnx)
548 +            richelieu = ExtEntity('Personne', 7,
549 +                                  {'nom': set([u'Richelieu']), 'enfant': set([8])})
550 +            self.assertRaises(Exception, importer.import_entities, [richelieu])
551 +            cnx.commit()
552 +            rset = cnx.execute('Any X WHERE X is Personne, X nom "Richelieu"')
553 +            self.assertEqual(len(rset), 0)
554 +
555 +    def test_import_missing_non_inlined_relation(self):
556 +        """Check import of ext entity with missing non-inlined relation"""
557 +        with self.admin_access.repo_cnx() as cnx:
558 +            importer = self.importer(cnx)
559 +            richelieu = ExtEntity('Personne', 9,
560 +                                  {'nom': set([u'Richelieu']), 'connait': set([10])})
561 +            self.assertRaises(Exception, importer.import_entities, [richelieu])
562 +            cnx.commit()
563 +            rset = cnx.execute('Any X WHERE X is Personne, X nom "Richelieu"')
564 +            entity = rset.get_entity(0, 0)
565 +            self.assertEqual(entity.nom, u'Richelieu')
566 +            self.assertEqual(len(entity.connait), 0)
567 +
568 +    def test_update(self):
569 +        """Check update of ext entity"""
570 +        with self.admin_access.repo_cnx() as cnx:
571 +            importer = self.importer(cnx)
572 +            # First import
573 +            richelieu = ExtEntity('Personne', 11,
574 +                                  {'nom': {u'Richelieu Diacre'}})
575 +            importer.import_entities([richelieu])
576 +            cnx.commit()
577 +            rset = cnx.execute('Any X WHERE X is Personne')
578 +            entity = rset.get_entity(0, 0)
579 +            self.assertEqual(entity.nom, u'Richelieu Diacre')
580 +            # Second import
581 +            richelieu = ExtEntity('Personne', 11,
582 +                                  {'nom': {u'Richelieu Cardinal'}})
583 +            importer.import_entities([richelieu])
584 +            cnx.commit()
585 +            rset = cnx.execute('Any X WHERE X is Personne')
586 +            self.assertEqual(len(rset), 1)
587 +            entity = rset.get_entity(0, 0)
588 +            self.assertEqual(entity.nom, u'Richelieu Cardinal')
589 +
590 +
591 +def extentities_from_csv(fpath):
592 +    """Yield ExtEntity read from `fpath` CSV file."""
593 +    with open(fpath) as f:
594 +        for uri, name, knows in ucsvreader(f, skipfirst=True, skip_empty=False):
595 +            yield ExtEntity('Personne', uri,
596 +                            {'nom': set([name]), 'connait': set([knows])})
597 +
598 +
599 +class DataimportFunctionalTC(CubicWebTC):
600 +
601 +    def test_csv(self):
602 +        extenties = extentities_from_csv(self.datapath('people.csv'))
603 +        with self.admin_access.repo_cnx() as cnx:
604 +            store = RQLObjectStore(cnx)
605 +            importer = ExtEntitiesImporter(self.schema, store)
606 +            importer.import_entities(extenties)
607 +            cnx.commit()
608 +            rset = cnx.execute('String N WHERE X nom N, X connait Y, Y nom "Alice"')
609 +            self.assertEqual(rset[0][0], u'Bob')
610 +
611 +
612 +if __name__ == '__main__':
613 +    unittest_main()
diff --git a/doc/book/en/devrepo/dataimport.rst b/doc/book/en/devrepo/dataimport.rst
@@ -10,11 +10,28 @@
614  speed/security tradeoffs. Those keeping all the *CubicWeb* hooks and security will be slower but the
615  possible errors in insertion (bad data types, integrity error, ...) will be raised.
616 
617  These data import utilities are provided in the package `cubicweb.dataimport`.
618 
619 -All the stores have the following API::
620 +The API is built on top of the following concepts:
621 +
622 +* `Store`, class responsible for inserting values in the backend database
623 +
624 +* `ExtEntity`, some intermediate representation of data to import, using external identifier but no
625 +  eid, and usually with slightly different representation than the associated entity's schema
626 +
627 +* `Generator`, class or functions that will yield `ExtEntity` from some data source (eg RDF, CSV)
628 +
629 +* `Importer`, class responsible for turning `ExtEntity`'s extid to eid, doing creation or update
630 +  accordingly and may be controlling the insertion order of entities before feeding them to a
631 +  `Store`
632 +
633 +Stores
634 +~~~~~~
635 +
636 +Stores are responsible to insert properly formatted entities and relations into the database. They
637 +have the following API::
638 
639      >>> user_eid = store.prepare_insert_entity('CWUser', login=u'johndoe')
640      >>> group_eid = store.prepare_insert_entity('CWUser', name=u'unknown')
641      >>> store.relate(user_eid, 'in_group', group_eid)
642      >>> store.flush()
@@ -71,5 +88,10 @@
643  -----------------
644 
645  This store relies on *COPY FROM*/execute many sql commands to directly push data using SQL commands
646  rather than using the whole *CubicWeb* API. For now, **it only works with PostgresSQL** as it requires
647  the *COPY FROM* command.
648 +
649 +ExtEntity and Importer
650 +~~~~~~~~~~~~~~~~~~~~~~
651 +
652 +.. automodule:: cubicweb.dataimport.importer