"Data Fast-food": quick interactive exploratory processing and visualization of complex datasets with CubicWeb

With the emergence of the semantic web in the past few years, and the increasing number of high quality open data sets (cf the lod diagram), there is a growing interest in frameworks that allow to store/query/process/mine/visualize large data sets.

We have seen in previous blog posts how CubicWeb may be used as an efficient knowledge management system for various types of data, and how it may be used to perform complex queries. In this post, we will see, using Geonames data, how CubicWeb may perform simple or complex data mining and machine learning procedures on data, using the datamining cube. This cube adds powerful tools to CubicWeb that make it easy to interactively process and visualize datasets.

At this point, it is not meant to be used on massive datasets, for it is not fully optimized yet. If you try to perform a TF-IDF (term frequency–inverse document frequency) with a hierarchical clustering on the full dbpedia abstracts dataset, be prepared to wait. But it is a promising way to enrich the user experience while playing with different datasets, for quick interactive exploratory datamining processing (what I've called the "Data fast-food"). This cube is based on the scikit-learn toolbox that has recently gained a huge popularity in the machine learning and Python community. The release of this cube drastically increases the interest of CubicWeb for data management.

The Datamining cube

For a given query, similarly to SQL, CubicWeb returns a result set. This result set may be presented by a view to display a table, a map, a graph, etc (see documentation and previous blog posts).

The datamining cube introduces the possibility to process the result set before presenting it, for example to apply machine learning algorithms to cluster the data.

The datamining cube is based on two concepts:

  • the concept of processor: basically, a processor transforms a result set in a numpy array, given some criteria defining the mathematical processing, and the columns/rows of the result set to be taken into account. The numpy-array is a polyvalent structure that is widely used for numerical computation. This array could thus be efficiently used with any kind of datamining algorithms. Note that, in our context of knowledge management, it is more convenient to return a numpy array with additional meta-information, such as indices or labels, the result being stored in what we call a cw-array. Meta-information may be useful for display, but is not compulsory.
  • the concept of array-view: the "views" are basic components of CubicWeb, distinguish querying and displaying the data is key in this framework. So, on a given result set, many different views can be applied. In the datamining cube, we simply overload the basic view of CubicWeb, so that it works with cw-array instead of result sets. These array-views are associated to some machine learning or datamining processes. For example, one can apply the k-means (clustering process) view on a given cw-array.

A very important feature is that the processor and the array-view are called directly through the URL using the two related parameters arid (for ARray ID) and vid (for View ID, standard in CubicWeb).

Processors

We give some examples of basic processors that may be found in the datamining cube:

  • AttributesAsFloatArrayProcessor (arid='attr-asfloat'): This processor turns all Int, BigInt and Float attributes in the result set to floats, and returns the corresponding array. The number of rows is equal to the number of rows in the result set, and the number of columns is equal to the number of convertible attributes in the result set.
  • EntityAsFloatArrayProcessor (arid='entity-asfloat'): This processor performs similarly to the AttributesAsFloatArrayProcessor, but keeps the reference to the entities used to create the numpy-array. Thus, this information could be used for display (map, label, ...).
  • AttributesAsTokenArrayProcessor (arid='attr-astoken'): This processor turns all String attributes in the result set in a numpy array, based on a Word-n-gram analyze. This may be used to tokenize a set of strings.
  • PivotTableCountArrayProcessor (arid='pivot-table-count'): This processor is used to create a pivot table, with a count function. Other functions, such as sum or product also exist. This may be used to create some spreadsheet-like views.
  • UndirectedRelationArrayProcessor (arid='undirected-rel'): This processor creates a binary numpy array of dimension (nb_entities, nb_entities), that represents the relations (or corelations) between entities. This may be used for graph-based vizualisation.

We are also planning to extend the concept of processor to sparse matrix (scipy.sparse), in order to deal with very high dimensional data.

Array Views

The array views that are found in the datamining cube, are, for most of them, used for simple visualization. We used HTML-based templates and the Protovis Javascript Library.

We will not detail all the views, but rather show some examples. Read the reference documentation for a complete and detailed description.

Examples on numerical data

Histogram

The request:

Any LO, LA WHERE X latitude LA, NOT X latitude NULL, X longitude LO,  NOT X longitude NULL,
X country C, NOT X elevation NULL, C name "France"

that may be translated as:

All couples (latitude, longitude) of the locations in France, with an elevation not null

and, using vid=protovis-hist and arid=attr-asfloat

Scatter plot

Using the notion of view, we can display differently the same result set, for example using a scatter plot (vid=protovis-scatterplot).

Another example with the request:

Any P, E WHERE X is Location, X elevation E, X elevation >1, X population P,
X population >10, X country CO, CO name "France"

that may be translated as:

All couples (population, elevation) of locations in France, with a population higher than 10 (inhabitants),and an elevation higher than 1 (meter)

and, using the same vid (vid=protovis-scatterplot) and the same arid (arid=attr-asfloat)

If a third column is given in the result set (and thus in the numpy array), it will be encoded in the size/color of each dot of the scatter plot. For example with the request:

Any LO, LA, E WHERE X latitude LA, NOT X latitude NULL, X longitude LO,  NOT X longitude NULL,
X country C, NOT X elevation NULL, X elevation E, C name "France"

that may be translated as:

All tuples (latitude, longitude, elevation) of the locations in France, with an elevation not null

and, using the same vid (vid=protovis-scatterplot) and the same arid (arid=attr-asfloat), we can visualize the elevation on a map, encoded in size/color

Another example with the request:

Any LO, LA LIMIT 50000 WHERE X is Location, X population  >1000, X latitude LA, X longitude LO,
X country CO, CO name "France"

that may be translated as:

All couples (latitude, longitude) of 50000 locations in France, with a population higher than 100 (inhabitants)

There also exist some AreaChart view, LineArray view, ...

Examples on relational data

Relational Matrix (undirected graph)

The request:

Any X,Y WHERE X continent CO, CO name "North America", X neighbour_of Y

that may be translated as:

All neighbour countries in North America

and using the vid='protovis-binarymap' and arid='undirected-rel'

Relational Matrix (directed graph)

If we do not want a symmetric matrix, i.e. if we want to keep the direction of a link (X,Y is not the same relation as Y,X), we can use the directed*rel array processor. For example, with the following request:

Any X,Y LIMIT 20 WHERE X continent Y

that may be translated as:

20 countries and their continent

and using the vid='protovis-binarymap' and arid='directed-rel'

Force directed graph

For a dynamic representation of relations, we can use a force directed graph. The request:

Any X,Y WHERE X neighbour_of Y

that may be translated as:

All neighbour countries in the World.

and using the vid='protovis-forcedirected' and arid='undirected-rel', we can see the full graph, with small independent components (e.g. UK and Ireland)

Again, a third column in the result set could be used to encode some labeling information, for example the continent.

The request:

Any X,Y,CO WHERE X neighbour_of Y, X continent CO

that may be translated as:

All neighbour countries in the World, and their corresponding continent.

and again, using the vid='protovis-forcedirected' and arid='undirected-rel', we can see the full graph with the continents encoded in color (Americas in green, Africa in dark blue, ...)

Dendrogram

For hierarchical information, one can use the Dendrogram view. For example, with the request:

Any X,Y WHERE X continent Y

that may be translated as:

All couple (country, continent) in the World

and using vid='protovis-dendrogram' and arid='directed-rel', we have the following dendrogram (we only show a part due to lack of space)

Unsupervised Learning

We have also developed some machine learning view for unsupervised learning. This is more a proof of concept than a fully optimized development, but we can already do some cool stuff. Each machine learning processing is referenced by a mlid. For example, with the request:

Any LO, LA WHERE X is Location, X elevation E, X elevation >1, X latitude LA, X longitude LO,
X country CO, CO name "France"

that may be translated as:

All couples (latitude, longitude) of the locations in France, with an elevation higher than 1

and using vid='protovis-scatterplot' arid='attr-asfloat' and mlid='kmeans', we can construct a scatter plot of all couples of latitude and longitude in France, and create 10 clusters using the kmeans clustering. The labeling information is thus encoded in color/size:

Download

Finally, we have also implement a download view, based on the Pickle of the numpy-array. It is thus possible to access remotely any data within a Python shell, allowing to process them as you want. Changing the request can be done very easily by changing the rql parameter in the URL. For example:

import pickle, urllib
data = pickle.loads(urllib.open('http://mydomain?rql=my request&vid=array-numpy&arid=attr-asfloat'))