Blog entries

  • CubicWeb sprint in Paris / Need for Speed

    2011/03/22 by Adrien Di Mascio

    Logilab is hosting a CubicWeb sprint - 3 days in our Paris offices.

    The general focus will be on speed :

    • on cubicweb-server side : improve performance of massive insertions / deletions
    • on cubicweb-client side : cache implementation, HTTP server, massive parallel usage, etc.

    This sprint will take place from in April 2011 from tuesday the 26th to thursday the 28th. You are more than welcome to come along and help out, contribute, but unlike previous sprints, at least basic knowledge of CubicWeb will be required for participants since no introduction is planned.

    Network resources will be available for those bringing laptops.

    Address : 104 Boulevard Auguste-Blanqui, Paris. Ring "Logilab" (googlemap)

    Metro : Glacière

    Contact : http://www.logilab.fr/contact

    Dates : 26/04/2011 to 28/04/2011


  • A simple scalable web server HA architecture suitable for medium sized projects

    2011/02/14 by Florent Cayré

    Having deployed and maintained several public medium sized web sites running CubicWeb when I worked at SecondWeb, I was asked by my friends from Logilab to write a blog post describing how we managed our deployment while working with the customer and the hosting company.

    Non technical (albeit important) considerations

    Customers that want to run such a medium traffic web site either tell you which hosting company they partner with, or ask you to find one, so you have no other choice to deal with an external hosting structure to manage the servers. I prefer this by the way because:

    1. High Availability (HA) hosting really requires skills and hardware that are neither common nor cheap;
    2. HA hosting requires 24/7/365 availability that SecondWeb could not (and did not even want to) offer.

    It is clearly difficult for all parties (try to put yourself in the shoes of the customer...) to manage a website with 3 partners involved, each with their own goals. From the development leader point of view, you will notice that the technical people of the hosting company continuously change and you keep seeing the same operational errors even if you provide and keep improving high quality documentation. The software upgrade documentation has to be particularly clear as it greatly influences the overall web site availability. You also have to keep an history of the interventions on the servers yourself and maintain an up-to-date copy of the configuration files.

    The overall architecture proposed here partly benefits from this experience with managed hosting company, in that we tried to keep it simple.

    Which traffic size ? Why not bigger ?

    The architecture proposed here has been successfully tested with sites delivering web pages to up to 2 millions unique visitors per month. It should scale further up depending on your site database access needs: if you need very fresh data and have a lot of write operations to the database, you will need to distribute database access amongst several servers, which is beyond the scope of this post.

    This is the main limitation of the proposed architecture and the reason why it is not well-suited for a bigger traffic.

    Design choices

    Load balancing - Preserve user sessions

    To achieve very high availability for your web site, you must have no single point of failure in the whole architecture, which can be far from reasonable from the costs point of view. However, hosting companies can share costs between their customers and have them benefit from a double network infrastructure all along the way from the Internet to your web servers, themselves hosted on two distant locations. You may then choose an even number of web servers, half of them hosted on each network infrastructure.

    The important thing is that you must preserve user sessions. As of CubicWeb 3.10, DB persistent sessions have not been implemented yet (it will soon, there is a ticket planned for this functionality), thus you must preserve session cookies by always directing a given user to the same web server, which is usually achieved by configuring the load balancer(s) in IP hash mode (it is faster than balancing on the session cookie, which implies reaching the http stack rather than staying at the TCP/IP level).

    Squid caching, processor load balancing

    Now if you have multi-processor web servers (which is very likely these times) you will need to use one CubicWeb application instance per processor or the Python GIL will limit the CPU of your application to a fraction of the available power. This is pretty easy, you just have to duplicate configuration directories from /etc/cubicweb.d, changing instance names and ports. You can use a simple sed-based script to generate these copies automatically and keep them in sync.

    Now that we have one instance per processor, the problem of preserving sessions is back. It can be elegantly solved using Squid, which can of course deliver cached objects (in particular images, more on this later), but also listen on several ports and distribute incoming requests evenly among the CubicWeb instances based on their port of origin. Note that the load balancer must be set up to balance between ports of the web servers, one port for each processor. The Squid configuration file to achieve this, looks like:

    http_port 81 defaultsite=www.example.org vhost
    acl portA myport 81
    
    http_port 82 defaultsite=www.example.org vhost
    acl portB myport 82
    
    acl site1 dstdomain www.example.org
    
    cache_peer 127.0.0.1 parent 8081 0 no-query originserver default name=server_1
    cache_peer_access server_1 allow portA site1
    cache_peer_access server_1 deny all
    
    cache_peer 127.0.0.1 parent 8082 0 no-query originserver default name=server_2
    cache_peer_access server_2 allow portB site1
    cache_peer_access server_2 deny all
    

    This is a way to setup Squid to listen to ports 81 and 82 and distribute requests for www.example.org to ports 8081 and 8082 respectively. This way, requests should be evenly balanced between the processors a on bi-processor web server.

    You can now setup Squid more classically to achieve what it is initially done for: caching. See Squid docs for this, particularly the refresh_pattern directive. Note you do not need to force any HTTP cache standard feature in Squid, as CubicWeb enables you to fine tune caching using simple HTTPCacheManager classes found in cubicweb/web/httpcache.py (at the end of this file, you will also find default cache manager configuration for the entity and startup views).

    CubicWeb with Apache frontend

    This is controversial but it did not hurt for me: I like to put an Apache frontend between Squid and the Twisted-based CubicWeb application, because the hosting companies are usually pretty good at setting it up, like to use server status for monitoring, mod_deflate for textual content compression, mod_rewrite and other modules to customize, monitor or fine tune the web servers.

    It can however be argued that Apache is a huge piece of software for such a restrictive usage, and its memory footprint would be better used for caching.

    No shared disk

    This is an interesting part that simplifies the overall setup: if you want to save data on disk, it is likely that you also want to keep it in sync between the web servers, or use a highly secure network storage solution.

    As we already have a data store accessible from the web servers, namely the database itself, I often choose to use it even for images. This looks like the nightmare of every sysadmin, but if you make sure the images are not fetched every second from the database, by using fine tuned cache settings, it will not hurt. And this way you still benefit from the flexibility of a database and the easier maintenance of a single data store. We can use CubicWeb cache settings to allow squid caching images for 1 hour for example. If you have a very dynamic web site however, you will then need to force a URL change when an image is edited. This can easily be achieved in CubicWeb using a custom edit controller that creates a new image when the data attribute of an Image instance was edited, as illustrated here:

    from cubicweb import typed_eid
    from cubicweb.selectors import yes
    from cubicweb.web.views.editcontroller import EditController
    
    
    class CustomEditController(EditController):
        __select__ = EditController.__select__ & yes()
    
        def handle_updated_image(self, old_eid):
            'modify submitted form to change old_eid into a new entity eid in all key/ values'
            old_eid = unicode(old_eid)
            form = self._cw.form
            new_eid = self._cw.varmaker.next()
            # handle image eid
            del form['__type:%s' % old_eid]
            form['__type:%s' % new_eid] = u'Image'
            # handle eid list
            index = form['eid'].index(old_eid)
            form['eid'] = form['eid'][:index] + [new_eid] + form['eid'][index+1:]
            # handle attribute and relations
            for (k, v) in form.iteritems():
                if v == old_eid:
                    form[k] = new_eid
                if k.endswith(u':%s' % old_eid):
                    form[k[:-len(old_eid)] + new_eid] = v
                    del form[k]
    
        def _default_publish(self):
            # implement image creation when data image was updated, so that we can use
            # a far expiry date cache on download view
            images = []
            for (k, v) in self._cw.form.iteritems():
                if v != 'Image' or not k.startswith('__type') or k == self._cw.form['__maineid']:
                    continue
                try:
                    eid = typed_eid(k[7:])
                except ValueError:
                    continue
                if self._cw.form.get('data-subject:%s' % eid, None):
                    self.handle_updated_image(eid)
                    images.append(eid)
            super(CustomEditController, self)._default_publish()
            for eid in images:
                self._cw.execute('DELETE Image I WHERE I eid %(eid)s', {'eid': eid})
    

    To add the 1 hour expiry date for image download view, you can use:

    from cubicweb.selectors import yes
    from cubicweb.web import httpcache
    from cubicweb.web.views.idownloadable import DownloadView
    
    class CustomDownloadView(DownloadView):
        __select__ = DownloadView.__select__ & yes()
        http_cache_manager = httpcache.MaxAgeHTTPCacheManager
        cache_max_age = 3600
    

    Database server

    Hosting companies now often have a pretty good knowledge of PostgreSQL, the favorite DB back end for CubicWeb. They usually propose to replicate the database for data safety at a low cost, using PostgreSQL log shipping feature. Note that new PostgreSQL 9 versions should make it easier to setup replication modes that could be useful to improve performance and scalability, but there is still a lack of production level experience for the moment. Please share if you have, because it is the main issue to deal with to scale up further.

    Pre-production

    This is worth mentioning you need a pre-production server hosted by the same company on the same hardware (or virtual machine), because:

    • software upgrade will run smoother if the technical staff of the hosting company has already performed the same upgrade operation once: check the same person does both within a short timeframe if possible;
    • you will feel better if your migration scripts have successfully run on a fresh copy of the production data: ask for a db copy before a pre-production upgrade; this is much easier to do if you do not have to copy the database dumps remotely.
    • the pre-production server can host its own database server and the replication of the production one.

    Monitoring

    When you experience a web site downtime, it is much too late to take a look at the available monitoring. It is important to prepare the tools you need to diagnose a problem, get used to read the graphs and have the orders of magnitude of the values and their variations in mind.

    Even the simplest graphs, like CPU usage, need to be correctly interpreted. In a recent setup, I did not realize that only one CPU was used on a bi-pro server, delivering half the power it should... When you cannot access the machine and use top, you only see the information of the monitoring graphs, so you must know how to read them !

    Apart from the classical CPU, CPU load, (detailed) memory usage, and network traffic, ask for PostgreSQL, Squid, and Apache specific graphs (plug-ins for them are easy to find and install for classic monitoring solutions).

    For CubicWeb web sites, it is also worth setting up following views and use them for automatic alerts:

    • a software / db version consistency monitoring
    • a db pool size monitoring
    • a simple db connection check view
    • a view writing the server host name is not interesting for automatic alerts but to see on which server your IP is directed to: this is needed when you do not reproduce the behaviour the customer is complaining about...

    There are some classes I use for these tasks. Feel free to reuse and adapt them to your needs:

    from socket import gethostname
    
    from cubicweb.view import View
    
    
    class _MonitoringView(View):
        __abstract__ = True
        __select__ = yes()
        content_type = 'text/plain'
        templatable = False
    
    
    class PoolMonitoringView(_MonitoringView):
        __regid__ = 'monitor_pool'
    
        def call(self):
            repo = self._cw.cnx._repo
            max_pool = self._cw.vreg.config['connections-pool-size']
            percent = ((max_pool - repo._available_pools.qsize()) * 100.0) / max_pool
            self.w(u'%s%%' % percent)
    
    
    class DBMonitoringView(_MonitoringView):
        __regid__ = 'monitor_db'
    
        def call(self):
            try:
                count = self._cw.execute('Any COUNT(X) WHERE X is CWUser')[0][0]
                self.w(u'ServiceOK : %s users in DB' % count)
            except:
                self.w(u'ServiceKO')
    
    
    class VersionMonitoringView(_MonitoringView):
        __regid__ = 'monitor_version'
    
        def versions_text(self, versions):
            return u' | '.join(cube + u': ' + u'.'.join(unicode(x) for x in version)
                               for (cube, version) in versions)
    
        def call(self):
            config = self._cw.vreg.config
            vc_config = config.vc_config()
            db_config = [('cubicweb', vc_config.get('cubicweb', '?'))]
            fs_config = [('cubicweb', config.cubicweb_version())]
            for cube in sorted(config.cubes()):
                db_config.append((cube, vc_config.get(cube, '?')))
                try:
                    fs_version = config.cube_version(cube)
                except:
                    fs_version = '?'
                fs_config.append((cube, fs_version))
            db_config = self.versions_text(db_config)
            fs_config = self.versions_text(fs_config)
            if db_config == fs_config:
                self.w(u'ServiceOK : FS config %s == DB config %s' % (fs_config, db_config))
            else:
                self.w(u'ServiceKO : FS config %s !$ DB config %s' % (fs_config, db_config))
    
    
    class HostnameMonitoringView(_MonitoringView):
        __regid__ = 'monitor_hostname'
    
        def call(self):
            self.w(unicode(gethostname()))
    

    Sketch of the architecture and conclusion

    There is a sketch of the proposed architecture. Please comment on it and share your experience on the topic, I would be happy to learn your tips and tricks.

    I would conclude with an important remark regarding performance: a good scalable architecture is of great help to run a busy web site smoothly, however the performance boost you get by optimizing your software performance is usually worth it and must be seriously considered before any hardware upgrade, may it seem costly at first glance.

    /file/1521968?vid=download

  • Profiling your CubicWeb instance

    2009/03/27 by Adrien Di Mascio

    If you feel that one of your pages takes more time than it should to be generated, chances are that you're making too many RQL queries. Obviously, there are other reasons but my personal experience tends to show this is first thing to track down. Luckily for us, CubicWeb provides a configuration option to log rql queries. In your all-in-one.conf file, set the query-log-file option:

    # web application query log file
    query-log-file=~/myapp-rql.log
    

    Then restart your application, reload your page and stop your application. The file myapp-rql.log now contains the list of RQL queries that were executed during your test. It's a simple text file containing lines such as:

    Any A WHERE X eid %(x)s, X lastname A {'x': 448} -- (0.002 sec, 0.010 CPU sec)
    Any A WHERE X eid %(x)s, X firstname A {'x': 447} -- (0.002 sec, 0.000 CPU sec)
    

    The structure of each line is:

    <RQL QUERY> <QUERY ARGS IF ANY> -- <TIME SPENT>
    

    Use the cubicweb-ctl exlog command to examine and summarize data found in such a file:

    adim@crater:~$ cubicweb-ctl exlog < ~/myapp-rql.log
    0.07 50 Any A WHERE X eid %(x)s, X firstname A {}
    0.05 50 Any A WHERE X eid %(x)s, X lastname A {}
    0.01 1 Any X,AA ORDERBY AA DESC WHERE E eid %(x)s, E employees X, X modification_date AA {}
    0.01 1 Any X WHERE X eid %(x)s, X owned_by U, U eid %(u)s {, }
    0.01 1 Any B,T,P ORDERBY lower(T) WHERE B is Bookmark,B title T, B path P, B bookmarked_by U, U eid %(x)s {}
    0.01 1 Any A,B,C,D WHERE A eid %(x)s,A name B,A creation_date C,A modification_date D {}
    

    This command sorts and uniquifies queries so that it's easy to see where is the hot spot that needs optimization.

    Having said all this, it would probably be worth talking about the fetch_attrs attribute you can define in your entity classes because it can greatly reduce the number of queries executed but I'll make a specific blog entry for this.

    I should finally mention the existence of the profile option in the all-in-on.conf. If set, this option will make your application run in an hotshot session and store the results in the specified file.


  • CubicWeb using Postgresql at its best

    2014/02/08 by Nicolas Chauvat

    We had a chat today with a core contributor to Postgresql from whom we may buy consulting services in the future. We discussed how CubicWeb could get the best out of Postgresql:

    • making use of the LISTEN/NOTIFY mechanism built into PG could be useful (to warn the cache about modified items for example) and PgQ is its good friend;
    • views (materialized or not) are another way to implement computed attributes and relations (see CWEP number 002) and it could be that the Entities table is in fact a view of other tables;
    • implementing RQL as an in-database language could open the door to new things (there is PL/pgSQL, PL/Python, what if we had PL/RQL?);
    • Foreign Data Wrappers written with Multicorn would be another way to write data feeds (see LDAP integration for an example);
    • managing dates can be tricky when users reside in different timezones and UTC is important to keep in mind (unicode/str is a good analogy);
    • for transitive closures that are often needed when implementing access control policies with __permissions, Postgresql can go a long way with queries like "WITH ... (SELECT UNION ALL SELECT RETURNING *) UPDATE USING ...";
    • the fastest way to load tabular data that does not need too much pre-processing is to create a temporary table in memory, then COPY-FROM the data into that table, then index it, then write the transform and load step in SQL (maybe with PL/Python);
    • when executing more than 10 updates in a row, it is better to write into a temporary table in memory, then update the actual tables with UPDATE USING (let's check if the psycopg driver does that when executemany is called);
    • reaching 10e8 rows in a table is at the time of this writing the stage when you should start monitoring your db seriously and start considering replication, partition and sharding.
    • full-text search is much better in Postgresql than the general public thinks it is and recent developments made it orders of magnitude faster than tools like Lucene or Solr and ElasticSearch;
    • when dealing with complex queries (searching graphs maybe), an option to consider is to implement a specific data type, use it into a materialized view and use GIN or GIST indexes over it;
    • for large scientific data sets, it could be interesting to link the numpy library into Postgresql and turn numpy arrays into a new data type;
    • Oh, and one last thing: the object-oriented tables of Postgresql are not such a great idea, unless you have a use case that fits them perfectly and does not hit their limitations (CubicWeb's is_instance_of does not seem to be one of these).

    Hopin' I got you thinkin' :)

    http://developer.postgresql.org/~josh/graphics/logos/elephant.png