Having deployed and maintained several public medium sized web sites running
CubicWeb when I worked at SecondWeb, I was asked
by my friends from Logilab to write a blog post
describing how we managed our deployment while working with the customer and the
Customers that want to run such a medium traffic web site either tell you
which hosting company they partner with, or ask you to find one, so you have no
other choice to deal with an external hosting structure to manage the servers.
I prefer this by the way because:
- High Availability (HA) hosting really requires skills and hardware that are
neither common nor cheap;
- HA hosting requires 24/7/365 availability that SecondWeb could not (and did
not even want to) offer.
It is clearly difficult for all parties (try to put yourself in the shoes of the
customer...) to manage a website with 3 partners involved, each with their own
goals. From the development leader point of view, you will notice that the
technical people of the hosting company continuously change and you keep seeing
the same operational errors even if you provide and keep improving high quality
documentation. The software upgrade documentation has to be particularly clear
as it greatly influences the overall web site availability. You also have to
keep an history of the interventions on the servers yourself and maintain an
up-to-date copy of the configuration files.
The overall architecture proposed here partly benefits from this experience with
managed hosting company, in that we tried to keep it simple.
The architecture proposed here has been successfully tested with sites
delivering web pages to up to 2 millions unique visitors per month. It should
scale further up depending on your site database access needs: if you need very
fresh data and have a lot of write operations to the database, you will need
to distribute database access amongst several servers, which is beyond the scope
of this post.
This is the main limitation of the proposed architecture and the reason why it
is not well-suited for a bigger traffic.
To achieve very high availability for your web site, you must have no single
point of failure in the whole architecture, which can be far from reasonable
from the costs point of view. However, hosting companies can share costs
between their customers and have them benefit from a double network
infrastructure all along the way from the Internet to your web servers,
themselves hosted on two distant locations. You may then choose an even number
of web servers, half of them hosted on each network infrastructure.
The important thing is that you must preserve user sessions. As of CubicWeb
3.10, DB persistent sessions have not been implemented yet (it will soon, there
is a ticket planned for this
functionality), thus you must preserve session cookies by always directing a
given user to the same web server, which is usually achieved by configuring the
load balancer(s) in IP hash mode (it is faster than balancing on the session
cookie, which implies reaching the http stack rather than staying at the TCP/IP level).
Now if you have multi-processor web servers (which is very likely these times)
you will need to use one CubicWeb application instance per processor or the
Python GIL will limit the CPU of your application to a fraction of the available
power. This is pretty easy, you just have to duplicate configuration directories
from /etc/cubicweb.d, changing instance names and ports. You can use a simple
sed-based script to generate these copies automatically and keep them in sync.
Now that we have one instance per processor, the problem of preserving sessions
is back. It can be elegantly solved using Squid,
which can of course deliver cached objects (in particular images, more on this
later), but also listen on several ports and distribute incoming requests evenly
among the CubicWeb instances based on their port of origin. Note that the load
balancer must be set up to balance between ports of the web servers, one port
for each processor. The Squid configuration file to achieve this, looks like:
http_port 81 defaultsite=www.example.org vhost
acl portA myport 81
http_port 82 defaultsite=www.example.org vhost
acl portB myport 82
acl site1 dstdomain www.example.org
cache_peer 127.0.0.1 parent 8081 0 no-query originserver default name=server_1
cache_peer_access server_1 allow portA site1
cache_peer_access server_1 deny all
cache_peer 127.0.0.1 parent 8082 0 no-query originserver default name=server_2
cache_peer_access server_2 allow portB site1
cache_peer_access server_2 deny all
This is a way to setup Squid to listen to ports 81 and 82 and distribute requests
for www.example.org to ports 8081 and 8082 respectively. This way, requests
should be evenly balanced between the processors a on bi-processor web server.
You can now setup Squid more classically to achieve what it is initially done
for: caching. See Squid docs for this, particularly the
directive. Note you do not need to force any HTTP cache standard feature in
Squid, as CubicWeb enables you to fine tune caching using simple
HTTPCacheManager classes found in cubicweb/web/httpcache.py (at the end of this
file, you will also find default cache manager configuration for the entity and
This is controversial but it did not hurt for me: I like to put an Apache frontend
between Squid and the Twisted-based CubicWeb application, because the hosting
companies are usually pretty good at setting it up, like to use server status for
monitoring, mod_deflate for textual content compression, mod_rewrite and other
modules to customize, monitor or fine tune the web servers.
It can however be argued that Apache is a huge piece of software for such a
restrictive usage, and its memory footprint would be better used for caching.
Hosting companies now often have a pretty good knowledge of PostgreSQL, the
favorite DB back end for CubicWeb. They usually propose to replicate the database
for data safety at a low cost, using PostgreSQL log shipping feature. Note that
new PostgreSQL 9 versions should make it easier to setup replication modes that
could be useful to improve performance and scalability, but there is still a
lack of production level experience for the moment. Please share if you have,
because it is the main issue to deal with to scale up further.
This is worth mentioning you need a pre-production server hosted by the same
company on the same hardware (or virtual machine), because:
- software upgrade will run smoother if the technical staff of the hosting company
has already performed the same upgrade operation once: check the same person
does both within a short timeframe if possible;
- you will feel better if your migration scripts have successfully run on a
fresh copy of the production data: ask for a db copy before a pre-production
upgrade; this is much easier to do if you do not have to copy the database
- the pre-production server can host its own database server and the replication
of the production one.
When you experience a web site downtime, it is much too late to take a look at
the available monitoring. It is important to prepare the tools you need to
diagnose a problem, get used to read the graphs and have the orders of
magnitude of the values and their variations in mind.
Even the simplest graphs, like CPU usage, need to be correctly interpreted. In
a recent setup, I did not realize that only one CPU was used on a bi-pro server,
delivering half the power it should... When you cannot access the machine and
use top, you only see the information of the monitoring graphs, so you must
know how to read them !
Apart from the classical CPU, CPU load, (detailed) memory usage, and network
traffic, ask for PostgreSQL, Squid, and Apache specific graphs (plug-ins for them
are easy to find and install for classic monitoring solutions).
For CubicWeb web sites, it is also worth setting up following views and use
them for automatic alerts:
- a software / db version consistency monitoring
- a db pool size monitoring
- a simple db connection check view
- a view writing the server host name is not interesting for automatic alerts but
to see on which server your IP is directed to: this is needed when you do not
reproduce the behaviour the customer is complaining about...
There are some classes I use for these tasks. Feel free to reuse and adapt them
to your needs:
from socket import gethostname
from cubicweb.view import View
__abstract__ = True
__select__ = yes()
content_type = 'text/plain'
templatable = False
__regid__ = 'monitor_pool'
repo = self._cw.cnx._repo
max_pool = self._cw.vreg.config['connections-pool-size']
percent = ((max_pool - repo._available_pools.qsize()) * 100.0) / max_pool
self.w(u'%s%%' % percent)
__regid__ = 'monitor_db'
count = self._cw.execute('Any COUNT(X) WHERE X is CWUser')
self.w(u'ServiceOK : %s users in DB' % count)
__regid__ = 'monitor_version'
def versions_text(self, versions):
return u' | '.join(cube + u': ' + u'.'.join(unicode(x) for x in version)
for (cube, version) in versions)
config = self._cw.vreg.config
vc_config = config.vc_config()
db_config = [('cubicweb', vc_config.get('cubicweb', '?'))]
fs_config = [('cubicweb', config.cubicweb_version())]
for cube in sorted(config.cubes()):
db_config.append((cube, vc_config.get(cube, '?')))
fs_version = config.cube_version(cube)
fs_version = '?'
db_config = self.versions_text(db_config)
fs_config = self.versions_text(fs_config)
if db_config == fs_config:
self.w(u'ServiceOK : FS config %s == DB config %s' % (fs_config, db_config))
self.w(u'ServiceKO : FS config %s !$ DB config %s' % (fs_config, db_config))
__regid__ = 'monitor_hostname'
There is a sketch of the proposed architecture. Please comment on it and share
your experience on the topic, I would be happy to learn your tips and tricks.
I would conclude with an important remark regarding performance: a good scalable
architecture is of great help to run a busy web site smoothly, however the
performance boost you get by optimizing your software performance is usually
worth it and must be seriously considered before any hardware upgrade, may it
seem costly at first glance.