Following the "release often, release early" mantra, I thought it
might be a good idea to apply it to monitoring on one of our client
projects. So right from the demo stage where we deliver a new version
every few weeks (and sometimes every few days), we setup some
monitoring.
To evaluate parts of the performance of a web page we can look at some
metrics such as the number of assets the browser will need to
download, the size of the assets (js, css, images, etc) and even
things such as the number of subdomains used to deliver assets. You
can take a look at such metrics in most developer tools available in
the browser, but we want to graph this over time. A nice tool for this
is sitespeed.io (written in javascript
with phantomjs). Out of the box, it has a graphite outputter so
we just have to add --graphiteHost FQDN. sitespeed.io even
recommends using grafana to visualize the
results and publishes some example dashboards that can be adapted to
your needs.
The sitespeed.io command is configured and run by salt using pillars
and its scheduler.
We will have to take a look at using their jenkins plugin with our
jenkins continuous integration instance.
Applications will have bugs (in particular when released often to get
a client to validate some design choices early). Level 0 is having
your client calling you up saying the application has crashed. The
next level is watching some log somewhere to see those errors pop
up. The next level is centralised logs on which you can monitor the
numerous pieces of your application (rsyslog over UDP helps here,
graylog might be a good solution for
visualisation).
When it starts getting useful and usable is when your bugs get
reported with some rich context. That's when using sentry gets in. It's free software developed on github (although the website does not
really show that) and it is written in python, so it was a good match
for our culture. And it is pretty awesome too.
We plug sentry into our WSGI pipeline (thanks to cubicweb-pyramid) by installing
and configuring the sentry cube : cubicweb-sentry. This will catch
rich context bugs and provide us with vital information about what the user
was doing when the crash occured.
This also helps sharing bug information within a team.
The sentry cube reports on errors being raised when using the web
application, but can also catch some errors when running some
maintenance or import commands (ccplugins in CubicWeb). In this
particular case, a lot of importing is being done and Sentry can
detect and help us triage the import errors with context on which
files are failing.
This part is a bit neglected for the moment. Client side we can use
Javascript to monitor usage. Some basic metrics can come from piwik which is usually used for audience
statistics. To get more precise statistics we've been told Boomerang has an interesting
approach, enabling a closer look at how fast a page was displayed
client side, how much time was spend on DNS, etc.
On the client side, we're also looking at two features of the Sentry
project : the raven-js
client which reports Javascript errors directly from the browser to
the Sentry server, and the user feedback form which captures some
context when something goes wrong or a user/client wants to report
that something should be changed on a given page.
To wrap up, we also often generate traffic to catch some bugs and
performance metrics automatically :
- wget --mirror $URL
- linkchecker $URL
- for $search_term in cat corpus; do wget URL/$search_term ; done
- wapiti $URL --scope page
- nikto $URL
Then watch the graphs and the errors in Sentry... Fix them. Restart.
We've spend little time on the dashboard yet since we're concentrating on collecting the metrics for now. But here is a glimpse of the "work in progress" dashboard which combines various data sources and various metrics on the same screen and the same time scale.
- internal health checks, we're taking a look at python-hospital and healthz: Stop
reverse engineering applications and start monitoring from the
inside (Monitorama) (the idea is to
distinguish between the app is running and the app is serving it's
purpose), and pyramid_health
- graph the number of Sentry errors and the number of types of errors:
the sentry API should be able to give us this information. Feed it to
Salt and Carbon.
- setup some alerting : next versions of Grafana will be doing that, or with elastalert
- setup "release version X" events in Graphite that are displayed in
Grafana, maybe with some manual command or a postcreate command when
using docker-compose up ?
- make it easier for devs to have this kind of setup. Using this suite
of tools in developement might sometimes be overkill, but can be
useful.