Wherein I attend Confoo

At the end of February, Montreal played host to our very own international web technology conference. Confoo is my very first web conference, so what better way to kick-start this blog than with a recap of my experience.

Before the conference

The ticket for Confoo, priced at a hefty $845, was graciously provided by the National Film Board of Canada, who sends its web developers to all local conferences and some remote ones too. They’re very awesome like that. The preparation though wasn’t a pleasant experience, due mainly to the sad UX on confoo.ca. For one, you can’t select multiple topics to highlight the pertinent talks on their schedule page, and you need to be logged in to be able to star them. Speaking of which, every time I tried to log in on my iPhone their server panicked because “CSRF attack detected”. I am 1337 hax0r, ph33r me. Good thing they provided these large ad-filled dead-tree booklets to carry around.

Seriously though, none of that really mattered because I was only there to be dazzled with fresh knowledge. Word of caution first: if you know anything about a topic, don’t attend a Confoo talk on the subject. They purposefully select their presentations to stay at a general level that’s easy to grasp. So instead go learn exciting new things, like what Perl’s been quietly up to these last few years, linked data or crazy caching voodoo using Varnish and PostgreSQL.

Without further ado, here are the talks I was most excited about.

Exobrain

I kicked off the conference with Paul Fenwick’s guided tour of his open source If-This-Then-That clone in Perl. It lacks the usability of IFTTT but more than makes up for it by giving you the ability to write your own agents, thus circumventing the petty political posturing so common among Web 2.0 giants these days. And since all the components talk to each other in JSON over ZMQ, you could theoretically plug in agents written in any language. Paul demonstrated how tweeting at him with #todo adds an item to his list and pushes it to his watch. I of course (ab)used the system to volunteer a very helpful recommendation. He also uses it to get his Beeminder and his HabitRPG to talk to each other, so that’s nice too.

During the unconference period on the last day, a few of us had the honour of booting exobrain up for the (allegedly) very first time on machines that didn’t belong to Paul. The procedure isn’t for the faint of heart, and if you’re not already using your environment for routine Perl development, installing all the dependencies takes nigh on half an hour. But when it was done, it was glorious and now I get to rekindle my Perl romance and explore its renaissant ecosystem. Link dump from my notes: HabitRPG, cpanminus for a better CPAN cli, Ubic service manager, ttyter, vimperator Firefox plugin, Zotero the personal research assistant.

Paul blogs at pjf.id.au and tweets as @pjf, and you can find Exobrain on GitHub and on Google Groups.

Web scraping (for fun and profit)

I absolutely love web scraping, it makes me feel like some kind of data liberator. It’s an intricate labour of hack and craft, so whenever you get to hear a fellow scraper talk, it’s like peeking inside an artisan’s secret toolbox. Ben Lamb definitely delivered on that front, sharing war stories of crawling retail websites for price watching and venue sites for event info. Interesting tools in his kit: a proxy like Fiddler for spying on requests by XHR or Flash, PyParsing for extracting data from natural language (in his case dates) and RabbitMQ for orchestrating spiders.

He also had some great advice to give. Most importantly, prepare a generic base crawler that provides you with tools you’ll want in all your spiders. Think about graceful degradation (because human writing is anything but clean and structured), logging with plenty of diagnostic info (because human writing…), data normalization (because human writing…), data replay (saving the scraper state so it can be unpaused once the issue is fixed. You know, because human writing…) and orchestration controls so you can play with the spiders. I haven’t yet looked into Scrapy but I believe it provides at least some of this. Then you want a good test suite for the various bits of parsing you do. Whenever you come across a new bit of data that you want from the wild wild web, you can add it to the suite and ensure your code can handle it (e.g. the various flavours that dates may come in). Finally, be nice to the sites you’re scraping and take a 5-10 second breath between requests, even if for no other reason than to avoid detection.

Ben is on Twitter as @zurgy and his slides are here. And if scraping is your vice, you might also be interested in this Python web scraping resource.

Linked data

I’ve only tangentially looked at linked data before when investigating ways to add hypermedia affordances to JSON, and JSON-LD popped up. Sarven Capadisli presented an excellent primer to linked data concepts and tools on the first day, and a case study on using linked statistical data on the second. We started with Tim Berners-Lee’s four principles driving linked data. We moved on to the data model used for concretizing these principles, RDF, and to its basic unit of structure, the triple <subject, predicate, object>. These triples can be serialized in a host of formats, of which we saw RDFa, Turtle, N-Triples and RDF/XML. To ensure RDF actually links things together, the triple items are IRIs, and lists of IRIs for common concepts have been compiled as vocabularies (or ontologies), e.g. FOAF for social concepts, Dublin Core for web resources and similar real-world artifacts, and schema.org.

We then looked at some ways you can interact with RDF data. rapper is a cli tool for parsing RDF (in RDF/XML, N-Triples, Turtle or RSS/ATOM notations) based on the Redland RDF libraries. SPARQL is a query language and protocol for interacting with RDF graphs on the web or in an RDF store (i.e. a db for triples).

Illustrating the power of linked data, Sarven built a web interface for performing statistical analysis on open statistical data sets from large organizations like the World Bank and Transparency International. You can learn more about his work on statistical linked dataspaces at 270a.info.

JS testing

Jordan @jakerella Kasper warned against pretending that eyeballing your site in a browser can substitute for disciplined testing. We’re already passing our Angular code through the Jasmine wringer at the NFB so not much here was new to us. But if you’re still writing cowboy javascript, Jordan has a few useful tips: name your callback functions so you can get useful information in tracebacks, don’t couple your code to DOM queries but pass already selected elements to your functions instead, and don’t couple your function internals to the server by dropping some XHR in the middle of it all. Here are the slides, and notice the beautiful Blazon presentation tool he’s been building at appendTo.

Anyway, turns out Jordan’s a great guy, very fun to have beers with, and the only other human being I know of to have a Matrix skull plug like mine. Represent.

Integrated cache invalidation

This one absolutely blew my mind, Magnus Hagander of the PostgreSQL team showed us how grown-ups invalidate caches. For an example setting, we were considering e-commerce websites, where products appear in many pages: details page, various category pages, sales pages, etc. When Django renders a page, it sends a custom header with all the ids of the products present. Then, in pg we write an update trigger on products that notifies Varnish of the ids that changed. We don’t want to send HTTP requests from a trigger though, so he uses the pgq messaging queue from pl/pgsql to do this asynchronously. Then in Varnish, we BAN hits containing this id in the custom header.

TL;DR The database is an integration point in the web architecture, anything that wants to change data must talk to it. We can use this to our advantage for orchestrating cache invalidation. You can find the slides for this talk as well as many other interesting ones on Magnus’ talks page.

Small data machine learning

Andrei Zmievski has a very interesting problem: he is @a on Twitter. To get an idea of the kind of havoc this causes in his mentions stream, have a look at this Twitter search. This was the story of how he trained a logistic regression classifier to spot the legit tweets referring to him. He wrote a series of functions that extracted independent and discriminant features from a corpus of tweets, and stored the results for training the model. Unfortunately, due to all the noise only 2% of the corpus was labeled as good, posing a problem in training. Neither trivial over- nor under-sampling worked, so he used a generative model to perform synthetic over-sampling. When all that was fixed he performed a typical gradient descent using cross training on the collected hand-labeled feature vectors, and now he can enjoy Twitter the same way we do.

Useful references from the talk: the FastML blog and Ian Barber’s PHP Information Retrieval blog.

Miscellanea

There were other useful talks and I learned a great deal about Varnish, Redis, writing web apps that use offline mode, and managing distributed teams of developers. They are all worthy of mention, but the ones I wrote about stood out as the most interesting to me.

Next up: PyCon, coming to Montréal in 2014 and 2015.