Scaling Scrapy Crawlers

In dMetrics we analyze online users’ chatter. Our algorithms process tons of data and discover users’ decisions, decision triggers and motivations. For example, we know what kind of users prefer product X to product Y, why they switch between products and when.

In this post I would like to discuss web crawling – one of our ways to collect the data. The crawlers are based on an open-source project Scrapy (version 0.16.3). Scrapy is written in Python on top of Twisted framework. This framework is famous for its asynchronous programming model. This model is especially advantageous for I/O-intensive applications: I/O calls are non-blocking and the framework gains a great deal of concurrency without creating numerous threads.

Continue reading

What are the benefits of using estimators derived from L1 (as opposed to least squares or L2) error minimization?

The short answer is that L1 is suitable when the error is exponentially distributed. L2 or least squares is good when the error is distributed normally. Usually L1-based estimators are robust to outliers, while L2 are very sensitive to them.

In this post I present a very informal but intuitive explanation, that misses some important assumptions and details.

Continue reading

Scalable image search 2: bag of words

One way to make an image index more scalable is to use the “Bag of words” model. In the bag of words model, the feature space is divided into clusters (or words). The feature space can be divided by applying, for instance, the k-means clustering algorithm (or the hierarchical k-means algorithm) to the catalog features. Then each descriptor is assigned to one or more clusters with closest centers (in some metrics, e.g. Euclidean). Instead of storing a whole descriptor it is sufficient to store a cluster number. The clusters’ centers are stored separately and occupy a constant space with regards to the number of indexed images.

Continue reading

Scalable image search 1: using the k-d tree

Image databases for certain applications may be extremely large. For example, index size of TinEye is nearly 2 billion images (in February 2011). In September 2010 Flickr reported that it was hosting 5 billion images. The toy algorithm for image matching described in the previous post is not suitable for large databases. In this post I present a method that can scale up to several millions of images on a commodity server.

Continue reading