## Ordering Tagged Items by Similarity

So I’ve got this web-based image gallery. The images in it are organized with simple tags (there’s a many-to-many relationship between images and tags). Every image has at least one tag, and almost all images have several tags. When a user wants to look up an image in the gallery, they select one or more tags and are shown thumbnails for all images which have at least those tags. At first the image thumbnails were simply sorted alphabetically by file name, and this worked fine when the size of the collection was modest, but that didn’t last. Soon the collection grew to the point where most queries would give forty or more thumbnails, and it became pretty challenging to find your desired image in the pile. The search was even trickier if you had only a general idea of what you were looking for and couldn’t select lots of tags to narrow things down.

To improve the usability of this thing I had to come up with a way to arrange the image thumbnails in a particular order such that images with similar characteristics were placed near each other. This would make it much, much easier to visually scan the mass of thumbnails. In this post I’ll elaborate on the technical details of the problem and how I developed the algorithm I’m currently using.

## Basic Spatial Hashing with Boost

So let’s say you’ve got some particles in 2-dimensional space.

There they are. Now let’s say you want any two particles which are very close to each other to interact. You need to know which pairs are close enough. You could take all possible pairs of particles, calculate their distances, and compare those to the minimum distance `d`. That’s fine if you have 10 particles, but if you have 5,000 that’s going to add up to over 12,000,000 pairs. That’s too many pairs.

That complexity can be alleviated if for each particle `P` we can efficiently find a subset of all particles which includes all those within `d` of `P` and which includes as few as possible which aren’t within `d`. Then we can calculate the distances from `P` to the few particles in this subset and determine which ones are actually within distance `d`.

Enter: spatial hashing. I’ll show how `boost::unordered_multimap` can be used to do this.

## 2007 wants its page files back

compcache. Use it.

The general idea is to set apart some physical memory as swap space, with compression on the pages going in. Much of what’s in RAM is very compressible, so a fast compression algorithm can do a lot. I ran a large portion of a swap file through LZO and got a ratio above 3:1. And it was fast. I’ve never seen compression that fast. The bottom line is: compressing back to RAM is orders of magnitude faster than copying to disk, in more ways than one.

I enabled this feature on an old Pentium III laptop and it works wonderfully. I found it was already present in CrunchBang 9.04, an extremely light Ubuntu variant; I just had to set `COMPCACHE_SIZE`in an init script. That machine has 256M memory and a

So I tried it on my workstation with its 2 gig memory. With `COMPCACHE_SIZE="30 %"`, if the compression ratio holds out, I’ve got more like 3.2 gig, and I’m pushing the limits of my primitive ape address space. A 64 bit machine could have, say, 8 gig RAM. A 50% compcache would be stellar: 4 gig of real RAM, and quick access to another 8 gig or more. There is definitely no need for the disk, and swapping is definitely a waste of SSD life.

• edit `/etc/initramfs-tools/initramfs.conf`
• change `COMPCACHE_SIZE`
(specific instructions are there in the file)
• run `update-initramfs -u`
• disable swap files in `/etc/fstab`
(optional probably. I think compcache gets priority.)
• reboot