Popular Content Returns

Published: Sun 23 August 2015
By EWS

In Blog.

The reboot article that I wrote recently referred to a promise I made about getting some of the original TLL content back. I've decided to approach that piecemeal and with the intention of doing a little massaging in the process (something akin to Martin Fowler's "retread").

Along those lines, I've done a very basic popularity assessment. Of course, the new site no longer has the original URL's so looking at my Apache logs there's a slew of 404's–some of these are baddy's trying to get in through nefarious means, but alot of them are genuine attempts to get at old content.

In an attempt to not make this Meta article too boring, I'm going to include just a small walk-through of how I went about determining what to concentrate on first.

Preamble: Apache Log Format

My Apache Web Server (as opposed to Apache anything else, since the term Apache doesn't mean what it used to mean) is set up to log things in the Common Log Format. Following is an example log entry:

66.249.75.120 - - [17/Aug/2015:03:43:21 +0200] "GET /robots.txt HTTP/1.1" 404 208 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

Step 1: Get 404's

We're only interested in the log entries that failed due to the resource not being found, so:

grep ' 404 ' access_log

Step 2: Extract URL's

... | sed -r 's/.*"[A-Z]{3,4} ([^[:space:]]+).*/\1/'

So there are a few things of interest around the above invocation of the stream editor:

  1. Use of -r to indicate that we want to use extended regular expressions. I must admit, when I first saw that I got terribly excited, thinking that I was going to have at my disposal the full gamut of what's available with Perl RE's, but digging a little deeper it looks like that in the case of sed, extended simply means that "magic" characters are treated as such without having to escape them (e.g., the parenthesis open and close characters to indicate groups). I've always shaken my head at regular expression syntaxes that require those characters to be escaped, but then I never really fully immersed myself in regular expressions–there's probably a heap of history.
  2. The [A-Z]{3,4} bit is there to include GET's, POST's and HEAD's in what I want–hey, the latter two probably aren't needed in trying to get at broken links, but I'm a stickler for completeness. Those who know just enough about RE's to get by won't typically use the brace repeat syntax; {3,4} means "from 3 to 4 characters".

Step 3: Remove Trailing Parameters

... | sed -r 's/\?.*//'

In this case I need to escape the ? since by default in RE parlance it isn't interpreted literally but rather means "0 or 1 of".

Step 4: Count Unique URL's

... | sort | uniq -c

The combination of sort and uniq on the Unix command-line is super-powerful–in this case, the -c parameter indicates that I'm interested in a count of the unique URL's.

Comments !

social