Popular Content Returns

The reboot article that I wrote recently referred to a promise I made about getting some of the original TLL content back. I've decided to approach that piecemeal and with the intention of doing a little massaging in the process (something akin to Martin Fowler's "retread").

Along those lines, I've done a very basic popularity assessment. Of course, the new site no longer has the original URL's so looking at my Apache logs there's a slew of 404's–some of these are baddy's trying to get in through nefarious means, but alot of them are genuine attempts to get at old content.

In an attempt to not make this Meta article too boring, I'm going to include just a small walk-through of how I went about determining what to concentrate on first.

Preamble: Apache Log Format

My Apache Web Server (as opposed to Apache anything else, since the term Apache doesn't mean what it used to mean) is set up to log things in the Common Log Format. Following is an example log entry:

66.249.75.120 - - [17/Aug/2015:03:43:21 +0200] "GET /robots.txt HTTP/1.1" 404 208 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

Step 1: Get 404's

We're only interested in the log entries that failed due to the resource not being found, so:

grep ' 404 ' access_log

Step 2: Extract URL's

... | sed -r 's/.*"[A-Z]{3,4} ([^[:space:]]+).*/\1/'

So there are a few things of interest around the above invocation of the stream editor:

Use of -r to indicate that we want to use extended regular expressions. I must admit, when I first saw that I got terribly excited, thinking that I was going to have at my disposal the full gamut of what's available with Perl RE's, but digging a little deeper it looks like that in the case of sed, extended simply means that "magic" characters are treated as such without having to escape them (e.g., the parenthesis open and close characters to indicate groups). I've always shaken my head at regular expression syntaxes that require those characters to be escaped, but then I never really fully immersed myself in regular expressions–there's probably a heap of history.
The [A-Z]{3,4} bit is there to include GET's, POST's and HEAD's in what I want–hey, the latter two probably aren't needed in trying to get at broken links, but I'm a stickler for completeness. Those who know just enough about RE's to get by won't typically use the brace repeat syntax; {3,4} means "from 3 to 4 characters".

Step 3: Remove Trailing Parameters

... | sed -r 's/\?.*//'

In this case I need to escape the ? since by default in RE parlance it isn't interpreted literally but rather means "0 or 1 of".

Step 4: Count Unique URL's

... | sort | uniq -c

The combination of sort and uniq on the Unix command-line is super-powerful–in this case, the -c parameter indicates that I'm interested in a count of the unique URL's.

Step 5: Sort Again to Reveal the Most Popular

... | sort -nr

The -n option indicates that I want to interpret the first item in each line as a number (as opposed to alpha sorting), and the -r says: "reverse", so largest first.

Following are the top items:

47294 /wp-login.php
 2397 http://limberlambda.com/wp-login.php
  815 http://thelimberlambda.com/wp-login.php
  727 /robots.txt
  613 /xmlrpc.php
  294 /favicon.ico
  153 /2010/02/09/what-is-a-senior-developer/
  103 /feed/
   97 /2010/03/05/senior-developer-assessment-re-aligning-expectations/
   69 /2010/02/20/senior-developer-assessment-revisited/
   42 /2010/02/27/desperately-seeking-senior/
   34 /2011/04/16/architecture-revisited/
   33 /author/admin/
   29 /category/fun/
   28 /2010/05/28/slosh/
   27 /comments/feed/

Whoah! That first item is definitely a baddy, the second and third as well. Let's cut to the chase: "What is a Senior Developer" is the clear winner when it comes to URL's-looked-for-but-not-found. I guess that's the first one I'll be "retreading" then. Looking a little further down the list, there's a pattern ... that period in 2010 when I was desperately seeking talent (and blogged about it) seems the most interesting.

The Limber Lambda

Popular Content Returns

Preamble: Apache Log Format

Step 1: Get 404's

Step 2: Extract URL's

Step 3: Remove Trailing Parameters

Step 4: Count Unique URL's

Step 5: Sort Again to Reveal the Most Popular

Comments !

Preamble: Apache Log Format

Step 1: Get 404's

Step 2: Extract URL's

Step 3: Remove Trailing Parameters

Step 4: Count Unique URL's

Step 5: Sort Again to Reveal the Most Popular

Comments !

social