URL Canonicalization, the Catholic Church and Your Web Site

Published on July 17, 2008

An entry in Web that deals with · · ·

Even though Matt Cutts, Google’s talking head for “playing by the rules”, discussed URL canonicalization issues back in the early part of 2006, it bothers me how much subpar information about .htaccess is published on the Internet by search marketers under the guise of “simple ways to normalize your domain.”

I personally find tutorials like that thoroughly inadequate and ultimately frustrating, and usually when I’m about halfway through, I get the feeling that people like this know not a hole in the ground from a hole in their ass.

With that being said, let me walk you through my understanding of the importance of URL canonicalization to your Web site and the way in which you can fully counter such sensitive search optimization issues.

First off, don’t let the term canonicalization throw you. Think of canonicalization as “canon” - like the “canon of the Catholic Church.” By keeping this in mind we can think of “canon” as refering to the definite or authoritative source of something. Strecthing our analogy wider, the canon of the Catholic Church consists of only the books that represent the definite bounds of said church’s philosophy - nothing more, nothing less. So Gospel of Matthew, Mark. Part of the canon. Gospel of Thomas, Judas. Absolutely not.

Think of your domain in the same way.

http://yoursite.com and http://www.yoursite.com - just like the Gospels of, say, Mark and Thomas - are different, yet similar. These URLs are viewed as totally different by the Web server as they both have the potential to house different documents, yet on most Web site installations they point to the same default home page. And in this latter and often common situation, one or another needs to be considered the definitive - or canonical - version.

So every single URL accessed from your domain that potentially can return separate yet identical versions of a page should (rather, needs to) be canonicalized.

Typically on any domain where canonicalization has not yet already been recognized as a problem, there are a few default issues you should learn to deal with. Usually,

http://yoursite.com
http://www.yousite.com
http://www.yoursite.com/index.php
http://yoursite.com/index.php

will all reference the same document.

The easiest and preferred way of cleaning this mess up is via Web server redirection. If you are hosted on an Apache Web server, you can setup .htaccess files in your root directory to fix this, and every subdirectory thereafter will inherent these conditions for domain redirection.

If you are on IIS, switch your hosting provider. Well, not really. Just grab some software that enables you to use mod_rewrite functions like Apache.

Without further ado, below is the main .htaccess that I use for most of my domain installations that solve the aforementioned canonicalization issues listed above plus a few others, such as trailing slashes, etc. - but we won’t cover all that here.

ErrorDocument 404 /404.php
Options +FollowSymLinks
<IfModule mod_rewrite.c>
RewriteEngine on
RewriteCond %{HTTP_HOST} ^flexiblephilosophy\.com [NC]
RewriteRule ^(.*)$ http://www.flexiblephilosophy.com/$1 [L,R=301]
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /.*index\.php\ HTTP/
RewriteRule ^(.*)index\.php$ /$1 [R=301,L]
</IfModule>

Ok, so, lines one and two - totally unnecessary - but I like to have them in there. For providing you with the real meat beneath those, I’ll take your researching into these preceeding lines (and why they are a good idea to have) as justified knowledge exchange. Lines 3 to end does the bulk of what was explained above. We’ll go through the rest sequentally.

Lines 5 and 6 take the request from a non-www URL and redirect it to the www version of that document. Switch the necessary characters here if you would rather prefer to use the non-www version of your domain as the official - ahem, canonical - version.

Lines 7 and 8 will redirect any request to the file ‘/index.php’ to your canonical domain. Replace ‘index’ with whichever file name you use as your default document and replace ‘.php’ with whichever file extension you use as well.

Note that when setting up your .htaccess, be sure to use 301 as the HTTP response as that is the code for a permanent redirect. Yes, we a striving for permanency here. And just like the Catholic church, put your stamp down and declare to the Google Gods that your canonical URL is and will always be listed as such and, henceforth, will not change. That is, until you modify your .htaccess file. Or until Christ returns. Or both.

In fact, … no in fact. That probably crossed the line.

3 Responses to “URL Canonicalization, the Catholic Church and Your Web Site”

  1. Our Example in the Wild at Jeffrey Olchovy’s Flexible Philosophy Says:

    [...] .htaccess configuration and URL canonicalization - Look to the Catholic Church for Search Engine Optimization Techniques - URL Canonicalization, the Catholic Church and Your Web Site [...]

  2. An Example in the Wild - Taming the Wild Wild Web Says:

    [...]   .htaccess configuration and URL canonicalization - Look to the Catholic Church for Search Engine Optimization Techniques - URL Canonicalization, the Catholic Church and Your Web Site [...]

  3. AlexM (1 comments.) Says:

    I found your site on technorati and read a few of your other posts. Keep up the good work. I just added your RSS feed to my Google News Reader. Looking forward to reading more from you down the road!

Leave a Reply