Unifying URIs from multiple domains with mod_rewrite
For a change, something totally non-Windows here! This web site is served at www.heikniemi.fi as well as www.heikniemi.net. There are multiple reasons for this, but until this week, it has sucked quite a bit from a search engine perspective. It no longer does.
Any page has had at least two URIs. What this means is that the amount of links pointing to any of the pages has been halved; some people have linked to .fi, some to .net. The idea of canonicalizing these URIs to a single host name should combine the visibility (also known as Google Juice), and pages will have a far higher chance of gaining pagerank and thus rise in the Google search results.
My goal was simple: Everything in this blog (/hardcoded/*) was to be served under www.heikniemi.net – because that has traditionally been the URI I have given for this blog. For all the rest of the content, it made more sense to work under the .fi suffix, given that the content is in Finnish as well.
On an Apache server with Linux, this is very straightforward to do (supposing the mod_rewrite Apache module has been installed, but that’s usually the case). Just throw in an .htaccess file in the web site’s root directory with this:
RewriteEngine on RewriteCond %{HTTP_HOST} !^www\.heikniemi\.fi$ [NC] RewriteRule ^(.*)$ http://www.heikniemi.fi/$1 [R=301,L]
Basically, the RewriteCond just says “Apply the following rule whenever the domain name is NOT (note the exclamation mark) www.heikniemi.fi – compared with no case-sensitivity”.
When that condition applies, the RewriteRule’s first argument, a regular expression, is evaluated against the path part of the request URI. For “http://www.heikniemi.net/foobar/”, the path part would be “foobar/”. The .* in parentheses captures all of the path, and the captured group is reused in the second argument to RewriteRule, the URI to where the user shall be redirected.
The R=301 argument sets the HTTP result code to permanent redirect. This is essential to gather the Google Juice. Without R=301, the status code 302 (temporarily moved) is used, and this wouldn’t unify the page ranks between the two sites. The L argument just cancels processing of further rules, and is not strictly necessary, though perhaps clarifies the meaning of the rules.
An exception makes the rule
Then I wanted to make /hardcoded always be served under www.heikniemi.net. There are two approaches to this: Either I could add another RewriteCond/RewriteRule pair to the root .htaccess file, providing centralized control, or I could add a new .htaccess file under the specific directory, just governing files thereunder.
The choice between these two is mostly a choice of preference. I preferred modularity (separate .htaccess easily deployable with the content) over central control, but your mileage may vary. So I pushed the following into hardcoded/.htaccess.
RewriteCond %{HTTP_HOST} !^www\.heikniemi\.net$ [NC] RewriteRule ^(.*)$ http://www.heikniemi.net/hardcoded/$1 [R=301,L]
Overall, this just a few things reversed from the previous example. However, there is one important thing to note: Note that ^(.*)$ gets captured, but then $1 is appended after /hardcoded/. Now, because this is /hardcoded/.htaccess, paths passed to its RewriteRules are relative to that directory. For example, when I request http://www.heikniemi.fi/hardcoded/2010/05/, that RewriteRule gets to parse a path of “2010/05/”, not “hardcoded/2010/05/” as it would if the same rule were pasted into the .htaccess file in the web site’s root directory.
The two mandatory nitpicks
Something to note: I didn’t really double my Google Juice. The absolute majority of documents were practically always referenced by a single domain name – this was a consequence of consistent linking, but there was no constraint preventing people from linking into separate domains. This is really a way to be sure that you’re benefiting everything you can.
Another point: Really, more than two host names were unified. The servers were always also accessible as heikniemi.fi and heikniemi.net, without the www suffix. The same fix took care of it all.
May 21, 2010
· Jouni Heikniemi · 2 Comments
Tags: Apache · Posted in: Web
2 Responses
Perttu Tolvanen - May 21, 2010
Nice story. This problem is one of the most common problems today with large brand sites. It could actually be interesting to build a crawler that would identify domains that have this problem since there are a lot of brands that have this problem and they don't even know it. In many cases the marketing department is doing their projects and actually making things better, but for some reason the new domains are activated the wrong way. For example http://www.operafin.fi and http://www.ooppera.fi are mirrors (they are in the process of fixing the issue, but similar cases are very common).
I dont know the situation today, but atleast before it was possible that Google actually dropped the other domain completely from the index if it had the same content. Today the sites are usually both in index and either both show up in results or the other has clearly better rank (in the case when Google has clearly been able to identity which is the main domain).
argent - November 7, 2016
This paragraph presents clear idea for the new users of blogging,
that really how to do blogging and site-building.
Leave a Reply