Create an XML Sitemap on Heroku via Amazon S3

I’ve started hosting a few simple Rails applications on Heroku and so far, I’m really pleased with their hosting service. This post isn’t as much about Heroku as it is how to serve an XML sitemap for your application. Heroku apps don’t give you file system access from within your application, so you’re forced to host your sitemap on an external service, like Amazon S3. There’s a great plugin called sitemap_generator that lets you generate a sitemap and upload it to your Amazon S3 account using carrierwave and Fog.

Even though sitemap_generater will ping all of the major search engines when you build your sitemap (which you should rebuild regularly with a rake task), you will want to configure the sitemap in Google Webmaster Tools. Unfortunately, Webmaster Tools will only let you set a sitemap to come from your domain, not another host. What can we do to fix that?

Well, the easiest solution I came up with was to create a controller to handle your sitemap, but redirect it to the location of your sitemap on S3 (via CloudFront obviously). So, lets get to the code. Create a file called sitemap_controller.rb and paste this in:

class SitemapController < ApplicationController
   def index
      redirect_to SITEMAP_PATH
   end
end

This will redirect a call to the index action of this controller to the value of SITEMAP_PATH. But what is SITEMAP_PATH? Well, in my case, my application relies heavily on a custom Rails engine where all of my controllers and models are defined. So I figured it would be nice to configure the location of the sitemap on a per application basis. So in my actual rails application, I created an initializer and set the value of SITEMAP_PATH. Put this in sitemap.rb in config/initializers:

SITEMAP_PATH="http://somepathtoyoursitemap.com/"

That's the actual location of your sitemap on S3 (again, most likely via CloudFront). Now all that's left is to wire up a Rails route to actually respond to a request for sitemap.xml. That's done easily enough with the following:

match "/sitemap.xml", :controller => "sitemap", :action => "index"

That's it! Simply restart your app if its already running so the initializer will load and access your sitemap.


Thinking Sphinx – Indexing Models Defined in a Rails Engine

I’m back in the Ruby on Rails game after a long hiatus and my, things have changed a lot. And they’ve changed for the better. The application I’m working on, like many other web applications, requires an internal search feature. Sphinx was very reliable for me in the past, however, it seems that ultrasphinx and acts_as_sphinx has been replaced with a better Rails plugin, Thinking Sphinx. Getting started was super easy. After installing Sphinx and setting up the Thinking Sphinx gem (version 2.0.11) in my application’s Gemfile, I was ready to get started.

But, I ran into a problem. The platform I’m building leverages a Rails Engine to implement most of the application’s functionality. Thinking Sphinx wasn’t setting up any models to index, even though I had defined them. Turns out, that if you don’t define your models in a typical path that Thinking Sphinx is looking at, i.e. app/models, then you’re in trouble. However, after a bunch of searching, I found the solution to my problem. Create an initializer sphinx.rb in your config/initializers directory of your application. To it, add:

module ThinkingSphinx
  class Context
    def load_models
      MyModule::MyClass
    end
  end
end

I defined my models in a sub-folder of app/models and put them in a module, so hence the MyModule::MyClass. This explicitly tells Thinking Sphinx which models to load. Running rake thinking_sphinx:config after that change set up the sphinx config file as I expected it would. Then I ran thinking_sphinx:inde and I was off and running. Jumping into the rails console, I was able to verify that searching worked as expected. Hope that helps!


SqlCacheDependency and Query Notifications

There’s a lot of scattered information out there on how to configure ASP.NET applications to leverage Microsoft SQL Server’s Query Notification and Service Broker services for caching in ASP.NET applications. The two best step by step tutorials I’ve found online are:

http://www.simple-talk.com/sql/t-sql-programming/using-and-monitoring-sql-2005-query-notification/

http://dimarzionist.wordpress.com/2009/04/01/how-to-make-sql-server-notifications-work/

Both of those articles should get you started for sure. I ran into issues keeping our application from crashing after a period of time though while leveraging Query Notifications for caching in a few of my sites. The biggest issue I found was that I would see the following exception in our logs:

When using SqlDependency without providing an options value, SqlDependency.Start() 
must be called prior to execution of a command added to the SqlDependency instance.

Never did quite get a handle on what was going on here. I did figure out though that I could always find this in my Application log around the time that exception was thrown:

The query notification dialog on conversation handle '{A1FB449B-DEB3-E011-B6D2-002590198D55}.' closed due to the following error: '-8470Remote service has been dropped.'.

So, does this mean that I called SqlDependency.Stop() and now queued notifications aren’t going to be delivered. Are these critical errors that keep the application from coming back? I’ve read that a lot of the Query Notification messages you see in the log aren’t critical errors and can be ignored. I can’t ignore the timing of this error and the exception being thrown above though.

Anyway, I finally decided to pull this stuff out of our application until I get a better handle on what’s going on. The last straw was that I was trying to sync some database changes during a maintenance period and I couldn’t get them to sync because of a bunch of these SQL Query Notification issues. As I write this, I can’t even get my database back online as I’m waiting for ALTER DATABASE SET SINGLE_USER to complete (approaching 3 hours!!!). As I keep waiting, my Application log keeps filling up with the following Query Notification messages:

Query notification delivery could not send message on dialog ‘{FE161F6A-D6B3-E011-B6D2-002590198D55}.’. Delivery failed for notification ‘85addbaa-ce66-431d-870f-d91580a7480a;d527d584-9fd4-4b13-85bc-87cb6c2e166f‘ because of the following error in service broker: ‘The conversation handle “FE161F6A-D6B3-E011-B6D2-002590198D55″ is not found.’.

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.

I had a response to a post I made on the ASP.NET Forum and it was suggested that with all the cached items in the system, that SQL Server really could not catch up. This is a problem because not only does it slow the entire system down, but when you have to cycle the SQL Server service itself, it takes forever for the system to come back up because all of the notifications get requeued or something.


What Twitter Means for Your Google SEO

The “intertubes” was abuzz recently with news that Google was going to add social media to its algorithm, meaning that tweets could be of more importance in the future. But exactly how important? I’m not sure anyone really knows, but a few things I would assume out of the gate:

  1. Massive tweeting on your part probably won’t have much effect on any traffic sent your way on Google’s part. I honestly don’t think Google will take the text from a tweet just on face value. I believe they’ll use that in conjunction with other metrics when placing a value on the importance of a tweet.
  2. Your followers will probably play an important role in the effect of tweets. Just like how similar web sites linking to your site help with your ranking (based on keywords, linking, etc.), the same will probably be said for your Twitter followers. For instance, if you’re into Ford Mustangs and you promote your Ford Mustang site on Twitter, other Ford Mustang related Twitter accounts will be more valuable to you than a Twitter follower who’s all about Britney Spears. Makes sense.
  3. The depth of your tweets will mean the most. What I mean is, how many times does your tweet get re-tweeted? By having a tweet re-tweeted a ton of times basically means whatever you had to say started to really catch on and people thought it was important. More value would be placed on a tweet Google could tell the social network found important.
  4. A combination of all of the above. I’m not sure anyone has any solid idea on how Google is going to use Twitter data. My guess is they’ll use a combination of my assumptions above when placing a value on anything it gleams from Twitter.

What’s almost certain is Google appears to be applying more metrics to its algorithm. Whereas domain names, inbound links, domain age, etc. was of utmost importance several years ago, Google is going to look into more metrics when applying your search rankings. In my opinion, this is a good thing. At the end of the day, it puts more relevant topics first based on how people are using the information across the web. Only time will tell what the importance of these changes will be though. What does everyone else think?


Google Page Speed Plugin vs. Page Speed Online

Today, I took a look at Google Labs’ Page Speed Online app to check the score of one of my sites. I was shocked to find out it was scoring really low at 59/100. Pathetic in my opinion since I consider site speed a huge priority (and so does Google in fact). I had just done a site update earlier in the week, so I was thinking that I had broken something. I checked the Page Speed Plugin for Firefox (part of Firebug), and just like I remembered, we were scoring really high at 94/100. I decided to take a look at the Page Speed for Chrome to see where that plugin would score us. It wasn’t as high as Firefox, but not nearly as low as the Online version; scoring at 81/100.

So my question to Google is this: Why the difference? Aren’t they running the same rules? Which score means more to Google? Between the browsers I would assume the rules being run in Firebug instead of straight through Chrome could cause a slight difference. Also perhaps the rendering engines for the browsers could account for some difference too. If anyone knows the answer for sure and which score I should really believe, I’d love to know!


Close jQuery ColorBox on an Action

jQuery is awesome. If you use Javascript on your website, you should use jQuery. If you don’t, you don’t know what you’re missing.

Recently, on a new site I’m about to launch, I was looking for some better ways to use jQuery and ColorBox when estimating shipping charges for customers. Previously, I called out to an internal web service to do some calculations and then do a redirect with the values to display to the user. I was thinking, meh, a redirect? You really need to do that?

So I ripped it all out and started over. I basically decided I could use jQuery and element IDs to do the same thing. Hide some controls, set the html or text values of others where I wanted calculated values to show up. But the kicker was, I could easily do that from my ColorBox modal window, but I wanted it to close after hitting the submit button. Turns out this is stupid simple. From the ColorBox documentation, you can manually close the ColorBox window:


$.colorbox.close();

The key to making it work is to find the element that actually opened the ColorBox window. I managed to only get this to work by finding the form that owned the element that opened the window first, then get the element in question, i.e.

var myForm = $("#myForm ");
var myElement= shoppingCartForm.find('#myElement');
if (myElement!= null) {
    myElement.colorbox.close();
}

For some reason, just doing this didn’t work:

$('#myElement').colorbox.close();

That would have been simpler, but I got it to work and that’s all that I really cared about. Anyway, hopefully this will be useful to someone else!


Website Speed & Performance Tuning with GTmetrix

I stumbled upon a little gem today while searching for a few more techniques to improve the performance of my ASP.NET web applications. I use YSlow and Google Page Speed almost daily, and it was great to find this website that combines the both of them called GTmetrix. GTmetrix combines both Google Page Speed and YSlow into an easy to read, tabbed, table of recommendations. Each recommendation, once expanded, offers you a list of tasks that you can complete to improve the performance of your test. What’s more, is it ranks the grouping of recommendations from Low to High so that you know what to get after first. If you’re serious about your web site’s performance, definitely check this one out!


Uploading Content to Amazon S3 with CloudBerry Labs’ S3 Explorer

I recently made the move to Amazon S3 and CloudFront to store and server static content, in particular images, for some of my e-commerce web sites. We have thousands of images to serve to our visitors, in all different sizes. To get started, I went to Google to do some searching for some quality tools. I stumbled upon CloudBerry Labs‘ application S3 Explorer and downloaded it to give it a try. Installation was a snap and fairly quickly, I was configuring my Amazon S3 account in S3 Explorer. What’s very cool about this is that you can store as many S3 accounts that you might have, storing them for use later on. To configure an S3 connection, you will need your Amazon Access Key and your Amazon Secret Key. Now it was time to upload!

Like I mentioned earlier, we have thousands of images. In fact, we have over 27,000 images. And that’s just in one image dimension size! We have 6 sizes, so that’s well over 160,000 images. That would be a bear to do through Amazon’s S3 web interface. Especially if I needed to set headers and permissions. CloudBerry S3 Explorer came in handy for this. I selected one set of images and before I started the upload, it allowed me to set any HTTP Headers I needed on my images. After that, up they went. I’d say with my connection, it took an hour or so to get all of them up to S3, depending on the file sizes. After uploading, I needed to set permissions, which I was able to do by just selecting all of the S3 objects and setting the proper permissions. This was kind of slow because CloudBerry S3 Explorer needed to get information on all of the objects I had selected, which was over 27,000.

All in all, I think it took me a couple of days to sporadically upload and set up all of our images. The beauty is now we’re serving them from CloudFront, which makes our sites quite a bit faster. A total win win for us.

A few notes about this wonderful application:

  • It’s incredible easy to set permissions on objects. They have a check box if you want to open the objects up for the world to download, which was nice for us. It would have been nice to be able to do this before upload like HTTP Headers, but I didn’t see how.
  • Very easy to set HTTP Headers and any meta data you need on your objects. And you can do it before the upload starts!

  • One thing that confused me a little was on Windows 7, when I minimized S3 Explorer, it went into my task bar and not with other minimized applications. It took me a little while to figure out where it was hiding. At first I just thought the application had crashed on me.
  • Overwriting object preserved HTTP Headers and permissions, something I was a little concerned about.
  • Moving data between S3 folders and buckets was really easy. Again, preserves HTTP Headers and permissions.

So, all in all, my impressions of this application are really good, and I was only using the Freeware version. The pro version, for only $39.99, offers the unlimited S3 accounts and multi-threading which speeds up your uploads. Other features available in the Pro version are:

  • Compression
  • Encryption
  • Search
  • Chunking
  • FTP Support
  • Sync

For more information on CloudBerry Labs’ S3 Explorer, check out their product page for S3 Explorer. Hopefully you’ll find this nifty little application as useful as I did!


Determine if Amazon S3 Object Exists with ASP.NET SDK

After my earlier posts on invalidating Amazon CloudFront objects, I thought it would be important to see if an Amazon S3 object existed before trying to invalidate it. With the 1,000 request limit on invalidation requests before Amazon charges you for them, this seemed to be a prudent thing to do. So, I turned to the Amazon Web Services ASP.NET SDK to help me out with it. This is what I came up with:

public bool S3ObjectExists(string bucket, string key)
{
    using (AmazonS3Client client = new AmazonS3Client(this._awsAccessKey, this._awsSecretKey))
    {
        GetObjectRequest request = new GetObjectRequest();
	request.BucketName = bucket;
	request.Key = key;

	try
	{
		S3Response response = client.GetObject(request);
		if (response.ResponseStream != null)
		{
			return true;
		}
	}
	catch (AmazonS3Exception)
	{
		return false;
	}
	catch (WebException)
	{
		return false;
	}
	catch (Exception)
	{
		return false;
        }
    }
    return false;
}

I decided that if I found a valid ResponseStream on the S3Response, then I had a valid object. All I’m checking on is the object key itself, i.e. an image path in S3. Another note here is I’m checking for three different exceptions but returning false for all 3. The reason I have this coded this way for now is I wanted to see what different exceptions GetObject might throw depending on what was wrong with the request. This was done purely for testing purposes and will probably be changed in the future. For instance, I discovered that AmazonS3Exception is thrown when the object isn’t there. WebException is thrown when the object is there, but the request cannot be completed. I’m still in the testing phase with this, but I hope this helps some other Amazon Web Service developers out there.


Finding Duplicate Row Values with SQL

Every once in a while I have the need to find duplicate row values in a SQL table. I seem to forget how to do it each time since its not something I have a use for every day, so I thought I’d record it here and share. The solution I found was here:

http://www.petefreitag.com/item/169.cfm

Basically, the SQL code is as follows:

SELECT ColumnName, COUNT(ColumnName) AS ColumnNameCount
	FROM MyTable
GROUP BY ColumnName
HAVING (COUNT(ColumnName) > 1)

That’s it, enjoy!