A list of major sitemap-related problems

This is a break down of major sitemap-related problems that I've produced once provoked by a client, a large seller of used cars, that complained indexing issues

Problem

The client – a large seller of used cars – has reported that a Google is slow to index his sitemap of 50k pages. Importantly, the sitemap has been indexed all right, but then gradually the indexation level decreased to end up with less than 10% of the initial level.

Approach

This made me think that I should try to put together major sitemap-related problems here and give some examples too, where appropriate.

Only include canonical URLs

A common mistake is to include URLs of duplicate pages. This increases the load on your server without improving indexing. Besides, it’s a major factor influencing the quality of your site’s content in the Google Algo.

For example, I have randomly a randomly picked an URL from the sitemap (it’s a car listing, actually 2009 Pontiac G5) and found that a client’s site has 795 pages that are practically identical, i.e have completely duplicate content. This fact greatly impedes user experience and is considered as very strong predictor of the site’s low ranking with Google.

Obviously, these are the same pages that only differ in title, or more precisely in title element – location. Besides, arguably, all of my client’s duplicate listings were included in the sitemaps. This is a huge problem indeed.

So, in this case a level of indexing commensurate with the quality of the content. If you have the like situation, you may want to consolidate link signals for the duplicate or similar content as well as determine the URL you want people to see.

There are a number of ways to indicate a preferred domain to Google, including:

  1. Set your preferred domain

  2. Indicate the preferred URL with the rel="canonical" link element

  3. Use a sitemap to set preferred URLs for the same content

  4. Use 301 redirects for URLs that are not canonical

  5. Indicate how to handle dynamic parameters

  6. Specify a canonical link in your HTTP header

Still, technically, you may want to consolidate pages that are identical or similar. In my case, the pages were different in title. I was thinking that a client have created these duplicate pages with an intent to rank them on local search – that’s why the titles differ as regards location element.

Obviously it was a bad idea – Google would consider those 795 pages as complete duplicates and will use its Algo to significantly reduce the site’s ranking position due to a spam / low user experience reasons.

I recommend that only use 1 URL per unique product listing in your sitemap and use 301 redirects for URLs that are not canonical. This will redirect all identical or similar pages to the one listed on your sitemap.

If you mostly have similar pages that were created through operation of your CMS without any special intent on your side, I recommend indicating the preferred URL to Google with the rel="canonical" link element.

Only include URLs that can be fetched by Googlebot

According to the Google Guideline it’s the best practice to include only the URLs that can be fetched by Google. And it’s definitely a bad sign that the very sitemap cannot occassionally be reachable by Googlebot. My client client had 577 pages that have 404 error. I recommend excluding such pages from your sitemap. To tell search engines the content you don't want indexed, use a robots.txt file or robots meta tag. See robotstxt.org for more information on how to exclude content from search engines.

For a set of XML sitemaps: maximize the number of URLs in each XML sitemap

According to http://googlewebmastercentral.blogspot.com/ the limit is 50,000 URLs or a maximum size of 10MB uncompressed, whichever is reached first. Ping. A common mistake is to put only a handful of URLs into each XML sitemap file, which usually makes it harder for Google to download all of these XML sitemaps in a reasonable time.

Sitemap file location is an issue

According to http://www.sitemaps.org/protocol.html#escaping ‘Sitemap file location’ the location of a Sitemap file determines the set of URLs that can be included in that Sitemap. A Sitemap file located at http://example.com/catalog/sitemap.xml can include any URLs starting with http://example.com/catalog/ but can not include URLs starting with http://example.com/images/.

According to http://www.sitemaps.org/protocol.html#escaping ‘Sitemap file location’ URLs that are not considered valid are dropped from further consideration. It is strongly recommended that you place your Sitemap at the root directory of your web server.

For example, if your web server is at example.com, then your Sitemap index file would be at http://example.com/sitemap.xml. So, it’s very important to change location of your sitemap index file and move it to the domain root. If using Wordpress, you can update it on the plugin’s page of your Word Press admin, section ‘Sitemap Location’.

Specify the Sitemap location in your robots.txt file

According to http://www.sitemaps.org/protocol.html#escaping ‘Informing search engine crawlers’ you can specify the location of the Sitemap using a robots.txt file, and you can specify more than one Sitemap file per robots.txt file.

Only update modification time when the content changed meaningfully

http://googlewebmastercentral.blogspot.com/ recommend that you don’t set the last modification time to the current time whenever the sitemap or feed is served. All 500 URLs within the sitempa above are marked as modified in the same exact time. It’s very unnatural and violates the Google guidelines providing for the last modification time should be the last time the content of the page changed meaningfully. If a change is meant to be visible in the search results, then the last modification time should be the time of this change.

Optional tags are optional

According to http://www.sitemaps.org/protocol.html#index XML tag definitions <lastmod>, <changefreq>, <priority> tags are optional, so there is no using employing them. Google Webmaster blog, for example, does not do that (see: http://www.sitemaps.org/protocol.html#index).

Last updated