CMS, SEO and Accessibility
The following is taken from a posting by Dan Zambonini, Box UK's technical director, on http://www.e-consultancy.com.
It used to be that Content Managed websites implied doom and gloom for search engine inclusion and ranking. With the ‘opening up’ of search engine algorithms, and a little applied technology, CMS delivered web sites then began to avoid exclusion. Now, with experience and a little effort, a CMS can be used to SEO advantage, allowing easily optimised output and rapid response to the (increasingly often) alteration of search engine algorithms.
At a basic level, your CMS should allow the pages to be indexed. There are two common causes for exclusion:
- Query strings – dynamically generated URLs that contain ‘?’ and ‘amp;’ characters. These can be removed, or replaced, using ‘URL rewriting’ – a technique that involves a web-server plug-in (e.g. mod_rewrite for Apache or IIS_rewrite for IIS), and the conversion of these special characters to a ‘directory’ style URL (e.g. /news/document/latest.html).
- Session IDs. Many modern sites use ‘sessions’ to allow the persistent tracking of a user throughout a site (so that the user remains logged-in, or for user-path analysis, etc.). To allow this ‘persistence’ across multiple pages of a site, the CMS will create a unique number (session id) for the user, and store it in a) a cookie, b) a per-session cookie, or c) the query string (URL) of each internal link. As many users/browsers will not allow cookies, a) and b) are often replaced by c) when the CMS cannot create a cookie for the user. The Google spider, amongst others, will not accept cookies, and the site may therefore include the session id in URLs for the Google spider. As Google needs to uniquely identify each page (so that it doesn’t re-index the same page multiple times), this session id will present Google with different URLs for each visit (a new session is started on each visit), and as Google cannot obtain a single unique URL for each page, it won’t index the site. To prevent this, sessions (or at least URL based session ids) should be switched off for any search-engine-spider visits. Search engine spiders can be detected (and sessions switched off accordingly) by detecting the robot’s identifier in the HTTP headers.
With these basic problems corrected, you can then begin to optimise your code for better rankings. There are a number of common SEO techniques, which are well documented (see other postings on e-consultancy), but I just thought I’d briefly touch on the issue of accessibility with relation to SEO.
Many people consider accessibility to be associated with disabled users. However, a better approach would be to consider accessibility as the name suggests – providing as much ‘access’ to your site/content as possible. Accessibility therefore covers access by search engine spiders, users from other languages and cultures, and users of differing age groups. A number of ‘accessibility techniques’ handily also provide SEO opportunities:
- Ensure links make sense out of context - if a hyperlink is removed from the text, does it still make sense? For example, a number of sites link to ‘more’, which should be replaced with descriptive ‘more news and events’, or similar. Benefit to SEO: some search engines use link text for relevancy.
- Use simple language. The content of your site should be as easy to read as possible, e.g. avoid sector-specific terminology and overly-complex wording. Benefit to SEO: using common, neutral language will open up the content to a wider audience of search terms.
- Validate the HTML (http://validator.w3.org/). To appear in a search engine, your content needs to be indexed by a search engine spider/robot. The spider will attempt to split your content/page into sections before indexing – e.g. header, metadata tags, headings, normal text, etc. In order to split the content into its components, spiders will assume a certain structure – that of valid (X)HTML. If the spider has difficulty in calculating the structure of your code, some of the text could be misclassified or omitted.
Remember – search engine spiders are amongst the most ‘disabled’ users of the web; unable to hear, visualise formatting, or imply information from structure or colour. By designing inline with accessibility standards (http://www.w3.org/TR/WCAG10/), you are designing for Google.
