Tuesday, April 10, 2012

Be careful with non-ascii characters in URLs

ne thing that is rather common, especially on websites whose content is not in English, is URLs that contain unencoded characters such as space, å, ä, or ö. While this works most of the time it can cause problems.

Looking at RFC 3986 - Uniform Resource Identifier (URI): Generic Syntax, characters that are allowed to be used unencoded in URLs are either reserved or unreserved. The unreserved characters are a-z, A-Z, 0-9, -, ., _, and ~. The reserved characters are used as delimiters and are :, /, ?, #, [, ], @, !, $, &, ', (, ), *, +, ,, ;, and =.

In essence this means that the only characters you can reliably use for the actual name parts of a URL are a-z, A-Z, 0-9, -, ., _, and ~. Any other characters need to be Percent encoded.

So what does using an unencoded character like space or å in a URL mean in practice?

■It doesn’t work in all user agents. Recent browsers seem to handle it pretty well, but older browsers may not be able to follow links or load images. It also causes problems for some other software that expects URIs to be valid (Ruby’s URI.parse is one).
■It may make URLs ugly and hard to read since browsers may percent encode some of these characters before displaying them in the location bar. This varies from browser to browser. A URL like http://example.com/å ä ö/ may be displayed as http://example.com/å ä ö/, http://example.com/å%20ä%20ö/, http://example.com/%C3%A5%20%C3%A4%20%C3%B6/, or even http://example.com/√•%20√§%20√∂/.
To keep things simple and predictable, consider sticking to the unreserved characters in URLs unless you have a really strong internationalisation requirement for using other characters.

This post is a Quick Tip. Background info is available in Quick Tips for web developers and web designers.

reference from
link