Tatsuhiko Miyagawa's Blog

Re: Percent-encoding URIs in Perl - Mark Stosberg

December 17, 2010

utf8::encode $[0] if utf8::is_utf8 $[0];

via mark.stosberg.com

utf8::encode if utf8::is_utf8 is a bug. Don’t do it.

There are reasons URI::Escape provides two functions, uri_escape and uri_escape_utf8. The former handles arbitrary byte strings, whether it’s utf-8 or not, and the latter handles given strings as (possibly wide) characters.

Doing utf8::encode based on the utf8 flag is so wrong. It just tells the internal representation of a scalar and some latin-1 range characters might be encoded in a bogus way unless you explicitly call utf8::ugprade on it before passing to the function.

Nothing’s wrong with URI::Escape providing the uri_escape that handles arbitrary encodings. While I agree most web pages should just use UTF-8 for everything in 2010, using other text encodings such as EUC-JP, or even arbitrary binary data (such as JPEG data) in URL is not invalid either.

Mark’s quote from RFC3986 is done without its context. It says “When a new URI scheme defines a component that represents textual data consisting of characters from [UCS] …” which doesn’t apply when we encode parameters for the web URLs. It’s not a new URI scheme, it shouldn’t necessarily represent “textual data” either.

Don’t rely on utf8 flags of the strings. See perlunitut and perlunifaq for more details.