PATH_INFO decoding horrors
If your application (script) is located in /foo, and when a request is made on /foo/bar%2fbaz (where %2f means an URI encoded forward slash “/”), what would the PATH_INFO value be? /bar%2fbaz
(undecoded) or /bar/baz
(decoded)?
First of all, Apache has a problem using %2c in the URL anyway: they 404 by default, and you should add AllowEncodedSlashes On to accept those requests. More annoyingly, even though the document says:
Allowing encoded slashes does not imply decoding. Occurrences of %2F or %5C (only on according systems) will be left as such in the otherwise decoded URL string.
This is not actually true, and mod_cgi and other Apache handlers decode those characters. It’s reported as a bug but we’re still seeing it today on Apache 2.x.
Back to the original question, I think PATH_INFO should be there undecoded, so I added a test to see how our Plack server implementations behave, and interestingly our CGI server and HTTP::Server::Simple backend failed the automated unit tests. I also confirmed it fails on FCGI with lighttpd frontend as well as Apache2 mod_perl handler.
I looked at the code that handles this thing, and HTTP::Request::AsCGI and HTTP::Server::Simple both decode PATH_INFO intentionally, with a note saying “we do this because Apache and lighttpd do this”.
UPDATE: HTTP::Request::AsCGI leaves URI reserved characters encoded, like %2F because that made Catalyst tests fail. This is actually an incompatibility with Apache, and I confirmed their TestApp tests fail when tested with Apache 2.x CGI mode: here’s a patch for Catalyst and HTTP::Request::AsCGI so the app should work correctly under Apache CGI as well.
Python’s WSGI 2.0 wiki page also complains about this issue, linking to a detailed analysis against lots of different web servers, and suggests to include RAW_PATH_INFO in addition to PATH_INFO to avoid potential issues like this. Apache’s mod_cgi and lighttpd contains REQUEST_URI environment variables which are undecoded, so it’s possible to construct those RAW variables (otherwise we can’t tell if it was encoded or not in the beginning).
I’m also interested how Rack deals with this issue. The spec says “the value MAY be % encoded” so it’s not saying anything about the requirement.
So I think this is an Apache bug but unfortunately most software have been living with this bug, so changing PATH_INFO meaning might cause confusions even if PSGI is a new spec that can be free from the existent CGI spec (or in this case, implementations). So adding RAW_PATH_INFO, or REQUEST_URI which is currently not in the spec, to be undecoded might make more sense.
Thoughts?