The steps of Non-ASCII form submit processing
While scraping multi-languages web-pages, sometimes confused about the garbled display when parsing the html pages. Here I found a helpful article about the Non-ASCII Characters Forms processing, I list here for quick retrieving:
Basically, the article explains:
- How non ASCII characters are recorded on a Web page depends on the “charset” setting of the page.
- URL encoding is applied when input strings are transferred to the server.
- PHP CGI module applies URL decoding when parsing input strings into $_REQUEST.
C1. Key sequences on keyboard
|- Language input tool (optional)
C2. Byte sequences
|- Web browser
C3: HTTP request
|- Internet TCP/IP Connection
C4. HTTP request
|- Web server
C5. CGI variables and input stream
|- PHP CGI interface
C6. PHP built-in variable and input stream
1. Page encoding – Input strings entered in a HTML page will be encoded immediately based on the page’s
“charset” setting. For example, if the page has “charset=iso-8859-1”, double-byte Unicode characters
will be encoded as HTML entities in the form of “&#nnnnn;“, where “nnnnn” represents the
decimal value of the Unicode character code. For example, “你” is Unicode character encoded
as a HTML entity.
If the page has “charset=utf-8“, double-byte Unicode characters will be encoded as UTF-8 byte sequences. For example, “\xE4\xBD\xA0” is a Unicode character encoded as a UTF-8 byte sequence.
2. URL encoding – Web browser will then apply “x-www-form-urlencoded” to all input strings when sending them to the server as part of the HTTP request. URL encoding converts all non ASCII bytes in the form of “%xx”, “xx” is the HEX value of the byte. URL encoding also converts special characters in the form of “%xx”, with one exception for the space character ” “. It will be converted to “+”.
For example, if the page “charset=iso-8859-1”, a Unicode character is entered into the page. It will be encoded immediately as a HTML entity, like “你”. When sending it to the server, it will be encoded again as “%26%2320320%3B”.
If the page has “charset=utf-8”, the same Unicode character is entered into the page. It will be encoded immediately as a UTF-8 byte sequence, like ‘\xE4\xBD\xA0“. When sending it to the server, it will be encoded again as “%E4%BD%A0“.
3. From step “C3” to “C4”, Internet will maintain the URL encoded input strings as is.
4. From step “C4” to “C5”, Web server will maintain the URL encoded input strings as is.
5. From step “C5” to “C6”, PHP CGI interface is doing something interesting for you:
- $_SERVER[‘QUERY_STRING’] stores the URL encoded input strings as is, if input is submitted with the GET method.
- If input is submitted with the POST method, the URL encoded input strings will be maintained in the input stream.
- PHP parses input strings out of $_SERVER[‘QUERY_STRING’] or input stream into an array called $_REQUEST. During this parsing process, URL decoding is applied. All input strings are converted back to how they are entered on the page.
6. What do you want to do with the characters in the input data is your decision. You could output them back to the HTML document, or store them in a file. Of course, you can apply any conversion you want to.