William Jiang

JavaScript,PHP,Node,Perl,LAMP Web Developer – http://williamjxj.com; https://github.com/williamjxj?tab=repositories

Tag Archives: unicode

convert unicode to utf-8

I am converting Unicode 16-bits to UTF-8 for Chinese characters display. Here are the source of the conversion:

1. UTF-8 -> http://en.wikipedia.org/wiki/Utf8
2. CJK Unified Ideographs

1. Unicode CJK

The Chinese charset is set in the range of CJK.

The basic block named CJK Unified Ideographs (4E00–9FFF) contains 20,941 basic Chinese characters in the range U+4E00 through U+9FCC. The Charts are accessible here:

4E00-62FF, 6300-77FF, 7800-8CFF, 8D00-9FFF.

2. utf-8 Unicode table

What I going to do is to translate the right side Unicode to left-side UTF-8 3-bytes character.

utf-8(3字节) unicode(16位 – 用十六进制)
 
3-byte
E_
 
Indic
0800*
224
Misc.
1000
225
Symbol
2000
226
Kana
CJK

3000
227
CJK
4000
228
CJK
5000
229
CJK
6000
230
CJK
7000
231
CJK
8000
232
CJK
9000
233
Asian
A000
234
Hangul
B000
235
Hangul
C000
236
Hangul
Surr

D000
237
Priv Use
E000
238
Forms
F000
239

3. unicode->utf8 convert Formular

For CJK set, there is 3-bytes utf8 for a unicode charactor(16-bits).

Bits Last code point Byte 1 Byte 2 Byte 3 Byte 4 Byte 5 Byte 6
  7 U+007F 0xxxxxxx
11 U+07FF 110xxxxx 10xxxxxx
16 U+FFFF 1110xxxx 10xxxxxx 10xxxxxx
21 U+1FFFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
26 U+3FFFFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
31 U+7FFFFFFF 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

4. example

For chinese word ‘大’ (Unicode 0x5927), the convert from unicode to utf-8 are:

(1) 按照unicode转utf-8的编码规则,汉字使用3字节序列
所以套用三字节转换公式
0800 - FFFF 
1110xxxx 10xxxxxx 10xxxxxx
其中用x代表的16位使用unicode相应的位来填充

(2) 0x5927转换为2进制0101 1001 0010 0111
填充到上面公式中的x中变成
11100101 10100100 10100111
用16进制表示为E5 A4 A7

(3) 验证方法为:
在浏览器地址栏中输入javascript:alert(encodeURI('大').replace(/%/g,'')),按回车。

The steps of Non-ASCII form submit processing

The steps of Non-ASCII form submit processing

While scraping multi-languages web-pages, sometimes confused about the garbled display when parsing the html pages. Here I found a helpful article about the Non-ASCII Characters Forms processing, I list here for quick retrieving:

http://www.herongyang.com/PHP/Non-ASCII-Form-Basic-Rules.html

Basically, the article explains:

  1. How non ASCII characters are recorded on a Web page depends on the “charset” setting of the page.
  2. URL encoding is applied when input strings are transferred to the server.
  3. PHP CGI module applies URL decoding when parsing input strings into $_REQUEST.
C1. Key sequences on keyboard
      |
      |- Language input tool (optional)
      v
C2. Byte sequences
      |
      |- Web browser
      v
C3: HTTP request
      |
      |- Internet TCP/IP Connection
      v
C4. HTTP request
      |
      |- Web server
      v
C5. CGI variables and input stream 
      |
      |- PHP CGI interface
      v
C6. PHP built-in variable and input stream

1. Page encoding – Input strings entered in a HTML page will be encoded immediately based on the page’s
charset” setting. For example, if the page has “charset=iso-8859-1”, double-byte Unicode characters
will be encoded as HTML entities in the form of “&#nnnnn;“, where “nnnnn” represents the
decimal value of the Unicode character code. For example, “你” is Unicode character encoded
as a HTML entity.

If the page has “charset=utf-8“, double-byte Unicode characters will be encoded as UTF-8 byte sequences. For example, “\xE4\xBD\xA0” is a Unicode character encoded as a UTF-8 byte sequence.

2. URL encoding – Web browser will then apply “x-www-form-urlencoded” to all input strings when sending them to the server as part of the HTTP request. URL encoding converts all non ASCII bytes in the form of “%xx”, “xx” is the HEX value of the byte. URL encoding also converts special characters in the form of “%xx”, with one exception for the space character ” “. It will be converted to “+”.

For example, if the page “charset=iso-8859-1”, a Unicode character is entered into the page. It will be encoded immediately as a HTML entity, like “你”. When sending it to the server, it will be encoded again as “%26%2320320%3B”.

If the page has “charset=utf-8”, the same Unicode character is entered into the page. It will be encoded immediately as a UTF-8 byte sequence, like ‘\xE4\xBD\xA0“. When sending it to the server, it will be encoded again as “%E4%BD%A0“.

3. From step “C3” to “C4”, Internet will maintain the URL encoded input strings as is.

4. From step “C4” to “C5”, Web server will maintain the URL encoded input strings as is.

5. From step “C5” to “C6”, PHP CGI interface is doing something interesting for you:

  • $_SERVER[‘QUERY_STRING’] stores the URL encoded input strings as is, if input is submitted with the GET method.
  • If input is submitted with the POST method, the URL encoded input strings will be maintained in the input stream.
  • PHP parses input strings out of $_SERVER[‘QUERY_STRING’] or input stream into an array called $_REQUEST. During this parsing process, URL decoding is applied. All input strings are converted back to how they are entered on the page.

6. What do you want to do with the characters in the input data is your decision. You could output them back to the HTML document, or store them in a file. Of course, you can apply any conversion you want to.

Perl, unicode/utf8/gb2312 convert

Perl, unicode/utf8/gb2312 convert

Here is a helpful chinese article which summarizes Perl’s unicode/utf8/gb2312 transfer. I list here for quick retrieve:

use utf8;
use Encode;
use URI::Escape;

$\ = "\n";

#从unicode得到utf8编码
$str = '%u6536';
$str =~ s/\%u([0-9a-fA-F]{4})/pack("U",hex($1))/eg;
$str = encode( "utf8", $str );
print uc unpack( "H*", $str );

# 从unicode得到gb2312编码
$str = '%u6536';
$str =~ s/\%u([0-9a-fA-F]{4})/pack("U",hex($1))/eg;
$str = encode( "gb2312", $str );
print uc unpack( "H*", $str );

# 从中文得到utf8编码
$str = "收";
print uri_escape($str);

# 从utf8编码得到中文
$utf8_str = uri_escape("收");
print uri_unescape($str);

# 从中文得到perl unicode
utf8::decode($str);
@chars = split //, $str;
foreach (@chars) {
    printf "%x ", ord($_);
}

# 从中文得到标准unicode
$a = "汉语";
$a = decode( "utf8", $a );
map { print "\\u", sprintf( "%x", $_ ) } unpack( "U*", $a );

# 从标准unicode得到中文
$str = '%u6536';
$str =~ s/\%u([0-9a-fA-F]{4})/pack("U",hex($1))/eg;
$str = encode( "utf8", $str );
print $str;

# 从perl unicode得到中文
my $unicode = "\x{505c}\x{8f66}";
print encode( "utf8", $unicode );

Actually, to convert GB2312 to Unicode, then insert into MySQL Unicode_general_ci table, the following strange way might be more efficient:

use Encode;
$gb=decode("euc-cn","$gb");
$unicode=$dbh->quote($gb);
# to insert $unicode to MySQL unicode general_ci table.

It seems strange, but works fine. Others, like Encode:from_to(), Encode:encode() all don’t work.