文字编码_.NET_编程开发_程序员俱乐部

中国优秀的程序员网站程序员频道CXYCLUB技术地图
热搜:
更多>>
 
您所在的位置: 程序员俱乐部 > 编程开发 > .NET > 文字编码

文字编码

 2013/11/14 10:49:59  天添  博客园  我要评论(0)
  • 摘要:这个博客酝酿好久,不敢发,这个计算机的基本知识,我坦白说,我一直很混沌,一直不清楚,自己写点啥,纠结不知道自己的是否正确,容易被鄙视,尽量测试来论证,但是由于本人水平不高,还是会还怕对于这么基础的知识,还是掌握的不好。在学习文字编码的细节之前,先要认识几个概念:字符字符集字符编码字符编码方式文字:以视觉方式表现语言体系所用的符号。这个很好理解就是我们每天看见的A、B、C、D、啊、喔、额此类的东西。字符集:由于我们日常所见的文字,符号和数字总和的数量是巨大的,同事处理所有的文字是不可能的
  • 标签:编码

      这个博客酝酿好久,不敢发,这个计算机的基本知识,我坦白说,我一直很混沌,一直不清楚,自己写点啥,纠结不知道自己的是否正确,容易被鄙视,尽量测试来论证,但是由于本人水平不高,还是会还怕对于这么基础的知识,还是掌握的不好。

     在学习文字编码的细节之前,先要认识几个概念:

  • 字符                         
  • 字符集
  • 字符编码
  • 字符编码方式 

     文字:

           以视觉方式表现语言体系所用的符号。这个很好理解就是我们每天看见的A、B、C、D、啊、喔、额此类的东西。

     字符集: 

           由于我们日常所见的文字,符号和数字总和的数量是巨大的,同事处理所有的文字是不可能的,所以事先规定使用哪些文字,这些文字的集合就叫字符集。具有代表性的字符集有人比较熟知的美国的ASCII,欧洲的ISO8859,咱们中国人的GB_2312,以及后来的以表现多语言为目的的Unicode字符集,我们看一下ASCII表:   

     字符编码:

           在字符集中,每个字符都分配一个编码,就叫做字符编码。

     字符编码方式: 

           计算机上仅仅用整数来表示字符编码的方式成为字符编码方式。

      现在似乎明白一点了,虽然计算机能够处理图像、动画、以及各种程序、各种数据,但是CPU只能处理二进制的数字。所以必须将各种形式的处理对象转换成二进制,因为当初最开始搞计算机的人说英语,所有最开始的例如ASCII中只有字母,数字,和基本符号。然后随着计算机的发展,发展到中国了,ASCII已经不好使了,所以就出现了Unicode,和GB_2312,以及其他各个国家的字符集。

      好了现在写点代码来详细讲讲。

      在C#中查看一下C#中Unicode支持的字符集编码方式:       

 1 using System;
 2 using System.Collections.Generic;
 3 using System.Linq;
 4 using System.Text;
 5 using System.IO;
 6 
 7 namespace Text
 8 {
 9     class Program
10     {
11         static void Main(string[] args)
12         {
13             FileStream fs = File.Open("c:\\code.txt", FileMode.OpenOrCreate);
14             StringBuilder sb = new StringBuilder();
15             foreach (EncodingInfo coif in Encoding.GetEncodings())
16             { 
17                 sb.Append("Display Name: " + coif.DisplayName + "----Name: " + coif.Name + "\n");
18             }
19             byte[] coByte = Encoding.GetEncoding("Unicode").GetBytes(sb.ToString());
20 
21             fs.Write(coByte, 0, coByte.Length);
22             fs.Close();
23             Console.ReadKey();
24              
25         }
26     }
27 }

 

         本来是输出到控制台的,结果发现输出的内容还挺多的,只要写到文件里了,下面是输出的内容:

  1 Display Name: IBM EBCDIC (美国-加拿大)----Name: IBM037
  2 Display Name: OEM 美国----Name: IBM437
  3 Display Name: IBM EBCDIC (国际)----Name: IBM500
  4 Display Name: 阿拉伯字符(ASMO-708)----Name: ASMO-708
  5 Display Name: 阿拉伯字符(DOS)----Name: DOS-720
  6 Display Name: 希腊字符(DOS)----Name: ibm737
  7 Display Name: 波罗的海字符(DOS)----Name: ibm775
  8 Display Name: 西欧字符(DOS)----Name: ibm850
  9 Display Name: 中欧字符(DOS)----Name: ibm852
 10 Display Name: OEM 西里尔语----Name: IBM855
 11 Display Name: 土耳其字符(DOS)----Name: ibm857
 12 Display Name: OEM 多语言拉丁语 I----Name: IBM00858
 13 Display Name: 葡萄牙语(DOS)----Name: IBM860
 14 Display Name: 冰岛语(DOS)----Name: ibm861
 15 Display Name: 希伯来字符(DOS)----Name: DOS-862
 16 Display Name: 加拿大法语(DOS)----Name: IBM863
 17 Display Name: 阿拉伯字符(864)----Name: IBM864
 18 Display Name: 北欧字符(DOS)----Name: IBM865
 19 Display Name: 西里尔字符(DOS)----Name: cp866
 20 Display Name: 现代希腊字符(DOS)----Name: ibm869
 21 Display Name: IBM EBCDIC (多语言拉丁语 2)----Name: IBM870
 22 Display Name: 泰语(Windows)----Name: windows-874
 23 Display Name: IBM EBCDIC (现代希腊语)----Name: cp875
 24 Display Name: 日语(Shift-JIS)----Name: shift_jis
 25 Display Name: 简体中文(GB2312)----Name: gb2312
 26 Display Name: 朝鲜语----Name: ks_c_5601-1987
 27 Display Name: 繁体中文(Big5)----Name: big5
 28 Display Name: IBM EBCDIC (土耳其拉丁语 5)----Name: IBM1026
 29 Display Name: IBM 拉丁语 1----Name: IBM01047
 30 Display Name: IBM EBCDIC (美国-加拿大-欧洲)----Name: IBM01140
 31 Display Name: IBM EBCDIC (德国-欧洲)----Name: IBM01141
 32 Display Name: IBM EBCDIC (丹麦-挪威-欧洲)----Name: IBM01142
 33 Display Name: IBM EBCDIC (芬兰-瑞典-欧洲)----Name: IBM01143
 34 Display Name: IBM EBCDIC (意大利-欧洲)----Name: IBM01144
 35 Display Name: IBM EBCDIC (西班牙-欧洲)----Name: IBM01145
 36 Display Name: IBM EBCDIC (英国-欧洲)----Name: IBM01146
 37 Display Name: IBM EBCDIC (法国-欧洲)----Name: IBM01147
 38 Display Name: IBM EBCDIC (国际-欧洲)----Name: IBM01148
 39 Display Name: IBM EBCDIC (冰岛语-欧洲)----Name: IBM01149
 40 Display Name: Unicode----Name: utf-16
 41 Display Name: Unicode (Big-Endian)----Name: utf-16BE
 42 Display Name: 中欧字符(Windows)----Name: windows-1250
 43 Display Name: 西里尔字符(Windows)----Name: windows-1251
 44 Display Name: 西欧字符(Windows)----Name: Windows-1252
 45 Display Name: 希腊字符(Windows)----Name: windows-1253
 46 Display Name: 土耳其字符(Windows)----Name: windows-1254
 47 Display Name: 希伯来字符(Windows)----Name: windows-1255
 48 Display Name: 阿拉伯字符(Windows)----Name: windows-1256
 49 Display Name: 波罗的海字符(Windows)----Name: windows-1257
 50 Display Name: 越南字符(Windows)----Name: windows-1258
 51 Display Name: 朝鲜语(Johab)----Name: Johab
 52 Display Name: 西欧字符(Mac)----Name: macintosh
 53 Display Name: 日语(Mac)----Name: x-mac-japanese
 54 Display Name: 繁体中文(Mac)----Name: x-mac-chinesetrad
 55 Display Name: 朝鲜语(Mac)----Name: x-mac-korean
 56 Display Name: 阿拉伯字符(Mac)----Name: x-mac-arabic
 57 Display Name: 希伯来字符(Mac)----Name: x-mac-hebrew
 58 Display Name: 希腊字符(Mac)----Name: x-mac-greek
 59 Display Name: 西里尔字符(Mac)----Name: x-mac-cyrillic
 60 Display Name: 简体中文(Mac)----Name: x-mac-chinesesimp
 61 Display Name: 罗马尼亚语(Mac)----Name: x-mac-romanian
 62 Display Name: 乌克兰语(Mac)----Name: x-mac-ukrainian
 63 Display Name: 泰语(Mac)----Name: x-mac-thai
 64 Display Name: 中欧字符(Mac)----Name: x-mac-ce
 65 Display Name: 冰岛语(Mac)----Name: x-mac-icelandic
 66 Display Name: 土耳其字符(Mac)----Name: x-mac-turkish
 67 Display Name: 克罗地亚语(Mac)----Name: x-mac-croatian
 68 Display Name: Unicode (UTF-32)----Name: utf-32
 69 Display Name: Unicode (UTF-32 Big-Endian)----Name: utf-32BE
 70 Display Name: 繁体中文(CNS)----Name: x-Chinese-CNS
 71 Display Name: TCA 台湾----Name: x-cp20001
 72 Display Name: 繁体中文(Eten)----Name: x-Chinese-Eten
 73 Display Name: IBM5550 台湾----Name: x-cp20003
 74 Display Name: TeleText 台湾----Name: x-cp20004
 75 Display Name: Wang 台湾----Name: x-cp20005
 76 Display Name: 西欧字符(IA5)----Name: x-IA5
 77 Display Name: 德语(IA5)----Name: x-IA5-German
 78 Display Name: 瑞典语(IA5)----Name: x-IA5-Swedish
 79 Display Name: 挪威语(IA5)----Name: x-IA5-Norwegian
 80 Display Name: US-ASCII----Name: us-ascii
 81 Display Name: T.61----Name: x-cp20261
 82 Display Name: ISO-6937----Name: x-cp20269
 83 Display Name: IBM EBCDIC (德国)----Name: IBM273
 84 Display Name: IBM EBCDIC (丹麦-挪威)----Name: IBM277
 85 Display Name: IBM EBCDIC (芬兰-瑞典)----Name: IBM278
 86 Display Name: IBM EBCDIC (意大利)----Name: IBM280
 87 Display Name: IBM EBCDIC (西班牙)----Name: IBM284
 88 Display Name: IBM EBCDIC (UK)----Name: IBM285
 89 Display Name: IBM EBCDIC (日语片假名)----Name: IBM290
 90 Display Name: IBM EBCDIC (法国)----Name: IBM297
 91 Display Name: IBM EBCDIC (阿拉伯语)----Name: IBM420
 92 Display Name: IBM EBCDIC (希腊语)----Name: IBM423
 93 Display Name: IBM EBCDIC (希伯来语)----Name: IBM424
 94 Display Name: IBM EBCDIC (朝鲜语扩展)----Name: x-EBCDIC-KoreanExtended
 95 Display Name: IBM EBCDIC (泰语)----Name: IBM-Thai
 96 Display Name: 西里尔字符(KOI8-R)----Name: koi8-r
 97 Display Name: IBM EBCDIC (冰岛语)----Name: IBM871
 98 Display Name: IBM EBCDIC (西里尔俄语)----Name: IBM880
 99 Display Name: IBM EBCDIC (土耳其语)----Name: IBM905
100 Display Name: IBM 拉丁语 1----Name: IBM00924
101 Display Name: 日语(JIS 0208-19900212-1990)----Name: EUC-JP
102 Display Name: 简体中文(GB2312-80)----Name: x-cp20936
103 Display Name: 朝鲜语 Wansung----Name: x-cp20949
104 Display Name: IBM EBCDIC (西里尔塞尔维亚-保加利亚语)----Name: cp1025
105 Display Name: 西里尔字符(KOI8-U)----Name: koi8-u
106 Display Name: 西欧字符(ISO)----Name: iso-8859-1
107 Display Name: 中欧字符(ISO)----Name: iso-8859-2
108 Display Name: 拉丁语 3 (ISO)----Name: iso-8859-3
109 Display Name: 波罗的海字符(ISO)----Name: iso-8859-4
110 Display Name: 西里尔字符(ISO)----Name: iso-8859-5
111 Display Name: 阿拉伯字符(ISO)----Name: iso-8859-6
112 Display Name: 希腊字符(ISO)----Name: iso-8859-7
113 Display Name: 希伯来字符(ISO-Visual)----Name: iso-8859-8
114 Display Name: 土耳其字符(ISO)----Name: iso-8859-9
115 Display Name: 爱沙尼亚语(ISO)----Name: iso-8859-13
116 Display Name: 拉丁语 9 (ISO)----Name: iso-8859-15
117 Display Name: 欧罗巴----Name: x-Europa
118 Display Name: 希伯来字符(ISO-Logical)----Name: iso-8859-8-i
119 Display Name: 日语(JIS)----Name: iso-2022-jp
120 Display Name: 日语(JIS-允许 1 字节假名)----Name: csISO2022JP
121 Display Name: 日语(JIS-允许 1 字节假名 - SO/SI)----Name: iso-2022-jp
122 Display Name: 朝鲜语(ISO)----Name: iso-2022-kr
123 Display Name: 简体中文(ISO-2022)----Name: x-cp50227
124 Display Name: 日语(EUC)----Name: euc-jp
125 Display Name: 简体中文(EUC)----Name: EUC-CN
126 Display Name: 朝鲜语(EUC)----Name: euc-kr
127 Display Name: 简体中文(HZ)----Name: hz-gb-2312
128 Display Name: 简体中文(GB18030)----Name: GB18030
129 Display Name: ISCII 梵文----Name: x-iscii-de
130 Display Name: ISCII 孟加拉语----Name: x-iscii-be
131 Display Name: ISCII 泰米尔语----Name: x-iscii-ta
132 Display Name: ISCII 泰卢固语----Name: x-iscii-te
133 Display Name: ISCII 阿萨姆语----Name: x-iscii-as
134 Display Name: ISCII 奥里雅语----Name: x-iscii-or
135 Display Name: ISCII 卡纳达语----Name: x-iscii-ka
136 Display Name: ISCII 马拉雅拉姆语----Name: x-iscii-ma
137 Display Name: ISCII 古吉拉特语----Name: x-iscii-gu
138 Display Name: ISCII 旁遮普语----Name: x-iscii-pa
139 Display Name: Unicode (UTF-7)----Name: utf-7
140 Display Name: Unicode (UTF-8)----Name: utf-8

 

下面看一下Java的:

 1 package code;
 2 
 3 import java.nio.charset.Charset;
 4 import java.util.SortedMap;
 5 
 6 public class Code {
 7 
 8     public static void main(String[] args) {
 9         SortedMap<String, Charset> availableSet = Charset.availableCharsets();
10         for (String setKey : availableSet.keySet()) {
11             System.out.println("DisplayName: "+availableSet.get(setKey).displayName() +" Name: "+ availableSet.get(setKey).name()); 
12         }
13          
14     }
15 
16 }

看输出结果:

  1 DisplayName: Big5 Name: Big5
  2 DisplayName: Big5-HKSCS Name: Big5-HKSCS
  3 DisplayName: EUC-JP Name: EUC-JP
  4 DisplayName: EUC-KR Name: EUC-KR
  5 DisplayName: GB18030 Name: GB18030
  6 DisplayName: GB2312 Name: GB2312
  7 DisplayName: GBK Name: GBK
  8 DisplayName: IBM-Thai Name: IBM-Thai
  9 DisplayName: IBM00858 Name: IBM00858
 10 DisplayName: IBM01140 Name: IBM01140
 11 DisplayName: IBM01141 Name: IBM01141
 12 DisplayName: IBM01142 Name: IBM01142
 13 DisplayName: IBM01143 Name: IBM01143
 14 DisplayName: IBM01144 Name: IBM01144
 15 DisplayName: IBM01145 Name: IBM01145
 16 DisplayName: IBM01146 Name: IBM01146
 17 DisplayName: IBM01147 Name: IBM01147
 18 DisplayName: IBM01148 Name: IBM01148
 19 DisplayName: IBM01149 Name: IBM01149
 20 DisplayName: IBM037 Name: IBM037
 21 DisplayName: IBM1026 Name: IBM1026
 22 DisplayName: IBM1047 Name: IBM1047
 23 DisplayName: IBM273 Name: IBM273
 24 DisplayName: IBM277 Name: IBM277
 25 DisplayName: IBM278 Name: IBM278
 26 DisplayName: IBM280 Name: IBM280
 27 DisplayName: IBM284 Name: IBM284
 28 DisplayName: IBM285 Name: IBM285
 29 DisplayName: IBM297 Name: IBM297
 30 DisplayName: IBM420 Name: IBM420
 31 DisplayName: IBM424 Name: IBM424
 32 DisplayName: IBM437 Name: IBM437
 33 DisplayName: IBM500 Name: IBM500
 34 DisplayName: IBM775 Name: IBM775
 35 DisplayName: IBM850 Name: IBM850
 36 DisplayName: IBM852 Name: IBM852
 37 DisplayName: IBM855 Name: IBM855
 38 DisplayName: IBM857 Name: IBM857
 39 DisplayName: IBM860 Name: IBM860
 40 DisplayName: IBM861 Name: IBM861
 41 DisplayName: IBM862 Name: IBM862
 42 DisplayName: IBM863 Name: IBM863
 43 DisplayName: IBM864 Name: IBM864
 44 DisplayName: IBM865 Name: IBM865
 45 DisplayName: IBM866 Name: IBM866
 46 DisplayName: IBM868 Name: IBM868
 47 DisplayName: IBM869 Name: IBM869
 48 DisplayName: IBM870 Name: IBM870
 49 DisplayName: IBM871 Name: IBM871
 50 DisplayName: IBM918 Name: IBM918
 51 DisplayName: ISO-2022-CN Name: ISO-2022-CN
 52 DisplayName: ISO-2022-JP Name: ISO-2022-JP
 53 DisplayName: ISO-2022-JP-2 Name: ISO-2022-JP-2
 54 DisplayName: ISO-2022-KR Name: ISO-2022-KR
 55 DisplayName: ISO-8859-1 Name: ISO-8859-1
 56 DisplayName: ISO-8859-13 Name: ISO-8859-13
 57 DisplayName: ISO-8859-15 Name: ISO-8859-15
 58 DisplayName: ISO-8859-2 Name: ISO-8859-2
 59 DisplayName: ISO-8859-3 Name: ISO-8859-3
 60 DisplayName: ISO-8859-4 Name: ISO-8859-4
 61 DisplayName: ISO-8859-5 Name: ISO-8859-5
 62 DisplayName: ISO-8859-6 Name: ISO-8859-6
 63 DisplayName: ISO-8859-7 Name: ISO-8859-7
 64 DisplayName: ISO-8859-8 Name: ISO-8859-8
 65 DisplayName: ISO-8859-9 Name: ISO-8859-9
 66 DisplayName: JIS_X0201 Name: JIS_X0201
 67 DisplayName: JIS_X0212-1990 Name: JIS_X0212-1990
 68 DisplayName: KOI8-R Name: KOI8-R
 69 DisplayName: KOI8-U Name: KOI8-U
 70 DisplayName: Shift_JIS Name: Shift_JIS
 71 DisplayName: TIS-620 Name: TIS-620
 72 DisplayName: US-ASCII Name: US-ASCII
 73 DisplayName: UTF-16 Name: UTF-16
 74 DisplayName: UTF-16BE Name: UTF-16BE
 75 DisplayName: UTF-16LE Name: UTF-16LE
 76 DisplayName: UTF-32 Name: UTF-32
 77 DisplayName: UTF-32BE Name: UTF-32BE
 78 DisplayName: UTF-32LE Name: UTF-32LE
 79 DisplayName: UTF-8 Name: UTF-8
 80 DisplayName: windows-1250 Name: windows-1250
 81 DisplayName: windows-1251 Name: windows-1251
 82 DisplayName: windows-1252 Name: windows-1252
 83 DisplayName: windows-1253 Name: windows-1253
 84 DisplayName: windows-1254 Name: windows-1254
 85 DisplayName: windows-1255 Name: windows-1255
 86 DisplayName: windows-1256 Name: windows-1256
 87 DisplayName: windows-1257 Name: windows-1257
 88 DisplayName: windows-1258 Name: windows-1258
 89 DisplayName: windows-31j Name: windows-31j
 90 DisplayName: x-Big5-HKSCS-2001 Name: x-Big5-HKSCS-2001
 91 DisplayName: x-Big5-Solaris Name: x-Big5-Solaris
 92 DisplayName: x-euc-jp-linux Name: x-euc-jp-linux
 93 DisplayName: x-EUC-TW Name: x-EUC-TW
 94 DisplayName: x-eucJP-Open Name: x-eucJP-Open
 95 DisplayName: x-IBM1006 Name: x-IBM1006
 96 DisplayName: x-IBM1025 Name: x-IBM1025
 97 DisplayName: x-IBM1046 Name: x-IBM1046
 98 DisplayName: x-IBM1097 Name: x-IBM1097
 99 DisplayName: x-IBM1098 Name: x-IBM1098
100 DisplayName: x-IBM1112 Name: x-IBM1112
101 DisplayName: x-IBM1122 Name: x-IBM1122
102 DisplayName: x-IBM1123 Name: x-IBM1123
103 DisplayName: x-IBM1124 Name: x-IBM1124
104 DisplayName: x-IBM1364 Name: x-IBM1364
105 DisplayName: x-IBM1381 Name: x-IBM1381
106 DisplayName: x-IBM1383 Name: x-IBM1383
107 DisplayName: x-IBM33722 Name: x-IBM33722
108 DisplayName: x-IBM737 Name: x-IBM737
109 DisplayName: x-IBM833 Name: x-IBM833
110 DisplayName: x-IBM834 Name: x-IBM834
111 DisplayName: x-IBM856 Name: x-IBM856
112 DisplayName: x-IBM874 Name: x-IBM874
113 DisplayName: x-IBM875 Name: x-IBM875
114 DisplayName: x-IBM921 Name: x-IBM921
115 DisplayName: x-IBM922 Name: x-IBM922
116 DisplayName: x-IBM930 Name: x-IBM930
117 DisplayName: x-IBM933 Name: x-IBM933
118 DisplayName: x-IBM935 Name: x-IBM935
119 DisplayName: x-IBM937 Name: x-IBM937
120 DisplayName: x-IBM939 Name: x-IBM939
121 DisplayName: x-IBM942 Name: x-IBM942
122 DisplayName: x-IBM942C Name: x-IBM942C
123 DisplayName: x-IBM943 Name: x-IBM943
124 DisplayName: x-IBM943C Name: x-IBM943C
125 DisplayName: x-IBM948 Name: x-IBM948
126 DisplayName: x-IBM949 Name: x-IBM949
127 DisplayName: x-IBM949C Name: x-IBM949C
128 DisplayName: x-IBM950 Name: x-IBM950
129 DisplayName: x-IBM964 Name: x-IBM964
130 DisplayName: x-IBM970 Name: x-IBM970
131 DisplayName: x-ISCII91 Name: x-ISCII91
132 DisplayName: x-ISO-2022-CN-CNS Name: x-ISO-2022-CN-CNS
133 DisplayName: x-ISO-2022-CN-GB Name: x-ISO-2022-CN-GB
134 DisplayName: x-iso-8859-11 Name: x-iso-8859-11
135 DisplayName: x-JIS0208 Name: x-JIS0208
136 DisplayName: x-JISAutoDetect Name: x-JISAutoDetect
137 DisplayName: x-Johab Name: x-Johab
138 DisplayName: x-MacArabic Name: x-MacArabic
139 DisplayName: x-MacCentralEurope Name: x-MacCentralEurope
140 DisplayName: x-MacCroatian Name: x-MacCroatian
141 DisplayName: x-MacCyrillic Name: x-MacCyrillic
142 DisplayName: x-MacDingbat Name: x-MacDingbat
143 DisplayName: x-MacGreek Name: x-MacGreek
144 DisplayName: x-MacHebrew Name: x-MacHebrew
145 DisplayName: x-MacIceland Name: x-MacIceland
146 DisplayName: x-MacRoman Name: x-MacRoman
147 DisplayName: x-MacRomania Name: x-MacRomania
148 DisplayName: x-MacSymbol Name: x-MacSymbol
149 DisplayName: x-MacThai Name: x-MacThai
150 DisplayName: x-MacTurkish Name: x-MacTurkish
151 DisplayName: x-MacUkraine Name: x-MacUkraine
152 DisplayName: x-MS932_0213 Name: x-MS932_0213
153 DisplayName: x-MS950-HKSCS Name: x-MS950-HKSCS
154 DisplayName: x-MS950-HKSCS-XP Name: x-MS950-HKSCS-XP
155 DisplayName: x-mswin-936 Name: x-mswin-936
156 DisplayName: x-PCK Name: x-PCK
157 DisplayName: x-SJIS_0213 Name: x-SJIS_0213
158 DisplayName: x-UTF-16LE-BOM Name: x-UTF-16LE-BOM
159 DisplayName: X-UTF-32BE-BOM Name: X-UTF-32BE-BOM
160 DisplayName: X-UTF-32LE-BOM Name: X-UTF-32LE-BOM
161 DisplayName: x-windows-50220 Name: x-windows-50220
162 DisplayName: x-windows-50221 Name: x-windows-50221
163 DisplayName: x-windows-874 Name: x-windows-874
164 DisplayName: x-windows-949 Name: x-windows-949
165 DisplayName: x-windows-950 Name: x-windows-950
166 DisplayName: x-windows-iso2022jp Name: x-windows-iso2022jp

貌似比C#支持的编码方式更多一些。 

 在Eclipse中设置默认的程序集

这个很简单,不同的电脑和程序可能设置不同的编码方式作为默认值,所以一个程序从一台电脑上拷贝到另一台电脑上,程序不一定能够编译。接下来在程序默认的程序集:

JAVA:

 1 package code;
 2 
 3 import java.nio.charset.Charset;
 4 
 5 public class Code {
 6 
 7     public static void main(String[] args) {
 8         System.out.println("Default CharSet: "+Charset.defaultCharset()); 
 9     }
10 
11 }

输出结果:

1 Default CharSet: UTF-8

我的环境中的C#的默认编码格式:

 1 using System;
 2 using System.Collections.Generic;
 3 using System.Linq;
 4 using System.Text;
 5 using System.IO;
 6 
 7 namespace Text
 8 {
 9     class Program
10     {
11         static void Main(string[] args)
12         { 
13             Console.WriteLine(Encoding.Default.EncodingName); 
14             Console.ReadKey(); 
15         }
16     }
17 }

输出结果:

 

下面说做个有意思的事情,看看C#支持的编码格式都有那种格式能够支持咱们中文,借用一下最开始的那段程序:

 1 using System;
 2 using System.Collections.Generic;
 3 using System.Linq;
 4 using System.Text;
 5 using System.IO;
 6 
 7 namespace Text
 8 {
 9     class Program
10     {
11         static void Main(string[] args)
12         {
13              FileStream fs = File.Open("c:\\code.txt", FileMode.OpenOrCreate,FileAccess.ReadWrite);
14              string testStr = "天添";
15              StringBuilder sb = new StringBuilder();
16              foreach (EncodingInfo coif in Encoding.GetEncodings())
17              {
18                  Byte[] desBytes = Encoding.GetEncoding(coif.Name).GetBytes(testStr);
19                  string desStr = Encoding.GetEncoding(coif.Name).GetString(desBytes);
20 
21                  sb.Append(" Display Name: " + coif.DisplayName + "----Name: " + coif.Name +"----And The result is:  "+ desStr + "\n");
22              }
23              byte[] coByte = Encoding.GetEncoding("Unicode").GetBytes(sb.ToString());
24  
25              fs.Write(coByte, 0, coByte.Length);
26              fs.Close();
27              Console.ReadKey(); 
28         }
29     }
30 }

输出结果:

  1  Display Name: IBM EBCDIC (美国-加拿大)----Name: IBM037----And The result is:  ??
  2  Display Name: OEM 美国----Name: IBM437----And The result is:  ??
  3  Display Name: IBM EBCDIC (国际)----Name: IBM500----And The result is:  ??
  4  Display Name: 阿拉伯字符(ASMO-708)----Name: ASMO-708----And The result is:  ??
  5  Display Name: 阿拉伯字符(DOS)----Name: DOS-720----And The result is:  ??
  6  Display Name: 希腊字符(DOS)----Name: ibm737----And The result is:  ??
  7  Display Name: 波罗的海字符(DOS)----Name: ibm775----And The result is:  ??
  8  Display Name: 西欧字符(DOS)----Name: ibm850----And The result is:  ??
  9  Display Name: 中欧字符(DOS)----Name: ibm852----And The result is:  ??
 10  Display Name: OEM 西里尔语----Name: IBM855----And The result is:  ??
 11  Display Name: 土耳其字符(DOS)----Name: ibm857----And The result is:  ??
 12  Display Name: OEM 多语言拉丁语 I----Name: IBM00858----And The result is:  ??
 13  Display Name: 葡萄牙语(DOS)----Name: IBM860----And The result is:  ??
 14  Display Name: 冰岛语(DOS)----Name: ibm861----And The result is:  ??
 15  Display Name: 希伯来字符(DOS)----Name: DOS-862----And The result is:  ??
 16  Display Name: 加拿大法语(DOS)----Name: IBM863----And The result is:  ??
 17  Display Name: 阿拉伯字符(864)----Name: IBM864----And The result is:  ??
 18  Display Name: 北欧字符(DOS)----Name: IBM865----And The result is:  ??
 19  Display Name: 西里尔字符(DOS)----Name: cp866----And The result is:  ??
 20  Display Name: 现代希腊字符(DOS)----Name: ibm869----And The result is:  ??
 21  Display Name: IBM EBCDIC (多语言拉丁语 2)----Name: IBM870----And The result is:  ??
 22  Display Name: 泰语(Windows)----Name: windows-874----And The result is:  ??
 23  Display Name: IBM EBCDIC (现代希腊语)----Name: cp875----And The result is:  ??
 24  Display Name: 日语(Shift-JIS)----Name: shift_jis----And The result is:  天添
 25  Display Name: 简体中文(GB2312)----Name: gb2312----And The result is:  天添
 26  Display Name: 朝鲜语----Name: ks_c_5601-1987----And The result is:  天添
 27  Display Name: 繁体中文(Big5)----Name: big5----And The result is:  天添
 28  Display Name: IBM EBCDIC (土耳其拉丁语 5)----Name: IBM1026----And The result is:  ??
 29  Display Name: IBM 拉丁语 1----Name: IBM01047----And The result is:  ??
 30  Display Name: IBM EBCDIC (美国-加拿大-欧洲)----Name: IBM01140----And The result is:  ??
 31  Display Name: IBM EBCDIC (德国-欧洲)----Name: IBM01141----And The result is:  ??
 32  Display Name: IBM EBCDIC (丹麦-挪威-欧洲)----Name: IBM01142----And The result is:  ??
 33  Display Name: IBM EBCDIC (芬兰-瑞典-欧洲)----Name: IBM01143----And The result is:  ??
 34  Display Name: IBM EBCDIC (意大利-欧洲)----Name: IBM01144----And The result is:  ??
 35  Display Name: IBM EBCDIC (西班牙-欧洲)----Name: IBM01145----And The result is:  ??
 36  Display Name: IBM EBCDIC (英国-欧洲)----Name: IBM01146----And The result is:  ??
 37  Display Name: IBM EBCDIC (法国-欧洲)----Name: IBM01147----And The result is:  ??
 38  Display Name: IBM EBCDIC (国际-欧洲)----Name: IBM01148----And The result is:  ??
 39  Display Name: IBM EBCDIC (冰岛语-欧洲)----Name: IBM01149----And The result is:  ??
 40  Display Name: Unicode----Name: utf-16----And The result is:  天添
 41  Display Name: Unicode (Big-Endian)----Name: utf-16BE----And The result is:  天添
 42  Display Name: 中欧字符(Windows)----Name: windows-1250----And The result is:  ??
 43  Display Name: 西里尔字符(Windows)----Name: windows-1251----And The result is:  ??
 44  Display Name: 西欧字符(Windows)----Name: Windows-1252----And The result is:  ??
 45  Display Name: 希腊字符(Windows)----Name: windows-1253----And The result is:  ??
 46  Display Name: 土耳其字符(Windows)----Name: windows-1254----And The result is:  ??
 47  Display Name: 希伯来字符(Windows)----Name: windows-1255----And The result is:  ??
 48  Display Name: 阿拉伯字符(Windows)----Name: windows-1256----And The result is:  ??
 49  Display Name: 波罗的海字符(Windows)----Name: windows-1257----And The result is:  ??
 50  Display Name: 越南字符(Windows)----Name: windows-1258----And The result is:  ??
 51  Display Name: 朝鲜语(Johab)----Name: Johab----And The result is:  天添
 52  Display Name: 西欧字符(Mac)----Name: macintosh----And The result is:  ??
 53  Display Name: 日语(Mac)----Name: x-mac-japanese----And The result is:  天添
 54  Display Name: 繁体中文(Mac)----Name: x-mac-chinesetrad----And The result is:  天添
 55  Display Name: 朝鲜语(Mac)----Name: x-mac-korean----And The result is:  天添
 56  Display Name: 阿拉伯字符(Mac)----Name: x-mac-arabic----And The result is:  ??
 57  Display Name: 希伯来字符(Mac)----Name: x-mac-hebrew----And The result is:  ??
 58  Display Name: 希腊字符(Mac)----Name: x-mac-greek----And The result is:  ??
 59  Display Name: 西里尔字符(Mac)----Name: x-mac-cyrillic----And The result is:  ??
 60  Display Name: 简体中文(Mac)----Name: x-mac-chinesesimp----And The result is:  天添
 61  Display Name: 罗马尼亚语(Mac)----Name: x-mac-romanian----And The result is:  ??
 62  Display Name: 乌克兰语(Mac)----Name: x-mac-ukrainian----And The result is:  ??
 63  Display Name: 泰语(Mac)----Name: x-mac-thai----And The result is:  ??
 64  Display Name: 中欧字符(Mac)----Name: x-mac-ce----And The result is:  ??
 65  Display Name: 冰岛语(Mac)----Name: x-mac-icelandic----And The result is:  ??
 66  Display Name: 土耳其字符(Mac)----Name: x-mac-turkish----And The result is:  ??
 67  Display Name: 克罗地亚语(Mac)----Name: x-mac-croatian----And The result is:  ??
 68  Display Name: Unicode (UTF-32)----Name: utf-32----And The result is:  天添
 69  Display Name: Unicode (UTF-32 Big-Endian)----Name: utf-32BE----And The result is:  天添
 70  Display Name: 繁体中文(CNS)----Name: x-Chinese-CNS----And The result is:  天添
 71  Display Name: TCA 台湾----Name: x-cp20001----And The result is:  天添
 72  Display Name: 繁体中文(Eten)----Name: x-Chinese-Eten----And The result is:  天添
 73  Display Name: IBM5550 台湾----Name: x-cp20003----And The result is:  天添
 74  Display Name: TeleText 台湾----Name: x-cp20004----And The result is:  天添
 75  Display Name: Wang 台湾----Name: x-cp20005----And The result is:  天添
 76  Display Name: 西欧字符(IA5)----Name: x-IA5----And The result is:  ??
 77  Display Name: 德语(IA5)----Name: x-IA5-German----And The result is:  ??
 78  Display Name: 瑞典语(IA5)----Name: x-IA5-Swedish----And The result is:  ??
 79  Display Name: 挪威语(IA5)----Name: x-IA5-Norwegian----And The result is:  ??
 80  Display Name: US-ASCII----Name: us-ascii----And The result is:  ??
 81  Display Name: T.61----Name: x-cp20261----And The result is:  ??
 82  Display Name: ISO-6937----Name: x-cp20269----And The result is:  ??
 83  Display Name: IBM EBCDIC (德国)----Name: IBM273----And The result is:  ??
 84  Display Name: IBM EBCDIC (丹麦-挪威)----Name: IBM277----And The result is:  ??
 85  Display Name: IBM EBCDIC (芬兰-瑞典)----Name: IBM278----And The result is:  ??
 86  Display Name: IBM EBCDIC (意大利)----Name: IBM280----And The result is:  ??
 87  Display Name: IBM EBCDIC (西班牙)----Name: IBM284----And The result is:  ??
 88  Display Name: IBM EBCDIC (UK)----Name: IBM285----And The result is:  ??
 89  Display Name: IBM EBCDIC (日语片假名)----Name: IBM290----And The result is:  ??
 90  Display Name: IBM EBCDIC (法国)----Name: IBM297----And The result is:  ??
 91  Display Name: IBM EBCDIC (阿拉伯语)----Name: IBM420----And The result is:  ??
 92  Display Name: IBM EBCDIC (希腊语)----Name: IBM423----And The result is:  ??
 93  Display Name: IBM EBCDIC (希伯来语)----Name: IBM424----And The result is:  ??
 94  Display Name: IBM EBCDIC (朝鲜语扩展)----Name: x-EBCDIC-KoreanExtended----And The result is:  ??
 95  Display Name: IBM EBCDIC (泰语)----Name: IBM-Thai----And The result is:  ??
 96  Display Name: 西里尔字符(KOI8-R)----Name: koi8-r----And The result is:  ??
 97  Display Name: IBM EBCDIC (冰岛语)----Name: IBM871----And The result is:  ??
 98  Display Name: IBM EBCDIC (西里尔俄语)----Name: IBM880----And The result is:  ??
 99  Display Name: IBM EBCDIC (土耳其语)----Name: IBM905----And The result is:  ??
100  Display Name: IBM 拉丁语 1----Name: IBM00924----And The result is:  ??
101  Display Name: 日语(JIS 0208-19900212-1990)----Name: EUC-JP----And The result is:  天添
102  Display Name: 简体中文(GB2312-80)----Name: x-cp20936----And The result is:  天添
103  Display Name: 朝鲜语 Wansung----Name: x-cp20949----And The result is:  天添
104  Display Name: IBM EBCDIC (西里尔塞尔维亚-保加利亚语)----Name: cp1025----And The result is:  ??
105  Display Name: 西里尔字符(KOI8-U)----Name: koi8-u----And The result is:  ??
106  Display Name: 西欧字符(ISO)----Name: iso-8859-1----And The result is:  ??
107  Display Name: 中欧字符(ISO)----Name: iso-8859-2----And The result is:  ??
108  Display Name: 拉丁语 3 (ISO)----Name: iso-8859-3----And The result is:  ??
109  Display Name: 波罗的海字符(ISO)----Name: iso-8859-4----And The result is:  ??
110  Display Name: 西里尔字符(ISO)----Name: iso-8859-5----And The result is:  ??
111  Display Name: 阿拉伯字符(ISO)----Name: iso-8859-6----And The result is:  ??
112  Display Name: 希腊字符(ISO)----Name: iso-8859-7----And The result is:  ??
113  Display Name: 希伯来字符(ISO-Visual)----Name: iso-8859-8----And The result is:  ??
114  Display Name: 土耳其字符(ISO)----Name: iso-8859-9----And The result is:  ??
115  Display Name: 爱沙尼亚语(ISO)----Name: iso-8859-13----And The result is:  ??
116  Display Name: 拉丁语 9 (ISO)----Name: iso-8859-15----And The result is:  ??
117  Display Name: 欧罗巴----Name: x-Europa----And The result is:  ??
118  Display Name: 希伯来字符(ISO-Logical)----Name: iso-8859-8-i----And The result is:  ??
119  Display Name: 日语(JIS)----Name: iso-2022-jp----And The result is:  天添
120  Display Name: 日语(JIS-允许 1 字节假名)----Name: csISO2022JP----And The result is:  天添
121  Display Name: 日语(JIS-允许 1 字节假名 - SO/SI)----Name: iso-2022-jp----And The result is:  天添
122  Display Name: 朝鲜语(ISO)----Name: iso-2022-kr----And The result is:  天添
123  Display Name: 简体中文(ISO-2022)----Name: x-cp50227----And The result is:  天添
124  Display Name: 日语(EUC)----Name: euc-jp----And The result is:  天添
125  Display Name: 简体中文(EUC)----Name: EUC-CN----And The result is:  天添
126  Display Name: 朝鲜语(EUC)----Name: euc-kr----And The result is:  天添
127  Display Name: 简体中文(HZ)----Name: hz-gb-2312----And The result is:  天添
128  Display Name: 简体中文(GB18030)----Name: GB18030----And The result is:  天添
129  Display Name: ISCII 梵文----Name: x-iscii-de----And The result is:  ??
130  Display Name: ISCII 孟加拉语----Name: x-iscii-be----And The result is:  ??
131  Display Name: ISCII 泰米尔语----Name: x-iscii-ta----And The result is:  ??
132  Display Name: ISCII 泰卢固语----Name: x-iscii-te----And The result is:  ??
133  Display Name: ISCII 阿萨姆语----Name: x-iscii-as----And The result is:  ??
134  Display Name: ISCII 奥里雅语----Name: x-iscii-or----And The result is:  ??
135  Display Name: ISCII 卡纳达语----Name: x-iscii-ka----And The result is:  ??
136  Display Name: ISCII 马拉雅拉姆语----Name: x-iscii-ma----And The result is:  ??
137  Display Name: ISCII 古吉拉特语----Name: x-iscii-gu----And The result is:  ??
138  Display Name: ISCII 旁遮普语----Name: x-iscii-pa----And The result is:  ??
139  Display Name: Unicode (UTF-7)----Name: utf-7----And The result is:  天添
140  Display Name: Unicode (UTF-8)----Name: utf-8----And The result is:  天添

看了一下,有24中编码方式能够解析中文,其中还包括日本朝鲜台湾。有点意思。

虽然有一些编码方式都支持中文,但是他们确实是一样的吗?找几个看一下:

 1 using System;
 2 using System.Collections.Generic;
 3 using System.Linq;
 4 using System.Text;
 5 using System.IO;
 6 
 7 namespace Text
 8 {
 9     class Program
10     {
11         static void Main(string[] args)
12         {
13 
14             string testStr = "天添";
15 
16             ASCIIEncoding ascii = new ASCIIEncoding();
17             UTF8Encoding utf8Encoding = new UTF8Encoding();
18 
19             Console.WriteLine("原字符串为: " + testStr);
20             Byte[] asciiBytes = ascii.GetBytes(testStr);
21             Console.Write("Ascii转换的字节为:");
22             foreach (Byte b in asciiBytes)
23             {
24                 Console.Write("[{0}]", b);
25             }
26             Byte[] utf8Bytes = utf8Encoding.GetBytes(testStr);
27             Console.WriteLine();
28             Console.Write("UTF8转换的字节为:");
29             foreach (Byte b in utf8Bytes)
30             {
31                 Console.Write("[{0}]", b);
32             }
33             Console.WriteLine();
34             Byte[] gb2312Bytes = Encoding.GetEncoding("hz-gb-2312").GetBytes(testStr);
35             Console.Write("Gb2312转换的字节为: ");
36             foreach (Byte b in gb2312Bytes)
37             {
38                 Console.Write("[{0}]", b);
39             }
40             Console.WriteLine();
41             Byte[] jpBytes = Encoding.GetEncoding("iso-2022-jp").GetBytes(testStr);
42             Console.Write("iso-2022-jp转换的字节为: ");
43             foreach (Byte b in jpBytes)
44             {
45                 Console.Write("[{0}]", b);
46             }
47             Console.WriteLine();
48             string desAsciiStr = Encoding.GetEncoding("ascii").GetString(asciiBytes);
49             string desUtf8Str = Encoding.GetEncoding("utf-8").GetString(utf8Bytes);
50             string desGb2312Str = Encoding.GetEncoding("hz-gb-2312").GetString(gb2312Bytes); 
51             string desJpStr = Encoding.GetEncoding("csISO2022JP").GetString(jpBytes); 
52             Console.WriteLine("ascii转换结果: " + desAsciiStr);
53             Console.WriteLine("uft8转换结果:  " + desUtf8Str);
54             Console.WriteLine("gb2312转换结果: " + desGb2312Str);
55             Console.WriteLine("iso-2022-jp转换结果: " + desJpStr);
56             Console.ReadKey();
57         }
58     }
59 }

 

执行结果:

发现个问题:

     即使最终解析成功的UTF8和GB2312,但是它们中间产生的byte数组其实不一样的,这个好理解。这也是因为使用不同的字符编码。 

   

下面看一下.NET FRAMEWORK提供的Encoding类提供处理编码的方式      

  • ASCIIEncoding
  • UTF7Encoding
  • UTF8Encoding
  • UnicodeEncoding(UTF-16)
  • UTF32Encoding

  ASCIIEncoding,UTF8Encoding刚才已经稍微的用了一下了,下面试用一下其他的三个,在尝试的过程中发现有一点点的不一样。这也是Unicode的两个问题,

     NUL问题:因为C语言处理字符串中的NUL和C#处理方式不同。(我也不是特别熟悉,囧)

     字节排序问题:计算机中表示16位整数的时候,关于字节顺序有两种方式,一种是little endian,低位的8位先放,英特尔x86系列的CPU就是这样设计的。另一种成为big endian,代表性的SUN公司APARC的CPU。这样就有问题,选择哪种方式特别重要,再此CPU上使用这种方式编写,在另一种CPU上执行此程序需要更久的时间。

 1 using System;
 2 using System.Collections.Generic;
 3 using System.Linq;
 4 using System.Text;
 5 using System.IO;
 6 
 7 namespace Text
 8 {
 9     class Program
10     {
11         static void Main(string[] args)
12         {
13 
14             string testStr = "天添";
15 
16             UnicodeEncoding unicodingBigEnd = new UnicodeEncoding(true, true);
17             UnicodeEncoding unicodingLittleEnd = new UnicodeEncoding(false, true); 
18             Console.WriteLine("原字符串为: " + testStr);
19             Byte[] unicodingBigEndBytes = unicodingBigEnd.GetBytes(testStr);
20             Console.Write("BinEnd转换的字节为:");
21             foreach (Byte b in unicodingBigEndBytes)
22             {
23                 Console.Write("[{0}]", b);
24             }
25             Console.WriteLine();
26             Byte[] unicodingLittleBytes = unicodingLittleEnd.GetBytes(testStr);
27             Console.Write("Little转换的字节为:");
28             foreach (Byte b in unicodingLittleBytes)
29             {
30                 Console.Write("[{0}]", b);
31             }
32             Console.WriteLine();
33             string unicodeBigEnd = Encoding.GetEncoding("utf-16BE").GetString(unicodingBigEndBytes);
34             string unicodeLittleEnd = Encoding.GetEncoding("utf-16").GetString(unicodingLittleBytes);
35 
36             Console.WriteLine("BinEnd转换结果: " + unicodeBigEnd);
37             Console.WriteLine("Little转换结果: " + unicodeLittleEnd);
38             Console.ReadKey();
39         }
40     }
41 }

看结果:

发现果然是byte的顺序不一样,UTF32Encoding也有此问题。

 

貌似说了好多,又好像什么都没说,而且说的乱糟糟的。感觉对于编码方式有了一点新的认识,不知道我理解的对也不对,欢迎大家交流。上个图:

编程语言处理文本数据UCS方式和CSI方式的内容。以后再说吧。

 

 

 

   

      

      

 

发表评论
用户名: 匿名