當前位置：首頁 > 编程语言 > python >内容正文

python

python 获取系统相关编码的函数

發布時間：2023/12/2 python 25 豆豆

生活随笔收集整理的這篇文章主要介紹了 python 获取系统相关编码的函数小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

怎么避免UnicodeEncodeError: ‘ascii’ codec can’t…類似的錯誤？

1、首先在py文件頭部指定文件內容編碼，例如：# coding: utf8

2、文件保存的時候要和py文件頭部編碼一致

3、在用decode和encode的時候，一定要確認要轉換的字符原編碼是什么。

例如：網頁中都會指定編碼(<meta http-equiv=content-type content=”text/html; charset=gb2312″>), 你在抓取這個網站并獲取它的html后進行編碼轉化就要注意了:

import urllib2

html = urllib2.urlopen(url)

html = html.decode(‘gb2312′)

只要做上面三個就不會出現轉換編碼錯誤了

python建議，在python代碼中最好所有變量都是unicode;???? 流程可以這么寫：變量(轉換成unicode)——>python代碼——–>變量(轉換成其他編碼)

sys.getdefaultencoding():系統的缺省編碼(一般就是ascii),python默認語言的編碼是ascii編碼, 這就是為什么在py文件的頭部都要指定編碼了# coding:utf-8

Python獲取系統編碼參數的幾個函數

系統的缺省編碼(一般就是ascii)：sys.getdefaultencoding()?
系統當前的編碼：locale.getdefaultlocale()?
系統代碼中臨時被更改的編碼（通過locale.setlocale(locale.LC_ALL,“zh_CN.UTF-8″)）：locale.getlocale()?
文件系統的編碼：sys.getfilesystemencoding()?
終端的輸入編碼：sys.stdin.encoding?
終端的輸出編碼：sys.stdout.encoding?
代碼的缺省編碼：文件頭上# -*- coding: utf-8 –*-

來源：http://justpy.com/archives/144

(二)

http://www.cnblogs.com/itrust/archive/2010/05/14/1735185.html

字符串

python有兩種字符串

1 2	byteString =?"hello world! (in my default locale)" unicodeString =?u"hello Unicode world!"

相互轉換

 1 2 3 4 

   1?s =?"hello normal string"  2?u =?unicode( s, "utf-8"?)  3?backToBytes =?u.encode( "utf-8"?)  3?backToUtf8 =?backToBytes.decode(‘utf-8’) #與第二行效果相同  

如何判斷

 1 2 3 

   if?isinstance( s, str?): # 對Unicode strings，這個判斷結果為False  if?isinstance( s, unicode): # 對Unicode strings，這個判斷結果為True  if?isinstance( s, basestring?): # 對兩種字符串，返回都為True  

做個試驗

 1 2 3 4 5 6 

   import?sys   print?'default encoding: '?, sys.getdefaultencoding()  print?'file system encoding: '?, sys.getfilesystemencoding()  print?'stdout encoding: '?, sys.stdout.encoding  print?u'u"中文" is unicode: ', isinstance(u'中文',unicode)  print?u'"中文" is unicode: ', isinstance('中文',unicode)  

看輸出結果，注意下列事實：

python系統缺省的編碼格式為ASCII，這個缺省編碼在Python轉換字符串時用的到，這里給兩個例子：

1. a = "abc" + u"bcd", Python會如此轉換"abc".decode(sys.getdefaultencoding()) 然后將兩個Unicode字符合并。

2. print unicode('中文') , 這句話執行會出錯“UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 …”，是因為Python試圖用缺省編碼來編碼，而這個字符串不是ASCII，因此需要顯示的指出，如果你的文件源類型為utf-8，則應如此：print unicode('中文','utf-8’)

Windows下getfilesystemencoding輸出mbcs（多字節編碼，windows的mbcs，也就是ansi，它會在不同語言的windows中使用不同的編碼，在中文的windows中就是gb系列的編碼)

Windows下控制臺編碼為cp936, 當你打印東西到控制臺時Python自動做了轉換。這里會引發一個有趣的問題, 試一下這個簡單的例子test.py：

 1 2 3 

   # -*- coding: utf-8 -*-  s =?u'中文'  print?s  

在控制臺中分別運行 python test.py 和 python test.py > 1.txt

你會發現后者會報錯，原因是打印控制臺時Python會自動轉換編碼到sys.stdout.encoding, 而輸出到文件時Python不會自動在write調用中進行內部字符轉換。這個問題在PrintFails中有較詳細的說明。

UTF-8編碼格式

保存utf-8格式的文件

 1 2 3 

   import?codecs  fileObj =?codecs.open( "someFile", "r", "utf-8"?)  u =?fileObj.read() # Returns a Unicode string from the UTF-8 bytes in the file  

自己寫BOM頭

 1 2 3 4 

   out =?file( "someFile", "w"?)  out.write( codecs.BOM_UTF8 )  out.write( unicodeString.encode( "utf-8"?) )  out.close()  

自己去掉BOM頭

對UTF-16, Python將BOM解碼為空字串。然而對UTF-8, BOM被解碼為一個字符，如例：

1234	>>> codecs.BOM_UTF16.decode( "utf16"?) u'' >>> codecs.BOM_UTF8.decode( "utf8"?) u'\ufeff'

不知道為什么會這樣不同，因此你需要在讀文件時自己去掉BOM：

 1 2 3 4 5 6 7 8 9 10 11 

   import codecs  if?s.beginswith( codecs.BOM_UTF8?):  ????# The byte string s begins with the BOM: Do something.  ????# For example, decode the string as UTF-8  ?????  if?u[0] == unicode( codecs.BOM_UTF8, "utf8"?):  ????# The unicode string begins with the BOM: Do something.  ????# For example, remove the character.   # Strip the BOM from the beginning of the Unicode string, if it exists  u.lstrip( unicode( codecs.BOM_UTF8, "utf8"?) )  

源碼文件的編碼

關于Python對代碼文件的編碼處理，PEP0263?講的很清楚，現摘錄如下

python缺省認為文件為ASCII編碼。

可在代碼頭一行或二行加入聲明文件編碼申明，通知python該文件的編碼格式，如

???? # -*- coding: utf-8 –*-?? # 注意使用的編輯器，確保文件保存時使用了該編碼格式

對于Windows這樣的平臺，它使用了BOM（文件頭三個字節 \xef\xbb\xbf）來申明文件為utf-8編碼，這種情況下：

如果文件中沒有編碼申明，python以utf8處理
如果有編碼申明但不是utf-8, python報錯

==============另外，關于BOM================

(三)

某些軟件，如notepad，在保存一個以UTF-8編碼的文件時，會在文件開始的地方插入三個不可見的字符（0xEF 0xBB 0xBF，即BOM）。?
因此我們在讀取時需要自己去掉這些字符，python中的codecs module定義了這個常量：?

 1 2 3 4 5 6 

   # coding=gbk  import?codecs  data =?open("Test.txt").read()  if?data[:3] ==?codecs.BOM_UTF8:  ?data =?data[3:]  print?data.decode("utf-8")  

總結

以上是生活随笔為你收集整理的python 获取系统相关编码的函数的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： PYTHON-进阶-编码处理小结
下一篇： getopt在Python中的使用