1. 程式人生 > >Python3 UnicodeEncodeError: 'gbk' codec can't encode character '\U0001f495' in position 16: illegal

Python3 UnicodeEncodeError: 'gbk' codec can't encode character '\U0001f495' in position 16: illegal

在做某商品評價分析時,發現會有表情的非字元。在儲存成txt文字時,提示報錯

UnicodeEncodeError: 'gbk' codec can't encode character '\U0001f495' in position 16: illegal 

這個報錯,說明有些字元,gbk是無法解析的。所以要把這些字元過濾掉。最簡單粗暴的方式,我是這樣做的,僅供參考。

僅列出關鍵操作程式碼

result = collection.find({"__time": {"$regex": "2018-11-30"}}, ["product_id", "content"] )

for i in result:
    with open("nlptest.txt", 'a+') as f:
        f.write(i["product_id"] + "|" + i["content"].encode('gbk', 'ignore').decode('gbk') +"\n")

(Python3)其中,encode('gbk', 'ignore').decode('gbk') 是關鍵。在gbk解碼時忽略掉不能解碼的資料