75. 自然语言处理框架拓展#

75.1. 介绍#

前面我们介绍了循环神经网络,并利用深度学习框架完成文本分类任务。实际上,深度学习框架在自然语言处理方面并没有计算机视觉那么擅长。本次实验,我们重点了解和学习几个自然语言处理中的常用工具。这些工具往往会成为后续深入了解自然语言处理的利器。

75.2. 知识点#

  • Natural Language Toolkit

  • PyTorch Flair

  • 自然语言处理工具

工欲善其事,必先利其器。本着不重复造轮子的思想,我们在课程中会教大家学习使用机器学习应用过程中的常用工具。例如 NumPy 科学计算库,Pandas 数据分析库,scikit-learn 机器学习库等。以及深度学习中使用到的 TensorFlow 和 PyTorch 框架。

如今的自然语言处理,虽然大多都会融入深度学习的方法。但常用的 TensorFlow 和 PyTorch 却更多的是偏向于计算机视觉而设计,尤其是提供了大量针对图像的预处理方法。区别于计算机视觉,自然语言处理有它的难点,其中最明显的问题就是不同的语言面临的方法可能都不太一样。所以,本次实验我们学习几个专门为自然语言处理设计的工具,这也是对前面 NLP 实验内容的补充。

75.3. Natural Language Toolkit#

Natural Language Toolkit 的简称为 NLTK,顾名思义就是自然语言处理工具箱。目前,NLTK 主要用于英文和其他拉丁语系文本处理。

例如,我们可以使用 NLTK 对英文文本进行分词处理。我们首先需要下载 NLTK 拓展包,你可以使用 nltk.download() 来选择性下载所需拓展包或者直接使用 python -m nltk.downloader all 下载全部数据拓展包。

由于访问境外网络较慢,所以这里从课程镜像服务器下载英文分词所需的 Punkt Tokenizer Models 拓展

import nltk
nltk.download('punkt')  # 下载英文分词所需拓展包
True

接下来,使用 nltk.tokenize.word_tokenize 完成英文文本分词过程。

from nltk.tokenize import word_tokenize

text = """
[English] is a West Germanic language that was first spoken in early 
medieval England and eventually became a global lingua franca.
It is named after the <Angles>, one of the Germanic tribes that 
migrated to the area of Great Britain that later took their name, 
as England.
"""

tokens = word_tokenize(text)
print(tokens)
['[', 'English', ']', 'is', 'a', 'West', 'Germanic', 'language', 'that', 'was', 'first', 'spoken', 'in', 'early', 'medieval', 'England', 'and', 'eventually', 'became', 'a', 'global', 'lingua', 'franca', '.', 'It', 'is', 'named', 'after', 'the', '<', 'Angles', '>', ',', 'one', 'of', 'the', 'Germanic', 'tribes', 'that', 'migrated', 'to', 'the', 'area', 'of', 'Great', 'Britain', 'that', 'later', 'took', 'their', 'name', ',', 'as', 'England', '.']

如果仅需要对文本进行断句,可以使用 nltk.sent_tokenize 方法。

from nltk import sent_tokenize

sent_tokenize(text)
['\n[English] is a West Germanic language that was first spoken in early \nmedieval England and eventually became a global lingua franca.',
 'It is named after the <Angles>, one of the Germanic tribes that \nmigrated to the area of Great Britain that later took their name, \nas England.']

对于分词结果,同样可以使用 NLTK 完成文本过滤。例如去除文本中的标点符号,通过遍历分词结果,仅保留英文内容。这里使用到 Python 提供的 .isalpha 字符串处理方法。

tokens = word_tokenize(text)
# 仅保留 alphabetic
words = [word for word in tokens if word.isalpha()]
print(words)
['English', 'is', 'a', 'West', 'Germanic', 'language', 'that', 'was', 'first', 'spoken', 'in', 'early', 'medieval', 'England', 'and', 'eventually', 'became', 'a', 'global', 'lingua', 'franca', 'It', 'is', 'named', 'after', 'the', 'Angles', 'one', 'of', 'the', 'Germanic', 'tribes', 'that', 'migrated', 'to', 'the', 'area', 'of', 'Great', 'Britain', 'that', 'later', 'took', 'their', 'name', 'as', 'England']

当然,我们也可以去除英文停用词。这里使用需下载停用词拓展包,并使用 nltk.corpus.stopwords 来加载。实验中的停用词数据已经包含在一开始下载的数据包中了,本地可以使用以下代码加载:

from nltk.corpus import stopwords

nltk.download('stopwords')  # 安装停用词拓展包
stop_words = stopwords.words("english")  # 加载英文停用词
print(stop_words)
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

目前,该拓展包支持 Dutch, German, Italian, Portuguese, Swedish, Arabic, English, Greek, Kazakh, Romanian, Turkish Azerbaijani, Finnish, Hungarian, Nepali, Russian, Danish, French, Indonesian, Norwegian, Spanish 等语言的停用词,很遗憾并没有提供常用中文停用词。

同样,可以通过遍历的方法去掉停用词。

words_ = [w for w in words if not w in stop_words]
print(words_)
['English', 'West', 'Germanic', 'language', 'first', 'spoken', 'early', 'medieval', 'England', 'eventually', 'became', 'global', 'lingua', 'franca', 'It', 'named', 'Angles', 'one', 'Germanic', 'tribes', 'migrated', 'area', 'Great', 'Britain', 'later', 'took', 'name', 'England']

此外,NLTK 可以很方便地做词频统计,nltk.FreqDist 即可按降序返回词频字典。

from nltk import FreqDist

FreqDist(tokens)
FreqDist({'that': 3, 'the': 3, 'is': 2, 'a': 2, 'Germanic': 2, 'England': 2, '.': 2, ',': 2, 'of': 2, '[': 1, ...})

NLTK 提供的大量拓展包可以实现更多进阶应用。例如 PorterStemmer 可以实现对句子中词干的提取。词干提取是语言形态学中的概念,词干提取的目的是去除词缀得到词根,例如 Germanic 的词干为 german。

from nltk.stem.porter import PorterStemmer

porter = PorterStemmer()
stemmed = [porter.stem(word) for word in tokens]
print(stemmed)
['[', 'english', ']', 'is', 'a', 'west', 'german', 'languag', 'that', 'wa', 'first', 'spoken', 'in', 'earli', 'mediev', 'england', 'and', 'eventu', 'becam', 'a', 'global', 'lingua', 'franca', '.', 'it', 'is', 'name', 'after', 'the', '<', 'angl', '>', ',', 'one', 'of', 'the', 'german', 'tribe', 'that', 'migrat', 'to', 'the', 'area', 'of', 'great', 'britain', 'that', 'later', 'took', 'their', 'name', ',', 'as', 'england', '.']

最后,非常推荐大家阅读和练习 NLTK 官方出版的 Analyzing Text with the Natural Language Toolkit。这里面的内容非常全面,如果经常需要处理英文文本内容,该书将对你有很大的帮助。你也可以在国内购买此书的中译版本《Python 自然语言处理 - 人民邮电出版社》。

75.4. Flair#

Flair 是近年新兴起来的自然语言处理框架,其隶属于 PyTorch 生态圈。Flair 主要有以下几点特色。

首先,Flair 支持对英文文本的分词,词性标注以及命名实体识别。其中,词性标注即通过语法来标记某个词是名词、动词、形容词等,中文可以使用结巴工具,英文即可使用 NLTK 或者 Flair。而命名实体识别则指的是识别文本中具有特定意义的实体,主要包括人名、地名、机构名、专有名词等。Flair 在这几项工作中,均 取得了较好的成绩

其次,Flair 提供了支持多语言的多种词嵌入预训练模型,方便完成各类词嵌入工作。例如:Flair embeddings, BERT embeddingsELMo embeddings 等。最后,Flair 基于 Pytorch 构建了完整的框架,非常方便用于文本分类等任务。

下面,我们就来学习 Flair 的使用。Flair 中的文本标准类型是 flair.data.Sentence,对于英文文本可以通过下面的示例新建一个 Sentence。

from flair.data import Sentence

text = """
[English] is a West Germanic language that was first spoken in early 
medieval England and eventually became a global lingua franca.
It is named after the <Angles>, one of the Germanic tribes that 
migrated to the area of Great Britain that later took their name, 
as England.
"""

sentence = Sentence(text)
sentence
Sentence[55]: " [English] is a West Germanic language that was first spoken in early  medieval England and eventually became a global lingua franca. It is named after the <Angles>, one of the Germanic tribes that  migrated to the area of Great Britain that later took their name,  as England."

Flair 会自动按照空格识别每一个 Sentence 包含的 Tokens 数量。你可以通过遍历输出这些词组。如果我们使用空格间隔分好词的中文文本,Flair 同样支持识别 Tokens 数量。

for token in sentence:
    print(token)
Token[0]: "["
Token[1]: "English"
Token[2]: "]"
Token[3]: "is"
Token[4]: "a"
Token[5]: "West"
Token[6]: "Germanic"
Token[7]: "language"
Token[8]: "that"
Token[9]: "was"
Token[10]: "first"
Token[11]: "spoken"
Token[12]: "in"
Token[13]: "early"
Token[14]: "medieval"
Token[15]: "England"
Token[16]: "and"
Token[17]: "eventually"
Token[18]: "became"
Token[19]: "a"
Token[20]: "global"
Token[21]: "lingua"
Token[22]: "franca"
Token[23]: "."
Token[24]: "It"
Token[25]: "is"
Token[26]: "named"
Token[27]: "after"
Token[28]: "the"
Token[29]: "<"
Token[30]: "Angles"
Token[31]: ">"
Token[32]: ","
Token[33]: "one"
Token[34]: "of"
Token[35]: "the"
Token[36]: "Germanic"
Token[37]: "tribes"
Token[38]: "that"
Token[39]: "migrated"
Token[40]: "to"
Token[41]: "the"
Token[42]: "area"
Token[43]: "of"
Token[44]: "Great"
Token[45]: "Britain"
Token[46]: "that"
Token[47]: "later"
Token[48]: "took"
Token[49]: "their"
Token[50]: "name"
Token[51]: ","
Token[52]: "as"
Token[53]: "England"
Token[54]: "."

你可以看到,其实这些词组并不是期望的分词结果,例如出现 [English] 这样的表示。当然,在 Flair 中只需要指定 use_tokenizer=True,就会自动调用 segtok 完成英文分词。(不支持中文分词处理)

sentence = Sentence(text, use_tokenizer=True)
for token in sentence:
    print(token)
Token[0]: "["
Token[1]: "English"
Token[2]: "]"
Token[3]: "is"
Token[4]: "a"
Token[5]: "West"
Token[6]: "Germanic"
Token[7]: "language"
Token[8]: "that"
Token[9]: "was"
Token[10]: "first"
Token[11]: "spoken"
Token[12]: "in"
Token[13]: "early"
Token[14]: "medieval"
Token[15]: "England"
Token[16]: "and"
Token[17]: "eventually"
Token[18]: "became"
Token[19]: "a"
Token[20]: "global"
Token[21]: "lingua"
Token[22]: "franca"
Token[23]: "."
Token[24]: "It"
Token[25]: "is"
Token[26]: "named"
Token[27]: "after"
Token[28]: "the"
Token[29]: "<"
Token[30]: "Angles"
Token[31]: ">"
Token[32]: ","
Token[33]: "one"
Token[34]: "of"
Token[35]: "the"
Token[36]: "Germanic"
Token[37]: "tribes"
Token[38]: "that"
Token[39]: "migrated"
Token[40]: "to"
Token[41]: "the"
Token[42]: "area"
Token[43]: "of"
Token[44]: "Great"
Token[45]: "Britain"
Token[46]: "that"
Token[47]: "later"
Token[48]: "took"
Token[49]: "their"
Token[50]: "name"
Token[51]: ","
Token[52]: "as"
Token[53]: "England"
Token[54]: "."

接下来,你可以对 Sentence 进行 Word Embedding 词嵌入。中文词嵌入,Flair 使用了 FastText 提供的预训练模型,该模型使用维基百科语料进行训练 论文。你可以使用 flair.embeddings.WordEmbeddings('zh') 完成词向量加载。

from flair.embeddings import WordEmbeddings

# 初始化 embedding
embedding = WordEmbeddings('zh')  # 自行实现时请替换为 `zh`
# 创建 sentence,中文需传入分词后用空格间隔的语句,才能被 Flair 识别出 tokens
sentence = Sentence("机器 学习 是 一个 好 工具")
# 词嵌入
embedding.embed(sentence)

for token in sentence:
    print(token)
    print(token.embedding)  # 输出词嵌入后向量
2023-11-14 10:46:34,630 https://flair.informatik.hu-berlin.de/resources/embeddings/token/zh-wiki-fasttext-300d-1M.vectors.npy not found in cache, downloading to /var/folders/tc/9kxpg1x95sl6cm2lc2jwpgt80000gn/T/tmptm1wnan_
2023-11-14 10:48:27,701 copying /var/folders/tc/9kxpg1x95sl6cm2lc2jwpgt80000gn/T/tmptm1wnan_ to cache at /Users/huhuhang/.flair/embeddings/zh-wiki-fasttext-300d-1M.vectors.npy
2023-11-14 10:48:28,046 removing temp file /var/folders/tc/9kxpg1x95sl6cm2lc2jwpgt80000gn/T/tmptm1wnan_
2023-11-14 10:48:30,424 https://flair.informatik.hu-berlin.de/resources/embeddings/token/zh-wiki-fasttext-300d-1M not found in cache, downloading to /var/folders/tc/9kxpg1x95sl6cm2lc2jwpgt80000gn/T/tmpt5nod9rk
2023-11-14 10:48:37,381 copying /var/folders/tc/9kxpg1x95sl6cm2lc2jwpgt80000gn/T/tmpt5nod9rk to cache at /Users/huhuhang/.flair/embeddings/zh-wiki-fasttext-300d-1M
2023-11-14 10:48:37,394 removing temp file /var/folders/tc/9kxpg1x95sl6cm2lc2jwpgt80000gn/T/tmpt5nod9rk
Token[0]: "机器"
tensor([ 0.0143,  0.5811, -0.6224,  0.2212,  0.7530, -0.1874, -0.6064,  0.0659,
         0.2285,  0.0601,  0.8601, -0.0285,  1.1349, -0.5639, -0.1153, -0.0566,
         0.7801, -0.0867, -0.6968, -0.5147, -0.3374,  1.1837,  0.7827,  0.0867,
         0.4255,  0.1987,  0.8387, -0.0374,  0.3309, -0.0280,  0.8692, -0.9097,
        -0.8766, -0.6566, -0.4730,  1.0071,  0.7562, -0.4000, -0.0652,  0.9994,
         0.9919,  0.4734,  0.8127, -0.4761, -0.1291, -0.5706, -0.7824, -0.3793,
         0.1278, -1.0881,  0.6386, -0.4776, -0.7002, -0.8154,  0.1790, -0.6806,
        -1.2060, -1.0734, -2.0394,  0.4766,  0.9346,  0.0028,  0.5399,  0.8536,
         0.1003,  0.5261, -0.6837, -0.5685, -0.5339,  0.1208,  0.8826, -0.4829,
         1.1641, -0.2419, -0.7891, -0.1125, -0.1593, -0.8578, -0.6621, -1.1855,
         0.0431,  0.0583,  0.7011, -0.7517, -0.7582, -0.9517, -0.0285,  0.3103,
         0.1624, -0.9033, -0.7867,  0.4230, -0.2775, -0.0805, -0.3226,  0.7330,
         0.3128,  0.1851, -0.1853, -0.0596,  0.4414,  0.2600, -0.7027,  0.8328,
        -0.4970,  0.3798,  0.2092, -0.7503, -0.5770, -0.0128, -0.1826, -0.1387,
         0.3124,  0.0187,  0.0387,  0.3218, -0.6264, -0.0517, -0.8444, -0.2013,
        -0.5843,  0.4578,  0.3557,  0.3344, -0.3998, -0.3747,  0.8146,  0.5117,
        -0.8563,  0.2704,  0.3490,  0.5117, -0.7002,  0.7740, -1.1578,  0.4763,
        -0.1603, -0.2892, -0.6538, -0.2876,  0.0559,  0.3469,  0.6359, -1.0277,
        -0.3468,  0.6848, -1.1921, -0.2028, -0.2787, -0.0375,  0.3030,  0.2835,
        -0.4877,  0.5015,  0.6387, -0.7885,  0.5213, -0.1034, -0.4846,  0.7212,
         0.1653,  0.1738,  0.3371,  0.2896, -0.5367, -0.0286, -0.3308,  0.4364,
         0.8609,  0.2810,  0.4085,  0.3831,  1.1185, -0.0573,  0.1359, -0.0791,
         0.4720,  0.7635, -0.0476, -0.3413, -0.4208, -0.4342,  0.0646,  0.7787,
        -0.2778,  0.5125, -0.1750, -0.0067, -0.6191,  0.6051, -0.3996, -0.1800,
        -0.3747, -0.5957, -0.0768, -0.3267,  0.1453, -0.8712, -0.0167, -0.2440,
         0.0361, -0.1182, -0.0665, -0.2876,  0.3599,  0.2551,  0.3388, -0.4155,
         0.2375,  0.2611, -0.9597,  0.6817, -0.2156,  0.1333, -0.4112,  0.4606,
         0.2891, -0.1833,  0.6388,  0.4233, -0.3933,  0.9661,  0.4108, -0.4213,
        -0.5075,  0.4503, -0.3346, -0.2201, -0.3898,  0.0812, -0.8379, -0.5047,
         0.2715, -0.3409, -0.4785,  1.0817,  0.3356, -0.6770,  0.0201, -0.2554,
         0.6776, -0.4254,  0.1542, -0.8496, -0.3390,  0.2657, -0.7995, -0.1938,
        -0.5448,  0.7467, -0.6824, -0.0090, -1.0278,  0.4611, -0.2736, -0.2581,
         0.4046,  0.3983, -1.0149, -0.2133,  0.2896, -0.4678, -0.6328, -0.1350,
        -0.1115, -0.0133, -1.0436, -0.6583, -1.6365, -0.9621, -0.5890,  0.0709,
         1.2525, -0.6565,  0.2980, -0.3386,  0.3507,  0.5609,  0.2868, -0.8079,
         0.8194,  1.6471,  0.2568,  0.1511,  0.1190, -0.2075, -0.0378, -0.0687,
        -1.0916,  0.0909, -1.0063, -0.8164, -0.9036, -0.6002,  0.2261, -0.6284,
         0.1163,  0.2058,  1.0010, -0.0158])
Token[1]: "学习"
tensor([-3.2496e-01,  6.4774e-01, -3.8768e-01,  3.2548e-02,  7.6574e-01,
        -5.1088e-01, -7.9309e-01,  3.1974e-01,  1.4386e-01, -2.3689e-01,
         4.8327e-01, -6.3284e-02,  1.5617e+00, -6.6433e-01,  1.1605e-01,
        -2.2726e-01,  7.7193e-01, -2.1452e-01, -5.9669e-01, -6.7123e-01,
        -4.8646e-01,  1.0104e+00,  7.3959e-01, -6.9452e-02,  6.9977e-01,
        -5.0222e-01,  8.8357e-01, -2.6706e-02,  5.9556e-01,  2.4998e-01,
         8.0008e-01, -7.0294e-01, -6.3508e-01, -4.5956e-01, -7.2117e-01,
         6.0594e-01,  5.8690e-01, -3.0229e-01,  1.0712e-02,  1.4117e+00,
         9.3205e-01,  9.1340e-01,  7.4644e-01, -7.8001e-01, -3.1752e-01,
        -4.2588e-01, -6.7553e-01, -4.7500e-01,  1.2541e-01, -5.2286e-01,
         1.2321e+00, -4.2871e-01, -2.3411e-01, -7.4918e-01,  2.6468e-01,
        -9.1652e-01, -1.0878e+00, -1.3424e+00, -2.3272e+00,  2.0264e-01,
         7.5792e-01, -1.0008e-01,  6.2414e-01,  8.5643e-01,  2.8045e-01,
         6.6526e-01, -6.2577e-01, -4.4871e-01, -2.1029e-01,  5.2435e-03,
         9.8931e-01, -3.3600e-01,  1.0255e+00, -5.4760e-01, -4.6953e-01,
        -1.2598e-01, -1.1644e-01, -6.8195e-01, -2.6941e-01, -1.0325e+00,
        -4.2845e-01,  1.5560e-01,  8.3749e-01, -7.6646e-01, -6.7090e-01,
        -9.9258e-01, -1.9242e-01,  6.7083e-01,  2.2316e-01, -6.9702e-01,
        -5.0593e-01,  4.5782e-01, -1.6225e-01, -4.4178e-02, -4.5914e-01,
         9.3188e-01,  2.8645e-01,  3.5577e-01,  9.9708e-02, -5.0715e-03,
         3.0289e-01,  4.0837e-01, -3.1080e-01,  5.4059e-01, -5.3086e-01,
         4.7741e-02,  9.2715e-02, -7.3871e-01, -6.6761e-01, -2.4710e-01,
         1.1328e-02, -3.9275e-01, -8.4853e-03,  6.7699e-01,  4.8520e-01,
         2.2267e-01, -5.9829e-01, -1.2634e-01, -7.5148e-01, -4.0789e-01,
        -3.9861e-01,  4.6634e-01,  5.2215e-01,  9.5104e-02, -5.8386e-01,
        -3.6987e-01,  5.0411e-01,  2.6521e-01, -7.4881e-01,  1.3841e-01,
         5.5953e-01, -3.5650e-01, -5.7487e-01,  1.0268e+00, -1.0020e+00,
         4.0540e-01,  6.9844e-03, -4.0649e-02, -7.5194e-01, -1.7583e-01,
        -2.3509e-02,  6.2793e-01,  7.7491e-01, -9.5466e-01, -3.5790e-01,
         3.3733e-01, -9.7010e-01, -2.7844e-01, -4.7630e-01, -8.6698e-03,
         4.8556e-01,  5.4333e-01, -5.6352e-01,  4.5409e-01,  6.4429e-01,
        -8.2720e-01,  1.9464e-01, -3.3808e-02, -3.2662e-01,  6.3361e-01,
         5.6221e-01,  4.0578e-01, -5.3711e-03,  2.4223e-01, -7.3461e-02,
         2.6014e-01, -1.2481e-01,  1.1112e+00,  3.2438e-01,  2.6632e-01,
         4.4040e-01,  4.5628e-01,  1.1011e+00, -3.0905e-01,  2.0793e-01,
        -5.1031e-01,  8.0338e-01,  7.4910e-01, -1.2676e-01, -1.9419e-01,
        -5.3962e-01, -4.4887e-01,  5.0762e-02,  2.4368e-01, -1.9830e-01,
         5.7638e-01, -2.5450e-01, -2.6344e-01, -7.1175e-01,  8.4950e-01,
        -1.1203e-01, -5.1713e-02,  7.3786e-02, -4.6739e-01, -2.6118e-01,
        -1.0008e+00, -2.1247e-01, -7.9742e-01, -2.0798e-01,  2.9983e-01,
        -1.6070e-01, -1.1373e-01,  2.7128e-01, -9.5071e-01, -4.7413e-02,
         9.7961e-02,  1.5148e-01, -5.8094e-01,  1.8383e-02,  1.2603e-01,
        -7.1018e-01, -5.5209e-03, -3.5366e-01,  9.1064e-02, -8.4823e-01,
         1.4066e-01,  3.3259e-01, -4.3188e-01,  6.9093e-01,  5.9725e-01,
        -4.7339e-01,  1.4482e-02,  4.2865e-01, -1.2653e-01, -6.9798e-01,
         5.1845e-01,  1.8231e-02, -5.2561e-01, -7.6256e-01,  4.4506e-02,
        -7.8633e-01, -8.0979e-01,  1.5287e-01, -4.3581e-01, -4.7554e-01,
         5.5389e-01,  4.3198e-01, -1.1809e+00, -3.1034e-02, -1.5329e-01,
         6.3897e-01, -6.1526e-01,  6.6176e-01, -1.1912e-01,  6.6673e-02,
         1.5720e-01, -9.1384e-01, -7.0507e-02, -2.9597e-02,  8.7810e-01,
        -4.2138e-01,  6.7716e-02, -6.7661e-01,  6.9992e-01, -3.4975e-01,
        -6.0683e-02,  4.2290e-01,  6.0106e-01, -8.1242e-01, -1.7345e-01,
         8.1558e-01, -1.9420e-04, -2.6439e-01, -6.7547e-02, -4.9000e-01,
        -1.0618e-01, -4.5975e-01, -3.5768e-01, -1.3467e+00, -7.7125e-01,
        -3.1377e-01,  1.7904e-01,  7.0509e-01, -5.1039e-01, -1.6599e-01,
         1.1125e-02,  1.6963e-01,  2.1902e-01, -1.9176e-02, -6.2030e-01,
         5.8062e-01,  1.5223e+00,  3.6571e-02,  7.5735e-01,  5.1125e-01,
        -7.2188e-02, -8.3230e-03, -6.8004e-02, -1.3448e+00, -6.3490e-02,
        -8.8260e-01, -8.2197e-01, -7.2176e-01, -5.2066e-01, -3.0366e-01,
        -2.6454e-01,  6.6962e-01,  7.0614e-01,  9.2428e-01,  1.4701e-01])
Token[2]: "是"
tensor([-2.9470e-01,  9.6263e-01, -6.4208e-01,  2.6904e-01,  8.1669e-01,
        -4.1970e-01, -7.6458e-01,  2.2026e-01,  3.9823e-01, -4.4368e-02,
         9.0519e-01,  5.8803e-02,  1.5186e+00, -1.0610e+00,  3.7516e-02,
        -8.5236e-01,  7.1213e-01, -4.0752e-01, -1.4830e+00, -6.0035e-01,
        -1.1404e+00,  1.2292e+00,  1.2193e+00, -1.3205e-02,  7.5028e-01,
        -6.4071e-01,  1.0560e+00,  2.9746e-01,  6.5827e-01,  2.9021e-01,
         1.0365e+00, -8.8943e-01, -8.4054e-01, -9.1078e-01, -8.1180e-01,
         1.2037e+00,  1.1838e+00, -6.8467e-01,  1.3043e-01,  1.4297e+00,
         8.4812e-01,  1.2296e+00,  1.2645e+00, -1.3474e+00, -4.6647e-01,
        -5.0325e-01, -7.8006e-01, -2.9556e-01,  4.6694e-01, -8.6471e-01,
         1.1513e+00, -8.3127e-02, -5.6439e-01, -5.4749e-01,  2.4472e-01,
        -1.1987e+00, -1.1071e+00, -1.4940e+00, -3.5805e+00,  3.5054e-01,
         1.1822e+00, -1.4338e-01,  9.4363e-01,  1.4257e+00,  5.7719e-01,
         1.1155e+00, -1.1568e+00, -7.1422e-01, -5.9231e-01, -1.5217e-01,
         1.5964e+00, -8.5672e-01,  1.3706e+00, -7.5333e-01, -9.3702e-01,
         3.4216e-01, -1.8062e-02, -9.4624e-01, -4.6339e-01, -1.2441e+00,
        -2.4588e-01, -1.4186e-02,  9.0451e-01, -9.0349e-01, -8.4130e-01,
        -8.8856e-01,  1.9713e-01,  7.0343e-01,  6.6668e-02, -1.2076e+00,
        -9.1207e-01,  8.8654e-01, -1.5411e-01, -3.1390e-02, -8.8110e-01,
         1.3894e+00,  8.1058e-01,  6.1890e-01,  1.3928e-01,  1.5015e-01,
         6.2773e-01,  5.2557e-01, -6.0964e-01,  1.1134e+00, -6.4656e-01,
         4.3897e-01, -1.0560e-01, -8.1528e-01, -7.9294e-01,  1.1758e-02,
         2.2930e-01, -8.3421e-01, -1.2414e-01,  6.0844e-01,  2.7743e-01,
         4.4911e-01, -6.5487e-01,  1.2584e-01, -6.7680e-01, -1.1650e+00,
        -7.6231e-01,  1.0157e+00,  7.0824e-01,  7.2998e-02, -7.8838e-01,
        -7.8291e-01,  8.3730e-01,  8.5906e-01, -1.1207e+00,  3.9082e-01,
         7.1783e-01,  2.2138e-01, -9.7172e-01,  1.2464e+00, -1.2574e+00,
         3.4886e-01, -4.2288e-01, -4.6324e-01, -9.6657e-01, -6.5494e-01,
         3.6606e-02,  8.7073e-01,  1.4635e+00, -1.1964e+00, -9.0294e-01,
         2.8447e-01, -1.4719e+00, -1.9801e-01, -7.5471e-01, -4.4003e-01,
         5.0558e-01,  1.0695e+00, -2.0375e-01,  2.6232e-01,  1.0458e+00,
        -7.8943e-01,  2.9110e-01, -4.2868e-01,  5.1662e-04,  7.1790e-01,
         2.8826e-01,  3.6638e-01,  3.5106e-01, -2.6241e-01,  5.8454e-02,
         2.4675e-01, -3.5126e-01,  8.2303e-01,  7.0827e-01,  5.1579e-01,
         6.3620e-01,  6.3905e-01,  1.0002e+00, -2.7073e-01,  1.2566e-01,
         7.3826e-02,  7.0752e-01,  8.8663e-01,  1.8877e-02, -6.4460e-01,
        -7.2626e-01, -4.3245e-01,  2.9192e-01, -2.4789e-02, -5.5305e-01,
         8.3379e-01,  1.0214e-02, -7.8934e-01, -4.7006e-01,  8.4025e-01,
        -3.2540e-01, -5.3938e-02, -6.4136e-01, -4.4979e-01, -2.0078e-02,
        -3.9118e-01, -1.1196e+00, -7.6324e-01, -7.3820e-03,  5.5478e-01,
         4.8309e-02, -5.1268e-01, -1.5172e-01, -5.8643e-01,  1.2819e-01,
         2.3156e-01,  1.3604e-01, -5.1435e-01, -2.1252e-01,  2.9178e-01,
        -9.6110e-01,  2.3519e-01, -5.3226e-01,  4.5818e-01, -6.5265e-01,
         2.3695e-02,  4.7478e-01, -1.1026e-01,  1.0374e+00,  6.8898e-01,
        -6.5905e-01,  8.3337e-01,  3.8156e-01, -3.0748e-01, -1.1250e+00,
         8.5038e-01, -2.1900e-01, -1.5465e-01, -1.0664e+00,  1.8646e-01,
        -8.0464e-01, -2.3705e-01,  5.1165e-01, -8.5804e-01, -1.1827e-01,
         1.0432e+00,  5.5366e-01, -1.2645e+00, -7.6407e-02, -2.3647e-01,
         9.5911e-01, -1.0385e+00,  6.9627e-01, -3.1721e-02, -3.7161e-01,
         8.4316e-03, -6.1770e-01, -3.5963e-01, -8.2659e-01, -6.2814e-02,
        -4.9534e-01,  1.0803e-01, -8.3316e-01,  9.2093e-02, -4.6147e-01,
        -6.0965e-02,  5.5664e-01,  6.5784e-01, -1.5682e+00, -3.8237e-02,
         6.4389e-01, -3.5463e-01, -5.1613e-01, -3.1236e-01, -2.3338e-01,
        -6.3333e-02, -8.1927e-01, -8.9823e-01, -2.0185e+00, -1.4186e+00,
        -1.0503e-01,  4.2784e-01,  1.1059e+00, -7.7548e-01, -1.7601e-01,
        -1.9003e-01,  1.5087e-01,  1.0027e+00, -1.0793e-01, -7.0910e-01,
         6.4077e-01,  1.2628e+00,  1.6852e-01,  1.0921e+00,  3.0727e-02,
        -5.8835e-01,  3.9650e-01,  4.5892e-01, -8.2997e-01,  8.4672e-02,
        -7.3598e-01, -9.9018e-01, -8.3517e-01, -6.9301e-01,  1.2126e-01,
        -9.2381e-01,  6.5063e-01,  3.7096e-01,  8.6223e-01,  4.4922e-01])
Token[3]: "一个"
tensor([-0.0737,  0.5082, -0.6935, -0.0789,  0.3819, -0.2839, -0.5039,  0.0415,
         0.1331, -0.3137,  0.6677, -0.1970,  0.9833, -0.6265,  0.0319, -0.4018,
         0.5840,  0.3428, -1.0544, -0.6864, -0.5543,  0.7148,  0.5921, -0.1479,
         0.4243, -0.4255,  0.5889,  0.0828,  0.6716,  0.2270,  0.8467, -0.3782,
        -0.3355, -0.5189, -0.7975,  0.7559,  1.2961, -0.6540,  0.0932,  1.1062,
         0.7401,  0.8103,  0.6879, -0.7346, -0.5576, -0.1967, -0.6064, -0.2394,
        -0.0572, -0.7842,  0.6955, -0.1724, -0.6147, -0.8001,  0.4576, -0.6661,
        -0.9749, -1.3075, -1.7487,  0.3943,  0.7408, -0.1462,  0.6588,  0.7658,
         0.2726,  0.9435, -0.6261, -0.2634, -0.3476, -0.2783,  1.0468, -0.3697,
         0.7806, -0.3242, -0.7213, -0.0714,  0.1008, -0.9476, -0.1250, -0.8386,
        -0.2752,  0.0817,  0.3290, -0.5047, -0.5321, -0.8082,  0.1103,  0.6936,
        -0.0974, -0.9505, -0.5493,  0.7136, -0.0415, -0.1809, -0.4456,  1.1020,
         0.3636,  0.6953, -0.2080, -0.1619,  0.1121,  0.2361, -0.3169,  0.4728,
        -0.6969,  0.0054,  0.2143, -0.7620, -0.6029, -0.1606, -0.1066, -0.2429,
         0.0950,  0.3300,  0.0370,  0.0145, -0.3844,  0.0283, -0.5727, -0.5766,
        -0.5473,  0.4859,  0.1930,  0.2058, -0.6213, -0.4933,  0.1494,  0.3724,
        -0.8091,  0.1388,  0.5616,  0.1015, -0.7299,  0.8991, -0.9057,  0.1638,
        -0.0946, -0.2198, -0.7337, -0.4277,  0.0348,  0.7408,  0.8802, -0.8790,
        -0.5132,  0.3095, -1.1776,  0.1631, -0.4547, -0.0234,  0.3326,  0.4331,
        -0.1048,  0.2179,  0.7793, -0.5711,  0.3321,  0.0602, -0.3515,  0.4355,
         0.3082,  0.3437,  0.3638,  0.2393,  0.0977, -0.0503, -0.1972,  0.4905,
         0.6600,  0.3427,  0.3623,  0.2477,  1.0571, -0.2453, -0.0974,  0.1054,
         0.5306,  0.7143, -0.0602, -0.1745, -0.4580, -0.1708,  0.0271,  0.0232,
        -0.1978,  0.1574, -0.0032, -0.3223, -0.4930,  0.7378, -0.1691, -0.2887,
        -0.3597, -0.4060, -0.0719, -0.2593, -0.8465, -0.3307, -0.1629,  0.3394,
         0.0405, -0.4944, -0.1151, -0.4372,  0.2526, -0.1608,  0.3314, -0.6572,
        -0.0946,  0.2913, -0.6903,  0.2566, -0.0576,  0.0383, -0.4663,  0.1108,
         0.1470,  0.1758,  0.6486,  0.2232, -0.3011,  0.3376,  0.3911,  0.1646,
        -0.4874,  0.9991, -0.1202, -0.3505, -0.6318,  0.1634, -0.6370, -0.2258,
         0.4027, -0.4303, -0.1014,  0.7300,  0.4224, -0.6519,  0.2385, -0.0709,
         0.5122, -0.5937,  0.5370, -0.5672, -0.0740,  0.1552, -0.5209, -0.1279,
        -0.4684,  0.3429, -0.4396,  0.0586, -0.7579,  0.2429, -0.3678,  0.1680,
         0.5940,  0.2853, -1.1939, -0.1144,  0.0490, -0.2518, -0.2717, -0.4434,
        -0.2550, -0.2255, -0.8746, -0.8969, -0.9644, -1.0754, -0.5249,  0.1892,
         1.0731, -0.8210,  0.1116,  0.1603,  0.0713,  0.4434,  0.3396, -0.5897,
         0.5703,  1.1991, -0.0383,  0.5015,  0.0743, -0.3986, -0.0771,  0.0730,
        -0.3178,  0.5026, -0.7956, -0.7972, -0.4207, -0.4523,  0.2691, -0.0433,
         0.7785,  0.2873,  0.7858,  0.2248])
Token[4]: "好"
tensor([-9.8929e-02,  1.2491e+00, -8.4758e-01,  5.3167e-01,  1.0574e+00,
        -5.2330e-01, -6.3454e-01,  6.6102e-02,  2.8577e-01, -2.8122e-01,
         1.0163e+00,  9.8894e-02,  1.4946e+00, -1.1385e+00,  3.3889e-01,
        -7.4407e-01,  1.1756e+00, -4.6333e-01, -1.3732e+00, -9.0615e-01,
        -1.1597e+00,  1.2152e+00,  1.4630e+00, -9.5854e-02,  9.3928e-01,
        -2.7229e-01,  1.3321e+00, -4.2433e-02,  8.4669e-01,  3.3059e-01,
         1.0210e+00, -1.1766e+00, -1.1104e+00, -7.9354e-01, -9.3404e-01,
         1.0778e+00,  1.3256e+00, -8.2726e-01,  3.6350e-01,  1.4105e+00,
         7.5099e-01,  1.2105e+00,  1.0801e+00, -1.4571e+00, -4.3122e-01,
        -5.5185e-01, -8.6859e-01, -2.4268e-01,  6.3586e-01, -1.0281e+00,
         1.2762e+00, -4.9458e-01, -4.5010e-01, -4.6716e-01,  2.1048e-01,
        -1.1757e+00, -1.1276e+00, -1.5405e+00, -4.3190e+00,  4.7489e-01,
         1.3342e+00, -1.8817e-01,  8.5426e-01,  1.2849e+00,  3.4079e-01,
         7.8430e-01, -9.6167e-01, -5.7700e-01, -5.0483e-01, -4.4884e-01,
         1.4872e+00, -6.9987e-01,  1.5548e+00, -8.7175e-01, -7.7090e-01,
         7.0658e-01,  5.1195e-02, -1.0701e+00, -6.4907e-01, -1.3436e+00,
        -3.7279e-01,  2.2611e-01,  1.1259e+00, -7.7517e-01, -8.9178e-01,
        -1.2251e+00,  2.6167e-01,  6.9696e-01,  2.2853e-01, -1.0625e+00,
        -6.8215e-01,  6.7655e-01, -2.3570e-01, -3.6551e-01, -1.1400e+00,
         1.3774e+00,  7.5486e-01,  9.9982e-01,  7.5766e-02,  8.7429e-02,
         6.8929e-01,  9.4955e-01, -5.4822e-01,  1.1555e+00, -6.5018e-01,
         9.5104e-01, -3.0029e-02, -6.3037e-01, -1.1969e+00,  1.8153e-01,
         5.4985e-01, -8.9267e-01, -1.9155e-01,  6.1675e-01,  2.0949e-01,
         6.4118e-01, -7.3601e-01,  2.1812e-01, -9.4216e-01, -1.0408e+00,
        -9.5387e-01,  8.3657e-01,  7.9016e-01, -2.0435e-01, -7.4658e-01,
        -1.2543e+00,  1.0448e+00,  8.6996e-01, -1.0093e+00,  4.6790e-01,
         9.8485e-01,  3.6014e-01, -8.9233e-01,  1.0195e+00, -1.4158e+00,
         2.4657e-01, -5.8987e-01, -6.0092e-01, -1.0628e+00, -4.0283e-01,
         2.9971e-02,  8.7130e-01,  1.2866e+00, -1.2866e+00, -9.3108e-01,
         1.8117e-01, -1.4708e+00, -5.1765e-02, -7.2803e-01, -2.6360e-01,
         8.2880e-01,  1.0917e+00, -3.2472e-01,  1.7605e-01,  1.1482e+00,
        -9.1457e-01,  4.5168e-01, -4.1548e-01, -2.9865e-01,  8.7758e-01,
         5.7283e-01,  4.0329e-01,  4.1626e-01, -4.9962e-03, -3.0370e-01,
         2.3634e-01, -9.3532e-02,  1.2920e+00,  5.0149e-01,  4.6592e-01,
         6.0240e-01,  7.2581e-01,  1.3470e+00, -1.8658e-01,  6.0634e-02,
        -8.4577e-02,  7.9268e-01,  1.0364e+00, -1.8741e-01, -3.8072e-01,
        -6.4939e-01, -6.9535e-01,  3.1566e-01,  7.5466e-02, -4.4083e-01,
         8.4603e-01, -1.2895e-01, -7.4151e-01, -6.2922e-01,  9.0151e-01,
        -4.1163e-01, -2.1922e-01, -6.0280e-01, -8.4718e-01, -4.3876e-02,
        -8.6079e-01, -9.4760e-01, -8.7113e-01,  6.8971e-02,  5.3052e-01,
        -2.1489e-01, -5.0863e-01,  3.2898e-02, -8.4323e-01, -1.0329e-01,
         1.7324e-01,  4.5164e-01, -5.0305e-01,  2.7276e-02,  3.8356e-01,
        -1.0494e+00,  2.5880e-01, -5.8057e-01,  5.2955e-01, -9.6895e-01,
         4.0608e-01,  4.5989e-01, -3.9216e-01,  1.1082e+00,  7.3078e-01,
        -4.8874e-01,  1.0780e+00,  8.8200e-02, -3.1825e-04, -1.0974e+00,
         8.8000e-01, -3.7661e-01, -5.4520e-01, -1.0127e+00,  5.8841e-02,
        -9.7452e-01, -8.1897e-01,  4.0137e-01, -6.5215e-01, -1.9860e-01,
         1.2709e+00,  5.3455e-01, -1.2475e+00, -2.9769e-01, -5.5092e-01,
         7.9281e-01, -6.9952e-01,  7.4329e-01, -6.3450e-02, -3.0502e-01,
         8.0636e-02, -5.2637e-01, -3.3847e-01, -7.7989e-01,  2.5192e-01,
        -2.0292e-01, -1.5657e-01, -1.1926e+00,  3.0141e-01, -5.3737e-01,
        -1.6424e-01,  4.7523e-01,  5.2591e-01, -1.7125e+00,  7.9079e-03,
         5.0052e-01, -3.5621e-01, -3.8273e-01,  2.6603e-02, -4.5296e-01,
        -3.0302e-01, -8.4407e-01, -1.0795e+00, -2.1210e+00, -9.9769e-01,
         6.3477e-03,  4.1831e-01,  1.0916e+00, -6.7977e-01, -2.6044e-01,
        -4.0029e-01, -1.9433e-01,  8.1853e-01,  9.3558e-03, -7.8074e-01,
         6.8278e-01,  1.7029e+00,  1.8010e-01,  1.0896e+00, -1.8944e-01,
        -1.6505e-01,  6.6493e-01,  5.8992e-01, -1.1811e+00,  4.4594e-02,
        -8.6748e-01, -1.3192e+00, -6.1864e-01, -6.5932e-01, -9.3218e-02,
        -1.0111e+00,  7.4332e-02,  9.3045e-01,  1.1787e+00, -4.2524e-02])
Token[5]: "工具"
tensor([ 7.7626e-02,  3.4500e-01, -8.4805e-01,  6.4822e-01,  8.4264e-01,
        -1.5225e-01, -5.6491e-01,  7.3676e-02,  2.8573e-01, -3.7650e-01,
         7.1349e-01,  7.4695e-02,  1.0345e+00, -8.3189e-01,  2.7686e-01,
        -3.8717e-01,  5.5416e-01, -2.2696e-01, -7.5008e-01, -3.8934e-01,
        -7.8667e-01,  7.1259e-01,  8.0246e-01,  1.9120e-01,  6.8034e-01,
        -2.1005e-01,  7.8897e-01,  2.9680e-02,  4.6992e-01,  1.0709e-01,
         6.7642e-01, -5.0523e-01, -8.2933e-01, -1.9451e-01, -4.4002e-01,
         7.7130e-01,  8.2206e-01, -4.6257e-01,  2.5625e-01,  1.1233e+00,
         6.1146e-01,  6.2149e-01,  6.3909e-01, -1.0478e+00, -3.1095e-01,
        -3.7674e-01, -3.0834e-01, -1.0151e-01,  2.0864e-01, -6.2303e-01,
         9.2810e-01, -5.5475e-01, -2.2680e-01, -4.8252e-01,  2.0128e-01,
        -7.6970e-01, -8.0237e-01, -9.6166e-01, -2.5902e+00,  3.4217e-01,
         7.6201e-01,  2.2369e-01,  7.2643e-01,  8.1148e-01,  4.1677e-01,
         7.6627e-01, -7.6495e-01, -6.5571e-01, -5.0893e-01,  7.2132e-04,
         8.7514e-01, -3.1716e-01,  7.0562e-01, -5.0139e-01, -5.0363e-01,
         9.8056e-02,  9.5462e-02, -3.7544e-01, -5.1002e-01, -1.0472e+00,
        -1.8728e-01,  1.9773e-02,  9.3435e-01, -2.1193e-01, -7.1982e-01,
        -6.3043e-01,  3.9934e-01,  3.1178e-01,  2.8568e-01, -2.9688e-01,
        -4.2226e-01,  4.1676e-01, -3.2123e-01, -6.7405e-02, -6.3319e-01,
         1.0140e+00,  6.4152e-01,  3.5480e-01, -1.5926e-01,  2.8371e-01,
         4.9172e-01,  3.4924e-01, -3.7351e-01,  7.2828e-01, -6.3493e-01,
         6.4960e-01,  4.2411e-02, -6.6279e-01, -6.1671e-01,  1.1741e-01,
        -3.3209e-02, -3.5541e-01, -2.1580e-01,  1.2344e-01,  6.5102e-02,
         2.8030e-01, -3.7041e-01,  8.7519e-02, -2.9778e-01, -7.4646e-01,
        -4.6600e-01,  7.0127e-01,  7.1335e-01, -1.6944e-01, -6.1457e-01,
        -3.8265e-01,  6.4486e-01,  5.8761e-01, -7.5519e-01,  5.6657e-02,
         8.4512e-01,  3.8494e-01, -5.9174e-01,  6.1456e-01, -9.0165e-01,
         4.1715e-01, -2.1881e-01, -7.0829e-03, -5.1822e-01, -3.5971e-01,
         1.0889e-01,  5.2257e-01,  1.1090e+00, -1.1411e+00, -3.4349e-01,
         2.2106e-01, -1.3681e+00, -9.8239e-02, -9.5574e-01, -1.0046e-01,
         3.8280e-01,  4.3341e-01, -5.4691e-01,  7.9324e-02,  4.5687e-01,
        -6.1989e-01,  2.8172e-01, -1.7621e-01, -1.9494e-01,  4.9175e-01,
         1.8155e-01,  2.0121e-01,  3.5332e-01, -1.0972e-01, -2.4886e-01,
        -6.8137e-03,  1.7329e-02,  6.3899e-01,  5.2208e-01,  1.5924e-01,
         4.9881e-01,  3.2426e-02,  1.0455e+00, -1.0736e-01, -9.9695e-02,
         1.1880e-02,  1.6766e-01,  4.9924e-01, -4.2403e-01, -5.1307e-01,
        -5.0723e-01, -7.0273e-01,  1.2938e-01,  2.0587e-01, -2.4804e-01,
         4.3666e-01, -3.2109e-01, -4.9457e-01, -7.0252e-01,  4.8740e-01,
        -2.5155e-01, -1.2607e-01, -5.1706e-01, -1.3906e-01, -3.1923e-01,
        -5.9507e-01, -5.9941e-01, -5.9291e-01,  8.3765e-02,  2.2275e-01,
         5.0059e-02, -2.0300e-01, -1.8935e-01, -7.1984e-01,  3.8764e-01,
         1.5202e-01,  2.4962e-01,  5.9814e-02,  5.6661e-02,  4.8815e-01,
        -1.1389e+00,  3.6501e-01, -3.7702e-01,  3.8402e-01, -6.4953e-01,
         2.4076e-01,  2.2848e-01, -1.0029e-01,  3.6977e-01,  6.5855e-01,
        -5.9567e-01,  5.0346e-01,  4.5110e-01, -2.2932e-01, -7.9340e-01,
         7.2282e-01, -1.1029e-01, -2.9329e-01, -8.3050e-01, -1.7477e-01,
        -7.8086e-01, -3.5417e-01,  3.3621e-01, -6.3124e-01, -4.5237e-01,
         1.2538e+00,  5.0721e-01, -4.7086e-01, -3.9326e-01, -3.0356e-01,
        -2.5768e-02, -8.8868e-01,  2.1007e-01, -2.5571e-01, -4.9784e-01,
         1.9029e-01, -7.6648e-01, -5.4100e-01, -3.0252e-01,  5.7354e-01,
        -4.8085e-01, -3.4596e-03, -9.9778e-01,  6.6623e-02, -4.7766e-01,
        -1.2646e-01,  1.8771e-01,  1.5432e-01, -1.0605e+00, -1.3565e-01,
         5.2770e-01, -7.4257e-01, -1.7400e-01,  2.8087e-01, -4.0230e-01,
         3.1949e-02, -5.6044e-01, -6.3482e-01, -1.6851e+00, -1.0976e+00,
        -5.4208e-01,  3.5632e-01,  8.1533e-01, -3.7334e-01,  6.0636e-02,
        -3.8516e-01, -3.2269e-02,  4.6939e-01,  1.1107e-01, -6.4306e-01,
         4.4179e-01,  1.1898e+00,  6.1829e-02, -1.0223e-01,  4.0794e-02,
         2.7965e-03, -1.9893e-01,  3.7425e-01, -6.7358e-01,  1.4465e-02,
        -7.2238e-01, -1.0501e+00, -6.1100e-01, -7.7291e-01,  3.6434e-01,
        -6.0364e-01,  2.5729e-01,  1.8586e-01,  1.1608e+00,  4.6043e-02])

可以看到,非常方便地就输出了词嵌入后的向量。如果某个词不包含在 FastText 提供的预训练模型中,那么词向量将全部为 0。

from flair.embeddings import WordEmbeddings

sentence = Sentence("人工智能")
embedding.embed(sentence)
for token in sentence:
    print(token)
    print(token.embedding)
Token[0]: "人工智能"
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

FastText 提供的中文词嵌入向量长度为 300,通过大规模语料训练的词向量特征表现一般都会优于自己通过小语料训练的结果。这对于自然语言处理过程中的特征提取来说是非常方便的。

除此之外,Flair 中的 BERT Embeddings 也提供了支持中文的词嵌入模型,这是由 Google 在 2018 年提供的预训练模型。你可以通过下面的代码加载使用:

from flair.embeddings import BertEmbeddings

embedding = BertEmbeddings('bert-base-chinese')

Google BERT 预训练模型使用到的语料更多,效果更好。但由于模型更大,词嵌入向量长度为 3072,这就带来了更大的性能消耗。只有在强大算力支持情况下,才推荐使用 BertEmbeddings。

前面说过,Flair 基于 Pytorch 构建了完整的框架,非常方便用于文本分类等任务。所以,下面我们使用 Flair 来完成之前的假新闻分类任务。

首先,下载数据和停用词:

wget -nc "https://cdn.huhuhang.com/hands-on-ai/files/wsdm_mini.csv"  # 假新闻数据
wget -nc "https://cdn.huhuhang.com/hands-on-ai/files/stopwords.txt"  # 停用词词典
文件 “wsdm_mini.csv” 已经存在;不获取。

文件 “stopwords.txt” 已经存在;不获取。

预处理的部分变化不大,首先合并两列文本数据。

import pandas as pd

df = pd.read_csv("wsdm_mini.csv")
df["text"] = df[["title1_zh", "title2_zh"]].apply(
    lambda x: "".join(x), axis=1
)  # 合并文本数据列
data = df.drop(df.columns[[0, 1]], axis=1)  # 删除原文本列
data.head()
label text
0 disagreed 千叶湖八岁孩子不想去学英语,跳楼了「辟谣」千叶湖八岁孩子跳楼了为谣言信息
1 agreed 喝酸奶真的能补充益生菌吗?喝酸奶来补充益生菌,靠谱么?
2 agreed 刚刚马云终于出手了!房价要跌,扬言房地产中介都要失业了最新消息马云终于出手了!扬言房地产中介...
3 unrelated 直击“冯乡长”李正春追悼会:赵本山全程操办,赵四刘能现场祭奠昆明会议直击“活摘”谣言
4 disagreed 李雨桐爆薛之谦离婚内幕,说到底就是网红之间的恩怨情仇嘛薛之谦前女友李雨桐再次发微博爆料,薛之...

Flair 提供了用于文本分类非常高阶的 API,所以这里需要把数据处理成 API 支持输入的类型。其中,标签列需要全部添加 __label__ 标记。

data["label"] = "__label__" + data["label"].astype(str)
data.head()
label text
0 __label__disagreed 千叶湖八岁孩子不想去学英语,跳楼了「辟谣」千叶湖八岁孩子跳楼了为谣言信息
1 __label__agreed 喝酸奶真的能补充益生菌吗?喝酸奶来补充益生菌,靠谱么?
2 __label__agreed 刚刚马云终于出手了!房价要跌,扬言房地产中介都要失业了最新消息马云终于出手了!扬言房地产中介...
3 __label__unrelated 直击“冯乡长”李正春追悼会:赵本山全程操办,赵四刘能现场祭奠昆明会议直击“活摘”谣言
4 __label__disagreed 李雨桐爆薛之谦离婚内幕,说到底就是网红之间的恩怨情仇嘛薛之谦前女友李雨桐再次发微博爆料,薛之...

接下来,我们执行分词和去停用词,这个过程还是只有使用结巴中文分词完成。注意,分词之后的数据形式应该使用空格间隔,这是为了 Flair 能识别为 Sentence 类型。

import jieba
from tqdm.notebook import tqdm


def load_stopwords(file_path):
    with open(file_path, "r") as f:
        stopwords = [line.strip("\n") for line in f.readlines()]
    return stopwords


stopwords = load_stopwords("stopwords.txt")

corpus = []
for line in tqdm(data["text"]):
    words = []
    seg_list = list(jieba.cut(line))  # 分词
    for word in seg_list:
        if word in stopwords:  # 删除停用词
            continue
        words.append(word)
    corpus.append(" ".join(words))

data["text"] = corpus  # 将结果赋值到 DataFrame
data.head()
label text
0 __label__disagreed 千叶 湖 八岁 孩子 不想 去学 英语 跳楼 「 辟谣 千叶 湖 八岁 孩子 跳楼 谣言 信息
1 __label__agreed 喝 酸奶 真的 补充 益生菌 喝 酸奶 补充 益生菌 谱
2 __label__agreed 刚刚 马云 终于 出手 房价 跌 扬言 房地产 中介 失业 最新消息 马云 终于 出手 扬言...
3 __label__unrelated 直击 冯 乡长 李正春 追悼会 赵本山 全程 操办 赵四 刘能 现场 祭奠 昆明 会议 直击...
4 __label__disagreed 李雨桐 爆 薛之谦 离婚 内幕 说到底 网红 之间 恩怨 情仇 薛之谦 前女友 李雨桐 发微...

下面,我们把数据集划分为 2 部分:训练集和测试集。实际上,Flair 的文本分类 API 还支持支持一个验证集,这里为了简单就不处理了。

data.iloc[0 : int(len(data) * 0.8)].to_csv(
    "train.csv", sep="\t", index=False, header=False
)
data.iloc[int(len(data) * 0.8) :].to_csv(
    "test.csv", sep="\t", index=False, header=False
)

值得注意的是,我们需要按照 Flair API 的要求将数据集分别存储为 CSV 文件,方便后面调用。设置 index=False, header=False 去除索引列和数据列名。同时,使用 \t 用以间隔标签和特征。

下面,我们使用 flair.datasets.ClassificationCorpus 来加载处理好的语料数据。

from flair.datasets import ClassificationCorpus
from pathlib import Path

corpus = ClassificationCorpus(Path("./"), test_file="test.csv", train_file="train.csv")
corpus
2023-11-14 10:48:59,383 Reading data from .
2023-11-14 10:48:59,384 Train: train.csv
2023-11-14 10:48:59,385 Dev: None
2023-11-14 10:48:59,386 Test: test.csv
2023-11-14 10:48:59,810 No dev split found. Using 0% (i.e. 1200 samples) of the train split as dev data
2023-11-14 10:48:59,810 Initialized corpus . (label type name is 'class')
<flair.datasets.document_classification.ClassificationCorpus at 0x2a7e95ba0>

例如,你可以通过 corpus.train 来预览训练集语料,同时便于检查是否正常加载。正常加载的标志是中文以空格间隔,且后面识别处理 X Tokens 的数量。

corpus.train[0]  # 加载第一条训练语料
Sentence[17]: "千叶 湖 八岁 孩子 不想 去学 英语 跳楼 「 辟谣 千叶 湖 八岁 孩子 跳楼 谣言 信息" → disagreed (1.0)

接下来,我们开始对语料进行词嵌入操作。不过,这里需要先介绍 Flair 中的一个概念 Document Embeddings。文档嵌入实际上之前的文本分类实验中已经见过了。先前,我们简单地将一段文本中每一个词嵌入向量求和作为整段文本的特征,实际上就是和这里的文档嵌入概念相同。只不过,Flair 通过了更为丰富的文档嵌入方法。

例如,Flair 提供了 Pooling 的文档嵌入方法。实际上就是将词嵌入后的向量取平均、最大或最小值作为整段文本的嵌入向量。如下方的示例代码,Flair 还支持将不同的词嵌入方法通过列表传入文档嵌入方法中,以实现多种词嵌入方法的组合调用,十分灵活。

from flair.embeddings import WordEmbeddings, FlairEmbeddings, DocumentPoolEmbeddings

# initialize the word embeddings
glove_embedding = WordEmbeddings('glove')
flair_embedding_forward = FlairEmbeddings('news-forward')
flair_embedding_backward = FlairEmbeddings('news-backward')

# initialize the document embeddings, mode = mean
document_embeddings = DocumentPoolEmbeddings([glove_embedding,
                                              flair_embedding_backward,
                                              flair_embedding_forward], model='mean')
document_embeddings.embed(sentence)

不过,我们这里仅使用 WordEmbeddings() 一种词嵌入方法以提升速度。最终,使用 DocumentRNNEmbeddings RNN 文档嵌入方法得到文本锻炼的嵌入向量。DocumentRNNEmbeddings 实际上就是构建一个简单的 RNN 网络,输入为词嵌入向量,输出则视为文档嵌入。

from flair.embeddings import WordEmbeddings, DocumentRNNEmbeddings

word_embeddings = [WordEmbeddings("zh")]  # 词嵌入
document_embeddings = DocumentRNNEmbeddings(
    word_embeddings,
    hidden_size=512,
    reproject_words=True,
    reproject_words_dimension=256,
)  # 文档嵌入

接下来,我们就可以使用 Flair 提供的文本分类 API 构建一个文本分类器并完成训练了。

from flair.models import TextClassifier
from flair.trainers import ModelTrainer

# 初始化分类器
classifier = TextClassifier(
    document_embeddings,
    label_dictionary=corpus.make_label_dictionary(label_type='class'),
    multi_label=False,
    label_type='class'
)
# 训练分类器
trainer = ModelTrainer(classifier, corpus)
trainer.train("./", max_epochs=1)  # 分类器模型及日志输出在当前目录
2023-11-14 10:53:16,623 Computing label dictionary. Progress:
2023-11-14 10:53:17,566 Dictionary created for label 'class' with 3 values: disagreed (seen 3628 times), unrelated (seen 3596 times), agreed (seen 3576 times)
2023-11-14 10:53:17,579 ----------------------------------------------------------------------------------------------------
2023-11-14 10:53:17,579 Model: "TextClassifier(
  (embeddings): DocumentRNNEmbeddings(
    (embeddings): StackedEmbeddings(
      (list_embedding_0): WordEmbeddings(
        'zh'
        (embedding): Embedding(332647, 300)
      )
    )
    (word_reprojection_map): Linear(in_features=300, out_features=256, bias=True)
    (rnn): GRU(256, 512, batch_first=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): Linear(in_features=512, out_features=3, bias=True)
  (dropout): Dropout(p=0.0, inplace=False)
  (locked_dropout): LockedDropout(p=0.0)
  (word_dropout): WordDropout(p=0.0)
  (loss_function): CrossEntropyLoss()
  (weights): None
  (weight_tensor) None
)"
2023-11-14 10:53:17,579 ----------------------------------------------------------------------------------------------------
2023-11-14 10:53:17,580 Corpus: 10800 train + 1200 dev + 3000 test sentences
2023-11-14 10:53:17,580 ----------------------------------------------------------------------------------------------------
2023-11-14 10:53:17,580 Train:  10800 sentences
2023-11-14 10:53:17,580         (train_with_dev=False, train_with_test=False)
2023-11-14 10:53:17,581 ----------------------------------------------------------------------------------------------------
2023-11-14 10:53:17,581 Training Params:
2023-11-14 10:53:17,581  - learning_rate: "0.1" 
2023-11-14 10:53:17,581  - mini_batch_size: "32"
2023-11-14 10:53:17,582  - max_epochs: "1"
2023-11-14 10:53:17,582  - shuffle: "True"
2023-11-14 10:53:17,582 ----------------------------------------------------------------------------------------------------
2023-11-14 10:53:17,582 Plugins:
2023-11-14 10:53:17,583  - AnnealOnPlateau | patience: '3', anneal_factor: '0.5', min_learning_rate: '0.0001'
2023-11-14 10:53:17,583 ----------------------------------------------------------------------------------------------------
2023-11-14 10:53:17,583 Final evaluation on model from best epoch (best-model.pt)
2023-11-14 10:53:17,583  - metric: "('micro avg', 'f1-score')"
2023-11-14 10:53:17,584 ----------------------------------------------------------------------------------------------------
2023-11-14 10:53:17,584 Computation:
2023-11-14 10:53:17,584  - compute on device: cpu
2023-11-14 10:53:17,585  - embedding storage: cpu
2023-11-14 10:53:17,585 ----------------------------------------------------------------------------------------------------
2023-11-14 10:53:17,585 Model training base path: "."
2023-11-14 10:53:17,585 ----------------------------------------------------------------------------------------------------
2023-11-14 10:53:17,586 ----------------------------------------------------------------------------------------------------
2023-11-14 10:53:20,609 epoch 1 - iter 33/338 - loss 1.24147453 - time (sec): 3.02 - samples/sec: 349.38 - lr: 0.100000 - momentum: 0.000000
2023-11-14 10:53:23,036 epoch 1 - iter 66/338 - loss 1.19262065 - time (sec): 5.45 - samples/sec: 387.56 - lr: 0.100000 - momentum: 0.000000
2023-11-14 10:53:25,995 epoch 1 - iter 99/338 - loss 1.17272384 - time (sec): 8.41 - samples/sec: 376.75 - lr: 0.100000 - momentum: 0.000000
2023-11-14 10:53:28,436 epoch 1 - iter 132/338 - loss 1.16206076 - time (sec): 10.85 - samples/sec: 389.33 - lr: 0.100000 - momentum: 0.000000
2023-11-14 10:53:30,844 epoch 1 - iter 165/338 - loss 1.15570651 - time (sec): 13.26 - samples/sec: 398.28 - lr: 0.100000 - momentum: 0.000000
2023-11-14 10:53:33,111 epoch 1 - iter 198/338 - loss 1.14712376 - time (sec): 15.53 - samples/sec: 408.11 - lr: 0.100000 - momentum: 0.000000
2023-11-14 10:53:35,864 epoch 1 - iter 231/338 - loss 1.14621714 - time (sec): 18.28 - samples/sec: 404.42 - lr: 0.100000 - momentum: 0.000000
2023-11-14 10:53:38,292 epoch 1 - iter 264/338 - loss 1.14031862 - time (sec): 20.71 - samples/sec: 408.00 - lr: 0.100000 - momentum: 0.000000
2023-11-14 10:53:40,654 epoch 1 - iter 297/338 - loss 1.13598081 - time (sec): 23.07 - samples/sec: 412.00 - lr: 0.100000 - momentum: 0.000000
2023-11-14 10:53:42,987 epoch 1 - iter 330/338 - loss 1.13634822 - time (sec): 25.40 - samples/sec: 415.73 - lr: 0.100000 - momentum: 0.000000
2023-11-14 10:53:44,147 ----------------------------------------------------------------------------------------------------
2023-11-14 10:53:44,148 EPOCH 1 done: loss 1.1351 - lr: 0.100000
2023-11-14 10:53:44,997 DEV : loss 1.0034687519073486 - f1-score (micro avg)  0.4283
2023-11-14 10:53:45,114  - 0 epochs without improvement
2023-11-14 10:53:45,114 saving best model
2023-11-14 10:53:47,711 ----------------------------------------------------------------------------------------------------
2023-11-14 10:53:47,715 Loading model from best epoch ...
2023-11-14 10:53:51,879 
Results:
- F-score (micro) 0.4107
- F-score (macro) 0.3339
- Accuracy 0.4107

By class:
              precision    recall  f1-score   support

   unrelated     0.3432    0.8278    0.4852       999
   disagreed     0.6864    0.4141    0.5166       978
      agreed     0.0000    0.0000    0.0000      1023

    accuracy                         0.4107      3000
   macro avg     0.3432    0.4140    0.3339      3000
weighted avg     0.3380    0.4107    0.3300      3000

2023-11-14 10:53:51,880 ----------------------------------------------------------------------------------------------------
{'test_score': 0.4106666666666667}

在无 GPU 加持的情况下,上方训练过程较长,由于我们仅使用 Fasttext WordEmbeddings,假新闻数据集最终分类准确度也并不理想,可以选择终止训练继续阅读后续内容。

训练结束之后,Flair 会在当前目录下方保存最终模型 final-model.pt 和最佳模型 best-model.pt,方便后面复用。除此之外,一些训练日志文件和损失数据记录也会被保存在当前目录下方。

如果需要利用保存好的模型进行推理,可以使用 TextClassifier.load 来加载。

classifier = TextClassifier.load("./best-model.pt")  # 加载最佳模型
sentence = Sentence("千叶 湖 八岁 孩子 不想 去学 英语 跳楼 辟谣 千叶 湖 八岁 孩子 跳楼 谣言 信息")
classifier.predict(sentence)  # 模型推理
print(sentence.labels)  # 输出推理结果
['Sentence[16]: "千叶 湖 八岁 孩子 不想 去学 英语 跳楼 辟谣 千叶 湖 八岁 孩子 跳楼 谣言 信息"'/'disagreed' (0.5996)]

最终,模型可以推理输出类别及对应概率。

75.5. 总结#

本次实验,我们重点学习了自然语言处理中常用的 2 个工具:NLTK 和 Flair。除此之外,大家也可以自学像 FastTextspaCyPatternTextBlob 等 Python 第三方库。实际上,大多数自然语言处理工具对英文和其他拉丁语系支持较好,中文的支持都不太理想。当然,也有一些国内机构开源的自然语言处理工具,例如清华大学的 THULAC 和百度的 LAC 等。

相关链接