数据准备

下面都用20-newsgroups数据集来进行说明，使用方法：

from sklearn.datasets import fetch_20newsgroups

categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']
dataset = fetch_20newsgroups(
    subset='train', categories=categories, shuffle=True, random_state=42)

数据大概是这样子：

In [36]: print(dataset.data[:2])
Out[36]: 
['From: [email protected] (Michael Collier)\nSubject: Converting images to HP LaserJet III?\nNntp-Posting-Host: hampton\nOrganization: The City University\nLines: 14\n\nDoes anyone know of a good way (standard PC application/PD utility) to\nconvert tif/img/tga files into LaserJet III format.  We would also like to\ndo the same, converting to HPGL (HP plotter) files.\n\nPlease email any response.\n\nIs this the correct group?\n\nThanks in advance.  Michael.\n-- \nMichael Collier (Programmer)                 The Computer Unit,\nEmail: [email protected]                The City University,\nTel: 071 477-8000 x3769                      London,\nFax: 071 477-8565                            EC1V 0HB.\n',
 "From: [email protected] (Aniruddha B. Deglurkar)\nSubject: help: Splitting a trimming region along a mesh \nOrganization: University Of Kentucky, Dept. of Math Sciences\nLines: 28\n\n\n\n\tHi,\n\n\tI have a problem, I hope some of the 'gurus' can help me solve.\n\n\tBackground of the problem:\n\tI have a rectangular mesh in the uv domain, i.e  the mesh is a \n\tmapping of a 3d Bezier patch into 2d. The area in this domain\n\twhich is inside a trimming loop had to be rendered. The trimming\n\tloop is a set of 2d Bezier curve segments.\n\tFor the sake of notation: the mesh is made up of cells.\n\n\tMy problem is this :\n\tThe trimming area has to be split up into individual smaller\n\tcells bounded by the trimming curve segments. If a cell\n\tis wholly inside the area...then it is output as a whole ,\n\telse it is trivially rejected. \n\n\tDoes any body know how thiss can be done, or is there any algo. \n\tsomewhere for doing this.\n\n\tAny help would be appreciated.\n\n\tThanks, \n\tAni.\n-- \nTo get irritated is human, to stay cool, divine.\n"]

One-Hot Encoding

下面说两种实现方式：

自己写一个

这功能比较简单，自己可以写一个。虽然不是很提倡，但是下面做个简单的示范，拿英语为例子（因为这样子分词会比较简单，要自己写一个中文分词会比较麻烦）：

from sklearn.datasets import fetch_20newsgroups
from scipy.sparse import csr_matrix
import re


class Counter():
    def __init__(self, lang):
        self.lang = lang
        self.word2idx = {}
        self.idx2word = {}
        self.vocab = set()
        self.col_length = 0
        self.row_length = 0
        self.clean_re = re.compile(r"[^a-zA-Z?.!,¿]+") # 去掉一些不是内容的字符集
        self.token_re = re.compile(r"[\.|\?|\ |\,|\']") # 分词

        self.create_index() # 构建mapping

    def tokenizer(self, text):
        text = self.clean_re.sub(" ", text)
        return self.token_re.split(text)

    def create_index(self):
        for text in self.lang:
            tokens = self.tokenizer(text)
            if self.col_length < len(tokens):
                self.col_length = len(tokens)
            self.vocab.update(tokens)

        self.vocab = sorted(self.vocab)

        self.word2idx['<pad>'] = 0
        for index, word in enumerate(self.vocab):
            self.word2idx[word] = index + 1

        for word, index in self.word2idx.items():
            self.idx2word[index] = word

    def transform2index(self, texts):
        for text in texts:
            tokens = self.tokenizer(text)
            indices = [self.word2idx.get(token, 0) for token in tokens]
            indices += [0 for i in range(self.col_length - len(indices))]
            yield indices

    def transform2matrix(self, texts):
        indices = self.transform2index(texts)
        self.row_length = len(self.vocab)
        for index in indices:
            _row = list(range(len(index)))[:self.row_length]
            _col = index
            data = [1 for i in _row]
            yield csr_matrix((data, (_row, _col)), shape=(self.col_length, self.row_length))

我们拿上面的数据试一下：

In [72]: counter = Counter(dataset.data)
    ...: print(list(counter.transform2index(["I'm Groot"])))
    ...: print(list(counter.transform2matrix(["I'm Groot"])))
[[6136, 28031, 0]]
[<11000x39831 sparse matrix of type '<class 'numpy.int64'>'
	with 3 stored elements in Compressed Sparse Row format>]

Scikit-Learn

第一种，用的是scikit-learn提供的接口CountVectorizer，方法如下：

from sklearn.feature_extraction.text import CountVectorizer
counter = CountVectorizer()
counter.fit(dataset.data)

通过上面的fit操作，CountVectorizer就会帮你构建一个词表，查看词表如下：

In [39]: counter.vocabulary_
Out[39]: 
{'from': 14887,
 'sd345': 29022,
 'city': 8696,
 'ac': 4017,
 'uk': 33256,
 'michael': 21661,
 'collier': 9031,
...
}

有人说，这个值不值持中文或者其它语言。答案是当然了，实际上CountVectorizer在构建这个mapping的时候，会把输入text进行tokenization，也就是我们平时说的分词。负责这个是tokenizer函数，具体细节大家可以去看下源码。比如中文，你可以调用jieba分词来定义它：

import jieba

def tokenizer(text):
    return list(jieba.cut(text, cut_all=False))

然后我们在创建CountVectorizer的对象时候，加上它：

counter = CountVectorizer(tokenizer=tokenizer)

利用上面的counter把字符串转成向量，返回的是一个稀疏矩阵

In [42]: counter.transform(["hello i'm goot"])
Out[42]: 
<1x35788 sparse matrix of type '<class 'numpy.int64'>'
	with 1 stored elements in Compressed Sparse Row format>

如果你需要在一开始拿去fit的数据都转成vector，你可以用fit_transform来代替掉fit，这样它会在训练完之后就返回那个稀疏矩阵。

Embedding （词嵌入）

上面讲的是One-Hot编码的，因为要进行Embedding之前，都需要把Word转成词表。同样下面说两种方法：

Gensim

Gensim可以说是NLP界处理文本的神器,最常用就是拿它来做文本特征，Gensim用的就是WORD2VEC算法。当你拿去构建词表的文本不多时候，可以直接这么做：

import re
from gensim.models import Word2Vec

texts = [re.sub("[^a-zA-Z?.!,¿]+", " ", text) for text in dataset.data]
texts = [re.split("[\.|\?|\ |\,|\']", text) for text in texts]
model = Word2Vec(texts)

当然啦，很多时候我们要处理的文本量都比较大，我们没法一次都加载到内存中，这时候我们可以这么做：


def text_generator():
    for text in dataset.data"
        text = re.sub("[^a-zA-Z?.!,¿]+", " ", text)
        text = re.split("[\.|\?|\ |\,|\']", text)
        yield text

model = Word2Vec(workers=4)
model.build_vocab(text_generator())
model.train(text_generator())

保存模型

如果你的文本量很大的话，训练一次这个模型也是很费时间的，所以我们训练完之后可以把它保存起来，后面使用的时候只需要load进来就可以了。

model.save('path_to_save') # 保训练好的模型

model = Word2Vec.load('path_to_save') # 加载训练过的模型

Tensorflow

准确的说，实际上用的是Keras的模块，不错博主平时用的是Tensorflow，这里就偷懒不去用Keras来说明了。使用之前，也是需要把文本转成id:

counter = Counter(dataset.data)
indices = counter.transform2index(dataset.data)

其实这里应该可以把Embedding看成是网络中的一层，降输入层映射到高维。由于输入的文本太多，我们用Dataset去加载。

import tensorflow as tf

tf.enable_eager_execution()

data = tf.data.Dataset().from_generator(lambda :indices, output_types=(tf.float32))
data = data.batch(100).make_one_shot_iterator()
X = data.get_next()
embedding = tf.keras.layers.Embedding(len(counter.vocab), 200)
embedding(X)

输出：

Out[12]: 
<tf.Tensor: id=77, shape=(100, 11000, 200), dtype=float32, numpy=
array([[[0.7414243 , 0.07747948, 0.60920155, ..., 0.46979737,
         0.9815726 , 0.17995226],
        [0.03834593, 0.39614272, 0.3659439 , ..., 0.24742019,
         0.9771553 , 0.30359387],
        [0.5240742 , 0.58283293, 0.95961046, ..., 0.5749551 ,
         0.846769  , 0.9823524 ],
        ...,
        [0.3502736 , 0.988209  , 0.85835266, ..., 0.626845  ,
         0.56388843, 0.13658428],
        [0.3502736 , 0.988209  , 0.85835266, ..., 0.626845  ,
         0.56388843, 0.13658428],
        [0.3502736 , 0.988209  , 0.85835266, ..., 0.626845  ,
         0.56388843, 0.13658428]]], dtype=float32)>

几种词向量的实现

相关理论