fastNLP.embeddings.char_embedding¶

该文件中主要包含的是character的Embedding，包括基于CNN与LSTM的character Embedding。与其它Embedding一样，这里的Embedding输入也是词的index而不需要使用词语中的char的index来获取表达。

class fastNLP.embeddings.char_embedding.CNNCharEmbedding(vocab: fastNLP.core.vocabulary.Vocabulary, embed_size: int = 50, char_emb_size: int = 50, word_dropout: float = 0, dropout: float = 0, filter_nums: List[int] = 40, 30, 20, kernel_sizes: List[int] = 5, 3, 1, pool_method: str = 'max', activation='relu', min_char_freq: int = 2, pre_train_char_embed: str = None, requires_grad: bool = True, include_word_start_end: bool = True)[源代码]¶

基类 fastNLP.embeddings.TokenEmbedding

别名 fastNLP.embeddings.CNNCharEmbedding fastNLP.embeddings.char_embedding.CNNCharEmbedding

使用CNN生成character embedding。CNN的结构为, embed(x) -> Dropout(x) -> CNN(x) -> activation(x) -> pool -> fc -> Dropout. 不同的kernel大小的fitler结果是concat起来然后通过一层fully connected layer, 然后输出word的表示。

Example:
>>> import torch
>>> from fastNLP import Vocabulary
>>> from fastNLP.embeddings import CNNCharEmbedding
>>> vocab = Vocabulary().add_word_lst("The whether is good .".split())
>>> embed = CNNCharEmbedding(vocab, embed_size=50)
>>> words = torch.LongTensor([[vocab.to_index(word) for word in "The whether is good .".split()]])
>>> outputs = embed(words)
>>> outputs.size()
>>> # torch.Size([1, 5，50])

__init__(vocab: fastNLP.core.vocabulary.Vocabulary, embed_size: int = 50, char_emb_size: int = 50, word_dropout: float = 0, dropout: float = 0, filter_nums: List[int] = 40, 30, 20, kernel_sizes: List[int] = 5, 3, 1, pool_method: str = 'max', activation='relu', min_char_freq: int = 2, pre_train_char_embed: str = None, requires_grad: bool = True, include_word_start_end: bool = True)[源代码]¶

参数

vocab -- 词表
embed_size -- 该CNNCharEmbedding的输出维度大小，默认值为50.
char_emb_size -- character的embed的维度。character是从vocab中生成的。默认值为50.
word_dropout (float) -- 以多大的概率将一个词替换为unk。这样既可以训练unk也是一定的regularize。
dropout (float) -- 以多大的概率drop分布式表示与char embedding的输出。
filter_nums -- filter的数量. 长度需要和kernels一致。默认值为[40, 30, 20].
kernel_sizes -- kernel的大小. 默认值为[5, 3, 1].
pool_method -- character的表示在合成一个表示时所使用的pool方法，支持'avg', 'max'.
activation -- CNN之后使用的激活方法，支持'relu', 'sigmoid', 'tanh' 或者自定义函数.
min_char_freq -- character的最少出现次数。默认值为2.
pre_train_char_embed -- 可以有两种方式调用预训练好的character embedding：第一种是传入embedding文件夹 (文件夹下应该只有一个以.txt作为后缀的文件)或文件路径；第二种是传入embedding的名称，第二种情况将自动查看缓存中是否存在该模型，没有的话将自动下载。如果输入为None则使用embedding_dim的维度随机初始化一个embedding.
requires_grad -- 是否更新权重
include_word_start_end -- 是否在每个word开始的character前和结束的character增加特殊标示符号；

forward(words)[源代码]¶

输入words的index后，生成对应的words的表示。

参数: words -- [batch_size, max_len]
返回: [batch_size, max_len, embed_size]

training: bool¶

class fastNLP.embeddings.char_embedding.LSTMCharEmbedding(vocab: fastNLP.core.vocabulary.Vocabulary, embed_size: int = 50, char_emb_size: int = 50, word_dropout: float = 0, dropout: float = 0, hidden_size=50, pool_method: str = 'max', activation='relu', min_char_freq: int = 2, bidirectional=True, pre_train_char_embed: str = None, requires_grad: bool = True, include_word_start_end: bool = True)[源代码]¶

基类 fastNLP.embeddings.TokenEmbedding

别名 fastNLP.embeddings.LSTMCharEmbedding fastNLP.embeddings.char_embedding.LSTMCharEmbedding

使用LSTM的方式对character进行encode. embed(x) -> Dropout(x) -> LSTM(x) -> activation(x) -> pool -> Dropout

Example:

>>> import torch
>>> from fastNLP import Vocabulary
>>> from fastNLP.embeddings import LSTMCharEmbedding
>>> vocab = Vocabulary().add_word_lst("The whether is good .".split())
>>> embed = LSTMCharEmbedding(vocab, embed_size=50)
>>> words = torch.LongTensor([[vocab.to_index(word) for word in "The whether is good .".split()]])
>>> outputs = embed(words)
>>> outputs.size()
>>> # torch.Size([1, 5，50])

__init__(vocab: fastNLP.core.vocabulary.Vocabulary, embed_size: int = 50, char_emb_size: int = 50, word_dropout: float = 0, dropout: float = 0, hidden_size=50, pool_method: str = 'max', activation='relu', min_char_freq: int = 2, bidirectional=True, pre_train_char_embed: str = None, requires_grad: bool = True, include_word_start_end: bool = True)[源代码]¶

参数

vocab -- 词表
embed_size -- LSTMCharEmbedding的输出维度。默认值为50.
char_emb_size -- character的embedding的维度。默认值为50.
word_dropout (float) -- 以多大的概率将一个词替换为unk。这样既可以训练unk也是一定的regularize。
dropout -- 以多大概率drop character embedding的输出以及最终的word的输出。
hidden_size -- LSTM的中间hidden的大小，如果为bidirectional的，hidden会除二，默认为50.
pool_method -- 支持'max', 'avg'。
activation -- 激活函数，支持'relu', 'sigmoid', 'tanh', 或者自定义函数.
min_char_freq -- character的最小出现次数。默认值为2.
bidirectional -- 是否使用双向的LSTM进行encode。默认值为True。
pre_train_char_embed -- 可以有两种方式调用预训练好的character embedding：第一种是传入embedding文件夹 (文件夹下应该只有一个以.txt作为后缀的文件)或文件路径；第二种是传入embedding的名称，第二种情况将自动查看缓存中是否存在该模型，没有的话将自动下载。如果输入为None则使用embedding_dim的维度随机初始化一个embedding.
requires_grad -- 是否更新权重
include_word_start_end -- 是否在每个word开始的character前和结束的character增加特殊标示符号；

training: bool¶

forward(words)[源代码]¶

输入words的index后，生成对应的words的表示。

参数: words -- [batch_size, max_len]
返回: [batch_size, max_len, embed_size]