fastNLP.io.pipe¶

Pipe用于处理通过 Loader 读取的数据，所有的 Pipe 都包含 process 和 process_from_file 两种方法。 process(data_bundle) 传入一个 DataBundle 类型的对象, 在传入的 data_bundle 上进行原位修改，并将其返回； process_from_file(paths) 传入的文件路径，返回一个 DataBundle 类型的对象。 process(data_bundle) 或者 process_from_file(paths) 的返回 data_bundle 中的 DataSet 一般都包含原文与转换为index的输入以及转换为index的target；除了 DataSet 之外， data_bundle 还会包含将field转为index时所建立的词表。

class fastNLP.io.pipe.Pipe[源代码]¶

别名 fastNLP.io.Pipe fastNLP.io.pipe.Pipe

Pipe是fastNLP中用于处理DataBundle的类，但实际是处理DataBundle中的DataSet。所有Pipe都会在其process()函数的文档中指出该Pipe可处理的DataSet应该具备怎样的格式；在Pipe 文档中说明该Pipe返回后DataSet的格式以及其field的信息；以及新增的Vocabulary的信息。

一般情况下Pipe处理包含以下的几个过程，(1)将raw_words或raw_chars进行tokenize以切分成不同的词或字; (2) 再建立词或字的 Vocabulary , 并将词或字转换为index; (3)将target列建立词表并将target列转为index;

Pipe中提供了两个方法

-process()函数，输入为DataBundle -process_from_file()函数，输入为对应Loader的load函数可接受的类型。

__init__¶: Initialize self. See help(type(self)) for accurate signature.

process(data_bundle: fastNLP.io.data_bundle.DataBundle) → fastNLP.io.data_bundle.DataBundle[源代码]¶

对输入的DataBundle进行处理，然后返回该DataBundle。

参数:	data_bundle (DataBundle) -- 需要处理的DataBundle对象
返回:

process_from_file(paths) → fastNLP.io.data_bundle.DataBundle[源代码]¶

传入文件路径，生成处理好的DataBundle对象。paths支持的路径形式可以参考：fastNLP.io.Loader.load()

参数:	paths --
返回:	DataBundle

class fastNLP.io.pipe.CWSPipe(dataset_name=None, encoding_type='bmes', replace_num_alpha=True, bigrams=False, trigrams=False)[源代码]¶

基类 fastNLP.io.Pipe

别名 fastNLP.io.CWSPipe fastNLP.io.pipe.CWSPipe

对CWS数据进行预处理, 处理之后的数据，具备以下的结构

raw_words chars target seq_len

共同创造美好... [2, 3, 4...] [0, 2, 0, 2,...] 13

2001年新年钟声... [8, 9, 9, 7, ...] [0, 1, 1, 1, 2...] 20

... [...] [...] .

dataset的print_field_meta()函数输出的各个field的被设置成input和target的情况为:
+-------------+-----------+-------+--------+---------+
| field_names | raw_words | chars | target | seq_len |
+-------------+-----------+-------+--------+---------+
|   is_input  |   False   |  True |  True  |   True  |
|  is_target  |   False   | False |  True  |   True  |
| ignore_type |           | False | False  |  False  |
|  pad_value  |           |   0   |   0    |    0    |
+-------------+-----------+-------+--------+---------+

__init__(dataset_name=None, encoding_type='bmes', replace_num_alpha=True, bigrams=False, trigrams=False)[源代码]¶

参数:

dataset_name (str,None) -- 支持'pku', 'msra', 'cityu', 'as', None
encoding_type (str) -- 可以选择'bmes', 'segapp'两种。"我来自复旦大学...", bmes的tag为[S, B, E, B, M, M, E...]; segapp 的tag为[seg, app, seg, app, app, app, seg, ...]
replace_num_alpha (bool) -- 是否将数字和字母用特殊字符替换。
bigrams (bool) -- 是否增加一列bigram. bigram的构成是['复', '旦', '大', '学', ...]->["复旦", "旦大", ...]
trigrams (bool) -- 是否增加一列trigram. trigram的构成是 ['复', '旦', '大', '学', ...]->["复旦大", "旦大学", ...]

process(data_bundle: fastNLP.io.data_bundle.DataBundle) → fastNLP.io.data_bundle.DataBundle[源代码]¶

可以处理的DataSet需要包含raw_words列

raw_words
上海浦东开发与法制建设同步
新华社上海二月十日电（记者谢金虎、张持坚）
...

参数:	data_bundle --
返回:

process_from_file(paths=None) → fastNLP.io.data_bundle.DataBundle[源代码]¶

参数:	paths (str) --
返回:

class fastNLP.io.pipe.AGsNewsPipe(lower: bool = False, tokenizer: str = 'spacy')[源代码]¶

基类 fastNLP.io.CLSBasePipe

别名 fastNLP.io.AGsNewsPipe fastNLP.io.pipe.AGsNewsPipe

处理AG's News的数据, 处理之后DataSet中的内容如下

下面是使用AGsNewsPipe处理后的DataSet所具备的field¶

raw_words target words seq_len

I got 'new' tires from them and within... 0 [7, 110, 22, 107, 22, 499, 59, 140, 3,...] 160

Don't waste your time. We had two dif...
0 [277, 17, 278, 38, 30, 112, 24, 85, 27... 40

... . [...] .

dataset的print_field_meta()函数输出的各个field的被设置成input和target的情况为:
+-------------+-----------+--------+-------+---------+
| field_names | raw_words | target | words | seq_len |
+-------------+-----------+--------+-------+---------+
|   is_input  |   False   | False  |  True |   True  |
|  is_target  |   False   |  True  | False |  False  |
| ignore_type |           | False  | False |  False  |
|  pad_value  |           |   0    |   0   |    0    |
+-------------+-----------+--------+-------+---------+

__init__(lower: bool = False, tokenizer: str = 'spacy')[源代码]¶

参数:	lower (bool) -- 是否对输入进行小写化。 tokenizer (str) -- 使用哪种tokenize方式将数据切成单词。支持'spacy'和'raw'。raw使用空格作为切分。

process(data_bundle: fastNLP.io.data_bundle.DataBundle)¶

传入的DataSet应该具备如下的结构

raw_words	target
I got 'new' tires from them and...	1
Don't waste your time. We had two...	1
...	...

参数:	data_bundle --
返回:

process_from_file(paths=None)[源代码]¶

参数:	paths (str) --
返回:	DataBundle

class fastNLP.io.pipe.DBPediaPipe(lower: bool = False, tokenizer: str = 'spacy')[源代码]¶

基类 fastNLP.io.CLSBasePipe

别名 fastNLP.io.DBPediaPipe fastNLP.io.pipe.DBPediaPipe

处理DBPedia的数据, 处理之后DataSet中的内容如下

下面是使用DBPediaPipe处理后的DataSet所具备的field¶

raw_words target words seq_len

I got 'new' tires from them and within... 0 [7, 110, 22, 107, 22, 499, 59, 140, 3,...] 160

Don't waste your time. We had two dif...
0 [277, 17, 278, 38, 30, 112, 24, 85, 27... 40

... . [...] .

dataset的print_field_meta()函数输出的各个field的被设置成input和target的情况为:
+-------------+-----------+--------+-------+---------+
| field_names | raw_words | target | words | seq_len |
+-------------+-----------+--------+-------+---------+
|   is_input  |   False   | False  |  True |   True  |
|  is_target  |   False   |  True  | False |  False  |
| ignore_type |           | False  | False |  False  |
|  pad_value  |           |   0    |   0   |    0    |
+-------------+-----------+--------+-------+---------+

__init__(lower: bool = False, tokenizer: str = 'spacy')[源代码]¶

参数:	lower (bool) -- 是否对输入进行小写化。 tokenizer (str) -- 使用哪种tokenize方式将数据切成单词。支持'spacy'和'raw'。raw使用空格作为切分。

process(data_bundle: fastNLP.io.data_bundle.DataBundle)¶

传入的DataSet应该具备如下的结构

raw_words	target
I got 'new' tires from them and...	1
Don't waste your time. We had two...	1
...	...

参数:	data_bundle --
返回:

process_from_file(paths=None)[源代码]¶

参数:	paths (str) --
返回:	DataBundle

class fastNLP.io.pipe.YelpFullPipe(lower: bool = False, granularity=5, tokenizer: str = 'spacy')[源代码]¶

基类 fastNLP.io.CLSBasePipe

别名 fastNLP.io.YelpFullPipe fastNLP.io.pipe.YelpFullPipe

处理YelpFull的数据, 处理之后DataSet中的内容如下

下面是使用YelpFullPipe处理后的DataSet所具备的field¶

raw_words target words seq_len

I got 'new' tires from them and within... 0 [7, 110, 22, 107, 22, 499, 59, 140, 3,...] 160

Don't waste your time. We had two dif...
0 [277, 17, 278, 38, 30, 112, 24, 85, 27... 40

... . [...] .

dataset的print_field_meta()函数输出的各个field的被设置成input和target的情况为:
+-------------+-----------+--------+-------+---------+
| field_names | raw_words | target | words | seq_len |
+-------------+-----------+--------+-------+---------+
|   is_input  |   False   | False  |  True |   True  |
|  is_target  |   False   |  True  | False |  False  |
| ignore_type |           | False  | False |  False  |
|  pad_value  |           |   0    |   0   |    0    |
+-------------+-----------+--------+-------+---------+

__init__(lower: bool = False, granularity=5, tokenizer: str = 'spacy')[源代码]¶

参数:	lower (bool) -- 是否对输入进行小写化。 granularity (int) -- 支持2, 3, 5。若为2, 则认为是2分类问题，将1、2归为1类，4、5归为一类，丢掉2；若为3, 则有3分类问题，将 1、2归为1类，3归为1类，4、5归为1类；若为5, 则有5分类问题。 tokenizer (str) -- 使用哪种tokenize方式将数据切成单词。支持'spacy'和'raw'。raw使用空格作为切分。

process(data_bundle)[源代码]¶

传入的DataSet应该具备如下的结构

raw_words	target
I got 'new' tires from them and...	1
Don't waste your time. We had two...	1
...	...

参数:	data_bundle --
返回:

process_from_file(paths=None)[源代码]¶

参数:	paths --
返回:	DataBundle

class fastNLP.io.pipe.YelpPolarityPipe(lower: bool = False, tokenizer: str = 'spacy')[源代码]¶

基类 fastNLP.io.CLSBasePipe

别名 fastNLP.io.YelpPolarityPipe fastNLP.io.pipe.YelpPolarityPipe

处理YelpPolarity的数据, 处理之后DataSet中的内容如下

下面是使用YelpFullPipe处理后的DataSet所具备的field¶

raw_words target words seq_len

I got 'new' tires from them and within... 0 [7, 110, 22, 107, 22, 499, 59, 140, 3,...] 160

Don't waste your time. We had two dif...
0 [277, 17, 278, 38, 30, 112, 24, 85, 27... 40

... . [...] .

dataset的print_field_meta()函数输出的各个field的被设置成input和target的情况为:
+-------------+-----------+--------+-------+---------+
| field_names | raw_words | target | words | seq_len |
+-------------+-----------+--------+-------+---------+
|   is_input  |   False   | False  |  True |   True  |
|  is_target  |   False   |  True  | False |  False  |
| ignore_type |           | False  | False |  False  |
|  pad_value  |           |   0    |   0   |    0    |
+-------------+-----------+--------+-------+---------+

__init__(lower: bool = False, tokenizer: str = 'spacy')[源代码]¶

参数:	lower (bool) -- 是否对输入进行小写化。 tokenizer (str) -- 使用哪种tokenize方式将数据切成单词。支持'spacy'和'raw'。raw使用空格作为切分。

process(data_bundle: fastNLP.io.data_bundle.DataBundle)¶

传入的DataSet应该具备如下的结构

raw_words	target
I got 'new' tires from them and...	1
Don't waste your time. We had two...	1
...	...

参数:	data_bundle --
返回:

process_from_file(paths=None)[源代码]¶

参数:	paths (str) --
返回:	DataBundle

class fastNLP.io.pipe.SSTPipe(subtree=False, train_subtree=True, lower=False, granularity=5, tokenizer='spacy')[源代码]¶

基类 fastNLP.io.CLSBasePipe

别名 fastNLP.io.SSTPipe fastNLP.io.pipe.SSTPipe

经过该Pipe之后，DataSet中具备的field如下所示

下面是使用SSTPipe处理后的DataSet所具备的field¶

raw_words words target seq_len

It 's a lovely film with lovely perfor... 1 [187, 6, 5, 132, 120, 70, 132, 188, 25... 13

No one goes unindicted here , which is... 0 [191, 126, 192, 193, 194, 4, 195, 17, ... 13

... . [...] .

dataset的print_field_meta()函数输出的各个field的被设置成input和target的情况为:
+-------------+-----------+--------+-------+---------+
| field_names | raw_words | target | words | seq_len |
+-------------+-----------+--------+-------+---------+
|   is_input  |   False   | False  |  True |   True  |
|  is_target  |   False   |  True  | False |  False  |
| ignore_type |           | False  | False |  False  |
|  pad_value  |           |   0    |   0   |    0    |
+-------------+-----------+--------+-------+---------+

__init__(subtree=False, train_subtree=True, lower=False, granularity=5, tokenizer='spacy')[源代码]¶

参数:

subtree (bool) -- 是否将train, test, dev数据展开为子树，扩充数据量。 Default: False
train_subtree (bool) -- 是否将train集通过子树扩展数据。
lower (bool) -- 是否对输入进行小写化。
granularity (int) -- 支持2, 3, 5。若为2, 则认为是2分类问题，将0、1归为1类，3、4归为一类，丢掉2；若为3, 则有3分类问题，将 0、1归为1类，2归为1类，3、4归为1类；若为5, 则有5分类问题。
tokenizer (str) -- 使用哪种tokenize方式将数据切成单词。支持'spacy'和'raw'。raw使用空格作为切分。

process(data_bundle: fastNLP.io.data_bundle.DataBundle)[源代码]¶

对DataBundle中的数据进行预处理。输入的DataSet应该至少拥有raw_words这一列，且内容类似与

下面是使用SSTLoader读取的DataSet所具备的field¶
raw_words
(2 (3 (3 Effective) (2 but)) (1 (1 too-tepid)...
(3 (3 (2 If) (3 (2 you) (3 (2 sometimes) ...
...

参数:	data_bundle (DataBundle) -- 需要处理的DataBundle对象
返回:

class fastNLP.io.pipe.SST2Pipe(lower=False, tokenizer='spacy')[源代码]¶

基类 fastNLP.io.CLSBasePipe

别名 fastNLP.io.SST2Pipe fastNLP.io.pipe.SST2Pipe

加载SST2的数据, 处理完成之后DataSet将拥有以下的field

raw_words target words seq_len

it 's a charming and often affecting j... 1 [19, 9, 6, 111, 5, 112, 113, 114, 3] 9

unflinchingly bleak and desperate 0 [115, 116, 5, 117] 4

... ... . .

dataset的print_field_meta()函数输出的各个field的被设置成input和target的情况为:
+-------------+-----------+--------+-------+---------+
| field_names | raw_words | target | words | seq_len |
+-------------+-----------+--------+-------+---------+
|   is_input  |   False   | False  |  True |   True  |
|  is_target  |   False   |  True  | False |  False  |
| ignore_type |           | False  | False |  False  |
|  pad_value  |           |   0    |   0   |    0    |
+-------------+-----------+--------+-------+---------+

__init__(lower=False, tokenizer='spacy')[源代码]¶

参数:	lower (bool) -- 是否对输入进行小写化。 tokenizer (str) -- 使用哪种tokenize方式将数据切成单词。支持'spacy'和'raw'。raw使用空格作为切分。

process(data_bundle: fastNLP.io.data_bundle.DataBundle)¶

传入的DataSet应该具备如下的结构

raw_words	target
I got 'new' tires from them and...	1
Don't waste your time. We had two...	1
...	...

参数:	data_bundle --
返回:

process_from_file(paths=None)[源代码]¶

参数:	paths (str) -- 如果为None，则自动下载并缓存到fastNLP的缓存地址。
返回:	DataBundle

class fastNLP.io.pipe.IMDBPipe(lower: bool = False, tokenizer: str = 'spacy')[源代码]¶

基类 fastNLP.io.CLSBasePipe

别名 fastNLP.io.IMDBPipe fastNLP.io.pipe.IMDBPipe

经过本Pipe处理后DataSet将如下

输出DataSet的field¶

raw_words target words seq_len

Bromwell High is a cartoon ... 0 [3, 5, 6, 9, ...] 20

Story of a man who has ... 1 [20, 43, 9, 10, ...] 31

... . [...] .

其中raw_words为str类型，是原文; words是转换为index的输入; target是转换为index的目标值; words列被设置为input; target列被设置为target。

dataset的print_field_meta()函数输出的各个field的被设置成input和target的情况为:
+-------------+-----------+--------+-------+---------+
| field_names | raw_words | target | words | seq_len |
+-------------+-----------+--------+-------+---------+
|   is_input  |   False   | False  |  True |   True  |
|  is_target  |   False   |  True  | False |  False  |
| ignore_type |           | False  | False |  False  |
|  pad_value  |           |   0    |   0   |    0    |
+-------------+-----------+--------+-------+---------+

__init__(lower: bool = False, tokenizer: str = 'spacy')[源代码]¶

参数:	lower (bool) -- 是否将words列的数据小写。 tokenizer (str) -- 使用什么tokenizer来将句子切分为words. 支持spacy, raw两种。raw即使用空格拆分。

process(data_bundle: fastNLP.io.data_bundle.DataBundle)[源代码]¶

期待的DataBunlde中输入的DataSet应该类似于如下，有两个field，raw_words和target，且均为str类型

输入DataSet的field¶
raw_words	target
Bromwell High is a cartoon ...	pos
Story of a man who has ...	neg
...	...

参数:	data_bundle (DataBunlde) -- 传入的DataBundle中的DataSet必须包含raw_words和target两个field，且raw_words列应该为str, target列应该为str。
返回:	DataBundle

process_from_file(paths=None)[源代码]¶

参数:	paths -- 支持路径类型参见 `fastNLP.io.loader.Loader` 的load函数。
返回:	DataBundle

class fastNLP.io.pipe.ChnSentiCorpPipe(bigrams=False, trigrams=False)[源代码]¶

基类 fastNLP.io.Pipe

别名 fastNLP.io.ChnSentiCorpPipe fastNLP.io.pipe.ChnSentiCorpPipe

处理之后的DataSet有以下的结构

raw_chars target chars seq_len

這間酒店環境和服務態度亦算不錯,但房間空間太小~~ 1 [2, 3, 4, 5, ...] 31

<荐书> 推荐所有喜欢<红楼>... 1 [10, 21, ....] 25

...

其中chars, seq_len是input，target是target dataset的print_field_meta()函数输出的各个field的被设置成input和target的情况为:
+-------------+-----------+--------+-------+---------+
| field_names | raw_chars | target | chars | seq_len |
+-------------+-----------+--------+-------+---------+
|   is_input  |   False   |  True  |  True |   True  |
|  is_target  |   False   |  True  | False |  False  |
| ignore_type |           | False  | False |  False  |
|  pad_value  |           |   0    |   0   |    0    |
+-------------+-----------+--------+-------+---------+

__init__(bigrams=False, trigrams=False)[源代码]¶

参数:

bigrams (bool) -- 是否增加一列bigrams. bigrams的构成是['复', '旦', '大', '学', ...]->["复旦", "旦大", ...]。如果设置为True，返回的DataSet将有一列名为bigrams, 且已经转换为了index并设置为input，对应的vocab可以通过 data_bundle.get_vocab('bigrams')获取.
trigrams (bool) -- 是否增加一列trigrams. trigrams的构成是 ['复', '旦', '大', '学', ...]->["复旦大", "旦大学", ...] 。如果设置为True，返回的DataSet将有一列名为trigrams, 且已经转换为了index并设置为input，对应的vocab可以通过 data_bundle.get_vocab('trigrams')获取.

process(data_bundle: fastNLP.io.data_bundle.DataBundle)[源代码]¶

可以处理的DataSet应该具备以下的field

raw_chars	target
這間酒店環境和服務態度亦算不錯,但房間空間太小~~	1
<荐书> 推荐所有喜欢<红楼>...	1
...

参数:	data_bundle --
返回:

process_from_file(paths=None)[源代码]¶

参数:	paths -- 支持路径类型参见 `fastNLP.io.loader.Loader` 的load函数。
返回:	DataBundle

class fastNLP.io.pipe.THUCNewsPipe(bigrams=False, trigrams=False)[源代码]¶

基类 fastNLP.io.CLSBasePipe

别名 fastNLP.io.THUCNewsPipe fastNLP.io.pipe.THUCNewsPipe

处理之后的DataSet有以下的结构

raw_chars target chars seq_len

马晓旭意外受伤让国奥警惕无奈大雨格外青睐殷家军记者傅亚雨沈阳报道... 0 [409, 1197, 2146, 213, ...] 746

...

其中chars, seq_len是input，target是target dataset的print_field_meta()函数输出的各个field的被设置成input和target的情况为:
+-------------+-----------+--------+-------+---------+
| field_names | raw_chars | target | chars | seq_len |
+-------------+-----------+--------+-------+---------+
|   is_input  |   False   |  True  |  True |   True  |
|  is_target  |   False   |  True  | False |  False  |
| ignore_type |           | False  | False |  False  |
|  pad_value  |           |   0    |   0   |    0    |
+-------------+-----------+--------+-------+---------+
param bool bigrams:

是否增加一列bigrams. bigrams的构成是['复', '旦', '大', '学', ...]->["复旦", "旦大", ...]。如果设置为True，返回的DataSet将有一列名为bigrams, 且已经转换为了index并设置为input，对应的vocab可以通过 data_bundle.get_vocab('bigrams')获取.

param bool trigrams:

是否增加一列trigrams. trigrams的构成是 ['复', '旦', '大', '学', ...]->["复旦大", "旦大学", ...] 。如果设置为True，返回的DataSet将有一列名为trigrams, 且已经转换为了index并设置为input，对应的vocab可以通过 data_bundle.get_vocab('trigrams')获取.

process(data_bundle: fastNLP.io.data_bundle.DataBundle)[源代码]¶

可处理的DataSet应具备如下的field

raw_words	target
马晓旭意外受伤让国奥警惕无奈大雨格外青睐殷家军记者傅亚雨沈阳报道 ...	体育
...	...

参数:	data_bundle --
返回:

process_from_file(paths=None)[源代码]¶

参数:	paths -- 支持路径类型参见 `fastNLP.io.loader.Loader` 的load函数。
返回:	DataBundle

class fastNLP.io.pipe.WeiboSenti100kPipe(bigrams=False, trigrams=False)[源代码]¶

基类 fastNLP.io.CLSBasePipe

别名 fastNLP.io.WeiboSenti100kPipe fastNLP.io.pipe.WeiboSenti100kPipe

处理之后的DataSet有以下的结构

raw_chars target chars seq_len

六一出生的？好讽刺…… //@祭春姬:他爸爸是外星人吧 //@面孔小高:现在的孩子都怎么了 [怒][怒][怒] 0 [0, 690, 18, ...] 56

...

其中chars, seq_len是input，target是target dataset的print_field_meta()函数输出的各个field的被设置成input和target的情况为:
+-------------+-----------+--------+-------+---------+
| field_names | raw_chars | target | chars | seq_len |
+-------------+-----------+--------+-------+---------+
|   is_input  |   False   |  True  |  True |   True  |
|  is_target  |   False   |  True  | False |  False  |
| ignore_type |           | False  | False |  False  |
|  pad_value  |           |   0    |   0   |    0    |
+-------------+-----------+--------+-------+---------+
param bool bigrams:

是否增加一列bigrams. bigrams的构成是['复', '旦', '大', '学', ...]->["复旦", "旦大", ...]。如果设置为True，返回的DataSet将有一列名为bigrams, 且已经转换为了index并设置为input，对应的vocab可以通过 data_bundle.get_vocab('bigrams')获取.

param bool trigrams:

是否增加一列trigrams. trigrams的构成是 ['复', '旦', '大', '学', ...]->["复旦大", "旦大学", ...] 。如果设置为True，返回的DataSet将有一列名为trigrams, 且已经转换为了index并设置为input，对应的vocab可以通过 data_bundle.get_vocab('trigrams')获取.

process(data_bundle: fastNLP.io.data_bundle.DataBundle)[源代码]¶

可处理的DataSet应具备以下的field

raw_chars	target
六一出生的？好讽刺…… //@祭春姬:他爸爸是外星人吧 //@面孔小高:现在的孩子都怎么了 [怒][怒][怒]	0
...	...

参数:	data_bundle --
返回:

process_from_file(paths=None)[源代码]¶

参数:	paths -- 支持路径类型参见 `fastNLP.io.loader.Loader` 的load函数。
返回:	DataBundle

class fastNLP.io.pipe.Conll2003NERPipe(encoding_type: str = 'bio', lower: bool = False)[源代码]¶

基类 fastNLP.io._NERPipe

别名 fastNLP.io.Conll2003NERPipe fastNLP.io.pipe.Conll2003NERPipe

Conll2003的NER任务的处理Pipe, 该Pipe会（1）复制raw_words列，并命名为words; (2）在words, target列建立词表 (创建 fastNLP.Vocabulary 对象，所以在返回的DataBundle中将有两个Vocabulary); (3）将words，target列根据相应的 Vocabulary转换为index。经过该Pipe过后，DataSet中的内容如下所示

Following is a demo layout of DataSet returned by Conll2003Loader¶

raw_words target words seq_len

[Nadim, Ladki] [1, 2] [2, 3] 2

[AL-AIN, United, Arab, ...] [3, 4,...] [4, 5, 6,...] 6

[...] [...] [...] .

raw_words列为List[str], 是未转换的原始数据; words列为List[int]，是转换为index的输入数据; target列是List[int]，是转换为index的 target。返回的DataSet中被设置为input有words, target, seq_len; 设置为target有target。

dataset的print_field_meta()函数输出的各个field的被设置成input和target的情况为:
+-------------+-----------+--------+-------+---------+
| field_names | raw_words | target | words | seq_len |
+-------------+-----------+--------+-------+---------+
|   is_input  |   False   |  True  |  True |   True  |
|  is_target  |   False   |  True  | False |   True  |
| ignore_type |           | False  | False |  False  |
|  pad_value  |           |   0    |   0   |    0    |
+-------------+-----------+--------+-------+---------+

__init__(encoding_type: str = 'bio', lower: bool = False)¶

Param:	str encoding_type: target列使用什么类型的encoding方式，支持bioes, bio两种。
参数:	lower (bool) -- 是否将words小写化后再建立词表，绝大多数情况都不需要设置为True。

process(data_bundle: fastNLP.io.data_bundle.DataBundle) → fastNLP.io.data_bundle.DataBundle¶

支持的DataSet的field为

raw_words	target
[Nadim, Ladki]	[B-PER, I-PER]
[AL-AIN, United, Arab, ...]	[B-LOC, B-LOC, I-LOC, ...]
[...]	[...]

Return DataBundle:
参数:	data_bundle (DataBundle) -- 传入的DataBundle中的DataSet必须包含raw_words和ner两个field，且两个field的内容均为List[str]在传入DataBundle基础上原位修改。

process_from_file(paths) → fastNLP.io.data_bundle.DataBundle[源代码]¶

参数:	paths -- 支持路径类型参见 `fastNLP.io.loader.ConllLoader` 的load函数。
返回:	DataBundle

class fastNLP.io.pipe.OntoNotesNERPipe(encoding_type: str = 'bio', lower: bool = False)[源代码]¶

基类 fastNLP.io._NERPipe

别名 fastNLP.io.OntoNotesNERPipe fastNLP.io.pipe.OntoNotesNERPipe

处理OntoNotes的NER数据，处理之后DataSet中的field情况为

raw_words target words seq_len

[Nadim, Ladki] [1, 2] [2, 3] 2

[AL-AIN, United, Arab, ...] [3, 4] [4, 5, 6,...] 6

[...] [...] [...] .

raw_words列为List[str], 是未转换的原始数据; words列为List[int]，是转换为index的输入数据; target列是List[int]，是转换为index的 target。返回的DataSet中被设置为input有words, target, seq_len; 设置为target有target。

dataset的print_field_meta()函数输出的各个field的被设置成input和target的情况为:
+-------------+-----------+--------+-------+---------+
| field_names | raw_words | target | words | seq_len |
+-------------+-----------+--------+-------+---------+
|   is_input  |   False   |  True  |  True |   True  |
|  is_target  |   False   |  True  | False |   True  |
| ignore_type |           | False  | False |  False  |
|  pad_value  |           |   0    |   0   |    0    |
+-------------+-----------+--------+-------+---------+

__init__(encoding_type: str = 'bio', lower: bool = False)¶

Param:	str encoding_type: target列使用什么类型的encoding方式，支持bioes, bio两种。
参数:	lower (bool) -- 是否将words小写化后再建立词表，绝大多数情况都不需要设置为True。

process(data_bundle: fastNLP.io.data_bundle.DataBundle) → fastNLP.io.data_bundle.DataBundle¶

支持的DataSet的field为

raw_words	target
[Nadim, Ladki]	[B-PER, I-PER]
[AL-AIN, United, Arab, ...]	[B-LOC, B-LOC, I-LOC, ...]
[...]	[...]

Return DataBundle:
参数:	data_bundle (DataBundle) -- 传入的DataBundle中的DataSet必须包含raw_words和ner两个field，且两个field的内容均为List[str]在传入DataBundle基础上原位修改。

class fastNLP.io.pipe.MsraNERPipe(encoding_type: str = 'bio', bigrams=False, trigrams=False)[源代码]¶

基类 fastNLP.io._CNNERPipe

别名 fastNLP.io.MsraNERPipe fastNLP.io.pipe.MsraNERPipe

处理MSRA-NER的数据，处理之后的DataSet的field情况为

raw_chars target chars seq_len

[相, 比, 之, 下,...] [0, 0, 0, 0, ...] [2, 3, 4, 5, ...] 11

[青, 岛, 海, 牛, 队, 和, ...] [1, 2, 3, ...] [10, 21, ....] 21

[...] [...] [...] .

raw_chars列为List[str], 是未转换的原始数据; chars列为List[int]，是转换为index的输入数据; target列是List[int]，是转换为index的 target。返回的DataSet中被设置为input有chars, target, seq_len; 设置为target有target。

dataset的print_field_meta()函数输出的各个field的被设置成input和target的情况为:
+-------------+-----------+--------+-------+---------+
| field_names | raw_chars | target | chars | seq_len |
+-------------+-----------+--------+-------+---------+
|   is_input  |   False   |  True  |  True |   True  |
|  is_target  |   False   |  True  | False |   True  |
| ignore_type |           | False  | False |  False  |
|  pad_value  |           |   0    |   0   |    0    |
+-------------+-----------+--------+-------+---------+

__init__(encoding_type: str = 'bio', bigrams=False, trigrams=False)¶

参数:

encoding_type (str) -- target列使用什么类型的encoding方式，支持bioes, bio两种。
bigrams (bool) -- 是否增加一列bigrams. bigrams的构成是['复', '旦', '大', '学', ...]->["复旦", "旦大", ...]。如果设置为True，返回的DataSet将有一列名为bigrams, 且已经转换为了index并设置为input，对应的vocab可以通过 data_bundle.get_vocab('bigrams')获取.
trigrams (bool) -- 是否增加一列trigrams. trigrams的构成是 ['复', '旦', '大', '学', ...]->["复旦大", "旦大学", ...] 。如果设置为True，返回的DataSet将有一列名为trigrams, 且已经转换为了index并设置为input，对应的vocab可以通过 data_bundle.get_vocab('trigrams')获取.

process(data_bundle: fastNLP.io.data_bundle.DataBundle) → fastNLP.io.data_bundle.DataBundle¶

支持的DataSet的field为

raw_chars	target
[相, 比, 之, 下,...]	[O, O, O, O, ...]
[青, 岛, 海, 牛, 队, 和, ...]	[B-ORG, I-ORG, I-ORG, ...]
[...]	[...]

raw_chars列为List[str], 是未转换的原始数据; chars列为List[int]，是转换为index的输入数据; target列是List[int]，是转换为index的target。返回的DataSet中被设置为input有chars, target, seq_len; 设置为target有target。

参数:	data_bundle (DataBundle) -- 传入的DataBundle中的DataSet必须包含raw_words和ner两个field，且两个field的内容均为List[str]。在传入DataBundle基础上原位修改。
返回:	DataBundle

class fastNLP.io.pipe.WeiboNERPipe(encoding_type: str = 'bio', bigrams=False, trigrams=False)[源代码]¶

基类 fastNLP.io._CNNERPipe

别名 fastNLP.io.WeiboNERPipe fastNLP.io.pipe.WeiboNERPipe

处理weibo的ner的数据，处理之后的DataSet的field情况为

raw_chars chars target seq_len

['老', '百', '姓'] [4, 3, 3] [38, 39, 40] 3

['心'] [0] [41] 1

[...] [...] [...] .

raw_chars列为List[str], 是未转换的原始数据; chars列为List[int]，是转换为index的输入数据; target列是List[int]，是转换为index的 target。返回的DataSet中被设置为input有chars, target, seq_len; 设置为target有target。

dataset的print_field_meta()函数输出的各个field的被设置成input和target的情况为:
+-------------+-----------+--------+-------+---------+
| field_names | raw_chars | target | chars | seq_len |
+-------------+-----------+--------+-------+---------+
|   is_input  |   False   |  True  |  True |   True  |
|  is_target  |   False   |  True  | False |   True  |
| ignore_type |           | False  | False |  False  |
|  pad_value  |           |   0    |   0   |    0    |
+-------------+-----------+--------+-------+---------+

__init__(encoding_type: str = 'bio', bigrams=False, trigrams=False)¶

参数:

encoding_type (str) -- target列使用什么类型的encoding方式，支持bioes, bio两种。
bigrams (bool) -- 是否增加一列bigrams. bigrams的构成是['复', '旦', '大', '学', ...]->["复旦", "旦大", ...]。如果设置为True，返回的DataSet将有一列名为bigrams, 且已经转换为了index并设置为input，对应的vocab可以通过 data_bundle.get_vocab('bigrams')获取.
trigrams (bool) -- 是否增加一列trigrams. trigrams的构成是 ['复', '旦', '大', '学', ...]->["复旦大", "旦大学", ...] 。如果设置为True，返回的DataSet将有一列名为trigrams, 且已经转换为了index并设置为input，对应的vocab可以通过 data_bundle.get_vocab('trigrams')获取.

process(data_bundle: fastNLP.io.data_bundle.DataBundle) → fastNLP.io.data_bundle.DataBundle¶

支持的DataSet的field为

raw_chars	target
[相, 比, 之, 下,...]	[O, O, O, O, ...]
[青, 岛, 海, 牛, 队, 和, ...]	[B-ORG, I-ORG, I-ORG, ...]
[...]	[...]

raw_chars列为List[str], 是未转换的原始数据; chars列为List[int]，是转换为index的输入数据; target列是List[int]，是转换为index的target。返回的DataSet中被设置为input有chars, target, seq_len; 设置为target有target。

参数:	data_bundle (DataBundle) -- 传入的DataBundle中的DataSet必须包含raw_words和ner两个field，且两个field的内容均为List[str]。在传入DataBundle基础上原位修改。
返回:	DataBundle

class fastNLP.io.pipe.PeopleDailyPipe(encoding_type: str = 'bio', bigrams=False, trigrams=False)[源代码]¶

基类 fastNLP.io._CNNERPipe

别名 fastNLP.io.PeopleDailyPipe fastNLP.io.pipe.PeopleDailyPipe

处理people daily的ner的数据，处理之后的DataSet的field情况为

raw_chars target chars seq_len

[相, 比, 之, 下,...] [0, 0, 0, 0, ...] [2, 3, 4, 5, ...] 11

[青, 岛, 海, 牛, 队, 和, ...] [1, 2, 3, ...] [10, 21, ....] 21

[...] [...] [...] .

raw_chars列为List[str], 是未转换的原始数据; chars列为List[int]，是转换为index的输入数据; target列是List[int]，是转换为index的 target。返回的DataSet中被设置为input有chars, target, seq_len; 设置为target有target。

dataset的print_field_meta()函数输出的各个field的被设置成input和target的情况为:
+-------------+-----------+--------+-------+---------+
| field_names | raw_chars | target | chars | seq_len |
+-------------+-----------+--------+-------+---------+
|   is_input  |   False   |  True  |  True |   True  |
|  is_target  |   False   |  True  | False |   True  |
| ignore_type |           | False  | False |  False  |
|  pad_value  |           |   0    |   0   |    0    |
+-------------+-----------+--------+-------+---------+

__init__(encoding_type: str = 'bio', bigrams=False, trigrams=False)¶

参数:

encoding_type (str) -- target列使用什么类型的encoding方式，支持bioes, bio两种。
bigrams (bool) -- 是否增加一列bigrams. bigrams的构成是['复', '旦', '大', '学', ...]->["复旦", "旦大", ...]。如果设置为True，返回的DataSet将有一列名为bigrams, 且已经转换为了index并设置为input，对应的vocab可以通过 data_bundle.get_vocab('bigrams')获取.
trigrams (bool) -- 是否增加一列trigrams. trigrams的构成是 ['复', '旦', '大', '学', ...]->["复旦大", "旦大学", ...] 。如果设置为True，返回的DataSet将有一列名为trigrams, 且已经转换为了index并设置为input，对应的vocab可以通过 data_bundle.get_vocab('trigrams')获取.

process(data_bundle: fastNLP.io.data_bundle.DataBundle) → fastNLP.io.data_bundle.DataBundle¶

支持的DataSet的field为

raw_chars	target
[相, 比, 之, 下,...]	[O, O, O, O, ...]
[青, 岛, 海, 牛, 队, 和, ...]	[B-ORG, I-ORG, I-ORG, ...]
[...]	[...]

raw_chars列为List[str], 是未转换的原始数据; chars列为List[int]，是转换为index的输入数据; target列是List[int]，是转换为index的target。返回的DataSet中被设置为input有chars, target, seq_len; 设置为target有target。

参数:	data_bundle (DataBundle) -- 传入的DataBundle中的DataSet必须包含raw_words和ner两个field，且两个field的内容均为List[str]。在传入DataBundle基础上原位修改。
返回:	DataBundle

class fastNLP.io.pipe.Conll2003Pipe(chunk_encoding_type='bioes', ner_encoding_type='bioes', lower: bool = False)[源代码]¶

基类 fastNLP.io.Pipe

别名 fastNLP.io.Conll2003Pipe fastNLP.io.pipe.Conll2003Pipe

经过该Pipe后，DataSet中的内容如下

raw_words	pos	chunk	ner	words	seq_len
[Nadim, Ladki]	[0, 0]	[1, 2]	[1, 2]	[2, 3]	2
[AL-AIN, United, Arab, ...]	[1, 2...]	[3, 4...]	[3, 4...]	[4, 5, 6,...]	6
[...]	[...]	[...]	[...]	[...]	.

其中words, seq_len是input; pos, chunk, ner, seq_len是target dataset的print_field_meta()函数输出的各个field的被设置成input和target的情况为:

+-------------+-----------+-------+-------+-------+-------+---------+
| field_names | raw_words |  pos  | chunk |  ner  | words | seq_len |
+-------------+-----------+-------+-------+-------+-------+---------+
|   is_input  |   False   | False | False | False |  True |   True  |
|  is_target  |   False   |  True |  True |  True | False |   True  |
| ignore_type |           | False | False | False | False |  False  |
|  pad_value  |           |   0   |   0   |   0   |   0   |    0    |
+-------------+-----------+-------+-------+-------+-------+---------+

__init__(chunk_encoding_type='bioes', ner_encoding_type='bioes', lower: bool = False)[源代码]¶

参数:	chunk_encoding_type (str) -- 支持bioes, bio。 ner_encoding_type (str) -- 支持bioes, bio。 lower (bool) -- 是否将words列小写化后再建立词表

process(data_bundle) → fastNLP.io.data_bundle.DataBundle[源代码]¶

输入的DataSet应该类似于如下的形式

raw_words	pos	chunk	ner
[Nadim, Ladki]	[NNP, NNP]	[B-NP, I-NP]	[B-PER, I-PER]
[AL-AIN, United, Arab, ...]	[NNP, NNP...]	[B-NP, B-NP, ...]	[B-LOC, B-LOC,...]
[...]	[...]	[...]	[...]	.

参数:	data_bundle --
返回:	传入的DataBundle

process_from_file(paths)[源代码]¶

参数:	paths --
返回:

class fastNLP.io.pipe.MatchingBertPipe(lower=False, tokenizer: str = 'raw')[源代码]¶

基类 fastNLP.io.Pipe

别名 fastNLP.io.MatchingBertPipe fastNLP.io.pipe.MatchingBertPipe

Matching任务的Bert pipe，输出的DataSet将包含以下的field

raw_words1 raw_words2 target words seq_len

The new rights are... Everyone really likes.. 1 [2, 3, 4, 5, ...] 10

This site includes a... The Government Executive... 0 [11, 12, 13,...] 5

... ... . [...] .

words列是将raw_words1(即premise), raw_words2(即hypothesis)使用"[SEP]"链接起来转换为index的。 words列被设置为input，target列被设置为target和input(设置为input以方便在forward函数中计算loss，如果不在forward函数中计算loss也不影响，fastNLP将根据forward函数的形参名进行传参).

dataset的print_field_meta()函数输出的各个field的被设置成input和target的情况为:
+-------------+------------+------------+--------+-------+---------+
| field_names | raw_words1 | raw_words2 | target | words | seq_len |
+-------------+------------+------------+--------+-------+---------+
|   is_input  |   False    |   False    | False  |  True |   True  |
|  is_target  |   False    |   False    |  True  | False |  False  |
| ignore_type |            |            | False  | False |  False  |
|  pad_value  |            |            |   0    |   0   |    0    |
+-------------+------------+------------+--------+-------+---------+

__init__(lower=False, tokenizer: str = 'raw')[源代码]¶

参数:	lower (bool) -- 是否将word小写化。 tokenizer (str) -- 使用什么tokenizer来将句子切分为words. 支持spacy, raw两种。raw即使用空格拆分。

process(data_bundle)[源代码]¶

输入的data_bundle中的dataset需要具有以下结构：

raw_words1	raw_words2	target
Dana Reeve, the widow of the actor...	Christopher Reeve had an...	not_entailment
...	...

参数:	data_bundle --
返回:

process_from_file(paths) → fastNLP.io.data_bundle.DataBundle¶

传入文件路径，生成处理好的DataBundle对象。paths支持的路径形式可以参考：fastNLP.io.Loader.load()

参数:	paths --
返回:	DataBundle

class fastNLP.io.pipe.MatchingPipe(lower=False, tokenizer: str = 'raw')[源代码]¶

基类 fastNLP.io.Pipe

别名 fastNLP.io.MatchingPipe fastNLP.io.pipe.MatchingPipe

Matching任务的Pipe。输出的DataSet将包含以下的field

raw_words1 raw_words2 target words1 words2 seq_len1 seq_len2

The new rights are... Everyone really likes.. 1 [2, 3, 4, 5, ...] [10, 20, 6] 10 13

This site includes a... The Government Executive... 0 [11, 12, 13,...] [2, 7, ...] 6 7

... ... . [...] [...] . .

words1是premise，words2是hypothesis。其中words1,words2,seq_len1,seq_len2被设置为input；target被设置为target 和input(设置为input以方便在forward函数中计算loss，如果不在forward函数中计算loss也不影响，fastNLP将根据forward函数的形参名进行传参)。

dataset的print_field_meta()函数输出的各个field的被设置成input和target的情况为:
+-------------+------------+------------+--------+--------+--------+----------+----------+
| field_names | raw_words1 | raw_words2 | target | words1 | words2 | seq_len1 | seq_len2 |
+-------------+------------+------------+--------+--------+--------+----------+----------+
|   is_input  |   False    |   False    | False  |  True  |  True  |   True   |   True   |
|  is_target  |   False    |   False    |  True  | False  | False  |  False   |  False   |
| ignore_type |            |            | False  | False  | False  |  False   |  False   |
|  pad_value  |            |            |   0    |   0    |   0    |    0     |    0     |
+-------------+------------+------------+--------+--------+--------+----------+----------+

__init__(lower=False, tokenizer: str = 'raw')[源代码]¶

参数:	lower (bool) -- 是否将所有raw_words转为小写。 tokenizer (str) -- 将原始数据tokenize的方式。支持spacy, raw. spacy是使用spacy切分，raw就是用空格切分。

process(data_bundle)[源代码]¶

接受的DataBundle中的DataSet应该具有以下的field, target列可以没有

raw_words1	raw_words2	target
The new rights are...	Everyone really likes..	entailment
This site includes a...	The Government Executive...	not_entailment
...	...

参数:	data_bundle (DataBundle) -- 通过loader读取得到的data_bundle，里面包含了数据集的原始数据内容
返回:	data_bundle

process_from_file(paths) → fastNLP.io.data_bundle.DataBundle¶

传入文件路径，生成处理好的DataBundle对象。paths支持的路径形式可以参考：fastNLP.io.Loader.load()

参数:	paths --
返回:	DataBundle

class fastNLP.io.pipe.CoReferencePipe(config)[源代码]¶

基类 fastNLP.io.pipe.Pipe

别名 fastNLP.io.CoReferencePipe fastNLP.io.pipe.CoReferencePipe

对Coreference resolution问题进行处理，得到文章种类/说话者/字符级信息/序列长度。

处理完成后数据包含文章类别、speaker信息、句子信息、句子对应的index、char、句子长度、target：

words1 words2 words3 words4 chars seq_len target

bc [[0,0],[1,1]] [['I','am'],[]] [[1,2],[]] [[[1],[2,3]],[]] [2,3] [[[2,3],[6,7]],[[10,12],[20,22]]]

[...] [...] [...] [...] [...] [...] [...]

dataset的print_field_meta()函数输出的各个field的被设置成input和target的情况为:
+-------------+-----------+--------+-------+---------+
| field_names | raw_chars | target | chars | seq_len |
+-------------+-----------+--------+-------+---------+
|   is_input  |   False   |  True  |  True |   True  |
|  is_target  |   False   |  True  | False |   True  |
| ignore_type |           | False  | False |  False  |
|  pad_value  |           |   0    |   0   |    0    |
+-------------+-----------+--------+-------+---------+

process(data_bundle: fastNLP.io.data_bundle.DataBundle)[源代码]¶

对load进来的数据进一步处理原始数据包含：raw_key,raw_speaker,raw_words,raw_clusters

raw_key	raw_speaker	raw_words	raw_clusters
bc/cctv/00/cctv_0000_0	[[Speaker#1, Speaker#1],[]]	[['I','am'],[]]	[[[2,3],[6,7]],[[10,12],[20,22]]]
bc/cctv/00/cctv_0000_1	[['Speaker#1', 'peaker#1'],[]]	[['He','is'],[]]	[[[2,3],[6,7]],[[10,12],[20,22]]]
[...]	[...]	[...]	[...]

参数:	data_bundle --
返回:

class fastNLP.io.pipe.CMRC2018BertPipe(max_len=510)[源代码]¶

基类 fastNLP.io.Pipe

别名 fastNLP.io.CMRC2018BertPipe fastNLP.io.pipe.qa.CMRC2018BertPipe

处理之后的DataSet将新增以下的field(传入的field仍然保留)

context_len	raw_chars	target_start	target_end	chars
492	['范'	'廷'	'颂... ]	30	34	[21, 25, ...]
491	['范'	'廷'	'颂... ]	41	61	[21, 25, ...]

.	...	...	...	...

raw_words列是context与question拼起来的结果(连接的地方加入了[SEP])，words是转为index的值, target_start为答案start的index，target_end为答案end的index （闭区间）；context_len指示的是words列中context的长度。

其中各列的meta信息如下:

+-------------+-------------+-----------+--------------+------------+-------+---------+
| field_names | context_len | raw_chars | target_start | target_end | chars | answers |
+-------------+-------------+-----------+--------------+------------+-------+---------|
|   is_input  |    False    |   False   |    False     |   False    |  True |  False  |
|  is_target  |     True    |    True   |     True     |    True    | False |  True   |
| ignore_type |    False    |    True   |    False     |   False    | False |  True   |
|  pad_value  |      0      |     0     |      0       |     0      |   0   |   0     |
+-------------+-------------+-----------+--------------+------------+-------+---------+

process(data_bundle: fastNLP.io.data_bundle.DataBundle) → fastNLP.io.data_bundle.DataBundle[源代码]¶

传入的DataSet应该具备以下的field

:header:"title", "context", "question", "answers", "answer_starts", "id"¶
范廷颂	范廷颂枢机（，），圣名保禄·若瑟（）...	范廷颂是什么时候被任为主教的？	["1963年"]	["30"]	TRAIN_186_QUERY_0
范廷颂	范廷颂枢机（，），圣名保禄·若瑟（）...	1990年，范廷颂担任什么职务？	["1990年被擢升为天..."]	["41"]	TRAIN_186_QUERY_1
...	...	...	...	.	...

参数:	data_bundle --
返回: