fastNLP.core.utils¶

utils模块实现了 fastNLP 内部和外部所需的很多工具。其中用户可以使用的是 cache_results() 修饰器。

fastNLP.core.utils.cache_results(_cache_fp, _refresh=False, _verbose=1)[源代码]¶

别名 fastNLP.cache_results fastNLP.core.utils.cache_results

cache_results是fastNLP中用于cache数据的装饰器。通过下面的例子看一下如何使用:

import time
import numpy as np
from fastNLP import cache_results

@cache_results('cache.pkl')
def process_data():
    # 一些比较耗时的工作，比如读取数据，预处理数据等，这里用time.sleep()代替耗时
    time.sleep(1)
    return np.random.randint(10, size=(5,))

start_time = time.time()
print("res =",process_data())
print(time.time() - start_time)

start_time = time.time()
print("res =",process_data())
print(time.time() - start_time)

# 输出内容如下，可以看到两次结果相同，且第二次几乎没有花费时间
# Save cache to cache.pkl.
# res = [5 4 9 1 8]
# 1.0042750835418701
# Read cache from cache.pkl.
# res = [5 4 9 1 8]
# 0.0040721893310546875

可以看到第二次运行的时候，只用了0.0001s左右，是由于第二次运行将直接从cache.pkl这个文件读取数据，而不会经过再次预处理:

# 还是以上面的例子为例，如果需要重新生成另一个cache，比如另一个数据集的内容，通过如下的方式调用即可
process_data(_cache_fp='cache2.pkl')  # 完全不影响之前的‘cache.pkl'

上面的_cache_fp是cache_results会识别的参数，它将从'cache2.pkl'这里缓存/读取数据，即这里的'cache2.pkl'覆盖默认的 'cache.pkl'。如果在你的函数前面加上了@cache_results()则你的函数会增加三个参数[_cache_fp, _refresh, _verbose]。上面的例子即为使用_cache_fp的情况，这三个参数不会传入到你的函数中，当然你写的函数参数名也不可能包含这三个名称:

process_data(_cache_fp='cache2.pkl', _refresh=True)  # 这里强制重新生成一份对预处理的cache。
#  _verbose是用于控制输出信息的，如果为0,则不输出任何内容;如果为1,则会提醒当前步骤是读取的cache还是生成了新的cache

参数

_cache_fp (str) -- 将返回结果缓存到什么位置;或从什么位置读取缓存。如果为None，cache_results没有任何效用，除非在函数调用的时候传入_cache_fp这个参数。
_refresh (bool) -- 是否重新生成cache。
_verbose (int) -- 是否打印cache的信息。

返回

fastNLP.core.utils.seq_len_to_mask(seq_len, max_len=None)[源代码]¶

别名 fastNLP.seq_len_to_mask fastNLP.core.utils.seq_len_to_mask

将一个表示sequence length的一维数组转换为二维的mask，不包含的位置为0。转变 1-d seq_len到2-d mask.

>>> seq_len = torch.arange(2, 16)
>>> mask = seq_len_to_mask(seq_len)
>>> print(mask.size())
torch.Size([14, 15])
>>> seq_len = np.arange(2, 16)
>>> mask = seq_len_to_mask(seq_len)
>>> print(mask.shape)
(14, 15)
>>> seq_len = torch.arange(2, 16)
>>> mask = seq_len_to_mask(seq_len, max_len=100)
>>>print(mask.size())
torch.Size([14, 100])

参数

seq_len (np.ndarray,torch.LongTensor) -- shape将是(B,)
max_len (int) -- 将长度pad到这个长度。默认(None)使用的是seq_len中最长的长度。但在nn.DataParallel的场景下可能不同卡的seq_len会有区别，所以需要传入一个max_len使得mask的长度是pad到该长度。

返回

np.ndarray, torch.Tensor 。shape将是(B, max_length)，元素类似为bool或torch.uint8

fastNLP.core.utils.get_seq_len(words, pad_value=0)[源代码]¶

别名 fastNLP.get_seq_len fastNLP.core.utils.get_seq_len

给定batch_size x max_len的words矩阵，返回句子长度

参数: words -- batch_size x max_len
返回: (batch_size,)