Dataset机制

本文主要介绍了TensorFlow官方推荐的数据输入格式：tf.data.Dataset，这种格式具有两大优点：1、提供了管道式的数据输入机制，便于进行并行处理；2、提供了数据预处理操作的函数（如batch、shuffle），便于编写数据预处理模块。

基本概念

tf.data.Dataset reprents a sequence of elements, where each data example (features) is regarded as a single seperate element. Most commonly, the element is a tuple of tensors consists of one data example and one label corresponding to the example. In addition, for the sake of readablity, we often transform the the data example to the form of dictionary where value is fature value and key is its name. Example can be found here
Dataset can be constructed in two ways: 1. reading in-memory data from some array based structures like numpy arrays/pandas dataframes; 2. reading data from disk files and the formats can be csv, tsv, plain text, TFRecord.
After constructing the dataset, there are usually some pre-processing works for data and we can use tf.data.Dataset.map() function to achieve data trasformation.
We can use iterator mechanism to access elements of a dataset. However, in some high level APIs of tensorflow (like tf.Estimator) this process is pre-implemented and what we only need to do is providing the dataset constructed to them.

构造Dataset

from in-memory data

We can use tf.data.Dataset.from_tensor_slices() or tf.data.Dataset.from_tensors() methods to read in-memory data into a tensor.
from_tensor_slices() will slice the tensor provided along 0th dimension/axis so the dataset will get many elements.
While from_tensors() method directly see the tensor passed as a single one so the dataset will only get one element.

example code

dataset1 = tf.data.Dataset.from_tensor_slices(tf.random_uniform([4, 10]))
print(dataset1.output_types)  # ==> "tf.float32"
print(dataset1.output_shapes)  # ==> "(10,)"

dataset2 = tf.data.Dataset.from_tensor_slices(
   (tf.random_uniform([4]),
    tf.random_uniform([4, 100], maxval=100, dtype=tf.int32)))
print(dataset2.output_types)  # ==> "(tf.float32, tf.int32)"
print(dataset2.output_shapes)  # ==> "((), (100,))"

dataset3 = tf.data.Dataset.zip((dataset1, dataset2))
print(dataset3.output_types)  # ==> (tf.float32, (tf.float32, tf.int32))
print(dataset3.output_shapes)  # ==> "(10, ((), (100,)))

from csv/tsv file

Effectly csv/tsv files both belongs to plan text, the only difference is that in one line, csv file elements are seperated by comma(,) while tab(‘\t’) for tsv files.
So we can use tf.data.TextLineDataset to extract data of each line in the csv/tsv file.
It is worth noting that this step is only to extract data out of the file, and transforming the data to corresponding format(e.g. split the line and construct feature columns) will be undertook by tf.data.Dataset.map() function。

example code

def map_function(line):
    # 第一个参数表示数据单元，map_function内即实现了针对数据单元的操作。
    COLUMNS = ['SepalLength', 'SepalWidth',
           'PetalLength', 'PetalWidth',
           'label']
    FIELD_DEFAULTS = [[0.0], [0.0], [0.0], [0.0], [0]]

    fields = tf.decode_csv(line, FIELD_DEFAULTS)
    feature = dict(zip(COLUMNS, fields))
    label = feature.pop('label')
    return feature, label

dt = tf.data.TextLineDataset("path/to/iris_training.csv")
print(dt) # <TextLineDataset shapes: (), types: tf.string>
dt = dt.map(map_function)
print(dt)
# <MapDataset shapes: ({SepalLength: (), SepalWidth: (), PetalLength: (), PetalWidth: ()}, ()) \
# types: ({SepalLength: tf.float32, SepalWidth: tf.float32, \
# PetalLength: tf.float32, PetalWidth: tf.float32}, tf.int32)>

from TFRcord file

TFRecord is a binary file format for tensorflow.
to be updated

数据预处理

dataset = dataset.map(map_func=parse_fn), to transform data to the format approriate for model using.
Functionality should be implemented in parse_fn function. Details can be seen here.

How to use dataset

basic usage

dataset = dataset.shuffle(1000).repeat().batch(batch_size)
shuffle(buffer_size) is to shuffle the dataset, buffer_size determines the degree of shffuling.
repeat() is to determine whether the data can be repeat by iterator.
batch() is to set the batch size.

性能优化

pipelining

data extracting by dataset and data processing by model can be executed at the same time.

# change
dataset = dataset.batch(batch_size=FLAGS.batch_size)

# to
dataset = dataset.batch(batch_size=FLAGS.batch_size)
dataset = dataset.prefetch(buffer_size=FLAGS.prefetch_buffer_size) # commonly buffer_size is set to 1

parallelizing

extracting multiple data examples parallely (because data examples have little loigic dependence each other so can be highly parallelized).

# change
dataset = dataset.map(map_func=parse_fn)

#to
dataset = dataset.map(map_func=parse_fn, num_parallel_calls=FLAGS.num_parallel_calls)

dataset补充

dataset是表示pipeline slice的一种结构，能够顺序地一个个读取、处理元素。dataset的切片可以透明地处理嵌套dictionary和tuple，即会自动对dict中的value、tuple中的各个元素都进行切片，切出来后单个dataset元素还是dict/tuple，只是其中的值axis0维没有了。
dataset提供了TFRecord的读取接口，tf.data.TFRecordDataset，可以将TFRecord中的一行行record映射为dataset的一个数据单元，之后可以再进行处理。
dataset常用函数：
- tf.data.Dataset.shuffle(buffer_size)：打乱数据，指定一个buffer_size。数据依次输入到buffer中，打乱，再输出，不是很严格的全量shuffle。
- tf.data.Dataset.repeat(num)：数据可以重复多少次（用来限制epoch），如果不指定就是可以一直重复。
- tf.data.Dataset.batch(batch_size, drop_remainder=True/Flase)：对数据做batch，相当于对dataset单元axis0增加一个batch维。drop_remainder用来指定最后一个是否要丢弃（默认false），False的话axis0就是Unknown/None，True的话axis0就是spesific的batch_size。
如何遍历dataset：直接for in，或使用iterator。
dataset.padded_batch(batch_size, padded_shapes, padding_values): 对序列数据进行batch，进行batch的同时进行padding，将短的序列补到当前batch最长的长度，因为batch内部各个元素shape要保持一致。
- padded_shapes：表示要pad到多长，axis0表示group size（多为None，即补到当前batch下的最长长度），aixs1及之后表示每个feature内部各维度，指定为feature的shape即可。
- padding_values：表示要pad的元素值，tensor形式，类型要和原先的类型一致。
padded_batch和padding_values的类型要和dataset相同（可以是list、dict等，dict的话key要对应），指定好对应的shape即可。
对于dataset的map函数，输入一个dataset的数据单元（serialized str），输出一个数据单元（对于example来说，就是一个dict；对于sequence example，可以构造为dict返回，key为context和feature_lists，然后各自也是dict）
返回tensor的shape：parse_single：[None] + shape（None维为group内对象feature的数量）；parse：[batch_size, None] + shape。
FixedLenSequenceFeature中同一batch下，不同大小的group/examples会自动补到最大大小（取决于这个batch下的最大group_size），因此同一batch下的group size都是相同的；但不同batch下group size可以不同，如(32, 10, hidden_size)、(32, 12, hidden_size)或(32, 8, hidden_size)都是可以的，也不影响训练。因此如果在tf中查看shape，得到的会是(None, None, hidden_size)，bacth维是None的原因是因为最后一个bacth可能不足一个bacth_size，group维是None是因为不同batch的group_size也不一样。

Post Date： 2019-07-23