Example机制

example是tensorflow官方定义的一种数据组织形式，本质就是字典，可以方便地进行序列化与反解析，与TFRecord存储格式相结合，可以最大限度地发挥数据的并行读写效率。
example分为普通example和sequence example两种
Example
tensorflow example是基于key-value对的存储方法，其中key是一个字符串，其映射到的是feature信息，feature包含三种类型：Int64List（64位整数列表）、FloatList（浮点数列表）和BytesList（字符串列表）。以上三种类型都是列表类型，因此不仅可以存储一个数，也可以存储一列数（如矩阵）。
注1：Int64List、FloatList、BytesList都只能接收list形式，因此如果是scalar要先转成list，其他iterabal也要转成list。

注2：example是按照行读的，因此如果要存储一个M*N大小的矩阵，需要转化为长度为M*N的一行来存储，即0~M-1存储第一行，M~2M-1存储第二行，以此类推。

code

# Example构造过程：原始数据->Int64List->Feature->Features->Example

# 先构造feature
# Int64List()需要传入list类型
value1 = [0,1,2]
value2 = [3,4,5]
f1 = tf.train.Feature(int64_list=tf.train.Int64List(value=value1))
f2 = tf.train.Feature(int64_list=tf.train.Int64List(value=value2))


# 再构造features
dict_features = collections.OrderedDict()
# or dict_features = {} 只要是字典即可
# 构造 str-feature mapping
dict_features["f1"] = f1
dict_features["f2"] = f2
# feature参数传入构造好的字典即可
features = tf.train.Features(feature=dict_features)


# 再转化为example
tf_example = tf.train.Example(features=features)

# 序列化example
serialized_example = tf_example.SerializeToString()


# 将序列化的example写到TFRecord中
writer = tf.python_io.TFRecordWriter(output_file_path)
writer.write(tf_example.SerializeToString())

# print(tf_example)输出
features {
  feature {
    key: "f1"
    value {
      int64_list {
        value: 0
        value: 1
        value: 2
      }
    }
  }
  feature {
    key: "f2"
    value {
      int64_list {
        value: 4
        value: 5
        value: 6
      }
    }
  }
}

Sequence Example

Sequence Example用于存储一系列的Examples，适合那些一次需要处理一个group的数据的模型；同样也可以用于表示序列建模（sequence modeling）。即对于feature list的value，axis0可以表示不同对象的feature（同一group内），也可以表示时间步长（一个feature随时间的变化情况）。
一个sequence example包括context和feature_lists两部分，context表示group层面的一些特征（如group_id、group内对象的labels等），其和features组织形式是一样的；feature_lists则表示group内各个对象的特征，下面介绍其组织形式。
Sequence Example的基础数据单元是feature_list，组织形式也由features->feature_lists。feature_list也是key-value对的形式，只不过一个value中包含了多个feature（即一个key对应多个值），一个feature对应了group内单个对象的一个特征。
sequence example的一些要求：
- 同一feature list内的feature类型必须一致；
- 同一feature list内的feature大小最好一致（也可设置成不一致，但不推荐）；
- 同一feature lists内的不同feature list的长度必须一致（因为各个feature list的长度都是group内的元素个数）。
- 不同sequence example的对应feature list长度可以不一致（代表了不同group内的元素个数不同）。

code

# SequenceExample构造过程：原始数据->Int64List->Feature->list of Feature->FeatureList->FeatureLists->SequenceExample
# 构建 context（与构建Features的过程相同）
group_id = 1
labels = [1, 0, 0, 0]
group_id_feature = tf.train.Feature(
    int64_list=tf.train.Int64List(value=[group_id])) # 传入的必须是list形式
labels_feature = tf.train.Feature(int64_list=tf.train.Int64List(value=labels))
context_dict = {}
context_dict["group_id"] = group_id_feature
context_dict["labels"] = labels_feature
context = tf.train.Features(feature=context_dict)

# 构建 各个feature，将feature组成list的形式
value_list1 = [[0, 1, 2], [3, 4, 5], [6, 7, 8]]
value_list2 = [[0, -1, -2], [-3, -4, -5], [-6, -7, -8]]
list_of_feature1 = [tf.train.Feature(int64_list=tf.train.Int64List(value=value)) for value in value_list1]
list_of_feature2 = [tf.train.Feature(int64_list=tf.train.Int64List(value=value)) for value in value_list2]

# 将各个feature组装feature_list, 再组装成feature_lists
feature_lists_dict = {}
feature_lists_dict["feature_list1"] = tf.train.FeatureList(feature = list_of_feature1)
feature_lists_dict["feature_list2"] = tf.train.FeatureList(feature = list_of_feature2)
feature_lists = tf.train.FeatureLists(feature_list = feature_lists_dict)

# 将context和feature lists组装成 sequence example
seq_example = tf.train.SequenceExample(context = context, feature_lists = feature_lists)

# 序列化 sequence example
serialized_seq_example = seq_example.SerializeToString()

# 将序列化的example写到TFRecord中
writer = tf.python_io.TFRecordWriter(output_file_path)
writer.write(tf_example.SerializeToString())

# print(seq_example)输出
context {
  feature {
    key: "group_id"
    value {
      int64_list {
        value: 1
      }
    }
  }
  feature {
    key: "labels"
    value {
      int64_list {
        value: 1
        value: 0
        value: 0
        value: 0
      }
    }
  }
}
feature_lists {
  feature_list {
    key: "feature_list1"
    value {
      feature {
        int64_list {
          value: 0
          value: 1
          value: 2
        }
      }
      feature {
        int64_list {
          value: 3
          value: 4
          value: 5
        }
      }
      feature {
        int64_list {
          value: 6
          value: 7
          value: 8
        }
      }
    }
  }
  feature_list {
    key: "feature_list2"
    value {
      feature {
        int64_list {
          value: 0
          value: -1
          value: -2
        }
      }
      feature {
        int64_list {
          value: -3
          value: -4
          value: -5
        }
      }
      feature {
        int64_list {
          value: -6
          value: -7
          value: -8
        }
      }
    }
  }
}

example的字符串展现形式

example字符串形式展现形式

从TFRecord文件中读取example

从TFRecord读取example，分为两步：
1. 通过dataset读取TFRecord，dataset内置了record文件的读取函数；
2. 因为dataset加载的是serialized str的形式，所以再将str解析为程序需要的形式（通过map函数）。

解析example

tf.parse_single_example(
serialized,
features,
name=None,
example_names=None
)

解析单个序列化的example。
参数：
- serialized：待解析的serialized字符串
- features：str-FixedLenFeature的字典。
return：str-tensor的字典
tf.parse_example与tf.parse_single_example的不同在于，前者解析的是一个batch的serialized examples，即传入(batch_size，)的tensor。
一般不推荐用tf.parse_example()，直接解析单个example即可，解析完使用dataset.batch()做batch。

code

# [seq_length]为向量的维度，因为最初是以Int64List的形式传入features中的。
name_to_features = {
    "input_ids": tf.FixedLenFeature([seq_length], tf.int64),
    "input_mask": tf.FixedLenFeature([seq_length], tf.int64),
    "segment_ids": tf.FixedLenFeature([seq_length], tf.int64),
    "label_ids": tf.FixedLenFeature([], tf.int64),
    "is_real_example": tf.FixedLenFeature([], tf.int64),
}

def _decode_record(record, name_to_features):
  """Decodes a record to a TensorFlow example."""
  example = tf.parse_single_example(record, name_to_features) # type: dict
  # 返回值会作为dataset新的的元素
  return example

d = tf.data.TFRecordDataset(input_file)
d = d.map(lambda record: _decode_record(record, name_to_features))
d = d.batch(batch_size)

解析sequence example

tf.io.parse_single_sequence_example(
serialized,
context_features=None,
sequence_features=None,
example_name=None,
name=None
)

解析单个序列化的sequence example。
参数：
- context_features：str - tf.FixedLenFeature的字典。
- sequence_features：str - tf.FixedLenSequenceFeature的字典。
return：context和feature_lists两个字典，str-tensor格式。

code

# shape参数指定了group内单个元素的shape
# 最后解析出来的维度是(None)+shape，None为group size
name_to_feature_lists = {
    "input_ids": tf.FixedLenSequenceFeature(shape=[seq_length], dtype=tf.int64),
    "input_mask": tf.FixedLenSequenceFeature(shape=[seq_length], dtype=tf.int64),
    "segment_ids": tf.FixedLenSequenceFeature(shape=[seq_length], dtype=tf.int64),
    "label_ids": tf.FixedLenSequenceFeature(shape=[], dtype=tf.int64),
}

def _decode_record(record, name_to_feature_lists):
  """Decodes a record to a TensorFlow example."""
  _, seq_example = tf.parse_single_sequence_example(record, context_features=None,sequence_features=name_to_feature_lists)

d = tf.data.TFRecordDataset(input_file)
d = d.map(lambda record:  _decode_record(record, name_to_feature_lists))

Post Date： 2019-08-19