tfds.load()和tf.data.Dataset的简介
tfds.load()和tf.data.Dataset的簡(jiǎn)介
tfds.load()有以下參數(shù)
tfds.load(name, split=None, data_dir=None, batch_size=None, shuffle_files=False,download=True, as_supervised=False, decoders=None, read_config=None,with_info=False, builder_kwargs=None, download_and_prepare_kwargs=None,as_dataset_kwargs=None, try_gcs=False )重要參數(shù)如下:
- name 數(shù)據(jù)集的名字
- split 對(duì)數(shù)據(jù)集的切分
- data_dir 數(shù)據(jù)的位置或者數(shù)據(jù)下載的位置
- batch_size 批道數(shù)
- shuffle_files 打亂
- as_supervised 返回元組(默認(rèn)返回時(shí)字典的形式的)
1.數(shù)據(jù)的切分
# 拿數(shù)據(jù)集中訓(xùn)練集(數(shù)據(jù)集默認(rèn)劃分為train,test) train_ds = tfds.load('mnist', split='train')# 兩部分都拿出來(lái) train_ds, test_ds = tfds.load('mnist', split=['train', 'test'])# 兩部分都拿出來(lái),并合成一個(gè) train_test_ds = tfds.load('mnist', split='train+test')# 從訓(xùn)練集的10(含)到20(不含) train_10_20_ds = tfds.load('mnist', split='train[10:20]')# 訓(xùn)練集的前10% train_10pct_ds = tfds.load('mnist', split='train[:10%]')# 訓(xùn)練集的前10%和后80% train_10_80pct_ds = tfds.load('mnist', split='train[:10%]+train[-80%:]')#--------------------------------------------------- # 10%的交錯(cuò)驗(yàn)證集: # 沒批驗(yàn)證集拿訓(xùn)練集的10%: # [0%:10%], [10%:20%], ..., [90%:100%]. vals_ds = tfds.load('mnist', split=[f'train[{k}%:{k+10}%]' for k in range(0, 100, 10) ]) # 訓(xùn)練集拿90%: # [10%:100%] (驗(yàn)證集為 [0%:10%]), # [0%:10%] + [20%:100%] (驗(yàn)證集為 [10%:20%]), ..., # [0%:90%] (驗(yàn)證集為 [90%:100%]). trains_ds = tfds.load('mnist', split=[f'train[:{k}%]+train[{k+10}%:]' for k in range(0, 100, 10) ])還有使用ReadInstruction API 切分的,效果跟上面一樣
# The full `train` split. train_ds = tfds.load('mnist', split=tfds.core.ReadInstruction('train'))# The full `train` split and the full `test` split as two distinct datasets. train_ds, test_ds = tfds.load('mnist', split=[tfds.core.ReadInstruction('train'),tfds.core.ReadInstruction('test'), ])# The full `train` and `test` splits, interleaved together. ri = tfds.core.ReadInstruction('train') + tfds.core.ReadInstruction('test') train_test_ds = tfds.load('mnist', split=ri)# From record 10 (included) to record 20 (excluded) of `train` split. train_10_20_ds = tfds.load('mnist', split=tfds.core.ReadInstruction('train', from_=10, to=20, unit='abs'))# The first 10% of train split. train_10_20_ds = tfds.load('mnist', split=tfds.core.ReadInstruction('train', to=10, unit='%'))# The first 10% of train + the last 80% of train. ri = (tfds.core.ReadInstruction('train', to=10, unit='%') +tfds.core.ReadInstruction('train', from_=-80, unit='%')) train_10_80pct_ds = tfds.load('mnist', split=ri)# 10-fold cross-validation (see also next section on rounding behavior): # The validation datasets are each going to be 10%: # [0%:10%], [10%:20%], ..., [90%:100%]. # And the training datasets are each going to be the complementary 90%: # [10%:100%] (for a corresponding validation set of [0%:10%]), # [0%:10%] + [20%:100%] (for a validation set of [10%:20%]), ..., # [0%:90%] (for a validation set of [90%:100%]). vals_ds = tfds.load('mnist', [tfds.core.ReadInstruction('train', from_=k, to=k+10, unit='%')for k in range(0, 100, 10)]) trains_ds = tfds.load('mnist', [(tfds.core.ReadInstruction('train', to=k, unit='%') +tfds.core.ReadInstruction('train', from_=k+10, unit='%'))for k in range(0, 100, 10)])2.返回的對(duì)象
返回的對(duì)象是一個(gè)tf.data.Dataset或者和一個(gè)tfds.core.DatasetInfo(如果有的話)
3.指定目錄
指定目錄十分簡(jiǎn)單(默認(rèn)會(huì)放到用戶目錄下面)
train_ds = tfds.load('mnist', split='train',data_dir='~/user')4.獲取img和label
因?yàn)榉祷氐氖且粋€(gè)tf.data.Dataset對(duì)象,我們可以在對(duì)其進(jìn)行迭代之前對(duì)數(shù)據(jù)集進(jìn)行操作,以此來(lái)獲取符合我們要求的數(shù)據(jù)。
tf.data.Dataset有以下幾個(gè)重要的方法:
4.1 shuffle
數(shù)據(jù)的打亂
shuffle(buffer_size, seed=None, reshuffle_each_iteration=None ) #隨機(jī)重新排列此數(shù)據(jù)集的元素。 #該數(shù)據(jù)集用buffer_size元素填充緩沖區(qū),然后從該緩沖區(qū)中隨機(jī)采樣元素,將所選元素替換為新元素。為了實(shí)現(xiàn)完美 #的改組,需要緩沖區(qū)大小大于或等于數(shù)據(jù)集的完整大小。 #例如,如果您的數(shù)據(jù)集包含10,000個(gè)元素但buffer_size設(shè)置為1,000個(gè),則shuffle最初將僅從緩沖區(qū)的前1,000 #個(gè)元素中選擇一個(gè)隨機(jī)元素。選擇一個(gè)元素后,其緩沖區(qū)中的空間將由下一個(gè)(即1,001個(gè))元素替換,并保留1,000個(gè)#元素緩沖區(qū)。 #reshuffle_each_iteration控制隨機(jī)播放順序?qū)τ诿總€(gè)時(shí)期是否應(yīng)該不同。4.2 batch
批道大小(一批多少個(gè)數(shù)據(jù)),迭代的是時(shí)候根據(jù)批道數(shù)放回對(duì)應(yīng)的數(shù)據(jù)量
batch(batch_size, drop_remainder=False )dataset = tf.data.Dataset.range(8) dataset = dataset.batch(3) list(dataset.as_numpy_iterator())dataset = tf.data.Dataset.range(8) dataset = dataset.batch(3, drop_remainder=True) list(dataset.as_numpy_iterator())返回的是一個(gè)Dataset
4.3 map
用跟普通的map方法差不多,目的是對(duì)數(shù)據(jù)集操作
map(map_func, num_parallel_calls=None, deterministic=None )dataset = Dataset.range(1, 6) # ==> [ 1, 2, 3, 4, 5 ] dataset = dataset.map(lambda x: x + 1) list(dataset.as_numpy_iterator())返回的是一個(gè)Dataset
4.4 as_numpy_iterator
返回一個(gè)迭代器,該迭代器將數(shù)據(jù)集的所有元素轉(zhuǎn)換為numpy。
使用as_numpy_iterator檢查你的數(shù)據(jù)集的內(nèi)容。要查看元素的形狀和類型,請(qǐng)直接打印數(shù)據(jù)集元素,而不要使用 as_numpy_iterator。
dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3]) for element in dataset:print(element) #tf.Tensor( 1 , shape = ( ) , dtype = int32 ) #tf.Tensor ( 2 , shape = ( ) . dtype = int32 ) #tf.Tensor ( 3 , shape = ( ) , dtype = int32 )dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3]) for element in dataset.as_numpy_iterator():print(element) #1 #2 #34.5 對(duì)數(shù)據(jù)集操作示例
通過(guò)下面的寫法可以獲取符合格式的數(shù)據(jù):
#先用map()將img進(jìn)行resize,然后進(jìn)打亂,然后設(shè)定迭代的放回的batch_size dataset_train = dataset_train.map(lambda img, label: (tf.image.resize(img, (224, 224)) / 255.0, label)).shuffle(1024).batch(batch_size)#因?yàn)槭菧y(cè)試集,所以不打亂,只是把img進(jìn)行resize dataset_test = dataset_test.map(lambda img, label: (tf.image.resize(img, (224, 224)) / 255.0, label)).batch(batch_size)對(duì)數(shù)據(jù)進(jìn)行迭代:
for images, labels in dataset_train:labels_pred = model(images, training=True)loss = tf.keras.losses.sparse_categorical_crossentropy(y_true=labels, y_pred=labels_pred)loss = tf.reduce_mean(loss)········總結(jié)
以上是生活随笔為你收集整理的tfds.load()和tf.data.Dataset的简介的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 主存和内存以及一些概念
- 下一篇: from _sqlite3 import