【TensorFlow基础】加载和预处理数据
目錄
1.圖像
1.1 配置環(huán)境
1.2 加載數據集
1.3 數據預處理
1.4 訓練模型
2.CSV
2.1 配置環(huán)境
2.2 加載數據
2.3 數據預處理
2.4 構建模型
2.5 訓練、評估和預測
3.Numpy
3.1 配置環(huán)境
3.2 加載數據
3.3 數據預處理
4.pandas.DataFrame
4.1 配置環(huán)境
4.2 加載數據
4.3 數據預處理
4.4 創(chuàng)建并訓練模型
5.TFRecord和tf.Example
6.文本
6.1 配置環(huán)境
6.2 加載數據
6.3 數據預處理
6.4 構建模型
6.5 訓練模型
1.圖像
1.1 配置環(huán)境
import tensorflow as tf AUTOTUNE = tf.data.experimental.AUTOTUNE1.2 加載數據集
使用tf.data加載圖片
# 下載數據集 import pathlib data_root_orig = tf.keras.utils.get_file(origin='https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz',fname='flower_photos', untar=True) data_root = pathlib.Path(data_root_orig) print(data_root)''' Downloading data from https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz 228813984/228813984 [==============================] - 3s 0us/step /home/kbuilder/.keras/datasets/flower_photos '''# 查看數據集 for item in data_root.iterdir():print(item)''' /home/kbuilder/.keras/datasets/flower_photos/daisy /home/kbuilder/.keras/datasets/flower_photos/dandelion /home/kbuilder/.keras/datasets/flower_photos/sunflowers /home/kbuilder/.keras/datasets/flower_photos/roses /home/kbuilder/.keras/datasets/flower_photos/LICENSE.txt /home/kbuilder/.keras/datasets/flower_photos/tulips '''1.3 數據預處理
# 1. 打亂數據集 import random all_image_paths = list(data_root.glob('*/*')) all_image_paths = [str(path) for path in all_image_paths] random.shuffle(all_image_paths)image_count = len(all_image_paths) image_count # 3670# 檢查圖片 import os attributions = (data_root/"LICENSE.txt").open(encoding='utf-8').readlines()[4:] attributions = [line.split(' CC-BY') for line in attributions] attributions = dict(attributions)import IPython.display as display def caption_image(image_path):image_rel = pathlib.Path(image_path).relative_to(data_root)return "Image (CC BY 2.0) " + ' - '.join(attributions[str(image_rel)].split(' - ')[:-1])for n in range(3):image_path = random.choice(all_image_paths)display.display(display.Image(image_path))print(caption_image(image_path))print()# 2. 確定每張圖片的標簽 # 2.1 列出可用的標簽 label_names = sorted(item.name for item in data_root.glob('*/') if item.is_dir()) label_names # ['daisy', 'dandelion', 'roses', 'sunflowers', 'tulips']# 2.2 為每個標簽分配索引 label_to_index = dict((name, index) for index, name in enumerate(label_names)) label_to_index # {'daisy': 0, 'dandelion': 1, 'roses': 2, 'sunflowers': 3, 'tulips': 4}# 2.3 創(chuàng)建一個列表包含每個文件的標簽索引 all_image_labels = [label_to_index[pathlib.Path(path).parent.name]for path in all_image_paths]print("First 10 labels indices: ", all_image_labels[:10]) # First 10 labels indices: [4, 2, 4, 1, 1, 2, 4, 4, 3, 2]# 3. 加載和格式化圖片 # 3.1 取第一張圖片 img_path = all_image_paths[0] img_path # '/home/kbuilder/.keras/datasets/flower_photos/tulips/14099204939_60e6ffa4c3_n.jpg'# 查看原始數據 img_raw = tf.io.read_file(img_path) print(repr(img_raw)[:100]+"...") # <tf.Tensor: shape=(), dtype=string, numpy=b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x00\x00\x01\x00...# 3.2 將原始數據轉換為圖像tensor img_tensor = tf.image.decode_image(img_raw)print(img_tensor.shape) print(img_tensor.dtype)''' (212, 320, 3) <dtype: 'uint8'> '''# 3.3 根據模型調整圖像tensor大小 img_final = tf.image.resize(img_tensor, [192, 192]) img_final = img_final/255.0 print(img_final.shape) print(img_final.numpy().min()) print(img_final.numpy().max())''' (192, 192, 3) 0.0 1.0 '''# 3.4 將以上操作封裝在一個函數中 def preprocess_image(image):image = tf.image.decode_jpeg(image, channels=3)image = tf.image.resize(image, [192, 192])image /= 255.0 # normalize to [0,1] rangereturn imagedef load_and_preprocess_image(path):image = tf.io.read_file(path)return preprocess_image(image)# 應用封裝好的函數 import matplotlib.pyplot as pltimage_path = all_image_paths[0] label = all_image_labels[0]plt.imshow(load_and_preprocess_image(img_path)) plt.grid(False) plt.xlabel(caption_image(img_path)) plt.title(label_names[label].title()) print()構建一個tf.data.Dataset:
使用from_tensor_slices方法可以簡單地構建一個tf.data.Dataset
# 1. 構建字符串數據集path_ds # 將字符串數組切片得到一個字符串數據集path_ds path_ds = tf.data.Dataset.from_tensor_slices(all_image_paths)print(path_ds) # <TensorSliceDataset element_spec=TensorSpec(shape=(), dtype=tf.string, name=None)># 2. 加載并格式化圖片,構建圖片數據集image_ds # 創(chuàng)建一個新的圖像數據集image_ds,通過在路徑數據集上映射 preprocess_image 來動態(tài)加載和格式化圖片 image_ds = path_ds.map(load_and_preprocess_image, num_parallel_calls=AUTOTUNE)import matplotlib.pyplot as pltplt.figure(figsize=(8,8)) for n, image in enumerate(image_ds.take(4)):plt.subplot(2,2,n+1)plt.imshow(image)plt.grid(False)plt.xticks([])plt.yticks([])plt.xlabel(caption_image(all_image_paths[n]))plt.show()# 3. 創(chuàng)建標簽數據集label_ds label_ds = tf.data.Dataset.from_tensor_slices(tf.cast(all_image_labels, tf.int64))for label in label_ds.take(10):print(label_names[label.numpy()])# 4. 構建(圖片,標簽)對數據集image_label_ds image_label_ds = tf.data.Dataset.zip((image_ds, label_ds))print(image_label_ds) # <ZipDataset element_spec=(TensorSpec(shape=(192, 192, 3), dtype=tf.float32, name=None), TensorSpec(shape=(), dtype=tf.int64, name=None))>1.4 訓練模型
準備訓練數據:
- 訓練數據應被充分打亂;
- 訓練數據應被分割為batch;
1. 順序很重要:
- 在 .repeat 之后 .shuffle,會在 epoch 之間打亂數據(當有些數據出現兩次的時候,其他數據還沒有出現過);
- 在 .batch 之后 .shuffle,會打亂 batch 的順序,但是不會在 batch 之間打亂數據;
2. 在完全打亂中使用較大的緩沖區(qū)大小提供更好的隨機化但使用更多的內存。
3. 在從隨機緩沖區(qū)中拉取任何元素前,要先填滿它。所以當 Dataset(數據集)啟動的時候一個大的 buffer_size(緩沖區(qū)大小)可能會引起延遲。
4. 在隨機緩沖區(qū)完全為空之前,被打亂的數據集不會報告數據集的結尾。Dataset(數據集)由 .repeat 重新啟動,導致需要再次等待隨機緩沖區(qū)被填滿。
構建并訓練模型:
# 2. 構建模型 # 2.1 使用 MobileNet v2 模型副本進行訓練 mobile_net = tf.keras.applications.MobileNetV2(input_shape=(192, 192, 3), include_top=False)# 設置 MobileNet v2 權重不可訓練 mobile_net.trainable=False# 將模型輸出標準化至 [-1,1] 范圍內 def change_range(image,label):return 2*image-1, labelkeras_ds = ds.map(change_range)# MobileNet v2 為每張圖片的特征返回一個6*6的空間網格 # 數據集可能需要幾秒來啟動,因為要填滿其隨機緩沖區(qū),傳遞一個batch的圖片 image_batch, label_batch = next(iter(keras_ds))feature_map_batch = mobile_net(image_batch) print(feature_map_batch.shape) # (32, 6, 6, 1280) # 每個batch有32條樣本,每個樣本返回個6*6的空間網格# 2.2 構建模型 model = tf.keras.Sequential([mobile_net,tf.keras.layers.GlobalAveragePooling2D(), # 平均空間向量tf.keras.layers.Dense(len(label_names), activation = 'softmax')])logit_batch = model(image_batch).numpy()print("min logit:", logit_batch.min()) print("max logit:", logit_batch.max()) print()print("Shape:", logit_batch.shape)''' min logit: 0.014231807 max logit: 0.7678226Shape: (32, 5) '''model.compile(optimizer=tf.keras.optimizers.Adam(),loss='sparse_categorical_crossentropy',metrics=["accuracy"])# 2.3 查看模型結構 # 此時可訓練的變量為dense層中的weight和bias model.summary()''' Model: "sequential" _________________________________________________________________Layer (type) Output Shape Param # =================================================================mobilenetv2_1.00_192 (Funct (None, 6, 6, 1280) 2257984 ional) global_average_pooling2d (G (None, 1280) 0 lobalAveragePooling2D) dense (Dense) (None, 5) 6405 ================================================================= Total params: 2,264,389 Trainable params: 6,405 Non-trainable params: 2,257,984 _________________________________________________________________ '''# 2.4 訓練模型 # 出于演示目的每一個 epoch 只運行 3 step,在傳遞給 model.fit() 之前指定 step 的數量 steps_per_epoch=tf.math.ceil(len(all_image_paths)/BATCH_SIZE).numpy() steps_per_epoch # 115.0model.fit(ds, epochs=1, steps_per_epoch=3)2.CSV
使用泰坦尼克號乘客的數據,模型會根據乘客的年齡、性別、票務艙和是否獨自旅行等特征來預測乘客生還的可能性。
2.1 配置環(huán)境
import functoolsimport numpy as np import tensorflow as tf import tensorflow_datasets as tfdsTRAIN_DATA_URL = "https://storage.googleapis.com/tf-datasets/titanic/train.csv" TEST_DATA_URL = "https://storage.googleapis.com/tf-datasets/titanic/eval.csv"train_file_path = tf.keras.utils.get_file("train.csv", TRAIN_DATA_URL) test_file_path = tf.keras.utils.get_file("eval.csv", TEST_DATA_URL)# 讓 numpy 數據更易讀 np.set_printoptions(precision=3, suppress=True)2.2 加載數據
查看csv文件了解文件格式:
head {train_file_path}''' survived,sex,age,n_siblings_spouses,parch,fare,class,deck,embark_town,alone 0,male,22.0,1,0,7.25,Third,unknown,Southampton,n 1,female,38.0,1,0,71.2833,First,C,Cherbourg,n 1,female,26.0,0,0,7.925,Third,unknown,Southampton,y 1,female,35.0,1,0,53.1,First,C,Southampton,n 0,male,28.0,0,0,8.4583,Third,unknown,Queenstown,y 0,male,2.0,3,1,21.075,Third,unknown,Southampton,n 1,female,27.0,0,2,11.1333,Third,unknown,Southampton,n 1,female,14.0,1,0,30.0708,Second,unknown,Cherbourg,n 1,female,4.0,1,1,16.7,Third,G,Southampton,n '''CSV 文件的每列都會有一個列名,dataset 的構造函數會自動識別這些列名。如果文件的第一行不包含列名,那么需要將列名通過字符串列表傳給 make_csv_dataset 函數的 column_names 參數。
CSV_COLUMNS = ['survived', 'sex', 'age', 'n_siblings_spouses', 'parch', 'fare', 'class', 'deck', 'embark_town', 'alone']dataset = tf.data.experimental.make_csv_dataset(...,column_names=CSV_COLUMNS,...)這個示例使用了所有的列。如果你需要忽略數據集中的某些列,創(chuàng)建一個包含你需要使用的列的列表,然后傳給構造器的(可選)參數 select_columns。
dataset = tf.data.experimental.make_csv_dataset(...,select_columns = columns_to_use, ...)指定標簽所在的列:
LABEL_COLUMN = 'survived' LABELS = [0, 1]讀取CSV數據并創(chuàng)建dataset:
def get_dataset(file_path):dataset = tf.data.experimental.make_csv_dataset(file_path,batch_size=12, # 為了示例更容易展示,手動設置較小的值label_name=LABEL_COLUMN,na_value="?",num_epochs=1,ignore_errors=True)return datasetraw_train_data = get_dataset(train_file_path) raw_test_data = get_dataset(test_file_path)dataset 中的每個條目都是一個批次,用一個元組(多個樣本,多個標簽)表示。樣本中的數據組織形式是以列為主的張量(而不是以行為主的張量),每條數據中包含的元素個數就是批次大小(這個示例中是 12)。
examples, labels = next(iter(raw_train_data)) # 第一個批次 print("EXAMPLES: \n", examples, "\n") print("LABELS: \n", labels)''' EXAMPLES: OrderedDict([('sex', <tf.Tensor: shape=(12,), dtype=string, numpy= array([b'female', b'female', b'male', b'male', b'female', b'male',b'female', b'male', b'female', b'male', b'female', b'male'],dtype=object)>), ('age', <tf.Tensor: shape=(12,), dtype=float32, numpy= array([28., 41., 28., 24., 63., 28., 28., 65., 34., 9., 27., 30.],dtype=float32)>), ...]) LABELS: tf.Tensor([1 0 0 0 1 0 0 0 1 0 1 0], shape=(12,), dtype=int32) '''2.3 數據預處理
離散數據:對于有些分類的特征(即有些列只能在有限的集合中取值),使用 tf.feature_column API 創(chuàng)建一個 tf.feature_column.indicator_column 集合,每個tf.feature_column.indicator_column 對應一個分類的列。
CATEGORIES = {'sex': ['male', 'female'],'class' : ['First', 'Second', 'Third'],'deck' : ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'],'embark_town' : ['Cherbourg', 'Southhampton', 'Queenstown'],'alone' : ['y', 'n'] }categorical_columns = [] for feature, vocab in CATEGORIES.items():cat_col = tf.feature_column.categorical_column_with_vocabulary_list(key=feature, vocabulary_list=vocab)categorical_columns.append(tf.feature_column.indicator_column(cat_col))# 查看剛才創(chuàng)建的內容 categorical_columns''' [IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='sex', vocabulary_list=('male', 'female'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='class', vocabulary_list=('First', 'Second', 'Third'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='deck', vocabulary_list=('A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='embark_town', vocabulary_list=('Cherbourg', 'Southhampton', 'Queenstown'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='alone', vocabulary_list=('y', 'n'), dtype=tf.string, default_value=-1, num_oov_buckets=0))] '''?連續(xù)數據:需要對離散數據進行標準化
def process_continuous_data(mean, data):# 標準化數據data = tf.cast(data, tf.float32) * 1/(2*mean)return tf.reshape(data, [-1, 1])MEANS = {'age' : 29.631308,'n_siblings_spouses' : 0.545455,'parch' : 0.379585,'fare' : 34.385399 }numerical_columns = []for feature in MEANS.keys():num_col = tf.feature_column.numeric_column(feature, normalizer_fn=functools.partial(process_continuous_data, MEANS[feature]))numerical_columns.append(num_col)# 查看創(chuàng)建的內容 numerical_columns''' [NumericColumn(key='age', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=functools.partial(<function process_continuous_data at 0x7f74fc4f8c10>, 29.631308)),NumericColumn(key='n_siblings_spouses', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=functools.partial(<function process_continuous_data at 0x7f74fc4f8c10>, 0.545455)),NumericColumn(key='parch', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=functools.partial(<function process_continuous_data at 0x7f74fc4f8c10>, 0.379585)),NumericColumn(key='fare', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=functools.partial(<function process_continuous_data at 0x7f74fc4f8c10>, 34.385399))] '''將離散數據和連續(xù)數據的集合合并在一起,并傳遞給tf.keras.layers.DenseFeatures創(chuàng)建一個進行預處理的輸入層。
preprocessing_layer = tf.keras.layers.DenseFeatures(categorical_columns+numerical_columns)2.4 構建模型
model = tf.keras.Sequential([preprocessing_layer, # 預處理層tf.keras.layers.Dense(128, activation='relu'), # 全連接層tf.keras.layers.Dense(128, activation='relu'), # 全連接層tf.keras.layers.Dense(1, activation='sigmoid'), # 輸出層 ])model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])2.5 訓練、評估和預測
# 1. 訓練模型 train_data = raw_train_data.shuffle(500) test_data = raw_test_datamodel.fit(train_data, epochs=20)# 2. 評估模型 test_loss, test_accuracy = model.evaluate(test_data)print('\n\nTest Loss {}, Test Accuracy {}'.format(test_loss, test_accuracy)) # Test Loss 0.4481576979160309, Test Accuracy 0.8030303120613098# 3. 進行預測 predictions = model.predict(test_data)# 顯示前十個batch結果 for prediction, survived in zip(predictions[:10], list(test_data)[0][1][:10]):print("Predicted survival: {:.2%}".format(prediction[0])," | Actual outcome: ",("SURVIVED" if bool(survived) else "DIED"))''' 22/22 [==============================] - 0s 4ms/step Predicted survival: 90.08% | Actual outcome: SURVIVED Predicted survival: 0.97% | Actual outcome: SURVIVED Predicted survival: 0.98% | Actual outcome: DIED Predicted survival: 10.06% | Actual outcome: SURVIVED Predicted survival: 62.49% | Actual outcome: DIED Predicted survival: 62.38% | Actual outcome: SURVIVED Predicted survival: 11.18% | Actual outcome: SURVIVED Predicted survival: 60.31% | Actual outcome: SURVIVED Predicted survival: 9.72% | Actual outcome: DIED Predicted survival: 21.93% | Actual outcome: DIED '''3.Numpy
3.1 配置環(huán)境
import numpy as np import tensorflow as tf3.2 加載數據
# 從.npz文件中加載數據 DATA_URL = 'https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz'path = tf.keras.utils.get_file('mnist.npz', DATA_URL) with np.load(path) as data:train_examples = data['x_train']train_labels = data['y_train']test_examples = data['x_test']test_labels = data['y_test']# 創(chuàng)建數據集 train_dataset = tf.data.Dataset.from_tensor_slices((train_examples, train_labels)) test_dataset = tf.data.Dataset.from_tensor_slices((test_examples, test_labels))3.3 數據預處理
# shuffle and batch BATCH_SIZE = 64 SHUFFLE_BUFFER_SIZE = 100train_dataset = train_dataset.shuffle(SHUFFLE_BUFFER_SIZE).batch(BATCH_SIZE) test_dataset = test_dataset.batch(BATCH_SIZE)# 構建并訓練模型 model = tf.keras.Sequential([tf.keras.layers.Flatten(input_shape=(28, 28)),tf.keras.layers.Dense(128, activation='relu'),tf.keras.layers.Dense(10) ])model.compile(optimizer=tf.keras.optimizers.RMSprop(),loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),metrics=['sparse_categorical_accuracy'])model.fit(train_dataset, epochs=10)4.pandas.DataFrame
使用由克利夫蘭診所心臟病基金會(Cleveland Clinic Foundation for Heart Disease)提供的小型數據集,此數據集中有幾百行CSV。每行表示一個患者,每列表示一個屬性(describe)。我們將使用這些信息來預測患者是否患有心臟病,這是一個二分類問題。
4.1 配置環(huán)境
!pip install tensorflow-gpu==2.0.0-rc1 import pandas as pd import tensorflow as tf4.2 加載數據
# 下載數據集CSV文件 csv_file = tf.keras.utils.get_file('heart.csv', 'https://storage.googleapis.com/applied-dl/heart.csv')# 使用pandas讀取CSV文件 df = pd.read_csv(csv_file)# 使用df.head()查看數據集,df.dtypes查看每列特征的數據格式4.3 數據預處理
# 將分類的特征轉換成離散數值 df['thal'] = pd.Categorical(df['thal']) df['thal'] = df.thal.cat.codes# 使用tf.data.Dataset.from_tensor_slices讀取數據 target = df.pop('target') # label列 dataset = tf.data.Dataset.from_tensor_slices((df.values, target.values))# shuffle and batch train_dataset = dataset.shuffle(len(df)).batch(1)4.4 創(chuàng)建并訓練模型
def get_compiled_model():model = tf.keras.Sequential([tf.keras.layers.Dense(10, activation='relu'),tf.keras.layers.Dense(10, activation='relu'),tf.keras.layers.Dense(1, activation='sigmoid')])model.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])return modelmodel = get_compiled_model() model.fit(train_dataset, epochs=15)5.TFRecord和tf.Example
先跳過
6.文本
使用tf.data.TextLineDataset加載文本文件,通常被用來以文本文件構建數據集(原文件中的一行為一個樣本) 。使用相同作品(荷馬的伊利亞特)三個不同版本的英文翻譯,然后訓練一個模型來通過單行文本確定譯者。
6.1 配置環(huán)境
import tensorflow as tfimport tensorflow_datasets as tfds import os6.2 加載數據
DIRECTORY_URL = 'https://storage.googleapis.com/download.tensorflow.org/data/illiad/' FILE_NAMES = ['cowper.txt', 'derby.txt', 'butler.txt']for name in FILE_NAMES:text_dir = tf.keras.utils.get_file(name, origin=DIRECTORY_URL+name)parent_dir = os.path.dirname(text_dir)parent_dir # '/home/kbuilder/.keras/datasets'6.3 數據預處理
每個樣本都需要單獨標記,使用 tf.data.Dataset.map 來為每個樣本設定標簽。這將迭代數據集中的每一個樣本并且返回( example, label )對。
def labeler(example, index):return example, tf.cast(index, tf.int64) labeled_data_sets = []for i, file_name in enumerate(FILE_NAMES):lines_dataset = tf.data.TextLineDataset(os.path.join(parent_dir, file_name))labeled_dataset = lines_dataset.map(lambda ex: labeler(ex, i))labeled_data_sets.append(labeled_dataset)構建數據集:
BUFFER_SIZE = 50000 BATCH_SIZE = 64 TAKE_SIZE = 5000all_labeled_data = labeled_data_sets[0] for labeled_dataset in labeled_data_sets[1:]:all_labeled_data = all_labeled_data.concatenate(labeled_dataset)all_labeled_data = all_labeled_data.shuffle(BUFFER_SIZE, reshuffle_each_iteration=False)# 使用 tf.data.Dataset.take 與 print 來查看 (example, label) 對的外觀 # numpy 屬性顯示每個 Tensor 的值 for ex in all_labeled_data.take(5):print(ex)''' (<tf.Tensor: shape=(), dtype=string, numpy=b'To Ida; in his presence once arrived,'>, <tf.Tensor: shape=(), dtype=int64, numpy=0>) (<tf.Tensor: shape=(), dtype=string, numpy=b"Such now appears th' o'er-ruling sov'reign will">, <tf.Tensor: shape=(), dtype=int64, numpy=1>) (<tf.Tensor: shape=(), dtype=string, numpy=b'Them so prepared the King of men beheld'>, <tf.Tensor: shape=(), dtype=int64, numpy=0>) (<tf.Tensor: shape=(), dtype=string, numpy=b'mourn you, but the eddies of Scamander shall bear you into the broad'>, <tf.Tensor: shape=(), dtype=int64, numpy=2>) (<tf.Tensor: shape=(), dtype=string, numpy=b'there was no life left in him.'>, <tf.Tensor: shape=(), dtype=int64, numpy=2>) '''將文本編碼成數字:
# 1. 建立詞匯表 # 1.1 迭代每個樣本的 numpy 值 # 1.2 使用 tfds.features.text.Tokenizer 來將其分割成 token tokenizer = tfds.features.text.Tokenizer()# 1.3 將 token 放入一個python集合中消除重復項 vocabulary_set = set() for text_tensor, _ in all_labeled_data:some_tokens = tokenizer.tokenize(text_tensor.numpy())vocabulary_set.update(some_tokens)# 1.4 獲取該詞匯表的大小 vocab_size = len(vocabulary_set) vocab_size # 17178# 2. 樣本編碼 # 向編碼器的encode方法傳入一行文本后返回一個整數列表 # 2.1 構建編碼器 encoder = tfds.features.text.TokenTextEncoder(vocabulary_set)# 2.2 運行編碼器 # 將編碼器打包到 tf.py_function 并且傳參至數據集的 map 方法在數據集上運行編碼器 # tf.py_function 將計算圖表示為 python 函數 # tf.py_function(func,inp,Tout,name=None),func是一個python函數,接受inp作為參數返回一個Tout類型的輸出 def encode(text_tensor, label):encoded_text = encoder.encode(text_tensor.numpy())return encoded_text, labeldef encode_map_fn(text, label):encoded_text, label = tf.py_function(encode, inp=[text, label], Tout=(tf.int64, tf.int64))encoded_text.set_shape([None])label.set_shape([])return encoded_text, labelall_encoded_data = all_labeled_data.map(encode_map_fn)將數據集分割成測試集和訓練集并分batch:
使用 tf.data.Dataset.take 和 tf.data.Dataset.skip 來建立一個小一些的測試數據集和稍大一些的訓練數據集。在數據集被傳入模型之前,數據集需要被分批。每個分支中的樣本大小與格式需要一致,但是數據集中樣本并不全是相同大小的(每行文本字數并不相同)。因此,使用 tf.data.Dataset.padded_batch(而不是 batch )將樣本填充到相同的大小。
train_data = all_encoded_data.skip(TAKE_SIZE).shuffle(BUFFER_SIZE) train_data = train_data.padded_batch(BATCH_SIZE)test_data = all_encoded_data.take(TAKE_SIZE) test_data = test_data.padded_batch(BATCH_SIZE)# 查看測試集 sample_text, sample_labels = next(iter(test_data))sample_text[0], sample_labels[0]''' (<tf.Tensor: shape=(16,), dtype=int64, numpy=array([15746, 11433, 8394, 9006, 379, 3463, 17072, 0, 0,0, 0, 0, 0, 0, 0, 0])>,<tf.Tensor: shape=(), dtype=int64, numpy=0>) '''# 現在,test_data 和 train_data 不是(example, label)對的集合,而是批次的集合 # 每個批次都是一對(多樣本, 多標簽),表示為數組# 由于引入了一個新的 token 來編碼,因此詞匯表大小增加了一個 vocab_size += 16.4 構建模型
model = tf.keras.Sequential()# Embedding層將整數表示轉換為密集矢量嵌入 model.add(tf.keras.layers.Embedding(vocab_size, 64))# LSTM 層允許模型利用上下文中理解單詞含義 model.add(tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)))# 一個或多個緊密連接的層 # 編輯 `for` 行的列表去檢測層的大小 for units in [64, 64]:model.add(tf.keras.layers.Dense(units, activation='relu'))# 輸出層,第一個參數是標簽個數。 model.add(tf.keras.layers.Dense(3, activation='softmax'))# 編譯模型 model.compile(optimizer='adam',loss='sparse_categorical_crossentropy',metrics=['accuracy'])6.5 訓練模型
model.fit(train_data, epochs=3, validation_data=test_data)eval_loss, eval_acc = model.evaluate(test_data)print('\nEval loss: {}, Eval accuracy: {}'.format(eval_loss, eval_acc))''' 79/79 [==============================] - 1s 18ms/step - loss: 0.3794 - accuracy: 0.8246Eval loss: 0.3794495761394501, Eval accuracy: 0.8245999813079834 '''總結
以上是生活随笔為你收集整理的【TensorFlow基础】加载和预处理数据的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: seo 伪原创_如何判断外包的seo文章
- 下一篇: 1011 A+B 和 C (15分)