深度學習

1. 深度學習
- 1.1. 深度學習的知名模型
- 1.2. 深度學習的高速化
2. 深度學習運作原理
3. 實作範例

1. 深度學習

所謂深度學習是指

Figure 1: AI, Machine Learning與Deep Learning

在神經網路中我們曾經提及：

深度神經網路(Deep Neural Network, DNN)，顧名思義就是有很多層的神經網路。然而，幾層才算是多呢？一般來說有1-2個隱藏層的神經網絡就可以叫做多層，準確的說是(淺層)神經網絡(Shallow Neural Networks)。隨著隱藏層的增多，更深的神經網絡(一般來說超過3層)就都叫做深度神經網路¹。而那些以深度神經網路為模型的機器學習就是我們耳熟能詳的深度學習。

那麼幾層才算是夠深呢？實際上，「深度」只是一個商業概念，很多時候工業界把3層隱藏層也叫做「深度學習」，在機器學習領域的約定俗成是，名字中有深度(Deep)的網絡僅代表其有超過5-7層的隱藏層¹。

典型的深度學習如圖2，在此例中，輸入為一張手寫數字的影像，經由 4 層的深度學習模型後得知此數字為 4。

Figure 2: 典型的深度神經網路-1

圖3進一步說明網路模型中每一層的作用，可以將每一層網路視為對影像的特殊運算，如此一層一層逐一精煉(purified)，最後得到結果。

Figure 3: 典型的深度神經網路-2

關於增加層數的重要性，目前還缺乏理論佐證，但從過往的研究或實驗中，有幾點可以說明。

在 ILSVRC 這種大型視覺辨識競賽結果中，加深層數的比例多與辨識效能成正比。
加深層數可以在減少網路參數的狀況下得到相同成效，透過重叠層級，可以讓 ReLU 等活化函數夾在卷積層之間，進一步提高網路的表現力，因為透過活化函數，可以在網路增加「非線性」的能力，重叠非線性函數，也能達到更複雜的表現力。
學習的效率也是加深層數的優點之一，卷積層的神經元會反應出邊界等單純形狀，隨著層數增加，可以反應出紋理、物體部位等特質，依照階層逐漸變複雜。
以辨識「狗」為例子，如果要以層數較少的網路來解決這個問題，卷積層就要一次「理解」眾多特徵，還要因應不同拍攝環境帶來的變化，一次處理這些龐大的資料會花費許多學習時間；如果加深層數，就能用階層分解必須學習的問題，每一層可以處理單純的問題，例如，最初的層級可以只學習邊界，利用少量的學習資料來進行效率化的學習。
加深層數可以階層性的傳遞資料，例如，擷取出邊界的下一層會使用邊界資料來學習更高階的問題（如判斷形狀）。

1.1. 深度學習的知名模型

幾個知名的深度學習模型如下：

1.1.1. VGG

VGG 為由卷積層與池化層構成的基本 CNN。特色是含權重層（卷積層及全連接層）共 16-19 層，有時會稱為 VGG16² 或 VGG19³。VGG 由於結構非常簡單，應用性高，所以多數技術人員喜歡使用以 VGG 為最基礎的網路。

Figure 4: VGG-16

Figure 5: VGG-19

1.1.2. GoogLeNet

GoogLeNet⁴為2014 年 ILSVRC (ImageNet Large Scale Visual Recognition Competition)圖像分類競賽的冠軍得主，與 VGGNet（該年的亞軍）相比具有相對較低的錯誤率。GoogLeNet基本上與 CNN 相同，其特色是不僅會往垂直方向加深網路，也會往水平方向加深。GoogLeNet 往水平方向的做法稱為「Inception 結構」。

Figure 6: GoogLeNet

1.1.3. ResNet

ResNet⁵是由 Microsoft 團隊開發的網路，特色是具有能加深比過去更多層的「結構」，為了解決因加深過多層數無法順利學習的問題，ResNet 導入了「跳躍結構」（也稱為捷徑或分流）。跳躍結構是「直接」傳遞輸入資料，所以在反向傳播時，也會將上層的梯度「直接」傳遞給下層。透過這種跳躍結構，不用擔心梯度變小（或變得太大），可以把「具有意義的梯度」傳遞給上層。因此，跳躍結構能減少之前因為加深層數，使得梯度變小，出現梯度消失的問題。

Figure 7: ResNet

1.1.4. ImageNet大賽

從下圖可觀察到，網路的層數從2014年GoogLeNet的22層爆增到2015年ResNet的152層，足足多了130層。這個結果證實了越深的網路，在沒有Overfitting的情況下，效果是越好的⁶。

Figure 8: ImageNet歷年冠軍

那麼…如果想要提升模型的效果，是不是加越多網路層，使網路越深就可以了呢？，底下這個研究結果可以給我們一點啟發：

Figure 9: DNN層數與誤差的實驗

這是為2016年IEEE Conference on Computer Vision and Pattern Recognition的一篇研究結果⁷，實驗結果顯示一般深度網路層數越多，訓練誤差不降反增。

1.2. 深度學習的高速化

由於大資料(big data)與大型網路的關係，使得深度學習必須進行大量運算，過去我們使用 CPU 來進行運算，如今多數深度學習的框架多支援 GPU，甚至支援以多個 GPU 與多台裝置進行分散式學習。GPU 原本是圖形專用處理器，可以快速處理平行運算，GPU 運算的目標是把其強大的效能運用在各種用途。比較 CPU 與 GPU 在 AlexNet 的學習，CPU 需花費 40 天以上，GPU 則可以在 6 天內完成。

利用 GPU 除了可以大幅提升深度學習的運算速度，但是一旦變成多層網路時，就需要花費數天或數週的時間來學習，Google 的 TensorFlow、Microsoft 的 CNTK 便是針對分散式學習來開發的，100 個分散式的 GPU 可以提升比單一 GPU 高到 56 倍的速度，意味著原本要有天才能完成的學習，只要 3 小時就可以結束。

在深度學習的高速化過程中，包含運算量在內，記憶體容量、匯流排頻寬等，都會造成瓶頸，就記憶體容量來說，必須考慮到大量權重參數及中間資料會儲存在記憶體的情況。至於匯流排頻寛，一旦通過 GPU(或 CPU)的匯流排資料超過一定的限制，該處就會形成瓶頸，所以，最好能儘量減少通過網路的資料位元數。

1.2.1. GPU v.s. CPU

Figure 10: CPU 與 GPU 在架構上的設計差異

如圖10，CPU 和 GPU 的差異起源於其相異的設計目標與應用場景， CPU的設計目的是處理各種不同的資料運算、邏輯判斷和中斷要求;而 GPU 的設計目的則是為了圖形運算，其優勢在於能快速對同類型的資料進行平行運算⁸。二者主要差異大致如下：

CPU 是由幾個每次可處理數個獨立「執行緒」(threads)的核心(core)所組成；GPU 則有數百個這樣的核心，同時可以處理上千個執行緒
CPU 主要是線性執行； GPU 則是個高度平行化的單元
CPU 的發展主要致力於最佳化系統的遲滯時間，讓系統能有迅速流暢的反應；GPU 的發展則是朝頻寬最佳化努力。在深度神經網路中，頻寬為主要的系統瓶頸
GPU 的 Level 1 cache 比 CPU 快且大，在深度神經網路中，大部份的資料都會再次被使用到

2. 深度學習運作原理

2.1. 機器學習的SOP

資料載入和預處理
建立模型
訓練模型
評估模型/修改模型(回到3.)
發佈模型/應用

2.2. 資料的分割

訓練模型時通常會將資料集分割為訓練集、測試集，有時還會再細分出驗證集。
其具體分割方式可以分為以下幾類：

2.2.1. 基本資料分割

以下這些方法用於基本的模型訓練與評估，適合初步模型測試或大規模資料。

2.2.1.1. 訓練/測試分割（train-Test Split)

部份文獻也稱為Holdout Method，就是我們平時最常見的資料分割方式，把資料集分為以下兩部份：

訓練集(Training set): 用於訓練模型
測試集(Test set): 用於評估模型的性能

一般的分割比例常見為：

80% 訓練，20% 測試
70% 訓練，30% 測試

適用情境：

大規模資料集（如上萬筆以上）
訓練成本較高時，因為只需要訓練一次

缺點：

結果依賴單次分割，可能因為某次隨機分割影響評估結果，導致高變異性（high variance）
無法充分利用資料，因為測試集的部分資料未參與訓練，而這部分測試資料可能對模型的訓練有幫助，例如包含重要特徵

範例程式

1: from sklearn.model_selection import train_test_split
2: 
3: X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

2.2.1.2. 訓練/驗證/測試分割（Train-Validation-Test Split）

這個方法的主要目的為了調整模型的超參數，避免測試集的資料洩漏，做法是將資料集分為以下三部份：

訓練集(Training set): 用於訓練模型
驗證集(Validation set): 用於調整模型的超參數
測試集(Test set): 用於評估模型的性能

常見比例：

80% 訓練 / 10% 驗證 / 10% 測試
70% 訓練 / 15% 驗證 / 15% 測試

適用情境：

需要調整超參數（如學習率、模型架構）
確保測試集未參與模型選擇過程，避免資訊洩漏

缺點：

需要更多資料，否則訓練集會變小
驗證結果可能受隨機分割影響

範例程式

1: from sklearn.model_selection import train_test_split
2: 
3: X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.2, random_state=42)
4: X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

2.2.2. 交叉驗證（Cross-Validation, CV）

多次切分資料，讓每個資料點都能當一次測試集，避免評估的不穩定性。

2.2.2.1. K-Fold Cross-Validation

K-Fold 交叉驗證的主要目的之一是評估不同模型或超參數組合的表現，進而選擇最佳的設定。完整的流程如下：

先劃分出測試集
- 測試集不參與 K-Fold 交叉驗證，確保它是真正未見過的資料。
- 此步驟確保我們最終的模型評估是公平的，不會發生資訊洩漏（Data Leakage）。
在訓練集上使用 K-Fold 進行超參數調整
- 使用 K-Fold 交叉驗證來評估不同的超參數組合，選擇最佳的超參數。
- 此時測試集仍未被使用！我們只用交叉驗證來決定超參數。
使用完整的訓練集（不再做 K-Fold）訓練最終模型
- 找到最佳超參數後，我們不再使用 K-Fold，而是用完整的訓練集來訓練最終模型。
用測試集做最終評估
- 測試集應在第一步就獨立出來，不能參與 K-Fold 交叉驗證。交叉驗證應僅針對訓練集，確保泛化能力評估準確無誤

實際進行資料分割時，K-folder將訓練集分為K組，每次取其中一組作為測試集，其餘的作為訓練集，重複K次，最後取平均值作為模型的性能評估。

至於如何利用驗證集找出最佳超參數或模型，可以參考以下三種方法：

手動調整

定義超參數組合（例如 max_depth 和 n_estimators）。
使用 K-Fold 交叉驗證來評估每組超參數的表現。
選擇具有最佳平均交叉驗證準確率的超參數組合。

 1: from sklearn.model_selection import KFold
 2: from sklearn.ensemble import RandomForestClassifier
 3: from sklearn.metrics import accuracy_score
 4: import numpy as np
 5: 
 6: # 定義超參數組合
 7: param_grid = [
 8:     {"n_estimators": 50, "max_depth": 5},
 9:     {"n_estimators": 100, "max_depth": 5},
10:     {"n_estimators": 50, "max_depth": 10},
11:     {"n_estimators": 100, "max_depth": 10}
12: ]
13: 
14: best_params = None
15: best_score = 0
16: 
17: kf = KFold(n_splits=5, shuffle=True, random_state=42)
18: 
19: # 遍歷所有超參數組合
20: for params in param_grid:
21:     val_scores = []
22: 
23:     for train_idx, val_idx in kf.split(X_train):
24:         model = RandomForestClassifier(n_estimators=params["n_estimators"], max_depth=params["max_depth"])
25:         model.fit(X_train[train_idx], y_train[train_idx])  # 訓練模型
26:         y_val_pred = model.predict(X_train[val_idx])  # 在驗證集上預測
27:         val_acc = accuracy_score(y_train[val_idx], y_val_pred)  # 計算準確率
28:         val_scores.append(val_acc)
29: 
30:     avg_score = np.mean(val_scores)  # 計算交叉驗證的平均準確率
31: 
32:     print(f"參數組合: {params}, 平均交叉驗證準確率: {avg_score:.4f}")
33: 
34:     if avg_score > best_score:
35:         best_score = avg_score
36:         best_params = params
37: 
38: print(f"最佳超參數: {best_params}, 最高交叉驗證準確率: {best_score:.4f}")

在上述程式中，我們做了以下幾個步驟：

定義 4 組不同的超參數組合（n_estimators 和 max_depth）。
對每組超參數執行 K-Fold 交叉驗證，計算 5 次的平均準確率。
選擇擁有最高交叉驗證準確率的超參數組合作為最佳組合。

缺點：如果超參數組合很多（例如 1000 種），這樣手動測試會非常耗時，因此可以使用 Grid SearchCV 或 RandomizedSearchCV 來自動搜尋最佳超參數。

GridSearchCV

GridSearchCV 是 Scikit-Learn 提供的一個自動化調參工具，可以幫助我們快速找到最佳超參數組合。它會自動遍歷所有超參數組合，並使用 K-Fold 交叉驗證來評估每組超參數的表現，最後返回最佳超參數組合。

 1: from sklearn.model_selection import GridSearchCV
 2: from sklearn.ensemble import RandomForestClassifier
 3: 
 4: # 定義超參數搜索範圍
 5: param_grid = {
 6:     "n_estimators": [50, 100, 150],
 7:     "max_depth": [5, 10, 15]
 8: }
 9: 
10: # 建立隨機森林模型
11: model = RandomForestClassifier()
12: 
13: # 使用 GridSearchCV + K-Fold 交叉驗證來搜索最佳超參數
14: grid_search = GridSearchCV(model, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
15: grid_search.fit(X_train, y_train)
16: 
17: # 輸出最佳超參數和對應的分數
18: print(f"最佳超參數: {grid_search.best_params_}")
19: print(f"最佳交叉驗證準確率: {grid_search.best_score_:.4f}")

在上述程式中，我們做了以下幾個步驟：

定義超參數範圍（n_estimators 和 max_depth）。
使用 GridSearchCV 進行 K-Fold 交叉驗證，自動測試所有超參數組合。
輸出最佳超參數和最高的交叉驗證分數。

優勢：

完全自動化，不需要手動測試每組超參數。
計算效能較高，可以使用 n_jobs=-1 同時運行多個超參數組合，提高速度。

RandomizedSearchCV

RandomizedSearchCV 與 GridSearchCV 類似，但是它不是遍歷所有超參數組合，而是隨機選取一部分超參數組合進行測試，這樣可以節省時間，尤其是當超參數範圍很大時。

 1: from sklearn.model_selection import RandomizedSearchCV
 2: from scipy.stats import randint
 3: 
 4: # 定義超參數範圍（使用隨機分布）
 5: param_dist = {
 6:     "n_estimators": randint(50, 200),
 7:     "max_depth": randint(5, 20)
 8: }
 9: 
10: # 建立隨機森林模型
11: model = RandomForestClassifier()
12: 
13: # 使用 RandomizedSearchCV 進行隨機超參數搜尋
14: random_search = RandomizedSearchCV(model, param_dist, cv=5, n_iter=10, scoring='accuracy', n_jobs=-1, random_state=42)
15: random_search.fit(X_train, y_train)
16: 
17: # 輸出最佳超參數和對應的分數
18: print(f"最佳超參數: {random_search.best_params_}")
19: print(f"最佳交叉驗證準確率: {random_search.best_score_:.4f}")

在上述程式中，我們做了以下幾個步驟：

使用隨機分布設定超參數範圍（randint(50, 200)）。
n_iter=10 表示只隨機測試 10 組超參數組合，大幅減少計算時間。
使用 K-Fold 交叉驗證來評估每組超參數的表現。

優勢：

比 Grid Search 快速許多，適合超參數組合很多時使用。
可以涵蓋更廣的搜尋範圍，避免限制於固定的超參數組合。

2.2.2.2. LOOCV (Leave-One-Out Cross Validation)

LOOCV 是一種特殊的交叉驗證方法，是 K-fold 其中一種特例，此作法相當簡單明瞭，但是訓練負擔會非常重且耗時。將資料集分為訓練集和測試集，其中測試集只包含 一筆樣本資料 ，訓練集包含 其餘N-1筆樣本 ，整個訓練會重複進行N次(N為樣本數），最後取所有測試結果的平均值作為模型的性能評估。這樣的分割方式可以:

保證每個樣本都被用於測試一次，並且可以得到最穩定的結果
充分利用資料，每筆資料都有機會成為測試集
結果最穩定，不受特定分割影響

適用情境：

小型資料集（如 N < 100），因為可以最大化利用訓練資料
不適合大規模資料集，因為計算成本過高。

優點：

充分利用資料，每筆資料都有機會成為測試集
結果最穩定，不受特定分割影響

缺點

訓練成本極高：因為有 N 筆資料，就需要訓練 N 次，如果 N = 10000，則要訓練 10000 次，時間成本極高。
對離群值（outliers）極度敏感：如果某個樣本是異常值，這個異常值在某次迭代會成為測試集，會嚴重影響模型表現。

範例程式:

 1: from sklearn.model_selection import LeaveOneOut
 2: 
 3: loo = LeaveOneOut()
 4: loo_accuracies = []
 5: 
 6: for train_idx, test_idx in loo.split(X):
 7:     model.fit(X[train_idx], y[train_idx])
 8:     acc = model.score(X[test_idx], y[test_idx])
 9:     loo_accuracies.append(acc)
10: 
11: print("平均準確率:", np.mean(loo_accuracies))

2.2.2.3. Stratified K-fold Cross-Validation

Stratified K-fold Cross-Validation 是 K-fold Cross-Validation 的一種變形，主要是針對分類問題而設計的。在 K-fold Cross-Validation 中，每次分割都是隨機的，可能會導致某些類別的樣本在訓練集和測試集中的比例不均衡，進而影響模型的性能評估。Stratified K-fold Cross-Validation 通過保持每個類別在每個分割中的比例來解決這個問題，確保每個類別在訓練集和測試集中的比例相同。

適用場景

類別不均衡問題（例如詐欺檢測、罕見疾病分類）
確保所有類別都能在每個 fold 中出現，避免某些 fold 缺少少數類別
小型資料集時特別重要，避免某些類別完全未出現在特定 fold

範例程式

1: from sklearn.model_selection import StratifiedKFold
2: from collections import Counter
3: 
4: X = np.arange(20).reshape(-1, 1)  # 20 筆樣本
5: y = np.array([0] * 18 + [1] * 2)  # 18 個類別 0，2 個類別 1
6: 
7: skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
8: for fold, (train_idx, test_idx) in enumerate(skf.split(X, y)):
9:     print(f"Fold {fold+1} - 測試集類別分佈:", Counter(y[test_idx]))

2.2.3. 時間序列交叉驗證（Time Series Split）

Time Series Split 是時序資料（如股市、氣象預測）的專用分割方式，確保未來的資料不會洩漏到訓練中。

適用場景

股票預測
氣象預測
銷售趨勢分析
任何未來資料不可提前出現的應用

範例程式

1: from sklearn.model_selection import TimeSeriesSplit
2: 
3: tscv = TimeSeriesSplit(n_splits=5)
4: for train_idx, val_idx in tscv.split(X):
5:     model.fit(X[train_idx], y[train_idx])
6:     acc = model.score(X[val_idx], y[val_idx])

輸出示意

第一次訓練: 1 2 3 4  -> 測試 5
第二次訓練: 1 2 3 4 5 -> 測試 6
第三次訓練: 1 2 3 4 5 6 -> 測試 7

2.2.4. 自助抽樣（Bootstrap Sampling）

一種隨機抽樣方法，主要用於統計學與集成學習（如隨機森林）。以 抽完放回 的方式從原始資料集中隨機抽取樣本，形成新的訓練集。

適用場景

當資料量較小時，希望模擬多次抽樣的效果。
隨機森林（Random Forest）、統計學中的區間估計

優點

可以充分利用資料，因為每次抽樣都是隨機的，每次抽樣都有可能抽到重要樣本
可以模擬多次抽樣的效果，得到更穩定的結果

缺點

會產生重複樣本，導致模型訓練時的方差增加
計算成本高，因為要重複抽樣多次

範例程式1

1: from sklearn.utils import resample
2: 
3: X_train_bootstrap, y_train_bootstrap = resample(X_train, y_train, replace=True, random_state=42)
4: model.fit(X_train_bootstrap, y_train_bootstrap)

範例程式2: 隨機森林（Random Forest）

1: from sklearn.ensemble import RandomForestClassifier
2: 
3: model = RandomForestClassifier(n_estimators=100, bootstrap=True)
4: model.fit(X_train, y_train)

2.2.5. 交叉驗證的效能評估指標

在機器學習中，我們通常使用一些指標來評估模型的性能，這些指標可以幫助我們了解模型的表現，並選擇最佳模型。以下是一些常見的效能評估指標：

2.2.5.1. 分類問題

準確率（Accuracy）：正確預測的樣本數除以總樣本數。
精確率（Precision）：正確預測為正例的樣本數除以所有預測為正例的樣本數。
召回率（Recall）：正確預測為正例的樣本數除以所有真正為正例的樣本數。
F1 Score：精確率和召回率的調和平均數。
ROC 曲線（Receiver Operating Characteristic Curve）：以假陽性率（False Positive Rate）為橫軸，真陽性率（True Positive Rate）為縱軸繪製的曲線。
AUC（Area Under Curve）：ROC 曲線下的面積，用來衡量模型的性能。

2.2.5.2. 回歸問題

平均絕對誤差（Mean Absolute Error, MAE）：預測值和真實值之間的絕對誤差的平均值。
均方誤差（Mean Squared Error, MSE）：預測值和真實值之間的誤差的平方的平均值。
均方根誤差（Root Mean Squared Error, RMSE）：均方誤差的平方根。
R平方（R-squared）：模型解釋的變異量占總變異量的比例，取值範圍為0到1，越接近1表示模型越好。
MAPE（Mean Absolute Percentage Error）：預測值和真實值之間的絕對百分比誤差的平均值。
決定係數（Coefficient of Determination）：R平方的另一種表示方法，也是模型解釋的變異量占總變異量的比例。
殘差圖（Residual Plot）：用來檢查模型的殘差是否符合常態分佈，以及是否存在異方差性。
QQ 圖（Q-Q Plot）：用來檢查模型的殘差是否符合常態分佈。
預測區間（Prediction Interval）：用來估計預測值的不確定性，通常是一個區間，表示預測值的範圍。

2.3. Layer, 損失函數與優化器

前節深度學習中的每一「層」(layer)如何運作，取決於儲存於該層的權重(weight)，而權重是由多個數字組成。從技術層面來看，layer 是由各個權重參數(parameters)來和輸入的資料(如圖11中的X)進行運算以執行資料轉換的工作(如圖11)。而所謂的學習，指的就是幫助神經網路的每一層找出適當的權重值，讓神經網路可以將輸入的訓練資料經由與權重的運作推導出接近標準答案的運算結果(即圖11中的預測 Y)。

然而，這在實際運作上是十分困難的，因為一個深度神經網路可以包含數千萬個權重，此外，其中一個權重被改變後，往往會影響其他權重的運作。

Figure 11: nn 中 layer 的 parameter

為了提高神經網路的效能(預測的準確率)，我們要即時的掌握目前的輸出(Y)與真正的標準答案Y還差多少，這個評估由神經網路的損失函數(loss function;或稱目標函數, objective function;或稱成本函數, cost function⁹)來負責，如圖12。損失函數會取得神經網路的預測結果與標準答案二者的損失分數(又稱差距分數)，做為每一次學習的表現效能之評估標準。

Figure 12: 損失函數

而深度學習的基本工作就是使用損失函數做為回饋訊息來一步步微調權重，逐步降低每次學習的損失分數，最終目標在於讓損失函數結果達到最小，而這個微調工作則由優化器(optimizer，也稱最佳化函數)來執行。優化器實作了反向傳播演算法(Backpropagation)，這也是深度學習中的核心演算法，藉此來週整權重。

Figure 13: 優化器

事實上，同樣的流程我們也曾在迴歸裡看過，在找到一條理想的迴歸方程式時，我們也是先隨便找一條，然後用loss function去評估這條方程式的優劣，再「求切線斜率」的方式來修正方程式的係數。差別只在於：在迴歸時我們要修正的係數只有一、兩個，而在深度學習中，我們要同時修正成千上萬個權重。

那麼，在最初一次的學習，權重的值是如何設定的呢？可以先全數設為零，但更常用的做法是隨機指定，隨著多次學習後，權重會逐步往正確的方向調整，損失分數也會慢慢降低。

我們再複習一下神經網路這章裡的文字：

是的，就如同考試時你面對陌生選擇題的反應，神經網路也決定這麼幹，隨便丟一些數值填到矩陣中當成第一批參數。事實上，同樣的策略我們在線性迴歸:年齡身高預測/隨機的力量裡已經玩過了，當初在找出方程式的最佳參數組合時，我們也是閉上眼睛隨便選一組。不管整個網路中有多少參數，當我們隨機設定好了所有參數的最初值後，整個神經網就就可以運作了，嗯…至少已經可以依照前向傳播的流程輸出第一個預測結果了，你看，我們已經朝完美的人工智慧跨近一大步了-_-

接下來的流程其實和迴歸有點類似，我們評估預測結果的品質，然後回頭修正參數，只是這次的工程有點浩大，我們要修正所有的參數，這個回頭修正所有參數的過程稱為反向傳播(backward propagation)。

3. 實作範例

3.1. 二元分類：IMDB

自 IMDB 資料集中取得 50000 個正/負評論，各 25000 個，該資料集已內建於 Keras 中，且資料已先預處理，電影評論內容為由單字構成的 list 結構，例如，若評論內容為

In a Wonderful morning...

其 list 結構可能為

(8, 3, 386, 1969...)

即，每個單字都會依據其出現頻率給定一個編號，編號越小越常見。(與 IMDb 相關的 paper 參見Sentiment Analysis on IMDb / paperswithcode

1: from keras.datasets import imdb
2: (train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)
3: print(train_data[0])
4: print(train_labels[0])

[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]
1

如上為第一筆評論的單字代號與評論結果，若要將原始資料的單字代號還原，其程式碼如下：

 1: # word_index is a dictionary mapping words to an integer index
 2: word_index = imdb.get_word_index()
 3: print("字典中key為this對應的value:",word_index['this'])
 4: # We reverse it, mapping integer indices to words
 5: reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])
 6: print("反轉字典中key為11所對應到的value:",reverse_word_index[11])
 7: print("反轉字典中key為1所對應到的value:",reverse_word_index[1])
 8: print("反轉字典中key為2所對應到的value:",reverse_word_index[2])
 9: # We decode the review; note that our indices were offset by 3
10: # because 0, 1 and 2 are reserved indices for "padding", "start of sequence", and "unknown".
11: decoded_review = ' '.join([reverse_word_index.get(i - 3, '?') for i in train_data[0]])
12: print(decoded_review)

字典中key為this對應的value: 11
反轉字典中key為11所對應到的value: this
反轉字典中key為1所對應到的value: the
反轉字典中key為2所對應到的value: and
? this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert ? is an amazing actor and now the same being director ? father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for ? and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also ? to the two little boy's that played the ? of norman and paul they were just brilliant children are often left out of the ? list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you think the whole story was so lovely because it was true and was someone's life after all that was shared with us all

上述程式中第2行主要負責取得單字(key)的對應數字(value)的字典，再藉由第5行將(key:value)轉換為(value:key)，最後第11行將字典中的單字回復至原始評論，程式中(i-3)的原因是imdb.load_data已預留了第 0~2 個位置做特殊用途。

3.1.1. 準備資料

由於 IMDB 匯入 train_data 及 test_data 均為 list 型態，要先轉換為 tensor 才能輸入至神經網路，方法有二：

填補資料中每個子 list 內容使其具有相同長度，再做reshape
對每個子 list 做 one-hot encoding，其程式碼如下：

 1: import numpy as np
 2: def vectorize_sequences(sequences, dimension=10000):
 3:     # Create an all-zero matrix of shape (len(sequences), dimension)
 4:     results = np.zeros((len(sequences), dimension))
 5:     for i, sequence in enumerate(sequences):
 6:         results[i, sequence] = 1.  # set specific indices of results[i] to 1s
 7:     return results
 8: print("====train_data[0]======")
 9: print(train_data[0])
10: # Our vectorized training data
11: 
12: x_train = vectorize_sequences(train_data)
13: # Our vectorized test data
14: x_test = vectorize_sequences(test_data)
15: print("====x_data[0]======")
16: print(x_train[0])
17: 
18: # 最後再將標籤資料也向量化
19: y_train = np.asarray(train_labels).astype('float32')
20: y_test = np.asarray(test_labels).astype('float32')
21: print("====y_data[0]======")
22: print(y_train[0])

====train_data[0]======
[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]
====x_data[0]======
[0. 1. 1. ... 0. 0. 0.]
====y_data[0]======
1.0

3.1.2. 建立神經網路

要建構一個 Dense 層堆疊架構的神經網路，要考慮兩個關鍵：

要用多少層？
每一層要有多少神經元？

此處使用兩個中間層、一個輸出層，如圖14，一般的神經網路中，對那些介於輸入層和輸出層間的layer，我們習慣上稱之為隱藏層(hidden layers)，但此處 Keras 的輸入層也有隱藏層的特性。圖14的 hidden layer 以 relu 為啟動函數，輸出層以 sigmoid 啟動函數輸出機率值。

Figure 14: IMDB model 架構

由於輸入資料為向量、標籤為純量(1, 0)，對這樣的問題，適合用 relu 啟動函數的全連接層(Dense)堆疊架構：Dense(16, activation=’relu’)。其中 16 指該層神經元的數量(也可看成該層的寬度)，典型旳寫法為：

# 加入Dense隱藏層，該層有16個神經元
model.add(layers.Dense(16, activation='relu'))

擁有 16 個神經單元表示權重矩陣 W 的 shape 為(input_dimension, 16)，在 W 和 input 做內積後，input 資料會被映射到 16 維的空間上，最後加上 b、套用 relu 運算來產生輸出值。每一層的神經元數越多，可以讓神經網路學習更複雜的資料表示法，但也使計算成本更高。

Figure 15: ReLU 函數圖

3.1.3. 為什麼要加入Activation Function

為何要有 relu 等啟動函數？原因之一是這類函數為非線性函數(如圖18)，回顧神經網路中的「學測成績預測模型」，像圖16的模型，我們也只是在解一個如\(f(x)=x_1*w_1+x_2*w_2+x_3*w_3+...+x_7*w_7\)這樣的函式問題。

Figure 16: 學測成績預測模型#2

就算我們把模型2進化為模型3(如圖17)，本質上也仍只是一層，再多的層數也能合併為一層，此類模型並無助於複雜的學習。

Figure 17: 學測成績預測模型#3

以圖17為例，最後對學測成績\(\hat{y}\)的預測為：
\[ \hat{y}=y_1w_8+y_2w_9+y_3w_{10} \]
其中

\begin{eqnarray} y_1 &=& x_1w_{11} + x_2w_{21} + x_3w_{31} \\ y_2 &=& x_3w_{32} + x_4w_{42} + x_5w_{52} + x_6w_{62} \\ y_3 &=& x_5w_{53} + x_6w_{63} + x_7w_{73} \end{eqnarray}

如果我們稍微整理一下上面這個看起來像兩層的模型：

\begin{equation*} \begin{split} \hat{y} =& y_1w_8+y_2w_9+y_3w_{10} \\ =& (x_1w_{11} + x_2w_{21} + x_3w_{31})w_8 \\ &+ (x_3w_{32} + x_4w_{42} + x_5w_{52} + x_6w_{62})w_9 \\ &+ (x_5w_{53} + x_6w_{63} + x_7w_{73})w_{10} \\ =& x_1w_{11}w_8 + x_2w_{21}w_8 + x_3w_{31}w_8 \\ &+ x_3w_{32}w_9 + x_4w_{42}w_9 + x_5w_{52}w_9 + x_6w_{62}w_9 \\ &+ x_5w_{53}w_{10} + x_6w_{63}w_{10} + x_7w_{73}w_{10} \\ \end{split} \end{equation*}

最後就會發現，不管它看起來像是幾層，最後都能整理成一層的模樣:

\begin{equation*} \begin{split} \hat{y} =& x_1\times(w_{11}w_8) \\ &+ x_2\times(w_{21}w_8) \\ &+ x_3\times(w_{31}w_8+w_{32}w_9) \\ &+ x_4\times(w_{42}w_9) \\ &+ x_5\times(w_{52}w_9 + w_{53}w_{10}) \\ &+ x_6\times(w_{62}w_9 + w_{63}w_{10})\\ &+ x_7\times(w_{73}w_{10}) \end{split} \end{equation*}

結果就是跟底下的方程式一樣
\(f(x)=x_1*w_1+x_2*w_2+x_3*w_3+...+x_7*w_7\)

為了有效讓模型更加複雜，此處可以在模型中加入非線性轉換，如圖18中的ReLU激勵函數，其結果如圖19所示。

Figure 18: ReLU 函數圖

Figure 19: 學測成績預測模型#4

3.1.4. 程式實作

圖14的實作程式如下，此處以最簡單的 NN (Neural Network) 作為範例。以 Keras 的核心為模型，應用最常使用 Sequential 模型。藉由.add()我們可以一層一層的將神經網路疊起。在每一層之中我們只需要簡單的設定每層的大小(units)與激勵函數(activation function)。

1: from keras import models
2: from keras import layers
3: 
4: model = models.Sequential()
5: model.add(layers.Dense(16, activation='relu', input_shape=(10000,)))
6: model.add(layers.Dense(16, activation='relu'))
7: model.add(layers.Dense(1, activation='sigmoid'))

/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/keras/src/layers/core/dense.py:85: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead.
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)

建好 model 後，要選擇一個損失函數和一個優化器，由於要處理的是二元分類問題，所以最好用 binary_crossentropy 損失函數，因為 crossentropy 主要就是用來測量機率分佈之間的距離(差異)。其實作如下：

1: model.compile(optimizer='rmsprop',
2:              loss='binary_crossentropy',
3:              metrics=['accuracy'])

之所以能將 optimizer 和 loss function 以字串方式經由參數傳給 compile()，這是因為 rmsprop、binary_crossentropy 和 accuracy 均已事先在 Keras 套件中定義好了，若是要進一步自訂參數(如自訂學習率)，做法如下：

 1: # 調整learning rate
 2: from keras import optimizers
 3: 
 4: model.compile(optimizer=optimizers.RMSprop(learning_rate=0.001),
 5:               loss='binary_crossentropy',
 6:               metrics=['accuracy'])
 7: 
 8: # 使用另外的評估函數
 9: from keras import losses
10: from keras import metrics
11: 
12: model.compile(optimizer=optimizers.RMSprop(learning_rate=0.001),
13:               loss=losses.binary_crossentropy,
14:               metrics=[metrics.binary_accuracy])

若您使用的是M1/M2核心的Mac電腦，則可能會出現上述訊息，雖然不影響正執行結果，但你仍可以參考stackoverflow上的這篇文章來解決這些惱人的訊息。

3.1.5. 驗證神經網路的 model

為了在訓練期間監控 model 對新資料的準確度，可以從原始訓練資料中分離出 10000 個樣本來建立驗證資料集。

1: x_val = x_train[:10000] # 前10000個資料為驗證集
2: partial_x_train = x_train[10000:] # 第10000個以後為訓練集
3: 
4: y_val = y_train[:10000]
5: partial_y_train = y_train[10000:]

接下來才是使用 fit()來訓練模型，進行 20 個訓練週期(epoch，即，把 x_train 和 y_train 張量中的所有訓練樣本進行 20 輪的訓練)，以 512 個小樣本的小批量(batch_size)進行訓練，

1: history = model.fit(partial_x_train,
2:                     partial_y_train,
3:                     epochs=20,
4:                     batch_size=512,
5:                     validation_data=(x_val, y_val))

Epoch 1/20
30/30 ━━━━━━━━━━━━━━━━━━━━ 2s 34ms/step - binary_accuracy: 0.6895 - loss: 0.5987 - val_binary_accuracy: 0.8637 - val_loss: 0.3995
...略...
Epoch 20/20
[1m30/30[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step - binary_accuracy: 0.9995 - loss: 0.0107 - val_binary_accuracy: 0.8726 - val_loss: 0.5637

model.fit()會回傳一個 history 物件，這物件本身有一個 history 屬性，為一個包含有關訓練過程中相關資料的字典，這個字期包含有 4 個項目(val_loss, val_acc, loss, acc)，為訓練和驗證時監控的指標。

1: print(history.history)
2: print("binary_accuracy:",history.history['binary_accuracy'])
3: print("loss:",history.history['loss'])
4: print("val_binary_accuracy:",history.history['val_binary_accuracy'])
5: print("val_loss:",history.history['val_loss'])

{'binary_accuracy': [0.7724000215530396, 0.8913999795913696, 0.9195333123207092, 0.9340000152587891, 0.946066677570343, 0.9564666748046875, 0.9628000259399414, 0.966533362865448, 0.974133312702179, 0.9778666496276855, 0.9836000204086304, 0.9865333437919617, 0.9886000156402588, 0.9922000169754028, 0.9932000041007996, 0.9955999851226807, 0.9947333335876465, 0.9983999729156494, 0.997866690158844, 0.9980000257492065], 'loss': [0.523308277130127, 0.32568830251693726, 0.2428768277168274, 0.195417582988739, 0.16299770772457123, 0.13810740411281586, 0.12029378116130829, 0.10577400773763657, 0.08585202693939209, 0.07604678720235825, 0.0627065971493721, 0.05464218556880951, 0.04679805412888527, 0.03932145610451698, 0.033363644033670425, 0.02628222107887268, 0.026078475639224052, 0.018040597438812256, 0.017737768590450287, 0.014618363231420517], 'val_binary_accuracy': [0.8636999726295471, 0.8859000205993652, 0.8891000151634216, 0.876800000667572, 0.8762000203132629, 0.8863999843597412, 0.8859999775886536, 0.8823999762535095, 0.8826000094413757, 0.8780999779701233, 0.8794999718666077, 0.8788999915122986, 0.878000020980835, 0.8666999936103821, 0.8751999735832214, 0.8748999834060669, 0.8738999962806702, 0.8673999905586243, 0.8712999820709229, 0.8726000189781189], 'val_loss': [0.39949268102645874, 0.3119995892047882, 0.28234928846359253, 0.3045506775379181, 0.30820411443710327, 0.2834608554840088, 0.2936294972896576, 0.3101825714111328, 0.3229252099990845, 0.34479930996894836, 0.36042720079421997, 0.3789125978946686, 0.4012978971004486, 0.46704238653182983, 0.44717246294021606, 0.47011518478393555, 0.4929317533969879, 0.5508306622505188, 0.5428465604782104, 0.563687264919281]}
binary_accuracy: [0.7724000215530396, 0.8913999795913696, 0.9195333123207092, 0.9340000152587891, 0.946066677570343, 0.9564666748046875, 0.9628000259399414, 0.966533362865448, 0.974133312702179, 0.9778666496276855, 0.9836000204086304, 0.9865333437919617, 0.9886000156402588, 0.9922000169754028, 0.9932000041007996, 0.9955999851226807, 0.9947333335876465, 0.9983999729156494, 0.997866690158844, 0.9980000257492065]
loss: [0.523308277130127, 0.32568830251693726, 0.2428768277168274, 0.195417582988739, 0.16299770772457123, 0.13810740411281586, 0.12029378116130829, 0.10577400773763657, 0.08585202693939209, 0.07604678720235825, 0.0627065971493721, 0.05464218556880951, 0.04679805412888527, 0.03932145610451698, 0.033363644033670425, 0.02628222107887268, 0.026078475639224052, 0.018040597438812256, 0.017737768590450287, 0.014618363231420517]
val_binary_accuracy: [0.8636999726295471, 0.8859000205993652, 0.8891000151634216, 0.876800000667572, 0.8762000203132629, 0.8863999843597412, 0.8859999775886536, 0.8823999762535095, 0.8826000094413757, 0.8780999779701233, 0.8794999718666077, 0.8788999915122986, 0.878000020980835, 0.8666999936103821, 0.8751999735832214, 0.8748999834060669, 0.8738999962806702, 0.8673999905586243, 0.8712999820709229, 0.8726000189781189]
val_loss: [0.39949268102645874, 0.3119995892047882, 0.28234928846359253, 0.3045506775379181, 0.30820411443710327, 0.2834608554840088, 0.2936294972896576, 0.3101825714111328, 0.3229252099990845, 0.34479930996894836, 0.36042720079421997, 0.3789125978946686, 0.4012978971004486, 0.46704238653182983, 0.44717246294021606, 0.47011518478393555, 0.4929317533969879, 0.5508306622505188, 0.5428465604782104, 0.563687264919281]

 1: # 秀出history架構
 2: history_dict = history.history
 3: print(history_dict.keys())
 4: 
 5: # 畫圖
 6: import matplotlib.pyplot as plt
 7: accuracy = history.history['binary_accuracy']
 8: val_accuracy = history.history['val_binary_accuracy']
 9: loss = history.history['loss']
10: val_loss = history.history['val_loss']
11: epochs = range(1, len(accuracy) + 1)# "bo" is for "blue dot"
12: plt.cla()
13: plt.plot(epochs, loss, 'bo', label='Training loss')
14: # b is for "solid blue line"
15: plt.plot(epochs, val_loss, 'b', label='Validation loss')
16: plt.title('Training and validation loss')
17: plt.xlabel('Epochs')
18: plt.ylabel('Loss')
19: plt.legend()
20: plt.plot()
21: plt.savefig("images/imdb-Keras-1.png")
22: plt.cla()
23: #plt.show()plt.clf()   # clear figureplt.clf()
24: acc_values = history_dict['binary_accuracy']
25: val_acc_values = history_dict['val_binary_accuracy']
26: plt.plot(epochs, accuracy, 'bo', label='Training acc')
27: plt.plot(epochs, val_accuracy, 'b', label='Validation acc')
28: plt.title('Training and validation accuracy')
29: plt.xlabel('Epochs')
30: plt.ylabel('Accuracy')
31: plt.legend()
32: plt.plot()
33: plt.savefig("images/imdb-Keras-2.png")
34: #plt.show()

dict_keys(['binary_accuracy', 'loss', 'val_binary_accuracy', 'val_loss'])

Figure 20: IMDB-Keras-1

Figure 21: IMDB-Keras-2

3.1.6. 優化 model

由圖23、24可以看出，上述 model 雖然在訓練階段的效能不錯，loss function 隨 epoch 下降、accuracy 也隨 epoch 升高，但在驗證階段的表現卻十分不理想，不僅 accuracy 隨 epoch 的增加呈緩降趨勢，loss function 甚至還往上急升。

第二版的 model 做了以下改進:

將資料向量化(vectorize_sequences())
加入了兩層 layer 以及 dropout 層，其架構如圖22

Figure 22: IMDB model 架構#2

 1: # 向量化function
 2: def vectorize_sequences(sequences, dimension=10000):
 3:     # Create an all-zero matrix of shape (len(sequences), dimension)
 4:     results = np.zeros((len(sequences), dimension))
 5:     for i, sequence in enumerate(sequences):
 6:         results[i, sequence] = 1.  # set specific indices of results[i] to 1s
 7:     return results
 8: # Our vectorized training data
 9: x_train = vectorize_sequences(train_data)
10: # Our vectorized test data
11: x_test = vectorize_sequences(test_data)
12: # 最後再將標籤資料也向量化
13: y_train = np.asarray(train_labels).astype('float32')
14: y_test = np.asarray(test_labels).astype('float32')
15: # 建立model
16: model = models.Sequential()
17: model.add(layers.Dense(16, activation='relu', input_shape=(10000,)))
18: model.add(layers.Dense(64, activation='relu'))
19: model.add(layers.Dropout(0.25))
20: model.add(layers.Dense(64, activation='relu'))
21: model.add(layers.Dropout(0.25))
22: model.add(layers.Dense(1, activation='sigmoid'))
23: 
24: #判斷作業系統類型，選擇優化器
25: import platform
26: #if platform.system() == "Darwin" and platform.processor() == "arm":
27: #    opt = optimizers.legacy.RMSprop(learning_rate=0.0001)
28: #else:
29: opt = optimizers.RMSprop(learning_rate=0.0001)
30: 
31: model.compile(optimizer=opt, loss='binary_crossentropy',
32:               metrics=[metrics.binary_accuracy])
33: 
34: # 驗證資料集
35: x_val = x_train[:10000] # 前10000個資料為驗證集
36: partial_x_train = x_train[10000:] # 第10000個以後為訓練集
37: y_val = y_train[:10000]
38: partial_y_train = y_train[10000:]
39: 
40: # 訓練model
41: history = model.fit(partial_x_train, partial_y_train,
42:                     epochs=20, batch_size=512,
43:                     validation_data=(x_val, y_val), verbose=0)
44: 
45: # 秀出history架構
46: history_dict = history.history
47: print(history_dict.keys())
48: 
49: # 進行預測
50: x = model.predict(x_test)
51: print(x)
52: 
53: # 畫圖
54: import matplotlib.pyplot as plt
55: plt.cla()
56: loss = history.history['loss']
57: val_loss = history.history['val_loss']
58: epochs = range(1, len(binary_accuracy) + 1)# "bo" is for "blue dot"
59: plt.plot(epochs, loss, 'bo', label='Training loss')
60: # b is for "solid blue line"
61: plt.plot(epochs, val_loss, 'b', label='Validation loss')
62: plt.title('Training and validation loss')
63: plt.xlabel('Epochs')
64: plt.ylabel('Loss')
65: plt.legend()
66: plt.plot()
67: plt.savefig("images/imdb-Keras-3.png")
68: #plt.show()
69: 
70: plt.cla()
71: acc_values = history_dict['binary_accuracy']
72: val_acc_values = history_dict['val_binary_accuracy']
73: plt.plot(epochs, binary_accuracy, 'bo', label='Training acc')
74: plt.plot(epochs, val_binary_accuracy, 'b', label='Validation acc')
75: plt.title('Training and validation accuracy')
76: plt.xlabel('Epochs')
77: plt.ylabel('Accuracy')
78: plt.legend()
79: plt.plot()
80: plt.savefig("images/imdb-Keras-4.png")
81: #plt.show()

dict_keys(['binary_accuracy', 'loss', 'val_binary_accuracy', 'val_loss'])
782/782 ━━━━━━━━━━━━━━━━━━━━ 0s 530us/step
[[0.28722998]
 [0.99044675]
 [0.72447336]
 ...
 [0.04659463]
 [0.12886518]
 [0.4191768 ]]

Figure 23: IMDB-Keras-1

Figure 24: IMDB-Keras-2

Figure 25: IMDB-Keras-3

Figure 26: IMDB-Keras-4

比較上述兩組結果，可以發現優化版的 model 在 loss function 以及 accuracy 的表現都有進步。

3.2. 資料分割: IRIS / K-Fold Cross Validation

3.2.1. GridSearchCV

 1: import numpy as np
 2: import tensorflow as tf
 3: from sklearn.datasets import load_iris
 4: from sklearn.preprocessing import StandardScaler
 5: from sklearn.model_selection import train_test_split, GridSearchCV
 6: from sklearn.metrics import classification_report, accuracy_score
 7: from tensorflow import keras
 8: from tensorflow.keras.models import Sequential
 9: from tensorflow.keras.layers import Dense
10: from scikeras.wrappers import KerasClassifier
11: 
12: # 載入IRIS資料集
13: iris = load_iris()
14: x, y = iris.data, iris.target
15: 
16: # 資料標準化
17: scaler = StandardScaler()
18: x = scaler.fit_transform(x)
19: 
20: # 分割資料集為訓練集和測試集
21: x_train, x_test, y_train, y_test = train_test_split(
22:     x, y, test_size=0.2, random_state=9527
23: )
24: 
25: # 建立模型的函數 (用於 KerasClassifier)
26: def create_model(optimizer='adam', activation='relu'):
27:     model = Sequential()
28:     model.add(Dense(16, input_dim=4, activation=activation))
29:     model.add(Dense(8, activation=activation))
30:     model.add(Dense(3, activation='softmax'))
31: 
32:     model.compile(optimizer=optimizer,
33:                   loss='sparse_categorical_crossentropy',
34:                   metrics=['accuracy'])
35:     return model
36: 
37: # 使用 scikeras 的 KerasClassifier
38: model = KerasClassifier(model=create_model, verbose=0)
39: 
40: # 設定參數網格 (需加上 model__ 前綴)
41: param_grid = {
42:     'batch_size': [10, 20, 32],
43:     'epochs': [50, 100],
44:     'model__optimizer': ['adam', 'sgd'],
45:     'model__activation': ['relu', 'tanh']
46: }
47: 
48: # 建立 GridSearchCV 物件
49: grid = GridSearchCV(estimator=model,
50:                     param_grid=param_grid,
51:                     cv=5,
52:                     n_jobs=-1)
53: 
54: # 執行參數搜尋
55: grid_result = grid.fit(x_train, y_train)
56: 
57: # 輸出最佳參數及最佳分數
58: print(f"最佳參數: {grid_result.best_params_}")
59: print(f"最佳驗證準確度: {grid_result.best_score_:.4f}")
60: 
61: # 以最佳模型在測試集上評估效果
62: best_model = grid_result.best_estimator_
63: y_pred = best_model.predict(x_test)
64: print(f"測試集準確度: {accuracy_score(y_test, y_pred):.4f}")
65: print(classification_report(y_test, y_pred))

最佳參數: {'batch_size': 10, 'epochs': 100, 'model__activation': 'tanh', 'model__optimizer': 'adam'}
最佳驗證準確度: 0.9583
測試集準確度: 1.0000
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        13
           1       1.00      1.00      1.00         6
           2       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30

3.3. 多類別分類：數位新聞

目標：將路透社(Reuters)的數位新聞專欄分成 46 個主題，這屬於多類別分類(multiclass classification)問題，每個資料點只會被歸入一個類別；如果每個資料點可能屬於多個類別，則屬於多標籤多類別(multilabel multiclass classification)問題。

3.3.1. 資料集

和 MNIST、IMDB 一樣，這組由 Reuters 在 1986 年發布的簡短新聞主題資料集也內建在 Keras 中，這個資料集總共分為 46 個不同主題。

1: from keras.datasets import reuters
2: (train_data, train_labels), (test_data, test_labels) = reuters.load_data(num_words=10000)
3: print(train_data[0])
4: print(train_labels[0])

[1, 2, 2, 8, 43, 10, 447, 5, 25, 207, 270, 5, 3095, 111, 16, 369, 186, 90, 67, 7, 89, 5, 19, 102, 6, 19, 124, 15, 90, 67, 84, 22, 482, 26, 7, 48, 4, 49, 8, 864, 39, 209, 154, 6, 151, 6, 83, 11, 15, 22, 155, 11, 15, 7, 48, 9, 4579, 1005, 504, 6, 258, 6, 272, 11, 15, 22, 134, 44, 11, 15, 16, 8, 197, 1245, 90, 67, 52, 29, 209, 30, 32, 132, 6, 109, 15, 17, 12]
3

將資料向量化有幾種方式：將 label list 轉為整數張量，或是用 one-hot 編碼。以下為使用 python 自訂的編碼程式：

 1: import numpy as np
 2: 
 3: def vectorize_sequences(sequences, dimension=10000):
 4:     results = np.zeros((len(sequences), dimension))
 5:     for i, sequence in enumerate(sequences):
 6:         results[i, sequence] = 1.
 7:     return results
 8: 
 9: # Our vectorized training data
10: x_train = vectorize_sequences(train_data)
11: # Our vectorized test data
12: x_test = vectorize_sequences(test_data)
13: print('原始資料集維度:',train_data.shape)
14: print('向量化資料集維度:',x_train.shape)
15: print(x_train[0])

原始資料集維度: (8982,)
向量化資料集維度: (8982, 10000)
[0. 1. 1. ... 0. 0. 0.]

另外，Keras 也有一個內建的函式可用：

1: from tensorflow.keras.utils import to_categorical
2: 
3: one_hot_train_labels = to_categorical(train_labels)
4: one_hot_test_labels = to_categorical(test_labels)
5: print(one_hot_train_labels.shape)
6: print(one_hot_train_labels[0])

(8982, 46)
[0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]

3.3.2. 建立神經網路模型

此次面臨的問題不似 IMDB 只分成兩類，而是共有 46 類，若每個 Dense layer 仍只使用16個維度，可能無法學會區分 46 個不同類別，故有需要將維度增加：

1: from keras import models
2: from keras import layers
3: 
4: model = models.Sequential()
5: model.add(layers.Dense(64, activation='relu', input_shape=(10000,)))
6: model.add(layers.Dense(64, activation='relu'))
7: model.add(layers.Dense(46, activation='softmax'))

另外，輸出層將啟動函數由 sigmoid 改為 softmax，以機率值來顯示預測的類別結果，配合這種情境，最適合的損失函數為 categorical_crossentropy，它可以測量兩個機率分佈間的差距（即神經網路輸出的預測機率分佈與真實分佈間的距離），透過最小化這兩個分佈間的距離來訓練神經網路，讓結果接近答案。

1: model.compile(optimizer='rmsprop', loss='categorical_crossentropy',
2:                 metrics=['accuracy'])

此處的metrics用來儲存後續評估(model.evaluate)模型的記錄

3.3.3. 驗證資料集

由訓練集抽出 1000 個樣本來驗證：

1: x_val = x_train[:1000]
2: partial_x_train = x_train[1000:]
3: 
4: y_val = one_hot_train_labels[:1000]
5: partial_y_train = one_hot_train_labels[1000:]

3.3.4. 訓練模型

1: history = model.fit(partial_x_train,
2:                     partial_y_train,
3:                     epochs=9,
4:                     batch_size=512,
5:                     validation_data=(x_val, y_val),
6:                     verbose=0)
7: history_dict = history.history
8: print(history_dict.keys())

dict_keys(['accuracy', 'loss', 'val_accuracy', 'val_loss'])

3.3.5. 評估模型

程式第6行的model.evaluate()會傳回兩個結果:

loss value
model.compile()時指定的metrics，這裡會記錄accuracy

 1: print('loss:', history_dict['loss'])
 2: print('accuracy:', history_dict['accuracy'])
 3: print('val_accuracy:', history_dict['val_accuracy'])
 4: # 評估
 5: # Returns the loss value & metrics values for the model in test mode.
 6: results = model.evaluate(x_test, one_hot_test_labels)
 7: print("評估資料內容：",results)
 8: # 預測
 9: predictions = model.predict(x_test)
10: print("預測資料架構：",predictions[0].shape)
11: print("預測資料內容：",predictions[0])
12: print("預測結果:",np.argmax(predictions[0]))
13: print("答案:",one_hot_test_labels[0])

loss: [3.0026934146881104, 1.6911225318908691, 1.232879877090454, 0.995611846446991, 0.8247262835502625, 0.6870928406715393, 0.5740740299224854, 0.4755861163139343, 0.4013058543205261]
accuracy: [0.47619643807411194, 0.6736406683921814, 0.7411676049232483, 0.7872713804244995, 0.8251065015792847, 0.8561763763427734, 0.8801052570343018, 0.9030318260192871, 0.9189426302909851]
val_accuracy: [0.6200000047683716, 0.6959999799728394, 0.7379999756813049, 0.765999972820282, 0.7929999828338623, 0.8069999814033508, 0.8130000233650208, 0.8069999814033508, 0.8130000233650208]
71/71 ━━━━━━━━━━━━━━━━━━━━ 0s 839us/step - accuracy: 0.8001 - loss: 0.9160
評估資料內容： [0.9611411094665527, 0.7858415246009827]
71/71 ━━━━━━━━━━━━━━━━━━━━ 0s 685us/step
預測資料架構： (46,)
預測資料內容： [1.2854149e-05 3.7789090e-05 4.2292118e-05 9.0718436e-01 8.3002120e-02
 2.3392004e-06 2.9046484e-04 1.1120371e-05 1.7645630e-03 3.6879155e-05
 1.1398656e-06 5.2984012e-04 7.8006480e-05 5.6496664e-04 2.5368774e-05
 1.3207436e-04 7.5159752e-04 2.6923684e-05 1.9250263e-05 4.1414335e-04
 1.8693011e-03 7.0673483e-04 6.4346145e-06 1.2796119e-04 6.4659413e-05
 3.2037435e-05 5.2164037e-06 1.2275139e-04 9.2562705e-06 2.1327383e-04
 5.8351434e-05 1.9944042e-04 3.2126438e-05 2.4959205e-05 8.7677283e-05
 2.4225967e-05 6.1072549e-04 4.1610518e-05 3.6517431e-06 2.3155226e-04
 9.5444564e-05 4.4300844e-04 1.3225984e-05 1.0810289e-05 4.7312778e-07
 3.7070378e-05]
預測結果: 3
答案: [0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]

上述程式在經由 9 個 epoch 後精準度已近 80%(0.79)。

3.3.6. 評估結果視覺化

 1: # 畫圖
 2: import matplotlib.pyplot as plt
 3: 
 4: loss = history.history['loss']
 5: val_loss = history.history['val_loss']
 6: 
 7: epochs = range(1, len(loss) + 1)
 8: plt.cla()
 9: plt.plot(epochs, loss, 'bo', label='Training loss')
10: plt.plot(epochs, val_loss, 'b', label='Validation loss')
11: plt.title('Training and validation loss')
12: plt.xlabel('Epochs')
13: plt.ylabel('Loss')
14: plt.axis([0, 10, 0, 3])
15: plt.legend()
16: plt.plot()
17: plt.savefig("images/reuters-1.png")
18: #plt.show()
19: 
20: plt.cla()   # clear figure
21: 
22: accuracy = history.history['accuracy']
23: val_accuracy = history.history['val_accuracy']
24: 
25: plt.plot(epochs, accuracy, 'bo', label='Training accuracy')
26: plt.plot(epochs, val_accuracy, 'b', label='Validation accuracy')
27: plt.title('Training and validation accuracy')
28: plt.xlabel('Epochs')
29: plt.ylabel('Accuracy')
30: plt.axis([0, 10, 0, 1])
31: plt.legend()
32: plt.plot()
33: plt.savefig("images/reuters-2.png")
34: # plt.show()

Figure 27: Reuters-1

Figure 28: Reuters-2

3.3.7. 優化 model

上例中的中間層若將神經元數(維度)降到 4，則其驗證準確率會降至 71%，主要原因是因為這樣會壓縮大量資訊到一個低維度的中間層表示空間，雖然神經網路能將大部份必要的資訊塞進這 4 維表示法中，但仍顯不足。若再提升維度、增加層數、加入 Dropout，結果似乎沒有顯著改善，為什麼？

 1: model = models.Sequential()
 2: model.add(layers.Dense(64, activation='relu', input_shape=(10000,)))
 3: model.add(layers.Dense(128, activation='relu'))
 4: model.add(layers.Dropout(0.25))
 5: model.add(layers.Dense(256, activation='relu'))
 6: model.add(layers.Dropout(0.3))
 7: model.add(layers.Dense(512, activation='relu'))
 8: model.add(layers.Dropout(0.5))
 9: model.add(layers.Dense(46, activation='softmax'))
10: 
11: model.compile(optimizer='rmsprop',
12:               loss='categorical_crossentropy',
13:               metrics=['accuracy'])
14: 
15: # 訓練
16: history = model.fit(partial_x_train,
17:                     partial_y_train,
18:                     epochs=9,
19:                     batch_size=512,
20:                     validation_data=(x_val, y_val),
21:                     verbose=0)
22: 
23: history_dict = history.history
24: print(history_dict.keys())
25: 
26: # 評估
27: # Returns the loss value & metrics values for the model in test mode.
28: results = model.evaluate(x_test, one_hot_test_labels)
29: print("評估資料內容：",results)
30: 
31: # 預測
32: predictions = model.predict(x_test)
33: print("預測資料架構：",predictions[0].shape)
34: print("預測資料內容：",predictions[0])
35: print("預測結果:",np.argmax(predictions[0]))
36: print("答案:",one_hot_test_labels[0])
37: # 畫圖
38: 
39: import matplotlib.pyplot as plt
40: 
41: loss = history.history['loss']
42: val_loss = history.history['val_loss']
43: 
44: epochs = range(1, len(loss) + 1)
45: plt.cla()
46: plt.plot(epochs, loss, 'bo', label='Training loss')
47: plt.plot(epochs, val_loss, 'b', label='Validation loss')
48: plt.title('Training and validation loss')
49: plt.xlabel('Epochs')
50: plt.ylabel('Loss')
51: plt.axis([0, 10, 0, 3])
52: plt.legend()
53: plt.plot()
54: plt.savefig("images/reuters-3.png")
55: #plt.show()
56: 
57: plt.cla()   # clear figure
58: 
59: accuracy = history.history['accuracy']
60: val_accuracy = history.history['val_accuracy']
61: 
62: plt.plot(epochs, accuracy, 'bo', label='Training accuracy')
63: plt.plot(epochs, val_accuracy, 'b', label='Validation accuracy')
64: plt.title('Training and validation accuracy')
65: plt.xlabel('Epochs')
66: plt.ylabel('Loss')
67: plt.axis([0, 10, 0, 1])
68: plt.legend()
69: plt.plot()
70: plt.savefig("images/reuters-4.png")
71: # plt.show()

/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/keras/src/layers/core/dense.py:85: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead.
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)
dict_keys(['accuracy', 'loss', 'val_accuracy', 'val_loss'])
71/71 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.7549 - loss: 1.2041
評估資料內容： [1.2634036540985107, 0.7471059560775757]
71/71 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step
預測資料架構： (46,)
預測資料內容： [8.09684124e-08 9.49222340e-07 5.19393861e-09 9.99711692e-01
 1.80036150e-04 5.10908560e-09 1.91957028e-07 8.43487769e-08
 4.33161113e-05 5.34591793e-09 8.28265286e-07 2.17356887e-06
 1.84676892e-07 6.15979616e-07 5.97942105e-08 6.40554010e-09
 1.57276918e-05 3.67254273e-07 8.75066561e-08 4.16998364e-06
 3.44055552e-05 2.87553036e-07 9.22512200e-09 1.12576892e-07
 4.54701921e-09 1.97374948e-06 1.79199517e-08 2.03319690e-08
 1.43965934e-07 7.99068971e-08 3.35195637e-07 5.80273181e-08
 3.48503910e-08 4.89572605e-09 8.85782299e-07 2.67696922e-08
 3.60888407e-07 1.24438397e-08 2.77553855e-08 3.01288338e-07
 8.20946955e-09 1.16714205e-07 2.22603358e-09 1.22548434e-08
 2.11367723e-09 2.50342702e-09]
預測結果: 3
答案: [0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]

Figure 29: Rueter-1

Figure 30: Rueter-2

Figure 31: Rueter-3

Figure 32: Rueter-4

3.4. 迴歸問題：預測房價

3.4.1. 準備資料

1: from keras.datasets import boston_housing
2: 
3: (train_data, train_targets), (test_data, test_targets) =  boston_housing.load_data()
4: 
5: print(train_data.shape)
6: print(test_data.shape)

(404, 13)
(102, 13)

3.4.1.1. 資料集標準化

1: mean = train_data.mean(axis=0)
2: train_data -= mean
3: std = train_data.std(axis=0)
4: train_data /= std
5: 
6: test_data -= mean
7: test_data /= std

3.4.2. 建立神經網路

由於可用的樣本很少，所以使用一個較小的神經網路，一般來說，訓練資料集越少，過度配適的情況會越嚴重。

 1: from keras import models
 2: from keras import layers
 3: 
 4: def build_model():
 5:     # Because we will need to instantiate
 6:     # the same model multiple times,
 7:     # we use a function to construct it.
 8:     model = models.Sequential()
 9:     model.add(layers.Dense(64, activation='relu',
10:                            input_shape=(train_data.shape[1],)))
11:     model.add(layers.Dense(64, activation='relu'))
12:     model.add(layers.Dense(1))
13:     model.compile(optimizer='rmsprop', loss='mse', metrics=['mae'])
14:     return model

這裡以 1 unit 的神經網路結束而且沒有啟動函數(第12行)，代表為線性轉換，這是純量迴歸的基本設定，會輸出一個浮點數型別的數值(即迴歸值)，如果使用啟動函數，則只會輸出 0~1 間的值。另，mse 也是迴歸常用的損失函數，在評量指標的選擇方面，則採用 mae(mean absolute error，即預測值與目標值間差異的絕對值)。

3.4.3. 驗證

本例中由於資料點少，驗證集也只有 100 筆資料，故驗證分數可能會因驗證資料點或訓練資料點的選用而有很大的變化，因而阻礙評估 model 優劣的可靠性。在這種情況下，最好的方式是選用 K-fold corss validation，做法如圖33，原理是將資料拆分為 K 個區域(通常 K=4 或 5)，每次取一個區域做為驗證資料集，最後求 K 次驗證分數的平均值。

Figure 33: K-fold 交叉驗證

K-fold cross validation 的 python 實作程式碼如下：

 1: import numpy as np
 2: 
 3: k = 4
 4: num_val_samples = len(train_data) // k
 5: num_epochs = 100
 6: all_scores = []
 7: for i in range(k):
 8:     print('processing fold #', i)
 9:     # Prepare the validation data: data from partition # k
10:     val_data = train_data[i * num_val_samples: (i + 1) * num_val_samples]
11:     val_targets = train_targets[i * num_val_samples: (i + 1) * num_val_samples]
12: 
13:     # Prepare the training data: data from all other partitions
14:     partial_train_data = np.concatenate([train_data[:i * num_val_samples],
15:          train_data[(i + 1) * num_val_samples:]], axis=0)
16:     partial_train_targets = np.concatenate([train_targets[:i * num_val_samples],
17:          train_targets[(i + 1) * num_val_samples:]], axis=0)
18: 
19:     # Build the Keras model (already compiled)
20:     model = build_model()
21:     # Train the model (in silent mode, verbose=0)
22:     model.fit(partial_train_data, partial_train_targets,
23:               epochs=num_epochs, batch_size=1, verbose=0)
24:     # Evaluate the model on the validation data
25:     val_mse, val_mae = model.evaluate(val_data, val_targets, verbose=0)
26:     all_scores.append(val_mae)

processing fold # 0
processing fold # 1
processing fold # 2
processing fold # 3

3.4.4. 查看結果

1: print(all_scores)
2: print(np.mean(all_scores))

[2.2767844200134277, 2.619281053543091, 2.72979474067688, 2.562032461166382]
2.546973168849945

由上述結果看來，拆成 4 區的驗證分數自 2.28 到 2.73，總平均為 2.54，這個平均值是較為可靠的指標，因為當目標房價的數值很大時，2.28 到 2.73 會變成很大的誤差。

可能是因為 MAC 與 Linux 版本的 Anaconda 相容性問題，或是 Keras 版本差異問題，MAC 版與 Linux 下的 history.history 架構略有差異：

1: # Linux with Keras 2.2.5
2: dict_keys(['val_loss', 'val_mean_absolute_error', 'loss', 'mean_absolute_error'])
3: # Mac with Keras 2.3.1
4: dict_keys(['val_loss', 'val_mae', 'loss', 'mae'])

3.4.4.1. 評估結果視覺化

 1: # Some memory clean-up
 2: k = 4
 3: num_val_samples = len(train_data) // k
 4: num_epochs = 500
 5: all_mae_histories = []
 6: for i in range(k):
 7:     print('processing fold #', i)
 8:     # Prepare the validation data: data from partition # k
 9:     val_data = train_data[i * num_val_samples: (i + 1) * num_val_samples]
10:     val_targets = train_targets[i * num_val_samples: (i + 1) * num_val_samples]
11:     # Prepare the training data: data from all other partitions
12:     partial_train_data = np.concatenate(
13:         [train_data[:i * num_val_samples],
14:          train_data[(i + 1) * num_val_samples:]],
15:         axis=0)
16:     partial_train_targets = np.concatenate(
17:         [train_targets[:i * num_val_samples],
18:          train_targets[(i + 1) * num_val_samples:]],
19:         axis=0)
20:     # Build the Keras model (already compiled)
21:     model = build_model()
22:     # Train the model (in silent mode, verbose=0)
23:     history = model.fit(partial_train_data, partial_train_targets,
24:                         validation_data=(val_data, val_targets),
25:                         epochs=num_epochs, batch_size=1, verbose=0)
26:     mae_history = history.history['val_mae']
27:     all_mae_histories.append(mae_history)
28: 
29: average_mae_history = [np.mean([x[i] for x in all_mae_histories]) for i in range(num_epochs)]
30: 
31: import matplotlib.pyplot as plt
32: plt.plot(range(1, len(average_mae_history) + 1), average_mae_history)
33: plt.xlabel('Epochs')
34: plt.ylabel('Validation MAE')
35: plt.plot()
36: plt.savefig("images/Boston-House-Price.png")
37: 
38: # 排除每週期的前10個資料點
39: def smooth_curve(points, factor=0.9):
40:   smoothed_points = []
41:   for point in points:
42:     if smoothed_points:
43:       previous = smoothed_points[-1]
44:       smoothed_points.append(previous * factor + point * (1 - factor))
45:     else:
46:       smoothed_points.append(point)
47:   return smoothed_points
48: 
49: smooth_mae_history = smooth_curve(average_mae_history[10:])
50: plt.clf()
51: plt.plot(range(1, len(smooth_mae_history) + 1), smooth_mae_history)
52: plt.xlabel('Epochs')
53: plt.ylabel('Validation MAE')
54: plt.plot()
55: plt.savefig("images/Boston-House-Price-ex10.png")

processing fold # 0
processing fold # 1
processing fold # 2
processing fold # 3

Figure 34: Boston House Price Training MAE

圖34是由每一訓練週期的平均 MAE 分數所繪出的折線圖，由於單位刻度與 y 軸刻度問題，此圖失去了部份重要細節，經由下列方式進行修正：

省略前 10 個資料點，
把每個資料點替換成前一點的指數移動平均值(exponential moving average, EMA)，讓誤差變平滑。

EMA 常應用於各領域的資料分析中，其核心概念為：現在的資料會被過去的資料所影響，而時間點越近的資料影響越大，反之越小，如股票的漲幅，前 10 年的漲跌與前 10 日的漲跌，自然是後者對未來的影響更大。

EMA 的數學函式如下：
\( E_t = a \times V_t + (1-a) \times E_{t-1} \)，其中

\(E_t\)為時間點\(t\)的指數移動平均值
\(a\)為平滑係數，通常介於 0 到 1 之間
\(V_t\)為時間點\(t\)的原始數值
\(E_{t-1}\)為時間點\(t-1\)的指數移動平均值

為什麼前例中前 10 筆資料的與其他資料差異如此巨大？我們以前 10 天的資料(一天一筆)來看，第 10 天的 EMA 為：
\( E_{10} = aV_{10} + (1-a)E_9 \)
展開第 9 天的\(E_9\)後
\( E_{10} = aV_{10} + (1-a)[aV_9 + (1-a)E_8] \)
整理後變成
\( E_{10} = a(V_{10} + (1-a)V_9) + (1-a)^{2}E_8 \)
若繼續展開所有天數，將得到
\( E_{10} = a(V_{10} + (1-a)V_9) + (1-a)^{2}E_8+ \dots + (1-a)^{9}V_{1}) + (1-a)^{9}E_1 \)
通常上式的最後一項會因為時間很長而變太小，故可忽略不計，而由此也可看出，\(E_{10}\)的值會被每天的原始資料\((V_{10} \dots V_{1}\))影響，每多一天，原始數值就會多乘(1-a)倍，成指數關係，故時間越久遠的事件，影響越小。

Figure 35: Boston House Price Training MAE (排除前 10 個資料點)

由圖35是可看出 MAE 在 80 個週期後已停止改善，然後開始往上升，即，過了這點就開始發生過度適配的情況。

3.4.5. 小結

由此範例可知：

進行迴歸分析時，常以 MSE 做為損失函數、以 MAE 做為評估指標(而非 accuracy).
當輸入資料的特徵有不同刻度時，應先將每個特徵進行轉換。
當可用資料很少時，使用 K-fold 驗證來評估模式。
當可用資料很少時，最好使用隠藏層較少(較淺)的小型神經網路，如一個或兩個，以免產生過渡配適。

3.5. 圖片識別: MNIST

此處以最簡單的 NN (Neural Network) 作為範例。以 Keras 的核心為模型，應用最常使用 Sequential 模型。藉由.add()我們可以一層一層的將神經網路疊起。在每一層之中我們只需要簡單的設定每層的大小(units)與激勵函數(activation function)。需要特別記得的是：第一層要記得寫輸入的向量大小、最後一層的 units 要等於輸出的向量大小。在這邊我們最後一層使用的激活函數(activation function)為 softmax。

3.5.1. Import Library

 1: from keras.datasets import mnist
 2: from tensorflow.keras.utils import to_categorical
 3: import numpy as np
 4: 
 5: (x_train, y_train), (x_test, y_test) = mnist.load_data()
 6: 
 7: # 將訓練集特徵x_train攤平成一維向量
 8: X_train = x_train.reshape(x_train.shape[0], -1)
 9: # 將標籤y_train進行獨熱編碼
10: Y_train = to_categorical(y_train)
11: 
12: X_test = x_test.reshape(x_test.shape[0], -1)
13: Y_test = to_categorical(y_test)

3.5.2. 建立模型

Keras的模型有Sequential與Model兩類
決定好要設計的模型類別，還要決定模型裡的layer如何叠力，layer有許多選擇，例如Layer的種類就有Dense layer, Activation layer, Conv1D layer, Dropout layer…..
決定好layer,還要再選activation function，如 relu, sigmoid, softmax…..
Keras API reference / Layers API / Layer activation functions

 1: from keras.models import Sequential
 2: from keras.layers import Dense
 3: from keras.layers import Dropout
 4: 
 5: model = Sequential()
 6: #將模型疊起
 7: model.add(Dense(input_dim=28*28,units=128,activation='relu'))
 8: model.add(Dense(units=64,activation='relu'))
 9: model.add(Dense(units=10,activation='softmax'))
10: model.summary()

/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/keras/src/layers/core/dense.py:85: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead.
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)
Model: "sequential"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━
┃ Layer (type)                         ┃ Output Shape                ┃         P
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━
│ dense (Dense)                        │ (None, 128)                 │         1
├──────────────────────────────────────┼─────────────────────────────┼──────────
│ dense_1 (Dense)                      │ (None, 64)                  │
├──────────────────────────────────────┼─────────────────────────────┼──────────
│ dense_2 (Dense)                      │ (None, 10)                  │
└──────────────────────────────────────┴─────────────────────────────┴──────────
 Total params: 109,386 (427.29 KB)
 Trainable params: 109,386 (427.29 KB)
 Non-trainable params: 0 (0.00 B)

藉由model.summary()可以簡略輸出模型的大概架構與所使用的參數總數。

此例中叠了三個Dense層，第一層為每張圖的輸入(28*28個點)，有784個神經元(或node)，第二層有64個神經元，這是隱藏層，最後一層有10神經元，分別代表10種數字的可能性。

3.5.3. 訓練模型

訓練模型時要決定使用何種loss function、使用何種optimizer，可以到官網(Model training APIs)查看有哪些選項可使用以及何種選項適合哪些類型的資料集與問題。

1: 
2: model.compile(loss='categorical_crossentropy',
3:               optimizer='adam', metrics=['accuracy'])
4: 
5: train_history = model.fit(x=X_train, y=Y_train, validation_split=0.2,
6:                           epochs=50, batch_size=1000, verbose=2)

Epoch 1/50
48/48 - 1s - 27ms/step - accuracy: 0.9686 - loss: 0.1297 - val_accuracy: 0.9424 - val_loss: 0.4813
...略...
Epoch 50/50
48/48 - 1s - 12ms/step - accuracy: 0.9961 - loss: 0.0137 - val_accuracy: 0.9631 - val_loss: 0.4587

3.5.4. 查看訓練過程

看一下history的結構

1: print(train_history.history.keys())

dict_keys(['accuracy', 'loss', 'val_accuracy', 'val_loss'])

 1: import matplotlib.pyplot as plt
 2: def show_train_history(ylabel,train,test,fn):
 3:     plt.cla()
 4:     plt.plot(train_history.history[train])
 5:     plt.plot(train_history.history[test])
 6:     plt.title('Train History')
 7:     plt.ylabel(ylabel)
 8:     plt.xlabel('Epoch')
 9:     plt.legend(['train', 'test'], loc='center left')
10:     plt.savefig("images/"+fn, dpi=300)
11:     ##plt.show()
12: 
13: show_train_history('Accuracy', 'accuracy','val_accuracy','mnist-acc-val.png')
14: show_train_history('Loss', 'loss','val_loss','mnist-loss-val.png')

訓練完就可以透過accuracy與loss來評估模型的效能，可以粗略看出隨著epoch的增加，精確度也隨之提升、loss則隨之下降。

Figure 36: Accuracy

Figure 37: Loss

3.5.5. 評估模型準確率

1: score = model.evaluate(X_train, Y_train, batch_size = 200)
2: print ('\nTrain Acc:', score[1])
3: score = model.evaluate(X_test, Y_test, batch_size = 200)
4: print ('\nTest Acc:', score[1])

300/300 ━━━━━━━━━━━━━━━━━━━━ 1s 1ms/step - accuracy: 0.9955 - loss: 0.0254

Train Acc: 0.9897500276565552
[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - accuracy: 0.9575 - loss: 0.5729

Test Acc: 0.9627000093460083

3.5.6. 實際預測結果

1: prediction=model.predict(X_test)
2: print(prediction.shape)
3: print(prediction[:2])

313/313 ━━━━━━━━━━━━━━━━━━━━ 0s 519us/step
(10000, 10)
[[0.0000000e+00 4.8615205e-32 4.4841320e-31 5.9096544e-32 9.9364236e-38
  5.8718204e-37 0.0000000e+00 1.0000000e+00 0.0000000e+00 3.7505548e-29]
 [1.0376822e-33 4.1142359e-27 1.0000000e+00 6.7152534e-21 0.0000000e+00
  0.0000000e+00 0.0000000e+00 0.0000000e+00 1.6477517e-24 0.0000000e+00]]

 1: import matplotlib.pyplot as plt
 2: import numpy as np
 3: def oneHotDecode(number):
 4:     return np.argmax(number)
 5: def plot_images_labels_prediction(images, labels, prediction, num, fn):
 6:     plt.cla()
 7:     fig = plt.gcf()
 8: 
 9:     fig.set_size_inches(10, 14)
10: 
11:     idx = 0
12:     for i in range(0, num):
13:         ax=plt.subplot(5, 5, 1+i)
14:         ax.imshow(images[idx].reshape(28, 28), cmap='binary')
15: 
16:         ax.set_title("label=" +str(oneHotDecode(labels[idx]))+
17:                      ",\npredict="+str(np.argmax(prediction[idx]))
18:                      ,fontsize=10)
19:         idx+=1
20:     plt.savefig("images/"+fn, dpi=300, bbox_inches='tight',pad_inches = 0.2)
21: plot_images_labels_prediction(x_test, y_test, prediction, 20, 'mnist-predic-perf.png')
22:

最後輸出測試資料集的前20筆資料的圖、label以及預測結果

Figure 38: 前20筆測試集預測結果

3.6. 圖片識別版本2: MNIST

3.6.1. 另一版本

 1: # 載入資料
 2: from keras.datasets import mnist
 3: from tensorflow.keras.utils import to_categorical
 4: 
 5: def load_data():
 6:     # 載入minst的資料
 7:     (x_train, y_train), (x_test, y_test) = mnist.load_data()
 8:     # 將圖片轉換為一個60000*784的向量，並且標準化
 9:     x_train = x_train.reshape(x_train.shape[0], 28*28)
10:     x_test = x_test.reshape(x_test.shape[0], 28*28)
11:     x_train = x_train.astype('float32')
12:     x_test = x_test.astype('float32')
13:     x_train = x_train/255
14:     x_test = x_test/255
15:     # 將y轉換成one-hot encoding
16:     y_train = to_categorical(y_train, 10)
17:     y_test = to_categorical(y_test, 10)
18:     # 回傳處理完的資料
19:     return (x_train, y_train), (x_test, y_test)
20: 
21: import numpy as np
22: from keras import layers
23: from keras import models
24: 
25: def build_model():#建立模型
26:     model = models.Sequential()
27:     #將模型疊起
28:     model.add(layers.Dense(input_dim=28*28,units=128,activation='relu'))
29:     model.add(layers.Dense(units=64,activation='relu'))
30:     model.add(layers.Dense(units=10,activation='softmax'))
31:     model.summary()
32:     return model
33: 
34: # 開始訓練模型，此處使用了Adam做為我們的優化器，loss function選用了categorical_crossentropy。
35: (x_train,y_train),(x_test,y_test)=load_data()
36: model = build_model()
37: #開始訓練模型
38: model.compile(loss='categorical_crossentropy',optimizer="adam",metrics=['accuracy'])
39: model.fit(x_train, y_train, batch_size=100, epochs=5, verbose=0)
40: #顯示訓練結果
41: score = model.evaluate(x_train, y_train)
42: print ('\nTrain Acc:', score[1])
43: score = model.evaluate(x_test,y_test)
44: print ('\nTest Acc:', score[1])
45: 
46: ### 進行預測
47: prediction = model.predict(x_test)
48: print(prediction[:10])
49: 
50: import pandas as pd
51: # 将预测结果转换为类别标签
52: predicted_labels = np.argmax(prediction, axis=1)
53: # 将真实标签转换为类别标签
54: true_labels = np.argmax(y_test, axis=1)
55: 
56: p = pd.crosstab(true_labels, predicted_labels, rownames=['label'], colnames=['predict'])
57: print(p)
58:

Model: "sequential"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ dense (Dense)                   │ (None, 128)            │       100,480 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_1 (Dense)                 │ (None, 64)             │         8,256 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_2 (Dense)                 │ (None, 10)             │           650 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
 Total params: 109,386 (427.29 KB)
 Trainable params: 109,386 (427.29 KB)
 Non-trainable params: 0 (0.00 B)

[1m   1/1875[0m [37m━━━━━━━━━━━━━━━━━━━━[0m [1m1:11[0m 38ms/step - accuracy: 1.0000 - loss: 0.0060
[1m 119/1875[0m [32m━[0m[37m━━━━━━━━━━━━━━━━━━━[0m [1m0s[0m 426us/step - accuracy: 0.9900 - loss: 0.0408 
[1m 266/1875[0m [32m━━[0m[37m━━━━━━━━━━━━━━━━━━[0m [1m0s[0m 379us/step - accuracy: 0.9910 - loss: 0.0376
[1m 406/1875[0m [32m━━━━[0m[37m━━━━━━━━━━━━━━━━[0m [1m0s[0m 372us/step - accuracy: 0.9909 - loss: 0.0371
[1m 553/1875[0m [32m━━━━━[0m[37m━━━━━━━━━━━━━━━[0m [1m0s[0m 364us/step - accuracy: 0.9909 - loss: 0.0365
[1m 691/1875[0m [32m━━━━━━━[0m[37m━━━━━━━━━━━━━[0m [1m0s[0m 364us/step - accuracy: 0.9908 - loss: 0.0361
[1m 838/1875[0m [32m━━━━━━━━[0m[37m━━━━━━━━━━━━[0m [1m0s[0m 360us/step - accuracy: 0.9907 - loss: 0.0359
[1m 975/1875[0m [32m━━━━━━━━━━[0m[37m━━━━━━━━━━[0m [1m0s[0m 361us/step - accuracy: 0.9906 - loss: 0.0358
[1m1115/1875[0m [32m━━━━━━━━━━━[0m[37m━━━━━━━━━[0m [1m0s[0m 360us/step - accuracy: 0.9904 - loss: 0.0358
[1m1261/1875[0m [32m━━━━━━━━━━━━━[0m[37m━━━━━━━[0m [1m0s[0m 358us/step - accuracy: 0.9903 - loss: 0.0358
[1m1410/1875[0m [32m━━━━━━━━━━━━━━━[0m[37m━━━━━[0m [1m0s[0m 356us/step - accuracy: 0.9903 - loss: 0.0358
[1m1554/1875[0m [32m━━━━━━━━━━━━━━━━[0m[37m━━━━[0m [1m0s[0m 355us/step - accuracy: 0.9902 - loss: 0.0359
[1m1686/1875[0m [32m━━━━━━━━━━━━━━━━━[0m[37m━━━[0m [1m0s[0m 357us/step - accuracy: 0.9901 - loss: 0.0360
[1m1801/1875[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 362us/step - accuracy: 0.9900 - loss: 0.0361
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 364us/step - accuracy: 0.9900 - loss: 0.0361

Train Acc: 0.9893333315849304

[1m  1/313[0m [37m━━━━━━━━━━━━━━━━━━━━[0m [1m14s[0m 45ms/step - accuracy: 1.0000 - loss: 0.0102
[1m144/313[0m [32m━━━━━━━━━[0m[37m━━━━━━━━━━━[0m [1m0s[0m 350us/step - accuracy: 0.9736 - loss: 0.0826
[1m279/313[0m [32m━━━━━━━━━━━━━━━━━[0m[37m━━━[0m [1m0s[0m 361us/step - accuracy: 0.9729 - loss: 0.0823
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 359us/step - accuracy: 0.9733 - loss: 0.0810

Test Acc: 0.9764999747276306

[1m  1/313[0m [37m━━━━━━━━━━━━━━━━━━━━[0m [1m5s[0m 17ms/step
[1m139/313[0m [32m━━━━━━━━[0m[37m━━━━━━━━━━━━[0m [1m0s[0m 364us/step
[1m288/313[0m [32m━━━━━━━━━━━━━━━━━━[0m[37m━━[0m [1m0s[0m 350us/step
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 378us/step
[[1.58942123e-06 4.56630090e-07 1.44183810e-04 7.96658918e-04
  4.83289035e-08 5.57914291e-07 8.43276837e-12 9.98949349e-01
  7.18311139e-06 9.99482872e-05]
 [9.20066245e-11 8.16951651e-05 9.99909282e-01 7.55548444e-06
  9.96026872e-11 2.12249276e-07 8.12293024e-08 3.26049777e-11
  1.17174454e-06 2.21182258e-14]
 [1.82215524e-06 9.96792972e-01 3.96484742e-04 1.60194541e-04
  1.02035912e-04 1.21604295e-04 6.79323348e-05 8.09717865e-04
  1.49104348e-03 5.62478417e-05]
 [9.99915719e-01 5.35815747e-09 4.51234046e-05 5.88590240e-07
  1.43257850e-07 3.82396547e-06 1.11071859e-05 1.55277394e-06
  2.66697526e-08 2.18579280e-05]
 [4.33285859e-05 1.10104907e-06 1.34856982e-05 1.56399324e-06
  9.93066788e-01 4.01588704e-06 4.50718362e-05 1.48709762e-04
  9.48531579e-06 6.66645123e-03]
 [6.66389894e-08 9.97991085e-01 3.08640438e-06 4.88619798e-06
  1.13685437e-05 2.65808126e-07 6.75553053e-08 1.93838566e-03
  2.31987287e-05 2.76108785e-05]
 [7.66120678e-08 4.94036567e-06 2.84373783e-07 6.58262920e-08
  9.98910308e-01 1.59385650e-06 7.10860562e-08 8.05889431e-06
  4.70556435e-04 6.04005065e-04]
 [6.92483184e-07 7.98684960e-06 4.36144364e-05 3.25308600e-03
  4.72561878e-05 9.02633201e-06 8.00792588e-09 1.29608334e-05
  2.07478279e-05 9.96604562e-01]
 [1.25154367e-08 2.27490159e-06 1.40500115e-03 6.09770814e-06
  4.96629109e-05 8.98099899e-01 9.85215753e-02 2.18058993e-08
  1.91533507e-03 1.61481594e-07]
 [6.98578688e-08 2.19913043e-09 5.58190187e-08 9.24819687e-06
  1.30430009e-04 7.13636261e-09 4.16892909e-12 4.35012007e-05
  2.02734009e-06 9.99814689e-01]]
predict    0     1     2    3    4    5    6     7    8    9
label
0        960     0     6    0    2    0    7     1    4    0
1          0  1128     3    0    0    0    1     0    3    0
2          2     4  1015    2    1    0    1     4    2    1
3          0     0    11  979    0    9    0     4    2    5
4          0     1     1    0  955    0    6     4    2   13
5          2     0     0   14    1  860    4     1    7    3
6          3     3     2    1    1    3  942     1    2    0
7          0     4     7    1    1    0    0  1004    0   11
8          5     1     4    7    3    2    3     5  938    6
9          3     2     2    6    6    1    2     3    0  984

Footnotes:

主流的深度學習模型有哪些？

VGG 16 Easiest Explanation

Extract Features, Visualize Filters and Feature Maps in VGG16 and VGG19 CNN Models

⁴

Inception-v1 (GoogLeNet) — Winner of ILSVRC 2014 (Image Classification)

⁵

Residual Leaning: 認識ResNet與他的冠名後繼者ResNeXt、ResNeSt

⁶

直觀理解ResNet —簡介、觀念及實作(Python Keras)

⁷

Deep Residual Learning for Image Recognition

⁸

台科大資訊科技

⁹

Difference Between the Cost, Loss, and the Objective Function

深度學習

Table of Contents

1. 深度學習

1.1. 深度學習的知名模型

1.1.1. VGG

1.1.2. GoogLeNet

1.1.3. ResNet

1.1.4. ImageNet大賽

1.2. 深度學習的高速化

1.2.1. GPU v.s. CPU

2. 深度學習運作原理

2.1. 機器學習的SOP

2.2. 資料的分割

2.2.1. 基本資料分割

2.2.1.1. 訓練/測試分割（train-Test Split)

2.2.1.2. 訓練/驗證/測試分割（Train-Validation-Test Split）

2.2.2. 交叉驗證（Cross-Validation, CV）

2.2.2.1. K-Fold Cross-Validation

2.2.2.2. LOOCV (Leave-One-Out Cross Validation)

2.2.2.3. Stratified K-fold Cross-Validation

2.2.3. 時間序列交叉驗證（Time Series Split）

2.2.4. 自助抽樣（Bootstrap Sampling）

2.2.5. 交叉驗證的效能評估指標

2.2.5.1. 分類問題

2.2.5.2. 回歸問題

2.3. Layer, 損失函數與優化器

3. 實作範例

3.1. 二元分類：IMDB

3.1.1. 準備資料

3.1.2. 建立神經網路

3.1.3. 為什麼要加入Activation Function

3.1.4. 程式實作

3.1.5. 驗證神經網路的 model

3.1.6. 優化 model

3.2. 資料分割: IRIS / K-Fold Cross Validation

3.2.1. GridSearchCV

3.3. 多類別分類：數位新聞

3.3.1. 資料集

3.3.2. 建立神經網路模型

3.3.3. 驗證資料集

3.3.4. 訓練模型

3.3.5. 評估模型

3.3.6. 評估結果視覺化

3.3.7. 優化 model

3.4. 迴歸問題：預測房價

3.4.1. 準備資料

3.4.1.1. 資料集標準化

3.4.2. 建立神經網路

3.4.3. 驗證

3.4.4. 查看結果

3.4.4.1. 評估結果視覺化

3.4.5. 小結

3.5. 圖片識別: MNIST

3.5.1. Import Library

3.5.2. 建立模型

3.5.3. 訓練模型

3.5.4. 查看訓練過程

3.5.5. 評估模型準確率

3.5.6. 實際預測結果

3.6. 圖片識別版本2: MNIST

3.6.1. 另一版本

Footnotes: