降低计算负载的策略¶

本教程介绍了两种用于修剪 TPOT 计算负载以减少运行时间的策略。

逐次减半¶

这一想法最初由 Parmentier 等人在 "TPOT-SH：一种更快的大数据集 AutoML 问题优化算法" 中使用 TPOT 进行了测试。该算法分两个阶段运行。最初，它使用少量数据子集和较大的种群规模来训练早期世代。后期的世代则在更大甚至完整的数据部分上评估较少数量的有前景的管道。这种方法通过初步的粗略评估快速识别出性能最佳的管道配置，随后进行更全面的评估。关于此策略的更多信息请参见教程 8。

在本教程中，我们将介绍以下参数

population_size

initial_population_size

population_scaling

generations_until_end_population

budget_range

generations_until_end_budget

budget_scaling

stepwise_steps

种群大小是每一代评估的个体数量。Budget 是指采样数据所占的比例。通过调整这些参数，我们可以控制 Budget 增加的速度以及种群大小随时间的变化。通常，这用于通过在少量数据子集上评估大量管道来快速缩小最佳模型范围，然后再使用更大样本在更少数据集上获得更好的估计。这可以通过减少评估性能较差管道的时间来降低总体计算成本。

population_size 确定每一代评估的个体数量。有时我们可能希望在早期世代中评估更多或更少的个体。initial_population_size 参数指定种群的起始大小。种群大小将在 generations_until_end_population 代的过程中从 initial_population_size 逐渐过渡到 population_size。population_scaling 决定了这种缩放的速度。在 generations_until_end_population 代内的插值是分步进行的，步数由 stepwise_steps 指定。

Budget 缩放也遵循同样的过程。

以下单元格说明了在给定设置下，种群大小和 Budget 如何随时间变化。（请注意，TPOT 在此数据集上收敛得相当快，但我们关闭了提前停止以获取完整的运行结果。）

输入 [1]

已复制！





import matplotlib.pyplot as plt
import tpot

population_size=30
initial_population_size=100
population_scaling = .5
generations_until_end_population = 50

budget_range = [.3,1]
generations_until_end_budget=50
budget_scaling = .5
stepwise_steps = 5

#Population and budget use stepwise
fig, ax1 = plt.subplots()
ax2 = ax1.twinx()

interpolated_values_population = tpot.utils.beta_interpolation(start=initial_population_size, end=population_size, n=generations_until_end_population, n_steps=stepwise_steps, scale=population_scaling)
interpolated_values_budget = tpot.utils.beta_interpolation(start=budget_range[0], end=budget_range[1], n=generations_until_end_budget, n_steps=stepwise_steps, scale=budget_scaling)
ax1.step(list(range(len(interpolated_values_population))), interpolated_values_population, label=f"population size")
ax2.step(list(range(len(interpolated_values_budget))), interpolated_values_budget, label=f"budget", color='r')
ax1.set_xlabel("generation")
ax1.set_ylabel("population size")
ax2.set_ylabel("bugdet")

ax1.legend(loc='center left', bbox_to_anchor=(1.1, 0.4))
ax2.legend(loc='center left', bbox_to_anchor=(1.1, 0.3))
plt.show()
import matplotlib.pyplot as plt import tpot population_size=30 initial_population_size=100 population_scaling = .5 generations_until_end_population = 50 budget_range = [.3,1] generations_until_end_budget=50 budget_scaling = .5 stepwise_steps = 5 #Population and budget use stepwise fig, ax1 = plt.subplots() ax2 = ax1.twinx() interpolated_values_population = tpot.utils.beta_interpolation(start=initial_population_size, end=population_size, n=generations_until_end_population, n_steps=stepwise_steps, scale=population_scaling) interpolated_values_budget = tpot.utils.beta_interpolation(start=budget_range[0], end=budget_range[1], n=generations_until_end_budget, n_steps=stepwise_steps, scale=budget_scaling) ax1.step(list(range(len(interpolated_values_population))), interpolated_values_population, label=f"population size") ax2.step(list(range(len(interpolated_values_budget))), interpolated_values_budget, label=f"budget", color='r') ax1.set_xlabel("generation") ax1.set_ylabel("population size") ax2.set_ylabel("bugdet") ax1.legend(loc='center left', bbox_to_anchor=(1.1, 0.4)) ax2.legend(loc='center left', bbox_to_anchor=(1.1, 0.3)) plt.show()

/opt/anaconda3/envs/tpotenv/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

No description has been provided for this image

输入 [2]

已复制！





# A Graph pipeline starting with at least one selector as a leaf, potentially followed by a series
# of stacking classifiers or transformers, and ending with a classifier. The graph will have at most 15 nodes and a max depth of 6.

import tpot
import sklearn
import sklearn.datasets
import numpy as np
import time
import tpot
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
import sklearn

X, y = sklearn.datasets.load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, random_state=1)
scorer = sklearn.metrics.make_scorer(sklearn.metrics.roc_auc_score, needs_proba=True, multi_class='ovr')


est = tpot.TPOTEstimator(
    generations=50,
    max_time_mins=None,
    scorers=['roc_auc_ovr'],
    scorers_weights=[1],
    classification=True,
    search_space = 'linear',
    n_jobs=32,
    cv=10,
    verbose=3,

    population_size=population_size,
    initial_population_size=initial_population_size,
    population_scaling = population_scaling,
    generations_until_end_population = generations_until_end_population,
    
    budget_range = budget_range,
    generations_until_end_budget=generations_until_end_budget,
    )



start = time.time()
est.fit(X_train, y_train)
print(f"total time: {time.time()-start}")

print("test score: ", scorer(est, X_test, y_test))
# A Graph pipeline starting with at least one selector as a leaf, potentially followed by a series # of stacking classifiers or transformers, and ending with a classifier. The graph will have at most 15 nodes and a max depth of 6. import tpot import sklearn import sklearn.datasets import numpy as np import time import tpot import pandas as pd import numpy as np from sklearn.linear_model import LogisticRegression import sklearn X, y = sklearn.datasets.load_breast_cancer(return_X_y=True) X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, random_state=1) scorer = sklearn.metrics.make_scorer(sklearn.metrics.roc_auc_score, needs_proba=True, multi_class='ovr') est = tpot.TPOTEstimator( generations=50, max_time_mins=None, scorers=['roc_auc_ovr'], scorers_weights=[1], classification=True, search_space = 'linear', n_jobs=32, cv=10, verbose=3, population_size=population_size, initial_population_size=initial_population_size, population_scaling = population_scaling, generations_until_end_population = generations_until_end_population, budget_range = budget_range, generations_until_end_budget=generations_until_end_budget, ) start = time.time() est.fit(X_train, y_train) print(f"total time: {time.time()-start}") print("test score: ", scorer(est, X_test, y_test))

/opt/anaconda3/envs/tpotenv/lib/python3.10/site-packages/sklearn/metrics/_scorer.py:610: FutureWarning: The `needs_threshold` and `needs_proba` parameter are deprecated in version 1.4 and will be removed in 1.6. You can either let `response_method` be `None` or set it to `predict` to preserve the same behaviour.
  warnings.warn(
Generation:   2%|▏         | 1/50 [00:20<16:51, 20.64s/it]

Generation:  1
Best roc_auc_score score: 1.0

Generation:   4%|▍         | 2/50 [00:45<18:39, 23.32s/it]

Generation:  2
Best roc_auc_score score: 1.0

Generation:   6%|▌         | 3/50 [01:19<22:03, 28.17s/it]

Generation:  3
Best roc_auc_score score: 1.0

Generation:   8%|▊         | 4/50 [01:57<24:30, 31.97s/it]

Generation:  4
Best roc_auc_score score: 1.0

Generation:  10%|█         | 5/50 [02:24<22:44, 30.32s/it]

Generation:  5
Best roc_auc_score score: 1.0

Generation:  12%|█▏        | 6/50 [03:09<25:40, 35.02s/it]

Generation:  6
Best roc_auc_score score: 1.0

Generation:  14%|█▍        | 7/50 [03:50<26:29, 36.96s/it]

Generation:  7
Best roc_auc_score score: 1.0

Generation:  16%|█▌        | 8/50 [04:27<26:04, 37.26s/it]

Generation:  8
Best roc_auc_score score: 1.0

Generation:  18%|█▊        | 9/50 [05:20<28:45, 42.08s/it]

Generation:  9
Best roc_auc_score score: 1.0

Generation:  20%|██        | 10/50 [06:03<28:09, 42.25s/it]

Generation:  10
Best roc_auc_score score: 1.0

Generation:  22%|██▏       | 11/50 [07:16<33:43, 51.88s/it]

Generation:  11
Best roc_auc_score score: 1.0

Generation:  24%|██▍       | 12/50 [08:04<31:55, 50.42s/it]

Generation:  12
Best roc_auc_score score: 1.0

Generation:  26%|██▌       | 13/50 [09:13<34:35, 56.10s/it]

Generation:  13
Best roc_auc_score score: 1.0

Generation:  28%|██▊       | 14/50 [10:15<34:49, 58.04s/it]

Generation:  14
Best roc_auc_score score: 1.0

Generation:  30%|███       | 15/50 [11:36<37:49, 64.85s/it]

Generation:  15
Best roc_auc_score score: 1.0

Generation:  32%|███▏      | 16/50 [12:59<39:47, 70.21s/it]

Generation:  16
Best roc_auc_score score: 1.0

Generation:  34%|███▍      | 17/50 [14:05<37:58, 69.05s/it]

Generation:  17
Best roc_auc_score score: 1.0

Generation:  36%|███▌      | 18/50 [15:23<38:13, 71.66s/it]

Generation:  18
Best roc_auc_score score: 1.0

Generation:  38%|███▊      | 19/50 [17:03<41:30, 80.33s/it]

Generation:  19
Best roc_auc_score score: 1.0

Generation:  40%|████      | 20/50 [18:32<41:28, 82.96s/it]

Generation:  20
Best roc_auc_score score: 1.0

Generation:  42%|████▏     | 21/50 [22:13<1:00:02, 124.23s/it]

Generation:  21
Best roc_auc_score score: 1.0

Generation:  44%|████▍     | 22/50 [24:54<1:03:11, 135.40s/it]

Generation:  22
Best roc_auc_score score: 1.0

Generation:  46%|████▌     | 23/50 [27:03<1:00:01, 133.40s/it]

Generation:  23
Best roc_auc_score score: 1.0

Generation:  48%|████▊     | 24/50 [29:09<56:48, 131.10s/it]

Generation:  24
Best roc_auc_score score: 1.0

Generation:  50%|█████     | 25/50 [31:26<55:27, 133.09s/it]

Generation:  25
Best roc_auc_score score: 1.0

Generation:  52%|█████▏    | 26/50 [33:27<51:48, 129.50s/it]

Generation:  26
Best roc_auc_score score: 1.0

Generation:  54%|█████▍    | 27/50 [35:51<51:12, 133.60s/it]

Generation:  27
Best roc_auc_score score: 1.0

Generation:  56%|█████▌    | 28/50 [38:40<52:54, 144.28s/it]

Generation:  28
Best roc_auc_score score: 1.0

Generation:  58%|█████▊    | 29/50 [40:49<48:55, 139.80s/it]

Generation:  29
Best roc_auc_score score: 1.0

Generation:  60%|██████    | 30/50 [43:49<50:36, 151.83s/it]

Generation:  30
Best roc_auc_score score: 1.0

Generation:  62%|██████▏   | 31/50 [48:19<59:20, 187.37s/it]

Generation:  31
Best roc_auc_score score: 1.0

Generation:  64%|██████▍   | 32/50 [50:18<50:01, 166.77s/it]

Generation:  32
Best roc_auc_score score: 1.0

Generation:  66%|██████▌   | 33/50 [52:22<43:38, 154.01s/it]

Generation:  33
Best roc_auc_score score: 1.0

Generation:  68%|██████▊   | 34/50 [54:38<39:35, 148.46s/it]

Generation:  34
Best roc_auc_score score: 1.0

Generation:  70%|███████   | 35/50 [57:43<39:52, 159.52s/it]

Generation:  35
Best roc_auc_score score: 1.0

Generation:  72%|███████▏  | 36/50 [1:00:16<36:44, 157.44s/it]

Generation:  36
Best roc_auc_score score: 1.0

Generation:  74%|███████▍  | 37/50 [1:04:37<40:51, 188.57s/it]

Generation:  37
Best roc_auc_score score: 1.0

Generation:  76%|███████▌  | 38/50 [1:27:43<1:49:32, 547.68s/it]

Generation:  38
Best roc_auc_score score: 1.0

Generation:  78%|███████▊  | 39/50 [1:29:40<1:16:43, 418.49s/it]

Generation:  39
Best roc_auc_score score: 1.0

Generation:  80%|████████  | 40/50 [1:32:56<58:39, 351.95s/it]

Generation:  40
Best roc_auc_score score: 1.0

Generation:  82%|████████▏ | 41/50 [1:36:55<47:41, 317.92s/it]

Generation:  41
Best roc_auc_score score: 1.0

Generation:  84%|████████▍ | 42/50 [1:38:54<34:27, 258.41s/it]

Generation:  42
Best roc_auc_score score: 1.0

Generation:  86%|████████▌ | 43/50 [1:40:59<25:28, 218.30s/it]

Generation:  43
Best roc_auc_score score: 1.0

Generation:  88%|████████▊ | 44/50 [1:42:38<18:14, 182.39s/it]

Generation:  44
Best roc_auc_score score: 1.0

Generation:  90%|█████████ | 45/50 [1:44:18<13:08, 157.68s/it]

Generation:  45
Best roc_auc_score score: 1.0

Generation:  92%|█████████▏| 46/50 [1:46:13<09:39, 144.91s/it]

Generation:  46
Best roc_auc_score score: 1.0

Generation:  94%|█████████▍| 47/50 [1:48:29<07:06, 142.24s/it]

Generation:  47
Best roc_auc_score score: 1.0

Generation:  96%|█████████▌| 48/50 [1:50:06<04:17, 128.67s/it]

Generation:  48
Best roc_auc_score score: 1.0

Generation:  98%|█████████▊| 49/50 [1:52:18<02:09, 129.85s/it]

Generation:  49
Best roc_auc_score score: 1.0

Generation: 100%|██████████| 50/50 [1:54:14<00:00, 137.09s/it]

Generation:  50
Best roc_auc_score score: 1.0

total time: 6862.724096059799
test score:  0.9917355371900827

交叉验证早期剪枝¶

大多数情况下，我们将使用交叉验证来评估管道。然而，我们通常可以在前几个折叠中判断管道是否有合理的机会超过之前表现最佳的管道。例如，如果目前为止的最佳分数是 .92 AUROC，而当前管道前五个折叠的平均分数仅在 .61 左右，我们可以相当有把握地认为接下来的五个折叠不太可能让这个管道领先于其他管道。通过不计算其余的折叠，我们可以节省大量计算资源。TPOT 可以使用两种策略来实现这一点（有关这些策略的更多信息，请参阅教程 8）。

阈值剪枝：管道必须达到预定义的百分位阈值（基于之前管道的分数）才能在每个交叉验证（CV）折叠中继续进行。
选择剪枝：在每个种群中，只选择前 N% 的管道（根据之前 CV 折叠中的表现排名）在下一个折叠中进行评估。""

如果前几个 CV 分数不理想，我们可以通过提前终止单个管道的评估来进一步降低计算负载。请注意，这与整个算法的提前停止不同。在本节中，我们将介绍

threshold_evaluation_pruning

threshold_evaluation_scaling

min_history_threshold

selection_evaluation_pruning

selection_evaluation_scaling

阈值早期停止使用之前得分来识别和终止表现不佳管道的交叉验证评估。我们从之前评估的管道中计算百分位数得分。管道必须在每个折叠中达到给定的百分位数才能继续评估下一个折叠，否则该管道将被丢弃。

threshold_evaluation_pruning 参数是一个列表，指定了用于评估早期停止的起始和结束百分位数作为阈值。threshold_evaluation_scaling 参数是一个浮点数，控制阈值从起始百分位数移动到结束百分数位的速度。min_history_threshold 参数指定了在使用阈值早期停止之前所需的最小先前得分数量。这确保算法有足够的历史数据来就何时停止评估管道做出明智的决定。

选择早期停止在每个折叠后使用一个选择算法来选择哪些算法将在下一个折叠中进行评估。例如，在折叠 1 上评估了 100 个个体后，我们可能只想对剩余折叠评估最好的 50 个个体。

selection_evaluation_pruning 参数是一个列表，指定了在每轮 CV 中选择的种群大小的下限和上限百分比。这用于确定下一代中要评估哪些个体。selection_evaluation_scaling 参数是一个浮点数，控制选择阈值从起始百分位数移动到结束百分数位的速度。

通过调整这些参数，我们可以控制算法如何选择下一代中要评估的个体以及何时停止评估表现不佳的管道。

在实践中，这些参数的值将取决于具体问题和可用的计算资源。

在以下部分中，我们将向您展示如何在 Jupyter Notebook 中使用 Python 代码设置和调整这些参数。我们还将提供这些参数如何影响算法性能的示例。

（请注意，在这些小型测试用例中，您可能不会注意到太多或任何性能改进，这些方法在大数据集和评估较慢的管道的实际场景中可能更有益。）

注意事项： 重要的是要注意 CV 剪枝如何与进化算法交互。当管道使用这些方法之一被剪枝时，它们会从活跃种群中移除，因此不再用于指导 TPOT 算法。如果剪枝的管道过多，这可能会降低每一代管道的多样性，并限制 TPOT 的学习能力。此外，剪枝方法可能会影响 TPOT 的运行时间。如果剪枝算法移除了性能稍差但运行速度更快的管道，TPOT 最有可能在下一代中只填充运行速度较慢的管道，从而在技术上增加了总运行时间。这可能是可以接受的，因为更多的计算资源会用于性能更高的管道。

输入 [3]

已复制！





import matplotlib.pyplot as plt
import tpot
import time
import sklearn
import sklearn.datasets

threshold_evaluation_pruning = [30, 90]
threshold_evaluation_scaling = .2 #.5
cv = 10

#Population and budget use stepwise
fig, ax1 = plt.subplots()

interpolated_values = tpot.utils.beta_interpolation(start=threshold_evaluation_pruning[0], end=threshold_evaluation_pruning[-1], n=cv, n_steps=cv, scale=threshold_evaluation_scaling)
ax1.step(list(range(len(interpolated_values))), interpolated_values, label=f"threshold")
ax1.set_xlabel("fold")
ax1.set_ylabel("percentile")
#ax1.legend(loc='center left', bbox_to_anchor=(1.1, 0.4))
plt.show()
import matplotlib.pyplot as plt import tpot import time import sklearn import sklearn.datasets threshold_evaluation_pruning = [30, 90] threshold_evaluation_scaling = .2 #.5 cv = 10 #Population and budget use stepwise fig, ax1 = plt.subplots() interpolated_values = tpot.utils.beta_interpolation(start=threshold_evaluation_pruning[0], end=threshold_evaluation_pruning[-1], n=cv, n_steps=cv, scale=threshold_evaluation_scaling) ax1.step(list(range(len(interpolated_values))), interpolated_values, label=f"threshold") ax1.set_xlabel("fold") ax1.set_ylabel("percentile") #ax1.legend(loc='center left', bbox_to_anchor=(1.1, 0.4)) plt.show()

输入 [4]

已复制！





import tpot
from tpot.search_spaces.pipelines import *
from tpot.search_spaces.nodes import *
from tpot.config.get_configspace import get_search_space
import sklearn.model_selection
import sklearn


selectors = get_search_space(["selectors","selectors_classification", "Passthrough"], random_state=42,)
estimators = get_search_space(['XGBClassifier'],random_state=42,)

scalers = get_search_space(["scalers","Passthrough"],random_state=42,)

transformers_layer =UnionPipeline([
                        ChoicePipeline([
                            DynamicUnionPipeline(get_search_space(["transformers"], random_state=42,)),
                            get_search_space("SkipTransformer"),
                        ]),
                        get_search_space("Passthrough")
                        ]
                    )
    
search_space = SequentialPipeline(search_spaces=[
                                            scalers,
                                            selectors, 
                                            transformers_layer,
                                            estimators,
                                            ])
import tpot from tpot.search_spaces.pipelines import * from tpot.search_spaces.nodes import * from tpot.config.get_configspace import get_search_space import sklearn.model_selection import sklearn selectors = get_search_space(["selectors","selectors_classification", "Passthrough"], random_state=42,) estimators = get_search_space(['XGBClassifier'],random_state=42,) scalers = get_search_space(["scalers","Passthrough"],random_state=42,) transformers_layer =UnionPipeline([ ChoicePipeline([ DynamicUnionPipeline(get_search_space(["transformers"], random_state=42,)), get_search_space("SkipTransformer"), ]), get_search_space("Passthrough") ] ) search_space = SequentialPipeline(search_spaces=[ scalers, selectors, transformers_layer, estimators, ])

输入 [5]

已复制！

import matplotlib.pyplot as plt
import tpot
import time
import sklearn
import sklearn.datasets

scorer = sklearn.metrics.make_scorer(sklearn.metrics.roc_auc_score, needs_proba=True, multi_class='ovr')

X, y = sklearn.datasets.make_classification(n_samples=5000, n_features=20, n_classes=5, random_state=1, n_informative=15, n_redundant=5, n_repeated=0, n_clusters_per_class=3, class_sep=.8)
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, random_state=1)

# search_space = tpot.config.template_search_spaces.get_template_search_spaces("linear",inner_predictors=False, random_state=42)
import matplotlib.pyplot as plt import tpot import time import sklearn import sklearn.datasets scorer = sklearn.metrics.make_scorer(sklearn.metrics.roc_auc_score, needs_proba=True, multi_class='ovr') X, y = sklearn.datasets.make_classification(n_samples=5000, n_features=20, n_classes=5, random_state=1, n_informative=15, n_redundant=5, n_repeated=0, n_clusters_per_class=3, class_sep=.8) X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, random_state=1) # search_space = tpot.config.template_search_spaces.get_template_search_spaces("linear",inner_predictors=False, random_state=42)

/opt/anaconda3/envs/tpotenv/lib/python3.10/site-packages/sklearn/metrics/_scorer.py:610: FutureWarning: The `needs_threshold` and `needs_proba` parameter are deprecated in version 1.4 and will be removed in 1.6. You can either let `response_method` be `None` or set it to `predict` to preserve the same behaviour.
  warnings.warn(

输入 [6]

已复制！





# no pruning
est = tpot.TPOTEstimator(  
                            generations=10,
                            max_time_mins=None,
                            scorers=['roc_auc_ovr'],
                            scorers_weights=[1],
                            classification=True,
                            search_space = search_space,
                            population_size=100,
                            n_jobs=32,
                            cv=cv,
                            verbose=3,
                            random_state=42,
                            )


start = time.time()
est.fit(X_train, y_train)
print(f"total time: {time.time()-start}")
print("test score: ", scorer(est, X_test, y_test))
# no pruning est = tpot.TPOTEstimator( generations=10, max_time_mins=None, scorers=['roc_auc_ovr'], scorers_weights=[1], classification=True, search_space = search_space, population_size=100, n_jobs=32, cv=cv, verbose=3, random_state=42, ) start = time.time() est.fit(X_train, y_train) print(f"total time: {time.time()-start}") print("test score: ", scorer(est, X_test, y_test))

Generation:  10%|█         | 1/10 [02:42<24:26, 162.98s/it]

Generation:  1
Best roc_auc_score score: 0.9212394545585599

Generation:  20%|██        | 2/10 [06:10<25:14, 189.31s/it]

Generation:  2
Best roc_auc_score score: 0.921316057689257

Generation:  30%|███       | 3/10 [10:07<24:37, 211.00s/it]

Generation:  3
Best roc_auc_score score: 0.9291812014325632

Generation:  40%|████      | 4/10 [16:26<27:43, 277.33s/it]

Generation:  4
Best roc_auc_score score: 0.9291812014325632

Generation:  50%|█████     | 5/10 [21:24<23:44, 284.90s/it]

Generation:  5
Best roc_auc_score score: 0.9309353469187138

Generation:  60%|██████    | 6/10 [28:02<21:32, 323.19s/it]

Generation:  6
Best roc_auc_score score: 0.9328394699598583

Generation:  70%|███████   | 7/10 [36:02<18:43, 374.57s/it]

Generation:  7
Best roc_auc_score score: 0.9341963775600117

Generation:  80%|████████  | 8/10 [45:34<14:34, 437.41s/it]

Generation:  8
Best roc_auc_score score: 0.9341963775600117

Generation:  90%|█████████ | 9/10 [54:40<07:51, 471.27s/it]

Generation:  9
Best roc_auc_score score: 0.9356175936945494

Generation: 100%|██████████| 10/10 [1:03:45<00:00, 382.55s/it]

Generation:  10
Best roc_auc_score score: 0.9371852416832148

total time: 3836.4180731773376
test score:  0.9422368174356803

输入 [7]

已复制！





import tpot.config
import tpot.config.template_search_spaces
import tpot.search_spaces



# search_space = tpot.config.get_search_space(["RandomForestClassifier"])

est = tpot.TPOTEstimator(  
                            generations=10,
                            max_time_mins=None,
                            scorers=['roc_auc_ovr'],
                            scorers_weights=[1],
                            classification=True,
                            search_space = search_space,
                            population_size=100,
                            n_jobs=32,
                            cv=cv,
                            verbose=3,
                            random_state=42,

                            threshold_evaluation_pruning = threshold_evaluation_pruning,
                            threshold_evaluation_scaling = threshold_evaluation_scaling,
                            )


start = time.time()
est.fit(X_train, y_train)
print(f"total time: {time.time()-start}")
print("test score: ", scorer(est, X_test, y_test))
import tpot.config import tpot.config.template_search_spaces import tpot.search_spaces # search_space = tpot.config.get_search_space(["RandomForestClassifier"]) est = tpot.TPOTEstimator( generations=10, max_time_mins=None, scorers=['roc_auc_ovr'], scorers_weights=[1], classification=True, search_space = search_space, population_size=100, n_jobs=32, cv=cv, verbose=3, random_state=42, threshold_evaluation_pruning = threshold_evaluation_pruning, threshold_evaluation_scaling = threshold_evaluation_scaling, ) start = time.time() est.fit(X_train, y_train) print(f"total time: {time.time()-start}") print("test score: ", scorer(est, X_test, y_test))

Generation:  10%|█         | 1/10 [02:57<26:40, 177.87s/it]

Generation:  1
Best roc_auc_score score: 0.9212394545585602

Generation:  20%|██        | 2/10 [03:57<14:24, 108.05s/it]

Generation:  2
Best roc_auc_score score: 0.9212394545585602

Generation:  30%|███       | 3/10 [05:58<13:18, 114.13s/it]

Generation:  3
Best roc_auc_score score: 0.9212394545585602

Generation:  40%|████      | 4/10 [07:54<11:29, 114.96s/it]

Generation:  4
Best roc_auc_score score: 0.9212394545585602

Generation:  50%|█████     | 5/10 [10:43<11:11, 134.34s/it]

Generation:  5
Best roc_auc_score score: 0.921316057689257

Generation:  60%|██████    | 6/10 [13:16<09:23, 140.78s/it]

Generation:  6
Best roc_auc_score score: 0.921316057689257

Generation:  70%|███████   | 7/10 [15:05<06:31, 130.43s/it]

Generation:  7
Best roc_auc_score score: 0.921316057689257

Generation:  80%|████████  | 8/10 [18:01<04:49, 144.72s/it]

Generation:  8
Best roc_auc_score score: 0.9255953925256337

Generation:  90%|█████████ | 9/10 [19:53<02:14, 134.59s/it]

Generation:  9
Best roc_auc_score score: 0.9255953925256337

Generation: 100%|██████████| 10/10 [21:24<00:00, 128.50s/it]

Generation:  10
Best roc_auc_score score: 0.9255953925256337

total time: 1295.825649023056
test score:  0.9320499022897322

输入 [8]

已复制！





import matplotlib.pyplot as plt
import tpot

selection_evaluation_pruning = [.9, .3]
selection_evaluation_scaling = .2

#Population and budget use stepwise
fig, ax1 = plt.subplots()

interpolated_values = tpot.utils.beta_interpolation(start=selection_evaluation_pruning[0], end=selection_evaluation_pruning[-1], n=cv, n_steps=cv, scale=selection_evaluation_scaling)
ax1.step(list(range(len(interpolated_values))), interpolated_values, label=f"threshold")
ax1.set_xlabel("fold")
ax1.set_ylabel("percent to select")
#ax1.legend(loc='center left', bbox_to_anchor=(1.1, 0.4))
plt.show()
import matplotlib.pyplot as plt import tpot selection_evaluation_pruning = [.9, .3] selection_evaluation_scaling = .2 #Population and budget use stepwise fig, ax1 = plt.subplots() interpolated_values = tpot.utils.beta_interpolation(start=selection_evaluation_pruning[0], end=selection_evaluation_pruning[-1], n=cv, n_steps=cv, scale=selection_evaluation_scaling) ax1.step(list(range(len(interpolated_values))), interpolated_values, label=f"threshold") ax1.set_xlabel("fold") ax1.set_ylabel("percent to select") #ax1.legend(loc='center left', bbox_to_anchor=(1.1, 0.4)) plt.show()

输入 [9]

已复制！





est = tpot.TPOTEstimator(  
                            generations=10,
                            max_time_mins=None,
                            scorers=['roc_auc_ovr'],
                            scorers_weights=[1],
                            classification=True,
                            search_space = search_space,
                            population_size=100,
                            n_jobs=32,
                            cv=cv,
                            verbose=3,
                            random_state=42,

                            selection_evaluation_pruning  = selection_evaluation_pruning,
                            selection_evaluation_scaling = selection_evaluation_scaling,
                            )


start = time.time()
est.fit(X_train, y_train)
print(f"total time: {time.time()-start}")
print("test score: ", scorer(est, X_test, y_test))
est = tpot.TPOTEstimator( generations=10, max_time_mins=None, scorers=['roc_auc_ovr'], scorers_weights=[1], classification=True, search_space = search_space, population_size=100, n_jobs=32, cv=cv, verbose=3, random_state=42, selection_evaluation_pruning = selection_evaluation_pruning, selection_evaluation_scaling = selection_evaluation_scaling, ) start = time.time() est.fit(X_train, y_train) print(f"total time: {time.time()-start}") print("test score: ", scorer(est, X_test, y_test))

Generation:  10%|█         | 1/10 [02:23<21:31, 143.50s/it]

Generation:  1
Best roc_auc_score score: 0.9212394545585602

Generation:  20%|██        | 2/10 [04:00<15:30, 116.31s/it]

Generation:  2
Best roc_auc_score score: 0.9212394545585602

Generation:  30%|███       | 3/10 [05:42<12:48, 109.73s/it]

Generation:  3
Best roc_auc_score score: 0.9212394545585602

Generation:  40%|████      | 4/10 [07:36<11:08, 111.45s/it]

Generation:  4
Best roc_auc_score score: 0.9212394545585602

Generation:  50%|█████     | 5/10 [09:12<08:48, 105.72s/it]

Generation:  5
Best roc_auc_score score: 0.9212394545585602

Generation:  60%|██████    | 6/10 [11:04<07:11, 107.81s/it]

Generation:  6
Best roc_auc_score score: 0.9212394545585602

Generation:  70%|███████   | 7/10 [12:54<05:26, 108.71s/it]

Generation:  7
Best roc_auc_score score: 0.9212394545585602

Generation:  80%|████████  | 8/10 [14:45<03:38, 109.49s/it]

Generation:  8
Best roc_auc_score score: 0.925549420935039

Generation:  90%|█████████ | 9/10 [16:49<01:54, 114.03s/it]

Generation:  9
Best roc_auc_score score: 0.925549420935039

Generation: 100%|██████████| 10/10 [18:36<00:00, 111.67s/it]

Generation:  10
Best roc_auc_score score: 0.925549420935039

/opt/anaconda3/envs/tpotenv/lib/python3.10/site-packages/sklearn/decomposition/_fastica.py:595: UserWarning: n_components is too large: it will be set to 20
  warnings.warn(
/opt/anaconda3/envs/tpotenv/lib/python3.10/site-packages/sklearn/decomposition/_fastica.py:128: ConvergenceWarning: FastICA did not converge. Consider increasing tolerance or the maximum number of iterations.
  warnings.warn(

total time: 1129.1526980400085
test score:  0.9324219154371735

输入 [10]

已复制！

est.evaluated_individuals[est.evaluated_individuals['roc_auc_score_step_9']>0]
est.evaluated_individuals[est.evaluated_individuals['roc_auc_score_step_9']>0]

输出[10]

	roc_auc_score	父代	变异函数	个体	代	roc_auc_score_step_0	提交时间戳	完成时间戳	评估错误	roc_auc_score_step_1	roc_auc_score_step_2	roc_auc_score_step_3	roc_auc_score_step_4	roc_auc_score_step_5	roc_auc_score_step_6	roc_auc_score_step_7	roc_auc_score_step_8	roc_auc_score_step_9	帕累托前沿	实例
0	0.812263	NaN	NaN	<tpot.search_spaces.pipelines.sequential.Seque...	0.0	0.811153	1.740198e+09	1.740198e+09	None	0.799213	0.807710	0.813587	0.797528	0.820692	0.827614	0.815069	0.805447	0.824616	NaN	(MinMaxScaler(), RFE(estimator=ExtraTreesClass...
1	0.848068	NaN	NaN	<tpot.search_spaces.pipelines.sequential.Seque...	0.0	0.846478	1.740197e+09	1.740197e+09	None	0.839894	0.844619	0.848321	0.846915	0.857902	0.855875	0.827655	0.850938	0.862081	NaN	(Passthrough(), RFE(estimator=ExtraTreesClassi...
4	0.831502	NaN	NaN	<tpot.search_spaces.pipelines.sequential.Seque...	0.0	0.817219	1.740197e+09	1.740197e+09	None	0.827888	0.821911	0.825558	0.830020	0.831529	0.836955	0.844634	0.832499	0.846805	NaN	(StandardScaler(), VarianceThreshold(threshold...
5	0.830374	NaN	NaN	<tpot.search_spaces.pipelines.sequential.Seque...	0.0	0.817150	1.740197e+09	1.740197e+09	None	0.831885	0.820694	0.824899	0.824409	0.827861	0.833923	0.844308	0.832798	0.845818	NaN	(MinMaxScaler(), SelectFromModel(estimator=Ext...
6	0.850091	NaN	NaN	<tpot.search_spaces.pipelines.sequential.Seque...	0.0	0.843524	1.740197e+09	1.740197e+09	None	0.841176	0.840619	0.846209	0.849561	0.854367	0.858035	0.860165	0.845179	0.862077	NaN	(Normalizer(norm='max'), SelectFwe(alpha=0.000...
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
983	0.886974	(13, 13)	ind_mutate	<tpot.search_spaces.pipelines.sequential.Seque...	9.0	0.871580	1.740198e+09	1.740198e+09	None	0.887762	0.882504	0.860872	0.898100	0.885523	0.893527	0.904779	0.884557	0.900537	NaN	(StandardScaler(), SelectFromModel(estimator=E...
986	0.850281	(35, 470)	ind_crossover	<tpot.search_spaces.pipelines.sequential.Seque...	9.0	0.837493	1.740198e+09	1.740198e+09	None	0.858289	0.844141	0.851260	0.848909	0.853002	0.856132	0.845356	0.847830	0.860393	NaN	(StandardScaler(), SelectPercentile(percentile...
990	0.878811	(866, 866)	ind_mutate	<tpot.search_spaces.pipelines.sequential.Seque...	9.0	0.875842	1.740198e+09	1.740198e+09	None	0.862567	0.881858	0.885539	0.874347	0.888858	0.891205	0.882103	0.863952	0.881838	NaN	(Normalizer(norm='l1'), SelectPercentile(perce...
991	0.835669	(72, 855)	ind_crossover	<tpot.search_spaces.pipelines.sequential.Seque...	9.0	0.838375	1.740198e+09	1.740198e+09	None	0.844572	0.837234	0.822799	0.818868	0.840971	0.845122	0.816390	0.840709	0.851650	NaN	(MinMaxScaler(), SelectPercentile(percentile=4...
992	0.892459	(898, 898)	ind_mutate	<tpot.search_spaces.pipelines.sequential.Seque...	9.0	0.881991	1.740198e+09	1.740198e+09	None	0.893987	0.882514	0.887394	0.902290	0.894360	0.903944	0.884672	0.889588	0.903849	NaN	(RobustScaler(quantile_range=(0.0911728428421,...

326 行 × 20 列

以上所有方法可以单独使用，也可以同时使用，如下所示

输入 [12]

已复制！





est = tpot.TPOTEstimator(  
                            generations=10,
                            max_time_mins=None,
                            scorers=['roc_auc_ovr'],
                            scorers_weights=[1],
                            classification=True,
                            search_space = search_space,
                            population_size=30,
                            n_jobs=3,
                            cv=cv,
                            verbose=3,

                            initial_population_size=initial_population_size,
                            population_scaling = population_scaling,
                            generations_until_end_population = generations_until_end_population,
                            
                            budget_range = budget_range,
                            generations_until_end_budget=generations_until_end_budget,
                            
                            threshold_evaluation_pruning = threshold_evaluation_pruning,
                            threshold_evaluation_scaling = threshold_evaluation_scaling,

                            selection_evaluation_pruning  = selection_evaluation_pruning,
                            selection_evaluation_scaling = selection_evaluation_scaling,
                            )


start = time.time()
est.fit(X_train, y_train)
print(f"total time: {time.time()-start}")
print("test score: ", scorer(est, X_test, y_test))
est = tpot.TPOTEstimator( generations=10, max_time_mins=None, scorers=['roc_auc_ovr'], scorers_weights=[1], classification=True, search_space = search_space, population_size=30, n_jobs=3, cv=cv, verbose=3, initial_population_size=initial_population_size, population_scaling = population_scaling, generations_until_end_population = generations_until_end_population, budget_range = budget_range, generations_until_end_budget=generations_until_end_budget, threshold_evaluation_pruning = threshold_evaluation_pruning, threshold_evaluation_scaling = threshold_evaluation_scaling, selection_evaluation_pruning = selection_evaluation_pruning, selection_evaluation_scaling = selection_evaluation_scaling, ) start = time.time() est.fit(X_train, y_train) print(f"total time: {time.time()-start}") print("test score: ", scorer(est, X_test, y_test))

Generation:  10%|█         | 1/10 [01:34<14:09, 94.40s/it]

Generation:  1
Best roc_auc_score score: 0.8515086951804098

Generation:  20%|██        | 2/10 [02:26<09:14, 69.36s/it]

Generation:  2
Best roc_auc_score score: 0.8515086951804098

Generation:  30%|███       | 3/10 [03:41<08:23, 71.97s/it]

Generation:  3
Best roc_auc_score score: 0.8515086951804098

Generation:  40%|████      | 4/10 [04:52<07:09, 71.53s/it]

Generation:  4
Best roc_auc_score score: 0.8515086951804098

Generation:  50%|█████     | 5/10 [05:52<05:37, 67.57s/it]

Generation:  5
Best roc_auc_score score: 0.8515086951804098

Generation:  60%|██████    | 6/10 [07:13<04:48, 72.10s/it]

Generation:  6
Best roc_auc_score score: 0.8515086951804098

Generation:  70%|███████   | 7/10 [08:06<03:17, 65.84s/it]

Generation:  7
Best roc_auc_score score: 0.8515086951804098

Generation:  80%|████████  | 8/10 [08:57<02:02, 61.13s/it]

Generation:  8
Best roc_auc_score score: 0.8515086951804098

Generation:  90%|█████████ | 9/10 [09:39<00:55, 55.14s/it]

Generation:  9
Best roc_auc_score score: 0.8515086951804098

Generation: 100%|██████████| 10/10 [10:17<00:00, 61.70s/it]

Generation:  10
Best roc_auc_score score: 0.8515086951804098

total time: 621.607882976532
test score:  0.9084772293865335