降低计算负载的策略¶
本教程介绍了两种用于修剪 TPOT 计算负载以减少运行时间的策略。
逐次减半¶
这一想法最初由 Parmentier 等人在 "TPOT-SH:一种更快的大数据集 AutoML 问题优化算法" 中使用 TPOT 进行了测试。该算法分两个阶段运行。最初,它使用少量数据子集和较大的种群规模来训练早期世代。后期的世代则在更大甚至完整的数据部分上评估较少数量的有前景的管道。这种方法通过初步的粗略评估快速识别出性能最佳的管道配置,随后进行更全面的评估。关于此策略的更多信息请参见教程 8。
在本教程中,我们将介绍以下参数
population_size
initial_population_size
population_scaling
generations_until_end_population
budget_range
generations_until_end_budget
budget_scaling
stepwise_steps
种群大小是每一代评估的个体数量。Budget 是指采样数据所占的比例。通过调整这些参数,我们可以控制 Budget 增加的速度以及种群大小随时间的变化。通常,这用于通过在少量数据子集上评估大量管道来快速缩小最佳模型范围,然后再使用更大样本在更少数据集上获得更好的估计。这可以通过减少评估性能较差管道的时间来降低总体计算成本。
population_size
确定每一代评估的个体数量。有时我们可能希望在早期世代中评估更多或更少的个体。initial_population_size
参数指定种群的起始大小。种群大小将在 generations_until_end_population
代的过程中从 initial_population_size
逐渐过渡到 population_size
。population_scaling
决定了这种缩放的速度。在 generations_until_end_population
代内的插值是分步进行的,步数由 stepwise_steps
指定。
Budget 缩放也遵循同样的过程。
以下单元格说明了在给定设置下,种群大小和 Budget 如何随时间变化。(请注意,TPOT 在此数据集上收敛得相当快,但我们关闭了提前停止以获取完整的运行结果。)
import matplotlib.pyplot as plt
import tpot
population_size=30
initial_population_size=100
population_scaling = .5
generations_until_end_population = 50
budget_range = [.3,1]
generations_until_end_budget=50
budget_scaling = .5
stepwise_steps = 5
#Population and budget use stepwise
fig, ax1 = plt.subplots()
ax2 = ax1.twinx()
interpolated_values_population = tpot.utils.beta_interpolation(start=initial_population_size, end=population_size, n=generations_until_end_population, n_steps=stepwise_steps, scale=population_scaling)
interpolated_values_budget = tpot.utils.beta_interpolation(start=budget_range[0], end=budget_range[1], n=generations_until_end_budget, n_steps=stepwise_steps, scale=budget_scaling)
ax1.step(list(range(len(interpolated_values_population))), interpolated_values_population, label=f"population size")
ax2.step(list(range(len(interpolated_values_budget))), interpolated_values_budget, label=f"budget", color='r')
ax1.set_xlabel("generation")
ax1.set_ylabel("population size")
ax2.set_ylabel("bugdet")
ax1.legend(loc='center left', bbox_to_anchor=(1.1, 0.4))
ax2.legend(loc='center left', bbox_to_anchor=(1.1, 0.3))
plt.show()
/opt/anaconda3/envs/tpotenv/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm
# A Graph pipeline starting with at least one selector as a leaf, potentially followed by a series
# of stacking classifiers or transformers, and ending with a classifier. The graph will have at most 15 nodes and a max depth of 6.
import tpot
import sklearn
import sklearn.datasets
import numpy as np
import time
import tpot
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
import sklearn
X, y = sklearn.datasets.load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, random_state=1)
scorer = sklearn.metrics.make_scorer(sklearn.metrics.roc_auc_score, needs_proba=True, multi_class='ovr')
est = tpot.TPOTEstimator(
generations=50,
max_time_mins=None,
scorers=['roc_auc_ovr'],
scorers_weights=[1],
classification=True,
search_space = 'linear',
n_jobs=32,
cv=10,
verbose=3,
population_size=population_size,
initial_population_size=initial_population_size,
population_scaling = population_scaling,
generations_until_end_population = generations_until_end_population,
budget_range = budget_range,
generations_until_end_budget=generations_until_end_budget,
)
start = time.time()
est.fit(X_train, y_train)
print(f"total time: {time.time()-start}")
print("test score: ", scorer(est, X_test, y_test))
/opt/anaconda3/envs/tpotenv/lib/python3.10/site-packages/sklearn/metrics/_scorer.py:610: FutureWarning: The `needs_threshold` and `needs_proba` parameter are deprecated in version 1.4 and will be removed in 1.6. You can either let `response_method` be `None` or set it to `predict` to preserve the same behaviour. warnings.warn( Generation: 2%|▏ | 1/50 [00:20<16:51, 20.64s/it]
Generation: 1 Best roc_auc_score score: 1.0
Generation: 4%|▍ | 2/50 [00:45<18:39, 23.32s/it]
Generation: 2 Best roc_auc_score score: 1.0
Generation: 6%|▌ | 3/50 [01:19<22:03, 28.17s/it]
Generation: 3 Best roc_auc_score score: 1.0
Generation: 8%|▊ | 4/50 [01:57<24:30, 31.97s/it]
Generation: 4 Best roc_auc_score score: 1.0
Generation: 10%|█ | 5/50 [02:24<22:44, 30.32s/it]
Generation: 5 Best roc_auc_score score: 1.0
Generation: 12%|█▏ | 6/50 [03:09<25:40, 35.02s/it]
Generation: 6 Best roc_auc_score score: 1.0
Generation: 14%|█▍ | 7/50 [03:50<26:29, 36.96s/it]
Generation: 7 Best roc_auc_score score: 1.0
Generation: 16%|█▌ | 8/50 [04:27<26:04, 37.26s/it]
Generation: 8 Best roc_auc_score score: 1.0
Generation: 18%|█▊ | 9/50 [05:20<28:45, 42.08s/it]
Generation: 9 Best roc_auc_score score: 1.0
Generation: 20%|██ | 10/50 [06:03<28:09, 42.25s/it]
Generation: 10 Best roc_auc_score score: 1.0
Generation: 22%|██▏ | 11/50 [07:16<33:43, 51.88s/it]
Generation: 11 Best roc_auc_score score: 1.0
Generation: 24%|██▍ | 12/50 [08:04<31:55, 50.42s/it]
Generation: 12 Best roc_auc_score score: 1.0
Generation: 26%|██▌ | 13/50 [09:13<34:35, 56.10s/it]
Generation: 13 Best roc_auc_score score: 1.0
Generation: 28%|██▊ | 14/50 [10:15<34:49, 58.04s/it]
Generation: 14 Best roc_auc_score score: 1.0
Generation: 30%|███ | 15/50 [11:36<37:49, 64.85s/it]
Generation: 15 Best roc_auc_score score: 1.0
Generation: 32%|███▏ | 16/50 [12:59<39:47, 70.21s/it]
Generation: 16 Best roc_auc_score score: 1.0
Generation: 34%|███▍ | 17/50 [14:05<37:58, 69.05s/it]
Generation: 17 Best roc_auc_score score: 1.0
Generation: 36%|███▌ | 18/50 [15:23<38:13, 71.66s/it]
Generation: 18 Best roc_auc_score score: 1.0
Generation: 38%|███▊ | 19/50 [17:03<41:30, 80.33s/it]
Generation: 19 Best roc_auc_score score: 1.0
Generation: 40%|████ | 20/50 [18:32<41:28, 82.96s/it]
Generation: 20 Best roc_auc_score score: 1.0
Generation: 42%|████▏ | 21/50 [22:13<1:00:02, 124.23s/it]
Generation: 21 Best roc_auc_score score: 1.0
Generation: 44%|████▍ | 22/50 [24:54<1:03:11, 135.40s/it]
Generation: 22 Best roc_auc_score score: 1.0
Generation: 46%|████▌ | 23/50 [27:03<1:00:01, 133.40s/it]
Generation: 23 Best roc_auc_score score: 1.0
Generation: 48%|████▊ | 24/50 [29:09<56:48, 131.10s/it]
Generation: 24 Best roc_auc_score score: 1.0
Generation: 50%|█████ | 25/50 [31:26<55:27, 133.09s/it]
Generation: 25 Best roc_auc_score score: 1.0
Generation: 52%|█████▏ | 26/50 [33:27<51:48, 129.50s/it]
Generation: 26 Best roc_auc_score score: 1.0
Generation: 54%|█████▍ | 27/50 [35:51<51:12, 133.60s/it]
Generation: 27 Best roc_auc_score score: 1.0
Generation: 56%|█████▌ | 28/50 [38:40<52:54, 144.28s/it]
Generation: 28 Best roc_auc_score score: 1.0
Generation: 58%|█████▊ | 29/50 [40:49<48:55, 139.80s/it]
Generation: 29 Best roc_auc_score score: 1.0
Generation: 60%|██████ | 30/50 [43:49<50:36, 151.83s/it]
Generation: 30 Best roc_auc_score score: 1.0
Generation: 62%|██████▏ | 31/50 [48:19<59:20, 187.37s/it]
Generation: 31 Best roc_auc_score score: 1.0
Generation: 64%|██████▍ | 32/50 [50:18<50:01, 166.77s/it]
Generation: 32 Best roc_auc_score score: 1.0
Generation: 66%|██████▌ | 33/50 [52:22<43:38, 154.01s/it]
Generation: 33 Best roc_auc_score score: 1.0
Generation: 68%|██████▊ | 34/50 [54:38<39:35, 148.46s/it]
Generation: 34 Best roc_auc_score score: 1.0
Generation: 70%|███████ | 35/50 [57:43<39:52, 159.52s/it]
Generation: 35 Best roc_auc_score score: 1.0
Generation: 72%|███████▏ | 36/50 [1:00:16<36:44, 157.44s/it]
Generation: 36 Best roc_auc_score score: 1.0
Generation: 74%|███████▍ | 37/50 [1:04:37<40:51, 188.57s/it]
Generation: 37 Best roc_auc_score score: 1.0
Generation: 76%|███████▌ | 38/50 [1:27:43<1:49:32, 547.68s/it]
Generation: 38 Best roc_auc_score score: 1.0
Generation: 78%|███████▊ | 39/50 [1:29:40<1:16:43, 418.49s/it]
Generation: 39 Best roc_auc_score score: 1.0
Generation: 80%|████████ | 40/50 [1:32:56<58:39, 351.95s/it]
Generation: 40 Best roc_auc_score score: 1.0
Generation: 82%|████████▏ | 41/50 [1:36:55<47:41, 317.92s/it]
Generation: 41 Best roc_auc_score score: 1.0
Generation: 84%|████████▍ | 42/50 [1:38:54<34:27, 258.41s/it]
Generation: 42 Best roc_auc_score score: 1.0
Generation: 86%|████████▌ | 43/50 [1:40:59<25:28, 218.30s/it]
Generation: 43 Best roc_auc_score score: 1.0
Generation: 88%|████████▊ | 44/50 [1:42:38<18:14, 182.39s/it]
Generation: 44 Best roc_auc_score score: 1.0
Generation: 90%|█████████ | 45/50 [1:44:18<13:08, 157.68s/it]
Generation: 45 Best roc_auc_score score: 1.0
Generation: 92%|█████████▏| 46/50 [1:46:13<09:39, 144.91s/it]
Generation: 46 Best roc_auc_score score: 1.0
Generation: 94%|█████████▍| 47/50 [1:48:29<07:06, 142.24s/it]
Generation: 47 Best roc_auc_score score: 1.0
Generation: 96%|█████████▌| 48/50 [1:50:06<04:17, 128.67s/it]
Generation: 48 Best roc_auc_score score: 1.0
Generation: 98%|█████████▊| 49/50 [1:52:18<02:09, 129.85s/it]
Generation: 49 Best roc_auc_score score: 1.0
Generation: 100%|██████████| 50/50 [1:54:14<00:00, 137.09s/it]
Generation: 50 Best roc_auc_score score: 1.0
total time: 6862.724096059799 test score: 0.9917355371900827
交叉验证早期剪枝¶
大多数情况下,我们将使用交叉验证来评估管道。然而,我们通常可以在前几个折叠中判断管道是否有合理的机会超过之前表现最佳的管道。例如,如果目前为止的最佳分数是 .92 AUROC,而当前管道前五个折叠的平均分数仅在 .61 左右,我们可以相当有把握地认为接下来的五个折叠不太可能让这个管道领先于其他管道。通过不计算其余的折叠,我们可以节省大量计算资源。TPOT 可以使用两种策略来实现这一点(有关这些策略的更多信息,请参阅教程 8)。
- 阈值剪枝:管道必须达到预定义的百分位阈值(基于之前管道的分数)才能在每个交叉验证(CV)折叠中继续进行。
- 选择剪枝:在每个种群中,只选择前 N% 的管道(根据之前 CV 折叠中的表现排名)在下一个折叠中进行评估。""
如果前几个 CV 分数不理想,我们可以通过提前终止单个管道的评估来进一步降低计算负载。请注意,这与整个算法的提前停止不同。在本节中,我们将介绍
threshold_evaluation_pruning
threshold_evaluation_scaling
min_history_threshold
selection_evaluation_pruning
selection_evaluation_scaling
阈值早期停止使用之前得分来识别和终止表现不佳管道的交叉验证评估。我们从之前评估的管道中计算百分位数得分。管道必须在每个折叠中达到给定的百分位数才能继续评估下一个折叠,否则该管道将被丢弃。
threshold_evaluation_pruning
参数是一个列表,指定了用于评估早期停止的起始和结束百分位数作为阈值。threshold_evaluation_scaling
参数是一个浮点数,控制阈值从起始百分位数移动到结束百分数位的速度。min_history_threshold
参数指定了在使用阈值早期停止之前所需的最小先前得分数量。这确保算法有足够的历史数据来就何时停止评估管道做出明智的决定。
选择早期停止在每个折叠后使用一个选择算法来选择哪些算法将在下一个折叠中进行评估。例如,在折叠 1 上评估了 100 个个体后,我们可能只想对剩余折叠评估最好的 50 个个体。
selection_evaluation_pruning
参数是一个列表,指定了在每轮 CV 中选择的种群大小的下限和上限百分比。这用于确定下一代中要评估哪些个体。selection_evaluation_scaling
参数是一个浮点数,控制选择阈值从起始百分位数移动到结束百分数位的速度。
通过调整这些参数,我们可以控制算法如何选择下一代中要评估的个体以及何时停止评估表现不佳的管道。
在实践中,这些参数的值将取决于具体问题和可用的计算资源。
在以下部分中,我们将向您展示如何在 Jupyter Notebook 中使用 Python 代码设置和调整这些参数。我们还将提供这些参数如何影响算法性能的示例。
(请注意,在这些小型测试用例中,您可能不会注意到太多或任何性能改进,这些方法在大数据集和评估较慢的管道的实际场景中可能更有益。)
注意事项: 重要的是要注意 CV 剪枝如何与进化算法交互。当管道使用这些方法之一被剪枝时,它们会从活跃种群中移除,因此不再用于指导 TPOT 算法。如果剪枝的管道过多,这可能会降低每一代管道的多样性,并限制 TPOT 的学习能力。此外,剪枝方法可能会影响 TPOT 的运行时间。如果剪枝算法移除了性能稍差但运行速度更快的管道,TPOT 最有可能在下一代中只填充运行速度较慢的管道,从而在技术上增加了总运行时间。这可能是可以接受的,因为更多的计算资源会用于性能更高的管道。
import matplotlib.pyplot as plt
import tpot
import time
import sklearn
import sklearn.datasets
threshold_evaluation_pruning = [30, 90]
threshold_evaluation_scaling = .2 #.5
cv = 10
#Population and budget use stepwise
fig, ax1 = plt.subplots()
interpolated_values = tpot.utils.beta_interpolation(start=threshold_evaluation_pruning[0], end=threshold_evaluation_pruning[-1], n=cv, n_steps=cv, scale=threshold_evaluation_scaling)
ax1.step(list(range(len(interpolated_values))), interpolated_values, label=f"threshold")
ax1.set_xlabel("fold")
ax1.set_ylabel("percentile")
#ax1.legend(loc='center left', bbox_to_anchor=(1.1, 0.4))
plt.show()
import tpot
from tpot.search_spaces.pipelines import *
from tpot.search_spaces.nodes import *
from tpot.config.get_configspace import get_search_space
import sklearn.model_selection
import sklearn
selectors = get_search_space(["selectors","selectors_classification", "Passthrough"], random_state=42,)
estimators = get_search_space(['XGBClassifier'],random_state=42,)
scalers = get_search_space(["scalers","Passthrough"],random_state=42,)
transformers_layer =UnionPipeline([
ChoicePipeline([
DynamicUnionPipeline(get_search_space(["transformers"], random_state=42,)),
get_search_space("SkipTransformer"),
]),
get_search_space("Passthrough")
]
)
search_space = SequentialPipeline(search_spaces=[
scalers,
selectors,
transformers_layer,
estimators,
])
import matplotlib.pyplot as plt
import tpot
import time
import sklearn
import sklearn.datasets
scorer = sklearn.metrics.make_scorer(sklearn.metrics.roc_auc_score, needs_proba=True, multi_class='ovr')
X, y = sklearn.datasets.make_classification(n_samples=5000, n_features=20, n_classes=5, random_state=1, n_informative=15, n_redundant=5, n_repeated=0, n_clusters_per_class=3, class_sep=.8)
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, random_state=1)
# search_space = tpot.config.template_search_spaces.get_template_search_spaces("linear",inner_predictors=False, random_state=42)
/opt/anaconda3/envs/tpotenv/lib/python3.10/site-packages/sklearn/metrics/_scorer.py:610: FutureWarning: The `needs_threshold` and `needs_proba` parameter are deprecated in version 1.4 and will be removed in 1.6. You can either let `response_method` be `None` or set it to `predict` to preserve the same behaviour. warnings.warn(
# no pruning
est = tpot.TPOTEstimator(
generations=10,
max_time_mins=None,
scorers=['roc_auc_ovr'],
scorers_weights=[1],
classification=True,
search_space = search_space,
population_size=100,
n_jobs=32,
cv=cv,
verbose=3,
random_state=42,
)
start = time.time()
est.fit(X_train, y_train)
print(f"total time: {time.time()-start}")
print("test score: ", scorer(est, X_test, y_test))
Generation: 10%|█ | 1/10 [02:42<24:26, 162.98s/it]
Generation: 1 Best roc_auc_score score: 0.9212394545585599
Generation: 20%|██ | 2/10 [06:10<25:14, 189.31s/it]
Generation: 2 Best roc_auc_score score: 0.921316057689257
Generation: 30%|███ | 3/10 [10:07<24:37, 211.00s/it]
Generation: 3 Best roc_auc_score score: 0.9291812014325632
Generation: 40%|████ | 4/10 [16:26<27:43, 277.33s/it]
Generation: 4 Best roc_auc_score score: 0.9291812014325632
Generation: 50%|█████ | 5/10 [21:24<23:44, 284.90s/it]
Generation: 5 Best roc_auc_score score: 0.9309353469187138
Generation: 60%|██████ | 6/10 [28:02<21:32, 323.19s/it]
Generation: 6 Best roc_auc_score score: 0.9328394699598583
Generation: 70%|███████ | 7/10 [36:02<18:43, 374.57s/it]
Generation: 7 Best roc_auc_score score: 0.9341963775600117
Generation: 80%|████████ | 8/10 [45:34<14:34, 437.41s/it]
Generation: 8 Best roc_auc_score score: 0.9341963775600117
Generation: 90%|█████████ | 9/10 [54:40<07:51, 471.27s/it]
Generation: 9 Best roc_auc_score score: 0.9356175936945494
Generation: 100%|██████████| 10/10 [1:03:45<00:00, 382.55s/it]
Generation: 10 Best roc_auc_score score: 0.9371852416832148
total time: 3836.4180731773376 test score: 0.9422368174356803
import tpot.config
import tpot.config.template_search_spaces
import tpot.search_spaces
# search_space = tpot.config.get_search_space(["RandomForestClassifier"])
est = tpot.TPOTEstimator(
generations=10,
max_time_mins=None,
scorers=['roc_auc_ovr'],
scorers_weights=[1],
classification=True,
search_space = search_space,
population_size=100,
n_jobs=32,
cv=cv,
verbose=3,
random_state=42,
threshold_evaluation_pruning = threshold_evaluation_pruning,
threshold_evaluation_scaling = threshold_evaluation_scaling,
)
start = time.time()
est.fit(X_train, y_train)
print(f"total time: {time.time()-start}")
print("test score: ", scorer(est, X_test, y_test))
Generation: 10%|█ | 1/10 [02:57<26:40, 177.87s/it]
Generation: 1 Best roc_auc_score score: 0.9212394545585602
Generation: 20%|██ | 2/10 [03:57<14:24, 108.05s/it]
Generation: 2 Best roc_auc_score score: 0.9212394545585602
Generation: 30%|███ | 3/10 [05:58<13:18, 114.13s/it]
Generation: 3 Best roc_auc_score score: 0.9212394545585602
Generation: 40%|████ | 4/10 [07:54<11:29, 114.96s/it]
Generation: 4 Best roc_auc_score score: 0.9212394545585602
Generation: 50%|█████ | 5/10 [10:43<11:11, 134.34s/it]
Generation: 5 Best roc_auc_score score: 0.921316057689257
Generation: 60%|██████ | 6/10 [13:16<09:23, 140.78s/it]
Generation: 6 Best roc_auc_score score: 0.921316057689257
Generation: 70%|███████ | 7/10 [15:05<06:31, 130.43s/it]
Generation: 7 Best roc_auc_score score: 0.921316057689257
Generation: 80%|████████ | 8/10 [18:01<04:49, 144.72s/it]
Generation: 8 Best roc_auc_score score: 0.9255953925256337
Generation: 90%|█████████ | 9/10 [19:53<02:14, 134.59s/it]
Generation: 9 Best roc_auc_score score: 0.9255953925256337
Generation: 100%|██████████| 10/10 [21:24<00:00, 128.50s/it]
Generation: 10 Best roc_auc_score score: 0.9255953925256337
total time: 1295.825649023056 test score: 0.9320499022897322
import matplotlib.pyplot as plt
import tpot
selection_evaluation_pruning = [.9, .3]
selection_evaluation_scaling = .2
#Population and budget use stepwise
fig, ax1 = plt.subplots()
interpolated_values = tpot.utils.beta_interpolation(start=selection_evaluation_pruning[0], end=selection_evaluation_pruning[-1], n=cv, n_steps=cv, scale=selection_evaluation_scaling)
ax1.step(list(range(len(interpolated_values))), interpolated_values, label=f"threshold")
ax1.set_xlabel("fold")
ax1.set_ylabel("percent to select")
#ax1.legend(loc='center left', bbox_to_anchor=(1.1, 0.4))
plt.show()
est = tpot.TPOTEstimator(
generations=10,
max_time_mins=None,
scorers=['roc_auc_ovr'],
scorers_weights=[1],
classification=True,
search_space = search_space,
population_size=100,
n_jobs=32,
cv=cv,
verbose=3,
random_state=42,
selection_evaluation_pruning = selection_evaluation_pruning,
selection_evaluation_scaling = selection_evaluation_scaling,
)
start = time.time()
est.fit(X_train, y_train)
print(f"total time: {time.time()-start}")
print("test score: ", scorer(est, X_test, y_test))
Generation: 10%|█ | 1/10 [02:23<21:31, 143.50s/it]
Generation: 1 Best roc_auc_score score: 0.9212394545585602
Generation: 20%|██ | 2/10 [04:00<15:30, 116.31s/it]
Generation: 2 Best roc_auc_score score: 0.9212394545585602
Generation: 30%|███ | 3/10 [05:42<12:48, 109.73s/it]
Generation: 3 Best roc_auc_score score: 0.9212394545585602
Generation: 40%|████ | 4/10 [07:36<11:08, 111.45s/it]
Generation: 4 Best roc_auc_score score: 0.9212394545585602
Generation: 50%|█████ | 5/10 [09:12<08:48, 105.72s/it]
Generation: 5 Best roc_auc_score score: 0.9212394545585602
Generation: 60%|██████ | 6/10 [11:04<07:11, 107.81s/it]
Generation: 6 Best roc_auc_score score: 0.9212394545585602
Generation: 70%|███████ | 7/10 [12:54<05:26, 108.71s/it]
Generation: 7 Best roc_auc_score score: 0.9212394545585602
Generation: 80%|████████ | 8/10 [14:45<03:38, 109.49s/it]
Generation: 8 Best roc_auc_score score: 0.925549420935039
Generation: 90%|█████████ | 9/10 [16:49<01:54, 114.03s/it]
Generation: 9 Best roc_auc_score score: 0.925549420935039
Generation: 100%|██████████| 10/10 [18:36<00:00, 111.67s/it]
Generation: 10 Best roc_auc_score score: 0.925549420935039
/opt/anaconda3/envs/tpotenv/lib/python3.10/site-packages/sklearn/decomposition/_fastica.py:595: UserWarning: n_components is too large: it will be set to 20 warnings.warn( /opt/anaconda3/envs/tpotenv/lib/python3.10/site-packages/sklearn/decomposition/_fastica.py:128: ConvergenceWarning: FastICA did not converge. Consider increasing tolerance or the maximum number of iterations. warnings.warn(
total time: 1129.1526980400085 test score: 0.9324219154371735
est.evaluated_individuals[est.evaluated_individuals['roc_auc_score_step_9']>0]
roc_auc_score | 父代 | 变异函数 | 个体 | 代 | roc_auc_score_step_0 | 提交时间戳 | 完成时间戳 | 评估错误 | roc_auc_score_step_1 | roc_auc_score_step_2 | roc_auc_score_step_3 | roc_auc_score_step_4 | roc_auc_score_step_5 | roc_auc_score_step_6 | roc_auc_score_step_7 | roc_auc_score_step_8 | roc_auc_score_step_9 | 帕累托前沿 | 实例 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.812263 | NaN | NaN | <tpot.search_spaces.pipelines.sequential.Seque... | 0.0 | 0.811153 | 1.740198e+09 | 1.740198e+09 | None | 0.799213 | 0.807710 | 0.813587 | 0.797528 | 0.820692 | 0.827614 | 0.815069 | 0.805447 | 0.824616 | NaN | (MinMaxScaler(), RFE(estimator=ExtraTreesClass... |
1 | 0.848068 | NaN | NaN | <tpot.search_spaces.pipelines.sequential.Seque... | 0.0 | 0.846478 | 1.740197e+09 | 1.740197e+09 | None | 0.839894 | 0.844619 | 0.848321 | 0.846915 | 0.857902 | 0.855875 | 0.827655 | 0.850938 | 0.862081 | NaN | (Passthrough(), RFE(estimator=ExtraTreesClassi... |
4 | 0.831502 | NaN | NaN | <tpot.search_spaces.pipelines.sequential.Seque... | 0.0 | 0.817219 | 1.740197e+09 | 1.740197e+09 | None | 0.827888 | 0.821911 | 0.825558 | 0.830020 | 0.831529 | 0.836955 | 0.844634 | 0.832499 | 0.846805 | NaN | (StandardScaler(), VarianceThreshold(threshold... |
5 | 0.830374 | NaN | NaN | <tpot.search_spaces.pipelines.sequential.Seque... | 0.0 | 0.817150 | 1.740197e+09 | 1.740197e+09 | None | 0.831885 | 0.820694 | 0.824899 | 0.824409 | 0.827861 | 0.833923 | 0.844308 | 0.832798 | 0.845818 | NaN | (MinMaxScaler(), SelectFromModel(estimator=Ext... |
6 | 0.850091 | NaN | NaN | <tpot.search_spaces.pipelines.sequential.Seque... | 0.0 | 0.843524 | 1.740197e+09 | 1.740197e+09 | None | 0.841176 | 0.840619 | 0.846209 | 0.849561 | 0.854367 | 0.858035 | 0.860165 | 0.845179 | 0.862077 | NaN | (Normalizer(norm='max'), SelectFwe(alpha=0.000... |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
983 | 0.886974 | (13, 13) | ind_mutate | <tpot.search_spaces.pipelines.sequential.Seque... | 9.0 | 0.871580 | 1.740198e+09 | 1.740198e+09 | None | 0.887762 | 0.882504 | 0.860872 | 0.898100 | 0.885523 | 0.893527 | 0.904779 | 0.884557 | 0.900537 | NaN | (StandardScaler(), SelectFromModel(estimator=E... |
986 | 0.850281 | (35, 470) | ind_crossover | <tpot.search_spaces.pipelines.sequential.Seque... | 9.0 | 0.837493 | 1.740198e+09 | 1.740198e+09 | None | 0.858289 | 0.844141 | 0.851260 | 0.848909 | 0.853002 | 0.856132 | 0.845356 | 0.847830 | 0.860393 | NaN | (StandardScaler(), SelectPercentile(percentile... |
990 | 0.878811 | (866, 866) | ind_mutate | <tpot.search_spaces.pipelines.sequential.Seque... | 9.0 | 0.875842 | 1.740198e+09 | 1.740198e+09 | None | 0.862567 | 0.881858 | 0.885539 | 0.874347 | 0.888858 | 0.891205 | 0.882103 | 0.863952 | 0.881838 | NaN | (Normalizer(norm='l1'), SelectPercentile(perce... |
991 | 0.835669 | (72, 855) | ind_crossover | <tpot.search_spaces.pipelines.sequential.Seque... | 9.0 | 0.838375 | 1.740198e+09 | 1.740198e+09 | None | 0.844572 | 0.837234 | 0.822799 | 0.818868 | 0.840971 | 0.845122 | 0.816390 | 0.840709 | 0.851650 | NaN | (MinMaxScaler(), SelectPercentile(percentile=4... |
992 | 0.892459 | (898, 898) | ind_mutate | <tpot.search_spaces.pipelines.sequential.Seque... | 9.0 | 0.881991 | 1.740198e+09 | 1.740198e+09 | None | 0.893987 | 0.882514 | 0.887394 | 0.902290 | 0.894360 | 0.903944 | 0.884672 | 0.889588 | 0.903849 | NaN | (RobustScaler(quantile_range=(0.0911728428421,... |
326 行 × 20 列
以上所有方法可以单独使用,也可以同时使用,如下所示
est = tpot.TPOTEstimator(
generations=10,
max_time_mins=None,
scorers=['roc_auc_ovr'],
scorers_weights=[1],
classification=True,
search_space = search_space,
population_size=30,
n_jobs=3,
cv=cv,
verbose=3,
initial_population_size=initial_population_size,
population_scaling = population_scaling,
generations_until_end_population = generations_until_end_population,
budget_range = budget_range,
generations_until_end_budget=generations_until_end_budget,
threshold_evaluation_pruning = threshold_evaluation_pruning,
threshold_evaluation_scaling = threshold_evaluation_scaling,
selection_evaluation_pruning = selection_evaluation_pruning,
selection_evaluation_scaling = selection_evaluation_scaling,
)
start = time.time()
est.fit(X_train, y_train)
print(f"total time: {time.time()-start}")
print("test score: ", scorer(est, X_test, y_test))
Generation: 10%|█ | 1/10 [01:34<14:09, 94.40s/it]
Generation: 1 Best roc_auc_score score: 0.8515086951804098
Generation: 20%|██ | 2/10 [02:26<09:14, 69.36s/it]
Generation: 2 Best roc_auc_score score: 0.8515086951804098
Generation: 30%|███ | 3/10 [03:41<08:23, 71.97s/it]
Generation: 3 Best roc_auc_score score: 0.8515086951804098
Generation: 40%|████ | 4/10 [04:52<07:09, 71.53s/it]
Generation: 4 Best roc_auc_score score: 0.8515086951804098
Generation: 50%|█████ | 5/10 [05:52<05:37, 67.57s/it]
Generation: 5 Best roc_auc_score score: 0.8515086951804098
Generation: 60%|██████ | 6/10 [07:13<04:48, 72.10s/it]
Generation: 6 Best roc_auc_score score: 0.8515086951804098
Generation: 70%|███████ | 7/10 [08:06<03:17, 65.84s/it]
Generation: 7 Best roc_auc_score score: 0.8515086951804098
Generation: 80%|████████ | 8/10 [08:57<02:02, 61.13s/it]
Generation: 8 Best roc_auc_score score: 0.8515086951804098
Generation: 90%|█████████ | 9/10 [09:39<00:55, 55.14s/it]
Generation: 9 Best roc_auc_score score: 0.8515086951804098
Generation: 100%|██████████| 10/10 [10:17<00:00, 61.70s/it]
Generation: 10 Best roc_auc_score score: 0.8515086951804098
total time: 621.607882976532 test score: 0.9084772293865335