AutoML 软件的预期¶
自动化机器学习 (AutoML) 采用了比大多数从业者习惯的更高层级的方法,因此我们整理了一些在使用 TPOT 等 AutoML 软件时应有的预期指南。
AutoML 算法不应只运行几分钟¶
当然,你可以只运行 TPOT 几分钟,它会为你的数据集找到一个相当不错的流水线。然而,如果你运行 TPOT 的时间不够长,它可能找不到最适合你的数据集的流水线。它甚至可能找不到任何合适的流水线,在这种情况下会引发 RuntimeError('A pipeline has not yet been optimized. Please call fit() first.')。通常,值得长时间(几小时到几天)并行运行多个 TPOT 实例,以便 TPOT 能够彻底搜索你的数据集的流水线空间。
AutoML 算法可能需要很长时间才能完成搜索¶
AutoML 算法不像在数据集上拟合单个模型那样简单;它们考虑了包含多个预处理步骤(缺失值填充、缩放、PCA、特征选择等)的流水线中的多种机器学习算法(随机森林、线性模型、SVM 等),所有模型和预处理步骤的超参数,以及在流水线中集成或堆叠算法的多种方法。
因此,TPOT 在大型数据集上运行会花费一些时间,但重要的是理解其原因。在默认的 TPOT 设置(100 代,种群大小为 100)下,TPOT 在完成前会评估 10,000 种流水线配置。为了理解这个数字,想想一个机器学习算法的 10,000 种超参数组合的网格搜索需要花费多长时间。这是需要通过 10 折交叉验证评估的 10,000 个模型配置,这意味着在一个网格搜索中,大约有 100,000 个模型在训练数据上被拟合和评估。这是一个耗时的过程,即使对于决策树等更简单的模型也是如此。
典型的 TPOT 运行需要几小时到几天才能完成(除非是小型数据集),但你随时可以在中途中断运行,查看目前为止的最佳结果。TPOT 还提供了 warm_start 和 periodic_checkpoint_folder 参数,允许你从中断的地方重新启动 TPOT 运行。
AutoML 算法可以为相同的数据集推荐不同的解决方案¶
如果你正在处理一个相当复杂的数据集,或者只运行 TPOT 短时间,不同的 TPOT 运行可能会产生不同的流水线推荐。TPOT 的优化算法是随机的,这意味着它(部分地)使用随机性来搜索可能的流水线空间。当两次 TPOT 运行推荐不同的流水线时,这表示由于时间不足导致 TPOT 运行未能收敛,或者多个流水线在你的数据集上的表现大致相同。
这实际上是相对于固定网格搜索技术的优势:TPOT 旨在成为一个助手,通过探索你可能从未考虑过的流水线配置,为你提供解决特定机器学习问题的思路,然后将微调留给更受约束的参数调整技术,如网格搜索或贝叶斯优化。
import tpot
from tpot import TPOTClassifier
Matplotlib is building the font cache; this may take a moment. /opt/anaconda3/envs/tpotenv/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm
然后按如下方式创建一个 TPOT 实例
classification_optimizer = TPOTClassifier()
也可以使用 TPOTRegressor 类将 TPOT 用于回归问题。除了类名之外,TPOTRegressor 的使用方式与 TPOTClassifier 相同。你可以在 API 文档中阅读有关 TPOTClassifier 和 TPOTRegressor 类的更多信息。
from tpot import TPOTRegressor
regression_optimizer = TPOTRegressor()
拟合 TPOT 模型的工作方式与任何其他 sklearn 估计器完全相同。包含自定义 TPOT 参数的示例代码可能如下所示
import sklearn
import sklearn.datasets
import sklearn.metrics
import tpot
classification_optimizer = TPOTClassifier(search_space="linear-light", max_time_mins=30/60, n_jobs=30, cv=5)
X, y = sklearn.datasets.load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, random_state=1, test_size=0.2)
classification_optimizer.fit(X_train, y_train)
auroc_score = sklearn.metrics.roc_auc_score(y_test, classification_optimizer.predict_proba(X_test)[:,1])
print("auroc_score: ", auroc_score)
Generation: : 5it [00:32, 6.57s/it]
auroc_score: 0.9950396825396826
评分器、目标函数和多目标优化。¶
将目标传递给 TPOT 有两种方式。
scorers
:评分器是具有签名 (estimator, X_test, y_test) 的函数,并接受期望拟合到训练数据的估计器。这些可以通过 sklearn.metrics.make_scorer 函数生成。此函数用于在交叉验证(在cv
参数中定义)期间评估测试折叠。它们通过 scorers 参数传递到 TPOT 中。这可以接受评分器本身或对应于评分函数的字符串(此处列出)。TPOT 还支持传递包含多个评分器的列表以进行多目标优化。对于 CV 的每个折叠,TPOT 只拟合估计器一次,然后循环评估所有提供的评分器。other_objective_functions
:TPOT 中的其他目标函数具有签名 (estimator) 并返回一个浮点数或浮点数列表。这些函数在交叉验证之外,只接收一个未拟合的估计器一次。用户也可以选择在此目标函数中拟合流水线。
每个评分器和目标函数必须附带一个与目标列表对应的权重列表,它们分别是 scorers_weights
和 other_objective_function_weights
。默认情况下,TPOT 会最大化目标函数(这可以通过 bigger_is_better=False
更改)。正权重意味着 TPOT 将寻求最大化该目标,负权重对应于最小化。对于大多数选择器(以及默认设置),只有符号很重要。如果使用自定义选择函数进行优化算法,权重的比例可能会有影响。零权重意味着该分数不会影响选择算法。
这里是使用两个评分器的示例
scorers=['roc_auc_ovr',tpot.objectives.complexity_scorer],
scorers_weights=[1,-1],
这里是使用一个评分器和一个辅助目标函数的示例
scorers=['roc_auc_ovr'],
scorers_weights=[1],
other_objective_functions=[tpot.objectives.number_of_leaves_objective],
other_objective_functions_weights=[-1],
TPOT 始终会根据函数名自动命名最终结果数据框中的评分器列。TPOT 将使用函数名作为 other_objective_functions
的列名。但是,如果你想指定自定义列名,可以将 objective_function_names
设置为 other_objective_functions
中每个函数返回的每个值的名称列表 (str)。如果你的附加函数每个函数返回多个值,这会很有用。
评分器或 other_objective_function 都可以返回多个值。在这种情况下,只需确保 scorers_weights
和 other_objective_function_weights
的长度与返回的分数数量相同即可。
TPOT 附带了一些你可以使用的额外内置目标函数。第一个表格是应用于已拟合流水线的客观函数,因此会传递给 scorers
参数。第二个表格是 other_objective_functions
参数的客观函数。
评分器
函数 | 描述 |
---|---|
tpot.objectives.complexity_scorer | 估算流水线中所有分类器和回归器的学习参数数量。此外,目前转换器加 1 分,选择器加 0 分(因为它们不影响“最终”预测流水线的复杂度)。 |
其他目标函数。
函数 | 描述 |
---|---|
tpot.objectives.average_path_length | 计算从所有节点到根/最终估计器的平均最短路径(仅支持 GraphPipeline) |
tpot.objectives.number_of_leaves_objective | 计算 GraphPipeline 中的叶节点(输入节点)数量 |
tpot.objectives.number_of_nodes_objective | 计算流水线中的节点数量(无论是 scikit-learn Pipeline、GraphPipeline、Feature Union,还是彼此嵌套) |
衡量模型复杂度¶
运行 TPOT 时,包含衡量模型复杂度的辅助目标有时会很有益。更复杂的模型可以带来更高的性能,但这会牺牲可解释性。更简单的模型可能更容易解释,但通常预测性能较低。然而,有时复杂度的巨大增加只会略微提高预测性能。可能存在其他更简单、更易解释的流水线,其性能略有下降,但由于可解释性提高而可以接受。然而,当纯粹为了性能而优化时,这些流水线往往会被忽略。通过将性能和复杂度都作为目标函数,TPOT 将尝试同时优化所有复杂度级别的最佳流水线。优化后,用户将能够看到复杂度与性能之间的权衡,并决定哪种流水线最适合他们的需求。
考虑衡量复杂度的两种方法是 tpot.objectives.number_of_nodes_objective
或 tpot.objectives.complexity_scorer
。节点数量目标函数只计算流水线中的步骤数量。这是一个简单的指标,但它不区分不同模型类型的复杂度。例如,简单的 LogisticRegression 与更复杂的 XGBoost 计算结果相同。复杂度评分器尝试估算流水线中分类器和回归器的学习参数数量。精确量化和比较不同类别的模型之间的复杂度是具有挑战性且可能带有主观性的。然而,这个函数为进化算法提供了一个合理的启发式方法,至少在定性上将更复杂和不太复杂的算法区分开来。虽然可能很难精确比较 LogisticRegression 和 XGBoost 的相对复杂度,但例如,两者在函数返回的复杂度值上总是处于相反的两端。这允许 Pareto 前沿一端是 LogisticRegression,另一端是 XGBoost。
此分析的一个示例在下一节中展示。
内置配置¶
TPOT 可用于优化超参数、选择模型以及优化模型流水线(包括确定步骤顺序)。教程 2 更详细地介绍了如何使用自定义超参数范围、模型类型和可能的流水线配置来自定义搜索空间。TPOT 还附带了一些我们认为非常适合优化机器学习流水线的默认操作符和参数配置。下面是 TPOT 附带的当前内置配置列表。可以将它们作为字符串传递给任何 TPOT 估计器的 search space
参数。
字符串 | 描述 |
---|---|
linear | 一种线性流水线,其结构为 "Selector->(transformers+Passthrough)->(classifiers/regressors+Passthrough)->final classifier/regressor." 对于转换器和内部估计器层,TPOT 可以选择一个或多个转换器/分类器,也可以不选择。内部分类器/回归器层是可选的。 |
linear-light | 与 linear 具有相同的搜索空间,但没有内部分类器/回归器层,并且使用了较少的运行速度更快的估计器。 |
graph | TPOT 将优化形状为有向无环图的流水线。图的节点可以包括选择器、缩放器、转换器或分类器/回归器(内部分类器/回归器可选择不包含)。这将返回自定义的 GraphPipeline 而不是 sklearn Pipeline。详情请见教程 6。 |
graph-light | 与 graph 搜索空间相同,但没有内部分类器/回归器,并且使用了较少的运行速度更快的估计器。 |
| mdr | TPOT 将搜索一系列特征选择器和多因素降维模型,以找到一系列能够最大化预测准确性的操作符。TPOT MDR 配置专门用于全基因组关联研究 (GWAS),详情可在此处在线查看。
请注意,TPOT MDR 运行可能很慢,因为特征选择例程计算成本较高,尤其是在大型数据集上。 |
默认情况下,linear
和 graph
配置允许在流水线中除了最终分类器/回归器之外,还包含额外的堆叠分类器/回归器。如果你想禁用此功能,可以通过函数 tpot.config.template_search_spaces.get_template_search_spaces
并设置 inner_predictios=False
来手动获取不包含内部分类器/回归器的搜索空间。你可以将生成的搜索空间传递给 search space
参数。
import tpot
from tpot.search_spaces.pipelines import SequentialPipeline
from tpot.config import get_search_space
stc_search_space = SequentialPipeline([
get_search_space("selectors"),
get_search_space("all_transformers"),
get_search_space("classifiers"),
])
est = tpot.TPOTEstimator(
search_space = stc_search_space,
scorers=["roc_auc_ovr", tpot.objectives.complexity_scorer],
scorers_weights=[1.0, -1.0],
classification = True,
cv = 5,
max_eval_time_mins = 10,
early_stop = 2,
verbose = 2,
n_jobs=4,
)
使用内置方法
est = tpot.TPOTEstimator(
search_space = "linear",
scorers=["roc_auc_ovr", tpot.objectives.complexity_scorer],
scorers_weights=[1.0, -1.0],
classification = True,
cv = 5,
max_eval_time_mins = 10,
early_stop = 2,
verbose = 2,
n_jobs=4,
)
TPOT 使用的特定超参数范围可以在 tpot/config 文件夹中的文件中找到。上面列出的模板搜索空间定义在 tpot/config/template_search_spaces.py 中。可以在 tpot/config/get_configspace.py 文件中获取单个模型的搜索空间(tpot.config.get_search_space
)。有关自定义搜索空间的更多详细信息可以在教程 2 中找到。
`tpot.config.template_search_spaces.get_template_search_spaces`
Returns a search space which can be optimized by TPOT.
Parameters
----------
search_space: str or SearchSpace
The default search space to use. If a string, it should be one of the following:
- 'linear': A search space for linear pipelines
- 'linear-light': A search space for linear pipelines with a smaller, faster search space
- 'graph': A search space for graph pipelines
- 'graph-light': A search space for graph pipelines with a smaller, faster search space
- 'mdr': A search space for MDR pipelines
If a SearchSpace object, it should be a valid search space object for TPOT.
classification: bool, default=True
Whether the problem is a classification problem or a regression problem.
inner_predictors: bool, default=None
Whether to include additional classifiers/regressors before the final classifier/regressor (allowing for ensembles).
Defaults to False for 'linear-light' and 'graph-light' search spaces, and True otherwise. (Not used for 'mdr' search space)
cross_val_predict_cv: int, default=None
The number of folds to use for cross_val_predict.
Defaults to 0 for 'linear-light' and 'graph-light' search spaces, and 5 otherwise. (Not used for 'mdr' search space)
get_search_space_params: dict
Additional parameters to pass to the get_search_space function.
cross_val_predict_cv¶
此外,在训练带有内部分类器/回归器的模型时,利用 cross_val_predict_cv
可能会提高性能。如果设置了此参数,在模型训练期间,任何非最终预测器的分类器或回归器将使用 sklearn.model_selection.cross_val_predict
将样本外预测传递到模型的后续步骤中。模型仍将拟合到完整数据上,用于训练后的预测。在样本外预测上训练下游模型通常可以防止过拟合并提高性能。原因在于这为下游模型提供了上游模型在未见数据上表现如何的估计。否则,如果上游模型对数据严重过拟合,下游模型可能只会盲目相信看似表现良好的模型,从而将过拟合传播到最终结果。
缺点是 cross_val_predict_cv 计算量更大,对于你的数据集可能不是必需的。
linear_with_cross_val_predict_sp = tpot.config.template_search_spaces.get_template_search_spaces(search_space="linear", classification=True, inner_predictors=True, cross_val_predict_cv=5)
classification_optimizer = TPOTClassifier(search_space=linear_with_cross_val_predict_sp, max_time_mins=30/60, n_jobs=30, cv=5)
终止优化(早期停止)¶
请注意,我们使用较短的时间作为快速示例,但在实践中,你可能需要运行 TPOT 更长时间。默认情况下,TPOT 设置总时间限制为 1 小时,每个流水线最多 5 分钟。在实践中,你可能希望增加这些值。
有三种方法可以终止 TPOT 运行并结束优化过程。TPOT 会在满足其中一个条件时立即终止。
max_time_mins
:(默认,60 分钟)在此分钟数之后,TPOT 将终止并返回目前为止找到的最佳流水线。early_stop
:性能没有改进的代数,达到此代数后 TPOT 终止。一般来说,大约 5 到 20 的值足以合理确定性能已经收敛。generations
:要运行的进化算法的总代数。
默认情况下,TPOT 会运行直到时间限制结束,没有代数或早期停止限制。
最佳实践和技巧:¶
- 从 .py 脚本运行 tpot 时,务必使用
if __name__=="__main__":
保护代码。这是因为 TPOT 处理 Python 和 Dask 并行化的方式。
from dask.distributed import Client, LocalCluster
import tpot
import sklearn
import sklearn.datasets
import numpy as np
if __name__=="__main__":
scorer = sklearn.metrics.get_scorer('roc_auc_ovo')
X, y = sklearn.datasets.load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, train_size=0.75, test_size=0.25)
est = tpot.TPOTClassifier(n_jobs=4, max_time_mins=3, verbose=2, early_stop=3)
est.fit(X_train, y_train)
print(scorer(est, X_test, y_test))
Generation: : 1it [03:13, 193.20s/it] /opt/anaconda3/envs/tpotenv/lib/python3.10/site-packages/sklearn/preprocessing/_discretization.py:307: UserWarning: Bins whose width are too small (i.e., <= 1e-8) in feature 0 are removed. Consider decreasing the number of bins. warnings.warn( /opt/anaconda3/envs/tpotenv/lib/python3.10/site-packages/sklearn/preprocessing/_discretization.py:307: UserWarning: Bins whose width are too small (i.e., <= 1e-8) in feature 1 are removed. Consider decreasing the number of bins. warnings.warn( /opt/anaconda3/envs/tpotenv/lib/python3.10/site-packages/sklearn/preprocessing/_discretization.py:307: UserWarning: Bins whose width are too small (i.e., <= 1e-8) in feature 2 are removed. Consider decreasing the number of bins. warnings.warn( /opt/anaconda3/envs/tpotenv/lib/python3.10/site-packages/sklearn/preprocessing/_discretization.py:307: UserWarning: Bins whose width are too small (i.e., <= 1e-8) in feature 3 are removed. Consider decreasing the number of bins. warnings.warn( /opt/anaconda3/envs/tpotenv/lib/python3.10/site-packages/sklearn/preprocessing/_discretization.py:307: UserWarning: Bins whose width are too small (i.e., <= 1e-8) in feature 4 are removed. Consider decreasing the number of bins. warnings.warn( /opt/anaconda3/envs/tpotenv/lib/python3.10/site-packages/sklearn/preprocessing/_discretization.py:307: UserWarning: Bins whose width are too small (i.e., <= 1e-8) in feature 5 are removed. Consider decreasing the number of bins. warnings.warn( /opt/anaconda3/envs/tpotenv/lib/python3.10/site-packages/sklearn/preprocessing/_discretization.py:307: UserWarning: Bins whose width are too small (i.e., <= 1e-8) in feature 6 are removed. Consider decreasing the number of bins. warnings.warn( /opt/anaconda3/envs/tpotenv/lib/python3.10/site-packages/sklearn/preprocessing/_discretization.py:307: UserWarning: Bins whose width are too small (i.e., <= 1e-8) in feature 7 are removed. Consider decreasing the number of bins. warnings.warn( /opt/anaconda3/envs/tpotenv/lib/python3.10/site-packages/sklearn/preprocessing/_discretization.py:307: UserWarning: Bins whose width are too small (i.e., <= 1e-8) in feature 8 are removed. Consider decreasing the number of bins. warnings.warn( /opt/anaconda3/envs/tpotenv/lib/python3.10/site-packages/sklearn/preprocessing/_discretization.py:307: UserWarning: Bins whose width are too small (i.e., <= 1e-8) in feature 9 are removed. Consider decreasing the number of bins. warnings.warn( /opt/anaconda3/envs/tpotenv/lib/python3.10/site-packages/sklearn/preprocessing/_discretization.py:307: UserWarning: Bins whose width are too small (i.e., <= 1e-8) in feature 10 are removed. Consider decreasing the number of bins. warnings.warn( /opt/anaconda3/envs/tpotenv/lib/python3.10/site-packages/sklearn/preprocessing/_discretization.py:307: UserWarning: Bins whose width are too small (i.e., <= 1e-8) in feature 11 are removed. Consider decreasing the number of bins. warnings.warn( /opt/anaconda3/envs/tpotenv/lib/python3.10/site-packages/sklearn/preprocessing/_discretization.py:307: UserWarning: Bins whose width are too small (i.e., <= 1e-8) in feature 12 are removed. Consider decreasing the number of bins. warnings.warn( /opt/anaconda3/envs/tpotenv/lib/python3.10/site-packages/sklearn/preprocessing/_discretization.py:307: UserWarning: Bins whose width are too small (i.e., <= 1e-8) in feature 13 are removed. Consider decreasing the number of bins. warnings.warn( /opt/anaconda3/envs/tpotenv/lib/python3.10/site-packages/sklearn/preprocessing/_discretization.py:307: UserWarning: Bins whose width are too small (i.e., <= 1e-8) in feature 14 are removed. Consider decreasing the number of bins. warnings.warn( /opt/anaconda3/envs/tpotenv/lib/python3.10/site-packages/sklearn/preprocessing/_discretization.py:307: UserWarning: Bins whose width are too small (i.e., <= 1e-8) in feature 15 are removed. Consider decreasing the number of bins. warnings.warn( /opt/anaconda3/envs/tpotenv/lib/python3.10/site-packages/sklearn/preprocessing/_discretization.py:307: UserWarning: Bins whose width are too small (i.e., <= 1e-8) in feature 16 are removed. Consider decreasing the number of bins. warnings.warn( /opt/anaconda3/envs/tpotenv/lib/python3.10/site-packages/sklearn/preprocessing/_discretization.py:307: UserWarning: Bins whose width are too small (i.e., <= 1e-8) in feature 17 are removed. Consider decreasing the number of bins. warnings.warn( /opt/anaconda3/envs/tpotenv/lib/python3.10/site-packages/sklearn/preprocessing/_discretization.py:307: UserWarning: Bins whose width are too small (i.e., <= 1e-8) in feature 18 are removed. Consider decreasing the number of bins. warnings.warn( /opt/anaconda3/envs/tpotenv/lib/python3.10/site-packages/sklearn/preprocessing/_discretization.py:307: UserWarning: Bins whose width are too small (i.e., <= 1e-8) in feature 19 are removed. Consider decreasing the number of bins. warnings.warn( /opt/anaconda3/envs/tpotenv/lib/python3.10/site-packages/sklearn/preprocessing/_discretization.py:307: UserWarning: Bins whose width are too small (i.e., <= 1e-8) in feature 20 are removed. Consider decreasing the number of bins. warnings.warn( /opt/anaconda3/envs/tpotenv/lib/python3.10/site-packages/sklearn/preprocessing/_discretization.py:307: UserWarning: Bins whose width are too small (i.e., <= 1e-8) in feature 21 are removed. Consider decreasing the number of bins. warnings.warn( /opt/anaconda3/envs/tpotenv/lib/python3.10/site-packages/sklearn/preprocessing/_discretization.py:307: UserWarning: Bins whose width are too small (i.e., <= 1e-8) in feature 22 are removed. Consider decreasing the number of bins. warnings.warn( /opt/anaconda3/envs/tpotenv/lib/python3.10/site-packages/sklearn/preprocessing/_discretization.py:307: UserWarning: Bins whose width are too small (i.e., <= 1e-8) in feature 23 are removed. Consider decreasing the number of bins. warnings.warn( /opt/anaconda3/envs/tpotenv/lib/python3.10/site-packages/sklearn/preprocessing/_discretization.py:307: UserWarning: Bins whose width are too small (i.e., <= 1e-8) in feature 24 are removed. Consider decreasing the number of bins. warnings.warn( /opt/anaconda3/envs/tpotenv/lib/python3.10/site-packages/sklearn/preprocessing/_discretization.py:307: UserWarning: Bins whose width are too small (i.e., <= 1e-8) in feature 25 are removed. Consider decreasing the number of bins. warnings.warn( /opt/anaconda3/envs/tpotenv/lib/python3.10/site-packages/sklearn/preprocessing/_discretization.py:307: UserWarning: Bins whose width are too small (i.e., <= 1e-8) in feature 26 are removed. Consider decreasing the number of bins. warnings.warn( /opt/anaconda3/envs/tpotenv/lib/python3.10/site-packages/sklearn/preprocessing/_discretization.py:307: UserWarning: Bins whose width are too small (i.e., <= 1e-8) in feature 27 are removed. Consider decreasing the number of bins. warnings.warn( /opt/anaconda3/envs/tpotenv/lib/python3.10/site-packages/sklearn/preprocessing/_discretization.py:307: UserWarning: Bins whose width are too small (i.e., <= 1e-8) in feature 28 are removed. Consider decreasing the number of bins. warnings.warn( /opt/anaconda3/envs/tpotenv/lib/python3.10/site-packages/sklearn/preprocessing/_discretization.py:307: UserWarning: Bins whose width are too small (i.e., <= 1e-8) in feature 29 are removed. Consider decreasing the number of bins. warnings.warn( /opt/anaconda3/envs/tpotenv/lib/python3.10/site-packages/sklearn/preprocessing/_discretization.py:307: UserWarning: Bins whose width are too small (i.e., <= 1e-8) in feature 30 are removed. Consider decreasing the number of bins. warnings.warn( /opt/anaconda3/envs/tpotenv/lib/python3.10/site-packages/sklearn/preprocessing/_discretization.py:307: UserWarning: Bins whose width are too small (i.e., <= 1e-8) in feature 31 are removed. Consider decreasing the number of bins. warnings.warn( /opt/anaconda3/envs/tpotenv/lib/python3.10/site-packages/sklearn/preprocessing/_discretization.py:307: UserWarning: Bins whose width are too small (i.e., <= 1e-8) in feature 32 are removed. Consider decreasing the number of bins. warnings.warn( /opt/anaconda3/envs/tpotenv/lib/python3.10/site-packages/sklearn/preprocessing/_discretization.py:307: UserWarning: Bins whose width are too small (i.e., <= 1e-8) in feature 33 are removed. Consider decreasing the number of bins. warnings.warn( /opt/anaconda3/envs/tpotenv/lib/python3.10/site-packages/sklearn/preprocessing/_discretization.py:307: UserWarning: Bins whose width are too small (i.e., <= 1e-8) in feature 34 are removed. Consider decreasing the number of bins. warnings.warn( /opt/anaconda3/envs/tpotenv/lib/python3.10/site-packages/sklearn/preprocessing/_discretization.py:307: UserWarning: Bins whose width are too small (i.e., <= 1e-8) in feature 35 are removed. Consider decreasing the number of bins. warnings.warn( /opt/anaconda3/envs/tpotenv/lib/python3.10/site-packages/sklearn/preprocessing/_discretization.py:307: UserWarning: Bins whose width are too small (i.e., <= 1e-8) in feature 36 are removed. Consider decreasing the number of bins. warnings.warn( /opt/anaconda3/envs/tpotenv/lib/python3.10/site-packages/sklearn/preprocessing/_discretization.py:307: UserWarning: Bins whose width are too small (i.e., <= 1e-8) in feature 37 are removed. Consider decreasing the number of bins. warnings.warn( /opt/anaconda3/envs/tpotenv/lib/python3.10/site-packages/sklearn/preprocessing/_discretization.py:307: UserWarning: Bins whose width are too small (i.e., <= 1e-8) in feature 38 are removed. Consider decreasing the number of bins. warnings.warn( /opt/anaconda3/envs/tpotenv/lib/python3.10/site-packages/sklearn/preprocessing/_discretization.py:307: UserWarning: Bins whose width are too small (i.e., <= 1e-8) in feature 39 are removed. Consider decreasing the number of bins. warnings.warn( /opt/anaconda3/envs/tpotenv/lib/python3.10/site-packages/sklearn/preprocessing/_discretization.py:307: UserWarning: Bins whose width are too small (i.e., <= 1e-8) in feature 40 are removed. Consider decreasing the number of bins. warnings.warn( /opt/anaconda3/envs/tpotenv/lib/python3.10/site-packages/sklearn/preprocessing/_discretization.py:307: UserWarning: Bins whose width are too small (i.e., <= 1e-8) in feature 41 are removed. Consider decreasing the number of bins. warnings.warn( /opt/anaconda3/envs/tpotenv/lib/python3.10/site-packages/sklearn/preprocessing/_discretization.py:307: UserWarning: Bins whose width are too small (i.e., <= 1e-8) in feature 42 are removed. Consider decreasing the number of bins. warnings.warn( /opt/anaconda3/envs/tpotenv/lib/python3.10/site-packages/sklearn/preprocessing/_discretization.py:307: UserWarning: Bins whose width are too small (i.e., <= 1e-8) in feature 43 are removed. Consider decreasing the number of bins. warnings.warn( /opt/anaconda3/envs/tpotenv/lib/python3.10/site-packages/sklearn/preprocessing/_discretization.py:307: UserWarning: Bins whose width are too small (i.e., <= 1e-8) in feature 44 are removed. Consider decreasing the number of bins. warnings.warn( /opt/anaconda3/envs/tpotenv/lib/python3.10/site-packages/sklearn/preprocessing/_discretization.py:307: UserWarning: Bins whose width are too small (i.e., <= 1e-8) in feature 45 are removed. Consider decreasing the number of bins. warnings.warn( /opt/anaconda3/envs/tpotenv/lib/python3.10/site-packages/sklearn/preprocessing/_discretization.py:307: UserWarning: Bins whose width are too small (i.e., <= 1e-8) in feature 46 are removed. Consider decreasing the number of bins. warnings.warn( /opt/anaconda3/envs/tpotenv/lib/python3.10/site-packages/sklearn/preprocessing/_discretization.py:307: UserWarning: Bins whose width are too small (i.e., <= 1e-8) in feature 47 are removed. Consider decreasing the number of bins. warnings.warn( /opt/anaconda3/envs/tpotenv/lib/python3.10/site-packages/sklearn/preprocessing/_discretization.py:307: UserWarning: Bins whose width are too small (i.e., <= 1e-8) in feature 48 are removed. Consider decreasing the number of bins. warnings.warn( /opt/anaconda3/envs/tpotenv/lib/python3.10/site-packages/sklearn/preprocessing/_discretization.py:307: UserWarning: Bins whose width are too small (i.e., <= 1e-8) in feature 49 are removed. Consider decreasing the number of bins. warnings.warn( /opt/anaconda3/envs/tpotenv/lib/python3.10/site-packages/sklearn/preprocessing/_discretization.py:307: UserWarning: Bins whose width are too small (i.e., <= 1e-8) in feature 50 are removed. Consider decreasing the number of bins. warnings.warn( /opt/anaconda3/envs/tpotenv/lib/python3.10/site-packages/sklearn/preprocessing/_discretization.py:307: UserWarning: Bins whose width are too small (i.e., <= 1e-8) in feature 51 are removed. Consider decreasing the number of bins. warnings.warn( /opt/anaconda3/envs/tpotenv/lib/python3.10/site-packages/sklearn/preprocessing/_discretization.py:307: UserWarning: Bins whose width are too small (i.e., <= 1e-8) in feature 52 are removed. Consider decreasing the number of bins. warnings.warn( /opt/anaconda3/envs/tpotenv/lib/python3.10/site-packages/sklearn/preprocessing/_discretization.py:307: UserWarning: Bins whose width are too small (i.e., <= 1e-8) in feature 53 are removed. Consider decreasing the number of bins. warnings.warn( /opt/anaconda3/envs/tpotenv/lib/python3.10/site-packages/sklearn/preprocessing/_discretization.py:307: UserWarning: Bins whose width are too small (i.e., <= 1e-8) in feature 54 are removed. Consider decreasing the number of bins. warnings.warn( /opt/anaconda3/envs/tpotenv/lib/python3.10/site-packages/sklearn/preprocessing/_discretization.py:307: UserWarning: Bins whose width are too small (i.e., <= 1e-8) in feature 55 are removed. Consider decreasing the number of bins. warnings.warn( /opt/anaconda3/envs/tpotenv/lib/python3.10/site-packages/sklearn/preprocessing/_discretization.py:307: UserWarning: Bins whose width are too small (i.e., <= 1e-8) in feature 56 are removed. Consider decreasing the number of bins. warnings.warn( /opt/anaconda3/envs/tpotenv/lib/python3.10/site-packages/sklearn/preprocessing/_discretization.py:307: UserWarning: Bins whose width are too small (i.e., <= 1e-8) in feature 57 are removed. Consider decreasing the number of bins. warnings.warn( /opt/anaconda3/envs/tpotenv/lib/python3.10/site-packages/sklearn/preprocessing/_discretization.py:307: UserWarning: Bins whose width are too small (i.e., <= 1e-8) in feature 58 are removed. Consider decreasing the number of bins. warnings.warn( /opt/anaconda3/envs/tpotenv/lib/python3.10/site-packages/sklearn/preprocessing/_discretization.py:307: UserWarning: Bins whose width are too small (i.e., <= 1e-8) in feature 59 are removed. Consider decreasing the number of bins. warnings.warn(
0.999621947852182
示例分析和估计器类¶
这里我们使用 scikit-learn 中包含的一个玩具示例数据集。我们将使用 light
配置和 complexity_scorer
来估算复杂度。
请注意,对于这个玩具示例,我们设置了相对较短的运行时间。在实践中,我们建议运行 TPOT 更长时间,并将 early_stop
的值设置为 5 到 20 左右(详情见下文)。
from dask.distributed import Client, LocalCluster
import tpot
import sklearn
import sklearn.datasets
import numpy as np
import tpot.objectives
scorer = sklearn.metrics.get_scorer('roc_auc_ovr')
X, y = sklearn.datasets.load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, train_size=0.75, test_size=0.25)
est = tpot.TPOTClassifier(
scorers=[scorer, tpot.objectives.complexity_scorer],
scorers_weights=[1.0, -1.0],
search_space="linear",
n_jobs=4,
max_time_mins=60,
max_eval_time_mins=10,
early_stop=2,
verbose=2,)
est.fit(X_train, y_train)
print(scorer(est, X_test, y_test))
Generation: : 4it [02:34, 38.64s/it] /opt/anaconda3/envs/tpotenv/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:690: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet. warnings.warn(
0.9978289188015632
你可以通过 fitted_pipeline_
属性访问 TPOT 选择的最佳流水线。这是具有最高交叉验证分数(如果未提供评分器,则为第一个评分器或第一个目标函数)的流水线。
best_pipeline = est.fitted_pipeline_
best_pipeline
Pipeline(steps=[('minmaxscaler', MinMaxScaler()), ('selectpercentile', SelectPercentile(percentile=68.60012151662)), ('featureunion-1', FeatureUnion(transformer_list=[('skiptransformer', SkipTransformer()), ('passthrough', Passthrough())])), ('featureunion-2', FeatureUnion(transformer_list=[('skiptransformer', SkipTransformer()), ('passthrough', Passthrough())])), ('mlpclassifier', MLPClassifier(activation='identity', alpha=0.0023692590029, hidden_layer_sizes=[139, 139], learning_rate='invscaling', learning_rate_init=0.0004707733364, n_iter_no_change=32))])在 Jupyter 环境中,请重新运行此单元格以显示 HTML 表示或信任笔记本。
在 GitHub 上,无法渲染 HTML 表示,请尝试使用 nbviewer.org 加载此页面。
Pipeline(steps=[('minmaxscaler', MinMaxScaler()), ('selectpercentile', SelectPercentile(percentile=68.60012151662)), ('featureunion-1', FeatureUnion(transformer_list=[('skiptransformer', SkipTransformer()), ('passthrough', Passthrough())])), ('featureunion-2', FeatureUnion(transformer_list=[('skiptransformer', SkipTransformer()), ('passthrough', Passthrough())])), ('mlpclassifier', MLPClassifier(activation='identity', alpha=0.0023692590029, hidden_layer_sizes=[139, 139], learning_rate='invscaling', learning_rate_init=0.0004707733364, n_iter_no_change=32))])
MinMaxScaler()
SelectPercentile(percentile=68.60012151662)
FeatureUnion(transformer_list=[('skiptransformer', SkipTransformer()), ('passthrough', Passthrough())])
SkipTransformer()
Passthrough()
FeatureUnion(transformer_list=[('skiptransformer', SkipTransformer()), ('passthrough', Passthrough())])
SkipTransformer()
Passthrough()
MLPClassifier(activation='identity', alpha=0.0023692590029, hidden_layer_sizes=[139, 139], learning_rate='invscaling', learning_rate_init=0.0004707733364, n_iter_no_change=32)
best_pipeline.predict(X_test)
array([1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1])
保存流水线¶
我们建议使用 dill 或 pickle 保存 fitted_pipeline_ 的实例。请注意,我们不建议 pickle TPOT 对象本身。
import dill as pickle
with open("best_pipeline.pkl", "wb") as f:
pickle.dump(best_pipeline, f)
#load the pipeline
import dill as pickle
with open("best_pipeline.pkl", "rb") as f:
my_loaded_best_pipeline = pickle.load(f)
evaluated_individuals 数据框 - 进一步结果分析¶
tpot 估计器对象的 evaluated_individuals
属性是一个 Pandas 数据框,包含有关运行的信息。每行对应于 tpot 探索的一个独立流水线。数据框包含以下列
列 | 描述 |
---|---|
<n 个目标函数列> | 第一组列将对应于每个目标函数。这些列可以由 TPOT 自动命名,也可以由用户传入。 |
父个体 | 这包含一个元组,其中包含当前流水线的“父个体”的索引。例如,(29, 42) 表示使用了索引为 29 和 42 的流水线来生成该流水线。 |
变异函数 | 应用于父个体以生成新流水线的函数 |
个体 | 表示特定流水线和超参数配置的个体类。此类还包含用于变异和交叉的功能。要从个体获取 sklearn 估计器/流水线对象,可以调用 export_pipeline() 函数。(例如,pipe = ind.export_pipeline() ) |
代 | 创建个体所在的代。(请注意,如果被选中,前几代中表现更好的流水线可能仍会出现在给定代的当前“种群”中。) |
提交时间戳 | 流水线送去评估的时间戳,以秒为单位。这是 time.time() 的输出,即“返回自纪元以来的时间(秒),以浮点数表示。” |
完成时间戳 | 流水线评估完成的时间戳,与提交时间戳单位相同 |
Pareto_Front | 如果你有多个参数,如果流水线性能落在 Pareto 前沿线上,则此列为 True。这是得分严格优于不在该线上的流水线,但彼此之间并非严格优于的流水线集合。 |
实例 | 这包含为此行评估的未拟合流水线。(这是调用个体类的 export_pipeline() 函数返回的流水线) |
#get the score/objective column names generated by TPOT
est.objective_names
['roc_auc_score', 'complexity_scorer']
df = est.evaluated_individuals
df
roc_auc_score | complexity_scorer | 父个体 | 变异函数 | 个体 | 代 | 提交时间戳 | 完成时间戳 | 评估错误 | Pareto_Front | 实例 | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | NaN | NaN | NaN | <tpot.search_spaces.pipelines.sequential.Seque... | 0.0 | 1.740178e+09 | 1.740178e+09 | INVALID | NaN | (MaxAbsScaler(), RFE(estimator=ExtraTreesClass... |
1 | NaN | NaN | NaN | NaN | <tpot.search_spaces.pipelines.sequential.Seque... | 0.0 | 1.740178e+09 | 1.740179e+09 | INVALID | NaN | (RobustScaler(quantile_range=(0.1386847479391,... |
2 | NaN | NaN | NaN | NaN | <tpot.search_spaces.pipelines.sequential.Seque... | 0.0 | 1.740178e+09 | 1.740178e+09 | INVALID | NaN | (RobustScaler(quantile_range=(0.0087917518794,... |
3 | NaN | NaN | NaN | NaN | <tpot.search_spaces.pipelines.sequential.Seque... | 0.0 | 1.740178e+09 | 1.740178e+09 | INVALID | NaN | (Passthrough(), Passthrough(), FeatureUnion(tr... |
4 | 0.969262 | 241.2 | NaN | NaN | <tpot.search_spaces.pipelines.sequential.Seque... | 0.0 | 1.740178e+09 | 1.740178e+09 | None | NaN | (RobustScaler(quantile_range=(0.0359502923061,... |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
245 | 0.986280 | 44.0 | (184, 184) | ind_crossover | <tpot.search_spaces.pipelines.sequential.Seque... | 4.0 | 1.740179e+09 | 1.740179e+09 | None | NaN | (RobustScaler(quantile_range=(0.1428289713161,... |
246 | 0.902845 | 9.0 | (145, 148) | ind_mutate , ind_mutate , ind_crossover | <tpot.search_spaces.pipelines.sequential.Seque... | 4.0 | 1.740179e+09 | 1.740179e+09 | None | NaN | (MinMaxScaler(), SelectFwe(alpha=0.00184795618... |
247 | 0.992851 | 5301.0 | (155, 133) | ind_mutate , ind_mutate , ind_crossover | <tpot.search_spaces.pipelines.sequential.Seque... | 4.0 | 1.740179e+09 | 1.740179e+09 | None | NaN | (MaxAbsScaler(), SelectFwe(alpha=0.00212090942... |
248 | 0.992349 | 7749.0 | (152, 152) | ind_mutate | <tpot.search_spaces.pipelines.sequential.Seque... | 4.0 | 1.740179e+09 | 1.740179e+09 | None | NaN | (MinMaxScaler(), SelectFromModel(estimator=Ext... |
249 | 0.515242 | 9.0 | (182, 182) | ind_mutate | <tpot.search_spaces.pipelines.sequential.Seque... | 4.0 | 1.740179e+09 | 1.740179e+09 | None | NaN | (MaxAbsScaler(), VarianceThreshold(threshold=0... |
250 行 × 11 列
让我们绘制不同流水线的性能图,包括 Pareto 前沿¶
在散点图中绘制多个目标的性能是可视化模型复杂度与预测性能之间权衡的有用方法。在绘制 Pareto 前沿流水线时,可以最佳地可视化这一点,它们呈现了在复杂度谱上的最佳性能流水线。通常,复杂度较高的模型可能带来更高的性能,但更难解释。
import seaborn as sns
import matplotlib.pyplot as plt
#replace nans in pareto front with 0
fig, ax = plt.subplots(figsize=(5,5))
sns.scatterplot(df[df['Pareto_Front']!=1], x='roc_auc_score', y='complexity_scorer', label='other', ax=ax)
sns.scatterplot(df[df['Pareto_Front']==1], x='roc_auc_score', y='complexity_scorer', label='Pareto Front', ax=ax)
ax.title.set_text('Performance of all pipelines')
#log scale y
ax.set_yscale('log')
plt.show()
#replace nans in pareto front with 0
fig, ax = plt.subplots(figsize=(10,5))
sns.scatterplot(df[df['Pareto_Front']==1], x='roc_auc_score', y='complexity_scorer', label='Pareto Front', ax=ax)
ax.title.set_text('Performance of only the Pareto Front')
#log scale y
# ax.set_yscale('log')
plt.show()
#plot only the pareto front pipelines
sorted_pareto_front = df[df['Pareto_Front']==1].sort_values('roc_auc_score', ascending=False)
sorted_pareto_front
roc_auc_score | complexity_scorer | 父个体 | 变异函数 | 个体 | 代 | 提交时间戳 | 完成时间戳 | 评估错误 | Pareto_Front | 实例 | |
---|---|---|---|---|---|---|---|---|---|---|---|
51 | 0.996818 | 582.0 | (13, 13) | ind_mutate | <tpot.search_spaces.pipelines.sequential.Seque... | 1.0 | 1.740179e+09 | 1.740179e+09 | None | 1.0 | (MinMaxScaler(), SelectPercentile(percentile=6... |
133 | 0.996239 | 31.0 | (65, 65) | ind_mutate | <tpot.search_spaces.pipelines.sequential.Seque... | 2.0 | 1.740179e+09 | 1.740179e+09 | None | 1.0 | (StandardScaler(), SelectFwe(alpha=0.002276474... |
185 | 0.995843 | 30.9 | (133, 133) | ind_mutate | <tpot.search_spaces.pipelines.sequential.Seque... | 3.0 | 1.740179e+09 | 1.740179e+09 | None | 1.0 | (StandardScaler(), SelectFwe(alpha=0.000234016... |
233 | 0.995115 | 30.7 | (185, 185) | ind_mutate | <tpot.search_spaces.pipelines.sequential.Seque... | 4.0 | 1.740179e+09 | 1.740179e+09 | None | 1.0 | (StandardScaler(), SelectFwe(alpha=0.000234016... |
85 | 0.990894 | 26.0 | (6, 23) | ind_crossover | <tpot.search_spaces.pipelines.sequential.Seque... | 1.0 | 1.740179e+09 | 1.740179e+09 | None | 1.0 | (MaxAbsScaler(), SelectFwe(alpha=0.00114277554... |
228 | 0.990081 | 19.0 | (162, 162) | ind_mutate | <tpot.search_spaces.pipelines.sequential.Seque... | 4.0 | 1.740179e+09 | 1.740179e+09 | None | 1.0 | (MaxAbsScaler(), VarianceThreshold(threshold=0... |
215 | 0.988614 | 9.0 | (162, 162) | ind_mutate | <tpot.search_spaces.pipelines.sequential.Seque... | 4.0 | 1.740179e+09 | 1.740179e+09 | None | 1.0 | (MaxAbsScaler(), VarianceThreshold(threshold=0... |
121 | 0.982524 | 7.0 | (10, 10) | ind_mutate | <tpot.search_spaces.pipelines.sequential.Seque... | 2.0 | 1.740179e+09 | 1.740179e+09 | None | 1.0 | (MaxAbsScaler(), SelectFwe(alpha=0.03019980124... |
在某些情况下,你可能希望选择一个性能稍低但复杂度显著降低的流水线。
#access the best performing pipeline with the lowest complexity
best_pipeline_lowest_complexity = sorted_pareto_front.iloc[-1]['Instance']
best_pipeline_lowest_complexity
Pipeline(steps=[('maxabsscaler', MaxAbsScaler()), ('selectfwe', SelectFwe(alpha=0.0301998012478)), ('featureunion-1', FeatureUnion(transformer_list=[('skiptransformer', SkipTransformer()), ('passthrough', Passthrough())])), ('featureunion-2', FeatureUnion(transformer_list=[('skiptransformer', SkipTransformer()), ('passthrough', Passthrough())])), ('kneighborsclassifier', KNeighborsClassifier(n_jobs=1, n_neighbors=2))])在 Jupyter 环境中,请重新运行此单元格以显示 HTML 表示或信任笔记本。
在 GitHub 上,无法渲染 HTML 表示,请尝试使用 nbviewer.org 加载此页面。
Pipeline(steps=[('maxabsscaler', MaxAbsScaler()), ('selectfwe', SelectFwe(alpha=0.0301998012478)), ('featureunion-1', FeatureUnion(transformer_list=[('skiptransformer', SkipTransformer()), ('passthrough', Passthrough())])), ('featureunion-2', FeatureUnion(transformer_list=[('skiptransformer', SkipTransformer()), ('passthrough', Passthrough())])), ('kneighborsclassifier', KNeighborsClassifier(n_jobs=1, n_neighbors=2))])
MaxAbsScaler()
SelectFwe(alpha=0.0301998012478)
FeatureUnion(transformer_list=[('skiptransformer', SkipTransformer()), ('passthrough', Passthrough())])
SkipTransformer()
Passthrough()
FeatureUnion(transformer_list=[('skiptransformer', SkipTransformer()), ('passthrough', Passthrough())])
SkipTransformer()
Passthrough()
KNeighborsClassifier(n_jobs=1, n_neighbors=2)
绘制随时间变化的性能图 + 从中断处继续运行¶
绘制随时间变化的性能图是评估 TPOT 模型是否收敛的好方法。如果性能随时间趋于稳定,则可能延长运行时间不会获得太多性能提升。如果图看起来仍在改进,则可能值得运行 TPOT 更长时间。
在这种情况下,我们可以看到性能接近最优并且已经趋缓,因此很可能不需要更多时间。
#get columns where roc_auc_score is not NaN
scores_and_times = df[df['roc_auc_score'].notna()][['roc_auc_score', 'Completed Timestamp']].sort_values('Completed Timestamp', ascending=True).to_numpy()
#get best score at a given time
best_scores = np.maximum.accumulate(scores_and_times[:,0])
times = scores_and_times[:,1]
times = times - df['Submitted Timestamp'].min()
fig, ax = plt.subplots(figsize=(10,5))
ax.plot(times, best_scores)
ax.set_xlabel('Time (seconds)')
ax.set_ylabel('Best Score')
plt.show()
检查点¶
有两种方法可以恢复 TPOT 运行。
- 如果将
warm_start
参数设置为 True,后续调用fit
将从中断处继续训练(scikit-learn 的常规默认行为是在后续调用 fit 时从头开始重新训练)。 - 如果设置了
periodic_checkpoint_folder
,TPOT 会定期将其当前状态保存到磁盘。如果 TPOT 被中断(作业取消、电脑关机、崩溃),你可以从中断处恢复训练。检查点文件夹存储了一个包含所有评估过的流水线的数据框。可以加载和检查此数据框,以帮助在调试时诊断问题。
注意:TPOT 不会清理检查点文件。如果设置了 periodic_checkpoint_folder
参数,即使输入数据已更改,训练也会始终从上次保存的点继续。一个常见问题是在实验之间忘记更改此文件夹,导致 TPOT 从为另一个数据集优化的流水线继续训练。如果你打算从头开始运行,必须删除该参数、提供一个空文件夹或删除原始检查点文件夹。
常用参数¶
这里列出了最常用的一些可自定义参数及其作用。请参阅 TPOTEstimator
或 TPOTEstimatorSteadyState
的文档以获取所有参数的完整文档。
参数 | 类型 | 描述 |
---|---|---|
scorers | 列表, 评分器 | 用于交叉验证的评分器列表;见 |
scorers_weights | 列表 | 优化期间应用于评分器的权重 |
classification | 布尔型 | 问题类型:True 表示分类,False 表示回归 |
cv | 整数, 交叉验证器 | 交叉验证策略:整数表示折叠数或自定义交叉验证器 |
max_depth | 整数 | 最大流水线深度 |
other_objective_functions | 列表 | 附加目标函数;默认:[average_path_length_objective] |
other_objective_functions_weights | 列表 | 附加目标函数的权重;默认:[-1] |
objective_function_names | 列表 | 目标函数的名称;默认:None(使用函数名) |
bigger_is_better | 布尔型 | 优化方向:True 表示最大化,False 表示最小化 |
generations | 整数 | 优化代数;默认:50 |
max_time_mins | 浮点型 | 最大优化时间(分钟);默认:无限 |
max_eval_time_mins | 浮点型 | 每个个体的最大评估时间(分钟);默认:300 |
n_jobs | 整数 | 并行进程数;默认:1 |
memory_limit | 字符串 | 每个作业的内存限制;默认:“4GB” |
verbose | 整数 | 优化过程的详细程度:0(无),1(进度),3(最佳个体),4(警告),5+(完整警告) |
memory | 字符串, 内存对象 | 如果提供,流水线将在使用 joblib.Memory 调用 fit 后缓存每个转换器。 |
periodic_checkpoint_folder | 字符串 | 定期保存种群的文件夹。如果为 None,则不会进行定期保存。如果提供,训练将从此检查点恢复。 |
防止过拟合¶
在小型数据集上,TPOT 完全可能过度拟合交叉验证分数本身。这可能导致在保留数据集上的性能低于预期。TPOT 始终将具有最高 CV 分数的模型作为其最终的 fitted_pipeline。然而,如果通过交叉验证评估的最佳性能模型实际上只是对 CV 分数过度拟合,那么与 Pareto 前沿上的其他模型相比,它的性能实际上可能更差。* 使用辅助复杂度目标并评估整个 Pareto 前沿可能会有所帮助。在某些情况下,复杂度较低但性能较低的流水线实际上在保留集上的表现可能更好。可以在保留的验证集上评估和比较这些流水线,或者有时,如果数据非常有限,简单地使用不同的种子来划分 CV 折叠也可以。* TPOT 可以自动完成此操作。可以将 validation_strategy
参数设置为在保留的验证集(由 validation_fraction
参数设置的数据集百分比)或用于划分 CV 折叠的不同种子上来重新测试最终的 Pareto 前沿。可以通过将 validation_strategy
设置为 "split" 或 "reshuffled" 来分别选择这些策略。* 增加交叉验证的折叠数可以缓解此问题。* 嵌套交叉验证也可用于估计 TPOT 优化算法本身的性能。* 从搜索空间中移除更复杂的方法可以减少过拟合的机会。
加速 TPOT 的技巧和窍门¶
TPOT 可能是一个计算量很大的算法,因为它在潜在的大型数据集上拟合数千个复杂的机器学习流水线。有几种策略可以减少所需的计算量,从而缩短运行时间。
TPOT 中实现了三种主要策略,用于减少冗余工作和/或防止在性能不佳的流水线上浪费计算资源。
- TPOT 流水线通常会包含完全相同的组件执行完全相同的计算(例如,流水线的第一步保持不变,只改变了最终分类器的参数)。在这些情况下,第一种策略是简单地缓存这些重复计算,使其只发生一次。更多信息见下一小节。
- 连续减半 (Successive Halving)。这一想法首次由 Parmentier 等人在《TPOT-SH:一种在大型数据集上解决 AutoML 问题的更快优化算法》中与 TPOT 一起进行了测试。该算法分两个阶段运行。最初,它使用少量数据子集和较大的种群规模训练早期代。随后的代则在较大甚至全部的数据部分上评估一组较小的有潜力的流水线。这种方法通过初步的粗略评估快速识别出表现最佳的流水线配置,然后进行更全面的评估。有关此策略的更多信息,请参阅教程 8。
- 最常见的是,我们使用交叉验证来评估流水线。然而,通常在前几个折叠内我们就能判断流水线是否有合理的机会超越之前的最佳流水线。例如,如果目前为止的最佳得分是 0.92 AUROC,而当前流水线前五个折叠的平均得分仅为 0.61 左右,我们可以合理地确信后续五个折叠不太可能使这个流水线领先于其他流水线。通过不计算剩余的折叠,我们可以节省大量的计算资源。TPOT 可以使用两种策略来实现此目的(有关这些策略的更多信息,请参阅教程 8)。
- 阈值剪枝 (Threshold Pruning):流水线必须达到预定义的百分位数阈值(基于之前的流水线得分)才能在每个交叉验证 (CV) 折叠中继续进行。
- 选择剪枝 (Selection Pruning):在每个种群中,仅选择排名前 N% 的流水线(按前一个 CV 折叠中的性能排名)在下一个折叠中进行评估。
TPOT 中的流水线缓存 (joblib.Memory)¶
使用 memory 参数,流水线可以在拟合每个转换器后缓存其结果。此功能用于避免在优化过程中,如果转换器的参数和输入数据与另一个已拟合的流水线完全相同,而导致的重复计算。TPOT 允许用户指定自定义目录路径或 joblib.Memory 对象,以便在未来的 TPOT 运行(或 warm_start 运行)中重用内存缓存。
在 TPOT 中启用内存缓存有三种方法
from tpot import TPOTClassifier
from tempfile import mkdtemp
from joblib import Memory
from shutil import rmtree
# Method 1, auto mode: TPOT uses memory caching with a temporary directory and cleans it up upon shutdown
est = TPOTClassifier(memory='auto')
# Method 2, with a custom directory for memory caching
est = TPOTClassifier(memory='/to/your/path')
# Method 3, with a Memory object
memory = Memory(location='./to/your/path', verbose=0)
est = TPOTClassifier(memory=memory)
注意:如果用户设置了自定义目录路径或 Memory 对象,TPOT 不会清理内存缓存。我们建议你在不再需要内存缓存时进行清理。
高级并行化 (HPC 和多节点训练)¶
有关使用 Dask 进行并行化的更多详细信息,包括使用多个节点的信息,请参阅教程 7。
常见问题解答和调试¶
如果你在使用 TPOT 时遇到问题,以下是一些常见问题及其解决方法。
- 性能低于预期。我该怎么办?
- TPOT 可能需要运行更长时间,增加
max_time_mins
、early_stop
或generations
的值。 - 单个流水线可能需要更多时间来完成拟合;增加
max_eval_time_seconds
的值。 - 配置可能不包含最优模型类型或超参数范围,可以探索其他包含的模板,或自定义自己的搜索空间(参见教程 2!)
- 检查
periodic_checkpoint_folder
是否设置正确。一个常见问题是在实验之间忘记更改此文件夹,导致 TPOT 从为另一个数据集优化的流水线继续训练。
- TPOT 可能需要运行更长时间,增加
- TPOT 太慢了!一直在运行,从未终止
- 检查三个终止条件中至少有一个设置为合理水平。这些条件是
max_time_mins
、early_stop
或generations
。此外,检查max_eval_time_seconds
是否为大多数模型提供了足够的训练时间而不过长。(有些估计器可能需要不合理的长时间来拟合;此参数旨在防止它们导致所有进程停止。根据我的经验,SVC 和 SVR 往往是罪魁祸首,因此从搜索空间中移除它们也可以缩短运行时间)。 - 设置
memory
参数,以便在使用 scikit-learn 流水线或 TPOT GraphPipelines 时,TPOT 可以防止重复工作。 - 增加 n_jobs 以使用更多进程/CPU 算力。有关高级 Dask 用法(包括在 HPC 上跨多个节点并行化)的信息,请参阅教程 7。
- 使用特征选择,无论是 sklearn 方法的内置配置(参见教程 2),还是基因特征选择(参见教程 3 和 5 了解两种不同策略)。
- 使用连续减半以减少计算负载(参见教程 8)。
- 检查三个终止条件中至少有一个设置为合理水平。这些条件是
- evaluated_individuals 数据框中的许多流水线崩溃或无效!
- 这是正常的,也是 TPOT 的预期行为。在某些情况下,TPOT 可能会尝试无效的超参数组合,导致流水线无法工作。其他时候,流水线配置本身可能无效。例如,选择器可能因为其超参数而未选择任何特征。另一个常见示例是
MultinomialNB
抛出错误,因为它期望正值,但先前的转换产生了负值。 - 如果你使用了自定义搜索空间,可以使用
ConfigSpace
条件来防止无效超参数(由于 TPOT 使用交叉的方式,这仍然可能发生)。 - 设置
verbose=5
将打印出所有失败流水线的完整错误消息。这对于调试流水线、自定义搜索空间模块或其他地方是否存在配置错误非常有用。
- 这是正常的,也是 TPOT 的预期行为。在某些情况下,TPOT 可能会尝试无效的超参数组合,导致流水线无法工作。其他时候,流水线配置本身可能无效。例如,选择器可能因为其超参数而未选择任何特征。另一个常见示例是
- TPOT 因内存问题崩溃
- 设置
memory_limit
参数,使 n_jobs * memorylimit 小于你机器上的可用 RAM,并留出一些余量。这应该可以防止因内存问题导致的崩溃。 - 如上所述,使用特征选择也可能改善内存使用情况。
- 移除会产生高 RAM 使用率的模块(例如,多个 PolynomialFeatures 或一个具有高次数的 PolynomialFeatures)。
- 设置
- 为什么在设置了 random_state 的情况下,我的 TPOT 运行不可重现?
- 检查
periodic_checkpoint_folder
是否设置正确。如果设置为非空文件夹,TPOT 将从检查点继续训练,而不是从头开始新运行。为了使 TPOT 运行可重现,它们必须具有相同的起始点。 - 如果使用自定义搜索空间,将固定的
random_state
值传递给使用它们的 scikit-learn 模块的 configspace。TPOT 不检查估计器是否接受 random state 值(参见教程 2)。 - 如果使用 TPOT 提供的预构建搜索空间,请确保将
random_state
传递给tpot.config.get_configspace
或tpot.config.template_search_spaces.get_template_search_spaces
。这确保所有支持它的估计器都获得固定的 random_state 值。(参见教程 2)。 - 如果使用自定义节点和流水线类型,确保所有随机决策都使用传递给变异/交叉函数的 rng 参数。
- 如果设置了
max_eval_time_mins
,TPOT 将终止超出此时间限制的流水线。如果流水线评估时间非常接近时间限制,CPU 分配中的微小随机波动可能导致某个流水线在一次运行中被评估,而在另一次运行中不被评估。这种微小的结果差异会扰乱后续运行中的随机数生成器。将max_eval_time_mins
设置为 None 或更高的值可以防止这种边缘情况。 - 如果将
TPOTEstimatorSteadyState
与n_jobs
> 1 一起使用,CPU 分配中的随机波动也可能略微改变流水线的评估顺序,从而影响下游结果。当n_jobs=1
时,TPOTEstimatorSteadyState
的可重现性更可靠(对于默认的TPOTEstimator
、TPOTClassifier
、TPOTRegressor
来说这不是问题,因为它们使用批处理的代际方法,执行顺序不影响结果)。
- 检查
- 考虑到我设置的
n_jobs
值,TPOT 没有使用我预期的所有 CPU 核心。- 默认的 TPOT 算法使用代际方法。这意味着 TPOT 需要评估
population_size
(默认 50)个流水线,然后才能开始下一批。在每一代结束时,TPOT 可能会在等待最后几个流水线完成评估时 unused 线程。一些估计器或流水线的评估速度可能比其他流水线慢得多。可以通过几种方式解决这个问题- 减少
max_eval_time_mins
的值,提早结束长时间运行的流水线评估。 - 移除容易导致收敛非常慢的估计器或超参数配置(通常是
SVC
或SVR
)。 - 或者,
TPOTEstimatorSteadyState
为进化算法使用了一个略有不同的后端,它不采用代际方法。相反,一旦某个个体完成评估,它就会立即生成并评估下一个个体。使用此估计器,所有核心都应始终得到利用。 - 有时,将 n_jobs 设置为线程数的倍数可以帮助最大程度地减少线程在等待其他线程完成时处于空闲状态的可能性
- 减少
- 默认的 TPOT 算法使用代际方法。这意味着 TPOT 需要评估
更多选项¶
tpot.TPOTClassifier
和 tpot.TPOTRegressor
具有一组简化的超参数,并为分类和回归问题设置了默认值。目前,这两者都使用 tpot.TPOTEstimator
类中的标准进化算法。如果你想要更多控制,可以研究 tpot.TPOTEstimator
或 tpot.TPOTEstimatorSteadyState
类。
TPOT 内置了两种进化算法,对应于两种不同的估计器类。
tpot.TPOTEstimator
使用标准进化算法,每代评估恰好 population_size 个个体。这类似于 TPOT1 中的算法。下一代直到前一代完全评估完成后才会开始。这导致 CPU 时间未得到充分利用,因为核心在等待最后一个个体完成训练,但这可能有助于保持种群多样性。tpot.TPOTEstimatorSteadyState
的不同之处在于,一旦某个个体完成评估,它就会立即生成并评估下一个个体。正在评估的个体数量由 n_jobs 参数决定。不再有代的概念。population_size 参数现在指的是已评估父个体列表的大小。评估个体时,选择方法会更新父个体列表。这使得在使用多个核心时可以更有效地利用资源。
tpot.TPOTEstimatorSteadyState¶
import tpot
import sklearn
import sklearn.datasets
graph_search_space = tpot.search_spaces.pipelines.GraphSearchPipeline(
root_search_space= tpot.config.get_search_space(["KNeighborsClassifier", "LogisticRegression", "DecisionTreeClassifier"]),
leaf_search_space = tpot.config.get_search_space("selectors"),
inner_search_space = tpot.config.get_search_space(["transformers"]),
max_size = 10,
)
est = tpot.TPOTEstimatorSteadyState(
search_space = graph_search_space,
scorers=['roc_auc_ovr',tpot.objectives.complexity_scorer],
scorers_weights=[1,-1],
classification=True,
max_eval_time_mins=15,
max_time_mins=30,
early_stop=10, #In TPOTEstimatorSteadyState, since there are no generations, early_stop is the number of pipelines to evaluate before stopping.
n_jobs=30,
verbose=2)
scorer = sklearn.metrics.get_scorer('roc_auc_ovo')
X, y = sklearn.datasets.load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, train_size=0.75, test_size=0.25)
est.fit(X_train, y_train)
print(scorer(est, X_test, y_test))
Evaluations: : 119it [00:37, 3.21it/s] /opt/anaconda3/envs/tpotenv/lib/python3.10/site-packages/sklearn/preprocessing/_data.py:2785: UserWarning: n_quantiles (688) is greater than the total number of samples (426). n_quantiles is set to n_samples. warnings.warn(
0.9816225907664725
fitted_pipeline = est.fitted_pipeline_ # access best pipeline directly
fitted_pipeline.plot()
#view the summary of all evaluated individuals as a pandas dataframe
est.evaluated_individuals.head()
roc_auc_score | complexity_scorer | 父个体 | 变异函数 | 个体 | 提交时间戳 | 完成时间戳 | 评估错误 | Pareto_Front | 实例 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0.841954 | 95.0 | NaN | NaN | <tpot.search_spaces.pipelines.graph.GraphPipel... | 1.740179e+09 | 1.740179e+09 | None | NaN | [('DecisionTreeClassifier_1', 'SelectPercentil... |
1 | 0.967781 | 89.0 | NaN | NaN | <tpot.search_spaces.pipelines.graph.GraphPipel... | 1.740179e+09 | 1.740179e+09 | None | NaN | [('DecisionTreeClassifier_1', 'SelectFwe_1'), ... |
2 | 0.972412 | 22.0 | NaN | NaN | <tpot.search_spaces.pipelines.graph.GraphPipel... | 1.740179e+09 | 1.740179e+09 | None | NaN | [('KNeighborsClassifier_1', 'ColumnOneHotEncod... |
3 | 0.975926 | 54.0 | NaN | NaN | <tpot.search_spaces.pipelines.graph.GraphPipel... | 1.740179e+09 | 1.740179e+09 | None | NaN | [('KNeighborsClassifier_1', 'SelectFwe_1')] |
4 | 0.964352 | 84.0 | NaN | NaN | <tpot.search_spaces.pipelines.graph.GraphPipel... | 1.740179e+09 | 1.740179e+09 | None | NaN | [('DecisionTreeClassifier_1', 'ZeroCount_1'), ... |
tpot.TPOTEstimator¶
import tpot
import sklearn
import sklearn.datasets
est = tpot.TPOTEstimator(
search_space = graph_search_space,
max_time_mins=10,
scorers=['roc_auc_ovr'], #scorers can be a list of strings or a list of scorers. These get evaluated during cross validation.
scorers_weights=[1],
classification=True,
n_jobs=1,
early_stop=5, #how many generations with no improvement to stop after
#List of other objective functions. All objective functions take in an untrained GraphPipeline and return a score or a list of scores
other_objective_functions= [ ],
#List of weights for the other objective functions. Must be the same length as other_objective_functions. By default, bigger is better is set to True.
other_objective_functions_weights=[],
verbose=2)
scorer = sklearn.metrics.get_scorer('roc_auc_ovo')
X, y = sklearn.datasets.load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, train_size=0.75, test_size=0.25)
est.fit(X_train, y_train)
print(scorer(est, X_test, y_test))
Generation: : 5it [10:06, 121.38s/it] /opt/anaconda3/envs/tpotenv/lib/python3.10/site-packages/sklearn/linear_model/_sag.py:349: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge warnings.warn(
0.996046608406159
回归示例
import tpot
import sklearn
import sklearn.metrics
import sklearn.datasets
scorer = sklearn.metrics.get_scorer('neg_mean_squared_error')
X, y = sklearn.datasets.load_diabetes(return_X_y=True)
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, train_size=0.75, test_size=0.25)
est = tpot.tpot_estimator.templates.TPOTRegressor(n_jobs=4, max_time_mins=30, verbose=2, cv=5, early_stop=5)
est.fit(X_train, y_train)
print(scorer(est, X_test, y_test))
Generation: : 24it [18:15, 45.63s/it]
-2968.0005982574958