TPOT 中的基因特征选择节点¶
TPOT 可以使用进化算法在优化流水线的同时优化特征选择。它包含了两种具有不同特征选择策略的节点搜索空间:FSSNode 和 GeneticFeatureSelectorNode。
FSSNode - (特征集选择器) 如果您有一个预定义的特征集列表想要从中选择,这个节点很有用。每个 FeatureSetSelector 节点将选择一个特征组传递到流水线的下一步。请注意,FSSNode 不会创建自己的特征子集,也不会混合/匹配多个预定义的特征集。
GeneticFeatureSelectorNode — 与 FSSNode 从预定义的特征子集列表中选择不同,这个节点使用进化算法从头开始优化一个全新的特征子集。这在没有预定义特征分组的情况下非常有用。
本教程重点介绍 FSSNode。有关 GeneticFeatureSelectorNode 的更多信息,请参阅教程 5。
将这些搜索空间与最小化复杂性的次要目标函数配对可能也会有益。这将鼓励 TPOT 尝试生成具有最少特征的最简单流水线。
tpot.objectives.number_of_nodes_objective - 这可以用作计算节点数量的 other_objective_function。
tpot.objectives.complexity_scorer - 这是一个试图计算学习参数总数(系数数量、决策树中的节点数量等)的评分器。
特征集选择器¶
FeatureSetSelector 是 sklearn.feature_selection.SelectorMixin 的一个子类,它简单地返回手动指定的列。参数 sel_subset 指定它选择的列的名称或索引。然后 transform 函数简单地索引并返回选定的列。您还可以选择使用 name 参数命名该组,但这仅用于记录,类不使用它。
sel_subset: list or int
If X is a dataframe, items in sel_subset list must correspond to column names
If X is a numpy array, items in sel_subset list must correspond to column indexes
int: index of a single column
import tpot
import pandas as pd
import numpy as np
#make a dataframe with columns a,b,c,d,e,f
#numpy array where columns are 1,2,3,4,5,6
data = np.repeat([np.arange(6)],10,0)
df = pd.DataFrame(data,columns=['a','b','c','d','e','f'])
fss = tpot.builtin_modules.FeatureSetSelector(name='test',sel_subset=['a','b','c'])
print("original DataFrame")
print(df)
print("Transformed Data")
print(fss.fit_transform(df))
original DataFrame a b c d e f 0 0 1 2 3 4 5 1 0 1 2 3 4 5 2 0 1 2 3 4 5 3 0 1 2 3 4 5 4 0 1 2 3 4 5 5 0 1 2 3 4 5 6 0 1 2 3 4 5 7 0 1 2 3 4 5 8 0 1 2 3 4 5 9 0 1 2 3 4 5 Transformed Data [[0 1 2] [0 1 2] [0 1 2] [0 1 2] [0 1 2] [0 1 2] [0 1 2] [0 1 2] [0 1 2] [0 1 2]]
/opt/anaconda3/envs/tpotenv/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm
FSSNode¶
FSSNode
是一个节点搜索空间,它简单地从特征集列表中选择一个特征集。这与 EstimatorNode 的工作方式相同,但提供了更简单的接口来定义特征集。
请注意,FSS 只有在用作流水线的第一个步骤时才定义良好。这是因为下游节点将接收经过不同转换的数据,导致原始索引不再对应于转换后的数据中的同一列。
FSSNode
接受一个参数 subsets
,它定义了特征组。有四种方法可以定义子集。
subsets : str or list, default=None
Sets the subsets that the FeatureSetSeletor will select from if set as an option in one of the configuration dictionaries.
Features are defined by column names if using a Pandas data frame, or ints corresponding to indexes if using numpy arrays.
- str : If a string, it is assumed to be a path to a csv file with the subsets.
The first column is assumed to be the name of the subset and the remaining columns are the features in the subset.
- list or np.ndarray : If a list or np.ndarray, it is assumed to be a list of subsets (i.e a list of lists).
- dict : A dictionary where keys are the names of the subsets and the values are the list of features.
- int : If an int, it is assumed to be the number of subsets to generate. Each subset will contain one feature.
- None : If None, each column will be treated as a subset. One column will be selected per subset.
假设您想要三个特征组,每个组包含三列。以下示例是等效的
str¶
sel_subsets=simple_fss.csv
\# simple_fss.csv
group_one, 1,2,3
group_two, 4,5,6
group_three, 7,8,9
dict¶
sel_subsets = { "group_one" : [1,2,3], "group_two" : [4,5,6], "group_three" : [7,8,9], }
list¶
sel_subsets = [[1,2,3], [4,5,6], [7,8,9]]
示例¶
对于这些示例,我们创建一个虚拟数据集,其中前六列是提供信息的,其余列是不提供信息的。
import tpot
import sklearn.datasets
from sklearn.linear_model import LogisticRegression
import numpy as np
import pandas as pd
import tpot
import sklearn.datasets
from sklearn.linear_model import LogisticRegression
import numpy as np
from tpot.search_spaces.nodes import *
from tpot.search_spaces.pipelines import *
from tpot.config import get_search_space
X, y = sklearn.datasets.make_classification(n_samples=1000, n_features=6, n_informative=6, n_redundant=0, n_repeated=0, n_classes=2, n_clusters_per_class=2, weights=None, flip_y=0.01, class_sep=1.0, hypercube=True, shift=0.0, scale=1.0, shuffle=True, random_state=None)
X = np.hstack([X, np.random.rand(X.shape[0],6)]) #add six uninformative features
X = pd.DataFrame(X, columns=['a','b','c','d','e','f','g','h','i', 'j', 'k', 'l']) # a, b ,c the rest are uninformative
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, train_size=0.75, test_size=0.25)
X.head()
a | b | c | d | e | f | g | h | i | j | k | l | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2.315814 | -3.427720 | -1.314654 | -1.508737 | -0.300932 | 0.089448 | 0.327651 | 0.329022 | 0.857495 | 0.734238 | 0.257218 | 0.652350 |
1 | -0.191001 | -1.396922 | 0.149488 | -1.730145 | -0.394932 | 0.519712 | 0.807762 | 0.509823 | 0.876159 | 0.002806 | 0.449828 | 0.671350 |
2 | 0.661264 | -0.981737 | 0.703879 | 0.730321 | -2.750405 | 0.396581 | 0.380302 | 0.532604 | 0.877129 | 0.610919 | 0.780108 | 0.625689 |
3 | 1.445936 | 0.354237 | 0.779040 | 1.288014 | 2.397133 | 0.186324 | 0.544191 | 0.465419 | 0.588535 | 0.919575 | 0.513460 | 0.831546 |
4 | -0.989027 | -1.824787 | -1.448234 | 1.546442 | 1.643775 | 0.167975 | 0.188238 | 0.024149 | 0.544878 | 0.834503 | 0.877869 | 0.278330 |
假设基于先验知识或兴趣,我们知道特征可以按如下方式分组
subsets = { "group_one" : ['a','b','c',],
"group_two" : ['d','e','f'],
"group_three" : ['g','h','i'],
"group_four" : ['j','k','l'],
}
我们可以创建一个 FSSNode,它将从这个子集中进行选择。流水线中的每个节点只选择一个子集。
fss_search_space = FSSNode(subsets=subsets)
如果从这个搜索空间中随机采样,可以看到得到一个选择器,它选择一个预定义的集合。在本例中,它选择了 group two,其中包含 ['d', 'e', 'f']。(在 generate 函数中设置了随机种子,以便重新运行 Notebook 时选择相同的组。)
fss_selector = fss_search_space.generate(rng=1).export_pipeline()
fss_selector
FeatureSetSelector(name='group_two', sel_subset=['d', 'e', 'f'])在 Jupyter 环境中,请重新运行此单元格以显示 HTML 表示或信任 Notebook。
在 GitHub 上,HTML 表示无法渲染,请尝试使用 nbviewer.org 加载此页面。
FeatureSetSelector(name='group_two', sel_subset=['d', 'e', 'f'])
fss_selector.set_output(transform="pandas") #by default sklearn selectors return numpy arrays. this will make it return pandas dataframes
fss_selector.fit(X_train)
fss_selector.transform(X_train)
d | e | f | |
---|---|---|---|
162 | 1.315442 | -1.039258 | 0.194516 |
168 | -1.908995 | -0.953551 | -1.430472 |
214 | 0.181162 | 1.022858 | -2.289700 |
895 | 2.825765 | -1.205520 | 1.147791 |
154 | -2.300481 | 1.023173 | 0.449162 |
... | ... | ... | ... |
32 | -1.793062 | 2.209649 | -0.045031 |
829 | -0.221409 | 1.688750 | 0.069356 |
176 | 0.141471 | -1.880294 | 1.984397 |
124 | -0.359952 | 1.141758 | 2.019301 |
35 | 0.171312 | 0.079332 | 0.178522 |
750 行 × 3 列
在底层实现中,变异(mutation)将随机选择另一个特征集,交叉(crossover)将交换两个个体选择的特征集
ind1 = fss_search_space.generate(rng=1)
ind1.export_pipeline()
FeatureSetSelector(name='group_two', sel_subset=['d', 'e', 'f'])在 Jupyter 环境中,请重新运行此单元格以显示 HTML 表示或信任 Notebook。
在 GitHub 上,HTML 表示无法渲染,请尝试使用 nbviewer.org 加载此页面。
FeatureSetSelector(name='group_two', sel_subset=['d', 'e', 'f'])
ind1.mutate()
ind1.export_pipeline()
FeatureSetSelector(name='group_four', sel_subset=['j', 'k', 'l'])在 Jupyter 环境中,请重新运行此单元格以显示 HTML 表示或信任 Notebook。
在 GitHub 上,HTML 表示无法渲染,请尝试使用 nbviewer.org 加载此页面。
FeatureSetSelector(name='group_four', sel_subset=['j', 'k', 'l'])
现在可以在定义流水线时使用它。对于第一个示例,我们将构建一个简单的线性流水线,其中第一步是特征集选择器,第二步是分类器。
classification_search_space = get_search_space(["RandomForestClassifier"])
fss_and_classifier_search_space = SequentialPipeline([fss_search_space, classification_search_space])
est = tpot.TPOTEstimator(generations=5,
scorers=["roc_auc_ovr", tpot.objectives.complexity_scorer],
scorers_weights=[1.0, -1.0],
n_jobs=32,
classification=True,
search_space = fss_and_classifier_search_space,
verbose=1,
)
scorer = sklearn.metrics.get_scorer('roc_auc_ovr')
est.fit(X_train, y_train)
print(scorer(est, X_test, y_test))
/Users/ketrong/Desktop/tpotvalidation/tpot/tpot/tpot_estimator/estimator.py:456: UserWarning: Both generations and max_time_mins are set. TPOT will terminate when the first condition is met. warnings.warn("Both generations and max_time_mins are set. TPOT will terminate when the first condition is met.") Generation: 100%|██████████| 5/5 [00:36<00:00, 7.26s/it]
0.926166142557652
est.fitted_pipeline_
Pipeline(steps=[('featuresetselector', FeatureSetSelector(name='group_one', sel_subset=['a', 'b', 'c'])), ('randomforestclassifier', RandomForestClassifier(max_features=0.30141491087, min_samples_leaf=4, min_samples_split=17, n_estimators=128, n_jobs=1))])在 Jupyter 环境中,请重新运行此单元格以显示 HTML 表示或信任 Notebook。
在 GitHub 上,HTML 表示无法渲染,请尝试使用 nbviewer.org 加载此页面。
Pipeline(steps=[('featuresetselector', FeatureSetSelector(name='group_one', sel_subset=['a', 'b', 'c'])), ('randomforestclassifier', RandomForestClassifier(max_features=0.30141491087, min_samples_leaf=4, min_samples_split=17, n_estimators=128, n_jobs=1))])
FeatureSetSelector(name='group_one', sel_subset=['a', 'b', 'c'])
RandomForestClassifier(max_features=0.30141491087, min_samples_leaf=4, min_samples_split=17, n_estimators=128, n_jobs=1)
通过这种设置,TPOT 能够识别所使用的其中一个子集,但性能并非最佳。在本例中,我们恰好知道需要多个特征集。如果想在流水线中包含多个特征,我们将不得不修改搜索空间。对此有三种选择。
- UnionPipeline - 这允许您选择固定数量的特征集。如果使用包含两个 FSSNode 的 UnionPipeline,您将始终选择两个简单地连接在一起的特征集。
- DynamicUnionPipeline - 这个空间允许选择多个 FSSNode。与 UnionPipeline 不同,您无需指定选择的集合数量,TPOT 将确定最佳的集合数量。此外,使用 DynamicUnionPipeline 时,不能两次选择同一个特征集。请注意,DynamicUnionPipeline 虽然可以选择多个特征集,但它从不将两个特征集混合在一起。
- GraphSearchPipeline - 当设置为 leave_search_space 时,GraphSearchPipeline 也可以选择多个 FSSNode,它们作为流水线其余部分的输入。
UnionPipeline + FSSNode 示例¶
union_fss_space = UnionPipeline([fss_search_space, fss_search_space])
# this union search space will always select exactly two fss_search_space
selector1 = union_fss_space.generate(rng=1).export_pipeline()
selector1
FeatureUnion(transformer_list=[('featuresetselector-1', FeatureSetSelector(name='group_two', sel_subset=['d', 'e', 'f'])), ('featuresetselector-2', FeatureSetSelector(name='group_three', sel_subset=['g', 'h', 'i']))])在 Jupyter 环境中,请重新运行此单元格以显示 HTML 表示或信任 Notebook。
在 GitHub 上,HTML 表示无法渲染,请尝试使用 nbviewer.org 加载此页面。
FeatureUnion(transformer_list=[('featuresetselector-1', FeatureSetSelector(name='group_two', sel_subset=['d', 'e', 'f'])), ('featuresetselector-2', FeatureSetSelector(name='group_three', sel_subset=['g', 'h', 'i']))])
FeatureSetSelector(name='group_two', sel_subset=['d', 'e', 'f'])
FeatureSetSelector(name='group_three', sel_subset=['g', 'h', 'i'])
selector1.set_output(transform="pandas")
selector1.fit(X_train)
selector1.transform(X_train)
d | e | f | g | h | i | |
---|---|---|---|---|---|---|
162 | 1.315442 | -1.039258 | 0.194516 | 0.751175 | 0.411340 | 0.824754 |
168 | -1.908995 | -0.953551 | -1.430472 | 0.072697 | 0.875766 | 0.953255 |
214 | 0.181162 | 1.022858 | -2.289700 | 0.135222 | 0.395847 | 0.232638 |
895 | 2.825765 | -1.205520 | 1.147791 | 0.925905 | 0.486645 | 0.710991 |
154 | -2.300481 | 1.023173 | 0.449162 | 0.645161 | 0.131657 | 0.863514 |
... | ... | ... | ... | ... | ... | ... |
32 | -1.793062 | 2.209649 | -0.045031 | 0.502947 | 0.994603 | 0.280062 |
829 | -0.221409 | 1.688750 | 0.069356 | 0.328066 | 0.102381 | 0.492280 |
176 | 0.141471 | -1.880294 | 1.984397 | 0.365550 | 0.465859 | 0.974601 |
124 | -0.359952 | 1.141758 | 2.019301 | 0.329380 | 0.718647 | 0.365507 |
35 | 0.171312 | 0.079332 | 0.178522 | 0.215759 | 0.546279 | 0.662928 |
750 行 × 6 列
DynamicUnionPipeline + FSSNode 示例¶
动态联合流水线可以选择可变数量的特征集。
dynamic_fss_space = DynamicUnionPipeline(fss_search_space)
dynamic_fss_space.generate(rng=1).export_pipeline()
FeatureUnion(transformer_list=[('featuresetselector', FeatureSetSelector(name='group_three', sel_subset=['g', 'h', 'i']))])在 Jupyter 环境中,请重新运行此单元格以显示 HTML 表示或信任 Notebook。
在 GitHub 上,HTML 表示无法渲染,请尝试使用 nbviewer.org 加载此页面。
FeatureUnion(transformer_list=[('featuresetselector', FeatureSetSelector(name='group_three', sel_subset=['g', 'h', 'i']))])
FeatureSetSelector(name='group_three', sel_subset=['g', 'h', 'i'])
dynamic_fss_space.generate(rng=3).export_pipeline()
FeatureUnion(transformer_list=[('featuresetselector-1', FeatureSetSelector(name='group_one', sel_subset=['a', 'b', 'c'])), ('featuresetselector-2', FeatureSetSelector(name='group_four', sel_subset=['j', 'k', 'l']))])在 Jupyter 环境中,请重新运行此单元格以显示 HTML 表示或信任 Notebook。
在 GitHub 上,HTML 表示无法渲染,请尝试使用 nbviewer.org 加载此页面。
FeatureUnion(transformer_list=[('featuresetselector-1', FeatureSetSelector(name='group_one', sel_subset=['a', 'b', 'c'])), ('featuresetselector-2', FeatureSetSelector(name='group_four', sel_subset=['j', 'k', 'l']))])
FeatureSetSelector(name='group_one', sel_subset=['a', 'b', 'c'])
FeatureSetSelector(name='group_four', sel_subset=['j', 'k', 'l'])
graph_search_space = tpot.search_spaces.pipelines.GraphSearchPipeline(
leaf_search_space = fss_search_space,
inner_search_space = tpot.config.get_search_space(["transformers"]),
root_search_space= tpot.config.get_search_space(["KNeighborsClassifier", "LogisticRegression", "DecisionTreeClassifier"]),
max_size = 10,
)
graph_search_space.generate(rng=4).export_pipeline().plot()
使用 TPOT 进行优化¶
对于这个示例,我们将优化 DynamicUnion 搜索空间
import tpot
import sklearn.datasets
from sklearn.linear_model import LogisticRegression
import numpy as np
final_classification_search_space = SequentialPipeline([dynamic_fss_space, classification_search_space])
est = tpot.TPOTEstimator(generations=5,
scorers=["roc_auc_ovr", tpot.objectives.complexity_scorer],
scorers_weights=[1.0, -1.0],
n_jobs=32,
classification=True,
search_space = final_classification_search_space,
verbose=1,
)
scorer = sklearn.metrics.get_scorer('roc_auc_ovr')
est.fit(X_train, y_train)
print(scorer(est, X_test, y_test))
/Users/ketrong/Desktop/tpotvalidation/tpot/tpot/tpot_estimator/estimator.py:456: UserWarning: Both generations and max_time_mins are set. TPOT will terminate when the first condition is met. warnings.warn("Both generations and max_time_mins are set. TPOT will terminate when the first condition is met.") Generation: 100%|██████████| 5/5 [00:41<00:00, 8.33s/it]
0.9838836477987423
可以看到这个流水线性能略好,并且正确识别了 group one 和 group two 作为生成方程中使用的特征集。
est.fitted_pipeline_
Pipeline(steps=[('featureunion', FeatureUnion(transformer_list=[('featuresetselector-1', FeatureSetSelector(name='group_two', sel_subset=['d', 'e', 'f'])), ('featuresetselector-2', FeatureSetSelector(name='group_one', sel_subset=['a', 'b', 'c']))])), ('randomforestclassifier', RandomForestClassifier(max_features=0.0530704381152, min_samples_leaf=2, min_samples_split=5, n_estimators=128, n_jobs=1))])在 Jupyter 环境中,请重新运行此单元格以显示 HTML 表示或信任 Notebook。
在 GitHub 上,HTML 表示无法渲染,请尝试使用 nbviewer.org 加载此页面。
Pipeline(steps=[('featureunion', FeatureUnion(transformer_list=[('featuresetselector-1', FeatureSetSelector(name='group_two', sel_subset=['d', 'e', 'f'])), ('featuresetselector-2', FeatureSetSelector(name='group_one', sel_subset=['a', 'b', 'c']))])), ('randomforestclassifier', RandomForestClassifier(max_features=0.0530704381152, min_samples_leaf=2, min_samples_split=5, n_estimators=128, n_jobs=1))])
FeatureUnion(transformer_list=[('featuresetselector-1', FeatureSetSelector(name='group_two', sel_subset=['d', 'e', 'f'])), ('featuresetselector-2', FeatureSetSelector(name='group_one', sel_subset=['a', 'b', 'c']))])
FeatureSetSelector(name='group_two', sel_subset=['d', 'e', 'f'])
FeatureSetSelector(name='group_one', sel_subset=['a', 'b', 'c'])
RandomForestClassifier(max_features=0.0530704381152, min_samples_leaf=2, min_samples_split=5, n_estimators=128, n_jobs=1)
linear_search_space = tpot.config.template_search_spaces.get_template_search_spaces("linear", classification=True)
fss_and_linear_search_space = SequentialPipeline([fss_search_space, linear_search_space])
# est = tpot.TPOTEstimator(
# population_size=32,
# generations=10,
# scorers=["roc_auc_ovr", tpot.objectives.complexity_scorer],
# scorers_weights=[1.0, -1.0],
# other_objective_functions=[number_of_selected_features],
# other_objective_functions_weights = [-1],
# objective_function_names = ["Number of selected features"],
# n_jobs=32,
# classification=True,
# search_space = fss_and_linear_search_space,
# verbose=2,
# )
fss_and_linear_search_space.generate(rng=1).export_pipeline()
Pipeline(steps=[('featuresetselector', FeatureSetSelector(name='group_two', sel_subset=['d', 'e', 'f'])), ('pipeline', Pipeline(steps=[('maxabsscaler', MaxAbsScaler()), ('rfe', RFE(estimator=ExtraTreesClassifier(max_features=0.0390676831531, min_samples_leaf=8, min_samples_split=14, n_jobs=1), step=0.753983388654)), ('featureunion-1', FeatureUnion(transformer_lis... FeatureUnion(transformer_list=[('skiptransformer', SkipTransformer()), ('passthrough', Passthrough())])), ('histgradientboostingclassifier', HistGradientBoostingClassifier(early_stopping=True, l2_regularization=9.1304e-09, learning_rate=0.0036310282582, max_features=0.238877814721, max_leaf_nodes=1696, min_samples_leaf=59, n_iter_no_change=14, tol=0.0001, validation_fraction=None))]))])在 Jupyter 环境中,请重新运行此单元格以显示 HTML 表示或信任 Notebook。
在 GitHub 上,HTML 表示无法渲染,请尝试使用 nbviewer.org 加载此页面。
Pipeline(steps=[('featuresetselector', FeatureSetSelector(name='group_two', sel_subset=['d', 'e', 'f'])), ('pipeline', Pipeline(steps=[('maxabsscaler', MaxAbsScaler()), ('rfe', RFE(estimator=ExtraTreesClassifier(max_features=0.0390676831531, min_samples_leaf=8, min_samples_split=14, n_jobs=1), step=0.753983388654)), ('featureunion-1', FeatureUnion(transformer_lis... FeatureUnion(transformer_list=[('skiptransformer', SkipTransformer()), ('passthrough', Passthrough())])), ('histgradientboostingclassifier', HistGradientBoostingClassifier(early_stopping=True, l2_regularization=9.1304e-09, learning_rate=0.0036310282582, max_features=0.238877814721, max_leaf_nodes=1696, min_samples_leaf=59, n_iter_no_change=14, tol=0.0001, validation_fraction=None))]))])
FeatureSetSelector(name='group_two', sel_subset=['d', 'e', 'f'])
Pipeline(steps=[('maxabsscaler', MaxAbsScaler()), ('rfe', RFE(estimator=ExtraTreesClassifier(max_features=0.0390676831531, min_samples_leaf=8, min_samples_split=14, n_jobs=1), step=0.753983388654)), ('featureunion-1', FeatureUnion(transformer_list=[('featureunion', FeatureUnion(transformer_list=[('columnordinalencoder', ColumnOrdinalEncoder()), ('pca', PCA(n_co... FeatureUnion(transformer_list=[('skiptransformer', SkipTransformer()), ('passthrough', Passthrough())])), ('histgradientboostingclassifier', HistGradientBoostingClassifier(early_stopping=True, l2_regularization=9.1304e-09, learning_rate=0.0036310282582, max_features=0.238877814721, max_leaf_nodes=1696, min_samples_leaf=59, n_iter_no_change=14, tol=0.0001, validation_fraction=None))])
MaxAbsScaler()
RFE(estimator=ExtraTreesClassifier(max_features=0.0390676831531, min_samples_leaf=8, min_samples_split=14, n_jobs=1), step=0.753983388654)
ExtraTreesClassifier(max_features=0.0390676831531, min_samples_leaf=8, min_samples_split=14, n_jobs=1)
ExtraTreesClassifier(max_features=0.0390676831531, min_samples_leaf=8, min_samples_split=14, n_jobs=1)
FeatureUnion(transformer_list=[('featureunion', FeatureUnion(transformer_list=[('columnordinalencoder', ColumnOrdinalEncoder()), ('pca', PCA(n_components=0.9286371732844))])), ('passthrough', Passthrough())])
ColumnOrdinalEncoder()
PCA(n_components=0.9286371732844)
Passthrough()
FeatureUnion(transformer_list=[('skiptransformer', SkipTransformer()), ('passthrough', Passthrough())])
SkipTransformer()
Passthrough()
HistGradientBoostingClassifier(early_stopping=True, l2_regularization=9.1304e-09, learning_rate=0.0036310282582, max_features=0.238877814721, max_leaf_nodes=1696, min_samples_leaf=59, n_iter_no_change=14, tol=0.0001, validation_fraction=None)
进阶用法¶
如果想进阶使用,可以将更多搜索空间组合起来,以便为每个特征集设置独特的预处理流水线。这里有一个示例
dynamic_transformers = DynamicUnionPipeline(get_search_space("all_transformers"), max_estimators=4)
dynamic_transformers_with_passthrough = tpot.search_spaces.pipelines.UnionPipeline([
dynamic_transformers,
tpot.config.get_search_space("Passthrough")],
)
multi_step_engineering = DynamicLinearPipeline(dynamic_transformers_with_passthrough, max_length=4)
fss_engineering_search_space = SequentialPipeline([fss_search_space, multi_step_engineering])
union_fss_engineering_search_space = DynamicUnionPipeline(fss_engineering_search_space)
final_fancy_search_space = SequentialPipeline([union_fss_engineering_search_space, classification_search_space])
final_fancy_search_space.generate(rng=3).export_pipeline()
Pipeline(steps=[('featureunion', FeatureUnion(transformer_list=[('pipeline-1', Pipeline(steps=[('featuresetselector', FeatureSetSelector(name='group_one', sel_subset=['a', 'b', 'c'])), ('pipeline', Pipeline(steps=[('featureunion', FeatureUnion(transformer_list=[('featureunion', FeatureUnion(transformer_list=[('zerocount', ZeroCount())])), ('passthrough', Passth... KBinsDiscretizer(encode='onehot-dense', n_bins=11)), ('rbfsampler', RBFSampler(gamma=0.0925899621466, n_components=17)), ('maxabsscaler', MaxAbsScaler())])), ('passthrough', Passthrough())]))]))]))])), ('randomforestclassifier', RandomForestClassifier(bootstrap=False, class_weight='balanced', max_features=0.8205760841606, min_samples_leaf=16, min_samples_split=11, n_estimators=128, n_jobs=1))])在 Jupyter 环境中,请重新运行此单元格以显示 HTML 表示或信任 Notebook。
在 GitHub 上,HTML 表示无法渲染,请尝试使用 nbviewer.org 加载此页面。
Pipeline(steps=[('featureunion', FeatureUnion(transformer_list=[('pipeline-1', Pipeline(steps=[('featuresetselector', FeatureSetSelector(name='group_one', sel_subset=['a', 'b', 'c'])), ('pipeline', Pipeline(steps=[('featureunion', FeatureUnion(transformer_list=[('featureunion', FeatureUnion(transformer_list=[('zerocount', ZeroCount())])), ('passthrough', Passth... KBinsDiscretizer(encode='onehot-dense', n_bins=11)), ('rbfsampler', RBFSampler(gamma=0.0925899621466, n_components=17)), ('maxabsscaler', MaxAbsScaler())])), ('passthrough', Passthrough())]))]))]))])), ('randomforestclassifier', RandomForestClassifier(bootstrap=False, class_weight='balanced', max_features=0.8205760841606, min_samples_leaf=16, min_samples_split=11, n_estimators=128, n_jobs=1))])
FeatureUnion(transformer_list=[('pipeline-1', Pipeline(steps=[('featuresetselector', FeatureSetSelector(name='group_one', sel_subset=['a', 'b', 'c'])), ('pipeline', Pipeline(steps=[('featureunion', FeatureUnion(transformer_list=[('featureunion', FeatureUnion(transformer_list=[('zerocount', ZeroCount())])), ('passthrough', Passthrough())]))]))])), ('pipeline-2',... PCA(n_components=0.9470333477868))])), ('passthrough', Passthrough())])), ('featureunion-3', FeatureUnion(transformer_list=[('featureunion', FeatureUnion(transformer_list=[('kbinsdiscretizer', KBinsDiscretizer(encode='onehot-dense', n_bins=11)), ('rbfsampler', RBFSampler(gamma=0.0925899621466, n_components=17)), ('maxabsscaler', MaxAbsScaler())])), ('passthrough', Passthrough())]))]))]))])
FeatureSetSelector(name='group_one', sel_subset=['a', 'b', 'c'])
Pipeline(steps=[('featureunion', FeatureUnion(transformer_list=[('featureunion', FeatureUnion(transformer_list=[('zerocount', ZeroCount())])), ('passthrough', Passthrough())]))])
FeatureUnion(transformer_list=[('featureunion', FeatureUnion(transformer_list=[('zerocount', ZeroCount())])), ('passthrough', Passthrough())])
ZeroCount()
Passthrough()
FeatureSetSelector(name='group_four', sel_subset=['j', 'k', 'l'])
Pipeline(steps=[('featureunion-1', FeatureUnion(transformer_list=[('featureunion', FeatureUnion(transformer_list=[('kbinsdiscretizer', KBinsDiscretizer(encode='onehot-dense', n_bins=37, strategy='kmeans')), ('featureagglomeration', FeatureAgglomeration(n_clusters=31))])), ('passthrough', Passthrough())])), ('featureunion-2', FeatureUnion(transformer_list=[('f... PCA(n_components=0.9470333477868))])), ('passthrough', Passthrough())])), ('featureunion-3', FeatureUnion(transformer_list=[('featureunion', FeatureUnion(transformer_list=[('kbinsdiscretizer', KBinsDiscretizer(encode='onehot-dense', n_bins=11)), ('rbfsampler', RBFSampler(gamma=0.0925899621466, n_components=17)), ('maxabsscaler', MaxAbsScaler())])), ('passthrough', Passthrough())]))])
FeatureUnion(transformer_list=[('featureunion', FeatureUnion(transformer_list=[('kbinsdiscretizer', KBinsDiscretizer(encode='onehot-dense', n_bins=37, strategy='kmeans')), ('featureagglomeration', FeatureAgglomeration(n_clusters=31))])), ('passthrough', Passthrough())])
KBinsDiscretizer(encode='onehot-dense', n_bins=37, strategy='kmeans')
FeatureAgglomeration(n_clusters=31)
Passthrough()
FeatureUnion(transformer_list=[('featureunion', FeatureUnion(transformer_list=[('quantiletransformer', QuantileTransformer(n_quantiles=840, output_distribution='normal')), ('pca', PCA(n_components=0.9470333477868))])), ('passthrough', Passthrough())])
QuantileTransformer(n_quantiles=840, output_distribution='normal')
PCA(n_components=0.9470333477868)
Passthrough()
FeatureUnion(transformer_list=[('featureunion', FeatureUnion(transformer_list=[('kbinsdiscretizer', KBinsDiscretizer(encode='onehot-dense', n_bins=11)), ('rbfsampler', RBFSampler(gamma=0.0925899621466, n_components=17)), ('maxabsscaler', MaxAbsScaler())])), ('passthrough', Passthrough())])
KBinsDiscretizer(encode='onehot-dense', n_bins=11)
RBFSampler(gamma=0.0925899621466, n_components=17)
MaxAbsScaler()
Passthrough()
RandomForestClassifier(bootstrap=False, class_weight='balanced', max_features=0.8205760841606, min_samples_leaf=16, min_samples_split=11, n_estimators=128, n_jobs=1)
其他示例¶
字典¶
import tpot
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
import sklearn
subsets = { "group_one" : ['a','b','c'],
"group_two" : ['d','e','f'],
"group_three" : ['g','h','i'],
}
fss_search_space = tpot.search_spaces.nodes.FSSNode(subsets=subsets)
selector = fss_search_space.generate(rng=1).export_pipeline()
selector.set_output(transform="pandas")
selector.fit(X_train)
selector.transform(X_train)
d | e | f | |
---|---|---|---|
162 | 1.315442 | -1.039258 | 0.194516 |
168 | -1.908995 | -0.953551 | -1.430472 |
214 | 0.181162 | 1.022858 | -2.289700 |
895 | 2.825765 | -1.205520 | 1.147791 |
154 | -2.300481 | 1.023173 | 0.449162 |
... | ... | ... | ... |
32 | -1.793062 | 2.209649 | -0.045031 |
829 | -0.221409 | 1.688750 | 0.069356 |
176 | 0.141471 | -1.880294 | 1.984397 |
124 | -0.359952 | 1.141758 | 2.019301 |
35 | 0.171312 | 0.079332 | 0.178522 |
750 行 × 3 列
list¶
import tpot
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
import sklearn
subsets = [['a','b','c'],['d','e','f'],['g','h','i']]
fss_search_space = tpot.search_spaces.nodes.FSSNode(subsets=subsets)
selector = fss_search_space.generate(rng=1).export_pipeline()
selector.set_output(transform="pandas")
selector.fit(X_train)
selector.transform(X_train)
d | e | f | |
---|---|---|---|
162 | 1.315442 | -1.039258 | 0.194516 |
168 | -1.908995 | -0.953551 | -1.430472 |
214 | 0.181162 | 1.022858 | -2.289700 |
895 | 2.825765 | -1.205520 | 1.147791 |
154 | -2.300481 | 1.023173 | 0.449162 |
... | ... | ... | ... |
32 | -1.793062 | 2.209649 | -0.045031 |
829 | -0.221409 | 1.688750 | 0.069356 |
176 | 0.141471 | -1.880294 | 1.984397 |
124 | -0.359952 | 1.141758 | 2.019301 |
35 | 0.171312 | 0.079332 | 0.178522 |
750 行 × 3 列
CSV 文件¶
注意:请检查 CSV 文件中的空格!
import tpot
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
import sklearn
subsets = 'simple_fss.csv'
'''
# simple_fss.csv
one,a,b,c
two,d,e,f
three,g,h,i
'''
fss_search_space = tpot.search_spaces.nodes.FSSNode(subsets=subsets)
selector = fss_search_space.generate(rng=1).export_pipeline()
selector.set_output(transform="pandas")
selector.fit(X_train)
selector.transform(X_train)
d | e | f | |
---|---|---|---|
162 | 1.315442 | -1.039258 | 0.194516 |
168 | -1.908995 | -0.953551 | -1.430472 |
214 | 0.181162 | 1.022858 | -2.289700 |
895 | 2.825765 | -1.205520 | 1.147791 |
154 | -2.300481 | 1.023173 | 0.449162 |
... | ... | ... | ... |
32 | -1.793062 | 2.209649 | -0.045031 |
829 | -0.221409 | 1.688750 | 0.069356 |
176 | 0.141471 | -1.880294 | 1.984397 |
124 | -0.359952 | 1.141758 | 2.019301 |
35 | 0.171312 | 0.079332 | 0.178522 |
750 行 × 3 列
在使用 numpy 数据时,以上所有内容都相同,但列名被整数索引取代。
import tpot
import sklearn.datasets
from sklearn.linear_model import LogisticRegression
import numpy as np
import pandas as pd
n_features = 6
X, y = sklearn.datasets.make_classification(n_samples=1000, n_features=n_features, n_informative=6, n_redundant=0, n_repeated=0, n_classes=2, n_clusters_per_class=2, weights=None, flip_y=0.01, class_sep=1.0, hypercube=True, shift=0.0, scale=1.0, shuffle=True, random_state=None)
X = np.hstack([X, np.random.rand(X.shape[0],3)]) #add three uninformative features
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, train_size=0.75, test_size=0.25)
print(X)
[[-0.31748616 2.20805859 -2.21719911 ... 0.5595234 0.80605806 0.41484993] [ 2.8673731 1.45905176 -1.11516833 ... 0.74646156 0.95635356 0.03575697] [-1.64867116 2.14478724 2.31196119 ... 0.22969172 0.72447325 0.81842014] ... [ 1.17772695 0.7188885 -0.52548496 ... 0.99266968 0.95436462 0.57430922] [ 0.14052568 0.15042817 -0.86281564 ... 0.25379746 0.1818071 0.55993116] [ 1.37273916 -0.14898886 -0.89938251 ... 0.767549 0.66184827 0.49174333]]
import tpot
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
import sklearn
subsets = { "group_one" : [0,1,2],
"group_two" : [3,4,5],
"group_three" : [6,7,8],
}
fss_search_space = tpot.search_spaces.nodes.FSSNode(subsets=subsets)
selector = fss_search_space.generate(rng=1).export_pipeline()
selector.fit(X_train)
selector.transform(X_train)
array([[-0.76235619, -1.97629642, 1.05447979], [ 2.16944118, -1.55515714, 0.67925075], [ 1.96557199, 0.13789923, 1.588271 ], ..., [ 0.78956322, 2.12535053, 0.63115798], [-0.80184984, -0.40793866, 1.3880617 ], [-1.38085267, 1.62568989, -1.42046795]])