TPOT 中的基因特征选择节点¶
TPOT 可以使用进化算法在优化流水线的同时优化特征选择。它包含了两种具有不同特征选择策略的节点搜索空间:FSSNode 和 GeneticFeatureSelectorNode。
FSSNode - (特征集选择器) 如果您有一个预定义的特征集列表想要从中选择,这个节点很有用。每个 FeatureSetSelector 节点将选择一个特征组传递到流水线的下一步。请注意,FSSNode 不会创建自己的特征子集,也不会混合/匹配多个预定义的特征集。
GeneticFeatureSelectorNode — 与 FSSNode 从预定义的特征子集列表中选择不同,这个节点使用进化算法从头开始优化一个全新的特征子集。这在没有预定义特征分组的情况下非常有用。
本教程重点介绍 FSSNode。有关 GeneticFeatureSelectorNode 的更多信息,请参阅教程 5。
将这些搜索空间与最小化复杂性的次要目标函数配对可能也会有益。这将鼓励 TPOT 尝试生成具有最少特征的最简单流水线。
tpot.objectives.number_of_nodes_objective - 这可以用作计算节点数量的 other_objective_function。
tpot.objectives.complexity_scorer - 这是一个试图计算学习参数总数(系数数量、决策树中的节点数量等)的评分器。
特征集选择器¶
FeatureSetSelector 是 sklearn.feature_selection.SelectorMixin 的一个子类,它简单地返回手动指定的列。参数 sel_subset 指定它选择的列的名称或索引。然后 transform 函数简单地索引并返回选定的列。您还可以选择使用 name 参数命名该组,但这仅用于记录,类不使用它。
sel_subset: list or int
If X is a dataframe, items in sel_subset list must correspond to column names
If X is a numpy array, items in sel_subset list must correspond to column indexes
int: index of a single column
import tpot
import pandas as pd
import numpy as np
#make a dataframe with columns a,b,c,d,e,f
#numpy array where columns are 1,2,3,4,5,6
data = np.repeat([np.arange(6)],10,0)
df = pd.DataFrame(data,columns=['a','b','c','d','e','f'])
fss = tpot.builtin_modules.FeatureSetSelector(name='test',sel_subset=['a','b','c'])
print("original DataFrame")
print(df)
print("Transformed Data")
print(fss.fit_transform(df))
original DataFrame a b c d e f 0 0 1 2 3 4 5 1 0 1 2 3 4 5 2 0 1 2 3 4 5 3 0 1 2 3 4 5 4 0 1 2 3 4 5 5 0 1 2 3 4 5 6 0 1 2 3 4 5 7 0 1 2 3 4 5 8 0 1 2 3 4 5 9 0 1 2 3 4 5 Transformed Data [[0 1 2] [0 1 2] [0 1 2] [0 1 2] [0 1 2] [0 1 2] [0 1 2] [0 1 2] [0 1 2] [0 1 2]]
/opt/anaconda3/envs/tpotenv/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm
FSSNode¶
FSSNode 是一个节点搜索空间,它简单地从特征集列表中选择一个特征集。这与 EstimatorNode 的工作方式相同,但提供了更简单的接口来定义特征集。
请注意,FSS 只有在用作流水线的第一个步骤时才定义良好。这是因为下游节点将接收经过不同转换的数据,导致原始索引不再对应于转换后的数据中的同一列。
FSSNode 接受一个参数 subsets,它定义了特征组。有四种方法可以定义子集。
subsets : str or list, default=None
Sets the subsets that the FeatureSetSeletor will select from if set as an option in one of the configuration dictionaries.
Features are defined by column names if using a Pandas data frame, or ints corresponding to indexes if using numpy arrays.
- str : If a string, it is assumed to be a path to a csv file with the subsets.
The first column is assumed to be the name of the subset and the remaining columns are the features in the subset.
- list or np.ndarray : If a list or np.ndarray, it is assumed to be a list of subsets (i.e a list of lists).
- dict : A dictionary where keys are the names of the subsets and the values are the list of features.
- int : If an int, it is assumed to be the number of subsets to generate. Each subset will contain one feature.
- None : If None, each column will be treated as a subset. One column will be selected per subset.
假设您想要三个特征组,每个组包含三列。以下示例是等效的
str¶
sel_subsets=simple_fss.csv
\# simple_fss.csv
group_one, 1,2,3
group_two, 4,5,6
group_three, 7,8,9
dict¶
sel_subsets = { "group_one" : [1,2,3], "group_two" : [4,5,6], "group_three" : [7,8,9], }
list¶
sel_subsets = [[1,2,3], [4,5,6], [7,8,9]]
示例¶
对于这些示例,我们创建一个虚拟数据集,其中前六列是提供信息的,其余列是不提供信息的。
import tpot
import sklearn.datasets
from sklearn.linear_model import LogisticRegression
import numpy as np
import pandas as pd
import tpot
import sklearn.datasets
from sklearn.linear_model import LogisticRegression
import numpy as np
from tpot.search_spaces.nodes import *
from tpot.search_spaces.pipelines import *
from tpot.config import get_search_space
X, y = sklearn.datasets.make_classification(n_samples=1000, n_features=6, n_informative=6, n_redundant=0, n_repeated=0, n_classes=2, n_clusters_per_class=2, weights=None, flip_y=0.01, class_sep=1.0, hypercube=True, shift=0.0, scale=1.0, shuffle=True, random_state=None)
X = np.hstack([X, np.random.rand(X.shape[0],6)]) #add six uninformative features
X = pd.DataFrame(X, columns=['a','b','c','d','e','f','g','h','i', 'j', 'k', 'l']) # a, b ,c the rest are uninformative
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, train_size=0.75, test_size=0.25)
X.head()
| a | b | c | d | e | f | g | h | i | j | k | l | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2.315814 | -3.427720 | -1.314654 | -1.508737 | -0.300932 | 0.089448 | 0.327651 | 0.329022 | 0.857495 | 0.734238 | 0.257218 | 0.652350 |
| 1 | -0.191001 | -1.396922 | 0.149488 | -1.730145 | -0.394932 | 0.519712 | 0.807762 | 0.509823 | 0.876159 | 0.002806 | 0.449828 | 0.671350 |
| 2 | 0.661264 | -0.981737 | 0.703879 | 0.730321 | -2.750405 | 0.396581 | 0.380302 | 0.532604 | 0.877129 | 0.610919 | 0.780108 | 0.625689 |
| 3 | 1.445936 | 0.354237 | 0.779040 | 1.288014 | 2.397133 | 0.186324 | 0.544191 | 0.465419 | 0.588535 | 0.919575 | 0.513460 | 0.831546 |
| 4 | -0.989027 | -1.824787 | -1.448234 | 1.546442 | 1.643775 | 0.167975 | 0.188238 | 0.024149 | 0.544878 | 0.834503 | 0.877869 | 0.278330 |
假设基于先验知识或兴趣,我们知道特征可以按如下方式分组
subsets = { "group_one" : ['a','b','c',],
"group_two" : ['d','e','f'],
"group_three" : ['g','h','i'],
"group_four" : ['j','k','l'],
}
我们可以创建一个 FSSNode,它将从这个子集中进行选择。流水线中的每个节点只选择一个子集。
fss_search_space = FSSNode(subsets=subsets)
如果从这个搜索空间中随机采样,可以看到得到一个选择器,它选择一个预定义的集合。在本例中,它选择了 group two,其中包含 ['d', 'e', 'f']。(在 generate 函数中设置了随机种子,以便重新运行 Notebook 时选择相同的组。)
fss_selector = fss_search_space.generate(rng=1).export_pipeline()
fss_selector
FeatureSetSelector(name='group_two', sel_subset=['d', 'e', 'f'])在 Jupyter 环境中,请重新运行此单元格以显示 HTML 表示或信任 Notebook。
在 GitHub 上,HTML 表示无法渲染,请尝试使用 nbviewer.org 加载此页面。
FeatureSetSelector(name='group_two', sel_subset=['d', 'e', 'f'])
fss_selector.set_output(transform="pandas") #by default sklearn selectors return numpy arrays. this will make it return pandas dataframes
fss_selector.fit(X_train)
fss_selector.transform(X_train)
| d | e | f | |
|---|---|---|---|
| 162 | 1.315442 | -1.039258 | 0.194516 |
| 168 | -1.908995 | -0.953551 | -1.430472 |
| 214 | 0.181162 | 1.022858 | -2.289700 |
| 895 | 2.825765 | -1.205520 | 1.147791 |
| 154 | -2.300481 | 1.023173 | 0.449162 |
| ... | ... | ... | ... |
| 32 | -1.793062 | 2.209649 | -0.045031 |
| 829 | -0.221409 | 1.688750 | 0.069356 |
| 176 | 0.141471 | -1.880294 | 1.984397 |
| 124 | -0.359952 | 1.141758 | 2.019301 |
| 35 | 0.171312 | 0.079332 | 0.178522 |
750 行 × 3 列
在底层实现中,变异(mutation)将随机选择另一个特征集,交叉(crossover)将交换两个个体选择的特征集
ind1 = fss_search_space.generate(rng=1)
ind1.export_pipeline()
FeatureSetSelector(name='group_two', sel_subset=['d', 'e', 'f'])在 Jupyter 环境中,请重新运行此单元格以显示 HTML 表示或信任 Notebook。
在 GitHub 上,HTML 表示无法渲染,请尝试使用 nbviewer.org 加载此页面。
FeatureSetSelector(name='group_two', sel_subset=['d', 'e', 'f'])
ind1.mutate()
ind1.export_pipeline()
FeatureSetSelector(name='group_four', sel_subset=['j', 'k', 'l'])在 Jupyter 环境中,请重新运行此单元格以显示 HTML 表示或信任 Notebook。
在 GitHub 上,HTML 表示无法渲染,请尝试使用 nbviewer.org 加载此页面。
FeatureSetSelector(name='group_four', sel_subset=['j', 'k', 'l'])
现在可以在定义流水线时使用它。对于第一个示例,我们将构建一个简单的线性流水线,其中第一步是特征集选择器,第二步是分类器。
classification_search_space = get_search_space(["RandomForestClassifier"])
fss_and_classifier_search_space = SequentialPipeline([fss_search_space, classification_search_space])
est = tpot.TPOTEstimator(generations=5,
scorers=["roc_auc_ovr", tpot.objectives.complexity_scorer],
scorers_weights=[1.0, -1.0],
n_jobs=32,
classification=True,
search_space = fss_and_classifier_search_space,
verbose=1,
)
scorer = sklearn.metrics.get_scorer('roc_auc_ovr')
est.fit(X_train, y_train)
print(scorer(est, X_test, y_test))
/Users/ketrong/Desktop/tpotvalidation/tpot/tpot/tpot_estimator/estimator.py:456: UserWarning: Both generations and max_time_mins are set. TPOT will terminate when the first condition is met.
warnings.warn("Both generations and max_time_mins are set. TPOT will terminate when the first condition is met.")
Generation: 100%|██████████| 5/5 [00:36<00:00, 7.26s/it]
0.926166142557652
est.fitted_pipeline_
Pipeline(steps=[('featuresetselector',
FeatureSetSelector(name='group_one',
sel_subset=['a', 'b', 'c'])),
('randomforestclassifier',
RandomForestClassifier(max_features=0.30141491087,
min_samples_leaf=4,
min_samples_split=17, n_estimators=128,
n_jobs=1))])在 Jupyter 环境中,请重新运行此单元格以显示 HTML 表示或信任 Notebook。在 GitHub 上,HTML 表示无法渲染,请尝试使用 nbviewer.org 加载此页面。
Pipeline(steps=[('featuresetselector',
FeatureSetSelector(name='group_one',
sel_subset=['a', 'b', 'c'])),
('randomforestclassifier',
RandomForestClassifier(max_features=0.30141491087,
min_samples_leaf=4,
min_samples_split=17, n_estimators=128,
n_jobs=1))])FeatureSetSelector(name='group_one', sel_subset=['a', 'b', 'c'])
RandomForestClassifier(max_features=0.30141491087, min_samples_leaf=4,
min_samples_split=17, n_estimators=128, n_jobs=1)通过这种设置,TPOT 能够识别所使用的其中一个子集,但性能并非最佳。在本例中,我们恰好知道需要多个特征集。如果想在流水线中包含多个特征,我们将不得不修改搜索空间。对此有三种选择。
- UnionPipeline - 这允许您选择固定数量的特征集。如果使用包含两个 FSSNode 的 UnionPipeline,您将始终选择两个简单地连接在一起的特征集。
- DynamicUnionPipeline - 这个空间允许选择多个 FSSNode。与 UnionPipeline 不同,您无需指定选择的集合数量,TPOT 将确定最佳的集合数量。此外,使用 DynamicUnionPipeline 时,不能两次选择同一个特征集。请注意,DynamicUnionPipeline 虽然可以选择多个特征集,但它从不将两个特征集混合在一起。
- GraphSearchPipeline - 当设置为 leave_search_space 时,GraphSearchPipeline 也可以选择多个 FSSNode,它们作为流水线其余部分的输入。
UnionPipeline + FSSNode 示例¶
union_fss_space = UnionPipeline([fss_search_space, fss_search_space])
# this union search space will always select exactly two fss_search_space
selector1 = union_fss_space.generate(rng=1).export_pipeline()
selector1
FeatureUnion(transformer_list=[('featuresetselector-1',
FeatureSetSelector(name='group_two',
sel_subset=['d', 'e', 'f'])),
('featuresetselector-2',
FeatureSetSelector(name='group_three',
sel_subset=['g', 'h',
'i']))])在 Jupyter 环境中,请重新运行此单元格以显示 HTML 表示或信任 Notebook。在 GitHub 上,HTML 表示无法渲染,请尝试使用 nbviewer.org 加载此页面。
FeatureUnion(transformer_list=[('featuresetselector-1',
FeatureSetSelector(name='group_two',
sel_subset=['d', 'e', 'f'])),
('featuresetselector-2',
FeatureSetSelector(name='group_three',
sel_subset=['g', 'h',
'i']))])FeatureSetSelector(name='group_two', sel_subset=['d', 'e', 'f'])
FeatureSetSelector(name='group_three', sel_subset=['g', 'h', 'i'])
selector1.set_output(transform="pandas")
selector1.fit(X_train)
selector1.transform(X_train)
| d | e | f | g | h | i | |
|---|---|---|---|---|---|---|
| 162 | 1.315442 | -1.039258 | 0.194516 | 0.751175 | 0.411340 | 0.824754 |
| 168 | -1.908995 | -0.953551 | -1.430472 | 0.072697 | 0.875766 | 0.953255 |
| 214 | 0.181162 | 1.022858 | -2.289700 | 0.135222 | 0.395847 | 0.232638 |
| 895 | 2.825765 | -1.205520 | 1.147791 | 0.925905 | 0.486645 | 0.710991 |
| 154 | -2.300481 | 1.023173 | 0.449162 | 0.645161 | 0.131657 | 0.863514 |
| ... | ... | ... | ... | ... | ... | ... |
| 32 | -1.793062 | 2.209649 | -0.045031 | 0.502947 | 0.994603 | 0.280062 |
| 829 | -0.221409 | 1.688750 | 0.069356 | 0.328066 | 0.102381 | 0.492280 |
| 176 | 0.141471 | -1.880294 | 1.984397 | 0.365550 | 0.465859 | 0.974601 |
| 124 | -0.359952 | 1.141758 | 2.019301 | 0.329380 | 0.718647 | 0.365507 |
| 35 | 0.171312 | 0.079332 | 0.178522 | 0.215759 | 0.546279 | 0.662928 |
750 行 × 6 列
DynamicUnionPipeline + FSSNode 示例¶
动态联合流水线可以选择可变数量的特征集。
dynamic_fss_space = DynamicUnionPipeline(fss_search_space)
dynamic_fss_space.generate(rng=1).export_pipeline()
FeatureUnion(transformer_list=[('featuresetselector',
FeatureSetSelector(name='group_three',
sel_subset=['g', 'h',
'i']))])在 Jupyter 环境中,请重新运行此单元格以显示 HTML 表示或信任 Notebook。在 GitHub 上,HTML 表示无法渲染,请尝试使用 nbviewer.org 加载此页面。
FeatureUnion(transformer_list=[('featuresetselector',
FeatureSetSelector(name='group_three',
sel_subset=['g', 'h',
'i']))])FeatureSetSelector(name='group_three', sel_subset=['g', 'h', 'i'])
dynamic_fss_space.generate(rng=3).export_pipeline()
FeatureUnion(transformer_list=[('featuresetselector-1',
FeatureSetSelector(name='group_one',
sel_subset=['a', 'b', 'c'])),
('featuresetselector-2',
FeatureSetSelector(name='group_four',
sel_subset=['j', 'k',
'l']))])在 Jupyter 环境中,请重新运行此单元格以显示 HTML 表示或信任 Notebook。在 GitHub 上,HTML 表示无法渲染,请尝试使用 nbviewer.org 加载此页面。
FeatureUnion(transformer_list=[('featuresetselector-1',
FeatureSetSelector(name='group_one',
sel_subset=['a', 'b', 'c'])),
('featuresetselector-2',
FeatureSetSelector(name='group_four',
sel_subset=['j', 'k',
'l']))])FeatureSetSelector(name='group_one', sel_subset=['a', 'b', 'c'])
FeatureSetSelector(name='group_four', sel_subset=['j', 'k', 'l'])
graph_search_space = tpot.search_spaces.pipelines.GraphSearchPipeline(
leaf_search_space = fss_search_space,
inner_search_space = tpot.config.get_search_space(["transformers"]),
root_search_space= tpot.config.get_search_space(["KNeighborsClassifier", "LogisticRegression", "DecisionTreeClassifier"]),
max_size = 10,
)
graph_search_space.generate(rng=4).export_pipeline().plot()
使用 TPOT 进行优化¶
对于这个示例,我们将优化 DynamicUnion 搜索空间
import tpot
import sklearn.datasets
from sklearn.linear_model import LogisticRegression
import numpy as np
final_classification_search_space = SequentialPipeline([dynamic_fss_space, classification_search_space])
est = tpot.TPOTEstimator(generations=5,
scorers=["roc_auc_ovr", tpot.objectives.complexity_scorer],
scorers_weights=[1.0, -1.0],
n_jobs=32,
classification=True,
search_space = final_classification_search_space,
verbose=1,
)
scorer = sklearn.metrics.get_scorer('roc_auc_ovr')
est.fit(X_train, y_train)
print(scorer(est, X_test, y_test))
/Users/ketrong/Desktop/tpotvalidation/tpot/tpot/tpot_estimator/estimator.py:456: UserWarning: Both generations and max_time_mins are set. TPOT will terminate when the first condition is met.
warnings.warn("Both generations and max_time_mins are set. TPOT will terminate when the first condition is met.")
Generation: 100%|██████████| 5/5 [00:41<00:00, 8.33s/it]
0.9838836477987423
可以看到这个流水线性能略好,并且正确识别了 group one 和 group two 作为生成方程中使用的特征集。
est.fitted_pipeline_
Pipeline(steps=[('featureunion',
FeatureUnion(transformer_list=[('featuresetselector-1',
FeatureSetSelector(name='group_two',
sel_subset=['d',
'e',
'f'])),
('featuresetselector-2',
FeatureSetSelector(name='group_one',
sel_subset=['a',
'b',
'c']))])),
('randomforestclassifier',
RandomForestClassifier(max_features=0.0530704381152,
min_samples_leaf=2, min_samples_split=5,
n_estimators=128, n_jobs=1))])在 Jupyter 环境中,请重新运行此单元格以显示 HTML 表示或信任 Notebook。在 GitHub 上,HTML 表示无法渲染,请尝试使用 nbviewer.org 加载此页面。
Pipeline(steps=[('featureunion',
FeatureUnion(transformer_list=[('featuresetselector-1',
FeatureSetSelector(name='group_two',
sel_subset=['d',
'e',
'f'])),
('featuresetselector-2',
FeatureSetSelector(name='group_one',
sel_subset=['a',
'b',
'c']))])),
('randomforestclassifier',
RandomForestClassifier(max_features=0.0530704381152,
min_samples_leaf=2, min_samples_split=5,
n_estimators=128, n_jobs=1))])FeatureUnion(transformer_list=[('featuresetselector-1',
FeatureSetSelector(name='group_two',
sel_subset=['d', 'e', 'f'])),
('featuresetselector-2',
FeatureSetSelector(name='group_one',
sel_subset=['a', 'b',
'c']))])FeatureSetSelector(name='group_two', sel_subset=['d', 'e', 'f'])
FeatureSetSelector(name='group_one', sel_subset=['a', 'b', 'c'])
RandomForestClassifier(max_features=0.0530704381152, min_samples_leaf=2,
min_samples_split=5, n_estimators=128, n_jobs=1)linear_search_space = tpot.config.template_search_spaces.get_template_search_spaces("linear", classification=True)
fss_and_linear_search_space = SequentialPipeline([fss_search_space, linear_search_space])
# est = tpot.TPOTEstimator(
# population_size=32,
# generations=10,
# scorers=["roc_auc_ovr", tpot.objectives.complexity_scorer],
# scorers_weights=[1.0, -1.0],
# other_objective_functions=[number_of_selected_features],
# other_objective_functions_weights = [-1],
# objective_function_names = ["Number of selected features"],
# n_jobs=32,
# classification=True,
# search_space = fss_and_linear_search_space,
# verbose=2,
# )
fss_and_linear_search_space.generate(rng=1).export_pipeline()
Pipeline(steps=[('featuresetselector',
FeatureSetSelector(name='group_two',
sel_subset=['d', 'e', 'f'])),
('pipeline',
Pipeline(steps=[('maxabsscaler', MaxAbsScaler()),
('rfe',
RFE(estimator=ExtraTreesClassifier(max_features=0.0390676831531,
min_samples_leaf=8,
min_samples_split=14,
n_jobs=1),
step=0.753983388654)),
('featureunion-1',
FeatureUnion(transformer_lis...
FeatureUnion(transformer_list=[('skiptransformer',
SkipTransformer()),
('passthrough',
Passthrough())])),
('histgradientboostingclassifier',
HistGradientBoostingClassifier(early_stopping=True,
l2_regularization=9.1304e-09,
learning_rate=0.0036310282582,
max_features=0.238877814721,
max_leaf_nodes=1696,
min_samples_leaf=59,
n_iter_no_change=14,
tol=0.0001,
validation_fraction=None))]))])在 Jupyter 环境中,请重新运行此单元格以显示 HTML 表示或信任 Notebook。在 GitHub 上,HTML 表示无法渲染,请尝试使用 nbviewer.org 加载此页面。
Pipeline(steps=[('featuresetselector',
FeatureSetSelector(name='group_two',
sel_subset=['d', 'e', 'f'])),
('pipeline',
Pipeline(steps=[('maxabsscaler', MaxAbsScaler()),
('rfe',
RFE(estimator=ExtraTreesClassifier(max_features=0.0390676831531,
min_samples_leaf=8,
min_samples_split=14,
n_jobs=1),
step=0.753983388654)),
('featureunion-1',
FeatureUnion(transformer_lis...
FeatureUnion(transformer_list=[('skiptransformer',
SkipTransformer()),
('passthrough',
Passthrough())])),
('histgradientboostingclassifier',
HistGradientBoostingClassifier(early_stopping=True,
l2_regularization=9.1304e-09,
learning_rate=0.0036310282582,
max_features=0.238877814721,
max_leaf_nodes=1696,
min_samples_leaf=59,
n_iter_no_change=14,
tol=0.0001,
validation_fraction=None))]))])FeatureSetSelector(name='group_two', sel_subset=['d', 'e', 'f'])
Pipeline(steps=[('maxabsscaler', MaxAbsScaler()),
('rfe',
RFE(estimator=ExtraTreesClassifier(max_features=0.0390676831531,
min_samples_leaf=8,
min_samples_split=14,
n_jobs=1),
step=0.753983388654)),
('featureunion-1',
FeatureUnion(transformer_list=[('featureunion',
FeatureUnion(transformer_list=[('columnordinalencoder',
ColumnOrdinalEncoder()),
('pca',
PCA(n_co...
FeatureUnion(transformer_list=[('skiptransformer',
SkipTransformer()),
('passthrough',
Passthrough())])),
('histgradientboostingclassifier',
HistGradientBoostingClassifier(early_stopping=True,
l2_regularization=9.1304e-09,
learning_rate=0.0036310282582,
max_features=0.238877814721,
max_leaf_nodes=1696,
min_samples_leaf=59,
n_iter_no_change=14, tol=0.0001,
validation_fraction=None))])MaxAbsScaler()
RFE(estimator=ExtraTreesClassifier(max_features=0.0390676831531,
min_samples_leaf=8, min_samples_split=14,
n_jobs=1),
step=0.753983388654)ExtraTreesClassifier(max_features=0.0390676831531, min_samples_leaf=8,
min_samples_split=14, n_jobs=1)ExtraTreesClassifier(max_features=0.0390676831531, min_samples_leaf=8,
min_samples_split=14, n_jobs=1)FeatureUnion(transformer_list=[('featureunion',
FeatureUnion(transformer_list=[('columnordinalencoder',
ColumnOrdinalEncoder()),
('pca',
PCA(n_components=0.9286371732844))])),
('passthrough', Passthrough())])ColumnOrdinalEncoder()
PCA(n_components=0.9286371732844)
Passthrough()
FeatureUnion(transformer_list=[('skiptransformer', SkipTransformer()),
('passthrough', Passthrough())])SkipTransformer()
Passthrough()
HistGradientBoostingClassifier(early_stopping=True,
l2_regularization=9.1304e-09,
learning_rate=0.0036310282582,
max_features=0.238877814721, max_leaf_nodes=1696,
min_samples_leaf=59, n_iter_no_change=14,
tol=0.0001, validation_fraction=None)进阶用法¶
如果想进阶使用,可以将更多搜索空间组合起来,以便为每个特征集设置独特的预处理流水线。这里有一个示例
dynamic_transformers = DynamicUnionPipeline(get_search_space("all_transformers"), max_estimators=4)
dynamic_transformers_with_passthrough = tpot.search_spaces.pipelines.UnionPipeline([
dynamic_transformers,
tpot.config.get_search_space("Passthrough")],
)
multi_step_engineering = DynamicLinearPipeline(dynamic_transformers_with_passthrough, max_length=4)
fss_engineering_search_space = SequentialPipeline([fss_search_space, multi_step_engineering])
union_fss_engineering_search_space = DynamicUnionPipeline(fss_engineering_search_space)
final_fancy_search_space = SequentialPipeline([union_fss_engineering_search_space, classification_search_space])
final_fancy_search_space.generate(rng=3).export_pipeline()
Pipeline(steps=[('featureunion',
FeatureUnion(transformer_list=[('pipeline-1',
Pipeline(steps=[('featuresetselector',
FeatureSetSelector(name='group_one',
sel_subset=['a',
'b',
'c'])),
('pipeline',
Pipeline(steps=[('featureunion',
FeatureUnion(transformer_list=[('featureunion',
FeatureUnion(transformer_list=[('zerocount',
ZeroCount())])),
('passthrough',
Passth...
KBinsDiscretizer(encode='onehot-dense',
n_bins=11)),
('rbfsampler',
RBFSampler(gamma=0.0925899621466,
n_components=17)),
('maxabsscaler',
MaxAbsScaler())])),
('passthrough',
Passthrough())]))]))]))])),
('randomforestclassifier',
RandomForestClassifier(bootstrap=False,
class_weight='balanced',
max_features=0.8205760841606,
min_samples_leaf=16,
min_samples_split=11, n_estimators=128,
n_jobs=1))])在 Jupyter 环境中,请重新运行此单元格以显示 HTML 表示或信任 Notebook。在 GitHub 上,HTML 表示无法渲染,请尝试使用 nbviewer.org 加载此页面。
Pipeline(steps=[('featureunion',
FeatureUnion(transformer_list=[('pipeline-1',
Pipeline(steps=[('featuresetselector',
FeatureSetSelector(name='group_one',
sel_subset=['a',
'b',
'c'])),
('pipeline',
Pipeline(steps=[('featureunion',
FeatureUnion(transformer_list=[('featureunion',
FeatureUnion(transformer_list=[('zerocount',
ZeroCount())])),
('passthrough',
Passth...
KBinsDiscretizer(encode='onehot-dense',
n_bins=11)),
('rbfsampler',
RBFSampler(gamma=0.0925899621466,
n_components=17)),
('maxabsscaler',
MaxAbsScaler())])),
('passthrough',
Passthrough())]))]))]))])),
('randomforestclassifier',
RandomForestClassifier(bootstrap=False,
class_weight='balanced',
max_features=0.8205760841606,
min_samples_leaf=16,
min_samples_split=11, n_estimators=128,
n_jobs=1))])FeatureUnion(transformer_list=[('pipeline-1',
Pipeline(steps=[('featuresetselector',
FeatureSetSelector(name='group_one',
sel_subset=['a',
'b',
'c'])),
('pipeline',
Pipeline(steps=[('featureunion',
FeatureUnion(transformer_list=[('featureunion',
FeatureUnion(transformer_list=[('zerocount',
ZeroCount())])),
('passthrough',
Passthrough())]))]))])),
('pipeline-2',...
PCA(n_components=0.9470333477868))])),
('passthrough',
Passthrough())])),
('featureunion-3',
FeatureUnion(transformer_list=[('featureunion',
FeatureUnion(transformer_list=[('kbinsdiscretizer',
KBinsDiscretizer(encode='onehot-dense',
n_bins=11)),
('rbfsampler',
RBFSampler(gamma=0.0925899621466,
n_components=17)),
('maxabsscaler',
MaxAbsScaler())])),
('passthrough',
Passthrough())]))]))]))])FeatureSetSelector(name='group_one', sel_subset=['a', 'b', 'c'])
Pipeline(steps=[('featureunion',
FeatureUnion(transformer_list=[('featureunion',
FeatureUnion(transformer_list=[('zerocount',
ZeroCount())])),
('passthrough',
Passthrough())]))])FeatureUnion(transformer_list=[('featureunion',
FeatureUnion(transformer_list=[('zerocount',
ZeroCount())])),
('passthrough', Passthrough())])ZeroCount()
Passthrough()
FeatureSetSelector(name='group_four', sel_subset=['j', 'k', 'l'])
Pipeline(steps=[('featureunion-1',
FeatureUnion(transformer_list=[('featureunion',
FeatureUnion(transformer_list=[('kbinsdiscretizer',
KBinsDiscretizer(encode='onehot-dense',
n_bins=37,
strategy='kmeans')),
('featureagglomeration',
FeatureAgglomeration(n_clusters=31))])),
('passthrough',
Passthrough())])),
('featureunion-2',
FeatureUnion(transformer_list=[('f...
PCA(n_components=0.9470333477868))])),
('passthrough',
Passthrough())])),
('featureunion-3',
FeatureUnion(transformer_list=[('featureunion',
FeatureUnion(transformer_list=[('kbinsdiscretizer',
KBinsDiscretizer(encode='onehot-dense',
n_bins=11)),
('rbfsampler',
RBFSampler(gamma=0.0925899621466,
n_components=17)),
('maxabsscaler',
MaxAbsScaler())])),
('passthrough',
Passthrough())]))])FeatureUnion(transformer_list=[('featureunion',
FeatureUnion(transformer_list=[('kbinsdiscretizer',
KBinsDiscretizer(encode='onehot-dense',
n_bins=37,
strategy='kmeans')),
('featureagglomeration',
FeatureAgglomeration(n_clusters=31))])),
('passthrough', Passthrough())])KBinsDiscretizer(encode='onehot-dense', n_bins=37, strategy='kmeans')
FeatureAgglomeration(n_clusters=31)
Passthrough()
FeatureUnion(transformer_list=[('featureunion',
FeatureUnion(transformer_list=[('quantiletransformer',
QuantileTransformer(n_quantiles=840,
output_distribution='normal')),
('pca',
PCA(n_components=0.9470333477868))])),
('passthrough', Passthrough())])QuantileTransformer(n_quantiles=840, output_distribution='normal')
PCA(n_components=0.9470333477868)
Passthrough()
FeatureUnion(transformer_list=[('featureunion',
FeatureUnion(transformer_list=[('kbinsdiscretizer',
KBinsDiscretizer(encode='onehot-dense',
n_bins=11)),
('rbfsampler',
RBFSampler(gamma=0.0925899621466,
n_components=17)),
('maxabsscaler',
MaxAbsScaler())])),
('passthrough', Passthrough())])KBinsDiscretizer(encode='onehot-dense', n_bins=11)
RBFSampler(gamma=0.0925899621466, n_components=17)
MaxAbsScaler()
Passthrough()
RandomForestClassifier(bootstrap=False, class_weight='balanced',
max_features=0.8205760841606, min_samples_leaf=16,
min_samples_split=11, n_estimators=128, n_jobs=1)其他示例¶
字典¶
import tpot
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
import sklearn
subsets = { "group_one" : ['a','b','c'],
"group_two" : ['d','e','f'],
"group_three" : ['g','h','i'],
}
fss_search_space = tpot.search_spaces.nodes.FSSNode(subsets=subsets)
selector = fss_search_space.generate(rng=1).export_pipeline()
selector.set_output(transform="pandas")
selector.fit(X_train)
selector.transform(X_train)
| d | e | f | |
|---|---|---|---|
| 162 | 1.315442 | -1.039258 | 0.194516 |
| 168 | -1.908995 | -0.953551 | -1.430472 |
| 214 | 0.181162 | 1.022858 | -2.289700 |
| 895 | 2.825765 | -1.205520 | 1.147791 |
| 154 | -2.300481 | 1.023173 | 0.449162 |
| ... | ... | ... | ... |
| 32 | -1.793062 | 2.209649 | -0.045031 |
| 829 | -0.221409 | 1.688750 | 0.069356 |
| 176 | 0.141471 | -1.880294 | 1.984397 |
| 124 | -0.359952 | 1.141758 | 2.019301 |
| 35 | 0.171312 | 0.079332 | 0.178522 |
750 行 × 3 列
list¶
import tpot
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
import sklearn
subsets = [['a','b','c'],['d','e','f'],['g','h','i']]
fss_search_space = tpot.search_spaces.nodes.FSSNode(subsets=subsets)
selector = fss_search_space.generate(rng=1).export_pipeline()
selector.set_output(transform="pandas")
selector.fit(X_train)
selector.transform(X_train)
| d | e | f | |
|---|---|---|---|
| 162 | 1.315442 | -1.039258 | 0.194516 |
| 168 | -1.908995 | -0.953551 | -1.430472 |
| 214 | 0.181162 | 1.022858 | -2.289700 |
| 895 | 2.825765 | -1.205520 | 1.147791 |
| 154 | -2.300481 | 1.023173 | 0.449162 |
| ... | ... | ... | ... |
| 32 | -1.793062 | 2.209649 | -0.045031 |
| 829 | -0.221409 | 1.688750 | 0.069356 |
| 176 | 0.141471 | -1.880294 | 1.984397 |
| 124 | -0.359952 | 1.141758 | 2.019301 |
| 35 | 0.171312 | 0.079332 | 0.178522 |
750 行 × 3 列
CSV 文件¶
注意:请检查 CSV 文件中的空格!
import tpot
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
import sklearn
subsets = 'simple_fss.csv'
'''
# simple_fss.csv
one,a,b,c
two,d,e,f
three,g,h,i
'''
fss_search_space = tpot.search_spaces.nodes.FSSNode(subsets=subsets)
selector = fss_search_space.generate(rng=1).export_pipeline()
selector.set_output(transform="pandas")
selector.fit(X_train)
selector.transform(X_train)
| d | e | f | |
|---|---|---|---|
| 162 | 1.315442 | -1.039258 | 0.194516 |
| 168 | -1.908995 | -0.953551 | -1.430472 |
| 214 | 0.181162 | 1.022858 | -2.289700 |
| 895 | 2.825765 | -1.205520 | 1.147791 |
| 154 | -2.300481 | 1.023173 | 0.449162 |
| ... | ... | ... | ... |
| 32 | -1.793062 | 2.209649 | -0.045031 |
| 829 | -0.221409 | 1.688750 | 0.069356 |
| 176 | 0.141471 | -1.880294 | 1.984397 |
| 124 | -0.359952 | 1.141758 | 2.019301 |
| 35 | 0.171312 | 0.079332 | 0.178522 |
750 行 × 3 列
在使用 numpy 数据时,以上所有内容都相同,但列名被整数索引取代。
import tpot
import sklearn.datasets
from sklearn.linear_model import LogisticRegression
import numpy as np
import pandas as pd
n_features = 6
X, y = sklearn.datasets.make_classification(n_samples=1000, n_features=n_features, n_informative=6, n_redundant=0, n_repeated=0, n_classes=2, n_clusters_per_class=2, weights=None, flip_y=0.01, class_sep=1.0, hypercube=True, shift=0.0, scale=1.0, shuffle=True, random_state=None)
X = np.hstack([X, np.random.rand(X.shape[0],3)]) #add three uninformative features
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, train_size=0.75, test_size=0.25)
print(X)
[[-0.31748616 2.20805859 -2.21719911 ... 0.5595234 0.80605806 0.41484993] [ 2.8673731 1.45905176 -1.11516833 ... 0.74646156 0.95635356 0.03575697] [-1.64867116 2.14478724 2.31196119 ... 0.22969172 0.72447325 0.81842014] ... [ 1.17772695 0.7188885 -0.52548496 ... 0.99266968 0.95436462 0.57430922] [ 0.14052568 0.15042817 -0.86281564 ... 0.25379746 0.1818071 0.55993116] [ 1.37273916 -0.14898886 -0.89938251 ... 0.767549 0.66184827 0.49174333]]
import tpot
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
import sklearn
subsets = { "group_one" : [0,1,2],
"group_two" : [3,4,5],
"group_three" : [6,7,8],
}
fss_search_space = tpot.search_spaces.nodes.FSSNode(subsets=subsets)
selector = fss_search_space.generate(rng=1).export_pipeline()
selector.fit(X_train)
selector.transform(X_train)
array([[-0.76235619, -1.97629642, 1.05447979],
[ 2.16944118, -1.55515714, 0.67925075],
[ 1.96557199, 0.13789923, 1.588271 ],
...,
[ 0.78956322, 2.12535053, 0.63115798],
[-0.80184984, -0.40793866, 1.3880617 ],
[-1.38085267, 1.62568989, -1.42046795]])