TPOT 中的基因特征选择节点¶

TPOT 可以使用进化算法在优化流水线的同时优化特征选择。它包含了两种具有不同特征选择策略的节点搜索空间：FSSNode 和 GeneticFeatureSelectorNode。

FSSNode - (特征集选择器) 如果您有一个预定义的特征集列表想要从中选择，这个节点很有用。每个 FeatureSetSelector 节点将选择一个特征组传递到流水线的下一步。请注意，FSSNode 不会创建自己的特征子集，也不会混合/匹配多个预定义的特征集。
GeneticFeatureSelectorNode — 与 FSSNode 从预定义的特征子集列表中选择不同，这个节点使用进化算法从头开始优化一个全新的特征子集。这在没有预定义特征分组的情况下非常有用。

本教程重点介绍 FSSNode。有关 GeneticFeatureSelectorNode 的更多信息，请参阅教程 5。

将这些搜索空间与最小化复杂性的次要目标函数配对可能也会有益。这将鼓励 TPOT 尝试生成具有最少特征的最简单流水线。

tpot.objectives.number_of_nodes_objective - 这可以用作计算节点数量的 other_objective_function。

tpot.objectives.complexity_scorer - 这是一个试图计算学习参数总数（系数数量、决策树中的节点数量等）的评分器。

特征集选择器¶

FeatureSetSelector 是 sklearn.feature_selection.SelectorMixin 的一个子类，它简单地返回手动指定的列。参数 sel_subset 指定它选择的列的名称或索引。然后 transform 函数简单地索引并返回选定的列。您还可以选择使用 name 参数命名该组，但这仅用于记录，类不使用它。

sel_subset: list or int
    If X is a dataframe, items in sel_subset list must correspond to column names
    If X is a numpy array, items in sel_subset list must correspond to column indexes
    int: index of a single column

In [1]

已复制！





import tpot
import pandas as pd
import numpy as np
#make a dataframe with columns a,b,c,d,e,f

#numpy array where columns are 1,2,3,4,5,6
data = np.repeat([np.arange(6)],10,0)

df = pd.DataFrame(data,columns=['a','b','c','d','e','f'])
fss = tpot.builtin_modules.FeatureSetSelector(name='test',sel_subset=['a','b','c'])

print("original DataFrame")
print(df)
print("Transformed Data")
print(fss.fit_transform(df))
import tpot import pandas as pd import numpy as np #make a dataframe with columns a,b,c,d,e,f #numpy array where columns are 1,2,3,4,5,6 data = np.repeat([np.arange(6)],10,0) df = pd.DataFrame(data,columns=['a','b','c','d','e','f']) fss = tpot.builtin_modules.FeatureSetSelector(name='test',sel_subset=['a','b','c']) print("original DataFrame") print(df) print("Transformed Data") print(fss.fit_transform(df))

original DataFrame
   a  b  c  d  e  f
0  0  1  2  3  4  5
1  0  1  2  3  4  5
2  0  1  2  3  4  5
3  0  1  2  3  4  5
4  0  1  2  3  4  5
5  0  1  2  3  4  5
6  0  1  2  3  4  5
7  0  1  2  3  4  5
8  0  1  2  3  4  5
9  0  1  2  3  4  5
Transformed Data
[[0 1 2]
 [0 1 2]
 [0 1 2]
 [0 1 2]
 [0 1 2]
 [0 1 2]
 [0 1 2]
 [0 1 2]
 [0 1 2]
 [0 1 2]]

/opt/anaconda3/envs/tpotenv/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

FSSNode¶

FSSNode 是一个节点搜索空间，它简单地从特征集列表中选择一个特征集。这与 EstimatorNode 的工作方式相同，但提供了更简单的接口来定义特征集。

请注意，FSS 只有在用作流水线的第一个步骤时才定义良好。这是因为下游节点将接收经过不同转换的数据，导致原始索引不再对应于转换后的数据中的同一列。

FSSNode 接受一个参数 subsets，它定义了特征组。有四种方法可以定义子集。

subsets : str or list, default=None
        Sets the subsets that the FeatureSetSeletor will select from if set as an option in one of the configuration dictionaries. 
        Features are defined by column names if using a Pandas data frame, or ints corresponding to indexes if using numpy arrays.
        - str : If a string, it is assumed to be a path to a csv file with the subsets. 
            The first column is assumed to be the name of the subset and the remaining columns are the features in the subset.
        - list or np.ndarray : If a list or np.ndarray, it is assumed to be a list of subsets (i.e a list of lists).
        - dict : A dictionary where keys are the names of the subsets and the values are the list of features.
        - int : If an int, it is assumed to be the number of subsets to generate. Each subset will contain one feature.
        - None : If None, each column will be treated as a subset. One column will be selected per subset.

假设您想要三个特征组，每个组包含三列。以下示例是等效的

str¶

sel_subsets=simple_fss.csv

\# simple_fss.csv
group_one, 1,2,3
group_two, 4,5,6
group_three, 7,8,9

dict¶

sel_subsets = { "group_one" : [1,2,3], "group_two" : [4,5,6], "group_three" : [7,8,9], }

list¶

sel_subsets = [[1,2,3], [4,5,6], [7,8,9]]

示例¶

对于这些示例，我们创建一个虚拟数据集，其中前六列是提供信息的，其余列是不提供信息的。

In [2]

已复制！





import tpot
import sklearn.datasets
from sklearn.linear_model import LogisticRegression
import numpy as np
import pandas as pd
import tpot
import sklearn.datasets
from sklearn.linear_model import LogisticRegression
import numpy as np
from tpot.search_spaces.nodes import *
from tpot.search_spaces.pipelines import *
from tpot.config import get_search_space


X, y = sklearn.datasets.make_classification(n_samples=1000, n_features=6, n_informative=6, n_redundant=0, n_repeated=0, n_classes=2, n_clusters_per_class=2, weights=None, flip_y=0.01, class_sep=1.0, hypercube=True, shift=0.0, scale=1.0, shuffle=True, random_state=None)
X = np.hstack([X, np.random.rand(X.shape[0],6)]) #add six uninformative features
X = pd.DataFrame(X, columns=['a','b','c','d','e','f','g','h','i', 'j', 'k', 'l']) # a, b ,c the rest are uninformative
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, train_size=0.75, test_size=0.25)

X.head()
import tpot import sklearn.datasets from sklearn.linear_model import LogisticRegression import numpy as np import pandas as pd import tpot import sklearn.datasets from sklearn.linear_model import LogisticRegression import numpy as np from tpot.search_spaces.nodes import * from tpot.search_spaces.pipelines import * from tpot.config import get_search_space X, y = sklearn.datasets.make_classification(n_samples=1000, n_features=6, n_informative=6, n_redundant=0, n_repeated=0, n_classes=2, n_clusters_per_class=2, weights=None, flip_y=0.01, class_sep=1.0, hypercube=True, shift=0.0, scale=1.0, shuffle=True, random_state=None) X = np.hstack([X, np.random.rand(X.shape[0],6)]) #add six uninformative features X = pd.DataFrame(X, columns=['a','b','c','d','e','f','g','h','i', 'j', 'k', 'l']) # a, b ,c the rest are uninformative X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, train_size=0.75, test_size=0.25) X.head()

Out[2]

	a	b	c	d	e	f	g	h	i	j	k	l
0	2.315814	-3.427720	-1.314654	-1.508737	-0.300932	0.089448	0.327651	0.329022	0.857495	0.734238	0.257218	0.652350
1	-0.191001	-1.396922	0.149488	-1.730145	-0.394932	0.519712	0.807762	0.509823	0.876159	0.002806	0.449828	0.671350
2	0.661264	-0.981737	0.703879	0.730321	-2.750405	0.396581	0.380302	0.532604	0.877129	0.610919	0.780108	0.625689
3	1.445936	0.354237	0.779040	1.288014	2.397133	0.186324	0.544191	0.465419	0.588535	0.919575	0.513460	0.831546
4	-0.989027	-1.824787	-1.448234	1.546442	1.643775	0.167975	0.188238	0.024149	0.544878	0.834503	0.877869	0.278330

假设基于先验知识或兴趣，我们知道特征可以按如下方式分组

In [3]

已复制！

subsets = { "group_one" :  ['a','b','c',],
            "group_two" :  ['d','e','f'],
            "group_three" :  ['g','h','i'],
            "group_four" :  ['j','k','l'],
            }
subsets = { "group_one" : ['a','b','c',], "group_two" : ['d','e','f'], "group_three" : ['g','h','i'], "group_four" : ['j','k','l'], }

我们可以创建一个 FSSNode，它将从这个子集中进行选择。流水线中的每个节点只选择一个子集。

In [4]

已复制！

fss_search_space = FSSNode(subsets=subsets)
fss_search_space = FSSNode(subsets=subsets)

如果从这个搜索空间中随机采样，可以看到得到一个选择器，它选择一个预定义的集合。在本例中，它选择了 group two，其中包含 ['d', 'e', 'f']。（在 generate 函数中设置了随机种子，以便重新运行 Notebook 时选择相同的组。）

In [5]

已复制！

fss_selector = fss_search_space.generate(rng=1).export_pipeline()
fss_selector
fss_selector = fss_search_space.generate(rng=1).export_pipeline() fss_selector

Out[5]

FeatureSetSelector(name='group_two', sel_subset=['d', 'e', 'f'])

在 Jupyter 环境中，请重新运行此单元格以显示 HTML 表示或信任 Notebook。
在 GitHub 上，HTML 表示无法渲染，请尝试使用 nbviewer.org 加载此页面。

In [6]

已复制！

fss_selector.set_output(transform="pandas") #by default sklearn selectors return numpy arrays. this will make it return pandas dataframes
fss_selector.fit(X_train)
fss_selector.transform(X_train)
fss_selector.set_output(transform="pandas") #by default sklearn selectors return numpy arrays. this will make it return pandas dataframes fss_selector.fit(X_train) fss_selector.transform(X_train)

Out[6]

	d	e	f
162	1.315442	-1.039258	0.194516
168	-1.908995	-0.953551	-1.430472
214	0.181162	1.022858	-2.289700
895	2.825765	-1.205520	1.147791
154	-2.300481	1.023173	0.449162
...	...	...	...
32	-1.793062	2.209649	-0.045031
829	-0.221409	1.688750	0.069356
176	0.141471	-1.880294	1.984397
124	-0.359952	1.141758	2.019301
35	0.171312	0.079332	0.178522

750 行 × 3 列

在底层实现中，变异（mutation）将随机选择另一个特征集，交叉（crossover）将交换两个个体选择的特征集

In [7]

已复制！

ind1 = fss_search_space.generate(rng=1)
ind1.export_pipeline()
ind1 = fss_search_space.generate(rng=1) ind1.export_pipeline()

Out[7]

FeatureSetSelector(name='group_two', sel_subset=['d', 'e', 'f'])

在 Jupyter 环境中，请重新运行此单元格以显示 HTML 表示或信任 Notebook。
在 GitHub 上，HTML 表示无法渲染，请尝试使用 nbviewer.org 加载此页面。

In [8]

已复制！

ind1.mutate()
ind1.export_pipeline()
ind1.mutate() ind1.export_pipeline()

Out[8]

FeatureSetSelector(name='group_four', sel_subset=['j', 'k', 'l'])

在 Jupyter 环境中，请重新运行此单元格以显示 HTML 表示或信任 Notebook。
在 GitHub 上，HTML 表示无法渲染，请尝试使用 nbviewer.org 加载此页面。

现在可以在定义流水线时使用它。对于第一个示例，我们将构建一个简单的线性流水线，其中第一步是特征集选择器，第二步是分类器。

In [9]

已复制！





classification_search_space = get_search_space(["RandomForestClassifier"])
fss_and_classifier_search_space = SequentialPipeline([fss_search_space, classification_search_space])


est = tpot.TPOTEstimator(generations=5, 
                            scorers=["roc_auc_ovr", tpot.objectives.complexity_scorer],
                            scorers_weights=[1.0, -1.0],
                            n_jobs=32,
                            classification=True,
                            search_space = fss_and_classifier_search_space,
                            verbose=1,
                            )


scorer = sklearn.metrics.get_scorer('roc_auc_ovr')
est.fit(X_train, y_train)
print(scorer(est, X_test, y_test))
classification_search_space = get_search_space(["RandomForestClassifier"]) fss_and_classifier_search_space = SequentialPipeline([fss_search_space, classification_search_space]) est = tpot.TPOTEstimator(generations=5, scorers=["roc_auc_ovr", tpot.objectives.complexity_scorer], scorers_weights=[1.0, -1.0], n_jobs=32, classification=True, search_space = fss_and_classifier_search_space, verbose=1, ) scorer = sklearn.metrics.get_scorer('roc_auc_ovr') est.fit(X_train, y_train) print(scorer(est, X_test, y_test))

/Users/ketrong/Desktop/tpotvalidation/tpot/tpot/tpot_estimator/estimator.py:456: UserWarning: Both generations and max_time_mins are set. TPOT will terminate when the first condition is met.
  warnings.warn("Both generations and max_time_mins are set. TPOT will terminate when the first condition is met.")
Generation: 100%|██████████| 5/5 [00:36<00:00,  7.26s/it]

0.926166142557652

In [10]

已复制！

est.fitted_pipeline_
est.fitted_pipeline_

Out[10]

Pipeline(steps=[('featuresetselector',
                 FeatureSetSelector(name='group_one',
                                    sel_subset=['a', 'b', 'c'])),
                ('randomforestclassifier',
                 RandomForestClassifier(max_features=0.30141491087,
                                        min_samples_leaf=4,
                                        min_samples_split=17, n_estimators=128,
                                        n_jobs=1))])

在 Jupyter 环境中，请重新运行此单元格以显示 HTML 表示或信任 Notebook。
在 GitHub 上，HTML 表示无法渲染，请尝试使用 nbviewer.org 加载此页面。

通过这种设置，TPOT 能够识别所使用的其中一个子集，但性能并非最佳。在本例中，我们恰好知道需要多个特征集。如果想在流水线中包含多个特征，我们将不得不修改搜索空间。对此有三种选择。

UnionPipeline - 这允许您选择固定数量的特征集。如果使用包含两个 FSSNode 的 UnionPipeline，您将始终选择两个简单地连接在一起的特征集。
DynamicUnionPipeline - 这个空间允许选择多个 FSSNode。与 UnionPipeline 不同，您无需指定选择的集合数量，TPOT 将确定最佳的集合数量。此外，使用 DynamicUnionPipeline 时，不能两次选择同一个特征集。请注意，DynamicUnionPipeline 虽然可以选择多个特征集，但它从不将两个特征集混合在一起。
GraphSearchPipeline - 当设置为 leave_search_space 时，GraphSearchPipeline 也可以选择多个 FSSNode，它们作为流水线其余部分的输入。

UnionPipeline + FSSNode 示例¶

In [11]

已复制！

union_fss_space = UnionPipeline([fss_search_space, fss_search_space])
union_fss_space = UnionPipeline([fss_search_space, fss_search_space])

In [12]

已复制！

# this union search space will always select exactly two fss_search_space
selector1 = union_fss_space.generate(rng=1).export_pipeline()
selector1
# this union search space will always select exactly two fss_search_space selector1 = union_fss_space.generate(rng=1).export_pipeline() selector1

Out[12]

FeatureUnion(transformer_list=[('featuresetselector-1',
                                FeatureSetSelector(name='group_two',
                                                   sel_subset=['d', 'e', 'f'])),
                               ('featuresetselector-2',
                                FeatureSetSelector(name='group_three',
                                                   sel_subset=['g', 'h',
                                                               'i']))])

在 Jupyter 环境中，请重新运行此单元格以显示 HTML 表示或信任 Notebook。
在 GitHub 上，HTML 表示无法渲染，请尝试使用 nbviewer.org 加载此页面。

In [13]

已复制！

selector1.set_output(transform="pandas") 
selector1.fit(X_train)
selector1.transform(X_train)
selector1.set_output(transform="pandas") selector1.fit(X_train) selector1.transform(X_train)

Out[13]

	d	e	f	g	h	i
162	1.315442	-1.039258	0.194516	0.751175	0.411340	0.824754
168	-1.908995	-0.953551	-1.430472	0.072697	0.875766	0.953255
214	0.181162	1.022858	-2.289700	0.135222	0.395847	0.232638
895	2.825765	-1.205520	1.147791	0.925905	0.486645	0.710991
154	-2.300481	1.023173	0.449162	0.645161	0.131657	0.863514
...	...	...	...	...	...	...
32	-1.793062	2.209649	-0.045031	0.502947	0.994603	0.280062
829	-0.221409	1.688750	0.069356	0.328066	0.102381	0.492280
176	0.141471	-1.880294	1.984397	0.365550	0.465859	0.974601
124	-0.359952	1.141758	2.019301	0.329380	0.718647	0.365507
35	0.171312	0.079332	0.178522	0.215759	0.546279	0.662928

750 行 × 6 列

DynamicUnionPipeline + FSSNode 示例¶

动态联合流水线可以选择可变数量的特征集。

In [14]

已复制！

dynamic_fss_space = DynamicUnionPipeline(fss_search_space)
dynamic_fss_space.generate(rng=1).export_pipeline()
dynamic_fss_space = DynamicUnionPipeline(fss_search_space) dynamic_fss_space.generate(rng=1).export_pipeline()

Out[14]

FeatureUnion(transformer_list=[('featuresetselector',
                                FeatureSetSelector(name='group_three',
                                                   sel_subset=['g', 'h',
                                                               'i']))])

在 Jupyter 环境中，请重新运行此单元格以显示 HTML 表示或信任 Notebook。
在 GitHub 上，HTML 表示无法渲染，请尝试使用 nbviewer.org 加载此页面。

In [15]

已复制！

dynamic_fss_space.generate(rng=3).export_pipeline()
dynamic_fss_space.generate(rng=3).export_pipeline()

Out[15]

FeatureUnion(transformer_list=[('featuresetselector-1',
                                FeatureSetSelector(name='group_one',
                                                   sel_subset=['a', 'b', 'c'])),
                               ('featuresetselector-2',
                                FeatureSetSelector(name='group_four',
                                                   sel_subset=['j', 'k',
                                                               'l']))])

在 Jupyter 环境中，请重新运行此单元格以显示 HTML 表示或信任 Notebook。
在 GitHub 上，HTML 表示无法渲染，请尝试使用 nbviewer.org 加载此页面。

GraphSearchPipeline + FSSNode 示例¶

FSSNode 必须设置为叶节点搜索空间，因为它们充当流水线的输入。

这是一个来自此搜索空间的流水线示例，它利用了两个特征集。

In [16]

已复制！





graph_search_space = tpot.search_spaces.pipelines.GraphSearchPipeline(
    leaf_search_space = fss_search_space,
    inner_search_space = tpot.config.get_search_space(["transformers"]),
    root_search_space= tpot.config.get_search_space(["KNeighborsClassifier", "LogisticRegression", "DecisionTreeClassifier"]),
    max_size = 10,
)

graph_search_space.generate(rng=4).export_pipeline().plot()
graph_search_space = tpot.search_spaces.pipelines.GraphSearchPipeline( leaf_search_space = fss_search_space, inner_search_space = tpot.config.get_search_space(["transformers"]), root_search_space= tpot.config.get_search_space(["KNeighborsClassifier", "LogisticRegression", "DecisionTreeClassifier"]), max_size = 10, ) graph_search_space.generate(rng=4).export_pipeline().plot()

No description has been provided for this image

使用 TPOT 进行优化¶

对于这个示例，我们将优化 DynamicUnion 搜索空间

In [17]

已复制！





import tpot
import sklearn.datasets
from sklearn.linear_model import LogisticRegression
import numpy as np


final_classification_search_space = SequentialPipeline([dynamic_fss_space, classification_search_space])

est = tpot.TPOTEstimator(generations=5, 
                            scorers=["roc_auc_ovr", tpot.objectives.complexity_scorer],
                            scorers_weights=[1.0, -1.0],
                            n_jobs=32,
                            classification=True,
                            search_space = final_classification_search_space,
                            verbose=1,
                            )


scorer = sklearn.metrics.get_scorer('roc_auc_ovr')

est.fit(X_train, y_train)
print(scorer(est, X_test, y_test))
import tpot import sklearn.datasets from sklearn.linear_model import LogisticRegression import numpy as np final_classification_search_space = SequentialPipeline([dynamic_fss_space, classification_search_space]) est = tpot.TPOTEstimator(generations=5, scorers=["roc_auc_ovr", tpot.objectives.complexity_scorer], scorers_weights=[1.0, -1.0], n_jobs=32, classification=True, search_space = final_classification_search_space, verbose=1, ) scorer = sklearn.metrics.get_scorer('roc_auc_ovr') est.fit(X_train, y_train) print(scorer(est, X_test, y_test))

/Users/ketrong/Desktop/tpotvalidation/tpot/tpot/tpot_estimator/estimator.py:456: UserWarning: Both generations and max_time_mins are set. TPOT will terminate when the first condition is met.
  warnings.warn("Both generations and max_time_mins are set. TPOT will terminate when the first condition is met.")
Generation: 100%|██████████| 5/5 [00:41<00:00,  8.33s/it]

0.9838836477987423

可以看到这个流水线性能略好，并且正确识别了 group one 和 group two 作为生成方程中使用的特征集。

In [18]

已复制！

est.fitted_pipeline_
est.fitted_pipeline_

Out[18]

Pipeline(steps=[('featureunion',
                 FeatureUnion(transformer_list=[('featuresetselector-1',
                                                 FeatureSetSelector(name='group_two',
                                                                    sel_subset=['d',
                                                                                'e',
                                                                                'f'])),
                                                ('featuresetselector-2',
                                                 FeatureSetSelector(name='group_one',
                                                                    sel_subset=['a',
                                                                                'b',
                                                                                'c']))])),
                ('randomforestclassifier',
                 RandomForestClassifier(max_features=0.0530704381152,
                                        min_samples_leaf=2, min_samples_split=5,
                                        n_estimators=128, n_jobs=1))])

在 Jupyter 环境中，请重新运行此单元格以显示 HTML 表示或信任 Notebook。
在 GitHub 上，HTML 表示无法渲染，请尝试使用 nbviewer.org 加载此页面。

流水线?Pipeline 的文档i已拟合

Pipeline(steps=[('featureunion',
                 FeatureUnion(transformer_list=[('featuresetselector-1',
                                                 FeatureSetSelector(name='group_two',
                                                                    sel_subset=['d',
                                                                                'e',
                                                                                'f'])),
                                                ('featuresetselector-2',
                                                 FeatureSetSelector(name='group_one',
                                                                    sel_subset=['a',
                                                                                'b',
                                                                                'c']))])),
                ('randomforestclassifier',
                 RandomForestClassifier(max_features=0.0530704381152,
                                        min_samples_leaf=2, min_samples_split=5,
                                        n_estimators=128, n_jobs=1))])

featureunion: 特征联合?featureunion: FeatureUnion 的文档

FeatureUnion(transformer_list=[('featuresetselector-1',
                                FeatureSetSelector(name='group_two',
                                                   sel_subset=['d', 'e', 'f'])),
                               ('featuresetselector-2',
                                FeatureSetSelector(name='group_one',
                                                   sel_subset=['a', 'b',
                                                               'c']))])

featuresetselector-1

FeatureSetSelector

FeatureSetSelector(name='group_two', sel_subset=['d', 'e', 'f'])

featuresetselector-2

FeatureSetSelector

FeatureSetSelector(name='group_one', sel_subset=['a', 'b', 'c'])

随机森林分类器?RandomForestClassifier 的文档

RandomForestClassifier(max_features=0.0530704381152, min_samples_leaf=2,
                       min_samples_split=5, n_estimators=128, n_jobs=1)

与现有搜索空间结合¶

与所有搜索空间一样，FSSNode 可以与任何其他搜索空间结合使用。

您还可以将其与现有的预构建模板配对使用，例如

进阶用法¶

如果想进阶使用，可以将更多搜索空间组合起来，以便为每个特征集设置独特的预处理流水线。这里有一个示例

In [22]

已复制！





dynamic_transformers = DynamicUnionPipeline(get_search_space("all_transformers"), max_estimators=4)
dynamic_transformers_with_passthrough = tpot.search_spaces.pipelines.UnionPipeline([
    dynamic_transformers,
    tpot.config.get_search_space("Passthrough")],
    )
multi_step_engineering = DynamicLinearPipeline(dynamic_transformers_with_passthrough, max_length=4)
fss_engineering_search_space = SequentialPipeline([fss_search_space, multi_step_engineering])
union_fss_engineering_search_space = DynamicUnionPipeline(fss_engineering_search_space)

final_fancy_search_space = SequentialPipeline([union_fss_engineering_search_space, classification_search_space])
dynamic_transformers = DynamicUnionPipeline(get_search_space("all_transformers"), max_estimators=4) dynamic_transformers_with_passthrough = tpot.search_spaces.pipelines.UnionPipeline([ dynamic_transformers, tpot.config.get_search_space("Passthrough")], ) multi_step_engineering = DynamicLinearPipeline(dynamic_transformers_with_passthrough, max_length=4) fss_engineering_search_space = SequentialPipeline([fss_search_space, multi_step_engineering]) union_fss_engineering_search_space = DynamicUnionPipeline(fss_engineering_search_space) final_fancy_search_space = SequentialPipeline([union_fss_engineering_search_space, classification_search_space])

其他示例¶

字典¶

In [24]

已复制！





import tpot
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
import sklearn

subsets = { "group_one" :  ['a','b','c'],
            "group_two" :  ['d','e','f'],
            "group_three" :  ['g','h','i'],
            }

fss_search_space = tpot.search_spaces.nodes.FSSNode(subsets=subsets)

selector = fss_search_space.generate(rng=1).export_pipeline()
selector.set_output(transform="pandas")
selector.fit(X_train)
selector.transform(X_train)
import tpot import pandas as pd import numpy as np from sklearn.linear_model import LogisticRegression import sklearn subsets = { "group_one" : ['a','b','c'], "group_two" : ['d','e','f'], "group_three" : ['g','h','i'], } fss_search_space = tpot.search_spaces.nodes.FSSNode(subsets=subsets) selector = fss_search_space.generate(rng=1).export_pipeline() selector.set_output(transform="pandas") selector.fit(X_train) selector.transform(X_train)

Out[24]

	d	e	f
162	1.315442	-1.039258	0.194516
168	-1.908995	-0.953551	-1.430472
214	0.181162	1.022858	-2.289700
895	2.825765	-1.205520	1.147791
154	-2.300481	1.023173	0.449162
...	...	...	...
32	-1.793062	2.209649	-0.045031
829	-0.221409	1.688750	0.069356
176	0.141471	-1.880294	1.984397
124	-0.359952	1.141758	2.019301
35	0.171312	0.079332	0.178522

750 行 × 3 列

list¶

In [25]

已复制！





import tpot
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
import sklearn

subsets = [['a','b','c'],['d','e','f'],['g','h','i']]

fss_search_space = tpot.search_spaces.nodes.FSSNode(subsets=subsets)

selector = fss_search_space.generate(rng=1).export_pipeline()
selector.set_output(transform="pandas")
selector.fit(X_train)
selector.transform(X_train)
import tpot import pandas as pd import numpy as np from sklearn.linear_model import LogisticRegression import sklearn subsets = [['a','b','c'],['d','e','f'],['g','h','i']] fss_search_space = tpot.search_spaces.nodes.FSSNode(subsets=subsets) selector = fss_search_space.generate(rng=1).export_pipeline() selector.set_output(transform="pandas") selector.fit(X_train) selector.transform(X_train)

Out[25]

	d	e	f
162	1.315442	-1.039258	0.194516
168	-1.908995	-0.953551	-1.430472
214	0.181162	1.022858	-2.289700
895	2.825765	-1.205520	1.147791
154	-2.300481	1.023173	0.449162
...	...	...	...
32	-1.793062	2.209649	-0.045031
829	-0.221409	1.688750	0.069356
176	0.141471	-1.880294	1.984397
124	-0.359952	1.141758	2.019301
35	0.171312	0.079332	0.178522

750 行 × 3 列

CSV 文件¶

注意：请检查 CSV 文件中的空格！

In [26]

已复制！





import tpot
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
import sklearn

subsets = 'simple_fss.csv'
'''
# simple_fss.csv
one,a,b,c
two,d,e,f
three,g,h,i
'''

fss_search_space = tpot.search_spaces.nodes.FSSNode(subsets=subsets)

selector = fss_search_space.generate(rng=1).export_pipeline()
selector.set_output(transform="pandas")
selector.fit(X_train)
selector.transform(X_train)
import tpot import pandas as pd import numpy as np from sklearn.linear_model import LogisticRegression import sklearn subsets = 'simple_fss.csv' ''' # simple_fss.csv one,a,b,c two,d,e,f three,g,h,i ''' fss_search_space = tpot.search_spaces.nodes.FSSNode(subsets=subsets) selector = fss_search_space.generate(rng=1).export_pipeline() selector.set_output(transform="pandas") selector.fit(X_train) selector.transform(X_train)

Out[26]

	d	e	f
162	1.315442	-1.039258	0.194516
168	-1.908995	-0.953551	-1.430472
214	0.181162	1.022858	-2.289700
895	2.825765	-1.205520	1.147791
154	-2.300481	1.023173	0.449162
...	...	...	...
32	-1.793062	2.209649	-0.045031
829	-0.221409	1.688750	0.069356
176	0.141471	-1.880294	1.984397
124	-0.359952	1.141758	2.019301
35	0.171312	0.079332	0.178522

750 行 × 3 列

在使用 numpy 数据时，以上所有内容都相同，但列名被整数索引取代。

In [27]

已复制！





import tpot
import sklearn.datasets
from sklearn.linear_model import LogisticRegression
import numpy as np
import pandas as pd

n_features = 6
X, y = sklearn.datasets.make_classification(n_samples=1000, n_features=n_features, n_informative=6, n_redundant=0, n_repeated=0, n_classes=2, n_clusters_per_class=2, weights=None, flip_y=0.01, class_sep=1.0, hypercube=True, shift=0.0, scale=1.0, shuffle=True, random_state=None)
X = np.hstack([X, np.random.rand(X.shape[0],3)]) #add three uninformative features

X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, train_size=0.75, test_size=0.25)

print(X)
import tpot import sklearn.datasets from sklearn.linear_model import LogisticRegression import numpy as np import pandas as pd n_features = 6 X, y = sklearn.datasets.make_classification(n_samples=1000, n_features=n_features, n_informative=6, n_redundant=0, n_repeated=0, n_classes=2, n_clusters_per_class=2, weights=None, flip_y=0.01, class_sep=1.0, hypercube=True, shift=0.0, scale=1.0, shuffle=True, random_state=None) X = np.hstack([X, np.random.rand(X.shape[0],3)]) #add three uninformative features X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, train_size=0.75, test_size=0.25) print(X)

[[-0.31748616  2.20805859 -2.21719911 ...  0.5595234   0.80605806
   0.41484993]
 [ 2.8673731   1.45905176 -1.11516833 ...  0.74646156  0.95635356
   0.03575697]
 [-1.64867116  2.14478724  2.31196119 ...  0.22969172  0.72447325
   0.81842014]
 ...
 [ 1.17772695  0.7188885  -0.52548496 ...  0.99266968  0.95436462
   0.57430922]
 [ 0.14052568  0.15042817 -0.86281564 ...  0.25379746  0.1818071
   0.55993116]
 [ 1.37273916 -0.14898886 -0.89938251 ...  0.767549    0.66184827
   0.49174333]]

In [28]

已复制！





import tpot
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
import sklearn

subsets = { "group_one" :  [0,1,2],
            "group_two" :  [3,4,5],
            "group_three" :  [6,7,8],
            }

fss_search_space = tpot.search_spaces.nodes.FSSNode(subsets=subsets)
selector = fss_search_space.generate(rng=1).export_pipeline()
selector.fit(X_train)
selector.transform(X_train)
import tpot import pandas as pd import numpy as np from sklearn.linear_model import LogisticRegression import sklearn subsets = { "group_one" : [0,1,2], "group_two" : [3,4,5], "group_three" : [6,7,8], } fss_search_space = tpot.search_spaces.nodes.FSSNode(subsets=subsets) selector = fss_search_space.generate(rng=1).export_pipeline() selector.fit(X_train) selector.transform(X_train)

Out[28]

array([[-0.76235619, -1.97629642,  1.05447979],
       [ 2.16944118, -1.55515714,  0.67925075],
       [ 1.96557199,  0.13789923,  1.588271  ],
       ...,
       [ 0.78956322,  2.12535053,  0.63115798],
       [-0.80184984, -0.40793866,  1.3880617 ],
       [-1.38085267,  1.62568989, -1.42046795]])