跳到内容

遗传特征选择

此文件是 TPOT 库的一部分。

TPOT 当前版本由 Cedars-Sinai 的以下人员开发: - Pedro Henrique Ribeiro (https://github.com/perib, https://www.linkedin.com/in/pedro-ribeiro/) - Anil Saini (anil.saini@cshs.org) - Jose Hernandez (jgh9094@gmail.com) - Jay Moran (jay.moran@cshs.org) - Nicholas Matsumoto (nicholas.matsumoto@cshs.org) - Hyunjun Choi (hyunjun.choi@cshs.org) - Gabriel Ketron (gabriel.ketron@cshs.org) - Miguel E. Hernandez (miguel.e.hernandez@cshs.org) - Jason Moore (moorejh28@gmail.com)

TPOT 原始版本主要由宾夕法尼亚大学的以下人员开发: - Randal S. Olson (rso@randalolson.com) - Weixuan Fu (weixuanf@upenn.edu) - Daniel Angell (dpa34@drexel.edu) - Jason Moore (moorejh28@gmail.com) - 以及许多其他慷慨的开源贡献者

TPOT 是自由软件:您可以根据自由软件基金会发布的 GNU 宽通用公共许可证(此许可证的第 3 版,或您选择的任何更新版本)的条款重新分发和/或修改它。

发布 TPOT 是希望它会有用,但没有任何担保;甚至不包括适销性或特定用途适用性的默示担保。有关更多详细信息,请参阅 GNU 宽通用公共许可证。

您应该随 TPOT 一起收到了 GNU 宽通用公共许可证的副本。如果没有,请参阅 https://gnu.ac.cn/licenses/

GeneticFeatureSelectorNode

基类: SearchSpace

源代码位于 tpot/search_spaces/nodes/genetic_feature_selection.py
class GeneticFeatureSelectorNode(SearchSpace):
    def __init__(self,                     
                    n_features,
                    start_p=0.2,
                    mutation_rate = 0.1,
                    crossover_rate = 0.1,
                    mutation_rate_rate = 0, # These are still experimental but seem to help. Theory is that it takes slower steps as it gets closer to the optimal solution.
                    crossover_rate_rate = 0,# Otherwise is mutation_rate is too small, it takes forever, and if its too large, it never converges.
                    ):
        """
        A node that generates a GeneticFeatureSelectorIndividual. Uses genetic algorithm to select novel subsets of features.

        Parameters
        ----------
        n_features : int
            Number of features in the dataset.
        start_p : float
            Probability of selecting a given feature for the initial subset of features.
        mutation_rate : float
            Probability of adding/removing a feature from the subset of features.
        crossover_rate : float
            Probability of swapping a feature between two subsets of features.
        mutation_rate_rate : float
            Probability of changing the mutation rate. (experimental)
        crossover_rate_rate : float
            Probability of changing the crossover rate. (experimental)

        """

        self.n_features = n_features
        self.start_p = start_p
        self.mutation_rate = mutation_rate
        self.crossover_rate = crossover_rate
        self.mutation_rate_rate = mutation_rate_rate
        self.crossover_rate_rate = crossover_rate_rate


    def generate(self, rng=None) -> SklearnIndividual:
        return GeneticFeatureSelectorIndividual(   mask=self.n_features,
                                                    start_p=self.start_p,
                                                    mutation_rate=self.mutation_rate,
                                                    crossover_rate=self.crossover_rate,
                                                    mutation_rate_rate=self.mutation_rate_rate,
                                                    crossover_rate_rate=self.crossover_rate_rate,
                                                    rng=rng
                                                )

__init__(n_features, start_p=0.2, mutation_rate=0.1, crossover_rate=0.1, mutation_rate_rate=0, crossover_rate_rate=0)

一个生成 GeneticFeatureSelectorIndividual 的节点。使用遗传算法选择新颖的特征子集。

参数

名称 类型 描述 默认值
n_features int

数据集中的特征数量。

必需
start_p float

为初始特征子集选择给定特征的概率。

0.2
mutation_rate float

从特征子集中添加/移除特征的概率。

0.1
crossover_rate float

在两个特征子集之间交换特征的概率。

0.1
mutation_rate_rate float

改变变异率的概率。(实验性)

0
crossover_rate_rate float

改变交叉率的概率。(实验性)

0
源代码位于 tpot/search_spaces/nodes/genetic_feature_selection.py
def __init__(self,                     
                n_features,
                start_p=0.2,
                mutation_rate = 0.1,
                crossover_rate = 0.1,
                mutation_rate_rate = 0, # These are still experimental but seem to help. Theory is that it takes slower steps as it gets closer to the optimal solution.
                crossover_rate_rate = 0,# Otherwise is mutation_rate is too small, it takes forever, and if its too large, it never converges.
                ):
    """
    A node that generates a GeneticFeatureSelectorIndividual. Uses genetic algorithm to select novel subsets of features.

    Parameters
    ----------
    n_features : int
        Number of features in the dataset.
    start_p : float
        Probability of selecting a given feature for the initial subset of features.
    mutation_rate : float
        Probability of adding/removing a feature from the subset of features.
    crossover_rate : float
        Probability of swapping a feature between two subsets of features.
    mutation_rate_rate : float
        Probability of changing the mutation rate. (experimental)
    crossover_rate_rate : float
        Probability of changing the crossover rate. (experimental)

    """

    self.n_features = n_features
    self.start_p = start_p
    self.mutation_rate = mutation_rate
    self.crossover_rate = crossover_rate
    self.mutation_rate_rate = mutation_rate_rate
    self.crossover_rate_rate = crossover_rate_rate

MaskSelector

基类: BaseEstimator, SelectorMixin

选择预定义的特征子集。

源代码位于 tpot/search_spaces/nodes/genetic_feature_selection.py
class MaskSelector(BaseEstimator, SelectorMixin):
    """Select predefined feature subsets."""

    def __init__(self, mask, set_output_transform=None):
        self.mask = mask
        self.set_output_transform = set_output_transform
        if set_output_transform is not None:
            self.set_output(transform=set_output_transform)

    def fit(self, X, y=None):
        self.n_features_in_ = X.shape[1]
        if isinstance(X, pd.DataFrame):
            self.feature_names_in_ = X.columns
        #     self.set_output(transform="pandas")
        self.is_fitted_ = True #so sklearn knows it's fitted
        return self

    def _get_tags(self):
        tags = {"allow_nan": True, "requires_y": False}
        return tags

    def _get_support_mask(self):
        return np.array(self.mask)

    def get_feature_names_out(self, input_features=None):
        return self.feature_names_in_[self.get_support()]