跳到内容

特征转换器

本文件是 TPOT 库的一部分。

当前版本的 TPOT 由 Cedars-Sinai 的以下人员开发:- Pedro Henrique Ribeiro (https://github.com/perib, https://www.linkedin.com/in/pedro-ribeiro/) - Anil Saini (anil.saini@cshs.org) - Jose Hernandez (jgh9094@gmail.com) - Jay Moran (jay.moran@cshs.org) - Nicholas Matsumoto (nicholas.matsumoto@cshs.org) - Hyunjun Choi (hyunjun.choi@cshs.org) - Gabriel Ketron (gabriel.ketron@cshs.org) - Miguel E. Hernandez (miguel.e.hernandez@cshs.org) - Jason Moore (moorejh28@gmail.com)

TPOT 的原始版本主要由宾夕法尼亚大学的以下人员开发:- Randal S. Olson (rso@randalolson.com) - Weixuan Fu (weixuanf@upenn.edu) - Daniel Angell (dpa34@drexel.edu) - Jason Moore (moorejh28@gmail.com) - 以及许多慷慨的开源贡献者

TPOT 是免费软件:您可以根据自由软件基金会发布的 GNU 宽通用公共许可证(第三版或您选择的任何更高版本)的条款重新分发和/或修改它。

TPOT 的分发是希望它有用,但不提供任何担保;甚至不提供适销性或特定用途适用性的默示担保。详情请参阅 GNU 宽通用公共许可证。

您应该已经随 TPOT 收到了一份 GNU 宽通用公共许可证的副本。如果没有,请参阅 https://gnu.ac.cn/licenses/

CategoricalSelector

基础类:BaseEstimator, TransformerMixin

用于选择分类特征并使用 OneHotEncoder 对其进行转换的元转换器。

参数

名称 类型 描述 默认值
threshold int

每个特征的最大唯一值数量,用于判断特征是否为分类特征。

10
minimum_fraction

特征中唯一值的最小比例,用于判断特征是否为分类特征。

None
源代码位于 tpot/builtin_modules/feature_transformers.py
class CategoricalSelector(BaseEstimator, TransformerMixin):
    """Meta-transformer for selecting categorical features and transform them using OneHotEncoder.

    Parameters
    ----------

    threshold : int, default=10
        Maximum number of unique values per feature to consider the feature
        to be categorical.

    minimum_fraction: float, default=None
        Minimum fraction of unique values in a feature to consider the feature
        to be categorical.
    """

    def __init__(self, threshold=10, minimum_fraction=None):
        """Create a CategoricalSelector object."""
        self.threshold = threshold
        self.minimum_fraction = minimum_fraction


    def fit(self, X, y=None):
        """Do nothing and return the estimator unchanged
        This method is just there to implement the usual API and hence
        work in pipelines.
        Parameters
        ----------
        X : array-like
        """
        X = check_array(X, accept_sparse='csr')
        return self


    def transform(self, X):
        """Select categorical features and transform them using OneHotEncoder.

        Parameters
        ----------
        X: numpy ndarray, {n_samples, n_components}
            New data, where n_samples is the number of samples and n_components is the number of components.

        Returns
        -------
        array-like, {n_samples, n_components}
        """
        selected = auto_select_categorical_features(X, threshold=self.threshold)
        X_sel, _, n_selected, _ = _X_selected(X, selected)

        if n_selected == 0:
            # No features selected.
            raise ValueError('No categorical feature was found!')
        else:
            ohe = OneHotEncoder(categorical_features='all', sparse=False, minimum_fraction=self.minimum_fraction)
            return ohe.fit_transform(X_sel)

__init__(threshold=10, minimum_fraction=None)

创建一个 CategoricalSelector 对象。

源代码位于 tpot/builtin_modules/feature_transformers.py
def __init__(self, threshold=10, minimum_fraction=None):
    """Create a CategoricalSelector object."""
    self.threshold = threshold
    self.minimum_fraction = minimum_fraction

fit(X, y=None)

不执行任何操作并返回未更改的估计器。此方法仅用于实现常用的 API,因此可在 pipeline 中工作。

参数

名称 类型 描述 默认值
X array - like
必需
源代码位于 tpot/builtin_modules/feature_transformers.py
def fit(self, X, y=None):
    """Do nothing and return the estimator unchanged
    This method is just there to implement the usual API and hence
    work in pipelines.
    Parameters
    ----------
    X : array-like
    """
    X = check_array(X, accept_sparse='csr')
    return self

transform(X)

选择分类特征并使用 OneHotEncoder 对其进行转换。

参数

名称 类型 描述 默认值
X

新数据,其中 n_samples 是样本数量,n_components 是组件数量。

必需

返回值

类型 描述
(array - like, {n_samples, n_components})
源代码位于 tpot/builtin_modules/feature_transformers.py
def transform(self, X):
    """Select categorical features and transform them using OneHotEncoder.

    Parameters
    ----------
    X: numpy ndarray, {n_samples, n_components}
        New data, where n_samples is the number of samples and n_components is the number of components.

    Returns
    -------
    array-like, {n_samples, n_components}
    """
    selected = auto_select_categorical_features(X, threshold=self.threshold)
    X_sel, _, n_selected, _ = _X_selected(X, selected)

    if n_selected == 0:
        # No features selected.
        raise ValueError('No categorical feature was found!')
    else:
        ohe = OneHotEncoder(categorical_features='all', sparse=False, minimum_fraction=self.minimum_fraction)
        return ohe.fit_transform(X_sel)

ContinuousSelector

基础类:BaseEstimator, TransformerMixin

用于选择连续特征并使用 PCA 对其进行转换的元转换器。

参数

名称 类型 描述 默认值
threshold int

每个特征的最大唯一值数量,用于判断特征是否为分类特征。

10
svd_solver string {'auto', 'full', 'arpack', 'randomized'}

auto:求解器根据 X.shapen_components 的默认策略选择:如果输入数据大于 500x500 且要提取的组件数量小于数据最小维度的 80%,则启用更高效的 'randomized' 方法。否则,计算精确的完整 SVD,并在之后可选地截断。full:通过 scipy.linalg.svd 调用标准 LAPACK 求解器运行精确的完整 SVD,并通过后处理选择组件。arpack:通过 scipy.sparse.linalg.svds 调用 ARPACK 求解器运行截断到 n_components 的 SVD。它严格要求 0 < n_components < X.shape[1]。randomized:通过 Halko 等人的方法运行随机 SVD。

'randomized'
iterated_power int >= 0,或 'auto',(默认 'auto')

对于 svd_solver == 'randomized' 计算的幂方法的迭代次数。

'auto'
源代码位于 tpot/builtin_modules/feature_transformers.py
class ContinuousSelector(BaseEstimator, TransformerMixin):
    """Meta-transformer for selecting continuous features and transform them using PCA.

    Parameters
    ----------

    threshold : int, default=10
        Maximum number of unique values per feature to consider the feature
        to be categorical.

    svd_solver : string {'auto', 'full', 'arpack', 'randomized'}
        auto :
            the solver is selected by a default policy based on `X.shape` and
            `n_components`: if the input data is larger than 500x500 and the
            number of components to extract is lower than 80% of the smallest
            dimension of the data, then the more efficient 'randomized'
            method is enabled. Otherwise the exact full SVD is computed and
            optionally truncated afterwards.
        full :
            run exact full SVD calling the standard LAPACK solver via
            `scipy.linalg.svd` and select the components by postprocessing
        arpack :
            run SVD truncated to n_components calling ARPACK solver via
            `scipy.sparse.linalg.svds`. It requires strictly
            0 < n_components < X.shape[1]
        randomized :
            run randomized SVD by the method of Halko et al.

    iterated_power : int >= 0, or 'auto', (default 'auto')
        Number of iterations for the power method computed by
        svd_solver == 'randomized'.

    """

    def __init__(self, threshold=10, svd_solver='randomized' ,iterated_power='auto', random_state=42):
        """Create a ContinuousSelector object."""
        self.threshold = threshold
        self.svd_solver = svd_solver
        self.iterated_power = iterated_power
        self.random_state = random_state


    def fit(self, X, y=None):
        """Do nothing and return the estimator unchanged
        This method is just there to implement the usual API and hence
        work in pipelines.
        Parameters
        ----------
        X : array-like
        """
        X = check_array(X)
        return self


    def transform(self, X):
        """Select continuous features and transform them using PCA.

        Parameters
        ----------
        X: numpy ndarray, {n_samples, n_components}
            New data, where n_samples is the number of samples and n_components is the number of components.

        Returns
        -------
        array-like, {n_samples, n_components}
        """
        selected = auto_select_categorical_features(X, threshold=self.threshold)
        _, X_sel, n_selected, _ = _X_selected(X, selected)

        if n_selected == 0:
            # No features selected.
            raise ValueError('No continuous feature was found!')
        else:
            pca = PCA(svd_solver=self.svd_solver, iterated_power=self.iterated_power, random_state=self.random_state)
            return pca.fit_transform(X_sel)

__init__(threshold=10, svd_solver='randomized', iterated_power='auto', random_state=42)

创建一个 ContinuousSelector 对象。

源代码位于 tpot/builtin_modules/feature_transformers.py
def __init__(self, threshold=10, svd_solver='randomized' ,iterated_power='auto', random_state=42):
    """Create a ContinuousSelector object."""
    self.threshold = threshold
    self.svd_solver = svd_solver
    self.iterated_power = iterated_power
    self.random_state = random_state

fit(X, y=None)

不执行任何操作并返回未更改的估计器。此方法仅用于实现常用的 API,因此可在 pipeline 中工作。

参数

名称 类型 描述 默认值
X array - like
必需
源代码位于 tpot/builtin_modules/feature_transformers.py
def fit(self, X, y=None):
    """Do nothing and return the estimator unchanged
    This method is just there to implement the usual API and hence
    work in pipelines.
    Parameters
    ----------
    X : array-like
    """
    X = check_array(X)
    return self

transform(X)

选择连续特征并使用 PCA 对其进行转换。

参数

名称 类型 描述 默认值
X

新数据,其中 n_samples 是样本数量,n_components 是组件数量。

必需

返回值

类型 描述
(array - like, {n_samples, n_components})
源代码位于 tpot/builtin_modules/feature_transformers.py
def transform(self, X):
    """Select continuous features and transform them using PCA.

    Parameters
    ----------
    X: numpy ndarray, {n_samples, n_components}
        New data, where n_samples is the number of samples and n_components is the number of components.

    Returns
    -------
    array-like, {n_samples, n_components}
    """
    selected = auto_select_categorical_features(X, threshold=self.threshold)
    _, X_sel, n_selected, _ = _X_selected(X, selected)

    if n_selected == 0:
        # No features selected.
        raise ValueError('No continuous feature was found!')
    else:
        pca = PCA(svd_solver=self.svd_solver, iterated_power=self.iterated_power, random_state=self.random_state)
        return pca.fit_transform(X_sel)