跳到内容

Passkbinsdiscretizer

此文件是 TPOT 库的一部分。

TPOT 当前版本由 Cedars-Sinai 开发,开发者包括: - Pedro Henrique Ribeiro (https://github.com/perib, https://www.linkedin.com/in/pedro-ribeiro/) - Anil Saini (anil.saini@cshs.org) - Jose Hernandez (jgh9094@gmail.com) - Jay Moran (jay.moran@cshs.org) - Nicholas Matsumoto (nicholas.matsumoto@cshs.org) - Hyunjun Choi (hyunjun.choi@cshs.org) - Gabriel Ketron (gabriel.ketron@cshs.org) - Miguel E. Hernandez (miguel.e.hernandez@cshs.org) - Jason Moore (moorejh28@gmail.com)

TPOT 原始版本主要由宾夕法尼亚大学开发,开发者包括: - Randal S. Olson (rso@randalolson.com) - Weixuan Fu (weixuanf@upenn.edu) - Daniel Angell (dpa34@drexel.edu) - Jason Moore (moorejh28@gmail.com) - 以及许多慷慨的开源贡献者

TPOT 是免费软件:你可以根据自由软件基金会发布的 GNU 宽通用公共许可证的条款,无论是许可证的第 3 版,还是(由你选择的)任何后续版本,重新分发和/或修改它。

分发 TPOT 是希望它有用,但没有任何担保;甚至不包含适销性或特定用途适用性的默示担保。详情请参阅 GNU 宽通用公共许可证。

你应该已经收到一份 GNU 宽通用公共许可证的副本以及 TPOT。如果未收到,请参阅 https://gnu.ac.cn/licenses/

PassKBinsDiscretizer

基础类: BaseEstimator, TransformerMixin

源代码位于 tpot/builtin_modules/passkbinsdiscretizer.py
class PassKBinsDiscretizer(BaseEstimator, TransformerMixin):
    def __init__(self, n_bins=5,  encode='onehot-dense', strategy='quantile', subsample=None, random_state=None):
        self.n_bins = n_bins
        self.encode = encode
        self.strategy = strategy
        self.subsample = subsample
        self.random_state = random_state
        """
        Same as sklearn.preprocessing.KBinsDiscretizer, but passes through columns that are not discretized due to having fewer than n_bins unique values instead of ignoring them.
        See sklearn.preprocessing.KBinsDiscretizer for more information.
        """

    def fit(self, X, y=None):
        # Identify columns with more than n unique values
        # Create a ColumnTransformer to select and discretize the chosen columns
        self.selected_columns_ = select_features(X, min_unique=10)
        if isinstance(X, pd.DataFrame):
            self.not_selected_columns_ = [col for col in X.columns if col not in self.selected_columns_]
        else:
            self.not_selected_columns_ = [i for i in range(X.shape[1]) if i not in self.selected_columns_]

        enc = KBinsDiscretizer(n_bins=self.n_bins, encode=self.encode, strategy=self.strategy, subsample=self.subsample, random_state=self.random_state)
        self.transformer = ColumnTransformer([
            ('discretizer', enc, self.selected_columns_),
            ('passthrough', 'passthrough', self.not_selected_columns_)
        ])
        self.transformer.fit(X)
        return self

    def transform(self, X):
        return self.transformer.transform(X)

random_state = random_state 实例属性

与 sklearn.preprocessing.KBinsDiscretizer 相同,但会直通那些由于唯一值少于 n_bins 而未被离散化的列,而不是忽略它们。有关更多信息,请参阅 sklearn.preprocessing.KBinsDiscretizer。

select_features(X, min_unique=10)

给定一个 DataFrame 或 numpy 数组,返回具有多于 min_unique 个唯一值的列索引列表。

参数

名称 类型 描述 默认值
X

用于选择特征的数据

必需
min_unique

列必须拥有的最小唯一值数量才能被选中

10

返回值

类型 描述
list

具有多于 min_unique 个唯一值的列索引列表

源代码位于 tpot/builtin_modules/passkbinsdiscretizer.py
def select_features(X, min_unique=10,):
    """
    Given a DataFrame or numpy array, return a list of column indices that have more than min_unique unique values.

    Parameters
    ----------
    X: DataFrame or numpy array
        Data to select features from
    min_unique: int, default=10
        Minimum number of unique values a column must have to be selected

    Returns
    -------
    list
        List of column indices that have more than min_unique unique values

    """

    if isinstance(X, pd.DataFrame):
        return [col for col in X.columns if len(X[col].unique()) > min_unique]
    else:
        return [i for i in range(X.shape[1]) if len(np.unique(X[:, i])) > min_unique]