跳到内容

列独热编码器

此文件是 TPOT 库的一部分。

TPOT 当前版本由 Cedars-Sinai 的以下人员开发: - Pedro Henrique Ribeiro (https://github.com/perib, https://www.linkedin.com/in/pedro-ribeiro/) - Anil Saini (anil.saini@cshs.org) - Jose Hernandez (jgh9094@gmail.com) - Jay Moran (jay.moran@cshs.org) - Nicholas Matsumoto (nicholas.matsumoto@cshs.org) - Hyunjun Choi (hyunjun.choi@cshs.org) - Gabriel Ketron (gabriel.ketron@cshs.org) - Miguel E. Hernandez (miguel.e.hernandez@cshs.org) - Jason Moore (moorejh28@gmail.com)

TPOT 的原始版本主要由宾夕法尼亚大学的以下人员开发: - Randal S. Olson (rso@randalolson.com) - Weixuan Fu (weixuanf@upenn.edu) - Daniel Angell (dpa34@drexel.edu) - Jason Moore (moorejh28@gmail.com) - 以及许多慷慨的开源贡献者

TPOT 是自由软件:您可以根据自由软件基金会发布的 GNU 宽通用公共许可证(第三版或您选择的任何更高版本)的条款重新分发和/或修改它。

分发 TPOT 是希望它会有用,但不做任何担保;甚至不包括适销性或特定用途适用性的默示担保。有关更多详细信息,请参阅 GNU 宽通用公共许可证。

您应该已经收到了 GNU 宽通用公共许可证的副本以及 TPOT。如果没有,请参阅 https://gnu.ac.cn/licenses/

ColumnOneHotEncoder

基础: BaseEstimator, TransformerMixin

源代码位于 tpot/builtin_modules/column_one_hot_encoder.py
class ColumnOneHotEncoder(BaseEstimator, TransformerMixin):


    def __init__(self, columns='auto', drop=None, handle_unknown='infrequent_if_exist', sparse_output=False, min_frequency=None,max_categories=None):
        '''
        A wrapper for OneHotEncoder that allows for onehot encoding of specific columns in a DataFrame or np array.

        Parameters
        ----------

        columns : str, list, default='auto'
            Determines which columns to onehot encode with sklearn.preprocessing.OneHotEncoder.
            - 'auto' : Automatically select categorical features based on columns with less than 10 unique values
            - 'categorical' : Automatically select categorical features
            - 'numeric' : Automatically select numeric features
            - 'all' : Select all features
            - list : A list of columns to select

        drop, handle_unknown, sparse_output, min_frequency, max_categories : see sklearn.preprocessing.OneHotEncoder

        '''

        self.columns = columns
        self.drop = drop
        self.handle_unknown = handle_unknown
        self.sparse_output = sparse_output
        self.min_frequency = min_frequency
        self.max_categories = max_categories



    def fit(self, X, y=None):
        """Fit OneHotEncoder to X, then transform X.

        Equivalent to self.fit(X).transform(X), but more convenient and more
        efficient. See fit for the parameters, transform for the return value.

        Parameters
        ----------
        X : array-like or sparse matrix, shape=(n_samples, n_features)
            Dense array or sparse matrix.
        y: array-like {n_samples,} (Optional, ignored)
            Feature labels
        """

        if (self.columns == "categorical" or self.columns == "numeric") and not isinstance(X, pd.DataFrame):
            raise ValueError(f"Invalid value for columns: {self.columns}. "
                             "Only 'all' or <list> is supported for np arrays")

        if self.columns == "categorical":
            self.columns_ = list(X.select_dtypes(exclude='number').columns)
        elif self.columns == "numeric":
            self.columns_ =  [col for col in X.columns if is_numeric_dtype(X[col])]
        elif self.columns == "auto":
            self.columns_ = auto_select_categorical_features(X)
        elif self.columns == "all":
            if isinstance(X, pd.DataFrame):
                self.columns_ = X.columns
            else:
                self.columns_ = list(range(X.shape[1]))
        elif isinstance(self.columns, list):
            self.columns_ = self.columns
        else:
            raise ValueError(f"Invalid value for columns: {self.columns}")



        if len(self.columns_) == 0:
            return self

        self.enc = sklearn.preprocessing.OneHotEncoder( categories='auto',   
                                                        drop = self.drop,
                                                        handle_unknown = self.handle_unknown,
                                                        sparse_output = self.sparse_output,
                                                        min_frequency = self.min_frequency,
                                                        max_categories = self.max_categories)

        #TODO make this more consistent with sklearn baseimputer/baseencoder
        if isinstance(X, pd.DataFrame):
            self.enc.set_output(transform="pandas")
            for col in X.columns:
                # check if the column name is not a string
                if not isinstance(col, str):
                    # if it's not a string, rename the column with "X" prefix
                    X.rename(columns={col: f"X{col}"}, inplace=True)


        if len(self.columns_) == X.shape[1]:
            X_sel = self.enc.fit(X)
        else:
            X_sel, X_not_sel = _X_selected(X, self.columns_)
            X_sel = self.enc.fit(X_sel)

        return self

    def transform(self, X):
        """Transform X using one-hot encoding.

        Parameters
        ----------
        X : array-like or sparse matrix, shape=(n_samples, n_features)
            Dense array or sparse matrix.

        Returns
        -------
        X_out : sparse matrix if sparse=True else a 2-d array, dtype=int
            Transformed input.
        """


        if len(self.columns_) == 0:
            return X

        #TODO make this more consistent with sklearn baseimputer/baseencoder
        if isinstance(X, pd.DataFrame):
            for col in X.columns:
                # check if the column name is not a string
                if not isinstance(col, str):
                    # if it's not a string, rename the column with "X" prefix
                    X.rename(columns={col: f"X{col}"}, inplace=True)

        if len(self.columns_) == X.shape[1]:
            return self.enc.transform(X)
        else:

            X_sel, X_not_sel= _X_selected(X, self.columns_)
            X_sel = self.enc.transform(X_sel)

            #If X is dataframe
            if isinstance(X, pd.DataFrame):

                X_sel = pd.DataFrame(X_sel, columns=self.enc.get_feature_names_out())
                return pd.concat([X_not_sel.reset_index(drop=True), X_sel.reset_index(drop=True)], axis=1)
            else:
                return np.hstack((X_not_sel, X_sel))

__init__(columns='auto', drop=None, handle_unknown='infrequent_if_exist', sparse_output=False, min_frequency=None, max_categories=None)

一个 OneHotEncoder 的包装器,允许对 DataFrame 或 np 数组中的特定列进行独热编码。

参数

名称 类型 描述 默认值
columns (str, list)

确定使用 sklearn.preprocessing.OneHotEncoder 对哪些列进行独热编码。 - 'auto' : 根据具有少于 10 个唯一值的列自动选择分类特征 - 'categorical' : 自动选择分类特征 - 'numeric' : 自动选择数值特征 - 'all' : 选择所有特征 - list : 要选择的列列表

'auto'
drop 参见 sklearn.preprocessing.OneHotEncoder
None
handle_unknown 参见 sklearn.preprocessing.OneHotEncoder
None
sparse_output 参见 sklearn.preprocessing.OneHotEncoder
None
min_frequency 参见 sklearn.preprocessing.OneHotEncoder
None
max_categories 参见 sklearn.preprocessing.OneHotEncoder
None
源代码位于 tpot/builtin_modules/column_one_hot_encoder.py
def __init__(self, columns='auto', drop=None, handle_unknown='infrequent_if_exist', sparse_output=False, min_frequency=None,max_categories=None):
    '''
    A wrapper for OneHotEncoder that allows for onehot encoding of specific columns in a DataFrame or np array.

    Parameters
    ----------

    columns : str, list, default='auto'
        Determines which columns to onehot encode with sklearn.preprocessing.OneHotEncoder.
        - 'auto' : Automatically select categorical features based on columns with less than 10 unique values
        - 'categorical' : Automatically select categorical features
        - 'numeric' : Automatically select numeric features
        - 'all' : Select all features
        - list : A list of columns to select

    drop, handle_unknown, sparse_output, min_frequency, max_categories : see sklearn.preprocessing.OneHotEncoder

    '''

    self.columns = columns
    self.drop = drop
    self.handle_unknown = handle_unknown
    self.sparse_output = sparse_output
    self.min_frequency = min_frequency
    self.max_categories = max_categories

fit(X, y=None)

对 X 拟合 OneHotEncoder,然后转换 X。

等同于 self.fit(X).transform(X),但更方便和更高效。参数请参见 fit 方法,返回值请参见 transform 方法。

参数

名称 类型 描述 默认值
X 类数组或稀疏矩阵,形状=(n_样本, n_特征)

密集数组或稀疏矩阵。

必需
y

特征标签

None
源代码位于 tpot/builtin_modules/column_one_hot_encoder.py
def fit(self, X, y=None):
    """Fit OneHotEncoder to X, then transform X.

    Equivalent to self.fit(X).transform(X), but more convenient and more
    efficient. See fit for the parameters, transform for the return value.

    Parameters
    ----------
    X : array-like or sparse matrix, shape=(n_samples, n_features)
        Dense array or sparse matrix.
    y: array-like {n_samples,} (Optional, ignored)
        Feature labels
    """

    if (self.columns == "categorical" or self.columns == "numeric") and not isinstance(X, pd.DataFrame):
        raise ValueError(f"Invalid value for columns: {self.columns}. "
                         "Only 'all' or <list> is supported for np arrays")

    if self.columns == "categorical":
        self.columns_ = list(X.select_dtypes(exclude='number').columns)
    elif self.columns == "numeric":
        self.columns_ =  [col for col in X.columns if is_numeric_dtype(X[col])]
    elif self.columns == "auto":
        self.columns_ = auto_select_categorical_features(X)
    elif self.columns == "all":
        if isinstance(X, pd.DataFrame):
            self.columns_ = X.columns
        else:
            self.columns_ = list(range(X.shape[1]))
    elif isinstance(self.columns, list):
        self.columns_ = self.columns
    else:
        raise ValueError(f"Invalid value for columns: {self.columns}")



    if len(self.columns_) == 0:
        return self

    self.enc = sklearn.preprocessing.OneHotEncoder( categories='auto',   
                                                    drop = self.drop,
                                                    handle_unknown = self.handle_unknown,
                                                    sparse_output = self.sparse_output,
                                                    min_frequency = self.min_frequency,
                                                    max_categories = self.max_categories)

    #TODO make this more consistent with sklearn baseimputer/baseencoder
    if isinstance(X, pd.DataFrame):
        self.enc.set_output(transform="pandas")
        for col in X.columns:
            # check if the column name is not a string
            if not isinstance(col, str):
                # if it's not a string, rename the column with "X" prefix
                X.rename(columns={col: f"X{col}"}, inplace=True)


    if len(self.columns_) == X.shape[1]:
        X_sel = self.enc.fit(X)
    else:
        X_sel, X_not_sel = _X_selected(X, self.columns_)
        X_sel = self.enc.fit(X_sel)

    return self

transform(X)

使用独热编码转换 X。

参数

名称 类型 描述 默认值
X 类数组或稀疏矩阵,形状=(n_样本, n_特征)

密集数组或稀疏矩阵。

必需

返回值

名称 类型 描述
X_out 如果 sparse=True 则为稀疏矩阵,否则为 2-d 数组,dtype=int

转换后的输入。

源代码位于 tpot/builtin_modules/column_one_hot_encoder.py
def transform(self, X):
    """Transform X using one-hot encoding.

    Parameters
    ----------
    X : array-like or sparse matrix, shape=(n_samples, n_features)
        Dense array or sparse matrix.

    Returns
    -------
    X_out : sparse matrix if sparse=True else a 2-d array, dtype=int
        Transformed input.
    """


    if len(self.columns_) == 0:
        return X

    #TODO make this more consistent with sklearn baseimputer/baseencoder
    if isinstance(X, pd.DataFrame):
        for col in X.columns:
            # check if the column name is not a string
            if not isinstance(col, str):
                # if it's not a string, rename the column with "X" prefix
                X.rename(columns={col: f"X{col}"}, inplace=True)

    if len(self.columns_) == X.shape[1]:
        return self.enc.transform(X)
    else:

        X_sel, X_not_sel= _X_selected(X, self.columns_)
        X_sel = self.enc.transform(X_sel)

        #If X is dataframe
        if isinstance(X, pd.DataFrame):

            X_sel = pd.DataFrame(X_sel, columns=self.enc.get_feature_names_out())
            return pd.concat([X_not_sel.reset_index(drop=True), X_sel.reset_index(drop=True)], axis=1)
        else:
            return np.hstack((X_not_sel, X_sel))

ColumnOrdinalEncoder

基础: BaseEstimator, TransformerMixin

源代码位于 tpot/builtin_modules/column_one_hot_encoder.py
class ColumnOrdinalEncoder(BaseEstimator, TransformerMixin):


    def __init__(self, columns='auto', handle_unknown='error', unknown_value = -1, encoded_missing_value = np.nan, min_frequency=None,max_categories=None):
        '''

        Parameters
        ----------

        columns : str, list, default='auto'
            Determines which columns to onehot encode with sklearn.preprocessing.OneHotEncoder.
            - 'auto' : Automatically select categorical features based on columns with less than 10 unique values
            - 'categorical' : Automatically select categorical features
            - 'numeric' : Automatically select numeric features
            - 'all' : Select all features
            - list : A list of columns to select

        drop, handle_unknown, sparse_output, min_frequency, max_categories : see sklearn.preprocessing.OneHotEncoder

        '''

        self.columns = columns
        self.handle_unknown = handle_unknown
        self.unknown_value = unknown_value
        self.encoded_missing_value = encoded_missing_value
        self.min_frequency = min_frequency
        self.max_categories = max_categories



    def fit(self, X, y=None):
        """Fit OneHotEncoder to X, then transform X.

        Equivalent to self.fit(X).transform(X), but more convenient and more
        efficient. See fit for the parameters, transform for the return value.

        Parameters
        ----------
        X : array-like or sparse matrix, shape=(n_samples, n_features)
            Dense array or sparse matrix.
        y: array-like {n_samples,} (Optional, ignored)
            Feature labels
        """

        if (self.columns == "categorical" or self.columns == "numeric") and not isinstance(X, pd.DataFrame):
            raise ValueError(f"Invalid value for columns: {self.columns}. "
                             "Only 'all' or <list> is supported for np arrays")

        if self.columns == "categorical":
            self.columns_ = list(X.select_dtypes(exclude='number').columns)
        elif self.columns == "numeric":
            self.columns_ =  [col for col in X.columns if is_numeric_dtype(X[col])]
        elif self.columns == "auto":
            self.columns_ = auto_select_categorical_features(X)
        elif self.columns == "all":
            if isinstance(X, pd.DataFrame):
                self.columns_ = X.columns
            else:
                self.columns_ = list(range(X.shape[1]))
        elif isinstance(self.columns, list):
            self.columns_ = self.columns
        else:
            raise ValueError(f"Invalid value for columns: {self.columns}")

        if len(self.columns_) == 0:
            return self

        self.enc = sklearn.preprocessing.OrdinalEncoder(categories='auto',   
                                                        handle_unknown = self.handle_unknown,
                                                        unknown_value = self.unknown_value, 
                                                        encoded_missing_value = self.encoded_missing_value,
                                                        min_frequency = self.min_frequency,
                                                        max_categories = self.max_categories)
        #TODO make this more consistent with sklearn baseimputer/baseencoder
        '''
        if isinstance(X, pd.DataFrame):
            self.enc.set_output(transform="pandas")
            for col in X.columns:
                # check if the column name is not a string
                if not isinstance(col, str):
                    # if it's not a string, rename the column with "X" prefix
                    X.rename(columns={col: f"X{col}"}, inplace=True)
        '''

        if len(self.columns_) == X.shape[1]:
            X_sel = self.enc.fit(X)
        else:
            X_sel, X_not_sel = _X_selected(X, self.columns_)
            X_sel = self.enc.fit(X_sel)

        return self

    def transform(self, X):
        """Transform X using one-hot encoding.

        Parameters
        ----------
        X : array-like or sparse matrix, shape=(n_samples, n_features)
            Dense array or sparse matrix.

        Returns
        -------
        X_out : sparse matrix if sparse=True else a 2-d array, dtype=int
            Transformed input.
        """


        if len(self.columns_) == 0:
            return X

        #TODO make this more consistent with sklearn baseimputer/baseencoder
        '''
        if isinstance(X, pd.DataFrame):
            for col in X.columns:
                # check if the column name is not a string
                if not isinstance(col, str):
                    # if it's not a string, rename the column with "X" prefix
                    X.rename(columns={col: f"X{col}"}, inplace=True)
        '''

        if len(self.columns_) == X.shape[1]:
            return self.enc.transform(X)
        else:

            X_sel, X_not_sel= _X_selected(X, self.columns_)
            X_sel = self.enc.transform(X_sel)

            #If X is dataframe
            if isinstance(X, pd.DataFrame):

                X_sel = pd.DataFrame(X_sel, columns=self.enc.get_feature_names_out())
                return pd.concat([X_not_sel.reset_index(drop=True), X_sel.reset_index(drop=True)], axis=1)
            else:
                return np.hstack((X_not_sel, X_sel))

__init__(columns='auto', handle_unknown='error', unknown_value=-1, encoded_missing_value=np.nan, min_frequency=None, max_categories=None)

参数

名称 类型 描述 默认值
columns (str, list)

确定使用 sklearn.preprocessing.OneHotEncoder 对哪些列进行独热编码。 - 'auto' : 根据具有少于 10 个唯一值的列自动选择分类特征 - 'categorical' : 自动选择分类特征 - 'numeric' : 自动选择数值特征 - 'all' : 选择所有特征 - list : 要选择的列列表

'auto'
drop 参见 sklearn.preprocessing.OneHotEncoder
'error'
handle_unknown 参见 sklearn.preprocessing.OneHotEncoder
'error'
sparse_output 参见 sklearn.preprocessing.OneHotEncoder
'error'
min_frequency 参见 sklearn.preprocessing.OneHotEncoder
'error'
max_categories 参见 sklearn.preprocessing.OneHotEncoder
'error'
源代码位于 tpot/builtin_modules/column_one_hot_encoder.py
def __init__(self, columns='auto', handle_unknown='error', unknown_value = -1, encoded_missing_value = np.nan, min_frequency=None,max_categories=None):
    '''

    Parameters
    ----------

    columns : str, list, default='auto'
        Determines which columns to onehot encode with sklearn.preprocessing.OneHotEncoder.
        - 'auto' : Automatically select categorical features based on columns with less than 10 unique values
        - 'categorical' : Automatically select categorical features
        - 'numeric' : Automatically select numeric features
        - 'all' : Select all features
        - list : A list of columns to select

    drop, handle_unknown, sparse_output, min_frequency, max_categories : see sklearn.preprocessing.OneHotEncoder

    '''

    self.columns = columns
    self.handle_unknown = handle_unknown
    self.unknown_value = unknown_value
    self.encoded_missing_value = encoded_missing_value
    self.min_frequency = min_frequency
    self.max_categories = max_categories

fit(X, y=None)

对 X 拟合 OneHotEncoder,然后转换 X。

等同于 self.fit(X).transform(X),但更方便和更高效。参数请参见 fit 方法,返回值请参见 transform 方法。

参数

名称 类型 描述 默认值
X 类数组或稀疏矩阵,形状=(n_样本, n_特征)

密集数组或稀疏矩阵。

必需
y

特征标签

None
源代码位于 tpot/builtin_modules/column_one_hot_encoder.py
def fit(self, X, y=None):
    """Fit OneHotEncoder to X, then transform X.

    Equivalent to self.fit(X).transform(X), but more convenient and more
    efficient. See fit for the parameters, transform for the return value.

    Parameters
    ----------
    X : array-like or sparse matrix, shape=(n_samples, n_features)
        Dense array or sparse matrix.
    y: array-like {n_samples,} (Optional, ignored)
        Feature labels
    """

    if (self.columns == "categorical" or self.columns == "numeric") and not isinstance(X, pd.DataFrame):
        raise ValueError(f"Invalid value for columns: {self.columns}. "
                         "Only 'all' or <list> is supported for np arrays")

    if self.columns == "categorical":
        self.columns_ = list(X.select_dtypes(exclude='number').columns)
    elif self.columns == "numeric":
        self.columns_ =  [col for col in X.columns if is_numeric_dtype(X[col])]
    elif self.columns == "auto":
        self.columns_ = auto_select_categorical_features(X)
    elif self.columns == "all":
        if isinstance(X, pd.DataFrame):
            self.columns_ = X.columns
        else:
            self.columns_ = list(range(X.shape[1]))
    elif isinstance(self.columns, list):
        self.columns_ = self.columns
    else:
        raise ValueError(f"Invalid value for columns: {self.columns}")

    if len(self.columns_) == 0:
        return self

    self.enc = sklearn.preprocessing.OrdinalEncoder(categories='auto',   
                                                    handle_unknown = self.handle_unknown,
                                                    unknown_value = self.unknown_value, 
                                                    encoded_missing_value = self.encoded_missing_value,
                                                    min_frequency = self.min_frequency,
                                                    max_categories = self.max_categories)
    #TODO make this more consistent with sklearn baseimputer/baseencoder
    '''
    if isinstance(X, pd.DataFrame):
        self.enc.set_output(transform="pandas")
        for col in X.columns:
            # check if the column name is not a string
            if not isinstance(col, str):
                # if it's not a string, rename the column with "X" prefix
                X.rename(columns={col: f"X{col}"}, inplace=True)
    '''

    if len(self.columns_) == X.shape[1]:
        X_sel = self.enc.fit(X)
    else:
        X_sel, X_not_sel = _X_selected(X, self.columns_)
        X_sel = self.enc.fit(X_sel)

    return self

transform(X)

使用独热编码转换 X。

参数

名称 类型 描述 默认值
X 类数组或稀疏矩阵,形状=(n_样本, n_特征)

密集数组或稀疏矩阵。

必需

返回值

名称 类型 描述
X_out 如果 sparse=True 则为稀疏矩阵,否则为 2-d 数组,dtype=int

转换后的输入。

源代码位于 tpot/builtin_modules/column_one_hot_encoder.py
def transform(self, X):
    """Transform X using one-hot encoding.

    Parameters
    ----------
    X : array-like or sparse matrix, shape=(n_samples, n_features)
        Dense array or sparse matrix.

    Returns
    -------
    X_out : sparse matrix if sparse=True else a 2-d array, dtype=int
        Transformed input.
    """


    if len(self.columns_) == 0:
        return X

    #TODO make this more consistent with sklearn baseimputer/baseencoder
    '''
    if isinstance(X, pd.DataFrame):
        for col in X.columns:
            # check if the column name is not a string
            if not isinstance(col, str):
                # if it's not a string, rename the column with "X" prefix
                X.rename(columns={col: f"X{col}"}, inplace=True)
    '''

    if len(self.columns_) == X.shape[1]:
        return self.enc.transform(X)
    else:

        X_sel, X_not_sel= _X_selected(X, self.columns_)
        X_sel = self.enc.transform(X_sel)

        #If X is dataframe
        if isinstance(X, pd.DataFrame):

            X_sel = pd.DataFrame(X_sel, columns=self.enc.get_feature_names_out())
            return pd.concat([X_not_sel.reset_index(drop=True), X_sel.reset_index(drop=True)], axis=1)
        else:
            return np.hstack((X_not_sel, X_sel))