跳到内容

交叉验证工具

此文件是 TPOT 库的一部分。

当前版本的 TPOT 在 Cedars-Sinai 开发,开发人员包括: - Pedro Henrique Ribeiro (https://github.com/perib, https://www.linkedin.com/in/pedro-ribeiro/) - Anil Saini (anil.saini@cshs.org) - Jose Hernandez (jgh9094@gmail.com) - Jay Moran (jay.moran@cshs.org) - Nicholas Matsumoto (nicholas.matsumoto@cshs.org) - Hyunjun Choi (hyunjun.choi@cshs.org) - Gabriel Ketron (gabriel.ketron@cshs.org) - Miguel E. Hernandez (miguel.e.hernandez@cshs.org) - Jason Moore (moorejh28@gmail.com)

TPOT 的原始版本主要在宾夕法尼亚大学开发,开发人员包括: - Randal S. Olson (rso@randalolson.com) - Weixuan Fu (weixuanf@upenn.edu) - Daniel Angell (dpa34@drexel.edu) - Jason Moore (moorejh28@gmail.com) - 以及许多其他慷慨的开源贡献者

TPOT 是自由软件:您可以根据自由软件基金会发布的 GNU 宽通用公共许可证(第三版或您选择的任何后续版本)的条款重新分发和/或修改它。

发布 TPOT 是希望它有用,但不提供任何担保;甚至不包括适销性或特定用途适用性的默示担保。详情请参阅 GNU 宽通用公共许可证。

您应该已随 TPOT 收到一份 GNU 宽通用公共许可证的副本。如果没有,请参阅 https://gnu.ac.cn/licenses/

cross_val_score_objective(estimator, X, y, scorers, cv, fold=None)

计算估计器的交叉验证得分。每个折叠仅拟合估计器一次,并循环遍历评分器以评估估计器。

参数

名称 类型 描述 默认值
estimator

要拟合和评分的估计器。

必需
X

特征矩阵。

必需
y

目标向量。

必需
scorers

要使用的评分器。如果是列表,将遍历评分器并返回评分器列表。如果是单个评分器,将返回单个得分。

必需
cv

要使用的交叉验证器。例如,sklearn.model_selection.KFold 或 sklearn.model_selection.StratifiedKFold。

必需
fold

要返回得分的折叠。如果为 None,将返回所有得分(每个评分器)的平均值。默认值为 None。

None

返回值

名称 类型 描述
scores ndarrayfloat

每个评分器对应的估计器得分。如果 fold 为 None,将返回所有得分(每个评分器)的平均值。如果使用多个评分器,则返回列表,否则对于单个评分器返回浮点数。

源代码位于 tpot/tpot_estimator/cross_val_utils.py
def cross_val_score_objective(estimator, X, y, scorers, cv, fold=None):
    """
    Compute the cross validated scores for a estimator. Only fits the estimator once per fold, and loops over the scorers to evaluate the estimator.

    Parameters
    ----------
    estimator: sklearn.base.BaseEstimator
        The estimator to fit and score.
    X: np.ndarray or pd.DataFrame
        The feature matrix.
    y: np.ndarray or pd.Series
        The target vector.
    scorers: list or scorer
        The scorers to use. 
        If a list, will loop over the scorers and return a list of scorers.
        If a single scorer, will return a single score.
    cv: sklearn cross-validator
        The cross-validator to use. For example, sklearn.model_selection.KFold or sklearn.model_selection.StratifiedKFold.
    fold: int, optional
        The fold to return the scores for. If None, will return the mean of all the scores (per scorer). Default is None.

    Returns
    -------
    scores: np.ndarray or float
        The scores for the estimator per scorer. If fold is None, will return the mean of all the scores (per scorer).
        Returns a list if multiple scorers are used, otherwise returns a float for the single scorer.

    """

    #check if scores is not iterable
    if not isinstance(scorers, Iterable): 
        scorers = [scorers]
    scores = []
    if fold is None:
        for train_index, test_index in cv.split(X, y):
            this_fold_estimator = sklearn.base.clone(estimator)
            if isinstance(X, pd.DataFrame) or isinstance(X, pd.Series):
                X_train, X_test = X.iloc[train_index], X.iloc[test_index]
            else:
                X_train, X_test = X[train_index], X[test_index]

            if isinstance(y, pd.DataFrame) or isinstance(y, pd.Series):
                y_train, y_test = y.iloc[train_index], y.iloc[test_index]
            else:
                y_train, y_test = y[train_index], y[test_index]


            start = time.time()
            this_fold_estimator.fit(X_train,y_train)
            duration = time.time() - start

            this_fold_scores = [sklearn.metrics.get_scorer(scorer)(this_fold_estimator, X_test, y_test) for scorer in scorers] 
            scores.append(this_fold_scores)
            del this_fold_estimator
            del X_train
            del X_test
            del y_train
            del y_test


        return np.mean(scores,0)
    else:
        this_fold_estimator = sklearn.base.clone(estimator)
        train_index, test_index = list(cv.split(X, y))[fold]
        if isinstance(X, pd.DataFrame) or isinstance(X, pd.Series):
            X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        else:
            X_train, X_test = X[train_index], X[test_index]

        if isinstance(y, pd.DataFrame) or isinstance(y, pd.Series):
            y_train, y_test = y.iloc[train_index], y.iloc[test_index]
        else:
            y_train, y_test = y[train_index], y[test_index]

        start = time.time()
        this_fold_estimator.fit(X_train,y_train)
        duration = time.time() - start
        this_fold_scores = [sklearn.metrics.get_scorer(scorer)(this_fold_estimator, X_test, y_test) for scorer in scorers] 
        return this_fold_scores