728x90
728x90

corr() 함수와 numeric_only 옵션

들어가며

  • 판다스(Pandas) 2.0.0 버전부터 @corr@ 함수의 @numeric_only@ 옵션의 기본값이 @False@로 변경되었다.
  • 이에 대한 내용을 정리해본다.

 

설명

  • 판다스(Pandas) 2.0.0 버전부터 @corr@ 함수의 @numeric_only@ 옵션의 기본값이 @False@로 변경되었다.
    • 이전 버전에는 기본값이 @True@로 설정되어 있어서, 이 옵션을 따로 넣어주지 않아도 됐었다.
  • 따라서 판다스 2.0.0 이상 버전이 적용된 빅데이터분석기사 실기 시험 9회차부터 @corr@ 함수를 사용할 경우, @numeric_only=True@ 옵션을 반드시 지정해줘야 한다.
import pandas as pd

df = pd.read_csv("data/Titanic.csv")

corr_table = df.corr(numeric_only=True)   # numeric_only=True 옵션 지정
print(corr_table)
              PassengerId  Survived    Pclass  ...     SibSp     Parch      Fare
PassengerId     1.000000 -0.005007 -0.035144  ... -0.057527 -0.001652  0.012658
Survived       -0.005007  1.000000 -0.338481  ... -0.035322  0.081629  0.257307
Pclass         -0.035144 -0.338481  1.000000  ...  0.083081  0.018443 -0.549500
Age             0.036847 -0.077221 -0.369226  ... -0.308247 -0.189119  0.096067
SibSp          -0.057527 -0.035322  0.083081  ...  1.000000  0.414838  0.159651
Parch          -0.001652  0.081629  0.018443  ...  0.414838  1.000000  0.216225
Fare            0.012658  0.257307 -0.549500  ...  0.159651  0.216225  1.000000

[7 rows x 7 columns]

 

  • 만약 해당 옵션을 지정하지 않을 경우 아래와 같은 오류가 발생한다.
Makefile:6: recipe for target 'py3_run' failed
make: *** [py3_run] Error 1
Traceback (most recent call last):
  File "/goorm/Main.out", line 11, in <module>
    corr_table = df.corr()
                 ^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/pandas/core/frame.py", line 11049, in corr
    mat = data.to_numpy(dtype=float, na_value=np.nan, copy=False)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/pandas/core/frame.py", line 1993, in to_numpy
    result = self._mgr.as_array(dtype=dtype, copy=copy, na_value=na_value)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/pandas/core/internals/managers.py", line 1694, in as_array
    arr = self._interleave(dtype=dtype, na_value=na_value)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/pandas/core/internals/managers.py", line 1753, in _interleave
    result[rl.indexer] = arr
    ~~~~~~^^^^^^^^^^^^
ValueError: could not convert string to float: 'Braund, Mr. Owen Harris'

 

  • 혹시 @numeric_only@ 라는 옵션명을 잊어버렸을 경우, 아래와 같이 @help@ 명령을 이용하여 사용 예시 코드를 확인한다.
import pandas as pd

help(pd.DataFrame.corr)
Help on function corr in module pandas.core.frame:

corr(self, method: 'CorrelationMethod' = 'pearson', min_periods: 'int' = 1, numeric_only: 'bool' = False) -> 'DataFrame'
    Compute pairwise correlation of columns, excluding NA/null values.

    Parameters
    ----------
    method : {'pearson', 'kendall', 'spearman'} or callable
        Method of correlation:

        * pearson : standard correlation coefficient
        * kendall : Kendall Tau correlation coefficient
        * spearman : Spearman rank correlation
        * callable: callable with input two 1d ndarrays
            and returning a float. Note that the returned matrix from corr
            will have 1 along the diagonals and will be symmetric
            regardless of the callable's behavior.
    min_periods : int, optional
        Minimum number of observations required per pair of columns
        to have a valid result. Currently only available for Pearson
        and Spearman correlation.
    numeric_only : bool, default False
        Include only `float`, `int` or `boolean` data.

        .. versionadded:: 1.5.0

        .. versionchanged:: 2.0.0
            The default value of ``numeric_only`` is now ``False``.

    Returns
    -------
    DataFrame
        Correlation matrix.

    See Also
    --------
    DataFrame.corrwith : Compute pairwise correlation with another
        DataFrame or Series.
    Series.corr : Compute the correlation between two Series.

    Notes
    -----
    Pearson, Kendall and Spearman correlation are currently computed using pairwise complete observations.

    * `Pearson correlation coefficient <https://en.wikipedia.org/wiki/Pearson_correlation_coefficient>`_
    * `Kendall rank correlation coefficient <https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient>`_
    * `Spearman's rank correlation coefficient <https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient>`_

    Examples
    --------
    >>> def histogram_intersection(a, b):
    ...     v = np.minimum(a, b).sum().round(decimals=1)
    ...     return v
    >>> df = pd.DataFrame([(.2, .3), (.0, .6), (.6, .0), (.2, .1)],
    ...                   columns=['dogs', 'cats'])
    >>> df.corr(method=histogram_intersection)
          dogs  cats
    dogs   1.0   0.3
    cats   0.3   1.0

    >>> df = pd.DataFrame([(1, 1), (2, np.nan), (np.nan, 3), (4, 4)],
    ...                   columns=['dogs', 'cats'])
    >>> df.corr(min_periods=3)
          dogs  cats
    dogs   1.0   NaN
    cats   NaN   1.0

 

참고 사이트

 

pandas.DataFrame.corr — pandas 2.2.3 documentation

Include only float, int or boolean data. Changed in version 2.0.0: The default value of numeric_only is now False.

pandas.pydata.org

 

728x90
728x90