◀️ 🌱 April 🌱 ▶️
일	월	목	금	토
		0	0	0
0	0	0	0	0
0	0	0	0	0
0	0	0	0	0
0	0

[빅데이터분석기사 실기] 피어슨 상관 계수 구하기

2024. 11. 30. 19:03

728x90

피어슨 상관 계수 구하기

들어가며

피어슨 상관 계수(Pearson Correlation Coefficient)를 구하는 방법을 정리해본다.
2024년 9회 제3유형 기출 문제로 피어슨 상관 계수를 구하는 문제가 출제되었다.

피어슨 상관 계수(Pearson Correlation Coefficient)

개념

두 변수 간의 선형 관계의 강도와 방향을 측정하는 통계적 지표
-1에서 1 사이의 값을 가진다.

$r = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2 \cdot \sum (y_i - \bar{y})^2}}$
☑️ $x_i, y_i$ : 데이터 값
☑️ $\bar{x}, \bar{y}$ : 각각의 평균 값

특징

두 변수 간의 관계가 선형일 때만 유효하다. (직선 형태의 관계 등)
변수의 단위(스케일)에 영향을 받지 않는다. (cm, m 등 다른 단위를 사용해도 값은 동일하다.)
-1에서 1 사이로 값이 제한된다.
선형 관계만 측정하므로 비선형 관계에는 적합하지 않다.
- 예) $y = x^{2}$ 같은 곡선 관계는 피어슨 상관 계수가 낮게 나올 수 있다.
이상치(Outlier)에 민감하며, 이상치는 피어슨 상관 계수의 값을 왜곡할 수 있다.

해석

값	설명
+1	- 두 변수 간의 완벽한 양의 선형 관계 - 한 변수가 증가하면 다른 변수도 일정 비율로 증가 - 예) 공부 시간과 시험 점수
0	- 두 변수 간에 선형 관계가 없음. - 예) 자동차 속도와 주행 시간
-1	- 두 변수 간의 완벽한 음의 선형 관계 - 한 변수가 증가하면 다른 변수는 일정 비율로 감소 - 예) 운동 시간과 커피 소비량

방법

1️⃣ scipy.stats 모듈의 pearsonr 함수 사용하기

scipy.stats 모듈의 pearsonr 함수를 사용하여 피어슨 상관 계수를 구할 수 있다.
피어슨 상관 계수 값과 유의 확률(P-value) 값을 확인할 수 있다.

도움말


			
			
			
		
Help on function pearsonr in module scipy.stats._stats_py:
 
pearsonr(x, y, *, alternative='two-sided', method=None, axis=0)
    Pearson correlation coefficient and p-value for testing non-correlation.
 
    The Pearson correlation coefficient [1]_ measures the linear relationship
    between two datasets. Like other correlation
    coefficients, this one varies between -1 and +1 with 0 implying no
    correlation. Correlations of -1 or +1 imply an exact linear relationship.
    Positive correlations imply that as x increases, so does y. Negative
    correlations imply that as x increases, y decreases.
 
    This function also performs a test of the null hypothesis that the
    distributions underlying the samples are uncorrelated and normally
    distributed. (See Kowalski [3]_
    for a discussion of the effects of non-normality of the input on the
    distribution of the correlation coefficient.)
    The p-value roughly indicates the probability of an uncorrelated system
    producing datasets that have a Pearson correlation at least as extreme
    as the one computed from these datasets.
 
    Parameters
    ----------
    x : array_like
        Input array.
    y : array_like
        Input array.
    axis : int or None, default
        Axis along which to perform the calculation. Default is 0.
        If None, ravel both arrays before performing the calculation.
 
        .. versionadded:: 1.13.0
    alternative : {'two-sided', 'greater', 'less'}, optional
        Defines the alternative hypothesis. Default is 'two-sided'.
        The following options are available:
 
        * 'two-sided': the correlation is nonzero
        * 'less': the correlation is negative (less than zero)
        * 'greater':  the correlation is positive (greater than zero)
 
        .. versionadded:: 1.9.0
    method : ResamplingMethod, optional
        Defines the method used to compute the p-value. If `method` is an
        instance of `PermutationMethod`/`MonteCarloMethod`, the p-value is
        computed using
        `scipy.stats.permutation_test`/`scipy.stats.monte_carlo_test` with the
        provided configuration options and other appropriate settings.
        Otherwise, the p-value is computed as documented in the notes.
 
        .. versionadded:: 1.11.0
 
    Returns
    -------
    result : `~scipy.stats._result_classes.PearsonRResult`
        An object with the following attributes:
 
        statistic : float
            Pearson product-moment correlation coefficient.
        pvalue : float
            The p-value associated with the chosen alternative.
 
        The object has the following method:
 
        confidence_interval(confidence_level, method)
            This computes the confidence interval of the correlation
            coefficient `statistic` for the given confidence level.
            The confidence interval is returned in a ``namedtuple`` with
            fields `low` and `high`. If `method` is not provided, the
            confidence interval is computed using the Fisher transformation
            [1]_. If `method` is an instance of `BootstrapMethod`, the
            confidence interval is computed using `scipy.stats.bootstrap` with
            the provided configuration options and other appropriate settings.
            In some cases, confidence limits may be NaN due to a degenerate
            resample, and this is typical for very small samples (~6
            observations).
 
    Warns
    -----
    `~scipy.stats.ConstantInputWarning`
        Raised if an input is a constant array.  The correlation coefficient
        is not defined in this case, so ``np.nan`` is returned.
 
    `~scipy.stats.NearConstantInputWarning`
        Raised if an input is "nearly" constant.  The array ``x`` is considered
        nearly constant if ``norm(x - mean(x)) < 1e-13 * abs(mean(x))``.
        Numerical errors in the calculation ``x - mean(x)`` in this case might
        result in an inaccurate calculation of r.
 
    See Also
    --------
    spearmanr : Spearman rank-order correlation coefficient.
    kendalltau : Kendall's tau, a correlation measure for ordinal data.
 
    Notes
    -----
    The correlation coefficient is calculated as follows:
 
    .. math::
 
        r = \frac{\sum (x - m_x) (y - m_y)}
                 {\sqrt{\sum (x - m_x)^2 \sum (y - m_y)^2}}
 
    where :math:`m_x` is the mean of the vector x and :math:`m_y` is
    the mean of the vector y.
 
    Under the assumption that x and y are drawn from
    independent normal distributions (so the population correlation coefficient
    is 0), the probability density function of the sample correlation
    coefficient r is ([1]_, [2]_):
 
    .. math::
        f(r) = \frac{{(1-r^2)}^{n/2-2}}{\mathrm{B}(\frac{1}{2},\frac{n}{2}-1)}
 
    where n is the number of samples, and B is the beta function.  This
    is sometimes referred to as the exact distribution of r.  This is
    the distribution that is used in `pearsonr` to compute the p-value when
    the `method` parameter is left at its default value (None).
    The distribution is a beta distribution on the interval [-1, 1],
    with equal shape parameters a = b = n/2 - 1.  In terms of SciPy's
    implementation of the beta distribution, the distribution of r is::
 
        dist = scipy.stats.beta(n/2 - 1, n/2 - 1, loc=-1, scale=2)
 
    The default p-value returned by `pearsonr` is a two-sided p-value. For a
    given sample with correlation coefficient r, the p-value is
    the probability that abs(r') of a random sample x' and y' drawn from
    the population with zero correlation would be greater than or equal
    to abs(r). In terms of the object ``dist`` shown above, the p-value
    for a given r and length n can be computed as::
 
        p = 2*dist.cdf(-abs(r))
 
    When n is 2, the above continuous distribution is not well-defined.
    One can interpret the limit of the beta distribution as the shape
    parameters a and b approach a = b = 0 as a discrete distribution with
    equal probability masses at r = 1 and r = -1.  More directly, one
    can observe that, given the data x = [x1, x2] and y = [y1, y2], and
    assuming x1 != x2 and y1 != y2, the only possible values for r are 1
    and -1.  Because abs(r') for any sample x' and y' with length 2 will
    be 1, the two-sided p-value for a sample of length 2 is always 1.
 
    For backwards compatibility, the object that is returned also behaves
    like a tuple of length two that holds the statistic and the p-value.
 
    References
    ----------
    .. [1] "Pearson correlation coefficient", Wikipedia,
           https://en.wikipedia.org/wiki/Pearson_correlation_coefficient
    .. [2] Student, "Probable error of a correlation coefficient",
           Biometrika, Volume 6, Issue 2-3, 1 September 1908, pp. 302-310.
    .. [3] C. J. Kowalski, "On the Effects of Non-Normality on the Distribution
           of the Sample Product-Moment Correlation Coefficient"
           Journal of the Royal Statistical Society. Series C (Applied
           Statistics), Vol. 21, No. 1 (1972), pp. 1-12.
 
    Examples
    --------
    >>> import numpy as np
    >>> from scipy import stats
    >>> x, y = [1, 2, 3, 4, 5, 6, 7], [10, 9, 2.5, 6, 4, 3, 2]
    >>> res = stats.pearsonr(x, y)
    >>> res
    PearsonRResult(statistic=-0.828503883588428, pvalue=0.021280260007523286)
 
    To perform an exact permutation version of the test:
 
    >>> rng = np.random.default_rng(7796654889291491997)
    >>> method = stats.PermutationMethod(n_resamples=np.inf, random_state=rng)
    >>> stats.pearsonr(x, y, method=method)
    PearsonRResult(statistic=-0.828503883588428, pvalue=0.028174603174603175)
 
    To perform the test under the null hypothesis that the data were drawn from
    *uniform* distributions:
 
    >>> method = stats.MonteCarloMethod(rvs=(rng.uniform, rng.uniform))
    >>> stats.pearsonr(x, y, method=method)
    PearsonRResult(statistic=-0.828503883588428, pvalue=0.0188)
 
    To produce an asymptotic 90% confidence interval:
 
    >>> res.confidence_interval(confidence_level=0.9)
    ConfidenceInterval(low=-0.9644331982722841, high=-0.3460237473272273)
 
    And for a bootstrap confidence interval:
 
    >>> method = stats.BootstrapMethod(method='BCa', random_state=rng)
    >>> res.confidence_interval(confidence_level=0.9, method=method)
    ConfidenceInterval(low=-0.9983163756488651, high=-0.22771001702132443)  # may vary
 
    If N-dimensional arrays are provided, multiple tests are performed in a
    single call according to the same conventions as most `scipy.stats` functions:
 
    >>> rng = np.random.default_rng(2348246935601934321)
    >>> x = rng.standard_normal((8, 15))
    >>> y = rng.standard_normal((8, 15))
    >>> stats.pearsonr(x, y, axis=0).statistic.shape  # between corresponding columns
    (15,)
    >>> stats.pearsonr(x, y, axis=1).statistic.shape  # between corresponding rows
    (8,)
 
    To perform all pairwise comparisons between slices of the arrays,
    use standard NumPy broadcasting techniques. For instance, to compute the
    correlation between all pairs of rows:
 
    >>> stats.pearsonr(x[:, np.newaxis, :], y, axis=-1).statistic.shape
    (8, 8)
 
    There is a linear dependence between x and y if y = a + b*x + e, where
    a,b are constants and e is a random error term, assumed to be independent
    of x. For simplicity, assume that x is standard normal, a=0, b=1 and let
    e follow a normal distribution with mean zero and standard deviation s>0.
 
    >>> rng = np.random.default_rng()
    >>> s = 0.5
    >>> x = stats.norm.rvs(size=500, random_state=rng)
    >>> e = stats.norm.rvs(scale=s, size=500, random_state=rng)
    >>> y = x + e
    >>> stats.pearsonr(x, y).statistic
    0.9001942438244763
 
    This should be close to the exact value given by
 
    >>> 1/np.sqrt(1 + s**2)
    0.8944271909999159
 
    For s=0.5, we observe a high level of correlation. In general, a large
    variance of the noise reduces the correlation, while the correlation
    approaches one as the variance of the error goes to zero.
 
    It is important to keep in mind that no correlation does not imply
    independence unless (x, y) is jointly normal. Correlation can even be zero
    when there is a very simple dependence structure: if X follows a
    standard normal distribution, let y = abs(x). Note that the correlation
    between x and y is zero. Indeed, since the expectation of x is zero,
    cov(x, y) = E[x*y]. By definition, this equals E[x*abs(x)] which is zero
    by symmetry. The following lines of code illustrate this observation:
 
    >>> y = np.abs(x)
    >>> stats.pearsonr(x, y)
    PearsonRResult(statistic=-0.05444919272687482, pvalue=0.22422294836207743)
 
    A non-zero correlation coefficient can be misleading. For example, if X has
    a standard normal distribution, define y = x if x < 0 and y = 0 otherwise.
    A simple calculation shows that corr(x, y) = sqrt(2/Pi) = 0.797...,
    implying a high level of correlation:
 
    >>> y = np.where(x < 0, x, 0)
    >>> stats.pearsonr(x, y)
    PearsonRResult(statistic=0.861985781588, pvalue=4.813432002751103e-149)
 
    This is unintuitive since there is no dependence of x and y if x is larger
    than zero which happens in about half of the cases if we sample x and y.

예제 코드


			
			
			
		
from scipy.stats import pearsonr
 
x = [10, 20, 30, 40, 50]
y = [15, 25, 35, 45, 55]
 
# 피어슨 상관 계수 계산
corr, p_value = pearsonr(x, y)
 
print(f"피어슨 상관 계수: {corr}")
print(f"P-value: {p_value}")
 
""" 출력 결과
피어슨 상관 계수: 1.0
P-value: 0.0
"""

2️⃣ 판다스(Pandas) 패키지의 `corr` 함수 사용하기

판다스(Pandas) 패키지의 corr 함수를 이용하여 확인하는 방법이다.
피어슨 상관 계수 값을 확인할 수 있다.

예제 코드


			
			
			
		
from scipy.stats import pearsonr
 
df = pd.DataFrame({
    'X': [10, 20, 30, 40, 50],
    'Y': [15, 25, 35, 45, 55]
})
 
corr = df['X'].corr(df['Y'])    # 피어슨 상관계수 값 확인
print(f"피어슨 상관 계수: {corr}")
 
""" 출력 결과
피어슨 상관 계수: 1.0
"""

참고 사이트

피어슨 상관 계수 - 위키백과, 우리 모두의 백과사전

위키백과, 우리 모두의 백과사전.

ko.wikipedia.org

pearsonr — SciPy v1.14.1 Manual

>>> import numpy as np >>> from scipy import stats >>> x, y = [1, 2, 3, 4, 5, 6, 7], [10, 9, 2.5, 6, 4, 3, 2] >>> res = stats.pearsonr(x, y) >>> res PearsonRResult(statistic=-0.828503883588428, pvalue=0.021280260007523286) To perform an exact permutation v

docs.scipy.org

728x90

저작자표시 비영리 변경금지

'Certificate > 빅데이터분석기사' 카테고리의 다른 글

[빅데이터분석기사 실기] 시험장에서 알아두면 좋은 팁 (0)	2024.11.29
[빅데이터분석기사 실기] 제3유형: 가설 검정 연습 문제 (0)	2024.11.27
[빅데이터분석기사 실기] 제2유형 시험 준비 (0)	2024.11.26
[빅데이터분석기사 실기] 제1유형 시험 준비 (0)	2024.11.25
[빅데이터분석기사 실기] help(), dir() 활용하기 (0)	2024.11.25
[빅데이터분석기사 실기] corr() 함수와 numeric_only 옵션 (0)	2024.11.25
[빅데이터분석기사 실기] 시험장 들어가기 전에 보기 빠르게 보기 좋은 강의 모음 (1)	2024.11.17
[빅데이터분석기사 실기] 제6회 기출 변형 문제 (제3유형) (0)	2024.11.16

Per ardua ad astra."Hello, World!" 🤖

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

Per ardua ad astra.

"Hello, W

[빅데이터분석기사 실기] 피어슨 상관 계수 구하기

피어슨 상관 계수 구하기

들어가며

피어슨 상관 계수(Pearson Correlation Coefficient)

개념

특징

해석

방법

1️⃣ scipy.stats 모듈의 pearsonr 함수 사용하기

2️⃣ 판다스(Pandas) 패키지의 `corr` 함수 사용하기

참고 사이트

'Certificate > 빅데이터분석기사' 카테고리의 다른 글

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역

	Help on function pearsonr in module scipy.stats._stats_py:

	pearsonr(x, y, *, alternative='two-sided', method=None, axis=0)
	Pearson correlation coefficient and p-value for testing non-correlation.

	The Pearson correlation coefficient [1]_ measures the linear relationship
	between two datasets. Like other correlation
	coefficients, this one varies between -1 and +1 with 0 implying no
	correlation. Correlations of -1 or +1 imply an exact linear relationship.
	Positive correlations imply that as x increases, so does y. Negative
	correlations imply that as x increases, y decreases.

	This function also performs a test of the null hypothesis that the
	distributions underlying the samples are uncorrelated and normally
	distributed. (See Kowalski [3]_
	for a discussion of the effects of non-normality of the input on the
	distribution of the correlation coefficient.)
	The p-value roughly indicates the probability of an uncorrelated system
	producing datasets that have a Pearson correlation at least as extreme
	as the one computed from these datasets.

	Parameters
	----------
	x : array_like
	Input array.
	y : array_like
	Input array.
	axis : int or None, default
	Axis along which to perform the calculation. Default is 0.
	If None, ravel both arrays before performing the calculation.

	.. versionadded:: 1.13.0
	alternative : {'two-sided', 'greater', 'less'}, optional
	Defines the alternative hypothesis. Default is 'two-sided'.
	The following options are available:

	* 'two-sided': the correlation is nonzero
	* 'less': the correlation is negative (less than zero)
	* 'greater': the correlation is positive (greater than zero)

	.. versionadded:: 1.9.0
	method : ResamplingMethod, optional
	Defines the method used to compute the p-value. If `method` is an
	instance of `PermutationMethod`/`MonteCarloMethod`, the p-value is
	computed using
	`scipy.stats.permutation_test`/`scipy.stats.monte_carlo_test` with the
	provided configuration options and other appropriate settings.
	Otherwise, the p-value is computed as documented in the notes.

	.. versionadded:: 1.11.0

	Returns
	-------
	result : `~scipy.stats._result_classes.PearsonRResult`
	An object with the following attributes:

	statistic : float
	Pearson product-moment correlation coefficient.
	pvalue : float
	The p-value associated with the chosen alternative.

	The object has the following method:

	confidence_interval(confidence_level, method)
	This computes the confidence interval of the correlation
	coefficient `statistic` for the given confidence level.
	The confidence interval is returned in a ``namedtuple`` with
	fields `low` and `high`. If `method` is not provided, the
	confidence interval is computed using the Fisher transformation
	[1]_. If `method` is an instance of `BootstrapMethod`, the
	confidence interval is computed using `scipy.stats.bootstrap` with
	the provided configuration options and other appropriate settings.
	In some cases, confidence limits may be NaN due to a degenerate
	resample, and this is typical for very small samples (~6
	observations).

	Warns
	-----
	`~scipy.stats.ConstantInputWarning`
	Raised if an input is a constant array. The correlation coefficient
	is not defined in this case, so ``np.nan`` is returned.

	`~scipy.stats.NearConstantInputWarning`
	Raised if an input is "nearly" constant. The array ``x`` is considered
	nearly constant if ``norm(x - mean(x)) < 1e-13 * abs(mean(x))``.
	Numerical errors in the calculation ``x - mean(x)`` in this case might
	result in an inaccurate calculation of r.

	See Also
	--------
	spearmanr : Spearman rank-order correlation coefficient.
	kendalltau : Kendall's tau, a correlation measure for ordinal data.

	Notes
	-----
	The correlation coefficient is calculated as follows:

	.. math::

	r = \frac{\sum (x - m_x) (y - m_y)}
	{\sqrt{\sum (x - m_x)^2 \sum (y - m_y)^2}}

	where :math:`m_x` is the mean of the vector x and :math:`m_y` is
	the mean of the vector y.

	Under the assumption that x and y are drawn from
	independent normal distributions (so the population correlation coefficient
	is 0), the probability density function of the sample correlation
	coefficient r is ([1]_, [2]_):

	.. math::
	f(r) = \frac{{(1-r^2)}^{n/2-2}}{\mathrm{B}(\frac{1}{2},\frac{n}{2}-1)}

	where n is the number of samples, and B is the beta function. This
	is sometimes referred to as the exact distribution of r. This is
	the distribution that is used in `pearsonr` to compute the p-value when
	the `method` parameter is left at its default value (None).
	The distribution is a beta distribution on the interval [-1, 1],
	with equal shape parameters a = b = n/2 - 1. In terms of SciPy's
	implementation of the beta distribution, the distribution of r is::

	dist = scipy.stats.beta(n/2 - 1, n/2 - 1, loc=-1, scale=2)

	The default p-value returned by `pearsonr` is a two-sided p-value. For a
	given sample with correlation coefficient r, the p-value is
	the probability that abs(r') of a random sample x' and y' drawn from
	the population with zero correlation would be greater than or equal
	to abs(r). In terms of the object ``dist`` shown above, the p-value
	for a given r and length n can be computed as::

	p = 2*dist.cdf(-abs(r))

	When n is 2, the above continuous distribution is not well-defined.
	One can interpret the limit of the beta distribution as the shape
	parameters a and b approach a = b = 0 as a discrete distribution with
	equal probability masses at r = 1 and r = -1. More directly, one
	can observe that, given the data x = [x1, x2] and y = [y1, y2], and
	assuming x1 != x2 and y1 != y2, the only possible values for r are 1
	and -1. Because abs(r') for any sample x' and y' with length 2 will
	be 1, the two-sided p-value for a sample of length 2 is always 1.

	For backwards compatibility, the object that is returned also behaves
	like a tuple of length two that holds the statistic and the p-value.

	References
	----------
	.. [1] "Pearson correlation coefficient", Wikipedia,
	https://en.wikipedia.org/wiki/Pearson_correlation_coefficient
	.. [2] Student, "Probable error of a correlation coefficient",
	Biometrika, Volume 6, Issue 2-3, 1 September 1908, pp. 302-310.
	.. [3] C. J. Kowalski, "On the Effects of Non-Normality on the Distribution
	of the Sample Product-Moment Correlation Coefficient"
	Journal of the Royal Statistical Society. Series C (Applied
	Statistics), Vol. 21, No. 1 (1972), pp. 1-12.

	Examples
	--------
	>>> import numpy as np
	>>> from scipy import stats
	>>> x, y = [1, 2, 3, 4, 5, 6, 7], [10, 9, 2.5, 6, 4, 3, 2]
	>>> res = stats.pearsonr(x, y)
	>>> res
	PearsonRResult(statistic=-0.828503883588428, pvalue=0.021280260007523286)

	To perform an exact permutation version of the test:

	>>> rng = np.random.default_rng(7796654889291491997)
	>>> method = stats.PermutationMethod(n_resamples=np.inf, random_state=rng)
	>>> stats.pearsonr(x, y, method=method)
	PearsonRResult(statistic=-0.828503883588428, pvalue=0.028174603174603175)

	To perform the test under the null hypothesis that the data were drawn from
	uniform distributions:

	>>> method = stats.MonteCarloMethod(rvs=(rng.uniform, rng.uniform))
	>>> stats.pearsonr(x, y, method=method)
	PearsonRResult(statistic=-0.828503883588428, pvalue=0.0188)

	To produce an asymptotic 90% confidence interval:

	>>> res.confidence_interval(confidence_level=0.9)
	ConfidenceInterval(low=-0.9644331982722841, high=-0.3460237473272273)

	And for a bootstrap confidence interval:

	>>> method = stats.BootstrapMethod(method='BCa', random_state=rng)
	>>> res.confidence_interval(confidence_level=0.9, method=method)
	ConfidenceInterval(low=-0.9983163756488651, high=-0.22771001702132443) # may vary

	If N-dimensional arrays are provided, multiple tests are performed in a
	single call according to the same conventions as most `scipy.stats` functions:

	>>> rng = np.random.default_rng(2348246935601934321)
	>>> x = rng.standard_normal((8, 15))
	>>> y = rng.standard_normal((8, 15))
	>>> stats.pearsonr(x, y, axis=0).statistic.shape # between corresponding columns
	(15,)
	>>> stats.pearsonr(x, y, axis=1).statistic.shape # between corresponding rows
	(8,)

	To perform all pairwise comparisons between slices of the arrays,
	use standard NumPy broadcasting techniques. For instance, to compute the
	correlation between all pairs of rows:

	>>> stats.pearsonr(x[:, np.newaxis, :], y, axis=-1).statistic.shape
	(8, 8)

	There is a linear dependence between x and y if y = a + b*x + e, where
	a,b are constants and e is a random error term, assumed to be independent
	of x. For simplicity, assume that x is standard normal, a=0, b=1 and let
	e follow a normal distribution with mean zero and standard deviation s>0.

	>>> rng = np.random.default_rng()
	>>> s = 0.5
	>>> x = stats.norm.rvs(size=500, random_state=rng)
	>>> e = stats.norm.rvs(scale=s, size=500, random_state=rng)
	>>> y = x + e
	>>> stats.pearsonr(x, y).statistic
	0.9001942438244763

	This should be close to the exact value given by

	>>> 1/np.sqrt(1 + s**2)
	0.8944271909999159

	For s=0.5, we observe a high level of correlation. In general, a large
	variance of the noise reduces the correlation, while the correlation
	approaches one as the variance of the error goes to zero.

	It is important to keep in mind that no correlation does not imply
	independence unless (x, y) is jointly normal. Correlation can even be zero
	when there is a very simple dependence structure: if X follows a
	standard normal distribution, let y = abs(x). Note that the correlation
	between x and y is zero. Indeed, since the expectation of x is zero,
	cov(x, y) = E[xy]. By definition, this equals E[xabs(x)] which is zero
	by symmetry. The following lines of code illustrate this observation:

	>>> y = np.abs(x)
	>>> stats.pearsonr(x, y)
	PearsonRResult(statistic=-0.05444919272687482, pvalue=0.22422294836207743)

	A non-zero correlation coefficient can be misleading. For example, if X has
	a standard normal distribution, define y = x if x < 0 and y = 0 otherwise.
	A simple calculation shows that corr(x, y) = sqrt(2/Pi) = 0.797...,
	implying a high level of correlation:

	>>> y = np.where(x < 0, x, 0)
	>>> stats.pearsonr(x, y)
	PearsonRResult(statistic=0.861985781588, pvalue=4.813432002751103e-149)

	This is unintuitive since there is no dependence of x and y if x is larger
	than zero which happens in about half of the cases if we sample x and y.

	from scipy.stats import pearsonr

	x = [10, 20, 30, 40, 50]
	y = [15, 25, 35, 45, 55]

	# 피어슨 상관 계수 계산
	corr, p_value = pearsonr(x, y)

	print(f"피어슨 상관 계수: {corr}")
	print(f"P-value: {p_value}")

	""" 출력 결과
	피어슨 상관 계수: 1.0
	P-value: 0.0
	"""

	from scipy.stats import pearsonr

	df = pd.DataFrame({
	'X': [10, 20, 30, 40, 50],
	'Y': [15, 25, 35, 45, 55]
	})

	corr = df['X'].corr(df['Y']) # 피어슨 상관계수 값 확인
	print(f"피어슨 상관 계수: {corr}")

	""" 출력 결과
	피어슨 상관 계수: 1.0
	"""

Per ardua ad astra.

"Hello, W

[빅데이터분석기사 실기] 피어슨 상관 계수 구하기

피어슨 상관 계수 구하기

들어가며

피어슨 상관 계수(Pearson Correlation Coefficient)

개념

특징

해석

방법

1️⃣ scipy.stats 모듈의 pearsonr 함수 사용하기

2️⃣ 판다스(Pandas) 패키지의 corr 함수 사용하기

참고 사이트

'Certificate > 빅데이터분석기사' 카테고리의 다른 글

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역

2️⃣ 판다스(Pandas) 패키지의 `corr` 함수 사용하기