Глосарій

Dichotomize
Functional
Hyperparameter
Inductive Bias
Slater's Theorem
Statistical Learning
Strong Duality

Виберіть одне з ключових слів ліворуч ...

Machine LearningGenerative Models

Час читання: ~25 min

Kernel density estimation and quadratic discriminant analysis are ???

models, meaning that they are built using an estimate of the probability distribution of the data-generating process. Models which estimate the prediction function directly—such as linear or polynomial regression—are ??? models. In this section we will discuss three generative models.

To recap, from the second section, quadratic discriminant analysis posits that the class conditional densities are multivariate Gaussian. We use the observations from each class to estimate a mean and a covariance matrix for that class. We also use sample proportions \widehat{p}_c to estimate the class proportions. Given this approximation of the probability measure on \mathcal{X} \times \mathcal{Y}, we return the classifier h(\mathbf{x}) = \operatorname{argmax}_c \widehat{p}_c \widehat{f}_c(\mathbf{x}) (where \widehat{f}_c is the multivariate normal density with mean \widehat{\boldsymbol{\mu}}_c and covariance \widehat{\Sigma}_c).

A common variation on this idea is to posit that the class conditional densities have the same covariance matrix. Then observations from all of the classes can be pooled to estimate this common covariance matrix. We estimate the mean \widehat{\boldsymbol{\mu}}_c of each class c, and then we average (\mathbf{X}_i - \widehat{\boldsymbol{\mu}}_{Y_i})(\mathbf{X}_i - \widehat{\boldsymbol{\mu}}_{Y_i})' over all the sample points (\mathbf{X}_i, Y_i). This approach is called linear discriminant analysis (LDA).

The advantage of LDA over QDA stems from the difficulty of estimating the p^2 entries of a p\times p covariance matrix if p is even moderately large. Pooling the classes allows us to marshal more observations in the service of this estimation task.

The terms quadratic and linear refer to the resulting decision boundaries: solution sets of equations of the form p_1f_1(\mathbf{x})=p_2f_2(\mathbf{x}) are quadric hypersurfaces or hyperplanes if p_1 and p_2 are real numbers and f_1 and f_2 are distinct multivariate normal densities. If the covariances of f_1 and f_2 are equal, then the solution set p_1f_1(\mathbf{x})=p_2f_2(\mathbf{x}) is a hyperplane.

Exercise
Use the code cell below to confirm for the given covariance matrix and mean vectors that the solution set of p_1f_1(\mathbf{x})=p_2f_2(\mathbf{x}) is indeed a plane in three-dimensional space. (Hint: call simplify on the expression returned in the last line.)

Solution. The last line returns 12.182 \operatorname{e}^{2x - 3y + 4z}, so the set of points where this ratio is equal to p_1/p_2 is the solution set of 2x - 3y + 4z = \log(p_1/(12.182p_2)), which is a plane.

Although we used specific numbers in this example, it does illustrate the general point: the only quadratic term in the argument of the exponential in the formula for the multivariate normal distribution is \mathbf{x}' \Sigma^{-1} \mathbf{x}. Thus if we divide two such densities with the same \Sigma, the quadratic terms will cancel, and the only remaining variables will appear in the form of a linear combination in the exponent. When such expressions are set equal to a constant, the equation can be rearranged by dividing and taking logs to obtain a linear equation.

Naive Bayes

The naive Bayes approach to classification is to assume that the components of \mathbf{X} are conditionally independent given Y. In the context of the flower example, this would mean assuming that blue-flower petal width and length are independent (which was true in that example), that the red-flower petal width and length are independent (which was not true), and that the green-flower petal width and length are independent (also not true).

To train a naive Bayes classifier, we use the observations from each class to estimate a density \widehat{f}_{c,i} on \mathbb{R} for each feature component i=1,2,\ldots, p, and then we estimate

\begin{align*} \widehat{f}_c(x_1,\ldots, x_n) = \widehat{f}_{c,1}(x_1)\widehat{f}_{c,2}(x_2)\cdots \widehat{f}_{c,p} (x_p),\end{align*}

in accordance with the conditional independence assumption. The method for estimating the univariate densities \widehat{f}_{c,j} is up to the user; options include kernel density estimation and parametric estimation.

Exercise
Each scatter plot shows a set of sample points for a three-category classification problem. Match each data set to the best-suited model: Naive Bayes, LDA, QDA.

Solution. The correct order is (c), (a), (b), since the third plot shows class conditional densities which factor as a product of marginals, the first plot shows Gaussian class conditional probabilities with the same covariance matrices, and the second plot shows Gaussian class conditional probabilities with distinct covariance matrices.

Exercise
Consider a classification problem where the features X_1 and X_2 have the property that X_1 is uniformly distributed on [0,1] and X_2 is equal to 1 - X_1. Suppose further that the conditional distribution of Y given X_1 and X_2 assigns probability mass 80% to class 1 and 20% to class 0 when the observation is left of the vertical line x_1 = \frac{1}{2}, and assigns probability mass 75% to class 0 and 25% to class 1 when the observation is right of the vertical line x_1 = \frac{1}{2}.

(a) Find the prediction function which minimizes the misclassification probability.

(b) Show that the Naive Bayes assumption leads to the optimal prediction function, even though the relationship between X_1 and X_2 is modeled incorrectly.

Solution. (a) The classifier which minimizes the misclassification probability predicts class 1 for points in the northwest quadrant of the square (since the class-1 density is larger there), and class 0 for points in the southeast quadrant (since the class-0 density is larger there).

(b) The probability of the event \{Y = 1\} is

\begin{align*}\mathbb{P}(Y = 1 \cap \{X_1 \le 1/2\}) + \mathbb{P}(Y = 1 \cap \{X_1 > 1/2\}) = (1/2)(80\%) + (1/2)(25\%) = 52.5\%.\end{align*}

Therefore, the conditional density of X_1 given Y = 1 is

\begin{align*}f_{X_1|Y = 1}(x_1) = \begin{cases} \frac{80\%}{52.5\%} = \frac{32}{21} & \text{if }x_1 \le 1/2 \\ \frac{25\%}{52.5\%} = \frac{10}{21} & \text{if }x_1 > 1/2 \end{cases}\end{align*}

Likewise, the conditional density of X_2 given Y = 1 is

\begin{align*}f_{X_2|Y = 1}(x_2) = \begin{cases} \frac{80\%}{52.5\%} = \frac{32}{21} & \text{if }x_2 \ge 1/2 \\ \frac{25\%}{52.5\%} = \frac{10}{21} & \text{if }x_2 < 1/2 \end{cases}\end{align*}

Under the (erroneous) assumption that X_1 and X_2 are conditionally independent given Y = 1, we would get a joint conditional density function (given the event \{Y = 1\}) which is constant on each quadrant of the unit square, with value (32/21)^2 throughout the northwest quadrant, (10/21)^2 on the southeast quadrant, and (32/21)(10/21) on each of the other two quadrants. To emphasize the distinction between the actual measures and the naive Bayes measure, here's a visualization of each:

Likewise the probability density of (X_1, X_2) conditioned on \{Y = 0\} is (20\%)/(47.5\%) = 8/19 in the northwest quadrant of the square and (75\%)/(47.5\%) = 30/19 in the southeast quadrant of the square. Since (30/21)^2 > (8/19)^2, the naive Bayes classifier predicts 1 in the northwest quadrant of the square. Likewise, it predicts 0 in the southeast corner.

Therefore, despite modeling the relationship between the features incorrectly, the naive Bayes classifier does yield the optimal prediction function.

Bruno Bruno