VSnake notes: Week 4, Logistic Regression probabilistic interpretation

2015-11-01

Week 4, Logistic Regression probabilistic interpretation

Курс Scalable Machine Learning. Hadoop, Apache Spark, Python, ML -- вот это всё.

Продолжаю конспектировать пройденный курс. Неделя 4.

В прошлый раз было описание задачи (предсказание клика по рекламному банеру) и модели для решения (логистическая регрессия). Продолжим.

Вероятностная интерпретация логистической регресии.

In this segment, we'll introduce the probabilistic

interpretation of logistic regression.

And we'll show how this probabilistic interpretation

relates to our previous discussion on linear decision

boundaries.

Рассматривая логистическую регрессию, мы остановились на том, что модель выдавала нам да/нет в зависимости от знака результата.

Но что, если мы хотим получить значение вероятности? К примеру от 0 до 1?

However, what if we want more granular information

and, instead, want a model of a conditional probability

that a label equals 1 given some set of predictive features?

Например, вероятность дождичка в четверг

Возвращаясь к нашим рекламным банерам.

Вероятность клика в 10% это очень высокая вероятность. Учитывая контекст задачи.

If we're considering an ad that has

good historical performance, a user that has a high click rate

frequency, and a highly relevant publisher page,

we might expect a probability to be around 0.1.

And this would suggest that there's a 10% chance

that a click event will occur.

Although this number's low on an absolute scale,

recall that click events are rare,

and thus, the probability of 0.1 is a relatively high

probability in our context.

So if we want to work with a linear model,

we need to squash its output so that it'll

be in the appropriate range.

We can do this using the logistic, or sigmoid function,

which takes as input the dot product between our model

and our features and returns a value between 0 and 1

which we can interpret as a probability.

Логистическая функция, сигмоид, вгоняет результат линейной модели в рамки от 0 до 1.

В итоге, мы можем интерпретировать результат логистической регрессии как использование логистической функции для моделирования вероятности

Если же нам надо получить бинарную классификацию, типа да/нет, мы можем использовать пороговое значение, threshold.

It turns out that using this thresholding rule leads to a very natural connection with our previous discussion about decision boundaries

Когда значение сигмоид = 0.5, пороговое значение, это эквивалентно тому, что wT dot x = 0, что есть decision boundaries.

Поговорим о использовании порогового значения и предсказании вероятности.

USING PROBABILISTIC PREDICTIONS

we don't necessarily

have to set our threshold to 0.5.

Let's consider a spam detection example to see this point.

Now, imagine that we've made an incorrect spam prediction.

We can potentially make two types of mistakes.

The first type, which we call a false positive,

occurs when we classify a legitimate email as spam.

The second type, which we call a false negative,

occurs when we classify a spam email as a legitimate one.

В случае спама, false positive вреднее, чем false negative. Пропавшее письмо не все догадаются поискать в папке «спам».

These two types of errors are typically

at odds with each other.

Or, in other words, we often trade off one type of error

for the other.

Можно сдвинуть порог с 0.5 на другое значение. Но как определить оптимальный порог?

Надо попробовать набор значения порога и построить график

One natural way to do this is to use a ROC plot, which

visualizes the trade-offs we make

as we change our threshold.

In particular, a ROC plot focuses on two standard metrics

that are often at odds with one another, namely

the false positive rate, or FPR, and the true positive rate,

or TPR.

Using our spam detection application

as a running example, we can interpret

FPR as the percentage of legitimate emails

that are incorrectly predicted as spam.

And we can view TPR as the percentage of spam emails

that are correctly predicted as spam.

В идеальном случае мы должны получить FPR = 0%, TPR = 100%

В случае, когда решение спам/не спам принимается произвольно, мы получим FPR = 50%, TPR = 50%

we can use our classifier to generate

conditional probabilities for observations in our validation

set. And once we have these probabilities,

we can generate a ROC plot by varying

the threshold we use to convert these probabilities to class

predictions, and computing FPR and TPR

at each of these thresholds.

Допустим, мы готовы мириться с FPR = 10%, что означает, 10% нужных писем могут попасть в спам.

Но в некоторых случаях нам не нужна бинарная классификация а надо использовать напрямую значения вероятностей

As we know, click events are rare.

And as a result, our predicted click probabilities

will be uniformly low.

We may also want to combine these predictions

with other information and thus don't want a threshold,

as we'll lose this fine-grained information by doing so.

original post http://vasnake.blogspot.com/2015/11/week-4-logistic-regression_1.html

Tools

VSnake notes

2015-11-01

Week 4, Logistic Regression probabilistic interpretation

Комментариев нет:

Отправить комментарий

Архив блога

Ярлыки

Обо мне

Links