Записки программиста, обо всем и ни о чем. Но, наверное, больше профессионального.


Week 4, Logistic Regression probabilistic interpretation

Курс Scalable Machine Learning. Hadoop, Apache Spark, Python, ML -- вот это всё.

Продолжаю конспектировать пройденный курс. Неделя 4.
В прошлый раз было описание задачи (предсказание клика по рекламному банеру) и модели для решения (логистическая регрессия). Продолжим.

Вероятностная интерпретация логистической регресии.
In this segment, we'll introduce the probabilistic
interpretation of logistic regression.
And we'll show how this probabilistic interpretation
relates to our previous discussion on linear decision

Рассматривая логистическую регрессию, мы остановились на том, что модель выдавала нам да/нет в зависимости от знака результата.
Но что, если мы хотим получить значение вероятности? К примеру от 0 до 1?

However, what if we want more granular information
and, instead, want a model of a conditional probability
that a label equals 1 given some set of predictive features?

Например, вероятность дождичка в четверг

Возвращаясь к нашим рекламным банерам.
Вероятность клика в 10% это очень высокая вероятность. Учитывая контекст задачи.

If we're considering an ad that has
good historical performance, a user that has a high click rate
frequency, and a highly relevant publisher page,
we might expect a probability to be around 0.1.
And this would suggest that there's a 10% chance
that a click event will occur.
Although this number's low on an absolute scale,
recall that click events are rare,
and thus, the probability of 0.1 is a relatively high
probability in our context.

So if we want to work with a linear model,
we need to squash its output so that it'll
be in the appropriate range.
We can do this using the logistic, or sigmoid function,
which takes as input the dot product between our model
and our features and returns a value between 0 and 1
which we can interpret as a probability.

Логистическая функция, сигмоид, вгоняет результат линейной модели в рамки от 0 до 1.

В итоге, мы можем интерпретировать результат логистической регрессии как использование логистической функции для моделирования вероятности

Если же нам надо получить бинарную классификацию, типа да/нет, мы можем использовать пороговое значение, threshold.

It turns out that using this thresholding rule leads to a very natural connection with our previous discussion about decision boundaries

Когда значение сигмоид = 0.5, пороговое значение, это эквивалентно тому, что wT dot x = 0, что есть decision boundaries.

Поговорим о использовании порогового значения и предсказании вероятности.

we don't necessarily
have to set our threshold to 0.5.
Let's consider a spam detection example to see this point.

Now, imagine that we've made an incorrect spam prediction.
We can potentially make two types of mistakes.
The first type, which we call a false positive,
occurs when we classify a legitimate email as spam.
The second type, which we call a false negative,
occurs when we classify a spam email as a legitimate one.

В случае спама, false positive вреднее, чем false negative. Пропавшее письмо не все догадаются поискать в папке «спам».

These two types of errors are typically
at odds with each other.
Or, in other words, we often trade off one type of error
for the other.

Можно сдвинуть порог с 0.5 на другое значение. Но как определить оптимальный порог?

Надо попробовать набор значения порога и построить график

One natural way to do this is to use a ROC plot, which
visualizes the trade-offs we make
as we change our threshold.
In particular, a ROC plot focuses on two standard metrics
that are often at odds with one another, namely
the false positive rate, or FPR, and the true positive rate,
or TPR.

Using our spam detection application
as a running example, we can interpret
FPR as the percentage of legitimate emails
that are incorrectly predicted as spam.
And we can view TPR as the percentage of spam emails
that are correctly predicted as spam.

В идеальном случае мы должны получить FPR = 0%, TPR = 100%
В случае, когда решение спам/не спам принимается произвольно, мы получим FPR = 50%, TPR = 50%

we can use our classifier to generate
conditional probabilities for observations in our validation
set. And once we have these probabilities,
we can generate a ROC plot by varying
the threshold we use to convert these probabilities to class
predictions, and computing FPR and TPR
at each of these thresholds.

Допустим, мы готовы мириться с FPR = 10%, что означает, 10% нужных писем могут попасть в спам.

Но в некоторых случаях нам не нужна бинарная классификация а надо использовать напрямую значения вероятностей
As we know, click events are rare.
And as a result, our predicted click probabilities
will be uniformly low.

We may also want to combine these predictions
with other information and thus don't want a threshold,
as we'll lose this fine-grained information by doing so.

original post http://vasnake.blogspot.com/2015/11/week-4-logistic-regression_1.html

Комментариев нет:

Отправить комментарий

Архив блога


linux (241) python (191) citation (186) web-develop (170) gov.ru (159) video (124) бытовуха (114) sysadm (100) GIS (97) Zope(Plone) (88) бурчалки (84) Book (82) programming (82) грабли (77) development (73) Fun (72) windsurfing (72) Microsoft (64) hiload (62) opensource (58) internet provider (57) security (57) опыт (55) movie (52) Wisdom (51) ML (47) language (45) hardware (44) money (42) JS (41) driving (41) curse (40) bigdata (39) DBMS (38) ArcGIS (34) history (31) PDA (30) howto (30) holyday (29) Google (27) Oracle (27) tourism (27) virtbox (27) health (26) vacation (24) AI (23) Autodesk (23) SQL (23) Java (22) humor (22) knowledge (22) translate (20) CSS (19) cheatsheet (19) hack (19) Apache (16) Manager (15) web-browser (15) Никонов (15) functional programming (14) happiness (14) music (14) todo (14) PHP (13) course (13) scala (13) weapon (13) HTTP. Apache (12) SSH (12) frameworks (12) hero (12) im (12) settings (12) HTML (11) SciTE (11) USA (11) crypto (11) game (11) map (11) HTTPD (9) ODF (9) купи/продай (9) Photo (8) benchmark (8) documentation (8) 3D (7) CS (7) DNS (7) NoSQL (7) cloud (7) django (7) gun (7) matroska (7) telephony (7) Microsoft Office (6) VCS (6) bluetooth (6) pidgin (6) proxy (6) Donald Knuth (5) ETL (5) NVIDIA (5) REST (5) bash (5) flash (5) keyboard (5) price (5) samba (5) CGI (4) LISP (4) RoR (4) cache (4) display (4) holywar (4) nginx (4) pistol (4) spark (4) xml (4) Лебедев (4) IDE (3) IE8 (3) J2EE (3) NTFS (3) RDP (3) holiday (3) mount (3) Гоблин (3) кухня (3) урюк (3) AMQP (2) ERP (2) IE7 (2) NAS (2) Naudoc (2) PDF (2) address (2) air (2) british (2) coffee (2) font (2) ftp (2) messaging (2) notify (2) sharepoint (2) ssl/tls (2) stardict (2) tests (2) tunnel (2) udev (2) APT (1) CRUD (1) Canyonlands (1) Cyprus (1) DVDShrink (1) Jabber (1) K9Copy (1) Matlab (1) Palanga (1) Portugal (1) VBA (1) WD My Book (1) autoit (1) bike (1) cannabis (1) chat (1) concurrent (1) dbf (1) ext4 (1) idioten (1) join (1) krusader (1) license (1) mindmap (1) navitel (1) pneumatic weapon (1) quiz (1) regexp (1) robot (1) science (1) serialization (1) spatial (1) tie (1) vim (1) Науру (1) крысы (1) налоги (1) пианино (1)