VSnake notes: ноября 2015

2015-11-30

Теперь и IBM

Повальная мода у больших компаний -- открывать свои наработки в Machine Learning:

Компания IBM объявила о передаче под крыло организации Apache Software Foundation платформы  SystemML, предоставляющей средства для построения масштабируемых распределённых систем машинного обучения. Платформа предоставляет транслятор для различных алгоритмов машинного обучения, способный на основе заданного декларативного описания алгоритма автоматически генерировать гибридные планы выполнения как для единичных систем c обработкой данных в оперативной памяти, так и для кластеров с крупными хранилищами, развёрнутыми при помощи систем Apache Hadoop и Apache Spark

http://www.opennet.ru/opennews/art.shtml?num=43406

in the realm of Big Data, application developers face huge challenges when combining information from different sources and when deploying data-heavy applications to different types of computers. What they need is a good translator.
That’s why IBM has donated to the open source community SystemML, which is a universal translator for Big Data and the machine learning algorithms that are becoming essential to processing it. System ML enables developers who don’t have expertise in machine learning to embed it in their applications once and use it in industry-specific scenarios on a wide variety of computing platforms, from mainframes to smartphones.

http://www.ibm.com/blogs/think/2015/11/24/introducing-a-universal-translator-for-big-data-and-machine-learning/

http://researcher.watson.ibm.com/researcher/view_group.php?id=3174
https://github.com/SparkTC/systemml
https://github.com/apache/incubator-systemml
http://systemml.apache.org/

Уже все отметились, Гугль, Майкрософт, теперь вот и АйБиЭм.
Соревнуются, что-ли?

original post http://vasnake.blogspot.com/2015/11/ibm.html

2015-11-26

Scala

Scala => SCAlable LAnguage

Освободившись от бестолковой работы на банкиров, первым делом прошел курс

Functional Programming Principles in Scala

https://www.coursera.org/course/progfun

Что замечательно, курс ведет Мартин Одерски (Martin Odersky), автор Scala.

Поскольку в этом курсе по теме ФП для меня нового ничего нет (монады и лямбда-калькулюс в курсе не проходили), я основное внимание уделял не фишкам ФП а особенностям Scala.

Мне язык очень понравился, с потребительской точки зрения. Лаконичный, выразительный, с богатой стандартной библиотекой. Есть REPL! И это при том, что внутре у него Java.

По ощущениям, писать почти так же просто как и на Python. Только статическая типизация портит эти ощущения. Но, с другой стороны, зачем нам еще один Python? Или так: пусть будет как-бы-Питон со стат. типизацией.

Что напрягало: задачки и упражнения Мартин составил так, чтобы жизнь медом не казалась. Часто, чтобы найти правильный ответ, надо изрядно поскрипеть мозгами. А учитывая, что рекурсию сложнее обхода дерева я сто лет не применял, тем более не решал задач на комбинаторику, пришлось попотеть. Зато я знаю теперь, где у меня пробелы в образовании.

Но вот когда сообразишь, как решается задачка, код на Scala получается настолько лаконичный, что диву даешься. Правда, у лаконичного кода есть негативный эффект – если забыл, что там о чем, приходится долго в голове «раззиповывать» код в описание алгоритма.

Кстати, что характерно: все задачи и упражнения в курсе реализуются без мутабельных переменных. Ни разу не возникло потребности применить мутатор к переменной какой. Красота.

Мне Scala понравилась, буду дальше учить и использовать.

Videos

Getting Started

-- Course Introduction (2:44)

-- Tools Setup for Linux (12:24)

-- Tools Setup for Mac OS X (12:17)

-- Tools Setup for Windows (10:37)

-- Tutorial: Working on the Programming Assignments (8:47)

-- IntelliJ IDEA (optional alternative IDE)

Week 1: Functions & Evaluations

-- Lecture 1.1 - Programming Paradigms (14:32);

-- Lecture 1.2 - Elements of Programming (14:25);

-- Lecture 1.3 - Evaluation Strategies and Termination (4:22);

-- Lecture 1.4 - Conditionals and Value Definitions (8:49);

-- Lecture 1.5 - Example: square roots with Newton's method (11:25);

-- Lecture 1.6 - Blocks and Lexical Scope (8:00);

-- Lecture 1.7 - Tail Recursion (12:32)

Week 2: Higher Order Functions

-- Lecture 2.1 - Higher-Order Functions (10:18);

-- Lecture 2.2 - Currying (14:58);

-- Lecture 2.3 - Example: Finding Fixed Points (10:46);

-- Lecture 2.4 - Scala Syntax Summary (4:13);

-- Lecture 2.5 - Functions and Data (11:50);

-- Lecture 2.6 - More Fun With Rationals (15:08);

-- Lecture 2.7 - Evaluation and Operators (16:25);

Week 3: Data and Abstraction

-- Lecture 3.1 - Class Hierarchies (25:50);

-- Lecture 3.2 - How Classes Are Organized (20:30);

-- Lecture 3.3 - Polymorphism (21:09);

Week 4: Types and Pattern Matching

-- Lecture 4.1 - Functions as Objects (8:04);

-- Lecture 4.3 - Subtyping and Generics (15:02);

-- Lecture 4.2 - Objects Everywhere (19:07);

-- Lecture 4.4 - Variance (Optional) (21:33);

-- Lecture 4.5 - Decomposition (16:57);

-- Lecture 4.6 - Pattern Matching (19:36);

-- Lecture 4.7 - Lists (16:20);

Week 5: Lists

-- Lecture 5.1 - More Functions on Lists (13:04);

-- Lecture 5.2 - Pairs and Tuples (10:45);

-- Lecture 5.3 - Implicit Parameters (11:08);

-- Lecture 5.4 - Higher-Order List Functions (14:53);

-- Lecture 5.5 - Reduction of Lists (15:35);

-- Lecture 5.6 - Reasoning About Concat (13:00);

-- Lecture 5.7 - A Larger Equational Proof on Lists (9:53);

Week 6: Collections

-- Lecture 6.1 - Other Collections (20:45);

-- Lecture 6.2 - Combinatorial Search and For-Expressions (13:12);

-- Lecture 6.3 - Combinatorial Search Example (16:54);

-- Lecture 6.4 - Queries with For (7:50);

-- Lecture 6.5 - Translation of For (11:23);

-- Lecture 6.6 - Maps (22:39);

-- Lecture 6.7 - Putting the Pieces Together (20:35);

Week 7: Lazy Evaluation

-- Lecture 7.1 - Structural Induction on Trees (15:10);

-- Lecture 7.2 - Streams (12:12);

-- Lecture 7.3 - Lazy Evaluation (11:38);

-- Lecture 7.4 - Computing with Infinite Sequences (9:01);

-- Lecture 7.5 - Case Study: the Water Pouring Problem (31:45);

-- Lecture 7.6 - Course Conclusion (5:34);

original post http://vasnake.blogspot.com/2015/11/scala.html

2015-11-20

TensorFlow

Google решил, что негоже прятать от народа разработки ML. Нехорошо это, когда библиотеки кода сильно завязаны на внутреннюю архитектуру датацентра

Deep Learning has had a huge impact on computer science, making it possible to explore new frontiers of research and to develop amazingly useful products that millions of people use every day. Our internal deep learning infrastructure DistBelief, developed in 2011, has allowed Googlers to build ever larger neural networks and scale training to thousands of cores in our datacenters. We’ve used it to demonstrate that concepts like “cat” can be learned from unlabeled YouTube images, to improve speech recognition in the Google app by 25%, and to build image search in Google Photos. DistBelief also trained the Inception model that won Imagenet’s Large Scale Visual Recognition Challenge in 2014, and drove our experiments in automated image captioning as well as DeepDream.

While DistBelief was very successful, it had some limitations. It was narrowly targeted to neural networks, it was difficult to configure, and it was tightly coupled to Google’s internal infrastructure -- making it nearly impossible to share research code externally

http://googleresearch.blogspot.ru/2015/11/tensorflow-googles-latest-machine_9.html

Поэтому Гугл решил, что базовые библиотеки будут удобны и доступны всем. Глядишь, синергетика попрет.

Так появился проект TensorFlow

Today we’re proud to announce the open source release of TensorFlow -- our second-generation machine learning system, specifically designed to correct these shortcomings. TensorFlow is general, flexible, portable, easy-to-use, and completely open source. We added all this while improving upon DistBelief’s speed, scalability, and production readiness -- in fact, on some benchmarks, TensorFlow is twice as fast as DistBelief (see the whitepaper for details of TensorFlow’s programming model and implementation).

http://googleresearch.blogspot.ru/2015/11/tensorflow-googles-latest-machine_9.html

Компания Google опубликовала новый открытый проект - TensorFlow, в рамках которого подготовлена практическая реализация алгоритмов глубокого машинного обучения, созданная командой Google Brain, занимающейся исследованиями в области искусственного интеллекта, нейронных сетей и машинного обучения. В настоящее время технологии TensorFlow уже используются Google в таких областях, как распознавание речи, выделение лиц на фотографиях, определение схожести изображений, отсеивание спама в Gmail и определение смысла в сервисе перевода. Код системы написан на языках С++ и Python и распространяется под лицензией Apache.

TensorFlow предоставляет библиотеку готовых алгоритмов численных вычислений, реализованных через графы потоков данных (data flow graphs). Узлы в таких графах реализуют математические операции или точки входа/вывода, в то время как рёбра графа представляют многомерные массивы данных (тензоры), которые перетекают между узлами. Узлы могут быть закреплены за вычислительными устройствами и выполняться асинхронно, параллельно обрабатывая разом все подходящие к ним тезоры. Таким образом строится нейронная сеть, все узлы которой работают одновременно по аналогии с одновременной активацией нейронов в мозге.

http://www.opennet.ru/opennews/art.shtml?num=43285

Параллельность, CPU, GPU, вот это всё.

Такая красота, что просто нет повода не выпить :)

http://tensorflow.org/

https://github.com/tensorflow/tensorflow

И уже даже народ играет вовсю:

Reinforcement Learning using Tensor Flow

https://github.com/nivwusquorum/tensorflow-deepq

original post http://vasnake.blogspot.com/2015/11/tensorflow.html

2015-11-05

Week 5

Курс Scalable Machine Learning. Hadoop, Apache Spark, Python, ML -- вот это всё.

Продолжаю конспектировать пройденный курс. Неделя 5.

В прошлый раз закончили лабораторку №4. Далее: пятая неделя, лекции.

WEEK 5: Principal Component Analysis and Neuroimaging.

Topics: Introduction to neuroscience and neuroimaging data, exploratory data analysis, principal component analysis (PCA) formulations and solution, distributed PCA.

В целом, если не считать попытки заинтересовать народ темой «а вот смотрите, какие клевые картинки получаются, когда мы мозговую деятельность снимаем», вся история пятой недели про PCA – Principal Component Analysis. И про кластеризацию тоже поговорили.

Как мы помним, PCA эта такая хитрая математика, которая позволяет отбросить из датасета малозначительные компоненты, сократив размерность набора данных (dimensionality reduction).

Другими словами, если до этого мы занимались проблемами Supervised Learning, то теперь займемся Unsupervised методами.

В большинстве случаев, кластеризация рассматривается как средство снизить размерность, упростить представление данных.

Рассмотрим концепцию снижения размерности на примере размеров обуви:

To introduce the concept of dimensionality reduction,

we're going to consider a simple toy

example involving shoe sizes.

If we consider an abstract size space,

on the right, in which the true size lives,

then we can think of both the American and European sizes

as just different linear functions

of that underlined space.

If the size space ranges from one to two,

American sizes are the result of multiplying by six,

and European size are the result of multiplying by six

and adding 33.

When performing dimensionality reduction,

we're just going in the opposite direction.

In physics, for example, the space on the right

is referred to as the state space of a system.

Другой пример: записи активности нейронов пиявки в моменты когда она ползает или плавает. Множество сигналов, множество нейронов были сведены до трех компонент. Стало возможно построить трехмерный график

В общем, с мотивацией понятно: данные избыточны, данные запутаны, формат хранения затрудняет обработку (например картинки, где информация закодирована цветом пикселя). Надо откинуть лишнее, оставить нужное и преобразовать в числа.

В чем же заключается идея? Вернемся к примеру с размерами обуви

Мы выбираем направление (определяем одномерное пространство) и проецируем наши данные на этот вектор.

Задача заключается в том, чтобы выбрать такой вектор, для которого сумма Евклидовых расстояний до которого (от точек данных) была бы минимальной.

our goal is to minimize

the Euclidean distances between our original points

and their projections

PCA как раз и находит такие проекции, для которых расстояние между точками и их проекциями минимально.

Хотя это и похоже на линейную регрессию, алгоритмы весьма разные.

Linear regression aims to predict y from x.

As shown in the picture on the left,

we have a single feature, not two features,

as in the PCA picture.

We fit a line to the single feature

in order to minimize distances to y.

And thus we compute errors as vertical lines.

Теперь зайдем с другой стороны. Подумаем о вариативности. Для размеров обуви максимальный разброс наблюдается вдоль синей линии, не так ли?

Now let's think about another way we could potentially

better represent our data.

We do this by noting that in order to identify patterns

in our data, we often look for variation across observations.

So it seems reasonable to find a succinct representation

that best captures variation in our initial data.

It turns out that the PCA solution represents

the original data in terms of it's directions

of maximal variation.

Пора заняться матрицами. Математика, лежащая за PCA

Исходные данные, наблюдения:

We represent this data in an n by d matrix,

which we call X. Note that each row of this matrix

corresponds to an observation.

Specifically, we aim to find a k dimensional representation

for each of our n data points, where k is much smaller than d

The matrix Z stores the k dimensional representations

of our data points.

By defining Z equals XP, we are assuming

that Z is a linear combination of the original data,

and this linearity assumption simplifies our problem.

Сделано предположение, что отношение между исходными данными и редуцированными линейное. Это важно.

PCA это задача оптимизации, где мы ищем такую матрицу P, для которой выполняются требования вариативности.

We impose specific variance and covariance constraints on P related

to the variance of our original data.

Что такое вариации и ковариации? Это просто: это отклонение от среднего значения

По ходу, предварительная обработка данных приводит к тому, что среднее выводят в 0. Тогда запись (и вычисления) упрощаются.

Ковариация это то же самое, только для произведения двух фич

Сумма произведений, ничего не напоминает? Ага, умножение матриц.

Свойства ковариантности:

– large positive covariance indicates that the two features are highly correlated

– large negative covariance indicates that the two features are highly anticorrelated

– covariance of zero means that the two features are uncorrelated

Additionally, if the covariance between the two features equals each of their squared variances, then the two features are the same

The covariance matrix is d by d matrix where each entry stores pairwise covariance information about the d features

In particular, the diagonal entries of this matrix

equal the sample variances of the features,

and the ijth entries equal the sample covariance

between the ith and jth features.

Как это соотносится с задачей оптимизации PCA?

PCA requires that the features in our reduced dimensions have no correlation

Что можно выразить как: для сжатого датасета матрица ковариантности должна содержать нули везде, кроме диагонали.

Second, PCA requires that the features in the reduced dimension maximize variance, which means that the features are ordered by their variance.

Что выражается в том, что верхнее левое значение в матрице ковариантности для нового сжатого датасета будет максимальным. А правое нижнее – наименьшим. Не в том смысле, что минимальным, а в том, что диагональ отсортирована.

Входят эйгенвекторы – eigenvector.

Решение задачи кроется в матрицах ковариантности

Линейная алгебра нам в помощь.

Остается решить вопрос: какое значение к (размерность сжатого датасета) следует выбрать?

Для визуализации – два или три измерения с наибольшей вариативностью.

Для всякого числового анализа – столько, чтобы захватить «наибольшую вариативность)

In particular, given the eigenvalues

of a sample covariance matrix, we

can easily compute the fraction of retained variance

for any given k by computing the sum of the top k eigenvalues

and dividing the sum by the sum of all of the eigenvalues.

Ограничения и предположения метода:

First, PCA makes linearity assumptions,

and also assumes that the principal components

are orthogonal.

Second, ..., we've assumed that our data is centered, or, in other words,

that our features have zero mean, as this

simplifies our notation.

And finally, since PCA chooses directions of max variance,

it depends on the relative scale of each feature.

Ну и хватит на сегодня.

Завтра продолжим

PCA ALGORITHM

...

original post http://vasnake.blogspot.com/2015/11/week-5.html

2015-11-04

Week 4, Lab 4, Part 3

Курс Scalable Machine Learning. Hadoop, Apache Spark, Python, ML -- вот это всё.

Продолжаю конспектировать пройденный курс. Неделя 4.

В прошлый раз начали делать лабораторку №4, но не закончили.

Продолжаем.

Part 3: Parse CTR data and generate OHE features

Visualization 1: Feature frequency

Начинается работа с датасетом от Criteo.

Загрузка сырых данных в RDD

if os.path.isfile(fileName):
    rawData = (sc
               .textFile(fileName, 2)
               .map(lambda x: x.replace('\t', ',')))  # work with either ',' or '\t' separated data
    print rawData.take(1)

[u'0,1,1,5,0,1382,4,15,2,181,1,2,,2,68fd1e64,80e26c9b,fb936136,7b4723c4,25c83c98,7e0ccccf,de7995b8,1f89b562,a73ee510,a8cd5504,b2cb9c98,37c9c164,2824a5f6,1adce6ef,8ba8b39a,891b62e7,e5ba7672,f54016b9,21ddcdc9,b1252a9d,07b5194c,,3a171ecb,c5c50484,e8b83407,9727dd16']

Много фич, некоторые пустые, есть числа, есть странный текст, вроде хешей каких-то. Но это пока не важно. Мы тут занимаемся экстрагированием фич.

(3a) Loading and splitting the data

Разделим датасет на три части (3 RDD): тренировочный, тестовый и кросс-валидационный наборы. И закешируем их в оперативке.

https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.randomSplit

https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.cache

weights = [.8, .1, .1]
seed = 42
rawTrainData, rawValidationData, rawTestData = rawData.randomSplit(weights, seed)
rawTrainData.cache()
rawValidationData.cache()
rawTestData.cache()

nTrain = rawTrainData.count()
nVal = rawValidationData.count()
nTest = rawTestData.count()
print nTrain, nVal, nTest, nTrain + nVal + nTest

79911 10075 10014 100000

Всего 100 килозаписей.

(3b) Extract features

Превратим каждую строку датасета в список туплей. Первый элемент тупля это порядковый номер фичи, второй элемент – значение фичи.

Итоговый RDD будет списком списков.

def parsePoint(point):
    """Converts a comma separated string into a list of (featureID, value) tuples.

    Note:
        featureIDs should start at 0 and increase to the number of features - 1.

    Args:
        point (str): A comma separated string where the first value is the label and the rest
            are features.

    Returns:
        list: A list of (featureID, value) tuples.
    """
    res = []
    rowFeats = разбить по запятым и обрубить пробелы
    for idx, x in enumerate(rowFeats):
        пропустить первую фичу, это label
        res.append((idx-1, x))
    return res

parsedTrainFeat = rawTrainData.map(parsePoint)

numCategories = (parsedTrainFeat
                 .flatMap(lambda x: x)
                 .distinct()
                 .map(lambda x: (x[0], 1))
                 .reduceByKey(lambda x, y: x + y)
                 .sortByKey()
                 .collect())

numCategories это количества уникальных значений для каждой фичи

[(0, 144), (1, 2467), (2, 855), (3, 129), (4, 20311), (5, 1890), (6, 567), (7, 142), (8, 1796), (9, 8), (10, 81), (11, 62), (12, 252), (13, 471), (14, 492), (15, 36044), (16, 21331), (17, 131), (18, 12), (19, 7221), (20, 233), (21, 3), (22, 9905), (23, 3678), (24, 33988), (25, 2741), (26, 25), (27, 4844), (28, 28762), (29, 10), (30, 2379), (31, 1223), (32, 4), (33, 31887), (34, 11), (35, 14), (36, 10799), (37, 49), (38, 8325)]

(3c) Create an OHE dictionary from the dataset

Это мы уже проходили на игрушечном датасете.

ctrOHEDict = createOneHotDict(parsedTrainFeat)
numCtrOHEFeats = len(ctrOHEDict.keys())
print numCtrOHEFeats

233286

(3d) Apply OHE to the dataset

Теперь используем OHE словарь для создания нового датасета из OHE фичей (кстати, можно было энкодить не все фичи, некоторые уже были числовыми). Подобную работу мы делали для игрушечного датасета.

http://spark.apache.org/docs/1.3.1/api/python/pyspark.mllib.html#pyspark.mllib.regression.LabeledPoint

def parseOHEPoint(point, OHEDict, numOHEFeats):
    """Obtain the label and feature vector for this raw observation.

    Note:
        You must use the function `oneHotEncoding` in this implementation or later portions
        of this lab may not function as expected.
        e.g. oneHotEncoding([(1, 'black'), (0, 'mouse')], sampleOHEDictManual, numSampleOHEFeats)

    Args:
        point (str): A comma separated string where the first value is the label and the rest
            are features.
        OHEDict (dict of (int, str) to int): Mapping of (featureID, value) to unique integer.
        numOHEFeats (int): The number of unique features in the training dataset.

    Returns:
        LabeledPoint: Contains the label for the observation and the one-hot-encoding of the
            raw features based on the provided OHE dictionary.
    """
    features = []
    rowFeats = разбить по запятым и обрубить пробелы
    for idx, x in enumerate(rowFeats):
        первая фича это лабель:
            label = х
            continue
        features.append((индекс, x))
    oheFeats = закодировать фичи в OHE
    res = LabeledPoint(label, oheFeats)
    return res

OHETrainData = rawTrainData.map(lambda point: parseOHEPoint(point, ctrOHEDict, numCtrOHEFeats))
OHETrainData.cache()
print OHETrainData.take(1)

[LabeledPoint(0.0, (233286,[382,3101,6842,8311,8911,11887,12893,16211,17631,18646,23513,29366,33157,39536,55820,61797,81485,82753,93671,96986,109720,110662,112139,120263,128571,132400,132805,140595,160666,185457,190322,191105,195902,202638,204242,206037,222753,225966,229941],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0]))]

Видно как устроена OHE запись нового датасета: лабель, длина вектора и SparseVector из индексов и значений по этим индексам.

А визуализацию мы пропустим, она малоинтересна.

(3e) Handling unseen features

В функции oneHotEncoding надо обратить внимание на ситуацию, когда данные фичи пропущены. В таком случае пропуск игнорируется (в SparseVector остается нолик)

def oneHotEncoding(rawFeats, OHEDict, numOHEFeats):
???
        key = OHEDict.get(x, None)
        if key is not None:
???

Part 4: CTR prediction and logloss evaluation

Visualization 2: ROC curve

(4a) Logistic regression

Ну вот, добрались до создания и тренировки модели.

https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.classification.LogisticRegressionWithSGD

https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.regression.LogisticRegressionModel

from pyspark.mllib.classification import LogisticRegressionWithSGD
# fixed hyperparameters
numIters = 50
stepSize = 10.
regParam = 1e-6
regType = 'l2'
includeIntercept = True

model0 = LogisticRegressionWithSGD.train(
    OHETrainData,
    iterations=итераций,
    step=размер шага,
    regParam=регуляризация,
    regType=тип регуляризации,
    intercept=использовать смещение
)
sortedWeights = sorted(model0.weights)
print sortedWeights[:5], model0.intercept

[-0.4589923685357562, -0.37973707648623972, -0.3699655826675331, -0.36934962879928285, -0.32697945415010637] 0.56455084025

Как мы помним, сумма произведений коэффициентов (весов) на значения фич дает нам число, засунув которое в функцию сигмоид, мы получаем вероятность клика.

(4b) Log loss

Напишем функцию оценки результата

we will use log loss to evaluate the quality of models

from math import log

def computeLogLoss(p, y):
    """Calculates the value of log loss for a given probabilty and label.

    Note:
        log(0) is undefined, so when p is 0 we need to add a small value (epsilon) to it
        and when p is 1 we need to subtract a small value (epsilon) from it.

    Args:
        p (float): A probabilty between 0 and 1.
        y (int): A label.  Takes on the values 0 and 1.

    Returns:
        float: The log loss value.
    """
    epsilon = 10e-12
    p = max(epsilon, p)
    p = min(1-epsilon, p)
    if y == 1:
        res = log(p)
    else:
        res = log(1 - p)
    return -1.0 * res

print computeLogLoss(.5, 1)
print computeLogLoss(.5, 0)
print computeLogLoss(.99, 1)
print computeLogLoss(.99, 0)
print computeLogLoss(.01, 1)
print computeLogLoss(.01, 0)
print computeLogLoss(0, 1)
print computeLogLoss(1, 1)
print computeLogLoss(1, 0)

0.69314718056

0.0100503358535

4.60517018599

0.0100503358535

25.3284360229

1.00000008275e-11

25.3284359402

Видно, как штраф увеличивается при удалении от правильного ответа.

(4c) Baseline log loss

Чтобы оценивать модели, надо иметь точку опоры. Базовая модель – дает вероятность клика равную среднему значению лабелей.

classOneFracTrain = (OHETrainData
    .map(lambda x: лабель записи)
    .reduce(lambda x, y: сумма лабелей)
) / OHETrainData.количество записей
print classOneFracTrain

logLossTrBase = (OHETrainData
    .map(lambda x: computeLogLoss(найденная константа, лабель записи))
    .reduce(lambda x, y: сумма штрафа)
) / OHETrainData.количество записей
print 'Baseline Train Logloss = {0:.3f}\n'.format(logLossTrBase)

0.22717773523

Baseline Train Logloss = 0.536

Точность модели на уровне 50% – либо мы встретим динозавра, либо нет.

(4d) Predicted probability

Напишем функцию, выдающую предсказание по посчитанной модели.

http://en.wikipedia.org/wiki/Sigmoid_function

from math import exp #  exp(-t) = e^-t
import math

def getP(x, w, intercept):
    """Calculate the probability for an observation given a set of weights and intercept.

    Note:
        We'll bound our raw prediction between 20 and -20 for numerical purposes.

    Args:
        x (SparseVector): A vector with values of 1.0 for features that exist in this
            observation and 0.0 otherwise.
        w (DenseVector): A vector of weights (betas) for the model.
        intercept (float): The model's intercept.

    Returns:
        float: A probability between 0 and 1.
    """
    rawPrediction = интерсепт + фичи . веса
    # Bound the raw prediction value
    rawPrediction = min(rawPrediction, 20)
    rawPrediction = max(rawPrediction, -20)
    return math.pow(1.0 + exp(-1 * rawPrediction), -1)

trainingPredictions = (OHETrainData
    .map(lambda x: getP(фичи, веса, интерсепт))
)
print trainingPredictions.take(5)

[0.30262882023911125, 0.10362661997434075, 0.283634247838756, 0.17846102057880114, 0.5389775379218853]

(4e) Evaluate the model

Оценим точность модели и сравним с базовой

def evaluateResults(model, data):
    """Calculates the log loss for the data given the model.

    Args:
        model (LogisticRegressionModel): A trained logistic regression model.
        data (RDD of LabeledPoint): Labels and features for each observation.

    Returns:
        float: Log loss for the data.
    """
    res = (data
        .map(lambda x: computeLogLoss(getP(???), лабель))
        .reduce(lambda x, y: x + y)
    ) / data.count()
    return res

logLossTrLR0 = evaluateResults(model0, OHETrainData)

print ('OHE Features Train Logloss:\n\tBaseline = {0:.3f}\n\tLogReg = {1:.3f}'
       .format(logLossTrBase, logLossTrLR0))

OHE Features Train Logloss:

Baseline = 0.536

LogReg = 0.457

Неплохо для начала.

(4f) Validation log loss

Теперь заценим результат на валидационном датасете

logLossValBase = (OHEValidationData
    .map(lambda x: computeLogLoss(classOneFracTrain, x.label))
    .reduce(lambda x, y: x + y)
) / OHEValidationData.count()

logLossValLR0 = evaluateResults(model0, OHEValidationData)

print ('OHE Features Validation Logloss:\n\tBaseline = {0:.3f}\n\tLogReg = {1:.3f}'
       .format(logLossValBase, logLossValLR0))

OHE Features Validation Logloss:

Baseline = 0.528

LogReg = 0.457

Лучше, чем у базовой модели.

Visualization 2: ROC curve

We will now visualize how well the model predicts our target. To do this we generate a plot of the ROC curve. The ROC curve shows us the trade-off between the false positive rate and true positive rate, as we liberalize the threshold required to predict a positive outcome. A random model is represented by the dashed line.

Пожалуйста, выбирайте пороговое значение какое устраивает.

Part 5: Reduce feature dimension via feature hashing

Visualization 3: Hyperparameter heat map

И теперь быстренько пробежимся по методу хеширования фич.

(5a) Hash function

from collections import defaultdict
import hashlib

def hashFunction(numBuckets, rawFeats, printMapping=False):
    """Calculate a feature dictionary for an observation's features based on hashing.

    Note:
        Use printMapping=True for debug purposes and to better understand how the hashing works.

    Args:
        numBuckets (int): Number of buckets to use as features.
        rawFeats (list of (int, str)): A list of features for an observation.  Represented as
            (featureID, value) tuples.
        printMapping (bool, optional): If true, the mappings of featureString to index will be
            printed.

    Returns:
        dict of int to float:  The keys will be integers which represent the buckets that the
            features have been hashed to.  The value for a given key will contain the count of the
            (featureID, value) tuples that have hashed to that key.
    """
    mapping = {}
    for ind, category in rawFeats:
        featureString = category + str(ind)
        mapping[featureString] = int(int(hashlib.md5(featureString).hexdigest(), 16) % numBuckets)
    if(printMapping): print mapping
    sparseFeatures = defaultdict(float)
    for bucket in mapping.values():
        sparseFeatures[bucket] += 1.0
    return dict(sparseFeatures)

# Reminder of the sample values:
# sampleOne = [(0, 'mouse'), (1, 'black')]
# sampleTwo = [(0, 'cat'), (1, 'tabby'), (2, 'mouse')]
# sampleThree =  [(0, 'bear'), (1, 'black'), (2, 'salmon')]

sampOneFourBuckets = hashFunction(4, sampleOne, True)
sampTwoFourBuckets = hashFunction(4, sampleTwo, True)
sampThreeFourBuckets = hashFunction(4, sampleThree, True)

# Use one hundred buckets
sampOneHundredBuckets = hashFunction(100, sampleOne, True)
sampTwoHundredBuckets = hashFunction(100, sampleTwo, True)
sampThreeHundredBuckets = hashFunction(100, sampleThree, True)

print '\t\t 4 Buckets \t\t\t 100 Buckets'
print 'SampleOne:\t {0}\t\t {1}'.format(sampOneFourBuckets, sampOneHundredBuckets)
print 'SampleTwo:\t {0}\t\t {1}'.format(sampTwoFourBuckets, sampTwoHundredBuckets)
print 'SampleThree:\t {0}\t {1}'.format(sampThreeFourBuckets, sampThreeHundredBuckets)

{'black1': 2, 'mouse0': 3}

{'cat0': 0, 'tabby1': 0, 'mouse2': 2}

{'bear0': 0, 'black1': 2, 'salmon2': 1}

{'black1': 14, 'mouse0': 31}

{'cat0': 40, 'tabby1': 16, 'mouse2': 62}

{'bear0': 72, 'black1': 14, 'salmon2': 5}

4 Buckets 100 Buckets

SampleOne: {2: 1.0, 3: 1.0} {14: 1.0, 31: 1.0}

SampleTwo: {0: 2.0, 2: 1.0} {40: 1.0, 16: 1.0, 62: 1.0}

SampleThree: {0: 1.0, 1: 1.0, 2: 1.0} {72: 1.0, 5: 1.0, 14: 1.0}

На игрушечном примере показано, как работает хеш-функция.

Фактически, выдает нам параметр для SparseVector.

(5b) Creating hashed features

Используем хеш-функцию для подготовки датасетов под задачу.

def parseHashPoint(point, numBuckets):
    """Create a LabeledPoint for this observation using hashing.

    Args:
        point (str): A comma separated string where the first value is the label and the rest are
            features.
        numBuckets: The number of buckets to hash to.

    Returns:
        LabeledPoint: A LabeledPoint with a label (0.0 or 1.0) and a SparseVector of hashed
            features.
    """
    rowFeats = строку разбить по запятым и обрубить пробелы
    features = [] # list of tuples like [(0, 'mouse'), (1, 'black')]
    label = None
    for idx, x in enumerate(rowFeats):
        для первой фичи
            label = x
            continue
        features.append((индекс с 0, x))
    dictHashed = hashFunction(количество корзин, фичи, False)
    hashedFeats = SparseVector(количество корзин, результат хеширования)
    res = LabeledPoint(label, hashedFeats)
    return res

numBucketsCTR = 2 ** 15
hashTrainData = rawTrainData.map(lambda x: parseHashPoint(x, numBucketsCTR))
hashTrainData.cache()
hashValidationData = rawValidationData.map ...
hashValidationData.cache()
hashTestData = rawTestData.map ...
hashTestData.cache()

print hashTrainData.take(1)

[LabeledPoint(0.0, (32768,[1305,2883,3807,4814,4866,4913,6952,7117,9985,10316,11512,11722,12365,13893,14735,15816,16198,17761,19274,21604,22256,22563,22785,24855,25202,25533,25721,26487,26656,27668,28211,29152,29402,29873,30039,31484,32493,32708],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0]))]

Лабель и SparseVector составляют запись датасета.

(5c) Sparsity

Степень разреженности данных. Кому это интересно? Пропустим, выдав сразу количественную оценку

Average OHE Sparsity: 1.6717677e-04

Average Hash Sparsity: 1.1805561e-03

(5d) Logistic model with hashed features

Будем тренировать логистическую регрессию, подбирая гиперпараметры через Grid Search.

numIters = 500
regType = 'l2'
includeIntercept = True

# Initialize variables using values from initial model training
bestModel = None
bestLogLoss = 1e10

stepSizes = (1, 10)
regParams = (1e-6, 1e-3)
for stepSize in stepSizes:
    for regParam in regParams:
        model = (LogisticRegressionWithSGD
                 .train(hashTrainData, numIters, stepSize, 
                        regParam=regParam, regType=regType,
                        intercept=includeIntercept))
        logLossVa = evaluateResults(model, hashValidationData)
        print ('\tstepSize = {0:.1f}, regParam = {1:.0e}: logloss = {2:.3f}'
               .format(stepSize, regParam, logLossVa))
        if (logLossVa < bestLogLoss):
            bestModel = model
            bestLogLoss = logLossVa

print ('Hashed Features Validation Logloss:\n\tBaseline = {0:.3f}\n\tLogReg = {1:.3f}'
       .format(logLossValBase, bestLogLoss))

stepSize = 1.0, regParam = 1e-06: logloss = 0.470

stepSize = 1.0, regParam = 1e-03: logloss = 0.470

stepSize = 10.0, regParam = 1e-06: logloss = 0.448

stepSize = 10.0, regParam = 1e-03: logloss = 0.450

Hashed Features Validation Logloss:

Baseline = 0.528

LogReg = 0.448

Visualization 3: Hyperparameter heat map

We will now perform a visualization of an extensive hyperparameter search. Specifically, we will create a heat map where the brighter colors correspond to lower values of logLoss

(5e) Evaluate on the test set

Проверим нашу лучшую модель на тестовом наборе данных.

logLossTest = evaluateResults(bestModel, hashTestData)

# Log loss for the baseline model
logLossTestBaseline = (hashTestData
    .map(lambda x: computeLogLoss(classOneFracTrain, x.label))
    .sum()
) / hashTestData.count()

print ('Hashed Features Test Log Loss:\n\tBaseline = {0:.3f}\n\tLogReg = {1:.3f}'
       .format(logLossTestBaseline, logLossTest))

Hashed Features Test Log Loss:

Baseline = 0.537

LogReg = 0.456

Вот, четвертая лабораторка закончена.

Мы трансформировали датасеты, применяя технику Features Extraction, используя кодирование One-hot-encoding and Hashing.

Обучали модели логистической регрессии, оценивали их точность и делали предсказания.

Идея, в целом, понятна.

В следующий раз будут материалы пятой недели курса. Последней.

WEEK 5: Principal Component Analysis and Neuroimaging.

Topics: Introduction to neuroscience and neuroimaging data, exploratory data analysis, principal component analysis (PCA) formulations and solution, distributed PCA

original post http://vasnake.blogspot.com/2015/11/week-4-lab-4-part-3.html

2015-11-03

Week 4, Lab 4

Курс Scalable Machine Learning. Hadoop, Apache Spark, Python, ML -- вот это всё.

Продолжаю конспектировать пройденный курс. Неделя 4.

В прошлый раз было про One-hot-encoding для трансформации фич в числа и про хеширование как средство снижения размерности датасета.

Пришло время потрогать пройденные темы на практике. Лабораторка.

CTR PREDICTION PIPELINE LAB PREVIEW

In this segment, we'll provide an overview

of the click-through rate prediction pipeline

that you'll be working on in this week's Spark coding lab.

The goal of the lab is to implement a click-through rate

prediction pipeline using various techniques that we've

discussed in this week's lecture

The raw data consists of a subset of a data

from a Kaggle competition sponsored by Criteo.

This data includes 39 features describing users, ads,

and publishers.

Many of these features contain a large number of categories

You'll next need to extract features

to feed into a supervised learning model.

And feature extraction is the main focus of this lab.

You'll create OHE features, as well as hashed features,

and store these features using a sparse representation.

Экстрагировать фичи это не так страшно как кажется. Банальная трансформация. Типа текстовых данных в числа; комбинированные фичи, и т. д.

Given a set of features, either OHE or hashed features,

you will use MLlib to train logistic regression models.

You will then perform Hyperparameter

tuning to search for a good regularization parameter,

evaluating the results via log loss,

and visualizing the results of your grid search.

Понятно вроде.

ОК, забираем нотебук

https://raw.githubusercontent.com/spark-mooc/mooc-setup/master/ML_lab4_ctr_student.ipynb

Запускаем виртуалку

valik@snafu:~$  pushd ~/sparkvagrant/
valik@snafu:~/sparkvagrant$  vagrant up

И вперед: http://localhost:8001/tree

На Гитхабе этот нотебук

https://github.com/spark-mooc/mooc-setup/blob/master/ML_lab4_ctr_student.ipynb

Программа действий:

Part 1: Featurize categorical data using one-hot-encoding (OHE)
Part 2: Construct an OHE dictionary
Part 3: Parse CTR data and generate OHE features: Visualization 1: Feature frequency
Part 4: CTR prediction and logloss evaluation: Visualization 2: ROC curve
Part 5: Reduce feature dimension via feature hashing: Visualization 3: Hyperparameter heat map

Для справки

https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD

http://docs.scipy.org/doc/numpy/reference/index.html

Part 1: Featurize categorical data using one-hot-encoding (OHE)

(1a) One-hot-encoding

Сначала создадим словарь OHE вручную, для разминки возьмем датасет из трех записей про трех животных.

# Data for manual OHE
# Note: the first data point does not include any value for the optional third feature
sampleOne = [(0, 'mouse'), (1, 'black')]
sampleTwo = [(0, 'cat'), (1, 'tabby'), (2, 'mouse')]
sampleThree =  [(0, 'bear'), (1, 'black'), (2, 'salmon')]
sampleDataRDD = sc.parallelize([sampleOne, sampleTwo, sampleThree])

sampleOHEDictManual = {}
sampleOHEDictManual[(0,'bear')] = 0
sampleOHEDictManual[(0,'cat')] = 1
sampleOHEDictManual[(0,'mouse')] = 2
sampleOHEDictManual[(1,'black')] = 3
sampleOHEDictManual[(1,'tabby')] = 4
sampleOHEDictManual[(2,'mouse')] = 5
sampleOHEDictManual[(2,'salmon')] = 6

Всего семь категорий.

(1b) Sparse vectors

https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.linalg.SparseVector

Надо потренироваться в создании sparse векторов. Пока вручную.

import numpy as np
from pyspark.mllib.linalg import SparseVector
# TODO: Replace <FILL IN> with appropriate code
aDense = np.array([0., 3., 0., 4.])
aSparse = SparseVector(len(aDense), enumerate(aDense))

bDense = np.array([0., 0., 0., 1.])
bSparse = SparseVector(len(bDense), enumerate(bDense))

w = np.array([0.4, 3.1, -1.4, -.5])
print aDense.dot(w)
print aSparse.dot(w)
print bDense.dot(w)
print bSparse.dot(w)

Что характерно, несмотря на то, что такое решение удовлетворяет условиям теста (умножение дает одинаковые результаты), решение неправильное.

Правильно будет так:

aDense = np.array([0., 3., 0., 4.])
aSparse = SparseVector(len(aDense), {1: 3., 3: 4.})
bDense = np.array([0., 0., 0., 1.])
bSparse = SparseVector(len(bDense), [(3, 1.)])

Почему? Потому, что гладиолус. Смотри определение SparseVector.

(1c) OHE features as sparse vectors

Теперь, когда идея понятна, создадим SparseVector-ы для игрушечного датасета с животными.

Если идея непонятна, то вот: имеем семь категорий, значит вектор будет длинной 7. Изначально весь в нулях. Для каждой записи берем такой вектор и ставим единички по номерам из словаря, ключ в словаре – исходная фича записи.

Пример: (животное, мышка) в словаре дает 2. Значит в SparseVector ставим единичку в позиции 2 (считая с 0).

# sampleOHEDictManual[(0,'bear')] = 0
# sampleOHEDictManual[(0,'cat')] = 1
# sampleOHEDictManual[(0,'mouse')] = 2
# sampleOHEDictManual[(1,'black')] = 3
# sampleOHEDictManual[(1,'tabby')] = 4
# sampleOHEDictManual[(2,'mouse')] = 5
# sampleOHEDictManual[(2,'salmon')] = 6

# sampleOne = [(0, 'mouse'), (1, 'black')] = 2, 3
sampleOneOHEFeatManual = SparseVector(7, {2: 1.0, 3: 1.0})
# sampleTwo = [(0, 'cat'), (1, 'tabby'), (2, 'mouse')] = 1, 4, 5
sampleTwoOHEFeatManual = SparseVector(7, {1: 1.0, 4: 1.0, 5: 1.0})
# sampleThree =  [(0, 'bear'), (1, 'black'), (2, 'salmon')] = 0, 3, 6
sampleThreeOHEFeatManual = SparseVector(7, {0: 1.0, 3: 1.0, 6: 1.0})

Несложно, правда? А я довольно долго колупался, пока не сообразил, как надо правильно записывать SparseVector-ы.

(1d) Define a OHE function

Напишем функцию, которая возвращает нам SparseVector для записи исходного датасета.

def oneHotEncoding(rawFeats, OHEDict, numOHEFeats):
    """Produce a one-hot-encoding from a list of features and an OHE dictionary.

    Note:
        You should ensure that the indices used to create a SparseVector are sorted.

    Args:
        rawFeats (list of (int, str)): The features corresponding to a single observation.  Each
            feature consists of a tuple of featureID and the feature's value.
            (e.g. sampleOne) sampleOne = [(0, 'mouse'), (1, 'black')]
        OHEDict (dict): A mapping of (featureID, value) to unique integer.
            OHE Dictionary example:
                sampleOHEDictManual[(0,'bear')] = 0
                ...
                sampleOHEDictManual[(1,'black')] = 3
                ...
                sampleOHEDictManual[(2,'salmon')] = 6
        numOHEFeats (int): The total number of unique OHE features (combinations of featureID and
            value).

    Returns:
        SparseVector: A SparseVector of length numOHEFeats with indicies equal to the unique
            identifiers for the (featureID, value) combinations that occur in the observation and
            with values equal to 1.0.
            e.g. sampleOneOHEFeatManual = SparseVector(7, {2: 1.0, 3: 1.0})
    """
    spDict = {}
    для каждой фичи:
        key = значение из словаря ОХЕ
        if key is not None:
            spDict[key] = 1.0
    res = SparseVector(numOHEFeats, spDict)
    return res

# Calculate the number of features in sampleOHEDictManual
numSampleOHEFeats = len(sampleOHEDictManual)

# Run oneHotEnoding on sampleOne
sampleOneOHEFeat = oneHotEncoding(sampleOne, sampleOHEDictManual, numSampleOHEFeats)

(1e) Apply OHE to a dataset

Ну, тут все элементарно, применить функцию к исходному датасету, получив закодированный датасет, готовый к скармливанию в logistic regression.

sampleOHEData = sampleDataRDD.map(lambda x: oneHotEncoding(x, sampleOHEDictManual, numSampleOHEFeats))
print sampleOHEData.collect()

[SparseVector(7, {2: 1.0, 3: 1.0}), SparseVector(7, {1: 1.0, 4: 1.0, 5: 1.0}), SparseVector(7, {0: 1.0, 3: 1.0, 6: 1.0})]

Part 2: Construct an OHE dictionary

(2a) Pair RDD of (featureID, category)

Надо автоматизировать создание словаря. А то он у нас был ручками записан.

Для начала создадим RDD с уникальными значениями фич из исходного списка списков.

create an RDD of distinct (featureID, category) tuples

https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.flatMap

https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.distinct

# sampleOne = [(0, 'mouse'), (1, 'black')]
# sampleTwo = [(0, 'cat'), (1, 'tabby'), (2, 'mouse')]
# sampleThree =  [(0, 'bear'), (1, 'black'), (2, 'salmon')]
# sampleDataRDD = sc.parallelize([sampleOne, sampleTwo, sampleThree])
sampleDistinctFeats = (sampleDataRDD
                       .плоский список(lambda x: x)
                       .выкинуть дубли())

(2b) OHE Dictionary from distinct features

Вот теперь можно сгенерировать словарь, сопоставив уникальные категории номерам по порядку.

https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.zipWithIndex

https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.collectAsMap

sampleOHEDict = (sampleDistinctFeats
    .сгенерить индексы()
    .собрать словарь())
print sampleOHEDict

{(2, 'mouse'): 0, (0, 'cat'): 1, (0, 'bear'): 2, (2, 'salmon'): 3, (1, 'tabby'): 4, (1, 'black'): 5, (0, 'mouse'): 6}

(2c) Automated creation of an OHE dictionary

Собираем лего: напишем функцию, которая вернет нам словарь для исходного датасета (исходный датасет это список списков туплей).

def createOneHotDict(inputData):
    """Creates a one-hot-encoder dictionary based on the input data.

    Args:
        inputData (RDD of lists of (int, str)): An RDD of observations where each observation is
            made up of a list of (featureID, value) tuples.

    Returns:
        dict: A dictionary where the keys are (featureID, value) tuples and map to values that are
            unique integers.
    """
    distinctFeats = (inputData
        .плоский список(lambda x: x)
        .выкинуть дубли())
    res = (distinctFeats
        .сгенерить индексы()
        .собрать словарь())
    return res

sampleOHEDictAuto = createOneHotDict(sampleDataRDD)
print sampleOHEDictAuto

{(2, 'mouse'): 0, (0, 'cat'): 1, (0, 'bear'): 2, (2, 'salmon'): 3, (1, 'tabby'): 4, (1, 'black'): 5, (0, 'mouse'): 6}

На сегодня хватит, продолжим завтра.

Следующий номер программы:

Part 3: Parse CTR data and generate OHE features

Visualization 1: Feature frequency

original post http://vasnake.blogspot.com/2015/11/week-4-lab-4.html

Tools

VSnake notes

2015-11-30

Теперь и IBM

2015-11-26

Scala

2015-11-20

TensorFlow

2015-11-05

Week 5

2015-11-04

Week 4, Lab 4, Part 3

2015-11-03

Week 4, Lab 4

Архив блога

Ярлыки

Обо мне

Links