VSnake notes: Week 5

Курс Scalable Machine Learning. Hadoop, Apache Spark, Python, ML -- вот это всё.

Продолжаю конспектировать пройденный курс. Неделя 5.

В прошлый раз закончили лабораторку №4. Далее: пятая неделя, лекции.

WEEK 5: Principal Component Analysis and Neuroimaging.

Topics: Introduction to neuroscience and neuroimaging data, exploratory data analysis, principal component analysis (PCA) formulations and solution, distributed PCA.

В целом, если не считать попытки заинтересовать народ темой «а вот смотрите, какие клевые картинки получаются, когда мы мозговую деятельность снимаем», вся история пятой недели про PCA – Principal Component Analysis. И про кластеризацию тоже поговорили.

Как мы помним, PCA эта такая хитрая математика, которая позволяет отбросить из датасета малозначительные компоненты, сократив размерность набора данных (dimensionality reduction).

Другими словами, если до этого мы занимались проблемами Supervised Learning, то теперь займемся Unsupervised методами.

В большинстве случаев, кластеризация рассматривается как средство снизить размерность, упростить представление данных.

Рассмотрим концепцию снижения размерности на примере размеров обуви:

To introduce the concept of dimensionality reduction,

we're going to consider a simple toy

example involving shoe sizes.

If we consider an abstract size space,

on the right, in which the true size lives,

then we can think of both the American and European sizes

as just different linear functions

of that underlined space.

If the size space ranges from one to two,

American sizes are the result of multiplying by six,

and European size are the result of multiplying by six

and adding 33.

When performing dimensionality reduction,

we're just going in the opposite direction.

In physics, for example, the space on the right

is referred to as the state space of a system.

Другой пример: записи активности нейронов пиявки в моменты когда она ползает или плавает. Множество сигналов, множество нейронов были сведены до трех компонент. Стало возможно построить трехмерный график

В общем, с мотивацией понятно: данные избыточны, данные запутаны, формат хранения затрудняет обработку (например картинки, где информация закодирована цветом пикселя). Надо откинуть лишнее, оставить нужное и преобразовать в числа.

В чем же заключается идея? Вернемся к примеру с размерами обуви

Мы выбираем направление (определяем одномерное пространство) и проецируем наши данные на этот вектор.

Задача заключается в том, чтобы выбрать такой вектор, для которого сумма Евклидовых расстояний до которого (от точек данных) была бы минимальной.

our goal is to minimize

the Euclidean distances between our original points

and their projections

PCA как раз и находит такие проекции, для которых расстояние между точками и их проекциями минимально.

Хотя это и похоже на линейную регрессию, алгоритмы весьма разные.

Linear regression aims to predict y from x.

As shown in the picture on the left,

we have a single feature, not two features,

as in the PCA picture.

We fit a line to the single feature

in order to minimize distances to y.

And thus we compute errors as vertical lines.

Теперь зайдем с другой стороны. Подумаем о вариативности. Для размеров обуви максимальный разброс наблюдается вдоль синей линии, не так ли?

Now let's think about another way we could potentially

better represent our data.

We do this by noting that in order to identify patterns

in our data, we often look for variation across observations.

So it seems reasonable to find a succinct representation

that best captures variation in our initial data.

It turns out that the PCA solution represents

the original data in terms of it's directions

of maximal variation.

Пора заняться матрицами. Математика, лежащая за PCA

Исходные данные, наблюдения:

We represent this data in an n by d matrix,

which we call X. Note that each row of this matrix

corresponds to an observation.

Specifically, we aim to find a k dimensional representation

for each of our n data points, where k is much smaller than d

The matrix Z stores the k dimensional representations

of our data points.

By defining Z equals XP, we are assuming

that Z is a linear combination of the original data,

and this linearity assumption simplifies our problem.

Сделано предположение, что отношение между исходными данными и редуцированными линейное. Это важно.

PCA это задача оптимизации, где мы ищем такую матрицу P, для которой выполняются требования вариативности.

We impose specific variance and covariance constraints on P related

to the variance of our original data.

Что такое вариации и ковариации? Это просто: это отклонение от среднего значения

По ходу, предварительная обработка данных приводит к тому, что среднее выводят в 0. Тогда запись (и вычисления) упрощаются.

Ковариация это то же самое, только для произведения двух фич

Сумма произведений, ничего не напоминает? Ага, умножение матриц.

Свойства ковариантности:

– large positive covariance indicates that the two features are highly correlated

– large negative covariance indicates that the two features are highly anticorrelated

– covariance of zero means that the two features are uncorrelated

Additionally, if the covariance between the two features equals each of their squared variances, then the two features are the same

The covariance matrix is d by d matrix where each entry stores pairwise covariance information about the d features

In particular, the diagonal entries of this matrix

equal the sample variances of the features,

and the ijth entries equal the sample covariance

between the ith and jth features.

Как это соотносится с задачей оптимизации PCA?

PCA requires that the features in our reduced dimensions have no correlation

Что можно выразить как: для сжатого датасета матрица ковариантности должна содержать нули везде, кроме диагонали.

Second, PCA requires that the features in the reduced dimension maximize variance, which means that the features are ordered by their variance.

Что выражается в том, что верхнее левое значение в матрице ковариантности для нового сжатого датасета будет максимальным. А правое нижнее – наименьшим. Не в том смысле, что минимальным, а в том, что диагональ отсортирована.

Входят эйгенвекторы – eigenvector.

Решение задачи кроется в матрицах ковариантности

Линейная алгебра нам в помощь.

Остается решить вопрос: какое значение к (размерность сжатого датасета) следует выбрать?

Для визуализации – два или три измерения с наибольшей вариативностью.

Для всякого числового анализа – столько, чтобы захватить «наибольшую вариативность)

In particular, given the eigenvalues

of a sample covariance matrix, we

can easily compute the fraction of retained variance

for any given k by computing the sum of the top k eigenvalues

and dividing the sum by the sum of all of the eigenvalues.

Ограничения и предположения метода:

First, PCA makes linearity assumptions,

and also assumes that the principal components

are orthogonal.

Second, ..., we've assumed that our data is centered, or, in other words,

that our features have zero mean, as this

simplifies our notation.

And finally, since PCA chooses directions of max variance,

it depends on the relative scale of each feature.

Ну и хватит на сегодня.

Завтра продолжим

PCA ALGORITHM

...

original post http://vasnake.blogspot.com/2015/11/week-5.html

Tools

VSnake notes

2015-11-05

Week 5

Комментариев нет:

Отправить комментарий

Архив блога

Ярлыки

Обо мне

Links