VSnake notes: Week 5, part 2

Курс Scalable Machine Learning. Hadoop, Apache Spark, Python, ML -- вот это всё.

Продолжаю конспектировать пройденный курс. Неделя 5, продолжение.

В прошлый раз начали разбирать теорию PCA.

Далее: пятая неделя, лекции, алгоритмы вычисления PCA в распределенной среде, бигдата.

WEEK 5: Principal Component Analysis and Neuroimaging.

PCA ALGORITHM

Освежим в памяти, что такое ортогональные и ортонормальные векторы:

Orthogonal vectors are simply perpendicular vectors,

and one nice property of orthogonal vectors

is that their dot product always equals 0.

Note that a unit vector is simply

a vector whose Euclidean norm equals one.

Orthonormal … These are vectors that are orthogonal and also

have unit norm.

Going back to our example, since a and b are

both unit norm and orthogonal, they

are also orthonormal vectors.

In contrast, b and d, even though they're orthogonal,

are not orthonormal since the Euclidean norm

of d is greater than one.

Now that we're equipped with a better understanding

of these concepts, we could discuss

the interpretation of PCA as an iterative algorithm.

Понятно, ортонормальные векторы, ага.

Теперь как формулируется итеративный алгоритм PCA:

imagine that we want a k dimensional

representation of our data.

... we can imagine taking the following iterative

approach.

At the i-th iteration we aim to find

the direction of maximal variance in the data,

subject to the constraint that this direction is of unit norm

and that it is orthogonal to all directions we

chose in previous iterations.

We can then project our data onto this direction,

and the locations along this direction

become the i-th feature in our new representation.

As a result of this algorithm, we

have found k unit vectors that are pairwise orthogonal,

and thus, these k directions are orthonormal.

DISTRIBUTED PCA

Так как же нам найти Принципиальные Компоненты и получить редуцированный датасет?

4 шага:

отцентрировать данные,

вычислить матрицу ковариантности,

найти эйгенвекторы,

умножить данные на эйгенвекторы:

we'll assume that our raw data is not

centered. And so the first step in PCA involves centering our data.

Or in other words, computing the mean of each feature

and subtracting these feature means from each original data

point.

We'll assume that our centered data is stored in an n

by d matrix which we'll call x.

Once we've centered our data, we next

need to compute the sample covariance

matrix, or the scatter matrix.

Note that the scatter matrix is simply the sample covariance

matrix without dividing by n.

PCA returns the same solution in either case,

as discussed in the previous segment.

And we'll work with the scatter matrix in this segment

to simplify our notation.

As we previously discussed, the principal components

of our data equal the eigenvectors of the sample

covariance matrix.

So in the third step, we compute these eigenvectors

by performing an eigendecomposition.

Finally, in order to obtain our k dimensional representation,

we need to multiply our original data by the top k eigenvectors

to compute the PCA scores.

Distributed PCA, Big n, small d

Как посчитать PCA на практике, с учетом того, что количество фич невелико (small d).

… we must center our data.

And to do this, we must compute the mean of each feature.

There are d features, and thus d feature means.

And we define the vector m to be the d dimensional vector whose

i-th component is the mean of the i-th feature.

We can compute the mean vector via a simple

reduce operation whereby we sum all of the data points

together.

After we have performed this sum,

we can compute the mean vector, m,

by simply dividing by the number of data points, n.

After computing m on the driver, we

must send it back to the workers so that each data

point can be centered.

Together, the reduce operation to compute m

and the subsequent communication are inexpensive,

as they are linear in d in terms of local storage, computation,

and communication.

Once each worker has access to m,

we can then perform a map operation

to create the centered data points, which

simply involves subtracting m from each original data point.

… we next need to compute the scatter matrix.

As in the case of closed-form linear regression,

we can efficiently perform this computation in a distributed

fashion by using outer products.

Матрица ковариантности = X transpose * X, а это это сумма произведений строк датасета.

Очень легко ложится на модель распределенных вычислений.

We'll represent x visually by its rows or data points,

and then we can express this matrix multiplication

as a sum of outer products where each outer product involves

only a single row of x, or a single data point.

Also recall that, in the previous step,

we computed our center data and we stored it

in a data parallel fashion.

So with this context, we can now compute the scatter matrix

as a simple MapReduce operation.

In the map step, we take each point

and compute its outer product with itself.

This is a computational bottleneck

in our PCA computation, but it's distributed

across multiple workers.

In the reduce step, we simply sum over

all of these outer products.

This requires quadratic storage and computation in d,

both of which are feasible since we're assuming that d is small.

Once we've computed the scatter matrix,

we need to perform its eigendecomposition.

Since we want to compute a k dimensional representation

for our data, we're interested in the top k

principal components of our data, which we know

are the top k eigenvectors of a scatter matrix.

We represent these principal components by the d by k matrix P.

As a result of eigendecomposition,

we have access to the top k eigenvectors on the driver.

But now we need to communicate them to the workers

so that they can compute these k dimensional representations

for the data points.

This requires O of dk communication,

which is the communication bottleneck in this algorithm.

Additionally, the eigendecomposition

generally requires cubic time and quadratic space.

But if we only want the top k eigenvalues and eigenvectors,

we can reduce this time complexity to O of d squared k.

Finally, now that each worker has the principal component

stored locally, we can compute the k dimensional

representation for each point via a simple matrix vector

multiply.

This process requires O of dk local computation,

and can be performed via a simple map operation.

Простой алгоритм, но можно упереться в количество фич, превышающее возможности Spark кластера.

Не беда, есть способ проделать PCA в Big N and Big D setting.

Distributed PCA: big d, big n

Ключевая идея заключается в хитром, итеративном способе вычисления ейгенвекторов.

where both N and D are large,

we can only afford storage, computation, and communication that are linear in N & D.

So we can't locally store or operate on the scatter matrix.

Instead, we'll introduce an iterative approach.

Our iterative approach relies on performing

a sequence of matrix vector products

to compute the top K eigenvectors of the scatter matrix.

The most common methods for doing this are

Krylov subspace and random projection based methods.

And Spark's MLlib in particular relies on Krylov subspace methods.

In this approach, the driver provides the workers

with a D dimensional vector on each iteration,

and requires that the workers left multiply this vector

by the scatter matrix.

Overall, the algorithm requires O of K iterations,

or passes over the data, and O of DK local storage.

And notably, this algorithm computes the PCA solution

without ever explicitly computing the covariance

of a scatter matrix.

The first step involves the algorithm

communicating the D dimensional vector, vi, to all workers.

Next (step 2, see below), we need to multiply the scatter matrix by the vector vi

in a distributed fashion.

And we'll denote the result of this matrix multiplication

as the D dimensional vector qi.

The driver then uses qi to update its estimate of p,

which are the top K eigenvectors of the scatter matrix.

We repeat this process, o of k for o of k iterations,

until we converge on our answer for p

Also, note that we're using the letter

i here to denote the iteration number.

And so the vector vi communicated

to the workers in step one changes on each iteration.

step two is interesting.

The challenge is that we want to perform this matrix

multiplication without ever having to explicitly compute

the scatter matrix, or even having to store copies

of both X and X transpose.

And so we need to be a bit clever in how

we perform this computation.

And it turns out that by carefully breaking

this multiplication into two steps,

we're able to work to achieve our desired goal.

We first compute bi, which is an n dimensional vector

equal to x times vi.

We then multiply X transpose by this intermediate result

to obtained qi.

we can efficiently compute steps one

and step two by only storing X in a data parallel fashion.

Remember that bi is an n dimensional vector,

and each component of this vector is simply equal to the

dot product between a row of X and the vector vi.

Since each row of X is an observation,

and since each worker is storing vi locally,

we can simply compute the dot product between each data point

and vi, and then concatenate the results to obtain bi.

We can first perform a map operation, in which we compute

the dot product of each data point and the vector vi.

This requires O of d storage to store vi, and O of nd

distributed computation to compute the dot products.

Next, in the reduce step, we can simply concatenate the results.

Finally, each worker will need to store bi locally

in order to compute the overall result, qi, in the next step.

So we're going to need to communicate bi to each worker.

So overall, this reduce step, combined with the communication of qi,

requires linear time, space, and communication in terms of n.

The following spark code snippet succinctly

summarizes what we've done.

Starting with an RDD of training data,

we first compute a dot product in the map step,

then collect the results, which creates a list.

And finally, we convert this list into a NumPy array.

Now let's consider the second step of this two step process.

In this step, our goal is to compute the product of X transpose and bi.

By inspecting this product, we can interpret this multiplication

as the sum of rescaled data points.

In particular, for the jth data point, we can multiply it by the jth component of the vector bi,

which gives us a new d dimensional vector.

We can similarly rescale each of the other data points,

which gives us a total of n rescaled d dimensional vectors.

Taking the sum of these rescaled vectors gives us qi,

which is our desired result.

In the map step, we simply rescale each vector

by its corresponding component of bi.

Storing bi requires O of n storage,

and computing the rescaled vectors

requires a pass over all of our data,

and thus takes O of nd distributed computation.

In the reduce step, we simply take a sum

of these rescaled vectors.

Each worker can aggregate all of the rescaled vectors

that it is storing locally.

So each worker only needs to send a single D dimensional

vector to the driver, which is the driver then must sum.

Hence, this process is linear in d, in terms of storage,

computation, and communication.

The following spark code snippet summarizes the step,

showing how we can rescale in the map step,

and then sum the vectors in the reduce step.

С теоретической частью всё. Осталась еще одна лабораторка, пятая (№5).

И конспект будет завершен.

original post http://vasnake.blogspot.com/2015/12/week-5-part-2.html

Tools

VSnake notes

2015-12-28

Week 5, part 2

Distributed PCA: big d, big n

Комментариев нет:

Отправить комментарий

Архив блога

Ярлыки

Обо мне

Links