Записки программиста, обо всем и ни о чем. Но, наверное, больше профессионального.

2015-12-28

Week 5, part 2

Курс Scalable Machine Learning. Hadoop, Apache Spark, Python, ML -- вот это всё.

Продолжаю конспектировать пройденный курс. Неделя 5, продолжение.
В прошлый раз начали разбирать теорию PCA.
Далее: пятая неделя, лекции, алгоритмы вычисления PCA в распределенной среде, бигдата.

WEEK 5: Principal Component Analysis and Neuroimaging.

PCA ALGORITHM

Освежим в памяти, что такое ортогональные и ортонормальные векторы:

Orthogonal vectors are simply perpendicular vectors,
and one nice property of orthogonal vectors
is that their dot product always equals 0.

Note that a unit vector is simply
a vector whose Euclidean norm equals one.

Orthonormal … These are vectors that are orthogonal and also
have unit norm.



Going back to our example, since a and b are
both unit norm and orthogonal, they
are also orthonormal vectors.
In contrast, b and d, even though they're orthogonal,
are not orthonormal since the Euclidean norm
of d is greater than one.

Now that we're equipped with a better understanding
of these concepts, we could discuss
the interpretation of PCA as an iterative algorithm.

Понятно, ортонормальные векторы, ага.
Теперь как формулируется итеративный алгоритм PCA:

imagine that we want a k dimensional
representation of our data.
... we can imagine taking the following iterative
approach.

At the i-th iteration we aim to find
the direction of maximal variance in the data,
subject to the constraint that this direction is of unit norm
and that it is orthogonal to all directions we
chose in previous iterations.

We can then project our data onto this direction,
and the locations along this direction
become the i-th feature in our new representation.

As a result of this algorithm, we
have found k unit vectors that are pairwise orthogonal,
and thus, these k directions are orthonormal.



DISTRIBUTED PCA

Так как же нам найти Принципиальные Компоненты и получить редуцированный датасет?
4 шага
отцентрировать данные, 
вычислить матрицу ковариантности, 
найти эйгенвекторы, 
умножить данные на эйгенвекторы:

we'll assume that our raw data is not
centered. And so the first step in PCA involves centering our data.
Or in other words, computing the mean of each feature
and subtracting these feature means from each original data
point.

We'll assume that our centered data is stored in an n
by d matrix which we'll call x.

Once we've centered our data, we next
need to compute the sample covariance
matrix, or the scatter matrix.

Note that the scatter matrix is simply the sample covariance
matrix without dividing by n.
PCA returns the same solution in either case,
as discussed in the previous segment.
And we'll work with the scatter matrix in this segment
to simplify our notation.

As we previously discussed, the principal components
of our data equal the eigenvectors of the sample
covariance matrix.

So in the third step, we compute these eigenvectors
by performing an eigendecomposition.

Finally, in order to obtain our k dimensional representation,
we need to multiply our original data by the top k eigenvectors
to compute the PCA scores.



Distributed PCA, Big n, small d
Как посчитать PCA на практике, с учетом того, что количество фич невелико (small d).

we must center our data.
And to do this, we must compute the mean of each feature.
There are d features, and thus d feature means.
And we define the vector m to be the d dimensional vector whose
i-th component is the mean of the i-th feature.

We can compute the mean vector via a simple
reduce operation whereby we sum all of the data points
together.

After we have performed this sum,
we can compute the mean vector, m,
by simply dividing by the number of data points, n.

After computing m on the driver, we
must send it back to the workers so that each data
point can be centered.

Together, the reduce operation to compute m
and the subsequent communication are inexpensive,
as they are linear in d in terms of local storage, computation,
and communication.

Once each worker has access to m,
we can then perform a map operation
to create the centered data points, which
simply involves subtracting m from each original data point.



we next need to compute the scatter matrix.

As in the case of closed-form linear regression,
we can efficiently perform this computation in a distributed
fashion by using outer products.

Матрица ковариантности = X transpose * X, а это это сумма произведений строк датасета.
Очень легко ложится на модель распределенных вычислений.

We'll represent x visually by its rows or data points,
and then we can express this matrix multiplication
as a sum of outer products where each outer product involves
only a single row of x, or a single data point.

Also recall that, in the previous step,
we computed our center data and we stored it
in a data parallel fashion.
So with this context, we can now compute the scatter matrix
as a simple MapReduce operation.

In the map step, we take each point
and compute its outer product with itself.
This is a computational bottleneck
in our PCA computation, but it's distributed
across multiple workers.


In the reduce step, we simply sum over
all of these outer products.


This requires quadratic storage and computation in d,
both of which are feasible since we're assuming that d is small.

Once we've computed the scatter matrix,
we need to perform its eigendecomposition.

Since we want to compute a k dimensional representation
for our data, we're interested in the top k
principal components of our data, which we know
are the top k eigenvectors of a scatter matrix.

We represent these principal components by the d by k matrix P.
As a result of eigendecomposition,
we have access to the top k eigenvectors on the driver.
But now we need to communicate them to the workers
so that they can compute these k dimensional representations
for the data points.

This requires O of dk communication,
which is the communication bottleneck in this algorithm.


Additionally, the eigendecomposition
generally requires cubic time and quadratic space.
But if we only want the top k eigenvalues and eigenvectors,
we can reduce this time complexity to O of d squared k.

Finally, now that each worker has the principal component
stored locally, we can compute the k dimensional
representation for each point via a simple matrix vector
multiply.


This process requires O of dk local computation,
and can be performed via a simple map operation.

Простой алгоритм, но можно упереться в количество фич, превышающее возможности Spark кластера.
Не беда, есть способ проделать PCA в Big N and Big D setting.

Distributed PCA: big d, big n


Ключевая идея заключается в хитром, итеративном способе вычисления ейгенвекторов.

where both N and D are large,
we can only afford storage, computation, and communication that are linear in N & D.

So we can't locally store or operate on the scatter matrix.
Instead, we'll introduce an iterative approach.

Our iterative approach relies on performing
a sequence of matrix vector products
to compute the top K eigenvectors of the scatter matrix.


The most common methods for doing this are
Krylov subspace and random projection based methods.

And Spark's MLlib in particular relies on Krylov subspace methods.

In this approach, the driver provides the workers
with a D dimensional vector on each iteration,
and requires that the workers left multiply this vector
by the scatter matrix.
Overall, the algorithm requires O of K iterations,
or passes over the data, and O of DK local storage.

And notably, this algorithm computes the PCA solution
without ever explicitly computing the covariance
of a scatter matrix.

The first step involves the algorithm
communicating the D dimensional vector, vi, to all workers.

Next (step 2, see below), we need to multiply the scatter matrix by the vector vi
in a distributed fashion.
And we'll denote the result of this matrix multiplication
as the D dimensional vector qi.

The driver then uses qi to update its estimate of p,
which are the top K eigenvectors of the scatter matrix.



We repeat this process, o of k for o of k iterations,
until we converge on our answer for p

Also, note that we're using the letter
i here to denote the iteration number.
And so the vector vi communicated
to the workers in step one changes on each iteration.

step two is interesting.
The challenge is that we want to perform this matrix
multiplication without ever having to explicitly compute
the scatter matrix, or even having to store copies
of both X and X transpose.

And so we need to be a bit clever in how
we perform this computation.
And it turns out that by carefully breaking
this multiplication into two steps,
we're able to work to achieve our desired goal.

We first compute bi, which is an n dimensional vector
equal to x times vi.
We then multiply X transpose by this intermediate result
to obtained qi.

we can efficiently compute steps one
and step two by only storing X in a data parallel fashion.

Remember that bi is an n dimensional vector,
and each component of this vector is simply equal to the
dot product between a row of X and the vector vi.

Since each row of X is an observation,
and since each worker is storing vi locally,
we can simply compute the dot product between each data point
and vi, and then concatenate the results to obtain bi.

We can first perform a map operation, in which we compute
the dot product of each data point and the vector vi.
This requires O of d storage to store vi, and O of nd
distributed computation to compute the dot products.

Next, in the reduce step, we can simply concatenate the results.
Finally, each worker will need to store bi locally
in order to compute the overall result, qi, in the next step.
So we're going to need to communicate bi to each worker.



So overall, this reduce step, combined with the communication of qi,
requires linear time, space, and communication in terms of n.

The following spark code snippet succinctly
summarizes what we've done.
Starting with an RDD of training data,
we first compute a dot product in the map step,
then collect the results, which creates a list.
And finally, we convert this list into a NumPy array.



Now let's consider the second step of this two step process.
In this step, our goal is to compute the product of X transpose and bi.

By inspecting this product, we can interpret this multiplication
as the sum of rescaled data points.

In particular, for the jth data point, we can multiply it by the jth component of the vector bi,
which gives us a new d dimensional vector.

We can similarly rescale each of the other data points,
which gives us a total of n rescaled d dimensional vectors.
Taking the sum of these rescaled vectors gives us qi,
which is our desired result.

In the map step, we simply rescale each vector
by its corresponding component of bi.

Storing bi requires O of n storage,
and computing the rescaled vectors
requires a pass over all of our data,
and thus takes O of nd distributed computation.

In the reduce step, we simply take a sum
of these rescaled vectors.
Each worker can aggregate all of the rescaled vectors
that it is storing locally.
So each worker only needs to send a single D dimensional
vector to the driver, which is the driver then must sum.

Hence, this process is linear in d, in terms of storage,
computation, and communication.

The following spark code snippet summarizes the step,
showing how we can rescale in the map step,
and then sum the vectors in the reduce step.




С теоретической частью всё. Осталась еще одна лабораторка, пятая (№5).

И конспект будет завершен.






original post http://vasnake.blogspot.com/2015/12/week-5-part-2.html

Комментариев нет:

Отправить комментарий

Архив блога

Ярлыки

linux (241) python (191) citation (185) web-develop (170) gov.ru (157) video (123) бытовуха (112) sysadm (100) GIS (97) Zope(Plone) (88) Book (81) programming (81) бурчалки (81) грабли (77) development (73) Fun (72) windsurfing (72) Microsoft (64) hiload (62) opensource (58) internet provider (57) security (57) опыт (55) movie (52) Wisdom (51) ML (47) language (45) hardware (44) JS (41) curse (40) driving (40) money (40) DBMS (38) bigdata (38) ArcGIS (34) history (31) PDA (30) howto (30) holyday (29) Google (27) Oracle (27) virtbox (27) health (26) vacation (24) AI (23) Autodesk (23) SQL (23) Java (22) humor (22) knowledge (22) translate (20) CSS (19) cheatsheet (19) hack (19) tourism (19) Apache (16) Manager (15) web-browser (15) Никонов (15) happiness (14) music (14) todo (14) PHP (13) course (13) functional programming (13) weapon (13) HTTP. Apache (12) SSH (12) frameworks (12) hero (12) im (12) settings (12) HTML (11) SciTE (11) crypto (11) game (11) map (11) scala (11) HTTPD (9) ODF (9) купи/продай (9) benchmark (8) documentation (8) 3D (7) CS (7) DNS (7) NoSQL (7) Photo (7) cloud (7) django (7) gun (7) matroska (7) telephony (7) Microsoft Office (6) VCS (6) bluetooth (6) pidgin (6) proxy (6) Donald Knuth (5) ETL (5) NVIDIA (5) REST (5) bash (5) flash (5) keyboard (5) price (5) samba (5) CGI (4) LISP (4) RoR (4) cache (4) display (4) holywar (4) nginx (4) pistol (4) xml (4) Лебедев (4) IDE (3) IE8 (3) J2EE (3) NTFS (3) RDP (3) USA (3) holiday (3) mount (3) spark (3) Гоблин (3) кухня (3) урюк (3) AMQP (2) ERP (2) IE7 (2) NAS (2) Naudoc (2) PDF (2) address (2) air (2) british (2) coffee (2) font (2) ftp (2) messaging (2) notify (2) sharepoint (2) ssl/tls (2) stardict (2) tests (2) tunnel (2) udev (2) APT (1) CRUD (1) Canyonlands (1) Cyprus (1) DVDShrink (1) Jabber (1) K9Copy (1) Matlab (1) Palanga (1) Portugal (1) VBA (1) WD My Book (1) autoit (1) bike (1) cannabis (1) chat (1) concurrent (1) dbf (1) ext4 (1) idioten (1) krusader (1) license (1) mindmap (1) pneumatic weapon (1) quiz (1) regexp (1) robot (1) science (1) serialization (1) tie (1) vim (1) Науру (1) крысы (1) налоги (1) пианино (1)

Google+ Followers