VSnake notes: Data Manipulation at Scale: Systems and Algorithms

Закончил недавно курс:

Data Manipulation at Scale: Systems and Algorithms

University of Washington

https://www.coursera.org/learn/data-manipulation/

Ощущения двойственные: вроде и знаний прибавилось, но как-то неадекватно потраченному времени (а времени жалко).

Подавляющее количество лекционного времени занято поверхностными рассказами о том,

что происходит вокруг Data Sciense и, собственно, что это такое. С уклоном в Биг Дата.

Не знаю, на какую целевую аудиторию рассчитан курс, на бухгалтеров с начальными знаниями в СУБД и программировании?

Которые хотят стать аналитиками? Мне, с моим бэкграундом программиста, было откровенно скучно 90% времени.

А когда скучно не было, настойчиво хотелось более углубленного изучения материала.

Задачки тоже ерундовые, даже думать особо не надо.

В целом: очень поверхностный, вводный курс в современный анализ данных.

Silly bus

Data Science Context and Concepts:

– Examples and the Diversity of Data Science

– Working Definitions of Data Science

– Characterizing this Course

– Related Topics

– Course Logistics

Assignment: Twitter Sentiment Analysis

Relational Databases and the Relational Algebra:

– Principles of Data Manipulation and Management

– Relational Algebra

– SQL for Data Science

– Key Principles of Relational Databases

Assignment: SQL

MapReduce and Parallel Dataflow Programming:

– Reasoning about Scale

– The MapReduce Programming Model

– Algorithms in MapReduce

– Parallel Databases vs. MapReduce

Assignment: MapReduce

NoSQL: Systems and Concepts:

– What problems do NoSQL systems aim to solve?

– Early key-value systems and key concepts

– Document Stores and Extensible Record Stores

– Extended NoSQL Systems

– Pig: Programming with Relational Algebra

– Pig Analytics

– Spark

Graph Analytics:

– Structural Tasks

– Traversal Tasks

– Pattern Matching Tasks and Graph Query

– Recursive Queries

– Representations and Algorithms

Очень интересные темы, но, увы, галопом-по-европам.

Дайджест

Data Science Context and Concepts

Data Science: intersect(Programming, Math/Statistics, Substantive expertise)

Simple methods + lots of not-so-good data vs sophisticated methods + not-so-many good and clean data.

First approach wins.

Describing/quantifying uncertainty making a statistical argument;

Conveying the results to the public:

critical to the success of the effort, and it's far from trivial.

Collecting data from web, etc; process data; make revealing presentations:

it's data science.

Graph analytics, databases, visualisation, large datasets, adhoc interactive analysis, repurposing:

what data scientist do.

Data scientist: find nuggets of truth in data and then explain it to the buisness leaders.

Strong mathematical background.

Skills: statistics, data munging, visualisation.

80% of analytics is sums and averages.

Concerned with the collection, preparation, analysis, visualisation, management and preservation of large collections of information.

Tasks:

– 80% of the work: preparing to run a model (data gathering, cleaning, integrating, restructuring, transforming, loading, filtering, deleting, combining, merging, verifying, extracting, shaping, massaging)

– running the model

– other 80% of the work: interpreting the results, communicating the results

Data products: data-driven apps (spellchecker, translator, ...)

If you're a DBA, you need to learn to deal with unstructured data;

If you're a statistician, learn to deal with big data.

if you're a software engineer, learn statistical modeling and communicating results.

If you're a business analyst, learn algorithms and tradeoffs at scale.

Abstractions: matrices, linear algebra, relations + relation algebra.

Barrier: Incompatible data formats, non-aligned data, inconsistent data semantics.

90% of time for 'handling' data (not doing science).

Working with large and heterogeneous datasets requires significant programming chops.

Science: empirical → theoretical → computational → eScience (massive datasets).

Sometimes possible only one-pass through dataset.

Data: Volume, Variety, Velocity, (Veracity, ...)

Big Data: hard data (1G social graph, build a recommender system; 10K spreadsheets to integrate).

– data exhaust from customers; new sensors; ability to keep everything

Assignment:

– sentiment of each tweet: sum(sentiments for each word)

– sentiment of new word: numPositiveTweets — numNegativeTweets

– word frequency: wordCount / numWords

– happiest state: tweetsForState; max(sumSentiment / numTweets)

– top ten hash tags: sort(tagCount / totalTags)

Куски кода выложу отельным постом.

Relational Algebra (algebra of tables):

– universal

– applicable to Big Data

– well known

– abstracted / optimized query plans

– SQL to express queries, operations

RA expression does specify order of operations, SQL does not.

SQL: declarative language.

Data model is about:

– structures (rows/columns, key-value pairs),

– constraints (all rows must have the same number of columns, ...),

– operations (get next n bytes, find the rows where column 'a' is 'b').

Database: a collection of information organized to afford efficient retrieval.

What problem do they solve:

– sharing (concurrent access by multiple readers/writers)

– data model enforcement (data clean, organized)

– scale (work with datasets larger than RAM)

– flexibility (use the data in new, unanticipated ways)

Questions:

– how is the data physically organized on disk

– what kinds of queries are efficiently supported

– how hard is it update/add the data

– what happens when I encounter new queries? Do I reorganize the data?

Relational DBMS were invented to let you use one set of data in multiple ways, including unforeseen

Algebraic structure of the data, allowing reasoning and manipulating independently of physical data representation.

Relational algebra operations (closure: operation return relation):

– set operations (union [all], intersection-join, difference-except)

– selection (condition where ...)

– projection (select [distinct] a, b, ...)

– join (cross [cartesian] product, left/right/inner/outer join, theta join)

and extended (bag) ops

– duplicate elimination (distinct)

– grouping, aggregating (group by, sum, count, ...)

– sorting (order by)

Every paper will assume set semantics; every implementation – bag semantics.

Query optimization: apply laws/rules and find less expensive equivalent

– like arithmetic ((z*2)+((z*3)+0))/1 = (2+3)*z

– same logical expression → different physical algorithms (declarative query language)

– explain … mapping SQL to RA

UDF: user defined functions:

– scalar

– aggregate

– table

Views: logical data independence (just a query with a name)

– abstraction and reuse

Indexes: for searching needle in a haystack.

SQL assignment:

Сниппеты кода выложу отдельным постом.

MapReduce, parallel dataflow

– abstraction for parallel manipulation of massive datasets

Shared nothing vs

– shared memory (multicore PC)

– shared disc (Oracle, …)

only Shared Nothing can be scaled out to 1000s of nodes.

Implementation:

– master node

– M splits — read to M map tasks, write to R regions – read to R reduce tasks

– DFS, fault tolerant, files by blocks, block size 64 mb

– YARN, task scheduler, fault tolerant

Scalable:

– can make use of 1000s of cheap computers

– for N data items you must do no more than (N^m)/k operations

– scale up vs scale out

MapReduce:

– map (inkey, invalue) => [(outkey, outvalue)], eg: (docname, listofwords) => [(word, frequence)]

– shuffle, group/aggregate outkeys

– reduce (outkey, [outvalues]) => [resvalues], eg: (word, [frequence]) => (word, totalfrequence)

– work with tuples, bags of tuples

– think backward: what result you have to produce, from what reduce input you can do it, what you have to do in map phase to generate reduce input.

Inverted index: «eat»: (tweet1, tweet3); «pancakes»: (tweet3, tweet2)

– map to (word, tweetnum)

– reduce input is the result.

Relational join in MR:

– lump two datasets into one

– map to (joinkey, datarow)

– reduce to (joinkey, tab1row, tab2row)

Counting friends in MR:

– record ('Jim', 'Sue') is an edge from digraph

– counting like in 'word count' example, key is a person name

Matrix multiplication in MR:

– reduce output must be in form ((i, j), val) → (i,j) is a key

– mapper have to emit each cell from A numBcols times and so on

Parallel databases vs MapReduce

– MR: cheap scale out, fault tolerance, easy to deploy

– PD: schema, low latency, indexes, consistency (transactions/ACID), SQL, but very expensive

– SQL, indexes, schemas, low latency, algebraic optimisation emerging in MR world (Pig, Hive, Impala, Spark, ...)

– relational algebra everywhere

For data fitted to RDB (not so big), RDBMS may be faster than MR

– MR load data faster

– RDBMS process data faster (preprocessing done in loading phase)

MapReduce assignment:

Сниппеты кода выложу отдельным постом.

NoSQL: key-value, extensible record, document DB

NoSQL systems are purely about scale rather than analytics, and are arguably less relevant for the practicing data scientist.

Joins, language/algebra barely present in NoSQL bases:

– CouchDB (no language)

– Pig, Hive, Spark, Impala

It's a very small fraction of all available databases. Scale out for write, read – all about it.

NoSQL features:

– scale horizontally

– replicate and partition data over many nodes

– simple API – no query language

– weak concurrency model (vs ACID transactions)

– efficient use of distributed indexes, RAM data storage

– dynamically add new attributes to data records

ACID:

– atomicity

– consistency (written data must be valid according to all defined rules)

– isolation (often relaxed)

– durability

Relaxing consistency guarantees: CAP theorem

– consistency, availability, partitioning – choose 2 of 3

– w/o partitioning no scalability => AP, no C

– big scale, frequent failures => transactions can't be finished fast enough

– eventually consistent data.

Consistency:

– do all applications see all the same data?

Availability:

– can I interact with the system in the presence of failures?

Partitioning:

– if two sections of your system cannot talk to each other, can they make forward progress?

– If not => no Availability

– If yes => no Consistency

Conventional databases assume no partitioning.

NoSQL systems may sacrifice consistency.

Paxos protocol for solving consensus:

– distributed voting scheme

MVCC: multi-version concurrency control:

– creates new version on write

– checks timestamps to determine which version to read

Major impact systems:

– Memcached: in-memory indexes can be highly scalable (consistent hashing/hash ring)

– Dynamo: idea of using eventual consistency as a way to achieve higher availability and scalability

(SLA: respond within 300 ms for 99.9% requests at the 99th percentile); vector clocks

– BigTable: persistent record storage could be scaled to 1000s of nodes

BigTable/Hbase:

– sparse, distributed, persistent multi-dimentional sorted map

– record is: (row, col, time): string

– data sorted by row key

– row key range broken into tablets (tablet: unit of distribution)

– column families (family is unit of access control, mem.accounting, disc accounting)

– write through memtable, SSTables (disc chunks) are immutable

RA and declarative query languages comes to MapReduce:

– Hive

– Pig

Take joins. To make a join system can use different methods/strategies (diff. query plans).

System should pick the right one, not programmer => we need declarative query language with optimizer to make RA plan.

Enterprise needs:

– ACID

– declarative query language

– schema/standards

Pig: almost SQL on top of Hadoop (MapReduce), nested types:

– data model: atom, tuple, bag, dict (map)

– operators: load, filter, group/cogroup (join), distinct, foreach, flatten, store, union, cross, order, ...

Spark: iterative tasks, cached RDD:

– RDD: distributed collection of key-value pairs

– transformations → actions

– lazy evaluating (optimization)

– operations: filter, group, sort, union, cross, … (RA?)

Graphs: Structural tasks

Graph-structured data are increasingly common in data science contexts due to their ubiquity in modeling the communication between entities

graph:

– set of vertices V

– set of edges E

– edge is a pair (v, w), directed, undirected, weighted, capacitated, loaded with data

– structural algorithms, traversal, pattern-matching

Structural:

– in-degree, out-degree

– degree histogram: for d = 0..maxOutDegree find n = count(vertices.outdegree == d)

– random graph have exponential distribution: n(d) = (c*x^d), x < 1 (most vertices have small outdegree)

– human generated graphs have zipf distribution: n(d) = 1/d^x, x > 0 (long tail, most vertices have significant outdegree)

– connectivity coefficient: min number of vertices you need to remove that will disconnect the graph

– centrality:

closeness centrality of a vertex: average length of all its shortest paths;

betweeness centrality of v: the fraction of all shortest paths that pass through v

Traversal tasks:

– PageRank (eigenvector centrality): propagate node rank to connected nodes, divide to outgoing links, mitigate by damping factor

– minimum spanning tree (smallest subset of edges)

– Euler's path: visit every vertex crossing each edge once, easy

– Hamiltonian path: visit every vertex once, NP-complete

– maxflow/mincut

PageRank in details:

– formula: PRv = (1 - damp) / len(Vertices) + damp * sumPointingToV( PR(wi) / wi.outdegree)

– meaning: probability of stepping into vertex walking randomly

Graph: Pattern matching tasks

– find all instances of a pattern (in graph G, pattern: configuration of vertices and edges)

– pattern example: two vertices connected to each other directly. Vertices and edges may be labeled.

Popular pattern: triangle A → B → C → A, a cycle with 3 vertices.

– brute force: for each edge (v, w) check every vertex u: if (v, u) and (w, u) is edges => triangle.

RDF – resource description framework (subject, object, predicate):

– given a graph with edge labels

– pattern match example: A interferes with B regulates C associated with D

– need some kind of pattern expression language

– RDF: resource description framework, triples (subject, predicate, object)

– triple is just a labeled edge in a graph

Relation algebra, Datalog for graphs

– SQL query for pattern:

select i.subject from R i, R r, R a where i.predicate = 'interferes with' and

r.predicate = 'regulates' and a predicate = 'associated with' and

i.object = r.subject and r.object = a.subject

– Datalog query:

Ans(x) :-

R(x, interferes witn, y), R(y, regulates, z), R(z, associates with, w)

Another example:

– relations: InterferesWith(drug1, drug2); Regulates(drug, gene); AssociatedWith(gene, disease);

– SQL: select i.drug1 from InterferesWith i, Regulates r, AssociatedWith a

where i.drug2 = r.drug and r.gene = a.gene

– Datalog: Ans(x) :-

InterferesWith(x, y), Regulates(y, z), AssociatedWith(z, w)

Recursion

– graph query example: Datalog

knew(Person2) :- knew(Person1) email(Person1, Person2, Message, Time), Time < 'June 3'

– sort of BFS algorithm, loop ang queue

– beware: cycles in the graph

– MapReduce, Spark: loop, cache invariants

– in graphs with ZIPF distribution (long tail) you have to break looping: insignificant data going and going but you get all you need already

Pregel, Giraph, Hama:

– PageRank in MapReduce have problems: all graph data shuffled on every iteration;

we need to shuffle only new rank contributions; iterations controlled outside of MR.

– Pregel: batch algorithms on large graphs

while any vertex is active or max iterations not reached:

for each vertex: # parallelized

process messages from neighbors from prev.iteration

send messages to neighbors

set active flag

Data representation

– edge table: edges(v, w [, data])

top 5 highest in-degree vertices: select top 5 w, incount from (

select w, count(v) as incount from edges group by w

) order by incount

– adjacency list: edges(v, [w, …]); good for MR

– adjacency matrix: good for finding specific edge

Different algorithms, different properties.

original post http://vasnake.blogspot.com/2016/02/data-manipulation-at-scale-systems-and.html

Tools

VSnake notes

2016-02-19

Data Manipulation at Scale: Systems and Algorithms

Data Science Context and Concepts

Relational Algebra (algebra of tables):

MapReduce, parallel dataflow

NoSQL: key-value, extensible record, document DB

Graphs: Structural tasks

Graph: Pattern matching tasks

Recursion

Data representation

1 комментарий:

Архив блога

Ярлыки

Обо мне

Links