This is Stuff: MapReduce Questions and Answers

Tuesday, October 25, 2011

MapReduce Questions and Answers

With all the hype and buzz surrounding NoSQL, I decided to have a look at it. I quickly found that there is not one NoSQL I could learn. Rather, there are various different solutions with different purposes and trade offs. Those various solutions tend to have one thing in common: processing of data in NoSQL storage is usually done using MapReduce framework.

Search on MapReduce found various scattered blog posts, some universities courses pages and one book that seems to contain almost everything other sources did.

This post contains MapReduce questions and answers based on the book. Basically, if I would be a student, this is what I would have made as a test preparation notes. If I would be a teacher, this is what I would ask on the exam.

First chapter gives credit where the credit is due, the rest contains questions. Last chapter contains hands-on coding exercises.

The Book

The book is named Data-Intensive Text Processing with MapReduce. If you are unsure whether you want to buy it or not, pre-production manuscript is available for free.

Do not be fooled by the title. The book is more about MapReduce itself and less about text processing. First half of the book describes general purpose tricks (design patterns) useful for any task. Second half contains a chapters on text processing, graph algorithms and expectation maximization.

The book contains almost everything I found on various blogs, university courses pages and much more.

Questions

Questions are split by book chapters. With one minor exception (counters), questions are mostly based on the first half of the book. Second half contains concrete algorithms and I wanted to focus on general purpose knowledge.

It does not mean that learning them is not useful. Especially Graph Algorithms chapter contains easily generalizable ideas.

All answers are by default collapsed. Any of them can be expanded by click on '[+] Click to Expand the Answer' link. Use following links to expand/collapse all of them:

2 MapReduce Basics

2.2 Mappers and Reducers

Describe general MapReduce algorithm. Split it into phases. For each phase include:

who is responsible (framework/programmer/customizable),
what it does,
phase input,
phase output.

[+] Click to Expand the Answer

Decide if the statement is true or false: All MapReduce implementations implement exactly same algorithm.

[+] Click to Expand the Answer

True or false: Each mapper must generate the same number of key/value pairs as its input had.

[+] Click to Expand the Answer

True or false: Mappers input key/value pairs are sorted by the key.

[+] Click to Expand the Answer

True or false: Mappers output key/value must be of the same type as its input.

[+] Click to Expand the Answer

True or false: Reducer is applied to all values associated with the same key.

[+] Click to Expand the Answer

True or false: Reducers input key/value pairs are sorted by the key.

[+] Click to Expand the Answer

True or false: Each reducer must generate the same number of key/value pairs as its input had.

[+] Click to Expand the Answer

True or false: Reducers output key/value pair must be of the same type as its input.

[+] Click to Expand the Answer

2.3 The Execution Framework

What happens in case of hardware/software failure?

[+] Click to Expand the Answer

Is it possible to start reducers while some mappers still run? Why?

[+] Click to Expand the Answer

Define a straggler.

[+] Click to Expand the Answer

What is speculative execution (also called backup tasks)? What problem does it solve?

[+] Click to Expand the Answer

2.4 Partitioners and Combiners

What does partitioner do?

[+] Click to Expand the Answer

What does combiner do?

[+] Click to Expand the Answer

Decide if the statement is true or false: Each combiner runs exactly once.

[+] Click to Expand the Answer

2.5 The Distributed File System

Briefly describe HDFS architecture.

[+] Click to Expand the Answer

Decide if the statement is true or false: HDFS is good at managing large number of small files.

[+] Click to Expand the Answer

2.6 Hadoop Cluster Architecture

Explain difference between jobtracker and tasktracker?

[+] Click to Expand the Answer

Explain mapper lifecycle.

[+] Click to Expand the Answer

Explain reducer lifecycle.

[+] Click to Expand the Answer

3 MapReduce Algorithm Design

3.1 Local Aggregation

What is local aggregation and why is it used?

[+] Click to Expand the Answer

What is in-mapper combining? State advantages and disadvantages over writing custom combiner.

[+] Click to Expand the Answer

3.2 Pairs and Stripes

Explain Pair design patter on a co-occurence example. Include advantages/disadvantages against Stripes approach, possible optimizations and their efficacy.

[+] Click to Expand the Answer

Explain Stripes design patter on a co-occurence example. Include advantages/disadvantages against Pairs approach, possible optimizations and their efficacy.

[+] Click to Expand the Answer

Explain scalability bottleneck caused by stripes approach.

[+] Click to Expand the Answer

3.3 Computing Relative Frequencies

Relative frequencies of co-occurrences problem:

Input: text documents
    key: document id
  value: text document

Output: key/value pairs where
    key: pair(word1, word2)
  value: #co-occurrences(word1, word2)/#co-occurrences(word1, any word)

Fix following solution to relative frequencies of co-occurrences problem:

class MAPPER
  method INITIALIZE
    H = new hash map    

  method MAP(docid a, doc d)
    for all term w in doc d do
      for all term u patri neighbors(w) do
        H(w) = H(w) + 1
        emit(pair(u, w), count 1)

  method CLOSE 
    for all term w in H
      emit(pair(w, *), H(w))    

class REDUCER
  variable total_occurrences = 0

  method REDUCE(pair (p, u), counts[c1, c2, ..., cn])
    s = 0
    for all c in counts[c1, c2, ..., cn] do 
      s = s + c

    if u = * 
      total_occurrences = s
    else
      emit(pair p, s/total_occurrences)

class SORTING_COMPARATOR
  method compare(key (p1, u1), key (p2, u2))
    if p1 = p2 AND u1 = *
      return key1 is lower

    if p1 = p2 AND u2 = *
      return key2 is lower

    return compare(p1, p2)

[+] Click to Expand the Answer

Partitioner is missing, framework could send key/value pairs with totals to different reducer than key/pairs with word pairs.

class PARTITIONING_COMPARATOR
  method compare(key (p1, u1), key (p2, u2))
    if p1 = p2 
      return keys are equal

    return keys are different

Describe order inversion design pattern.

[+] Click to Expand the Answer

3.4 Secondary Sorting

Describe value-to-key design pattern.

[+] Click to Expand the Answer

3.5 Relational Joins

Describe reduce side join between tables with one-on-one relationship.

[+] Click to Expand the Answer

Describe reduce side join between tables with one-to-many relationship.

[+] Click to Expand the Answer

We assume that the join key is primary key in table called S. Second table is called T. In other words, the table S in on the 'one' side of the relationship and the table T is on the 'many' side of the relationship.

We have to implement mapper, custom sorter, partitioner and reducer.

Mapper produces key composed from join id and table flag. Partitioner splits the data in such a way, that all key/value pairs with the same join id goes to the same reducer. Custom sort puts key/value pair generated from the table S right before key/value pair with the same join id from the table T.

Reducers input looks like this:



((JoinId1, s)-> row)

((JoinId1, t)-> [rows])

((JoinId2, s)-> row)

((JoinId2, t)-> [rows])

...

((JoinIdn, s), row)

((JoinIdn, t), [rows])

The reducer joins all rows from s pair with all rows from following t pair.

Describe reduce side join between tables with many-to-many relationship.

[+] Click to Expand the Answer

We assume that data are stored in tables called S and T. The table S is smaller. We have to implement mapper, custom sorter, partitioner and reducer.

Mapper produces key composed from join id and table flag. Partitioner splits the data in such a way, that all key/value pairs with the same join id goes to the same reducer. Custom sort puts the key/value pairs generated from the table S is right before all key/value pair with the data from the table T.

Reducers input looks like this:



((JoinId1, s)-> [rows])

((JoinId1, t)-> [rows])

((JoinId2, s)-> [rows])

((JoinId2, t)-> [rows])

...

((JoinIdn, s), [rows])

((JoinIdn, t), [rows])

The reducer buffers all rows with the same JoinId from the table S into the memory and joins them with following T table rows.

All data from the smaller table must fit into the memory - the algorithm has scalability bottleneck problem.

Describe map side join between two database tables.

[+] Click to Expand the Answer

Map side join works only if following assumptions hold:

both datasets are sorted by the join key,
both datasets are partitioned the same way.

Mapper maps over larger dataset and reads corresponding part of smaller dataset inside the mapper. As the smaller set is partitioned the same way as bigger one, only one map task access the same data. As the data are sorted by the join key, we can perform merge join O(n).

Describe memory backed join.

[+] Click to Expand the Answer

Which one is faster? Map side join or reduce side join?

[+] Click to Expand the Answer

4 Inverting Indexing for Text Retrieval

The chapter contains a lot of details about integer numbers encoding and compression. Since these topics are not directly about MapReduce, I made no questions about them.

4.4 Inverting Indexing: Revised Implementation

Explain inverting index retrieval algorithm. You may assume that each document fits into the memory. Assume also then there is a huge number of documents. Which part should be optimized by integer encoding and compression?

Input: text documents
    key: document id
  value: text document

Output: key/value pairs where
    key: word 
  value: list[documentId, numberOfOccurences]
         list elements must be sorted by numberOfOccurences

[+] Click to Expand the Answer

Mapper counts number of occurrences in the document for each word. As the whole document fits into the memory, we can hold partial results in a map.

Intermediate key/values: 
    key: word, numberOfOccurences
  value: documentId

Custom partitioner groups intermediate key/values by word. Custom sort sorts them primary by word and secondary by the number of occurrences.

Reducer uses initialize method to initialize list of all postings. Reduce method handles two cases:

current word equal to previous word - add documentId and numberOfOccurences to posting list.
current word equal to previous word - emit previous word and posting list; initialize posting list.

Posting list in reducer should be compressed.

class MAPPER
  method INITIALIZE
    H = new hash map    

  method MAP(docid, doc d)
    H = new hash map
    for all term w in doc d do
        H(w) = H(w) + 1

    for all term w in H do
        emit(pair(u, w), count 1)

  method CLOSE 
    for all term w in H
      emit(pair(w, H(w)), docid)    

class REDUCER
  variable previous_word = 0
  variable PL = new list of postings

  method REDUCE(pair (w, #occurrences), docid)
    if w <> previous_word && previous_word <> 0 do
      emit(w, PL)
      PL = new list of postings
    
    PL.add(pair(#occurrences, docid))
    previous_word = w

  method compare(key (w1, o1), key (w2, o2))
    if w1 = w2 
      return keys are equal

    return keys are different

class SORTING_COMPARATOR
  method compare(key (w1, o1), key (w2, o2))
    if w1 = w2 do
      return compare(o1, o2)
         
    return compare(w1, w2)

5 Graph Algorithms

The chapter contains two algorithms: shortest path in the graph and page ranking algorithm. The questions are straightforward.

5.2 Parallel Breadth-First Search

Find shortest path from one node origin to all other nodes. Each edge has a weight associated. Input key/value pairs have already bean preprocessed into comfortable form.

Input: graph
    key: node id
  value: distance to origin, list[adjacent node, edge length]

Output: key/value pairs where
    key: node id
  value: distance to origin, list[adjacent node, edge length]

[+] Click to Expand the Answer

The algorithm requires multiple iterations. It stops the iteration does not change any 'distance to origin'. At worst, there will be O(n) iterations where n is a number of nodes in the graph.

Mapper passes original graph to the next iteration as it is. Plus, it generates key/value pair for each adjacent node. The value contains the minimum known distance from origin if the route would go through node.

class MAPPER
  method MAP(node, pair(dist, adjacencylist))
    emit(node, pair(dist, adjacencylist))
    for all (closenode, nodedist) in adjacencylist do
      emit(closenode, pair(dist + nodedist, empty))

Reducer finds the minimum known distance from each node. It passes the distance along with the original graph to the next iteration. It also increments global counter whenever minimum known distance to any node changes.

class REDUCER
  method REDUCE(node, list(dist, adjacencylist))
    minimum = infinity
    previous_iteration_solution = infinity
    original_graph = empty
    for all (dist, adjacencylist) in list do
      if adjacencylist not empty do 
        original_graph = adjacencylist
        previous_iteration_solution = dist
      if minimum > dist
        minimum = dist
    
    if previous_iteration_solution <> minimum
      increment global counter
    emit(node, pair(minimum, original_graph))

If the global counter is 0, the algorithm stops. Otherwise another iteration is needed.

Explain page rank algorithm, assume alpha = 0.

[+] Click to Expand the Answer

Page rank P(n) of a page n is calculated form page ranks of all pages linking to it.

P(n) = sum_m (P(m)/C(m))

The sum goes through all pages m linking to the page n. C(m) is the number of outgoing links of the page m.

Page rank algorithm runs in iterations. Mapper passes page rank contribution of each page to adjacent pages. Reducer updates page rank of each node. The algorithm stops when page ranks no longer moves.

class MAPPER
  method MAP(page, (page_rank, adjacency_list))
    emit(page, (0, adjacency_list))
    contribution = page_rank/adjacency_list.length
    for all node in adjacency_list do
      emit(node, (contribution, empty))

class REDUCER
  method REDUCE(page, contributions[c1, c2, ..., cn])
    rank = 0
    adjacency_list = new list
    for all c in contributions do
      adjacency_list.addAll(c.adjacency_list)
      rank = rank + c.contribution 

    emit(page, (rank, adjacency_list))

6 EM Algorithms For Text Processing

I made no questions out of this chapter.

Exercises

This chapter contains hands-on exercises for MapReduce. Some of them require multiple iterations.

Warm Up

Count number of occurrences of every word in a text collection.

Input:
    key: document id,
  value: text document.

Output:
    key: word,
  value: number of occurences.

[+] Click to Expand the Answer

Intermediate pairs:
    key: word
  value: integer - how many times was the word seen in the input.

class MAPPER
  method MAP(docid a, doc d)
    for all term w in doc d do
      emit(w, 1)

class COMBINER
  method COMBINE(word w, counts[c1, c2, ..., cn])
    s = 0
    for all c in counts[c1, c2, ..., cn] do 
      s = s + c

    emit(word w, s)

class REDUCER
  variable total_occurrences = 0

  method REDUCE(word w, counts[c1, c2, ..., cn])
    s = 0
    for all c in counts[c1, c2, ..., cn] do 
      s = s + c

    emit(word w, s)

Alternative solution would use in-mapper combining.

Web Store

Website user log contains user ids and length of each session. The website has modest number of registered users. Compute the average session length for each user.

Input:
    key: user id,
  value: session length.

Output:
    key: user id,
  value: average session length.

[+] Click to Expand the Answer

As the number of registered users is modest, we can use in-mapper combining.

class MAPPER
  variable total_time = new hash map 
  variable sessions_number = new hash map 

  method MAP(user_id, session_length)
    total_time(user_id) = total_time(user_id) + session_length
    sessions_number(user_id) = sessions_number(user_id) + 1

  method CLOSE 
    for all user_id in total_logged_in_time
      tt = total_time(user_id)
      sn = sessions_number(user_id)
      emit(user_id, pair(tt, sn))    

class REDUCER
  method REDUCE(user_id, [pairs(time, sessions_number)])
    total_time = 0
    total_sessions = 0
    for all pairs in [pairs(time, sessions_number)] do 
      total_time = total_time + time
      total_sessions = total_sessions + sessions_number

    emit(user_id, total_time/total_sessions)

Web store log contains user id and bought item for each sale. You need to implement "buyers of item also bought" functionality. Whenever the item is shown, the store will suggest five items most often bought by items buyers.

Input:
    key: user id,
  value: brought item.

Output:
    key: item,
  value: list of five most common "buyers of item also bought" items.

[+] Click to Expand the Answer

Our solution has two iterations. First iteration generates lists of all items brought by the same user. Grouping is done by the framework, both mapper and reducer perform an identity function.

Input:
    key: user id,
  value: brought item.

Output:
    key: user id,
  value: list of all brought items.

class MAPPER
  method MAP(user_id, item)
    emit(user_id, item)

class REDUCER
  method REDUCE(user_id, items[i1, i2, ..., in])
    emit(user_id, items)

Second iteration solves co-occurrences problem on list items. It uses the stripes approach. Only difference against the standard solution is that we have emit only five most common co-occurrences.

Input:
    key: user id,
  value: list of all brought items.

Output:
    key: item,
  value: list of five most common co-occurrences.

class MAPPER
  method MAP(user_id, items[i1, i2, ..., in])
    for all item in items do
      H = new hash map
      for all item j in items do
        H(j) = H(j) + 1
      emit(item, H)

class REDUCER
  method REDUCE(item, stripes[H1, H2, ..., Hn])
    T = new hash map
    for all H in stripes do
      for all (key/value) in H do
        T(key) = T(key) + value
    emit(user_id, max_five(T))

Web store log contains user id, timestamp, item and number of brought pieces for each sale. The store is looking for items whose sales rise or decline at the same time. Find 20 item couples with maximum of such months.

Input:
    key: user id,
  value: timestamp, brought item, count.

Output:
    key: item, item
  value: number of months when both items sales rose or decline.
      #: the output contains 20 key/value pairs with maximum value

[+] Click to Expand the Answer

Our solution requires multiple MapReduce iterations. We have to:

calculate whether items sales for any given month went up or down,
create lists of items with the same sales change during the same month,
find number of co-occurrences in those lists,
choose items with maximum co-occurrences.

First iteration calculates sales changes for any given month. We have to supply mapper, partitioner, custom sort and reducer. Mapper generates one intermediate key/value pair for each input key/value. Key is composed of sold item and sales month. Value contains number of sold pieces.

Partitioner sends all key/value pairs with the same item to the same reducer. Custom sort sorts them by months. Finally, reducer calculates sales changes.

Input:
    key: user id,
  value: timestamp, item, count.

Intermediate key/values:
    key: item, month 
  value: count.

Output:
    key: month, up/down/equal
  value: item.

class MAPPER
  method MAP(user_id, (timestamp, item, count))
    month = get_month(timestamp) 
    emit((item, month), count)

class PARTITIONING_COMPARATOR
  method compare(key (item1, month1), key (item2, month2))
    if item1 = item2 
      return keys are equal

    return keys are different

class SORTING_COMPARATOR
  method compare(key (item1, month1), key (item2, month2))
    if item1 = item2 do
      return compare(month1, month2)
         
    return compare(item1, item2)

class REDUCER
  method REDUCE((item, month), counts[c1, c2, ..., cn])
    c = sum([c1, c2, ..., cn])
    if last_item = item
      if last_month + 1 = month
        //emit correct up/down/equal flags
        if last_count < count
          emit((item, month), up)
        if last_count > count
          emit((item, month), down)
        if last_count = count
          emit((item, month), equal)
      else
        //no sales during some months
        emit((item, last_month + 1), down)
        emit((item, month), up)
    else 
      // new item
      emit((last_item, last_month + 1), down)
      emit((item, month), up)

    last_item = item
    last_count = count
    last_month = month

Second iteration groups first iteration results by keys. It generates lists of items with same sales changes during the same month. Framework does all the work. Both mapper and reducer perform an identity function.

Input:
    key: month, up/down/equal
  value: item.

Output:
    key: month, up/down/equal
  value: [items].

Third iteration performs standard 'co-occurrences by pairs' algorithm.

Input:
    key: month, up/down/equal
  value: [items].

Intermediate key/values:
    key: item, item
  value: partial number of co-occurrences.

Output:
    key: item, item
  value: number of months when both items sales rose or decline.
      #: the output contains all items couples

class MAPPER
  method MAP((month, change), items[i1, i2, ..., in])
    for each i in items do 
      for each j in items do
        if i != j 
          emit((i, j), 1) 

class COMBINER
  method COMBINE((item1, item2), co-occurrences[c1, c2, ..., cn])
    s = 0
    for all c in co-occurrences[c1, c2, ..., cn] do 
      s = s + c

    emit((item1, item2), s)

class REDUCER
  method REDUCE((item, item), co-occurrences[c1, c2, ..., cn])
    s = 0
    for all c in co-occurrences[c1, c2, ..., cn] do 
      s = s + c

    emit((item1, item2), s)

Finally, we have to choose 20 key/value pairs with maximum value. Each mapper selects 20 key/value pairs with maximum value and emits them with the same key. There will be only one reducer which selects final 20 key/value pairs.

Input:
    key: item, item
  value: number of months when both items sales rose or decline.
      #: the output contains all items couples

Intermediate key/values:
    key: 1
  value: item, item, number of months when both items sales rose or decline.
      #: the output contains 20 key/value pairs with maximum value for each mapper

Output:
    key: item, item
  value: number of months when both items sales rose or decline.
      #: the output contains 20 key/value pairs with maximum value

the code is very simple but long

Criminal Agency

Inputs to all exercises in this chapter uses the same data structure.

Criminal agency stole Facebook's friendships database and wants to analyze new data. Friendships are stored in form key/value pairs, each friendship corresponds to two key/value pairs:

Friends:
    key: first friend name
  value: second friend name

    key: second friend name
  value: first friend name

The agency owns also criminal records of all citizens:

Criminal record:
    key: citizen name
  value: when, where, accomplices, description

Find at risk youths. A person is considered at risk youth if more than half of his friends have criminal record.

[+] Click to Expand the Answer

Our solution has two iterations. First iteration joins two sets and flags each 'value friend' with has_record/law_abiding flags.

Output:
    key: first friend  
    value: second friend, has_record/law_abiding

The mapper flags each key with data set name. Partitioner groups data according to names in keys and sorter puts criminal records before friendships. We could use local aggregation to remove multiple criminal records for the same person.

class MAPPER
  method MAP(name, value)
    if value is name do
      emit(name, friendship, item)
    else
      emit(name, criminal, item)

class PARTITIONING_COMPARATOR
  method compare(key (name1, dataset1), key (name2, dataset2))
    if name1 = name2
      return keys are equal
 
    return keys are different
 
class SORTING_COMPARATOR
  method compare(key (name1, dataset1), key (name2, dataset2))
    if name1 = name2 AND dataset1 is criminal
      return key1 is lower

    if name1 = name2 AND dataset2 is criminal
      return key2 is lower

    return compare(name1, name2)

class REDUCER
  variable previous_name

  method REDUCE(pair(name, flag), items[i1, i2, ..., in])
    if flag is criminal do 
      previous_name = name
      has_record = criminal
      return 

    if previous_name <> name do 
      has_record = law_abiding
    else 
      has_record = criminal

    previous_name = name
    for all i in items do
      emit(i.name, pair(name, has_record))

Second iteration counts both total number of friends and number of friends with criminal record. Reducer emits key/value pairs only for at risk youths. Also this iteration could use some kind of local aggregation.

Intermediate key/value:
    key: name  
    value: total friends, total friend criminals
    # totals are relative only to in data sets subsets 

Output:
    key: name  
    value: empty
    # only at risk youths

class MAPPER
  method MAP(name, pair(name, has_record))
    if has_record is law_abiding do
      emit(name, pair(0, 1))
    else
      emit(name, pair(1, 1))

class REDUCER
  method REDUCE(name, items[pair(total, criminals)])
    total = 0
    criminals = 0
    for all i in items do
      total = total + i.total
      criminals = criminals + i.criminals

    if criminals / total > 0.5 do
      emit(name, empty)

Find gangs. Gang is a group of people that:

has exactly 5 members,
each member is friend with all other members,
each two members committed at least 3 crimes together.

[+] Click to Expand the Answer

Again, we need three iterations. The idea is to first clean up the graph of all useless edges, so that only criminal contacts remain. Then, we split graph into smaller manageable sub-graphs. We attach all criminal contacts and edges between them to each person:

Last iteration reducers input:
    key: person
    values: all his criminal contacts and relationships between them.

Final reducer takes smaller graphs represented by value in each key/value pair and finds complete sub-graphs with 4 vertices in it. Add person from the key in it, and you have found a complete sub-graph with 5 vertices. The reducer may use any polynomial algorithm.

First iteration uses pairs approach to clear the graph. We omit both local aggregation and removal of duplicities. Both would make the algorithm more efficient.

Intermediate key/values:
    key: first friend, second friend, friendship/accomplice
    value: 1

Output:
    key: first friend, second friend
    value: empty
    # only friends with common criminal record

class MAPPER
  method MAP(name, value)
    if value is name do
      emit(triple(name, value, friendship), empty)
    else
      for all crime_accomplice in value.accomplices do
        emit(triple(name, crime_accomplice, accomplice), 1)

class PARTITIONING_COMPARATOR
  method compare(key (name1, accomplice1, flag1), key (name2, accomplice2, flag2))
    if name1 = name2 AND accomplice1 = accomplice2
      return keys are equal
 
    return keys are different
 
class SORTING_COMPARATOR
  method compare(key (name1, accomplice1, flag1), key (name2, accomplice2, flag2))
    if name1 = name2 AND accomplice1 AND flag1 is friendship
      return key1 is lower

    if name1 = name2 AND accomplice1 AND flag2 is friendship
      return key2 is lower

    return compare(pair(name1, accomplice1), pair(name2, accomplice2))

class REDUCER
  variable previous_name
  variable previous_accomplice

  method sameAsPrevious(name, accomplice) 
    if previous_name <> name
      return false

    if previous_accomplice <> accomplice
      return false

    return true

  method REDUCE(triple(name, accomplice, flag), items[i1, i2, ..., in])
    if sameAsPrevious(name, accomplice) do 
      if items.length > 2 do 
        emit(name, accomplice)
      return

    if flag is friendship do 
      previous_name = name
      previous_accomplice = accomplice

Second iteration attaches lists of all 'second degree' friends to edges:

Input
    key: first friend, second friend
    value: empty

Intermediate key/values:
    key: first friend
    value: first friend, second friend

    key: second friend
    value: first friend, second friend

Output:
    key: first friend, second friend
    value: all friends of second friend 

    key: second friend, first friend
    value: all friends of first friend

class MAPPER
  method MAP((first friend, second friend), empty)
    emit(first friend, (first friend, second friend))
    emit(second friend, (first friend, second friend))

class REDUCER
  method REDUCE(name, edges[e1, e2, ..., en])
    friends = new Set
    friends.add(name)

    for all edge in edges do
      friends.add(edge.v1, edge.v2)

    for all edge in edges do
       emit(edge, friends)

Finally, mapper and shuffle and sort phase together generate lists of all friends of any given person and relationships between them.

Input
    key: friend 1, friend 2
    value: all friends of friend 2

Intermediate key/values:
    key: friend 1
    value: friend 2, all friends of friend 2

Reducers input (after shuffle and sort):
    key: person
    values: all his friends and relationships between them.

Output:
    key: first friend, second friend, third friend, fourth friend, fifth friend 
    value: gang

class MAPPER
  method MAP((friend , friend 2), all friends of second friend)
    emit(friend 1, (friend 2, all friends of friend 2))

class REDUCER
  method REDUCE(name, graph attached to it)
    any polynomial algorithm will work

33 comments:

Anonymous said...: Nice!

PS: Link to "Data-Intensive Text Processing with MapReduce" book is broken; Oct 16, 2012, 6:36:00 AM
Meri said...: @Anonymous Thank you, I fixed the link. They have it hosted on Github now.; Oct 16, 2012, 12:17:00 PM
Anonymous said...: This covers most topics on hadoop, thanks; Feb 10, 2013, 4:01:00 AM
Kumar said...: Thanks a lot.its gives more idea about Mapreduce; Jan 9, 2014, 10:59:00 AM
Anonymous said...: excellent!!!; Mar 27, 2014, 1:02:00 PM
Unknown said...: Thanks Meri .. nice article; Apr 8, 2014, 6:06:00 PM
Anonymous said...: Many thanks for the article :); Sep 23, 2014, 5:53:00 AM
Jimmy said...: thanks a not god bless :); Jan 19, 2016, 11:44:00 PM
Navin Karn said...: The word "shuttle" should be replaced by "shuffle".; Dec 10, 2016, 8:49:00 PM
sathya said...: This is a great post. I like this topic.This site has lots of advantage.I found many interesting things from this site. It helps me in many ways.Thanks for posting this again.

Hadoop Training in Chennai

Base SAS Training in Chennai; Sep 8, 2017, 10:48:00 PM
Unknown said...: Thanks for sharing the Map Reduce questions and answers which really helpful to develop my knowledge and cracking the interview easily.
Check out the
https://www.credosystemz.com/training-in-chennai/best-hadoop-training-in-chennai/; Mar 12, 2018, 4:33:00 AM
Unknown said...: Hi, your post on mapreduce was very much useful for me in my codings do keep posting your blogs and links Hadoop Training in Velachery | Hadoop Training .
Hadoop Training in Chennai | Hadoop .; Apr 12, 2018, 11:33:00 PM
Java training in chennai said...: This comment has been removed by the author.; May 10, 2019, 10:44:00 PM
Yourdoorstep said...: Thanks for Fantasctic blog and its to much informatic which i never think ..Keep writing and grwoing your self

Birth certificate in delhi
Birth certificate in ghaziabad
Birth certificate in gurgaon
Birth certificate in noida
birth certificate
how to get birth certificate in delhi
birth certificate agent in delhi
how to download birth certificate
birth certificate in greater noida
name add in birth certificate
Birth certificate in delhi; Sep 2, 2019, 9:53:00 AM
daizy mathew said...: Very nice and informative post
php training
python training; Sep 9, 2019, 4:53:00 AM
dezayno said...: At Dezayno, we understand that it is the responsibility of each of us, as individuals and together, to protect the environment, and provide safe, environmentally friendly products for the world. This is why all of our Premium Limited Edition T-Shirts are made on Non-GMO 100% Certified Organic Cotton that is grown on organic farms. Our Natural Organic ringspun cotton t-shirts are not only Eco friendly, they are extremely comfortable and designed to last. Our textile and manufacturing facilities have been carefully selected to help Dezayno source the world's best all-natural, organic materials for our premium apparel brand. Organic Clothing; Oct 3, 2019, 11:39:00 PM
Joyal said...: Oracle Training | Online Course | Certification in chennai | Oracle Training | Online Course | Certification in bangalore | Oracle Training | Online Course | Certification in hyderabad | Oracle Training | Online Course | Certification in pune | Oracle Training | Online Course | Certification in coimbatore; Jun 9, 2020, 9:04:00 AM
Mr Sourav said...: Nice Article, Thanks For Sharing The helpful content...
Keep Also Check My new article..
thesrvtech
Top 5 best smartphone under 10000 in hindi
Sim swap kya hai
Telegram vs Whatsapp In Hindi
Shardul thakur biography in hindi
Wifi caliing kya hai
How to link pan card with adhar card
What is Madhar aap in hindi; Jul 13, 2020, 8:44:00 AM
moumita said...: Hey , I have readout your stunning post. It is very informative and effective post. Bridal Makeup Artists in Kolkata; Feb 2, 2021, 2:56:00 AM
Boomerang Airplane said...: This comment has been removed by the author.; Jun 4, 2021, 6:19:00 AM
Sneha Cubestech said...: Thanks for sharing these questions and answers, This is really helpful for my work, Keep on sharing more like this

Software Development company
Best web development company
Mobile app development company; Oct 22, 2021, 10:50:00 PM
Rosy S said...: Great post with good content and thanks for sharing!!
Also Visit:
Google Analytics Training in Chennai |
Google Analytics Online Course; Nov 30, 2022, 11:48:00 PM
Abdulla said...: PrintWay provides its customers a varied suite of printed cards. We have a history of over 6 years in the printing industry. We have produced business cards, pamphlets, flyers, brochures, etc.; Dec 28, 2022, 8:36:00 PM
sadaesq said...: kralbet
betpark
tipobet
slot siteleri
kibris bahis siteleri
poker siteleri
bonus veren siteler
mobil ödeme bahis
betmatik
CNLUB; Feb 18, 2023, 9:21:00 AM
Sanu Dutta said...: Get access to all certification questions and answers:
scrum certification exam questions
safe for teams exam answers
google ads search certification answers
google ads measurement certification answers; Apr 22, 2023, 12:36:00 AM
city space me said...: best design company uae City Space; Aug 31, 2023, 2:04:00 AM
Anonymous said...: gynghjgynghbytgtyfg
شركة تنظيف مجالس; Sep 8, 2024, 2:33:00 AM
Anonymous said...: شركة تنظيف شقق بالقطيف WJ3oW44wN8; Sep 30, 2024, 1:07:00 PM
Anonymous said...: افران جدة q4GQaNU0pl; Oct 13, 2024, 5:56:00 AM
Anonymous said...: شركة تسليك مجاري بالجبيل SrjrdcjtBm; Nov 2, 2024, 2:28:00 PM
Anonymous said...: شركة مكافحة النمل الابيض بالاحساء q9Opl0L1NC; Dec 14, 2024, 12:07:00 AM
James Sarah Blogger said...: Examining some MapReduce questions and answers will help you grasp the concept of this incredible data processing framework to a greater extent. The community driven insights can really be valuable whether analytical factory or data factory whether you're taking on complex data sets or optimizing performance. Taking the example of life insurance landing page, with the help of MapReduce, you can analyze huge data and identify important trends which in return help in enhancing user experience by processing large data efficiently and effectively.; Jan 29, 2025, 8:26:00 AM
Anonymous said...: شركة عزل اسطح بجازان L5HhZNZiRL; Mar 18, 2025, 4:37:00 AM

Post a Comment

Search