Tuesday, October 25, 2011

MapReduce Questions and Answers

With all the hype and buzz surrounding NoSQL, I decided to have a look at it. I quickly found that there is not one NoSQL I could learn. Rather, there are various different solutions with different purposes and trade offs. Those various solutions tend to have one thing in common: processing of data in NoSQL storage is usually done using MapReduce framework.

Search on MapReduce found various scattered blog posts, some universities courses pages and one book that seems to contain almost everything other sources did.

This post contains MapReduce questions and answers based on the book. Basically, if I would be a student, this is what I would have made as a test preparation notes. If I would be a teacher, this is what I would ask on the exam.

First chapter gives credit where the credit is due, the rest contains questions. Last chapter contains hands-on coding exercises.

The Book

The book is named Data-Intensive Text Processing with MapReduce. If you are unsure whether you want to buy it or not, pre-production manuscript is available for free.

Do not be fooled by the title. The book is more about MapReduce itself and less about text processing. First half of the book describes general purpose tricks (design patterns) useful for any task. Second half contains a chapters on text processing, graph algorithms and expectation maximization.

The book contains almost everything I found on various blogs, university courses pages and much more.

Questions

Questions are split by book chapters. With one minor exception (counters), questions are mostly based on the first half of the book. Second half contains concrete algorithms and I wanted to focus on general purpose knowledge.

It does not mean that learning them is not useful. Especially Graph Algorithms chapter contains easily generalizable ideas.

All answers are by default collapsed. Any of them can be expanded by click on '[+] Click to Expand the Answer' link. Use following links to expand/collapse all of them:

2 MapReduce Basics
2.2 Mappers and Reducers
Describe general MapReduce algorithm. Split it into phases. For each phase include:
  • who is responsible (framework/programmer/customizable),
  • what it does,
  • phase input,
  • phase output.
[+] Click to Expand the Answer
Decide if the statement is true or false: All MapReduce implementations implement exactly same algorithm.
[+] Click to Expand the Answer
True or false: Each mapper must generate the same number of key/value pairs as its input had.
[+] Click to Expand the Answer
True or false: Mappers input key/value pairs are sorted by the key.
[+] Click to Expand the Answer
True or false: Mappers output key/value must be of the same type as its input.
[+] Click to Expand the Answer
True or false: Reducer is applied to all values associated with the same key.
[+] Click to Expand the Answer
True or false: Reducers input key/value pairs are sorted by the key.
[+] Click to Expand the Answer
True or false: Each reducer must generate the same number of key/value pairs as its input had.
[+] Click to Expand the Answer
True or false: Reducers output key/value pair must be of the same type as its input.
[+] Click to Expand the Answer
2.3 The Execution Framework
What happens in case of hardware/software failure?
[+] Click to Expand the Answer
Is it possible to start reducers while some mappers still run? Why?
[+] Click to Expand the Answer
Define a straggler.
[+] Click to Expand the Answer
What is speculative execution (also called backup tasks)? What problem does it solve?
[+] Click to Expand the Answer
2.4 Partitioners and Combiners
What does partitioner do?
[+] Click to Expand the Answer
What does combiner do?
[+] Click to Expand the Answer
Decide if the statement is true or false: Each combiner runs exactly once.
[+] Click to Expand the Answer
2.5 The Distributed File System
Briefly describe HDFS architecture.
[+] Click to Expand the Answer
Decide if the statement is true or false: HDFS is good at managing large number of small files.
[+] Click to Expand the Answer
2.6 Hadoop Cluster Architecture
Explain difference between jobtracker and tasktracker?
[+] Click to Expand the Answer
Explain mapper lifecycle.
[+] Click to Expand the Answer
Explain reducer lifecycle.
[+] Click to Expand the Answer
3 MapReduce Algorithm Design
3.1 Local Aggregation
What is local aggregation and why is it used?
[+] Click to Expand the Answer
What is in-mapper combining? State advantages and disadvantages over writing custom combiner.
[+] Click to Expand the Answer
3.2 Pairs and Stripes
Explain Pair design patter on a co-occurence example. Include advantages/disadvantages against Stripes approach, possible optimizations and their efficacy.
[+] Click to Expand the Answer
Explain Stripes design patter on a co-occurence example. Include advantages/disadvantages against Pairs approach, possible optimizations and their efficacy.
[+] Click to Expand the Answer
Explain scalability bottleneck caused by stripes approach.
[+] Click to Expand the Answer
3.3 Computing Relative Frequencies
Relative frequencies of co-occurrences problem:
Input: text documents
    key: document id
  value: text document

Output: key/value pairs where
    key: pair(word1, word2)
  value: #co-occurrences(word1, word2)/#co-occurrences(word1, any word)

Fix following solution to relative frequencies of co-occurrences problem:
class MAPPER
  method INITIALIZE
    H = new hash map    

  method MAP(docid a, doc d)
    for all term w in doc d do
      for all term u patri neighbors(w) do
        H(w) = H(w) + 1
        emit(pair(u, w), count 1)

  method CLOSE 
    for all term w in H
      emit(pair(w, *), H(w))    

class REDUCER
  variable total_occurrences = 0

  method REDUCE(pair (p, u), counts[c1, c2, ..., cn])
    s = 0
    for all c in counts[c1, c2, ..., cn] do 
      s = s + c

    if u = * 
      total_occurrences = s
    else
      emit(pair p, s/total_occurrences)

class SORTING_COMPARATOR
  method compare(key (p1, u1), key (p2, u2))
    if p1 = p2 AND u1 = *
      return key1 is lower

    if p1 = p2 AND u2 = *
      return key2 is lower

    return compare(p1, p2)
[+] Click to Expand the Answer
Describe order inversion design pattern.
[+] Click to Expand the Answer
3.4 Secondary Sorting
Describe value-to-key design pattern.
[+] Click to Expand the Answer
3.5 Relational Joins
Describe reduce side join between tables with one-on-one relationship.
[+] Click to Expand the Answer
Describe reduce side join between tables with one-to-many relationship.
[+] Click to Expand the Answer
Describe reduce side join between tables with many-to-many relationship.
[+] Click to Expand the Answer
Describe map side join between two database tables.
[+] Click to Expand the Answer
Describe memory backed join.
[+] Click to Expand the Answer
Which one is faster? Map side join or reduce side join?
[+] Click to Expand the Answer
4 Inverting Indexing for Text Retrieval
The chapter contains a lot of details about integer numbers encoding and compression. Since these topics are not directly about MapReduce, I made no questions about them.
4.4 Inverting Indexing: Revised Implementation
Explain inverting index retrieval algorithm. You may assume that each document fits into the memory. Assume also then there is a huge number of documents. Which part should be optimized by integer encoding and compression?
Input: text documents
    key: document id
  value: text document

Output: key/value pairs where
    key: word 
  value: list[documentId, numberOfOccurences]
         list elements must be sorted by numberOfOccurences
[+] Click to Expand the Answer
5 Graph Algorithms
The chapter contains two algorithms: shortest path in the graph and page ranking algorithm. The questions are straightforward.
5.2 Parallel Breadth-First Search
Find shortest path from one node origin to all other nodes. Each edge has a weight associated. Input key/value pairs have already bean preprocessed into comfortable form.
Input: graph
    key: node id
  value: distance to origin, list[adjacent node, edge length]

Output: key/value pairs where
    key: node id
  value: distance to origin, list[adjacent node, edge length]
[+] Click to Expand the Answer
Explain page rank algorithm, assume alpha = 0.
[+] Click to Expand the Answer
6 EM Algorithms For Text Processing
I made no questions out of this chapter.

Exercises

This chapter contains hands-on exercises for MapReduce. Some of them require multiple iterations.

Warm Up
Count number of occurrences of every word in a text collection.
Input:
    key: document id,
  value: text document.

Output:
    key: word,
  value: number of occurences.
[+] Click to Expand the Answer
Web Store
Website user log contains user ids and length of each session. The website has modest number of registered users. Compute the average session length for each user.
Input:
    key: user id,
  value: session length.

Output:
    key: user id,
  value: average session length.
[+] Click to Expand the Answer
Web store log contains user id and bought item for each sale. You need to implement "buyers of item also bought" functionality. Whenever the item is shown, the store will suggest five items most often bought by items buyers.
Input:
    key: user id,
  value: brought item.

Output:
    key: item,
  value: list of five most common "buyers of item also bought" items.
[+] Click to Expand the Answer
Web store log contains user id, timestamp, item and number of brought pieces for each sale. The store is looking for items whose sales rise or decline at the same time. Find 20 item couples with maximum of such months.
Input:
    key: user id,
  value: timestamp, brought item, count.

Output:
    key: item, item
  value: number of months when both items sales rose or decline.
      #: the output contains 20 key/value pairs with maximum value
[+] Click to Expand the Answer
Criminal Agency
Inputs to all exercises in this chapter uses the same data structure.

Criminal agency stole Facebook's friendships database and wants to analyze new data. Friendships are stored in form key/value pairs, each friendship corresponds to two key/value pairs:
Friends:
    key: first friend name
  value: second friend name

    key: second friend name
  value: first friend name

The agency owns also criminal records of all citizens:
Criminal record:
    key: citizen name
  value: when, where, accomplices, description
Find at risk youths. A person is considered at risk youth if more than half of his friends have criminal record.
[+] Click to Expand the Answer
Find gangs. Gang is a group of people that:
  • has exactly 5 members,
  • each member is friend with all other members,
  • each two members committed at least 3 crimes together.
[+] Click to Expand the Answer

33 comments:

Anonymous said...

Nice!

PS: Link to "Data-Intensive Text Processing with MapReduce" book is broken

Meri said...

@Anonymous Thank you, I fixed the link. They have it hosted on Github now.

Anonymous said...

This covers most topics on hadoop, thanks

Kumar said...

Thanks a lot.its gives more idea about Mapreduce

Anonymous said...

excellent!!!

Unknown said...

Thanks Meri .. nice article

Anonymous said...

Many thanks for the article :)

Jimmy said...

thanks a not god bless :)

Navin Karn said...

The word "shuttle" should be replaced by "shuffle".

sathya said...

This is a great post. I like this topic.This site has lots of advantage.I found many interesting things from this site. It helps me in many ways.Thanks for posting this again.

Hadoop Training in Chennai

Base SAS Training in Chennai

Unknown said...

Thanks for sharing the Map Reduce questions and answers which really helpful to develop my knowledge and cracking the interview easily.
Check out the
https://www.credosystemz.com/training-in-chennai/best-hadoop-training-in-chennai/

Unknown said...

Hi, your post on mapreduce was very much useful for me in my codings do keep posting your blogs and links Hadoop Training in Velachery | Hadoop Training .
Hadoop Training in Chennai | Hadoop .

Java training in chennai said...
This comment has been removed by the author.
Yourdoorstep said...

Thanks for Fantasctic blog and its to much informatic which i never think ..Keep writing and grwoing your self

Birth certificate in delhi
Birth certificate in ghaziabad
Birth certificate in gurgaon
Birth certificate in noida
birth certificate
how to get birth certificate in delhi
birth certificate agent in delhi
how to download birth certificate
birth certificate in greater noida
name add in birth certificate
Birth certificate in delhi

daizy mathew said...

Very nice and informative post
php training
python training

dezayno said...

At Dezayno, we understand that it is the responsibility of each of us, as individuals and together, to protect the environment, and provide safe, environmentally friendly products for the world. This is why all of our Premium Limited Edition T-Shirts are made on Non-GMO 100% Certified Organic Cotton that is grown on organic farms. Our Natural Organic ringspun cotton t-shirts are not only Eco friendly, they are extremely comfortable and designed to last. Our textile and manufacturing facilities have been carefully selected to help Dezayno source the world's best all-natural, organic materials for our premium apparel brand. Organic Clothing

Joyal said...

Oracle Training | Online Course | Certification in chennai | Oracle Training | Online Course | Certification in bangalore | Oracle Training | Online Course | Certification in hyderabad | Oracle Training | Online Course | Certification in pune | Oracle Training | Online Course | Certification in coimbatore

Mr Sourav said...

Nice Article, Thanks For Sharing The helpful content...
Keep Also Check My new article..
thesrvtech
Top 5 best smartphone under 10000 in hindi
Sim swap kya hai
Telegram vs Whatsapp In Hindi
Shardul thakur biography in hindi
Wifi caliing kya hai
How to link pan card with adhar card
What is Madhar aap in hindi

moumita said...

Hey , I have readout your stunning post. It is very informative and effective post. Bridal Makeup Artists in Kolkata

Boomerang Airplane said...
This comment has been removed by the author.
Sneha Cubestech said...

Thanks for sharing these questions and answers, This is really helpful for my work, Keep on sharing more like this

Software Development company
Best web development company
Mobile app development company

Rosy S said...

Great post with good content and thanks for sharing!!
Also Visit:
Google Analytics Training in Chennai |
Google Analytics Online Course

Abdulla said...

PrintWay provides its customers a varied suite of printed cards. We have a history of over 6 years in the printing industry. We have produced business cards, pamphlets, flyers, brochures, etc.

sadaesq said...

kralbet
betpark
tipobet
slot siteleri
kibris bahis siteleri
poker siteleri
bonus veren siteler
mobil ödeme bahis
betmatik
CNLUB

Sanu Dutta said...

Get access to all certification questions and answers:
scrum certification exam questions
safe for teams exam answers
google ads search certification answers
google ads measurement certification answers

city space me said...

best design company uae City Space

Anonymous said...

gynghjgynghbytgtyfg
شركة تنظيف مجالس

Anonymous said...

شركة تنظيف شقق بالقطيف WJ3oW44wN8

Anonymous said...

 افران جدة q4GQaNU0pl

Anonymous said...

شركة تسليك مجاري بالجبيل SrjrdcjtBm

Anonymous said...

شركة مكافحة النمل الابيض بالاحساء q9Opl0L1NC

James Sarah Blogger said...

Examining some MapReduce questions and answers will help you grasp the concept of this incredible data processing framework to a greater extent. The community driven insights can really be valuable whether analytical factory or data factory whether you're taking on complex data sets or optimizing performance. Taking the example of life insurance landing page, with the help of MapReduce, you can analyze huge data and identify important trends which in return help in enhancing user experience by processing large data efficiently and effectively.

Anonymous said...

شركة عزل اسطح بجازان L5HhZNZiRL

Post a Comment