Top AI Interview Questions & Answers — Part 3
1.How would you transfer data from one Hadoop cluster to another?
This question aims to test your experience with Hadoop. Generally, migration of data from one cluster to another is not very frequent. Hence, a person with deep expertise would be able to tackle this one.
Distributed copy command (distcp) is a tool provided by Hadoop for copying large data sets between distributed file systems within and across clusters. The command submits a regular MapReduce job that performs a file-by-file copy. MapReduce is also used to effect its distribution, error handling and recovery, and reporting.
Hadoop -distcp [source] \ [destination]
Here [source] and [destination] are hdfs urls.
2. What are Fact Tables?
A fact table record captures a measurement or a metric. For example, FACT_PURCHASED
that gives us a number of units purchased by date, by store and by product for a company. The other tables which provide some data around how these measurements and metrics in a Fact table are dimensions tables. So for the same scenario above, DIM_TIME
DIM_STORE
provides the store details around the purchase. A fact table usually will have all the primary keys to required dimensional tables and the measurement or the metric value for that record.3. Give some problems or scenarios where MapReduce concept works well and where it doesn’t work.
In order for us to answer this question, we should understand the motives behind this question. The interviewee wants to know how well we know MapReduce and other similar programs to be able to distinguish between where to use MapReduce vs where not to use it. This is ideally aimed to prevent the hammer and the nail problem where, to a person with a hammer every problem looks like a nail.
A MapReduce program consists of both a Map method and a Reduce method. The Map method takes a set of data and converts it into another set of data via filtering or sorting operations, where individual elements are broken down into key value pairs. The Reduce method takes the Map’s output as input and performs a summary operation.
At a system level, the MapReduce System orchestrates the processing by marshalling the distributed servers, running the various tasks in parallel, managing all communications and data transfer between the various parts of the system, and providing the redundancy and fault tolerance.
If we have understood this process in detail, it is easy to understand that since all the data is already provided to a MapReduce program, it will not be able to perform if the data is of streaming type. Some computations need to happen in memory for them to be effective. These cannot be handled by MapReduce. In ML, we use iterative processing which converges to provide us results. As the iterations run, we go more closer to the desired results. MapReduce cannot be used directly for Iterative processing. These are the top three areas where MapReduce may not work.
Subscribe to our Acing AI newsletter, I promise not to spam and its FREE!
Thanks for reading! ? If you enjoyed it, test how many times can you hit ? in 5 seconds. It’s great cardio for your fingers AND will help other people see the story.
The sole motivation of this blog article is to provide answers to some Data Science Interview Questions. I aim to make this a living document, so any updates and suggested changes can always be included. Please provide relevant feedback.