Our Best Offer Ever!! Summer Special - Get 3 Courses at 24,999/- Only. Read More

Noida: +917065273000

Gurgaon: +917291812999

About Hadoop

Hadoop is an open source and file system HDFS. An allocated processing framework runs data processing and storage for big data submissions running in clustered systems. It is at the center of a mounting ecosystem of big data technologies that are mainly used to hold up advanced analytics initiatives, including predictive analytics, data mining and machine learning applications. Hadoop can knob a variety of forms of structured and unstructured data, giving users more flexibility for collecting, routing and analyzing data than relational databases and data warehouses structures.

Hadoop Interview Questions And Answers

1. What is Hadoop and list its components?

Hadoop is an open-source framework used for storing large data sets and runs applications across clusters of commodity hardware.

It offers extensive storage for any type of data and can handle endless parallel tasks.

Core components of Hadoop:

  • Storage unit– HDFS (DataNode, NameNode)
  • Processing framework– YARN (NodeManager, ResourceManager)

2. What is YARN and explain its components?

Yet Another Resource Negotiator (YARN) is one of the core components of Hadoop and is responsible for managing resources for the various applications operating in a Hadoop cluster, and also schedules tasks on different cluster nodes.

YARN components:

  • Resource Manager - It runs on a master daemon and controls the resource allocation in the cluster.
  • Node Manager - It runs on a slave daemon and is responsible for the execution of tasks for each single Data Node.
  • Application Master - It maintains the user job lifecycle and resource requirements of individual applications. It operates along with the Node Manager and controls the execution of tasks.
  • Container - It is a combination of resources such as Network, HDD, RAM, CPU, etc., on a single node.

3. What are the Hadoop daemons and explain their roles in a Hadoop cluster?

Generally, the daemon is nothing but a process that runs in the background. Hadoop has five such daemons. They are:

  • NameNode - Is is the Master node responsible to store the meta-data for all the directories and files.
  • DataNode - It is the Slave node responsible to store the actual data.
  • Secondary NameNode - It is responsible for the backup of NameNode and stores entire metadata of data nodes like data node properties, address, and block report of each data node.
  • JobTracker - It is used for creating and running jobs. It runs on data nodes and allocates the job to TaskTracker.
  • TaskTracker - It operates on the data node. It runs the tasks and reports the tasks to JobTracker.

4. What is Avro Serialization in Hadoop?

  • The process of translating objects or data structures state into binary or textual form is called Avro Serialization. It is defined as a language-independent schema (written in JSON).
  • It provides AvroMapper and AvroReducer for running MapReduce programs.

5. How can you skip the bad records in Hadoop?

Hadoop provides a feature called SkipBadRecords class for skipping bad records while processing mapping inputs.

6. Compare Hadoop and RDBMS?

Apache Hadoop is the future of the database because it stores and processes a large amount of data. Which will not be possible with the traditional database. There is some difference between Hadoop and RDBMS which are as follows:

  • Architecture – Traditional RDBMS have ACID properties. Whereas Hadoop is a distributed computing framework having two main components: Distributed file system (HDFS) and MapReduce.
  • Data acceptance – RDBMS accepts only structured data. While Hadoop can accept both structured as well as unstructured data. It is a great feature of Hadoop, as we can store everything in our database and there will be no data loss.
  • Scalability – RDBMS is a traditional database which provides vertical scalability. So if the data increases for storing then we have to increase particular system configuration. While Hadoop provides horizontal scalability. So we just have to add one or more node to the cluster if there is any requirement for an increase in data.
  • OLTP (Real-time data processing) and OLAP – Traditional RDMS support OLTP (Real-time data processing). OLTP is not supported in Apache Hadoop. Apache Hadoop supports large scale Batch Processing workloads (OLAP).
  • Cost – Licensed software, therefore we have to pay for the software. Whereas Hadoop is open source framework, so we don’t need to pay for software.

7. What are the modes in which Hadoop run?

Apache Hadoop runs in three modes:

  • Local (Standalone) Mode – Hadoop by default run in a single-node, non-distributed mode, as a single Java process. Local mode uses the local file system for input and output operation. It is also used for debugging purpose, and it does not support the use of HDFS. Further, in this mode, there is no custom configuration required for configuration files.
  • Pseudo-Distributed Mode – Just like the Standalone mode, Hadoop also runs on a single-node in a Pseudo-distributed mode. The difference is that each daemon runs in a separate Java process in this Mode. In Pseudo-distributed mode, we need configuration for all the four files mentioned above. In this case, all daemons are running on one node and thus, both Master and Slave node are the same.
  • Fully-Distributed Mode – In this mode, all daemons execute in separate nodes forming a multi-node cluster. Thus, it allows separate nodes for Master and Slave.

8. What are the features of Standalone (local) mode?

By default, Hadoop run in a single-node, non-distributed mode, as a single Java process. Local mode uses the local file system for input and output operation. One can also use it for debugging purpose. It does not support the use of HDFS. Standalone mode is suitable only for running programs during development for testing. Further, in this mode, there is no custom configuration required for configuration files. Configuration files are:

  • core-site.xml
  • hdfs-site.xml files
  • mapred-site.xml
  • yarn-default.xml

9. What are the features of Pseudo mode?

Just like the Standalone mode, Hadoop can also run on a single-node in this mode. The difference is that each Hadoop daemon runs in a separate Java process in this Mode. In Pseudo-distributed mode, we need configuration for all the four files mentioned above. In this case, all daemons are running on one node and thus, both Master and Slave node are the same.

The pseudo mode is suitable for both for development and in the testing environment. In the Pseudo mode, all the daemons run on the same machine.

10. Compare Hadoop 2 and Hadoop 3?

  • In Hadoop 2, the minimum supported version of Java is Java 7, while in Hadoop 3 is Java 8.
  • Hadoop 2, handle fault tolerance by replication (which is wastage of space). While Hadoop 3 handle it by Erasure coding.
  • For data balancing Hadoop 2 uses HDFS balancer. While Hadoop 3 uses Intra-data node balancer.
  • In Hadoop 2 some default ports are Linux ephemeral port range. So at the time of startup, they will fail to bind. But in Hadoop 3 these ports have been moved out of the ephemeral range.
  • In hadoop 2, HDFS has 200% overhead in storage space. While Hadoop 3 has 50% overhead in storage space.
  • Hadoop 2 has features to overcome SPOF (single point of failure). So whenever NameNode fails, it recovers automatically. Hadoop 3 recovers SPOF automatically no need of manual intervention to overcome it.

11. How is security achieved in Hadoop?

Apache Hadoop achieves security by using Kerberos.

At a high level, there are three steps that a client must take to access a service when using Kerberos. Thus, each of which involves a message exchange with a server.

  • Authentication – The client authenticates itself to the authentication server. Then, receives a timestamped Ticket-Granting Ticket (TGT).
  • Authorization – The client uses the TGT to request a service ticket from the Ticket Granting Server.
  • Service Request – The client uses the service ticket to authenticate itself to the server.

12. Why does one remove or add nodes in a Hadoop cluster frequently?

The most important features of the Hadoop is its utilization of Commodity hardware. However, this leads to frequent Datanode crashes in a Hadoop cluster.

Another striking feature of Hadoop is the ease of scale by the rapid growth in data volume.

Hence, due to above reasons, administrator Add/Remove DataNodes in a Hadoop Cluster.

13. How do you define “Rack Awareness” in Hadoop?

Rack Awareness is the algorithm in which the “NameNode” decides how blocks and their replicas are placed, based on rack definitions to minimize network traffic between “DataNodes” within the same rack. Let’s say we consider replication factor 3 (default), the policy is that “for every block of data, two copies will exist in one rack, third copy in a different rack”. This rule is known as the “Replica Placement Policy”.

14. What is “speculative execution” in Hadoop?

If a node appears to be executing a task slower, the master node can redundantly execute another instance of the same task on another node. Then, the task which finishes first will be accepted and the other one is killed. This process is called “speculative execution”.

15. What is the difference between an “HDFS Block” and an “Input Split”?

The “HDFS Block” is the physical division of the data while “Input Split” is the logical division of the data. HDFS divides data in blocks for storing the blocks together, whereas for processing, MapReduce divides the data into the input split and assign it to mapper function.

16. Name the three modes in which Hadoop can run.

The three modes in which Hadoop can run are as follows:

  1. Standalone (local) mode: This is the default mode if we don’t configure anything. In this mode, all the components of Hadoop, such NameNode, DataNode, ResourceManager, and NodeManager, run as a single Java process. This uses the local filesystem.
  2. Pseudo-distributed mode: A single-node Hadoop deployment is considered as running Hadoop system in pseudo-distributed mode. In this mode, all the Hadoop services, including both the master and the slave services, were executed on a single compute node.
  3. Fully distributed mode: A Hadoop deployments in which the Hadoop master and slave services run on separate nodes, are stated as fully distributed mode.

What happens when two clients try to access the same file in the HDFS?

HDFS supports exclusive writes only.

When the first client contacts the “NameNode” to open the file for writing, the “NameNode” grants a lease to the client to create this file. When the second client tries to open the same file for writing, the “NameNode” will notice that the lease for the file is already granted to another client, and will reject the open request for the second client.

18. How does NameNode tackle DataNode failures?

NameNode periodically receives a Heartbeat (signal) from each of the DataNode in the cluster, which implies DataNode is functioning properly.

A block report contains a list of all the blocks on a DataNode. If a DataNode fails to send a heartbeat message, after a specific period of time it is marked dead.

The NameNode replicates the blocks of dead node to another DataNode using the replicas created earlier.

19. What is the HDFS block size?

By default, the HDFS block size is 128MB for Hadoop 2.x.

20. What is the default replication factor?

  • Replication factor means the minimum number of times the file will replicate(copy) across the cluster.
  • The default replication factor is 3

21. What is a Checkpoint Node in Hadoop?

Checkpoint Node is the new implementation of secondary NameNode in Hadoop. It periodically creates the checkpoints of filesystem metadata by merging the edits log file with FsImage file.

22. Can NameNode and DataNode be a commodity hardware?

The smart answer to this question would be, DataNodes are commodity hardware like personal computers and laptops as it stores data and are required in a large number. But from your experience, you can tell that, NameNode is the master node and it stores metadata about all the blocks stored in HDFS. It requires high memory (RAM) space, so NameNode needs to be a high-end machine with good memory space.

23. What does ‘jps’ command do?

The ‘jps’ command helps us to check if the Hadoop daemons are running or not. It shows all the Hadoop daemons i.e namenode, datanode, resourcemanager, nodemanager etc. that are running on the machine.

24. How will you write a custom partitioner?

Custom partitioner for a Hadoop job can be written easily by following the below steps:

  • Create a new class that extends Partitioner Class
  • Override method – getPartition, in the wrapper that runs in the MapReduce.
  • Add the custom partitioner to the job by using method set Partitioner or add the custom partitioner to the job as a config file.

25. What is a “Combiner”?

A “Combiner” is a mini “reducer” that performs the local “reduce” task. It receives the input from the “mapper” on a particular “node” and sends the output to the “reducer”. “Combiners” help in enhancing the efficiency of “MapReduce” by reducing the quantum of data that is required to be sent to the “reducers”.

26. What do you know about “SequenceFileInputFormat”?

“SequenceFileInputFormat” is an input format for reading within sequence files. It is a specific compressed binary file format which is optimized for passing the data between the outputs of one “MapReduce” job to the input of some other “MapReduce” job.

Sequence files can be generated as the output of other MapReduce tasks and are an efficient intermediate representation for data that is passing from one MapReduce job to another.

27. What is a Heartbeat in HDFS?

Heartbeat is the signals that NameNode receives from the DataNodes to show that it is functioning (alive). NameNode and DataNode do communicate using Heartbeat. If after the certain time of heartbeat NameNode do not receive any response from DataNode, then that Node is dead. The NameNode then schedules the creation of new replicas of those blocks on other DataNodes.

Heartbeats from a Datanode also carry information about total storage capacity. Also, carry the fraction of storage in use, and the number of data transfers currently in progress.

The default heartbeat interval is 3 seconds. One can change it by using dfs.heartbeat.interval in hdfs-site.xml.

28. Explain Hadoop Archives?

Apache Hadoop HDFS stores and processes large (terabytes) data sets. However, storing a large number of small files in HDFS is inefficient, since each file is stored in a block, and block metadata is held in memory by the namenode.

Reading through small files normally causes lots of seeks and lots of hopping from datanode to datanode to retrieve each small file, all of which is inefficient data access pattern.

Hadoop Archive (HAR) basically deals with small files issue. HAR pack a number of small files into a large file, so, one can access the original files in parallel transparently (without expanding the files) and efficiently.

Hadoop Archives are special format archives. It maps to a file system directory. Hadoop Archive always has a *.har extension. In particular, Hadoop MapReduce uses Hadoop Archives as an Input.

29. Explain the Single point of Failure in Hadoop?

In Hadoop 1.0, NameNode is a single point of Failure (SPOF). If namenode fails, all clients would unable to read/write files. In such event, whole Hadoop system would be out of service until new namenode is up.

Hadoop 2.0 overcomes this SPOF by providing support for multiple NameNode. High availability feature provides an extra NameNode to Hadoop architecture. This feature provides automatic failover. If active NameNode fails, then Standby-Namenode takes all the responsibility of active node. And cluster continues to work.

The initial implementation of Namenode high availability provided for single active/standby namenode. However, some deployment requires high degree fault-tolerance. So new version 3.0 enable this feature by allowing the user to run multiple standby namenode. For instance configuring three namenode and five journal nodes. So, the cluster is able to tolerate the failure of two nodes rather than one.

30. Explain Erasure Coding in Hadoop?

In Hadoop, by default HDFS replicates each block three times for several purposes. Replication in HDFS is very simple and robust form of redundancy to shield against the failure of datanode. But replication is very expensive. Thus, 3 x replication scheme has 200% overhead in storage space and other resources.

Thus, Hadoop 2.x introduced Erasure Coding a new feature to use in the place of Replication. It also provides the same level of fault tolerance with less space store and 50% storage overhead.

Erasure Coding uses Redundant Array of Inexpensive Disk (RAID). RAID implements EC through striping. In which it divide logical sequential data (such as a file) into the smaller unit (such as bit, byte or block). Then, stores data on different disk.

Encoding- In this process, RAID calculates and sort Parity cells for each strip of data cells. And recover error through the parity. Erasure coding extends a message with redundant data for fault tolerance. EC codec operates on uniformly sized data cells. In Erasure Coding, codec takes a number of data cells as input and produces parity cells as the output. Data cells and parity cells together are called an erasure coding group.

There are two algorithms available for Erasure Coding:

  • XOR Algorithm
  • Reed-Solomon Algorithm

31. What are the different data types in Pig Latin?

Pig Latin can handle both atomic data types like int, float, long, double etc. and complex data types like tuple, bag and map.

Atomic data types: Atomic or scalar data types are the basic data types which are used in all the languages like string, int, float, long, double, char[], byte[].

Complex Data Types: Complex data types are Tuple, Map and Bag.

Career scopes and salary scale

We are in the digital age of instant expectation. Internet, Smartphones, social networking sites, and various online data sharing platforms have contributed in creating making data large and big. Big data is not a new term for us. Hadoop with HDFS (Hadoop Distributed File System) is used for managing such big data storage and system. Hadoop candidates with in-depth knowledge and skill-based training are able discover their smooth ways to attain their career goals in the job markets.

A Hadoop developer or programmer is expected a minimum salary of 38, 000 dollars per annum. However, the salary of an experienced Hadoop developer can reach to double the figure mentioned before. However, salaries are very dependent upon the location, business, and the company’s requirements.

Conclusion

The article ‘Hadoop interview questions’ has been productively answered every advanced Hadoop interview questions. Any student or professional have studied this Hadoop interview questions for experienced candidates can find success in the interview. Even then, if the learners still need more detailing on Hadoop designing, structure, storage and implementation, then they may drop in a message to our experts regarding Hadoop interview questions for experienced professionals. Our trainers would be happy to help and make your mind up your Hadoop-designing issues of the students. Join Hadoop Training in NoidaHadoop Training in DelhiHadoop Training in Gurgaon



Enquire Now






Thank you

Yeah! Your Enquiry Submitted Successfully. One Of our team member will get back to your shortly.

Enquire Now Enquire Now