Home Tutorials Training Consulting Products Books Company Donate Contact us

NOW Hiring

Quick links


Apache Hadoop. This article describes how to use Apache Hadoop.

1. Apache Hadoop

1.1. Overview

Apache Hadoop is a software solution for distributed computing of large datasets. Hadoop provides a distributed filesystem (HDFS) and a MapReduce implementation.

A special computer acts as the "name node". This computer saves the information about the available clients and the files. The Hadoop clients (computers) are called nodes. The "name node" is currently a single point of failure. The Hadoop project is working on solutions for this.

1.2. Typical tasks

Apache Hadoop can be used to filter and aggregate data, e.g. a typical use case would be the analysis of web server log files to find the most visited pages. But MapReduce has been used to transverse the graphs and other tasks.

1.3. Writing the map and reduce functions

Hadoop allows that the map and reduce functions are written in Java. Hadoop provides also linker so that map and reduce functions can be written in other languages, e.g. C++, Python, Pe, etc.

2. Hadoop file system

The Hadoop file system (HDSF) is a distributed file system. It uses an existing file system of the operating system but extends this with redundancy and distribution. HSDF hides the complexity of distributed storage and redundancy from the programmer.

In the standard configuration HDFS saves all files three times on different nodes. The "name node" (server) has the information where the files are stored.

Harddisks are very effective in reading large files sequentially but are much slower during random access. HDFS is therefore optimized for large files.

To improve performance Hadoop also tries to move the computation to the nodes which store the data and not vice versa. Especially if you have very large data this helps to improve the performance as you can avoid that the network becomes the bottleneck.

3. MapReduce

Apache Hadoop jobs work according to the MapRecude principle. See MapReduce for details.

4. Installation

Apache Hadoop can be downloaded from Hadoop Homepage. To get started with Hadoop you require the following sub-projects:

  • Hadoop Common

  • MapReduce

  • HDFS

5. Getting started

6. About this website

7. Links and Literature

7.1. Source Code

7.3. vogella GmbH training and consulting support


The vogella company provides comprehensive training and education services from experts in the areas of Eclipse RCP, Android, Git, Java, Gradle and Spring. We offer both public and inhouse training. Whichever course you decide to take, you are guaranteed to experience what many before you refer to as “The best IT class I have ever attended”.

The vogella company offers expert consulting services, development support and coaching. Our customers range from Fortune 100 corporations to individual developers.

Copyright © 2012-2016 vogella GmbH. Free use of the software examples is granted under the terms of the EPL License. This tutorial is published under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Germany license.

See Licence.