Hadoop on Linux and Windows

Hadoop is a framework that allows for the distributed processing of large data sets across the cluster of computers using simple programming language models like Java.  It is part of the Apache project sponsored by the Apache Software Foundation. It is designed to scale up from single server to thousands of machines .i.e. it can be run on single node or cluster of nodes. Each node in hadoop cluster can execute a computation logic and store data.


Hadoop is designed to detect and handle the failures at the application layer, so delivering high-availability service on top of a cluster of computers, each of which may be prone to failures .i.e. when any node goes down while executing the job hadoop framework automatically handles that by re-starting the job in other node available in the cluster.

hadoop

Supported Platforms:

  • GNU/Linux as a development and production platform
  • Windows as a development platform

Pre-requisites:

  • GNU/Linux –  Java (1.6.0 and above) and ssh must be installed
  • Windows – Java (1.6.0 and above), ssh and Cygwin to support shell must be installed

Modes:

  • Standalone Mode (Single node cluster): – where hadoop daemons will run in a non-distributed mode as a single java process.
  • Pseudo Distributed Mode (Single node cluster): – where each hadoop daemons will run on a different java process.
  • Fully Distributed Mode (Multi node cluster): – where hadoop daemons will run on different nodes of the cluster.

For further installation details, refer here – http://www.oreillynet.com/pub/a/other-programming/excerpts/hadoop-tdg/installing-apache-hadoop.html.

Configurations:

After installing Hadoop we need to configure it by setting the namenode, dfs data and name directory details, Jobtracker details and such other by configuring core-site.xml (Namenode deatails), hdfs-site.xml (Datanode and Distributed FileSystem information) and mapred-site.xml (Jobtracker and Tasktracker details).

Configuring core-site.xml :

  1. We need to configure the namenode details .i.e. need to specify the host and port of the namenode. If hadoop is running in Standalone or Pseudo Distributed mode it can be configured as,

<property>

<name>fs.defsult.name</name>

<value>hdfs://localhost/</value>

</property>

If hadoop is running in Fully Distributed Mode then the configuration is,

<property>

<name>fs.defsult.name</name>

<value>hdfs://hostname:port</value>

</property>

2.    We can set the temporary directory structure in dfs for the user who is submitting the job.

For Linux:

<property>

<name>hadoop.tmp.dir</name>

<value>/home/hadoop/tmp</value>

</property>

If not set then the default value will be => /tmp/hadoop-${user.name}

For Windows:

<property>

<name>hadoop.tmp.dir</name>

<value>/${local directory}/hadoop/tmp </value>

</property>

${local directory} => can be any local directory like C://

Configuring hdfs-site.xml:

1)      Replication factor of the files in distributed filesystem can be set here. Default value will be 3.

<property>

<name>dfs.replication</name>

<value>2 </value>

</property>

For single node cluster replication factor can be set to 1 since we have only one machine and if it fails we cannot get the data since all the copies will be on same machine.

Note:

Standalone cluster = > replication factor won’t be there (non-distributed).

Pseudo Distributed cluster => replication factor will be 1 (Single node cluster).

Fully Distributed cluster => replication factor will be anything (default will be 3)

2)      Distributed FileSystem (dfs) name directory can be set. This determines where in the local filesystem the dfs name node should store in name table. We can specify comma separated directories where a duplicate copy will be stored in all the directories for redundancy.

<property>

<name>dfs.name.dir</name>

<value>${local path} </value>

</property>

For Linux:

${local path} => Ex: /home/hadoop/…

Setting dfs.name.dir is optional. If you are not specifying any value the default value will be

${hadoop.tmp.dir}/dfs/name

         Where ${hadoop.tmp.dir} => value set in core-site.xml

For Windows:

${local path} => Ex: C://hadoop/…

3)      Distributed FileSystem (dfs) data directory can be set in this xml. This determines where in the local filesystem the dfs data node should store its blocks. We can specify comma separated directories then data can be stored in all the named directories. The directories which are not present will be ignored.

<property>

<name>dfs.data.dir</name>

<value>${local path} </value>

</property>

For Linux:

${local path} => Ex: /home/hadoop/data/…

Setting dfs.name.dir is optional. If you are not specifying any value the default value will be

${hadoop.tmp.dir}/dfs/data

 Where ${hadoop.tmp.dir} => value set in core-site.xml

For Windows:

${local path} => Ex: C://hadoop/…

This kind of configuration made easy to extend the cluster .i.e. we can extend the cluster size by adding hard disk or any external device to existing data nodes. For increasing the filesystem size by attaching any secondary memory we just have to specify the path in dfs.data.dir (comma separated) where it will create a block on specified directory and start storing data files.

In Linux the external devices which we are attaching will show in /dev/ directory which can be given. In windows as a separate drive (Ex: Removable Hard disk)

Configuring mapred-site.xml:

1)      In this file we need to configure JobTracker details .i.e. need to specify the jobtracker host and port address.

If hadoop is running in Standalone or Pseudo Distributed mode it can be configured as,

<property>

<name> mapred.job.tracker</name>

<value> localhost:port</value>

</property>

If hadoop is running in Fully Distributed Mode then the configuration is,

<property>

<name> mapred.job.tracker</name>

<value>hdfs://hostname:port</value>

</property>

2)      We can also configure the mapreduce local directory which determines the local directory in where mapreduce stores the intermediate data files. We can give comma separated directory list in different devices inorder to spread the disk i/o. Directories which doesn’t exist will be ignored.

<property>

<name>mapred.local.dir</name>

<value>${local path} </value>

</property>

${local path} => Ex: /home/hadoop/mapred-local…

Setting mapred.local.dir is optional. If you are not specifying any value the default value will be

${hadoop.tmp.dir}/mapred/local

         Where ${hadoop.tmp.dir} => value set in core-site.xml

Why Hadoop is tightly coupled with Linux but not with Windows?

As mentioned above Hadoop is supported in both Linux and Windows but it has its own advantage of having Linux as the platform rather than Windows.

Advantage of using Linux as Hadoop platform over Windows,

      1. Cost: Linux is an open source which is available for free, licensed version also will not cost more but on the other hand Windows is expensive and not an open source.
      2. Reliability: Linux is more reliable when compared to Windows as Linux can run years together without needing to reboot. As hadoop daemons are required to run continuously in the production environment or some development environment Linux will be the better choice.
      3. Security: Linux is more secure when compared to Windows.
      4. Directory structure: The Directory structure in Linux is simple when compared to Windows since in Linux all the reference is from the root (/). While configuring properties in hadoop configuration files in Linux (Distributed FileSystem (dfs) name directory or data directory, hadoop temporary directory for the users who is submitting the job, mapred local directory where the intermediate data files will be stored) we will specify the directory with respect to root (/).  Whereas in Windows the Directory structure is different, the entire disk space is divided into directories (Ex: C drive, D drive) which made it little confusing while configuring the hadoop parameters.

Do let us know if you want to add on more to the above.

Leave a Reply

Your email address will not be published. Required fields are marked *