Project

General

Profile

1
In order to run the application, you need to have Cloudera's Hadoop installed. Among its facilities, you need to have Oozie installed as well; installing HBase might not be needed in your case, unless you specifically know that you will be using it. The steps of installation and configuration of Hadoop and its facilities are given below.
2

    
3
---
4

    
5
**Warning**: the description below is outdated. For example, the most recent version of Hadoop and Oozie work without any problems with Java 1.7. 
6

    
7
---
8

    
9
Hadoop
10
======
11
Java environment
12
----------------
13
IMPORTANT: Because of a bug in the Oozie version provided with Cloudera's Hadoop (by the way: this bug is removed in the version of Oozie available in the source code repository), **you need to have Oracle Java JDK 1.6 installed**. Oozie **does not** work with JDK 1.7.
14

    
15
Note that in order to install and configure Hadoop, you need to have `JAVA_HOME` environment variable set up properly. If you haven't got it set up already, you can do it by, e.g. adding a new line with contents `JAVA_HOME="/usr/lib/jvm/default-java"` (assuming that this is a proper path to installation directory of your Java distribution) to your `/etc/environment` file.
16

    
17

    
18
General information on Hadoop
19
-----------------------------
20
The instructions below show how to install Cloudera Hadoop CDH4 with MRv1 in accordance with the instructions given in [Cloudera CDH4 intallation guide](https://ccp.cloudera.com/display/CDH4DOC/CDH4+Installation+Guide).
21

    
22
It is important to know that Hadoop can be run in one of three modes:
23

    
24
- **standalone mode** - runs all of the Hadoop processes in a single JVM which makes it easy to debug the application. 
25
- **pseudo-distributed mode** - runs a full-fledged Hadoop on your local computer.
26
- **distributed mode** - runs the application on a cluster consisting of many nodes/hosts.
27

    
28
Below we will show how to install Hadoop initially in the pseudo-distributed mode but with a possibility to switch between the standalone and the pseudo-distributed mode.
29

    
30
Installation
31
------------
32
Installing Hadoop in pseudo-distributed mode (based on [Cloudera CDH4 pseudo distributed mode installation guide](https://ccp.cloudera.com/display/CDH4DOC/Installing+CDH4+on+a+Single+Linux+Node+in+Pseudo-distributed+Mode)) in case of 64-bit Ubuntu 12.04:
33

    
34
- create a new file `/etc/apt/sources.list.d/cloudera.list` with contents:
35

    
36
		deb [arch=amd64] http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh precise-cdh4 contrib
37
		deb-src [arch=amd64] http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh precise-cdh4 contrib
38

    
39
- add a repository key:
40
		
41
		curl -s http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh/archive.key | sudo apt-key add -
42
		
43
- update
44
			
45
		sudo apt-get update
46
	
47
- install packages 
48
			
49
		sudo apt-get install hadoop-0.20-conf-pseudo
50
			
51
- next, follow the steps described in the Cloudera's guide to installing Hadoop in the pseudo-distributed mode starting from the step "Step 1: Format the NameNode." This is available at [Cloudera CDH4 pseudo distributed mode installation guide - "Step 1: Format the Namenode"](https://ccp.cloudera.com/display/CDH4DOC/Installing+CDH4+on+a+Single+Linux+Node+in+Pseudo-distributed+Mode#InstallingCDH4onaSingleLinuxNodeinPseudo-distributedMode-Step1%3AFormattheNameNode.).
52
		
53
After install
54
-------------
55

    
56
### Switching between Hadoop modes
57
When you have Hadoop installed, you can **switch between standalone and pseudo-distributed configurations** (or other kinds of configurations) of Hadoop using the `update-alternatives` command, e.g.:
58

    
59
- `update-alternatives --display hadoop-conf` for list of available configurations and information which one is currently active
60
- `sudo update-alternatives --set hadoop-conf /etc/hadoop/conf.empty` to set the active configuration to `/etc/hadoop/conf.empty` which corresponds to Hadoop standalone mode.
61

    
62
### Web interfaces
63
You can view the web interfaces to the following services using appropriate addresses:
64

    
65
- **NameNode** - provides a web console for viewing HDFS, number of Data Nodes, and logs - [http://localhost:50070/](http://localhost:50070/)
66
	- In the pseudo-distributed configuration, you should see one live DataNode named "localhost".
67
- **JobTracker** - allows viewing the completed, currently running, and failed jobs along with their logs - [http://localhost:50030/](http://localhost:50030/)
68

    
69
Oozie
70
=====
71

    
72
Installation
73
------------
74

    
75
The description below is based on [Cloudera CDH4 Oozie installation guide](https://ccp.cloudera.com/display/CDH4DOC/Oozie+Installation#OozieInstallation-ConfiguringOozieinstall).
76

    
77
- Install Oozie with
78

    
79
		sudo apt-get install oozie oozie-client
80

    
81
- Create Oozie database schema
82

    
83
		sudo -u oozie /usr/lib/oozie/bin/ooziedb.sh create -run
84
	
85

    
86
	- this should result in an output similar to this one:
87

    
88
			Validate DB Connection
89
			DONE
90
			Check DB schema does not exist
91
			DONE
92
			Check OOZIE_SYS table does not exist
93
			DONE
94
			Create SQL schema
95
			DONE
96
			Create OOZIE_SYS table
97
			DONE
98
		
99
			Oozie DB has been created for Oozie version '3.1.3-cdh4.0.1'
100

    
101
			The SQL commands have been written to: /tmp/ooziedb-8221670220279408806.sql
102

    
103
- Install version 2.2 of ExtJS library:
104
	- download the zipped library from [http://extjs.com/deploy/ext-2.2.zip](http://extjs.com/deploy/ext-2.2.zip)
105
	- copy the zip file to `/var/lib/oozie` end extract it there
106
- Install Oozie ShareLib:
107

    
108
		mkdir /tmp/ooziesharelib
109
		cd /tmp/ooziesharelib
110
		tar -zxf /usr/lib/oozie/oozie-sharelib.tar.gz
111
		sudo -u hdfs hadoop fs -mkdir /user/oozie
112
		sudo -u hdfs hadoop fs -chown oozie /user/oozie
113
		sudo -u oozie hadoop fs -put share /user/oozie/share
114
	
115
- Start the Oozie server:
116

    
117
		sudo service oozie start
118

    
119
- Check the status of the server:
120
	- Using **command-line**:
121

    
122
			oozie admin -oozie http://localhost:11000/oozie -status
123
	
124
	as a result, the following should be printed out:
125

    
126
			System mode: NORMAL
127
			
128
	If instead of this output you get the exception `java.lang.NullPointerException` thrown, try executing the same command with `-auth SIMPLE` arguments, namely:
129
	
130
		oozie admin -auth SIMPLE -oozie http://localhost:11000/oozie -status
131
	
132
	This is related to a certain [Jira issue](https://issues.apache.org/jira/browse/OOZIE-1010). It seems that you have to use arguments `-auth SIMPLE` only once - after using them for the first time, the `oozie` program will no longer require them.
133
	
134
	- You can also check the status of the server using **web interface** - use a web browser to open a webpage at the following address: [http://localhost:11000/oozie/](http://localhost:11000/oozie/)
135

    
136
If you want to check if Oozie correctly executes its workflows, you can run some of the example workflows provided with Oozie as described in [Cloudera Oozie example workflows](http://archive.cloudera.com/cdh4/cdh/4/oozie/DG_Examples.html). Note that contrary to what is written there, the Oozie server is not available at `http://localhost:8080/oozie` but at `http://localhost:11000/oozie` address.
137

    
138
Documentation
139
-------------
140
A documentation about Oozie and creating Oozie workflows can be found at [Cloudera Oozie documentation](http://archive.cloudera.com/cdh4/cdh/4/oozie/).
141

    
142
A quite nice 3-part tutorial can be found at:
143

    
144
- [1. Introduction to Oozie at InfoQ](http://www.infoq.com/articles/introductionOozie)
145
- [2. Oozie by example at InfoQ](http://www.infoq.com/articles/oozieexample)
146
- [3. Extending Oozie at InfoQ](http://www.infoq.com/articles/ExtendingOozie)
147

    
148
HBase
149
=====
150
The description below is based on [Cloudera CDH4 HBase installation guide](https://ccp.cloudera.com/display/CDH4DOC/HBase+Installation).
151
The main goal is to have Hbase installed and working in pseudo-distributed mode utilizing hdfs of previously installed Hadoop instance.
152

    
153
Notice: on ubuntu systems installing packages listed below will start-up services straight away.
154

    
155
- Install HBase package
156

    
157
		sudo apt-get install hbase
158

    
159
- Install HBase Master package
160

    
161
		sudo apt-get install hbase-master
162

    
163
- Stop HBase Master in order to reconfigure HBase to work in pseudo-distributed mode
164

    
165
		sudo service hbase-master stop
166

    
167
- Configure HBase in pseudo-distributed mode by modifying /etc/hbase/conf/hbase-site.xml HBase configuration file and inserting:
168

    
169
		<property>
170
		  <name>hbase.cluster.distributed</name>
171
		  <value>true</value>
172
		</property>
173
		<property>
174
		  <name>hbase.rootdir</name>
175
		  <value>hdfs://localhost:8020/hbase</value>
176
		</property>
177

    
178
between the <configuration> and </configuration> tags. 
179

    
180
- Creating the /hbase Directory in HDFS with proper permissions
181

    
182
		sudo -u hdfs hadoop fs -mkdir /hbase
183
		sudo -u hdfs hadoop fs -chown hbase /hbase
184

    
185
where HBase data will be stored.
186

    
187
- Install Zookeper required for pseudo-distributed mode
188

    
189
		sudo apt-get install zookeeper-server
190

    
191
At first zookeper won't start due to the missing data directory. It will be created at init phase, therefore run:
192

    
193
		sudo service zookeeper-server init
194
		sudo service zookeeper-server start
195

    
196
- Inspect HBase Master administration panel in order to verify all required services are running. Default address is:
197

    
198
		http://localhost:60010/master-status
199

    
200
At least one region server should be listed. Default HBase home directory location is:
201

    
202
		hdfs://localhost:8020/hbase
203

    
204
- Access HBase by using the HBase Shell
205

    
206
		hbase shell
207

    
208
More detailed shell commands description are available on [http://wiki.apache.org/hadoop/Hbase/Shell](http://wiki.apache.org/hadoop/Hbase/Shell)
209

    
210
Troubleshooting
211
According to [http://hbase.apache.org/book.html#loopback.ip](http://hbase.apache.org/book.html#loopback.ip) 127.0.1.1 entry on ubuntu /etc/hosts file:
212

    
213
		127.0.1.1      laptop-work
214

    
215
may cause problems when deploying HBase Master with Region Server. The following exceptions:
216

    
217
		org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed setting up proxy interface org.apache.hadoop.hbase.ipc.HRegionInterface to localhost/127.0.0.1:60020 after attempts=1
218

    
219
may occur in hbase-master log file. As a solution: 127.0.1.1 entry should be removed from /etc/hosts file.
(1-1/5)