1
|
In order to run the application, you need to have Cloudera's Hadoop installed. Among its facilities, you need to have Oozie installed as well; installing HBase might not be needed in your case, unless you specifically know that you will be using it. The steps of installation and configuration of Hadoop and its facilities are given below.
|
2
|
|
3
|
---
|
4
|
|
5
|
**Warning**: the description below is outdated. For example, the most recent version of Hadoop and Oozie work without any problems with Java 1.7.
|
6
|
|
7
|
---
|
8
|
|
9
|
Hadoop
|
10
|
======
|
11
|
Java environment
|
12
|
----------------
|
13
|
IMPORTANT: Because of a bug in the Oozie version provided with Cloudera's Hadoop (by the way: this bug is removed in the version of Oozie available in the source code repository), **you need to have Oracle Java JDK 1.6 installed**. Oozie **does not** work with JDK 1.7.
|
14
|
|
15
|
Note that in order to install and configure Hadoop, you need to have `JAVA_HOME` environment variable set up properly. If you haven't got it set up already, you can do it by, e.g. adding a new line with contents `JAVA_HOME="/usr/lib/jvm/default-java"` (assuming that this is a proper path to installation directory of your Java distribution) to your `/etc/environment` file.
|
16
|
|
17
|
|
18
|
General information on Hadoop
|
19
|
-----------------------------
|
20
|
The instructions below show how to install Cloudera Hadoop CDH4 with MRv1 in accordance with the instructions given in [Cloudera CDH4 intallation guide](https://ccp.cloudera.com/display/CDH4DOC/CDH4+Installation+Guide).
|
21
|
|
22
|
It is important to know that Hadoop can be run in one of three modes:
|
23
|
|
24
|
- **standalone mode** - runs all of the Hadoop processes in a single JVM which makes it easy to debug the application.
|
25
|
- **pseudo-distributed mode** - runs a full-fledged Hadoop on your local computer.
|
26
|
- **distributed mode** - runs the application on a cluster consisting of many nodes/hosts.
|
27
|
|
28
|
Below we will show how to install Hadoop initially in the pseudo-distributed mode but with a possibility to switch between the standalone and the pseudo-distributed mode.
|
29
|
|
30
|
Installation
|
31
|
------------
|
32
|
Installing Hadoop in pseudo-distributed mode (based on [Cloudera CDH4 pseudo distributed mode installation guide](https://ccp.cloudera.com/display/CDH4DOC/Installing+CDH4+on+a+Single+Linux+Node+in+Pseudo-distributed+Mode)) in case of 64-bit Ubuntu 12.04:
|
33
|
|
34
|
- create a new file `/etc/apt/sources.list.d/cloudera.list` with contents:
|
35
|
|
36
|
deb [arch=amd64] http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh precise-cdh4 contrib
|
37
|
deb-src [arch=amd64] http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh precise-cdh4 contrib
|
38
|
|
39
|
- add a repository key:
|
40
|
|
41
|
curl -s http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh/archive.key | sudo apt-key add -
|
42
|
|
43
|
- update
|
44
|
|
45
|
sudo apt-get update
|
46
|
|
47
|
- install packages
|
48
|
|
49
|
sudo apt-get install hadoop-0.20-conf-pseudo
|
50
|
|
51
|
- next, follow the steps described in the Cloudera's guide to installing Hadoop in the pseudo-distributed mode starting from the step "Step 1: Format the NameNode." This is available at [Cloudera CDH4 pseudo distributed mode installation guide - "Step 1: Format the Namenode"](https://ccp.cloudera.com/display/CDH4DOC/Installing+CDH4+on+a+Single+Linux+Node+in+Pseudo-distributed+Mode#InstallingCDH4onaSingleLinuxNodeinPseudo-distributedMode-Step1%3AFormattheNameNode.).
|
52
|
|
53
|
After install
|
54
|
-------------
|
55
|
|
56
|
### Switching between Hadoop modes
|
57
|
When you have Hadoop installed, you can **switch between standalone and pseudo-distributed configurations** (or other kinds of configurations) of Hadoop using the `update-alternatives` command, e.g.:
|
58
|
|
59
|
- `update-alternatives --display hadoop-conf` for list of available configurations and information which one is currently active
|
60
|
- `sudo update-alternatives --set hadoop-conf /etc/hadoop/conf.empty` to set the active configuration to `/etc/hadoop/conf.empty` which corresponds to Hadoop standalone mode.
|
61
|
|
62
|
### Web interfaces
|
63
|
You can view the web interfaces to the following services using appropriate addresses:
|
64
|
|
65
|
- **NameNode** - provides a web console for viewing HDFS, number of Data Nodes, and logs - [http://localhost:50070/](http://localhost:50070/)
|
66
|
- In the pseudo-distributed configuration, you should see one live DataNode named "localhost".
|
67
|
- **JobTracker** - allows viewing the completed, currently running, and failed jobs along with their logs - [http://localhost:50030/](http://localhost:50030/)
|
68
|
|
69
|
Oozie
|
70
|
=====
|
71
|
|
72
|
Installation
|
73
|
------------
|
74
|
|
75
|
The description below is based on [Cloudera CDH4 Oozie installation guide](https://ccp.cloudera.com/display/CDH4DOC/Oozie+Installation#OozieInstallation-ConfiguringOozieinstall).
|
76
|
|
77
|
- Install Oozie with
|
78
|
|
79
|
sudo apt-get install oozie oozie-client
|
80
|
|
81
|
- Create Oozie database schema
|
82
|
|
83
|
sudo -u oozie /usr/lib/oozie/bin/ooziedb.sh create -run
|
84
|
|
85
|
|
86
|
- this should result in an output similar to this one:
|
87
|
|
88
|
Validate DB Connection
|
89
|
DONE
|
90
|
Check DB schema does not exist
|
91
|
DONE
|
92
|
Check OOZIE_SYS table does not exist
|
93
|
DONE
|
94
|
Create SQL schema
|
95
|
DONE
|
96
|
Create OOZIE_SYS table
|
97
|
DONE
|
98
|
|
99
|
Oozie DB has been created for Oozie version '3.1.3-cdh4.0.1'
|
100
|
|
101
|
The SQL commands have been written to: /tmp/ooziedb-8221670220279408806.sql
|
102
|
|
103
|
- Install version 2.2 of ExtJS library:
|
104
|
- download the zipped library from [http://extjs.com/deploy/ext-2.2.zip](http://extjs.com/deploy/ext-2.2.zip)
|
105
|
- copy the zip file to `/var/lib/oozie` end extract it there
|
106
|
- Install Oozie ShareLib:
|
107
|
|
108
|
mkdir /tmp/ooziesharelib
|
109
|
cd /tmp/ooziesharelib
|
110
|
tar -zxf /usr/lib/oozie/oozie-sharelib.tar.gz
|
111
|
sudo -u hdfs hadoop fs -mkdir /user/oozie
|
112
|
sudo -u hdfs hadoop fs -chown oozie /user/oozie
|
113
|
sudo -u oozie hadoop fs -put share /user/oozie/share
|
114
|
|
115
|
- Start the Oozie server:
|
116
|
|
117
|
sudo service oozie start
|
118
|
|
119
|
- Check the status of the server:
|
120
|
- Using **command-line**:
|
121
|
|
122
|
oozie admin -oozie http://localhost:11000/oozie -status
|
123
|
|
124
|
as a result, the following should be printed out:
|
125
|
|
126
|
System mode: NORMAL
|
127
|
|
128
|
If instead of this output you get the exception `java.lang.NullPointerException` thrown, try executing the same command with `-auth SIMPLE` arguments, namely:
|
129
|
|
130
|
oozie admin -auth SIMPLE -oozie http://localhost:11000/oozie -status
|
131
|
|
132
|
This is related to a certain [Jira issue](https://issues.apache.org/jira/browse/OOZIE-1010). It seems that you have to use arguments `-auth SIMPLE` only once - after using them for the first time, the `oozie` program will no longer require them.
|
133
|
|
134
|
- You can also check the status of the server using **web interface** - use a web browser to open a webpage at the following address: [http://localhost:11000/oozie/](http://localhost:11000/oozie/)
|
135
|
|
136
|
If you want to check if Oozie correctly executes its workflows, you can run some of the example workflows provided with Oozie as described in [Cloudera Oozie example workflows](http://archive.cloudera.com/cdh4/cdh/4/oozie/DG_Examples.html). Note that contrary to what is written there, the Oozie server is not available at `http://localhost:8080/oozie` but at `http://localhost:11000/oozie` address.
|
137
|
|
138
|
Documentation
|
139
|
-------------
|
140
|
A documentation about Oozie and creating Oozie workflows can be found at [Cloudera Oozie documentation](http://archive.cloudera.com/cdh4/cdh/4/oozie/).
|
141
|
|
142
|
A quite nice 3-part tutorial can be found at:
|
143
|
|
144
|
- [1. Introduction to Oozie at InfoQ](http://www.infoq.com/articles/introductionOozie)
|
145
|
- [2. Oozie by example at InfoQ](http://www.infoq.com/articles/oozieexample)
|
146
|
- [3. Extending Oozie at InfoQ](http://www.infoq.com/articles/ExtendingOozie)
|
147
|
|
148
|
HBase
|
149
|
=====
|
150
|
The description below is based on [Cloudera CDH4 HBase installation guide](https://ccp.cloudera.com/display/CDH4DOC/HBase+Installation).
|
151
|
The main goal is to have Hbase installed and working in pseudo-distributed mode utilizing hdfs of previously installed Hadoop instance.
|
152
|
|
153
|
Notice: on ubuntu systems installing packages listed below will start-up services straight away.
|
154
|
|
155
|
- Install HBase package
|
156
|
|
157
|
sudo apt-get install hbase
|
158
|
|
159
|
- Install HBase Master package
|
160
|
|
161
|
sudo apt-get install hbase-master
|
162
|
|
163
|
- Stop HBase Master in order to reconfigure HBase to work in pseudo-distributed mode
|
164
|
|
165
|
sudo service hbase-master stop
|
166
|
|
167
|
- Configure HBase in pseudo-distributed mode by modifying /etc/hbase/conf/hbase-site.xml HBase configuration file and inserting:
|
168
|
|
169
|
<property>
|
170
|
<name>hbase.cluster.distributed</name>
|
171
|
<value>true</value>
|
172
|
</property>
|
173
|
<property>
|
174
|
<name>hbase.rootdir</name>
|
175
|
<value>hdfs://localhost:8020/hbase</value>
|
176
|
</property>
|
177
|
|
178
|
between the <configuration> and </configuration> tags.
|
179
|
|
180
|
- Creating the /hbase Directory in HDFS with proper permissions
|
181
|
|
182
|
sudo -u hdfs hadoop fs -mkdir /hbase
|
183
|
sudo -u hdfs hadoop fs -chown hbase /hbase
|
184
|
|
185
|
where HBase data will be stored.
|
186
|
|
187
|
- Install Zookeper required for pseudo-distributed mode
|
188
|
|
189
|
sudo apt-get install zookeeper-server
|
190
|
|
191
|
At first zookeper won't start due to the missing data directory. It will be created at init phase, therefore run:
|
192
|
|
193
|
sudo service zookeeper-server init
|
194
|
sudo service zookeeper-server start
|
195
|
|
196
|
- Inspect HBase Master administration panel in order to verify all required services are running. Default address is:
|
197
|
|
198
|
http://localhost:60010/master-status
|
199
|
|
200
|
At least one region server should be listed. Default HBase home directory location is:
|
201
|
|
202
|
hdfs://localhost:8020/hbase
|
203
|
|
204
|
- Access HBase by using the HBase Shell
|
205
|
|
206
|
hbase shell
|
207
|
|
208
|
More detailed shell commands description are available on [http://wiki.apache.org/hadoop/Hbase/Shell](http://wiki.apache.org/hadoop/Hbase/Shell)
|
209
|
|
210
|
Troubleshooting
|
211
|
According to [http://hbase.apache.org/book.html#loopback.ip](http://hbase.apache.org/book.html#loopback.ip) 127.0.1.1 entry on ubuntu /etc/hosts file:
|
212
|
|
213
|
127.0.1.1 laptop-work
|
214
|
|
215
|
may cause problems when deploying HBase Master with Region Server. The following exceptions:
|
216
|
|
217
|
org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed setting up proxy interface org.apache.hadoop.hbase.ipc.HRegionInterface to localhost/127.0.0.1:60020 after attempts=1
|
218
|
|
219
|
may occur in hbase-master log file. As a solution: 127.0.1.1 entry should be removed from /etc/hosts file.
|