1
|
General notes
|
2
|
====================
|
3
|
|
4
|
Oozie-installer is a utility allowing building, uploading and running oozie workflows. In practice, it creates a `*.tar.gz` package that contains resouces that define a workflow and some helper scripts. See the `icm-iis-core-examples` project for examples of usage.
|
5
|
|
6
|
This module is automatically executed when running:
|
7
|
|
8
|
`mvn package -Poozie-package -Dworkflow.source.dir=classpath/to/parent/directory/of/oozie_app`
|
9
|
|
10
|
on module having set:
|
11
|
|
12
|
<parent>
|
13
|
<groupId>eu.dnetlib</groupId>
|
14
|
<artifactId>icm-iis-parent-container</artifactId>
|
15
|
<version>0.0.1-SNAPSHOT</version>
|
16
|
</parent>
|
17
|
|
18
|
in `pom.xml` file. `oozie-package` profile initializes oozie workflow packaging, `workflow.source.dir` property points to a workflow (notice: this is not a relative path but a classpath to directory).
|
19
|
|
20
|
The outcome of this packaging is `oozie-package.tar.gz` file containing inside all the resources required to run Oozie workflow:
|
21
|
|
22
|
- jar packages
|
23
|
- workflow definitions
|
24
|
- job properties
|
25
|
- maintenance scripts
|
26
|
|
27
|
Required properties
|
28
|
====================
|
29
|
|
30
|
In order to include proper workflow within package, `workflow.source.dir` property has to be set. It could be provided by setting `-Dworkflow.source.dir=some/job/dir` maven parameter.
|
31
|
|
32
|
Other placeholders used in shell scripts (`*.sh`) along with default values in `pom.xml` file:
|
33
|
|
34
|
property name | default value
|
35
|
---------------------------------------------------
|
36
|
iis.hadoop.frontend.host.name | localhost
|
37
|
iis.hadoop.master.host.name | localhost
|
38
|
iis.hadoop.frontend.user.name | ${user.name} which maven property holding current user name
|
39
|
iis.hadoop.frontend.home.dir | /mnt/tmp
|
40
|
sandboxName | generated by dedicated plugin, based on `workflow.source.dir`
|
41
|
sandboxDir | /user/${iis.hadoop.frontend.user.name}/${sandboxName}
|
42
|
workingDir | ${sandboxDir}/working_dir
|
43
|
oozieAppDir | oozie_app
|
44
|
oozieServiceLoc | http://${iis.hadoop.master.host.name}:11000/oozie
|
45
|
|
46
|
this list can be supplemented with job.properties default values defined in `pom.xml` file:
|
47
|
|
48
|
property name | default value
|
49
|
---------------------------------------------------
|
50
|
nameNode | hdfs://${iis.hadoop.master.host.name}:8020
|
51
|
jobTracker | ${iis.hadoop.master.host.name}:8021
|
52
|
queueName | default
|
53
|
|
54
|
All values will be overriden with the ones from `job.properties` and eventually `job-override.properties` stored in module's main folder. Values can be also provided as maven command line -D arguments.
|
55
|
|
56
|
When overriding properties from `job.properties`, `job-override.properties` file can be created in main module directory (the one containing `pom.xml` file) and define all new properties which will override existing properties. One can provide those properties one by one as command line arguments.
|
57
|
|
58
|
Properties overriding order is the following:
|
59
|
|
60
|
1. `pom.xml` defined properties (located in the project root dir)
|
61
|
2. `~/.m2/settings.xml` defined properties
|
62
|
3. `${workflow.source.dir}/job.properties`
|
63
|
4. `job-override.properties` (located in the project root dir)
|
64
|
5. `maven -Dparam=value`
|
65
|
|
66
|
where the maven `-Dparam` property is overriding all the other ones.
|
67
|
|
68
|
Workflow definition requirements
|
69
|
====================
|
70
|
|
71
|
`workflow.source.dir` property should point to the following directory structure:
|
72
|
|
73
|
[${workflow.source.dir}]
|
74
|
|
|
75
|
|-job.properties (optional)
|
76
|
|
|
77
|
\-[oozie_app]
|
78
|
|
|
79
|
\-workflow.xml
|
80
|
|
81
|
This property can be set using maven `-D` switch.
|
82
|
|
83
|
`[oozie_app]` is the default directory name however it can be set to any value as soon as `oozieAppDir` property is provided with directory name as value.
|
84
|
|
85
|
Subworkflows are supported as well and subworkflow directories should be nested within `[oozie_app]` directory.
|
86
|
|
87
|
Creating oozie installer step-by-step
|
88
|
=====================================
|
89
|
|
90
|
Automated oozie-installer steps are the following:
|
91
|
|
92
|
1. creating jar packages: `*.jar` and `*tests.jar` along with copying all dependancies in `target/dependencies`
|
93
|
2. reading properties from maven, `job.properties`, `job-override.properties`
|
94
|
3. invoking priming mechanism linking resources from import.txt file (currently resolving subworkflow resources)
|
95
|
4. assembling shell scripts for preparing Hadoop filesystem, uploading Oozie application and starting workflow
|
96
|
5. copying whole `${workflow.source.dir}` content to `target/${oozie.package.file.name}`
|
97
|
6. generating updated `job.properties` file in `target/${oozie.package.file.name}` based on maven, `job.properties` and `job-override.properties`
|
98
|
7. creating lib directory (or multiple directories for subworkflows for each nested directory) and copying jar packages created at step (1) to each one of them
|
99
|
8. bundling whole `${oozie.package.file.name}` directory into single tar.gz package
|
100
|
|
101
|
Uploading oozie package and running workflow on cluster
|
102
|
=======================================================
|
103
|
|
104
|
In order to simplify deployment and execution process four dedicated profiles were introduced:
|
105
|
|
106
|
- deploy-local
|
107
|
- run-local
|
108
|
- deploy
|
109
|
- run
|
110
|
|
111
|
to be used along with `oozie-package` profile e.g. by providing `-Poozie,deploy-local,run-local` or `-Poozie,deploy,run` maven parameters.
|
112
|
|
113
|
`deploy-local` profile supplements packaging process with:
|
114
|
1) extracting oozie package to `target/local-upload` directory
|
115
|
2) uploading oozie package content to local hadoop cluster
|
116
|
|
117
|
`run-local` profile introduces:
|
118
|
1) executing workflow uploaded to HDFS cluster using `deploy-local` command
|
119
|
|
120
|
`deploy` profile supplements packaging process with:
|
121
|
1) uploading oozie-package via scp to `/mnt/tmp/${user.name}/oozie-package-${timestamp}` directory on `${iis.hadoop.frontend.host.name}` machine
|
122
|
2) extracting uploaded package
|
123
|
3) uploading oozie content to hadoop cluster
|
124
|
|
125
|
`run` profile introduces:
|
126
|
1) executing workflow uploaded to HDFS cluster using `deploy` command
|
127
|
2) removing uploaded files
|
128
|
|
129
|
Notice: ssh access to frontend machine has to be configured on system level and it is preferable to set key-based authentication in order to simplify remote operations.
|
130
|
|
131
|
Other tips
|
132
|
==========
|
133
|
|
134
|
It is a good practice to define all hadoop cluster related environment variables in local `~/.m2/settings.xml` file.
|