I’ve seen a lot of confusion about how to include additional jars with your workflow and I’d like to use this opportunity to clarify. Below are the various ways to include a jar with your workflow:
- Set
oozie.libpath=/path/to/jars,another/path/to/jars in job.properties.
- This is useful if you have many workflows that all need the same jar; you can put it in one place in HDFS and use it with many workflows. The jars will be available to all actions in that workflow.
- There is no need to ever point this at the ShareLib location. (I see that in a lot of workflows.) Oozie knows where the ShareLib is and will include it automatically if you set
oozie.use.system.libpath=true injob.properties.
- Create a directory named “lib” next to your
workflow.xml in HDFS and put jars in there.
- This is useful if you have some jars that you only need for one workflow. Oozie will automatically make those jars available to all actions in that workflow.
- Specify the
<archive> tag in an action with the path to a single jar; you can have multiple <archive> tags.
- This is useful if you want some jars only for a specific action and not all actions in a workflow.
- The downside is that you have to specify them in your workflow.xml, so if you ever need to add/remove some jars, you have to change your
workflow.xml.
- Add jars to the ShareLib (e.g.
/user/oozie/share/lib/lib_<timestamp>/pig)
- While this will work, it’s not recommended for two reasons:
- The additional jars will be included with every workflow using that ShareLib, which may be unexpected to those workflows and users.
- When upgrading the ShareLib, you’ll have to recopy the additional jars to the new ShareLib.
Conclusion
At first, these changes may seem complicated and overwhelming. But just remember that, in a nutshell, all we did was add an extra level with a timestamp (the lib_<timestamp> directory). The ShareLib still works the same way as before and you don’t have to update any of your workflows to continue using it. Other than the installation changes (which Cloudera Manager can handle for you), everything else is optional or provided to make things easier.