Whats going on HADOOP with BIGDATA?

This blog is mainly focused on for discussions in Hadoop technology, using various tools from Hadoop ECO system. Hadoop experts and beginners post or share their views, experiences. Freshers on Hadoop post their questions here to clarify from experts. I motivated myself to create this blog for helping the new beginners from expensive Hadoop projects in the market. I do my best to collect and share genuine posts from hadoop discussions around world of internet too.

Monday, November 30, 2015

Submitting Oozie job

1. Keep the coordinator.xml and workflow.xml in HDFS. Keep job.properties in the local file system.

2. Keep the required sqoop scripts files in any HDFS location.

3. Goto -> job.properties file local location[1] and run below command

Running Oozie co-ordinator job:

>oozie job -oozie http://<IP>:11000/oozie -config ./job.properties –run

0000008-150525074405209-oozie-oozi-C

Knowing the status for the job workflow:

>oozie job -oozie http://<IP>:11000/oozie -info 0000008-150525074405209-oozie-oozi-C

Kill the oozie workflow job:

>oozie job -oozie http://<IP>:11000/oozie -kill 0000008-150525074405209-oozie-oozi-C

Including additional jars with your workflow in oozie

I’ve seen a lot of confusion about how to include additional jars with your workflow and I’d like to use this opportunity to clarify. Below are the various ways to include a jar with your workflow:

Set oozie.libpath=/path/to/jars,another/path/to/jars in job.properties.

This is useful if you have many workflows that all need the same jar; you can put it in one place in HDFS and use it with many workflows. The jars will be available to all actions in that workflow.
There is no need to ever point this at the ShareLib location. (I see that in a lot of workflows.) Oozie knows where the ShareLib is and will include it automatically if you set oozie.use.system.libpath=true injob.properties.

Create a directory named “lib” next to your workflow.xml in HDFS and put jars in there.

This is useful if you have some jars that you only need for one workflow. Oozie will automatically make those jars available to all actions in that workflow.

Specify the <archive> tag in an action with the path to a single jar; you can have multiple <archive> tags.

This is useful if you want some jars only for a specific action and not all actions in a workflow.
The downside is that you have to specify them in your workflow.xml, so if you ever need to add/remove some jars, you have to change your workflow.xml.

Add jars to the ShareLib (e.g. /user/oozie/share/lib/lib_<timestamp>/pig)

While this will work, it’s not recommended for two reasons:

The additional jars will be included with every workflow using that ShareLib, which may be unexpected to those workflows and users.
When upgrading the ShareLib, you’ll have to recopy the additional jars to the new ShareLib.

Conclusion

At first, these changes may seem complicated and overwhelming. But just remember that, in a nutshell, all we did was add an extra level with a timestamp (the lib_<timestamp> directory). The ShareLib still works the same way as before and you don’t have to update any of your workflows to continue using it. Other than the installation changes (which Cloudera Manager can handle for you), everything else is optional or provided to make things easier.

Monday, June 8, 2015

Oozie commands


Oozie commands


---------------


Note: Replace oozie server and port, with your cluster-specific.


 


1) Submit job:


$ oozie job -oozie http://cdh-dev01:11000/oozie -config oozieProject/workflowSshAction/job.properties -submit


job: 0000012-130712212133144-oozie-oozi-W


 


2) Run job:


$ oozie job -oozie http://cdh-dev01:11000/oozie -start 0000014-130712212133144-oozie-oozi-W


 


3) Check the status:


$ oozie job -oozie http://cdh-dev01:11000/oozie -info 0000014-130712212133144-oozie-oozi-W


 


4) Suspend workflow:


$ oozie job -oozie http://cdh-dev01:11000/oozie -suspend 0000014-130712212133144-oozie-oozi-W


 


5) Resume workflow:


$ oozie job -oozie http://cdh-dev01:11000/oozie -resume 0000014-130712212133144-oozie-oozi-W


 


6) Re-run workflow:


$ oozie job -oozie http://cdh-dev01:11000/oozie -config oozieProject/workflowSshAction/job.properties -rerun 0000014-130712212133144-oozie-oozi-W


 


7) Should you need to kill the job:


$ oozie job -oozie http://cdh-dev01:11000/oozie -kill 0000014-130712212133144-oozie-oozi-W


 


8) View server logs:


$ oozie job -oozie http://cdh-dev01:11000/oozie -logs 0000014-130712212133144-oozie-oozi-W


 


Logs are available at:


/var/log/oozie on the Oozie server.

Tuesday, May 26, 2015

Hive metastore upgrade and Database Requirements

To use external database for Hive or Oozie metastore, have a MySQL, Oracle, or PostgreSQL database deployed and available. By default, Hive and Oozie use Derby database for its metastore. To use an external database for Hive and Oozie metastore, ensure that a MySQL database is deployed and available.

Ensure that your database administrator creates the following databases and users.
• For Hive, ensure that your database administrator creates hive_dbname, hive_dbuser, and hive_dbpasswd.
• For Oozie, ensure that your database administrator creates oozie_dbname, oozie_dbuser, and oozie_dbpasswd.

Instructions to configure an Oracle database:

Run following SQL script against your Hive schema: /usr/lib/hive/scripts/metastore/upgrade/oracle/hive-schema-0.12.0.oracle.sql

You hardly see below scripts inside this folder:
derby mssql mysql oracle postgres

eg, /usr/hdp/2.2.0.0-2041/hive/scripts/metastore/upgrade/oracle/hive-schema-0.13.0.oracle.sql

Saturday, February 28, 2015

vi Editor: Useful concepts

Creating a file

vi testfile
In your home directory, invoke vi by typing vi followed by the name of the file you wish to create. You will see a screen with a column of tildes (~) along the left side. vi is now in command mode. Anything you type will be understood as a command, not as content to add to the file. In order to input text, you must type a command.

i
The two basic input commands are i, which means "insert the text I'm about to type to the left of the cursor", and a, which means "append the text I'm about to type to the right of the cursor". Since you are at the beginning of an empty file, either of these would work. We picked i arbitrarily.

Type in some text; here's a profound statement from philosopher Charles Sanders Peirce, if you can't think of your own:

     And what, then, is belief? It is the demi-cadence 
     which closes a musical phrase in the symphony of our 
     intellectual life.  We have seen that it has just 
     three properties: First, it is something that we are
     aware of; second, it appeases the irritation of doubt; 
     and, third, it involves the establishment in our 
     nature of a rule of action, or, say for short, a 
     habit.

Press RET after each line, since vi will not move to the next line automatically; when you finish typing, press the ESC key to leave insert or append mode and return to command mode.

:wq
If you've done everything correctly, when you type this command it should appear at the bottom of your screen, below all the ~ characters. The : tells vi you're about to give a series of commands; the wmeans to write the file you've just typed in --- in most new programs this is called "save" --- and the q means to quit vi. So you should be back at the shell prompt.

cat testfile
cat will display the file you typed on the screen.

Don't remove testfile, we'll use it in the next tutorial section.
As you use vi, always remember that pressing ESC will return you to command mode. So if you get confused, press ESC a couple times and start over.
vi has an annoying tendency to beep whenever you do something you aren't supposed to, like type an unknown command; don't be alarmed by this.

Moving around in a file

To move around in a file, Debian's vi allows you to use the arrow keys. The traditional keys also work, however; they are h for left, j for down, k for up, and l for right. These keys were chosen because they are adjacent on on the home row of the keyboard, and thus easy to type. Many people use them instead of the arrow keys since they're faster to reach with your fingers.

vi testfile
Open the file you created earlier with vi. You should see the text you typed before.

Move around the file with the arrow keys or the hjkl keys. If you try to move to far in any direction, vi will beep and refuse to do so; if you want to put text there, you have to use an insertion command like i or a.

:q
Exit vi.

Deleting text

vi testfile
Open your practice file again.

dd
The dd command deletes a line; the top line of the file should be gone now. [Note: u -undo action.]

x
x deletes a single character; the first letter of the second line will be erased. Delete and backspace don't work in vi, for historical reasons[13]. Some vi variants, such as vim will let you use backspace and delete.

10x
If you type a number before a command, it will repeat the command that many times. So this will delete 10 characters.

2dd
You can use a number with the dd command as well, deleting two lines.

:q
This will cause an error, because you've changed the file but haven't saved yet. There are two ways to avoid this; you can :wq, thus writing the file as you quit, or you can quit without saving:

:q!
With an exclamation point, you tell vi that you really mean it, and it should quit even though the file isn't saved. If you use :q! your deletions will not be saved to testfile; if you use :wq, they will be.

cat testfile
Back at the shell prompt, view testfile. It should be shorter now, if you used :wq, or be unchanged if you used :q!.

:q! is an excellent command to remember, because you can use it to bail out if you get hopelessly confused and feel you've ruined the file you were editing. Just press ESC a few times to be sure you're in command mode and then type :q!. This is guaranteed to get you out of vi with no damage done.
You now know everything you need to do basic editing; insertion, deletion, saving, and quitting. The following sections describe useful commands for doing things faster; you can skip over them if you like.

Wednesday, January 28, 2015

Multiple OutputFile Mapreduce Program

Data Sample:

BoroughCode+	StatusOrderNumber	+SignSequence+	Distance in Miles+	ArrowPoints	+SignDescription
B	P-004958	1	0		Curb Line
B	P-004958	2	9		Property Line
B	P-004958	3	30		NIGHT REGULATION
B	P-004958	4	30		1 HOUR PARKING 9AM-7PM

All above data is tabular format \t

Each line consists above column header data.

Entire data is in more than 8Lakh record of capacity.

Mapreduce Case:

Client wants to Fetch unique file with file name of StatusOrderNumber

And the file should contain Maximum Distance and respective Description in it.

My mapreduce Program goes like this:

MultipleOutputMapper.Java:-

import java.io.IOException;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.mapreduce.Mapper;

public class MultipleOutputMapper extends Mapper<LongWritable, Text, Text, Text> {

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

/*BoroughCode StatusOrderNumber SignSequence Distance ArrowPoints SignDescription

B P-004958 1 0 Curb Line */

if(!value.toString().contains("BoroughCode"))

{

String line = value.toString();

String[] tokens = line.split("\t");

context.write(new Text(tokens[1]), new Text(tokens[3] + "," + tokens[5]));

}

MultipleOutputReducer.Java:-

import java.io.IOException;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.output.MultipleOutputs;

import java.lang.Integer;

public class MultipleOutputReducer extends Reducer<Text, Text, Text, Text> {

MultipleOutputs<Text, Text> mos;

public void setup(Context context) {

mos = new MultipleOutputs<Text, Text>(context);

}

public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {

/*Key -> P-004958

Value ->1,Curb Line*/

int maxValue = Integer.MIN_VALUE;

Text kValue = new Text();

Text kDescr = new Text();

for (Text value : values) {

String str = value.toString();

String[] items = str.split(",");

if(maxValue < Integer.parseInt(items[0]))

{

kDescr = new Text(items[1].toString());

}

maxValue = Math.max(maxValue, Integer.parseInt(items[0]));

}

kValue = new Text(String.valueOf(maxValue));

mos.write(kValue, kDescr, key.toString());

}

protected void cleanup(Context context) throws IOException, InterruptedException {

mos.close();

}

MultipleOutputDriver.Java:-

package org.bkfs.multipleoutput;

import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.MultipleOutputs;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class MultipleOutputDriver extends Configured implements Tool {

public int run(String[] args) throws Exception {

if (args.length != 2) {
System.err.println("Usage: MultipleOutput <input path> <output path>");
System.exit(-1);
}

Path inputDir = new Path(args[0]);
Path outputDir = new Path(args[1]);

Job job = Job.getInstance();
job.setJarByClass(MultipleOutputDriver.class);
job.setJobName("MultipleOutput Job");

job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);

job.setMapperClass(MultipleOutputMapper.class);
job.setReducerClass(MultipleOutputReducer.class);

FileInputFormat.setInputPaths(job, inputDir);
FileOutputFormat.setOutputPath(job, outputDir);

MultipleOutputs.addNamedOutput(job, "MaxDistanceStatusOrderNumber", TextOutputFormat.class, NullWritable.class, Text.class);

job.waitForCompletion(true);

return 0;
}

public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new MultipleOutputDriver(), args);
System.exit(res);
}
}

Wednesday, January 14, 2015

Running Shell scripting from Crontab in Linux

>crontab null
>crontab -l

>vi masterscript
* * * * * pathto.sh pathto.cfg >dev/null >2$1

ESC -> readonly -> :q (Exit)
ESC -> readonly -> :wq! ( save and exit )
ESC -> readonly -> :q! (Force exit)
[ESC: Escape from Insert mode]

>crontab -l
masterscript

>crontab masterscript

You need to press [Esc] key followed by the colon (:) before typing the following commands:
eg, [ESC] :wq!

Tuesday, January 13, 2015

Apache Maven POM.xml for Hadoop Mapreduce programs

DOWNLOAD Maven software folder and place it in Program Files.
Set M2_HOME and M2 variables at System level.

Run below commands from command prompt to download dependencies to .m2/repositary folder

mvn org.apache.maven.plugins:maven-dependency-plugin:2.4:get \
    -DremoteRepositories=http://download.java.net/maven/2 \
    -Dartifact=robo-guice:robo-guice:0.4-SNAPSHOT \
    -Ddest=c:\temp\robo-guice.jar

Below is my MaximumTempSample application POM.xml file. Click on Project properties -> Maven -> Update Project... to get updates of dependencies.

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">

<groupId>MaximumTempSample</groupId>

<artifactId>org.mycompany.MaxRecord</artifactId>

<version>0.0.1-SNAPSHOT</version>

<name>MaximumTempSample</name>

<url>http://maven.apache.org</url>

<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>

<hadoop.version>2.5.0-cdh5.2.0</hadoop.version>

<log4j.version>1.2.17</log4j.version>

<maven_jar_plugin.version>2.5</maven_jar_plugin.version>

</properties>

<groupId>org.apache.logging.log4j</groupId>

</dependency>

<groupId>org.apache.logging.log4j</groupId>

</dependency>

<groupId>org.apache.hadoop</groupId>

<artifactId>hadoop-client</artifactId>

<version>${hadoop.version}</version>

</dependency>

</dependencies>

<build>

<groupId>org.apache.maven.plugins</groupId>

<artifactId>maven-jar-plugin</artifactId>

<version>${maven_jar_plugin.version}</version>

</plugin>

</plugins>

</build>

<id>cloudera-repo</id>

<url>http://repository.cloudera.com/artifactory/cloudera-repos/</url>

</repository>

</repositories>

</project>

Monday, January 12, 2015

Maximum Temperature Mapreduce Program

DriverClass.Java:-
----------------------

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class DriverClass {

public static void main(String[] args) throws Exception {

if (args.length != 2) {

System.err.println("Usage: MaxTemperature <input path> <output path>");

System.exit(-1);

}

Job job = new Job();

job.setJarByClass(DriverClass.class);

job.setJobName("Max temperature");

FileInputFormat.addInputPath(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.setMapperClass(DMapper.class);

job.setCombinerClass(DReducer.class);

job.setReducerClass(DReducer.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

System.exit(job.waitForCompletion(true) ? 0 : 1);

}

DMapper.Java:-

--------------------

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Mapper;

public class DMapper extends Mapper<LongWritable, Text, Text, IntWritable>

{

private static final int MISSING = 9999;

@Override

public void map(LongWritable key, Text value, Context context)

throws IOException, InterruptedException

{

if(value.toString().length()>130)

{

String line = value.toString();

String year = line.substring(15, 19);

int airTemperature;

if (line.charAt(87) == '+')

{

// parseInt doesn't like leading plus

// // //signs

airTemperature = Integer.parseInt(line.substring(88, 92));

}

else

{

airTemperature = Integer.parseInt(line.substring(87, 92));

}

String quality = line.substring(92, 93);

if (airTemperature != MISSING && quality.matches("[01459]"))

{

context.write(new Text(year), new IntWritable(airTemperature));

}

DReducer.Java:-

--------------------

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Reducer;

public class DReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

@Override

public void reduce(Text key, Iterable<IntWritable> values,Context context)

throws IOException, InterruptedException {

int maxValue = Integer.MIN_VALUE;

for (IntWritable value : values) {

maxValue = Math.max(maxValue, value.get());

}

context.write(key, new IntWritable(maxValue));

}

Sample Data:

0029029070999991901010106004+64333+023450FM-12+000599999V0202701N015919999999N0000001N9-00781+99999102001ADDGF108991999999999999999999
0029029070999991901010113004+64333+023450FM-12+000599999V0202901N008219999999N0000001N9-00721+99999102001ADDGF104991999999999999999999
0029029070999991901010120004+64333+023450FM-12+000599999V0209991C000019999999N0000001N9-00941+99999102001ADDGF108991999999999999999999
0029029070999991901010206004+64333+023450FM-12+000599999V0201801N008219999999N0000001N9-00611+99999101831ADDGF108991999999999999999999
0029029070999991901010213004+64333+023450FM-12+000599999V0201801N009819999999N0000001N9-00561+99999101761ADDGF108991999999999999999999
0029029070999991901010220004+64333+023450FM-12+000599999V0201801N009819999999N0000001N9-00281+99999101751ADDGF108991999999999999999999
0029029070999991901010306004+64333+023450FM-12+000599999V0202001N009819999999N0000001N9-00671+99999101701ADDGF106991999999999999999999
0029029070999991901010313004+64333+023450FM-12+000599999V0202301N011819999999N0000001N9-00331+99999101741ADDGF108991999999999999999999
0029029070999991901010320004+64333+023450FM-12+000599999V0202301N011819999999N0000001N9-00281+99999101741ADDGF108991999999999999999999
0029029070999991901010406004+64333+023450FM-12+000599999V0209991C000019999999N0000001N9-00331+99999102311ADDGF108991999999999999999999
0029029070999991901010413004+64333+023450FM-12+000599999V0202301N008219999999N0000001N9-00441+99999102261ADDGF108991999999999999999999
0029029070999991901010420004+64333+023450FM-12+000599999V0202001N011819999999N0000001N9-00391+99999102231ADDGF108991999999999999999999
0029029070999991901010506004+64333+023450FM-12+000599999V0202701N004119999999N0000001N9+00001+99999101821ADDGF104991999999999999999999
0029029070999991901010513004+64333+023450FM-12+000599999V0202701N002119999999N0000001N9+00061+99999102591ADDGF104991999999999999999999
0029029070999991901010520004+64333+023450FM-12+000599999V0202301N004119999999N0000001N9+00001+99999102671ADDGF104991999999999999999999