Whats going on HADOOP with BIGDATA?

This blog is mainly focused on for discussions in Hadoop technology, using various tools from Hadoop ECO system. Hadoop experts and beginners post or share their views, experiences. Freshers on Hadoop post their questions here to clarify from experts. I motivated myself to create this blog for helping the new beginners from expensive Hadoop projects in the market. I do my best to collect and share genuine posts from hadoop discussions around world of internet too.

Saturday, June 10, 2017

What is the difference between Map and FlatMap in Apache Spark

This example below demonstrates the difference b/w map() & flatMap() operation in RDD using Scala Shell. A flatMap flattens multiple Array into one Single Array

mountain@mountain:~/sbook$ cat words.txt

line1 word1

line2 word2 word1

line3 word3 word4

line4 word1

scala> val lines = sc.textFile("words.txt");

...

scala> lines.map(_.split(" ")).take(3)

res4: Array[Array[String]] = Array(Array(line1, word1), Array(line2, word2, word1), Array(line3, word3, word4))

A flatMap() flattens multiple list into one single List

scala> lines.flatMap(_.split(" ")).take(3)

res5: Array[String] = Array(line1, word1, line2)

Sunday, June 4, 2017

Finding duplicate records from a Hive Table

SET hive.cli.errors.ignore=true;
SET hive.vectorized.execution.enabled=true;
SET hive.vectorized.execution.reduce.enabled=true;
SET hive.cbo.enable=true;
SET hive.compute.query.using.stats=true;
SET hive.stats.fetch.column.stats=true;
SET hive.stats.fetch.partition.stats=true;
SET hive.optimize.index.filter=true;
SET tez.runtime.sort.threads=2;
SET tez.task.generate.counters.per.io=true;
SET hive.prewarm.enabled=true;
SET hive.prewarm.numcontainers=8;
SET tez.runtime.io.sort.mb=550;
SET tez.runtime.optimize.local.fetch=true;
SET tez.runtime.shuffle.keep-alive.enabled=true;
SET hive.cli.print.header=true;

SELECT column1, column2, column3, column4, dt, time, inst_id, customer_id
FROM SampleDB.SampleTable
GROUP BY column1, column2, column3, column4, dt, time, inst_id, customer_id HAVING COUNT(*) > 1;