Saturday, June 10, 2017

What is the difference between Map and FlatMap in Apache Spark

This example below demonstrates the difference b/w map() & flatMap() operation in RDD using Scala Shell. A flatMap flattens multiple Array into one Single Array

mountain@mountain:~/sbook$ cat words.txt

line1 word1

line2 word2 word1

line3 word3 word4

line4 word1

scala> val lines = sc.textFile("words.txt");

...

scala> lines.map(_.split(" ")).take(3)

res4: Array[Array[String]] = Array(Array(line1, word1), Array(line2, word2, word1), Array(line3, word3, word4))

A flatMap() flattens multiple list into one single List

scala> lines.flatMap(_.split(" ")).take(3)

res5: Array[String] = Array(line1, word1, line2)

Sunday, June 4, 2017

Finding duplicate records from a Hive Table


SET hive.cli.errors.ignore=true;
SET hive.vectorized.execution.enabled=true;
SET hive.vectorized.execution.reduce.enabled=true;
SET hive.cbo.enable=true;
SET hive.compute.query.using.stats=true;
SET hive.stats.fetch.column.stats=true;
SET hive.stats.fetch.partition.stats=true;
SET hive.optimize.index.filter=true;
SET tez.runtime.sort.threads=2;
SET tez.task.generate.counters.per.io=true;
SET hive.prewarm.enabled=true;
SET hive.prewarm.numcontainers=8;
SET tez.runtime.io.sort.mb=550;
SET tez.runtime.optimize.local.fetch=true;
SET tez.runtime.shuffle.keep-alive.enabled=true;
SET hive.cli.print.header=true;

SELECT column1, column2, column3, column4, dt, time, inst_id, customer_id
FROM SampleDB.SampleTable
GROUP BY column1, column2, column3, column4, dt, time, inst_id, customer_id HAVING COUNT(*) > 1;