RHadoop: November 2015

Wednesday, November 25, 2015

Hbase via Hive

HBase via Hive

One of the things I’m frequently asked about is how to use HBase from Apache Hive. Not just how to do it, but what works, how well it works, and how to make good use of it. I’ve done a bit of research in this area, so hopefully this will be useful to someone besides myself. This is a topic that we did not get to cover in HBase in Action, perhaps these notes will become the basis for the 2nd edition ;) These notes are applicable to Hive 0.11.x used in conjunction with HBase 0.94.x. They should be largely applicable to 0.12.x + 0.96.x, though I haven’t tested everything yet.

The hive project includes an optional library for interacting with HBase. This is where the bridge layer between the two systems is implemented. The primary interface you use when accessing HBase from Hive queries is called the HBaseStorageHandler. You can also interact with HBase tables directly via Input and Output formats, but the handler is simpler and works for most uses.

HBase tables from Hive

Use the HBaseStorageHandler to register HBase tables with the Hive metastore. You can optionally specify the HBase table asEXTERNAL, in which case Hive will not create to drop that table directly – you’ll have to use the HBase shell to do so.

CREATE [EXTERNAL] TABLE foo(...)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
TBLPROPERTIES ('hbase.table.name' = 'bar');

The above statement registers the HBase table named bar in the Hive metastore, accessible from Hive by the name foo.
Under the hood, HBaseStorageHandler is delegating interaction with the HBase table to HiveHBaseTableInputFormat andHiveHBaseTableOutputFormat. You can register your HBase table in Hive using those classes directly if you desire. The above statement is roughly equivalent to:

CREATE TABLE foo(...)
STORED AS
  INPUTFORMAT 'org.apache.hadoop.hive.hbase.HiveHBaseTableInputFormat'
  OUTPUTFORMAT 'org.apache.hadoop.hive.hbase.HiveHBaseTableOutputFormat'
TBLPROPERTIES ('hbase.table.name' = 'bar');

Also provided is the HiveHFileOutputFormat which means it should be possible to generate HFiles for bulkloading from Hive as well. In practice, I haven’t gotten this to work end-to-end (see HIVE-4627).

Schema mapping

Registering the table is only the first step. As part of that registration, you also need to specify a column mapping. This is how you link Hive column names to the HBase table’s rowkey and columns. Do so using the hbase.columns.mapping SerDe property.

CREATE TABLE foo(rowkey STRING, a STRING, b STRING)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ('hbase.columns.mapping' = ':key,f:c1,f:c2')
TBLPROPERTIES ('hbase.table.name' = 'bar');
...

The values provided in the mapping property correspond one-for-one with column names of the hive table. HBase column names are fully qualified by column family, and you use the special token :key to represent the rowkey. The above example makes rows from the HBase table bar available via the Hive table foo. The foo column rowkey maps to the HBase’s table’s rowkey, a to c1 in the f column family, and b to c2, also in the f family.

You can also associate Hive’s MAP data structures to HBase column families. In this case, only the STRING Hive type is used. The other Hive type currently supported is BINARY. See the wiki page for more examples.

Interacting with data

With the column mappings defined, you can now access HBase data just like you would any other Hive data. Only simple query predicates are currently supported.

SELECT * FROM foo WHERE ...;

You can also populate and HBase table using Hive. This works with both INTO and OVERWRITE clauses.

FROM source_hive_table INSERT INTO TABLE my_hbase_table
SELECT source_hive_table.* WHERE ...;

Be advised that there is a regression in Hive 0.12.0 which breaks this feature, see HIVE-5515.

In practice

There’s still a little finesse required to get everything wired up properly at runtime. The HBase interaction module is completely optional, so you have to make sure it and it’s HBase dependencies are available on Hive’s classpath.

$ export HADOOP_CLASSPATH=...
$ hive -e "CREATE TABLE ... STORED BY 'org.apache...HBaseStorageHandler'"

The installation environment could do a better job of handling this for users, but for the time being you must manage it yourself. Ideally the hive bin script can detect the presence of HBase and automatically make the necessary CLASSPATH adjustments. This enhancement appears to be tracked in HIVE-2055. The last mile is provided by the distribution itself, ensuring the environment variables are set for hive. This functionality is provided by BIGTOP-955.
You also need to make sure the necessary jars are shipped out to the MapReduce jobs when you execute your Hive statements. Hive provides a mechanism for shipping additional job dependencies via the auxjars feature.

$ export HIVE_AUX_JARS_PATH=...
$ hive -e "SELECT * FROM ..."

I did discover a small bug in HDP-1.3 builds which masks user-specified values of HIVE_AUX_JARS_PATH. With administrative rights, this is easily fixed by correcting the line in hive-env.sh to respect an existing value. The work-around in user scripts is to use the SETstatement to provide a value once you’ve launched the Hive CLI.

SET hive.aux.jars.path = ...

Hive should be able to detect which jars are necessary and add them itself. HBase provides theTableMapReduceUtils#addDependencyJars methods for this purpose. It appears that this is done in hive-0.12.0, at least according toHIVE-2379.

Future work

Much has been said about proper support for predicate pushdown (HIVE-1643, HIVE-2854, HIVE-3617, HIVE-3684) and data type awareness (HIVE-1245, HIVE-2599). These go hand-in-hand as predicate semantics are defined in terms of the types upon which they operate. More could be done to map Hive’s complex data types like Maps and Structs onto HBase column families as well (HIVE-3211). Support for HBase timestamps is a bit of a mess; they’re not made available to Hive applications with any level of granularity (HIVE-2828, HIVE-2306). The only interaction a user has is via storage handler setting for writing a custom timestamp with all operations.

From a performance perspective, there are things Hive can do today (ie, not dependent on data types) to take advantage of HBase. There’s also the possibility of an HBase-aware Hive to make use of HBase tables as intermediate storage location (HIVE-3565), facilitating map-side joins against dimension tables loaded into HBase. Hive could make use of HBase’s natural indexed structure (HIVE-3634, HIVE-3727), potentially saving huge scans.

Currently, the user doesn’t have (any?) control over the scans which are executed. Configuration on a per-job, or at least per-table basis should be enabled (HIVE-1233). That would enable an HBase-savy user to provide Hive with hints regarding how it should interact with HBase. Support for simple split sampling of HBase tables (HIVE-3399) could also be easily done because HBase manages table partitions already.

Other access channels

Everything discussed thus far has required Hive to interact with online HBase RegionServers. Applications may stand to gain significant throughput and enjoy greater flexibility by interacting directly with HBase data persisted to HDFS. This also has the benefit of preventing Hive workloads from interfering with online SLA-bound HBase applications (at least, until we see HBase improvements in QOS isolation between tasks, HBASE-4441).

As mentioned earlier, there is the HiveHFileOutputFormat. Resolving HIVE-4627 should make Hive a straight-forward way to generate HFiles for bulk loading. Once you’ve created the HFiles using Hive, there’s still the last step of running theLoadIncrementalHFiles utility to copy and register them in the regions. For this, the HiveStorageHandler interface will need some kind of hook to influence the query plan as it’s created, allowing it to append steps. Once in place, it should be possible to SET a runtime flag, switching an INSERT operation to use bulkload.

HBase recently introduced the table snapshot feature. This allows a user to create a persisted point-in-time view of a table, persisted to HDFS. HBase is able to restore a table from a snapshot to a previous state, and to create an entirely new table from an existing snapshot. Hive does not currently support reading from an HBase snapshot. For that matter, HBase doesn’t yet support MapReduce jobs over snapshots, though the feature is a work in progress (HBASE-8369).

Conclusions

The interface between HBase and Hive is young, but has nice potential. There’s a lot of low-hanging fruit that can be picked up to make things easier and faster. The most glaring issue barring real application development is the impedance mismatch between Hive’s typed, dense schema and HBase’s untyped, sparse schema. This is as much a cognitive problem as technical issue. Solutions here would allow a number of improvements to fall out, including much in the way of performance improvements. I’m hopeful that continuing work to add data types to HBase (HBASE-8089) can help bridge this gap.

Basic operations mostly work, at least in a rudimentary way. You can read data out of and write data back into HBase using Hive. Configuring the environment is an opaque and manual process, one which likely stymies novices from adopting the tools. There’s also the question of bulk operations – support for writing HFiles and reading HBase snapshots using Hive is entirely lacking at this point. And of course, there are bugs sprinkled throughout. The biggest recent improvement is the deprecation of HCatalog’s interface, removing the necessary upfront decision regarding which interface to use.

Hive provides a very usable SQL interface on top of HBase, one which integrates easily into many existing ETL workflows. That interface requires simplifying some of the BigTable semantics HBase provides, but the result will be to open up HBase to a much broader audience of users. The Hive interop compliments extremely well the experience provided by Phoenix. Hive has the benefit of not requiring the deployment complexities currently required by that system. Hopefully the common definition of types will allow a complimentary future.

Grab some data and register it in Hive

We’ll need some data to work with. For this purpose, grab some traffic stats from wikipedia. Once we have some data, copy it up to HDFS.

$ mkdir pagecounts ; cd pagecounts
$ for x in {0..9} ; do wget "http://dumps.wikimedia.org/other/pagecounts-raw/2008/2008-10/pagecounts-20081001-0${x}0000.gz" ; done
$ hadoop fs -copyFromLocal $(pwd) ./

For reference, this is what the data looks like.

$ zcat pagecounts-20081001-000000.gz | head -n5
aa.b Special:Statistics 1 837
aa Main_Page 4 41431
aa Special:ListUsers 1 5555
aa Special:Listusers 1 1052
aa Special:PrefixIndex/Comparison_of_Guaze%27s_Law_and_Coulomb%27s_Law 1 4332

As I understand it, each record is a count of page views of a specific page on Wikipedia. The first column is the language code, second is the page name, third is the number of page views, and fourth is the size of the page in bytes. Each file contains an hour’s worth of aggregated data. None of the above pages were particularly popular that hour.
Now that we have data and understand its raw schema, create a Hive table over it. To do that, we’ll use a DDL script that looks like this.

$ cat 00_pagecounts.ddl
-- define an external table over raw pagecounts data
CREATE TABLE IF NOT EXISTS pagecounts (projectcode STRING, pagename STRING, pageviews STRING, bytes STRING)
ROW FORMAT
  DELIMITED FIELDS TERMINATED BY ' '
  LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '/user/ndimiduk/pagecounts';

Run the script to register our dataset with Hive.

$ hive -f 00_pagecounts.ddl
OK
Time taken: 2.268 seconds

Verify that the schema mapping works by calculating a simple statistic over the dataset.

$ hive -e "SELECT count(*) FROM pagecounts;"
Total MapReduce jobs = 1
Launching Job 1 out of 1
...
OK
36668549
Time taken: 25.31 seconds, Fetched: 1 row(s)

Hive says the 10 files we downloaded contain just over 36.5mm records. Let’s just confirm things are working as expected by getting a second opinion. This isn’t that much data, so confirm on the command line.

$ zcat * | wc -l
36668549

The record counts match up – excellent.

Transform the schema for HBase

The next step is to transform the raw data into a schema that makes sense for HBase. In our case, we’ll create a schema that allows us to calculate aggregate summaries of pages according to their titles. To do this, we want all the data for a single page grouped together. We’ll manage that by creating a Hive view that represents our target HBase schema. Here’s the DDL.

$ cat 01_pgc.ddl
-- create a view, building a custom hbase rowkey
CREATE VIEW IF NOT EXISTS pgc (rowkey, pageviews, bytes) AS
SELECT concat_ws('/',
         projectcode,
         concat_ws('/',
           pagename,
           regexp_extract(INPUT__FILE__NAME, 'pagecounts-(\\d{8}-\\d{6})\\..*$', 1))),
       pageviews, bytes
FROM pagecounts;

The SELECT statement uses hive to build a compound rowkey for HBase. It concatenates the project code, page name, and date, joined by the '/' character. A handy trick: it uses a simple regex to extract the date from the source file names. Run it now.

$ hive -f 01_pgc.ddl
OK
Time taken: 2.712 seconds

This is just a view, so the SELECT statement won’t be evaluated until we query it for data. Registering it with hive doesn’t actually process any data. Again, make sure it works by querying Hive for a subset of the data.

$ hive -e "SELECT * FROM pgc WHERE rowkey LIKE 'en/q%' LIMIT 10;"
Total MapReduce jobs = 1
Launching Job 1 out of 1
...
OK
en/q:Special:Search/Blues/20081001-090000       1       1168
en/q:Special:Search/rock/20081001-090000        1       985
en/qadam_rasul/20081001-090000  1       1108
en/qarqay/20081001-090000       1       933
en/qemu/20081001-090000 1       1144
en/qian_lin/20081001-090000     1       918
en/qiang_(spear)/20081001-090000        1       973
en/qin_dynasty/20081001-090000  1       1120
en/qinghe_special_steel_corporation_disaster/20081001-090000    1       963
en/qmail/20081001-090000        1       1146
Time taken: 40.382 seconds, Fetched: 10 row(s)

Register the HBase table

Now that we have a dataset in Hive, it’s time to introduce HBase. The first step is to register our HBase table in Hive so that we can interact with it using Hive queries. That means another DDL statement. Here’s what it looks like.

$ cat 02_pagecounts_hbase.ddl
-- create a table in hbase to host the view
CREATE TABLE IF NOT EXISTS pagecounts_hbase (rowkey STRING, pageviews STRING, bytes STRING)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ('hbase.columns.mapping' = ':key,f:c1,f:c2')
TBLPROPERTIES ('hbase.table.name' = 'pagecounts');

This statement will tell Hive to go create an HBase table named pagecounts with the single column family f. It registers that HBase table in the Hive metastore by the name pagecounts_hbase with 3 columns: rowkey, pageviews, and bytes. The SerDe propertyhbase.columns.mapping makes the association from Hive column to HBase column. It says the Hive column rowkey is mapped to the HBase table’s rowkey, the Hive column pageviews to the HBase column f:c1, and bytes to the HBase column f:c2. To keep the example simple, we have Hive treat all these columns as the STRING type.

In order to use the HBase library, we need to make the HBase jars and configuration available to the local Hive process (at least untilHIVE-5518 is resolved). Do that by specifying a value for the HADOOP_CLASSPATH environment variable before executing the statement.

$ export HADOOP_CLASSPATH=/etc/hbase/conf:/usr/lib/hbase/hbase-0.94.6.1.3.2.0-111-security.jar:/usr/lib/zookeeper/zookeeper.jar
$ hive -f 02_pagecounts_hbase.ddl
OK
Time taken: 4.399 seconds

Populate the HBase table

Now it’s time to write data to HBase. This is done using a regular Hive INSERT statement, sourcing data from the view with SELECT. There’s one more bit of administration we need to take care of though. This INSERT statement will run a mapreduce job that writes data to HBase. That means we need to tell Hive to ship the HBase jars and dependencies with the job.
Note that this is a separate step from the classpath modification we did previously. Normally you can do this with an export statement from the shell, the same way we specified the HADOOP_CLASSPATH. However there’s a bug in HDP-1.3 that requires me to use Hive’sSET statement in the script instead.

$ cat 03_populate_hbase.hql
-- ensure hbase dependency jars are shipped with the MR job
-- Should export HIVE_AUX_JARS_PATH but this is broken in HDP-1.3.x
SET hive.aux.jars.path = file:///etc/hbase/conf/hbase-site.xml,file:///usr/lib/hive/lib/hive-hbase-handler-0.11.0.1.3.2.0-111.jar,file:///usr/lib/hbase/hbase-0.94.6.1.3.2.0-111-security.jar,file:///usr/lib/zookeeper/zookeeper-3.4.5.1.3.2.0-111.jar;

-- populate our hbase table
FROM pgc INSERT INTO TABLE pagecounts_hbase SELECT pgc.* WHERE rowkey LIKE 'en/q%' LIMIT 10;

Note there’s a big ugly bug in Hive 0.12.0 which means this doesn’t work with that version. Never fear though, we have a patch in progress. Follow along at HIVE-5515.
If you choose to use a different method for setting Hive’s auxpath, be advised that it’s a tricky process – depending on how you specify it (HIVE_AUX_JARS_PATH, --auxpath), Hive will interpret the argument differently. HIVE-2349 seeks to remedy this unfortunate state of affairs.

$ hive -f 03_populate_hbase.hql
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
...
OK
Time taken: 40.296 seconds

Be advised also that this step is currently broken on secured HBase deployments. Follow along HIVE-5523 if that’s of interest to you.

Query data from HBase-land

40 seconds later, you now have data in HBase. Let’s have a look using the HBase shell.

$ echo "scan 'pagecounts'" | hbase shell
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 0.94.6.1.3.2.0-111, r410a7a1c151ca953553eae68aa84e2a9f0d6e4ca, Mon Aug 19 19:00:12 PDT 2013

scan 'pagecounts'
ROW                                                 COLUMN+CELL
 en/q:Pan%27s_Labyrinth/20081001-080000             column=f:c1, timestamp=1381534232485, value=1
 en/q:Pan%27s_Labyrinth/20081001-080000             column=f:c2, timestamp=1381534232485, value=1153
 en/q:Special:Search/Jazz/20081001-080000           column=f:c1, timestamp=1381534232485, value=1
 en/q:Special:Search/Jazz/20081001-080000           column=f:c2, timestamp=1381534232485, value=980
 en/q:Special:Search/peinture/20081001-080000       column=f:c1, timestamp=1381534232485, value=1
 en/q:Special:Search/peinture/20081001-080000       column=f:c2, timestamp=1381534232485, value=989
 en/q:Special:Search/rock/20081001-080000           column=f:c1, timestamp=1381534232485, value=1
 en/q:Special:Search/rock/20081001-080000           column=f:c2, timestamp=1381534232485, value=980
 en/qadi/20081001-080000                            column=f:c1, timestamp=1381534232485, value=1
 en/qadi/20081001-080000                            column=f:c2, timestamp=1381534232485, value=1112
 en/qalawun%20complex/20081001-080000               column=f:c1, timestamp=1381534232485, value=1
 en/qalawun%20complex/20081001-080000               column=f:c2, timestamp=1381534232485, value=942
 en/qalawun/20081001-080000                         column=f:c1, timestamp=1381534232485, value=1
 en/qalawun/20081001-080000                         column=f:c2, timestamp=1381534232485, value=929
 en/qari'/20081001-080000                           column=f:c1, timestamp=1381534232485, value=1
 en/qari'/20081001-080000                           column=f:c2, timestamp=1381534232485, value=929
 en/qasvin/20081001-080000                          column=f:c1, timestamp=1381534232485, value=1
 en/qasvin/20081001-080000                          column=f:c2, timestamp=1381534232485, value=921
 en/qemu/20081001-080000                            column=f:c1, timestamp=1381534232485, value=1
 en/qemu/20081001-080000                            column=f:c2, timestamp=1381534232485, value=1157
10 row(s) in 0.4960 seconds

Here we have 10 rows with two columns each containing the data loaded using Hive. It’s now accessible in your online world using HBase. For example, perhaps you receive an updated data file and have a corrected value for one of the stats. You can update the record in HBase with a regular PUT command.

Verify data from from Hive

The HBase table remains available to you Hive world; Hive’s HBaseStorageHandler works both ways, after all.
Note that this command expects that the HADOOP_CLASSPATH is still set and HIVE_AUX_JARS_PATH as well if your query is complex.

$ hive -e "SELECT * from pagecounts_hbase;"
OK
en/q:Pan%27s_Labyrinth/20081001-080000  1       1153
en/q:Special:Search/Jazz/20081001-080000        1       980
en/q:Special:Search/peinture/20081001-080000    1       989
en/q:Special:Search/rock/20081001-080000        1       980
en/qadi/20081001-080000 1       1112
en/qalawun%20complex/20081001-080000    1       942
en/qalawun/20081001-080000      1       929
en/qari'/20081001-080000        1       929
en/qasvin/20081001-080000       1       921
en/qemu/20081001-080000 1       1157
Time taken: 2.554 seconds, Fetched: 10 row(s)

Continue using Hive for analysis

Since the HBase table is accessible from Hive, you can continue to use Hive for your ETL processing with mapreduce. Keep in mind that the auxpath considerations apply here too, so I’ve scripted out the query instead of just running it directly at the command line.

$ cat 04_query_hbase.hql
-- ensure hbase dependency jars are shipped with the MR job
-- Should export HIVE_AUX_JARS_PATH but this is broken in HDP-1.3.x
SET hive.aux.jars.path = file:///etc/hbase/conf/hbase-site.xml,file:///usr/lib/hive/lib/hive-hbase-handler-0.11.0.1.3.2.0-111.jar,file:///usr/lib/hbase/hbase-0.94.6.1.3.2.0-111-security.jar,file:///usr/lib/zookeeper/zookeeper-3.4.5.1.3.2.0-111.jar;

-- query hive data
SELECT count(*) from pagecounts_hbase;

Run it the same way we did the others.

$ hive -f 04_query_hbase.hql
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
...
OK
10
Time taken: 19.473 seconds, Fetched: 1 row(s)

There you have it: a hands-on, end to end demonstration of interacting with HBase from Hive. You can learn more about the nitty-gritty details in Enis’s deck on the topic, or see the presentation he and Ashutosh gave at HBaseCon. If you’re inclined to make the intersection of these technologies work better (faster, stronger), I encourage you to pick up any of the JIRA issues mentioned in this post or the previous.

Pig Interviewquesitons

Can you give us some examples how Hadoop is used in real time environment?

Let us assume that the we have an exam consisting of 10 Multiple-choice questions and 20 students appear for that exam. Every student will attempt each question. For each question and each answer option, a key will be generated. So we have a set of key-value pairs for all the questions and all the answer options for every student. Based on the options that the students have selected, you have to analyze and find out how many students have answered correctly.

This isn’t an easy task. Here Hadoop comes into picture! Hadoop helps you in solving these problems quickly and without much effort. You may also take the case of how many students have wrongly attempted a particular question.

What is BloomMapFile used for?

The BloomMapFile is a class that extends MapFile. So its functionality is similar to MapFile.

BloomMapFile uses dynamic Bloom filters to provide quick membership test for the keys. It is used in Hbase table format.

What is PIG?

PIG is a platform for analyzing large data sets that consist of high level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. PIG’s infrastructure layer consists of a compiler that produces sequence of MapReduce Programs.

What is the difference between logical and physical plans?

Pig undergoes some steps when a Pig Latin Script is converted into MapReduce jobs. After performing the basic parsing and semantic checking, it produces a logical plan. The logical plan describes the logical operators that have to be executed by Pig during execution. After this, Pig produces a physical plan. The physical plan describes the physical operators that are needed to execute the script.

Does ‘ILLUSTRATE’ run MR job?

No, illustrate will not pull any MR, it will pull the internal data. On the console, illustrate will not do any job. It just shows output of each stage and not the final output.

Is the keyword ‘DEFINE’ like a function name?

Yes, the keyword ‘DEFINE’ is like a function name. Once you have registered, you have to define it. Whatever logic you have written in Java program, you have an exported jar and also a jar registered by you. Now the compiler will check the function in exported jar. When the function is not present in the library, it looks into your jar.

Is the keyword ‘FUNCTIONAL’ a User Defined Function (UDF)?

No, the keyword ‘FUNCTIONAL’ is not a User Defined Function (UDF). While using UDF, we have to override some functions. Certainly you have to do your job with the help of these functions only. But the keyword ‘FUNCTIONAL’ is a built-in function i.e a pre-defined function, therefore it does not work as a UDF.

Why do we need MapReduce during Pig programming?

Pig is a high-level platform that makes many Hadoop data analysis issues easier to execute. The language we use for this platform is: Pig Latin. A program written in Pig Latin is like a query written in SQL, where we need an execution engine to execute the query. So, when a program is written in Pig Latin, Pig compiler will convert the program into MapReduce jobs. Here, MapReduce acts as the execution engine.

Are there any problems which can only be solved by MapReduce and cannot be solved by PIG? In which kind of scenarios MR jobs will be more useful than PIG?

Let us take a scenario where we want to count the population in two cities. I have a data set and sensor list of different cities. I want to count the population by using one mapreduce for two cities. Let us assume that one is Bangalore and the other is Noida. So I need to consider key of Bangalore city similar to Noida through which I can bring the population data of these two cities to one reducer. The idea behind this is some how I have to instruct map reducer program – whenever you find city with the name ‘Bangalore‘ and city with the name ‘Noida’, you create the alias name which will be the common name for these two cities so that you create a common key for both the cities and it get passed to the same reducer. For this, we have to write custom partitioner.

In mapreduce when you create a ‘key’ for city, you have to consider ’city’ as the key. So, whenever the framework comes across a different city, it considers it as a different key. Hence, we need to use customized partitioner. There is a provision in mapreduce only, where you can write your custom partitioner and mention if city = bangalore or noida then pass similar hashcode. However, we cannot create custom partitioner in Pig. As Pig is not a framework, we cannot direct execution engine to customize the partitioner. In such scenarios, MapReduce works better than Pig.

Does Pig give any warning when there is a type mismatch or missing field?

No, Pig will not show any warning if there is no matching field or a mismatch. If you assume that Pig gives such a warning, then it is difficult to find in log file. If any mismatch is found, it assumes a null value in Pig.

What co-group does in Pig?

Co-group joins the data set by grouping one particular data set only. It groups the elements by their common field and then returns a set of records containing two separate bags. The first bag consists of the record of the first data set with the common data set and the second bag consists of the records of the second data set with the common data set.

Can we say cogroup is a group of more than 1 data set?

Cogroup is a group of one data set. But in the case of more than one data sets, cogroup will group all the data sets and join them based on the common field. Hence, we can say that cogroup is a group of more than one data set and join of that data set as well.

What does FOREACH do?

FOREACH is used to apply transformations to the data and to generate new data items. The name itself is indicating that for each element of a data bag, the respective action will be performed.

Syntax : FOREACH bagname GENERATE expression1, expression2, …..

The meaning of this statement is that the expressions mentioned after GENERATE will be applied to the current record of the data bag.

What is bag?

A bag is one of the data models present in Pig. It is an unordered collection of tuples with possible duplicates. Bags are used to store collections while grouping. The size of bag is the size of the local disk, this means that the size of the bag is limited. When the bag is full, then Pig will spill this bag into local disk and keep only some parts of the bag in memory. There is no necessity that the complete bag should fit into memory. We represent bags with “{}”.

Hive-Interviewquestions

Hive Interview Questions

What is Hive?
Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems.
Hive was originally developed at Facebook. It’s now a Hadoop subproject with many contributors. Users need to concentrate only on the top level hive language rather than java map reduce programs. One of the main advantages of Hive is its SQLish nature. Thus it leverages the usability to a higher extend.

A hive program will be automatically compiled into map-reduce jobs executed on Hadoop. In addition, HiveQL supports custom map-reduce scripts to be plugged into queries.

Hive example:
selecting the employee names whose salary more than 100 dollars from a hive table called tbl_employee.
SELECT employee_name FROM tbl_employee WHERE salary > 100;
Users are excited to use Hive since it is very similar to SQL.

What are the types of tables in Hive?
There are two types of tables.
1. Managed tables.
2. External tables.

Only the drop table command differentiates managed and external tables. Otherwise, both type of tables are very similar.

Does Hive support record level Insert, delete or update?
Hive does not provide record-level update, insert, or delete. Henceforth, Hive does not provide transactions too. However, users can go with CASE statements and built in functions of Hive to satisfy the above DML operations. Thus, a complex update query in a RDBMS may need many lines of code in Hive.

What kind of datawarehouse application is suitable for Hive?
Hive is not a full database. The design constraints and limitations of Hadoop and HDFS impose limits on what Hive can do.

Hive is most suited for data warehouse applications, where
1) Relatively static data is analyzed,
2) Fast response times are not required, and
3) When the data is not changing rapidly.

Hive doesn’t provide crucial features required for OLTP, Online Transaction Processing. It’s closer to being an OLAP tool, Online Analytic Processing.So, Hive is best suited for data warehouse applications, where a large data set is maintained and mined for insights, reports, etc.

How can the columns of a table in hive be written to a file?
By using awk command in shell, the output from HiveQL (Describe) can be written to a file.
hive -S -e "describe table_name;" | awk -F" " ’{print 1}’ > ~/output.

CONCAT function in Hive with Example?
CONCAT function will concat the input strings. You can specify any number of strings separated by comma.

Example:
CONCAT ('Hive','-','performs','-','good','-','in','-','Hadoop');

Output:
Hive-performs-good-in-Hadoop

So, every time you delimit the strings by '-'. If it is common for all the strings, then Hive provides another command CONCAT_WS. Here you have to specify the delimit operator first.

CONCAT_WS ('-','Hive','performs','good','in','Hadoop');
Output: Hive-performs-good-in-Hadoop

REPEAT function in Hive with example?
REPEAT function will repeat the input string n times specified in the command.

Example:
REPEAT('Hadoop',3);

Output:
HadoopHadoopHadoop.

Note: You can add a space with the input string also.

TRIM function in Hive with example?
TRIM function will remove the spaces associated with a string.

Example:
TRIM(' Hadoop ');

Output:
Hadoop.

Note: If you want to remove only leading or trialing spaces then you can specify the below commands respectively.
LTRIM(' Hadoop');
RTRIM('Hadoop ');

REVERSE function in Hive with example?
REVERSE function will reverse the characters in a string.

Example:
REVERSE('Hadoop');

Output:
poodaH

LOWER or LCASE function in Hive with example?
LOWER or LCASE function will convert the input string to lower case characters.

Example:
LOWER('Hadoop');
LCASE('Hadoop');

Output:
hadoop

Note:
If the characters are already in lower case then they will be preserved.

UPPER or UCASE function in Hive with example?
UPPER or UCASE function will convert the input string to upper case characters.

Example:
UPPER('Hadoop');
UCASE('Hadoop');

Output:
HADOOP

Note:
If the characters are already in upper case then they will be preserved.

Double type in Hive – Important points?
It is important to know about the double type in Hive. Double type in Hive will present the data differently unlike RDBMS.
See the double type data below:
24624.0
32556.0
3.99893E5
4366.0

E5 represents 10^5 here. So, the value 3.99893E5 represents 399893. All the calculations will be accurately performed using double type. The maximum value for a IEEE 754 double is about 2.22E308.

It is crucial while exporting the double type data to any RDBMS since the type may be wrongly interpreted. So, it is advised to cast the double type into appropriate type before exporting.

Rename a table in Hive – How to do it?
Using ALTER command, we can rename a table in Hive.
ALTER TABLE hive_table_name RENAME TO new_name;

There is another way to rename a table in Hive. Sometimes, ALTER may take more time if the underlying table has more partitions/functions. In that case, Import and export options can be utilized. Here you are saving the hive data into HDFS and importing back to new table like below.
EXPORT TABLE tbl_name TO 'HDFS_location';
IMPORT TABLE new_tbl_name FROM 'HDFS_location';

If you prefer to just preserve the data, you can create a new table from old table like below.
CREATE TABLE new_tbl_name AS SELECT * FROM old_tbl_name;
DROP TABLE old_tbl_name;

How to change a column data type in Hive?
ALTER TABLE table_name CHANGE column_name column_name new_datatype;
Example: If you want to change the data type of ID column from integer to bigint in a table called employee.
ALTER TABLE employee CHANGE id id BIGINT;

Difference between order by and sort by in hive?
SORT BY will sort the data within each reducer. You can use any number of reducers for SORT BY operation.
ORDER BY will sort all of the data together, which has to pass through one reducer. Thus, ORDER BY in hive uses single reducer.
ORDER BY guarantees total order in the output while SORT BY only guarantees ordering of the rows within a reducer. If there is more than one reducer, SORT BY may give partially ordered final results

RLIKE in Hive?
RLIKE (Right-Like) is a special function in Hive where if any substring of A matches with B then it evaluates to true. It also obeys Java regular expression pattern. Users don't need to put % symbol for a simple match in RLIKE.

Examples:
'Express' RLIKE 'Exp' --> True
'Express' RLIKE '^E.*' --> True (Regular expression)

Moreover, RLIKE will come handy when the string has some spaces. Without using TRIM function, RLIKE satisfies the required scenario. Suppose if A has value 'Express ' (2 spaces additionally) and B has value 'Express' RLIKE will work better without using TRIM.
'Express' RLIKE 'Express' --> True

Note:
RLIKE evaluates to NULL if A or B is NULL.