How to upload a dataset in HIVE and run queries against it in Google Cloud Platform.
Dataset: movies_few.csv (See Assignment 4 or D2L for the location of the folder with this and other files.)
datain your bucket to upload your input dataset.
Create a cluster with one master and three worker nodes in Dataproc.
In Set up Cluster section, fill the following information
In Configure Nodes section,
Note - Make sure to replace the below bucket name (cs588-a4-demo) with your bucket name that you have created.
gsutil ls gs://cs588-a4-demo/data/
Note: Here myusername is basically your ODIN username since you are logged in to GCP with your pdx.edu email account. Replace the clustername with the name of the cluster that you have created. ([clustername]-m is always the name of the master VM in a cluster.
beeline -u jdbc:hive2://localhost:10000/default -n [myusername@clustername-m] -d org.apache.hive.jdbc.HiveDriver
An example of the usage is given below
At this point, you are interacting with the HIVE terminal and can issue HIVE commands.
CREATE EXTERNAL TABLE movies (film VARCHAR(100), genre ARRAY<VARCHAR(100)>, lead_studio VARCHAR(100), audience_score_percent BIGINT, profitability_percent BIGINT, rotten_tomatoes_percent BIGINT, world_wide_gross map<string,float>, year BIGINT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY '|' MAP KEYS TERMINATED BY ':' LOCATION 'gs://cs588-a4-demo/data/';
Note - It might take a while for the queries to execute, as they have to be dispatched to the cluster.
0: jdbc:hive2://localhost:10000/default> SELECT * FROM movies LIMIT 10;
0:jdbc:hive2://localhost:10000/default> SELECT COUNT(*) FROM movies;
0: jdbc:hive2://localhost:10000/default> SELECT film, genre FROM movies where rotten_tomatoes_percent > 80;
SELECT film, country, gross FROM movies LATERAL VIEW EXPLODE(world_wide_gross) gross_table AS country, gross WHERE gross > 200;
SELECT A.film as film, A.country as country, B.max_gross as gross FROM (SELECT film, country, gross FROM movies LATERAL VIEW EXPLODE(world_wide_gross) gross_table1 AS country, gross) A INNER JOIN (SELECT max(gross) as max_gross FROM movies LATERAL VIEW EXPLODE(world_wide_gross) gross_table AS country, gross GROUP BY country) B ON A.gross = B.max_gross;
NOTE: You might need to write ‘as A' and ‘as B' instead of only ‘A' and ‘B' when providing alias in the subquery
SELECT genre_item, collect_set(film) as film_list FROM movies LATERAL VIEW EXPLODE(genre) genre_table as genre_item WHERE genre_item in ('Adventure', 'Drama') GROUP BY genre_item;
As always, do not forget to delete the cluster, the bucket and the project when you complete the assignment.