How to upload a dataset in HIVE and run queries against it in Google Cloud Platform.
Dataset: movies_few.csv (See Assignment 4 or D2L for the location of the folder with this and other files.)
data
in your bucket to upload your input dataset. Create a cluster with one master and three worker nodes in Dataproc.
In Set up Cluster section, fill the following information
In Configure Nodes section,
Note - Make sure to replace the below bucket name (cs588-a4-demo) with your bucket name that you have created.
gsutil ls gs://cs588-a4-demo/data/
Note: Here myusername is basically your ODIN username since you are logged in to GCP with your pdx.edu email account. Replace the clustername with the name of the cluster that you have created. ([clustername]-m is always the name of the master VM in a cluster.
beeline -u jdbc:hive2://localhost:10000/default
-n [myusername@clustername-m]
-d org.apache.hive.jdbc.HiveDriver
An example of the usage is given below
At this point, you are interacting with the HIVE terminal and can issue HIVE commands.
CREATE EXTERNAL TABLE movies
(film VARCHAR(100),
genre ARRAY<VARCHAR(100)>,
lead_studio VARCHAR(100),
audience_score_percent BIGINT,
profitability_percent BIGINT,
rotten_tomatoes_percent BIGINT,
world_wide_gross map<string,float>,
year BIGINT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY '|'
MAP KEYS TERMINATED BY ':'
LOCATION 'gs://cs588-a4-demo/data/';
Note - It might take a while for the queries to execute, as they have to be dispatched to the cluster.
0: jdbc:hive2://localhost:10000/default> SELECT * FROM movies LIMIT 10;
0:jdbc:hive2://localhost:10000/default> SELECT COUNT(*) FROM movies;
0: jdbc:hive2://localhost:10000/default> SELECT film, genre FROM movies where rotten_tomatoes_percent > 80;
SELECT film, country, gross
FROM movies
LATERAL VIEW EXPLODE(world_wide_gross) gross_table AS
country, gross
WHERE gross > 200;
SELECT A.film as film, A.country as country, B.max_gross as gross FROM
(SELECT film, country, gross
FROM movies
LATERAL VIEW EXPLODE(world_wide_gross) gross_table1 AS
country, gross) A
INNER JOIN
(SELECT max(gross) as max_gross
FROM movies
LATERAL VIEW EXPLODE(world_wide_gross) gross_table AS
country, gross
GROUP BY country) B
ON A.gross = B.max_gross;
NOTE: You might need to write ‘as A' and ‘as B' instead of only ‘A' and ‘B' when providing alias in the subquery
SELECT genre_item, collect_set(film) as film_list
FROM movies
LATERAL VIEW EXPLODE(genre) genre_table as genre_item
WHERE genre_item in ('Adventure', 'Drama')
GROUP BY genre_item;
As always, do not forget to delete the cluster, the bucket and the project when you complete the assignment.