How to upload a dataset in HIVE and run queries against it in Google Cloud Platform.

Dataset: movies_few.csv (See Assignment 4 or D2L for the location of the folder with this and other files.)

  1. Log onto console.cloud.google.com
  2. Click on Products and Services icon on the top left corner of the page

  1. Scroll down and choose Cloud Storage -> Browser from the list

  1. Create a storage bucket in the google cloud storage for uploading the input dataset.

  1. Create a folder called data in your bucket to upload your input dataset.

  1. Double click on the folder to go inside it and upload your dataset to this folder

Create a cluster with one master and three worker nodes in Dataproc.

In Set up Cluster section, fill the following information

In Configure Nodes section,

  1. SSH into the master node of the cluster by clicking on the cluster and then going to the VM Instances tab.

  1. We can view the input dataset in cloud storage using the following command

Note - Make sure to replace the below bucket name (cs588-a4-demo) with your bucket name that you have created.

gsutil ls gs://cs588-a4-demo/data/

  1. Run the Beeline shell using the JDBC HIVE interface. Hive runs on localhost at port 10000. We use the Google Cloud user name at the Master Node host name in the command. The command is shown below

Note: Here myusername is basically your ODIN username since you are logged in to GCP with your pdx.edu email account. Replace the clustername with the name of the cluster that you have created. ([clustername]-m is always the name of the master VM in a cluster.

beeline -u jdbc:hive2://localhost:10000/default 
-n [myusername@clustername-m] 
-d org.apache.hive.jdbc.HiveDriver

An example of the usage is given below

At this point, you are interacting with the HIVE terminal and can issue HIVE commands.

  1. We create a table in HIVE to store the input dataset using the following syntax
CREATE EXTERNAL TABLE movies 
(film VARCHAR(100), 
genre ARRAY<VARCHAR(100)>, 
lead_studio VARCHAR(100), 
audience_score_percent BIGINT,
profitability_percent BIGINT, 
rotten_tomatoes_percent BIGINT, 
world_wide_gross map<string,float>, 
year BIGINT) 
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ',' 
COLLECTION ITEMS TERMINATED BY '|' 
MAP KEYS TERMINATED BY ':' 
LOCATION 'gs://cs588-a4-demo/data/';

Note - It might take a while for the queries to execute, as they have to be dispatched to the cluster.

  1. Select 10 rows from the table
0: jdbc:hive2://localhost:10000/default> SELECT * FROM movies LIMIT 10; 

  1. Count the number of rows in the table
0:jdbc:hive2://localhost:10000/default> SELECT COUNT(*) FROM movies;

  1. Select films where the rotten tomatoes percentage > 80
0: jdbc:hive2://localhost:10000/default> SELECT film, genre FROM movies  where rotten_tomatoes_percent > 80;

  1. Select films whose gross is greater than 200
SELECT film, country, gross 
FROM movies 
LATERAL VIEW EXPLODE(world_wide_gross) gross_table AS 
country, gross 
WHERE gross > 200; 

  1. Select the film with the highest gross in each country
SELECT A.film as film, A.country as country, B.max_gross as gross FROM 
(SELECT film, country, gross 
FROM movies 
LATERAL VIEW EXPLODE(world_wide_gross) gross_table1 AS 
country, gross) A 
INNER JOIN 
(SELECT max(gross) as max_gross 
FROM movies 
LATERAL VIEW EXPLODE(world_wide_gross) gross_table AS 
country, gross 
GROUP BY country) B 
ON A.gross = B.max_gross; 

NOTE: You might need to write ‘as A' and ‘as B' instead of only ‘A' and ‘B' when providing alias in the subquery

  1. Return all the films in the ‘Adventure' and ‘Drama' genre
SELECT genre_item, collect_set(film) as film_list 
FROM movies 
LATERAL VIEW EXPLODE(genre) genre_table as genre_item 
WHERE genre_item in ('Adventure', 'Drama') 
GROUP BY genre_item; 

  1. To exit the HIVE terminal, enter !q

As always, do not forget to delete the cluster, the bucket and the project when you complete the assignment.