BigQuery is a data warehouse that is used for analytic processing of large amounts of data at low costs using commodity hardware. In this codelab, you will import some data into BigQuery and use a number of its different interfaces to perform queries.

Begin by enabling the service and API. First, list all of the services available to enable via:

gcloud services list --available

Then, enable the BigQuery API as well as the BigQuery Connection API. This will be used to programmatically access the service.

gcloud services enable ...

Go to the web console and visit BigQuery. On the left-hand menu, scroll down to "Resources" and select your project, then click on "CREATE DATASET". Name your dataset yob.

Go back to the interface and select the new dataset.

In Cloud Shell, copy a file containing the name, gender, and count of newborn babies in 2014 from a storage bucket.

gsutil cp gs://cloud-training/CP100/Lab12/yob2014.txt ./

Then, examine its format. Determine the data-type for each field. This will be needed when instantiating a table using the file. Then, use wc to determine how many names it includes.

head -3 yob2014.txt
wc -l yob2014.txt

Go back to the BigQuery console, click on the dataset and create a table from Cloud Storage. Specify the input file location as the storage bucket URI and then select the CSV format (comma-separated values). Under "Destination", name the table baby_names. Then, under "Schema", add fields that correspond to the columns in the file including their types (in this case string or integer). Create the table.

Click on the table created within the dataset, then click the "Preview" tab to validate that the table has been created properly.

Then, click on the "Details" tab.

BigQuery supports multiple methods for querying the data. We'll go through several of them.

Via web console

In the Query editor for your table, run a query that lists the 10 most popular female names in 2014. Table names must be escaped with back-ticks in the UI. Notice that when given a valid query, the validator shows a green checkmark to show you how much data you will access when the query is run. It is important to examine this for every query so you can eventually learn to optimize costs.

Run the query, then hide the editor to see your query results.

Via command-line bq command

BigQuery can also be queried from Cloud Shell using the SDK's bq command. Bring up a Cloud Shell session and run a query to get the 10 least popular boys names in 2014. Note that the tool requires the table name to be delimited by square brackets with the colon and period separating the project name, dataset name, and table name as shown below.

Via interactive bq session

bq can be run interactively in Cloud Shell

bq shell

At the prompt, you can then enter your query. Run a query to find the 10 most popular male names in 2014.

Finally, run a query on your name. How popular was it?

One of the more common service combinations for performing data science is to pair the backend data warehouse (BigQuery) with an interactive Python or R based Jupyter notebook. On GCP, a managed Jupyter notebook service is provided called AI Platform Notebooks. In this lab, we'll use the tools to perform a simple analysis of the birthweight of babies in a particular year.

First, begin by enabling the Notebooks API

gcloud services enable notebooks.googleapis.com

AI Platform Notebooks wraps the configuration and startup of a Compute Engine instance that hosts an interactive Jupyter notebook that can be used to ingest data from BigQuery, process it, and generate data products from it. Run the gcloud command below to bring up an instance.

gcloud beta notebooks instances create bq-jupyter-instance \
  --vm-image-project=deeplearning-platform-release \
  --vm-image-family=tf2-2-2-cpu \
  --machine-type=n1-standard-1 \
  --location=us-west1-b

Then, in the console, visit "AI Platform"=>"Notebooks"

Click on "Open Jupyterlab" when it comes up.

Create a Python 3 notebook and leave it open.

Visit the BigQuery console to compose a new query.

Enter in a query to dump the entire table that we'll be querying. Do not run the query, but rather in the bottom right hand corner, see the amount of data that the query will go through. This is the size of the table.

SELECT * FROM bigquery-public-data.samples.natality

We'll be running a query to obtain data on birthweight from a publicly available natality dataset hosted on the platform.

Modify the <FMI> in the WHERE statement of the query template below with a SQL conjunction that returns the number of babies born and their average weight between 2001 and 2003 and their plurality (e.g. single, twins, triplets, etc.)

SELECT
  plurality,
  COUNT(1) AS num_babies,
  AVG(weight_pounds) AS avg_wt
FROM
  bigquery-public-data.samples.natality
WHERE
  year <FMI>
GROUP BY
  plurality

Answer the following question for your lab notebook:

Go back to your notebook. We will now repeat the query in Python. Enter the query from the previous step to construct the query string in one of the cells.

query_string = """
SELECT
  plurality,
  COUNT(1) AS num_babies,
  etc...
"""

Then click on the play button to execute the command to set the query string variable.

Then, in a single cell, enter in code that imports the BigQuery Python package, creates a BigQuery client, issues the query, obtains the response as a Pandas data frame, and outputs the results. Click on the play button to execute the code.

from google.cloud import bigquery
df = bigquery.Client().query(query_string).to_dataframe()
df.head()

The results should match your prior results in BigQuery. Using the plotting facility of Pandas, create a plot that visualizes the average weight of a baby given its plurality by placing the following code in a cell and executing it.

df.plot(x='plurality', y='avg_wt', kind='scatter')

The plot in the previous step shows a marked correlation between birthweight and plurality. If we were attempting to predict the birth weight of a baby that is about to be born using a machine learning model, the plurality would be a good feature to include.

What other features might be predictive of birth weight? In the next steps, we'll be exploring the answer to this question. Go back to your notebook and run the following code in a cell.

query_string = """
SELECT
  weight_pounds,
  is_male,
  mother_age,
  plurality,
  gestation_weeks
FROM
  publicdata.samples.natality
WHERE year > 2000
"""

from google.cloud import bigquery
df = bigquery.Client().query(query_string + " LIMIT 100").to_dataframe()
df.head()

The query produces all of the babies born after 2000, their weight, their gender, their mother's age, their plurality, and their gestation time. We have seen that plurality is highly correlated with birth weight, but what about the other attributes? To determine the answer to this question, we'll use the following function get_distinct_values(column_name) that parametrizes the initial query, but substitutes an attribute it is passed as column_name instead of using plurality. Examine the code, then paste it into a cell in your notebook and hit play.

def get_distinct_values(column_name):
  query_string = f"""
SELECT
  {column_name},
  COUNT(1) AS num_babies,
  AVG(weight_pounds) AS avg_wt
FROM
  publicdata.samples.natality
WHERE
  year > 2000
GROUP BY
  {column_name}
  """
  return bigquery.Client().query(query_string).to_dataframe().sort_values(column_name)

We'll now run our queries and view their plots. First, re-run the plurality query using the function, but generate a bar graph instead.

df = get_distinct_values('plurality')
df.plot(x='plurality', y='avg_wt', kind='bar')

Then, run the query using gender:

df = get_distinct_values('is_male')
df.plot(x='is_male', y='avg_wt', kind='bar')

Then, run the query using gestation time:

df = get_distinct_values('gestation_weeks')
df.plot(x='gestation_weeks', y='avg_wt', kind='bar')

Finally, run the query using the mother's age:

df = get_distinct_values('mother_age')
df.plot(x='mother_age', y='avg_wt', kind='bar')

In examining the plots, which two features are the strongest predictors for a newborn baby's weight?

Google hosts a number of COVID-19 datasets that we can query and generate information from. One dataset is its mobility one that measures the impact lockdown orders had in various locales. To begin with, go to the BigQuery console, click on "Resources, expand the bigquery-public-data drop-down, find Google's mobility dataset for the US and expand it (covid19_google_mobility.mobility_report).

Find the link that documents what the dataset measures and answer the following question:

Then, run the query below:

SELECT
 *
 FROM `bigquery-public-data.covid19_google_mobility.mobility_report`
 WHERE sub_region_1 = 'Oregon' AND sub_region_2 = "Multnomah County" AND date between "2020-03-01" AND "2020-03-31"
 ORDER BY date

Another dataset that is available is one that measures vehicle traffic changes. Go to the BigQuery console, click on "Resources, expand the bigquery-public-data drop-down, find the mobility impact dataset for the US and expand it (covid19_geotab_mobility_impact). Then click on the table airport_traffic. Find the column in this table that gives us information on the traffic impact.

Then, adapt the query below to find the following

SELECT
  airport_name,
  AVG( ... ) AS traffic_fraction
FROM
  `bigquery-public-data.covid19_geotab_mobility_impact.airport_traffic`
WHERE
  country_name = 'United States of America (the)'
  AND EXTRACT(MONTH from date) = 4
GROUP BY
  airport_name
ORDER BY
  traffic_fraction

Go back o to the BigQuery console, click on "Resources, expand the bigquery-public-data drop-down, find the New York Times COVID-19 dataset (covid19_nyt), and expand it.

There are four tables within the dataset. Click on each and view their schemas to see the columns they have. Find the following for subsequent queries that you will need to run

Go back to your Jupyter notebook.

Confirmed cases in Oregon

Within the notebook, perform the following query and plot the results using a line graph. The query pulls out the confirmed cases of COVID-19 across the state of Oregon by date.

SELECT date, confirmed_cases
FROM `bigquery-public-data.covid19_nyt.us_states`
WHERE state_name = 'Oregon'
ORDER BY date ASC

The plot should have the date as the x-axis in ascending order and the confirmed cases on the y-axis similar to below:


Date when states reached 1000 deaths

One of the macabre benchmarks for states is the day in which total deaths attributable by COVID-19 reached 1000 across the state. Run the following query in your lab notebook. It selects the state name and the minimum date such that the number of deaths in the state exceeded 1000.

SELECT state_name, MIN(date) as date_of_1000
FROM `bigquery-public-data.covid19_nyt.us_states`
WHERE deaths > 1000
GROUP BY state_name
ORDER BY date_of_1000 ASC

Mask usage per county

Surveys on mask usage may be an indicator as to how severe the pandemic will hit particular counties. Run the following query in your lab notebook that ranks the top counties in the US in which survey respondents always wear masks.

SELECT DISTINCT mu.county_fips_code, mu.always, ct.county
FROM `bigquery-public-data.covid19_nyt.mask_use_by_county` as mu
LEFT JOIN `bigquery-public-data.covid19_nyt.us_counties` as ct
ON mu.county_fips_code = ct.county_fips_code
ORDER BY mu.always DESC

Using your Jupyter notebook, perform the following...

Deaths in Multnomah county

Construct a query string that obtains the number of deaths from COVID-19 that have occurred in Multnomah county for each day in the dataset, ensuring the data is returned in ascending order of date. Run the query and obtain the results.

Deaths in Oregon

Construct a query string that obtains the total number of deaths from COVID-19 that have occurred in Oregon for each day in the dataset, ensuring the data is returned in ascending order of date. Run the query and obtain the results.

Delete the notebook

gcloud beta notebooks instances delete bq-jupyter-instance \
  --location us-west1-b

In the BigQuery console, delete the yob dataset.