Skip to content
This repository has been archived by the owner on Nov 26, 2021. It is now read-only.

Querying Dynamic Tables with SQL

Fabian Hueske edited this page Jun 8, 2020 · 8 revisions

This session discusses how to write SQL queries to process streaming data.

Specifically, you will learn about

  • Flink's concept of Dynamic Tables
  • Stream-Table conversions
  • Running queries on dynamic tables
  • Reasoning about and managing query state

Slides

Hands-On Exercises

These exercises will teach you how to write queries on dynamic tables.

Queries with Materializing Aggregation

The following exercises process data by continuously updating their results. Queries produce updates and/or deletions.

Ride Count per Number of Passengers

For this exercise you should compute the number of rides per number of passengers, i.e., determine how many rides happened with 1, 2, 3, ... passengers.

We are only interested in counts for rides that started in New York City.

Click here for hints.
  • Each ride is represented by two events. Filter out all end events for accurate counts.


The output should look similar to:
psgCnt         cnt
     4        6051
     2       31063
     3       10812
     6        5408
     5       11031
     1      100367
Click to see the solution.
SELECT
  psgCnt,
  COUNT(*) AS cnt
FROM Rides
WHERE isInNYC(lon, lat) AND isStart
GROUP BY
  psgCnt;

Ride Count per Area and Hour of Day

In this exercise we want to count the total number of arriving and departing rides per area for every hour of a day, i.e., the total number of rides per area that happen from 12pm to 1am, 1am to 2am, and so on. Hence, we do not want a separate count for every day but the combined counts of all days for every hour.

We are only interested in events that start or end in New York City. To limit the size of the result, return only areas with counts that include more than 60 events.

Click here for hints.
  • The query is similar to the query of the Windowed Ride Count exercise but differs in the GROUP BY clause, because it updates previous counts every day.
  • Use the built-in HOUR function to extract the hour of a timestamp.
  • Use the provided toAreaId to convert coordinates to an area id.


The output should look similar to:
area       isStart     hourOfDay        cnt
49551      false       0                85
49789       true       0                64
48806       true       0                75
50044      false       0                62
52543       true       0                67
49792      false       1                77
48559       true       1               114
48808       true       1               100
Click to see the solution.
SELECT
  toAreaId(lon, lat) AS area,
  isStart,
  HOUR(rideTime) AS hourOfDay,
  COUNT(*) AS cnt
FROM Rides
WHERE isInNYC(lon, lat)
GROUP BY
  toAreaId(lon, lat),
  isStart,
  HOUR(rideTime)
HAVING COUNT(*) > 60;

The query is a typical SELECT FROM GROUP BY query. After filtering by rides in New York City, the result is grouped by area id, isStart flag, and hour of the day (computed by the HOUR(timestamp) function). By using the HAVING clause, we only return counts that are greater than 60.