Times should show up in your local timezone

Wednesday, August 4

15:00-16:00

Opening Keynote: Community update + The State of Apache Beam

by Brittany Hermann, Austin Bennett, Pablo Estrada & Matthias Baetens

Brittany and Austin will provide an update of Apache Beam from the perspective of the community, its growth and initiatives. Pablo will provide an update on the current state of Apache Beam, where and how it is being used.

16:00-16:25

You belong together - detecting linked accounts at Ricardo

by Tobias Kaymak

This talk is about the learnings along the way from someone who started out using Beam with the Java SDK and fell in love with the Python one.

16:30-16:55

Scaling machine learning to millions of users with Apache Beam

by Tatiana Al-Chueyr Martins

Apache Beam is a critical technology in delivering millions of personalised recommendations to the BBC audience daily. The journey to adopt the technology, however, wasn’t the smoothest. The objective of this talk is to save others time and money. This talk will discuss: Why Beam First pipeline which allowed us to go from a machine learning prototype to production Issues faced with the first approach Solutions embraced to handle problems Current pipeline design and cost gains This talk will focus on using the Python SDK and the Dataflow runner.

17:00-17:45

TPC-DS and Apache Beam - the time has come!

by Alexey Romanenko & Ismael Mejía

TPC-DS is the de-facto SQL-based benchmark framework used to measure database systems and Big Data processing frameworks. Beam introduced an early TPC-DS implementation last year but so far we have not started to use it to measure the state of the performance of Beam.

17:50-18:00

From factory to cloud: The Path to Beam

by Dan H

This talk will cover our use case of Apache Beam in the Industrial IoT space. Primarily, what changes we’ve implemented this past year to acquire time series data from factories and how Apache Beam plays a role for our stream and batch processing needs.

18:00-18:45

Profiling Apache Beam Python pipelines

by Israel Herraiz

In this talk, we will explore how to profile Apache Beam Python pipelines to identify potential bottlenecks in our code.

18:00-20:00

Workshop: Build a Unified Batch and Streaming Pipeline with Apache Beam on AWS

by Gandhi Swaminathan & Steffen Hausmann

In this workshop, you explore an end to end example that combines batch and streaming aspects in one uniform Beam pipeline. Check out the prerrequisites and additional registration.

20:00-20:25

Deduplication: Where Beam Fits In

by Jeff Klukas

This session will start with a brief overview of the problem of duplicate records and the different options available for handling them. We’ll then explore two concrete approaches to deduplication within a Beam streaming pipeline implemented in Mozilla’s open source codebase [0] for ingesting telemetry data from Firefox clients. We’ll compare the robustness, performance, and operational experience of using the deduplication built in to PubsubIO vs. storing IDs in an external Redis cluster and why Mozilla switched from one approach to the other.

20:30-20:55

Fault Tolerant Integration of Apache Beam With Relational Database

by Savitha Jayasankar & Piaw Na

We will share a use case at Niantic where we used Postgres as a time-series database to store metrics information from Apache Beam workflows.

21:00-21:50

Relational Beam: Automatically optimize your pipeline

by Andrew Pilloud

Wouldn’t it be great if Beam could automatically optimize your pipeline? It can with Relational Beam. This talk will cover the current Relational offerings in Beam (with a focus on Java) including Schemas, SchemaIO, and SQL. Learn how SQL can automatically optimize your pipeline today and about our plans to move that functionality into the core Beam SDK. You’ll come away from this talk with an understanding of why you want to migrate to Schemas (and SchemaIO) even if you don’t use SQL.

Thursday, August 5

15:00-15:50

Simple Distributed Raytracer with the Beam Go SDK

by Robert Burke

Demonstrate the Beam Go SDK with a raytracer as the motivating example.

15:50-16:00

Ingesting and Processing Level 3 Order Book Streams

by Daniel Tyreus

Use case of Beam to ingest and process billions of order book messages for cryptocurrency.

16:30-16:55

GCP Dataflow Architecture

by Svetak Sundhar

Overview of the architecture for the Dataflow Runner of Apache Beam.

17:00-17:50

How to handle duplicate data in streaming pipelines using Dataflow and Pub/Sub

by Zeeshan

How to handle duplicate data in streaming pipelines using Dataflow and Pub/Sub.

17:00-18:00

Workshop: Step by step development of a streaming pipeline using Scio (Scala)

by Israel Herraiz

This workshop requires additional registration and has limited capacity. See details.

18:00-18:50

Multi-language Pipelines for improving usability and reducing overheads

by Chamikara Jayalath

Apache Beam offers a novel and powerful framework named Multi-Language Pipelines that allows you to execute pipelines that employ multiple supported SDK languages.

18:00-20:00

Workshop: Create your first Dataflow Flex template and set up a CI/CD pipeline for it on Cloud Build

by Miren Esnaola

In-depth workshop where we will explain how to use Flex templates for testing and CI/CD of Beam data pipelines. This workshop requires additional registration and has limited capacity. See details.

20:00-20:25

Image classification with Beam and AutoML

by David Cavazos

Walkthrough of a sample based on a real ML use case. Dealing with large unbalanced datasets, lazily preprocessing only the data used for training, orchestrating the workflow in Beam. Inspired by: https://wildlifeinsights.org

20:30-20:55

Using Beam for Real-time Manufacturing Data Analysis

by Jeswanth Yadagani

This is an application talk targeted at users of Apache Beam to illustrate how a combination of stateless, stateful, and windowed streaming transformations can be used to support arbitrarily complex real-time analysis of manufacturing time-series data. At Oden, we are focused on the ingest and analysis of data from connected manufacturing equipment, context from manufacturing execution systems, and input from operators on the manufacturing factory floor. We run several real-time analytics using ML models deployed on Apache Beam to provide access to patterns, alerts, and process optimization insights to end-users.

21:00-21:25

Lessons learned from using Dataflow for local ML batch inference

by Ramtin Rassoli

At BenchSci, we mine the world’s biological research papers with the aim of extracting information that will accelerate future pharmaceutical research programs by enabling more reproducible experiments. Machine Learning and specifically our in-house Deep Learning models play an important role in extracting these key pieces of information and organizing the knowledge into a meaningful and easy-to-use structure. While we have the luxury of processing this information in batch, the size and number of our models along with the size of the input data eventually outgrew our on-premise model serving infrastructure.

21:30-21:55

Autoscaling your transforms with auto-sharded GroupIntoBatches

by Pablo Estrada

Big data systems have implemented the ability to scale up from the cluster perspective: Add more workers, and parallelize further. An issue with this is that many transforms still work with a fixed number of shards that can’t scale up or down with the cluster. We recently added this capability to Cloud Dataflow, and we want to share our experiments and what we learned from this - and how you can take advantage of it.

Friday, August 6

15:00-15:45

ML Inference at scale, easy as learning your 5 times table, with tfx-dsl and Apache Beam!

by Reza Rokni & Robert Crowe

In this talk, we will make use of the RunInferene transform from the tfx-dsl library to build several inference pipelines, from single models to multiple models in chained or branched configurations.

15:50-16:00

Improve the usability of Apache Beam

by Mariann Nagy

In this talk we will share existing user feedback that we’ve gathered from Apache Beam users. The rough structure of the lightning talk is: We take user feedback seriously What existing users like and dislike about Apache Beam What we’re currently doing to make improvements How can the community help us build the best SDK ever

16:00-16:25

Real-time change data in Apache Beam with DebeziumIO

by Pablo Estrada

In this session, we will talk about DebeziumIO, which is a new transform that allows us to read change streams from various databases into Beam by relying on Debezium. The great thing about DebeziumIO is that we can run the Debezium connector within our Beam Pipeline, without having to depend on Kafka for infrastructure.

16:30-16:55

Scalable Predictions of Deep Learning models with Apache Beam

by Jo Pu & Hannes Hapke

With the rise of deep learning applications, so do the questions of how to integrate larger machine learning models (e.g. transformer models) in Apache Beam data pipelines. At Digits, we took a deep dive into optimal integrations of our deep learning models to be consumed efficiently with Apache Beam and Google Cloud’s Dataflow. In this presentation, we will walk the audience through how we evaluated the various options for serving Deep Learning models on Beam with Dataflow, the architecture of our production model pipelines, and some lessons we learned while optimizing inference performance and maximizing pipeline throughput.

16:00-17:00

Workshop: Privacy on Beam - E2E Differential Privacy Solution for Apache Beam

by Mirac Vuslat Basaran

This workshop requires additional registration and has limited capacity. See details.

17:00-17:25

Large scale streaming infrastructure using Apache Beam and DataFlow

by Talat Uyarer

Cortex Data Lake collects, transforms and integrates your enterprise’s security data to enable Palo Alto Networks solutions. We build streaming infrastructure for our customers. I will share share our architecture and experience while building that infrastructure.

17:30-17:55

How to build streaming data pipelines with Google Cloud Dataflow and Confluent Cloud

by Elena Cuevas

We will demonstrate how easy it is to use Confluent Cloud as the data source of your Beam pipelines. You will learn how to process the information that comes from Confluent Cloud in real time, make transformations on such information and feed it back to your Kafka topics and other parts of your architecture.

18:00-18:50

Leveraging Beam's Batch-Mode for Robust Recoveries and Late-Data Processing of Streaming Pipelines

by Devon Peticolas

This will be an application talk targeted at users or potential users of Apache Beam for real-time streaming applications. It will show how to write a Beam application deployable as both a streaming and batch job. And how to leverage that batch deployment for robust batch recoveries and late-data processing for your streaming application. At Oden, we focus on the ingest and analysis of data from connected manufacturing equipment, context from manufacturing execution systems, and input from operators on the manufacturing factory floor.

17:00-19:00

Workshop: Visually build Beam pipelines using Apache Hop

by Matt Casters

This workshop requires additional registration and has limited capacity. See details.

Schedule

Times should show up in your local timezone

Wednesday, August 4

Thursday, August 5

Friday, August 6