Apache Beam Cheat Sheet

Beam Transformations

  1. GroupByKey (GBK) : KV<K,V> → KV<K,Iterable<V>>
  2. CoGroupByKey (CoGBK): <KV<Id,CoGbkResult>> ; special map from tuple tag PCollection → Iterable<PCollection>

3. Combine.perKey :

3.1. Serializable Function<Iterable<T>,T>

3.2 CombineFn<InputT,AccumT,OutputT>

  • Flatten: PCollectionList<T> is created, then merged
  • Partition: returns a partition for (override) PCollectionList<T> with PartitionFn.
  • Distinct.create: PCollection<KV<Type1,Type2>>
  • Count.perKey: PCollection<KV<Type1,Long>>

Beam Transformations Code

  • Beam Transformations are written using:

<new DoFn<K,V>

Beam allows lambda functions while writing code in Java.


  • Filter without Lambda:
  • Filter with Lambda:
  • Writing your own PTransform
  • PTransform Methods:

public abstract OutputT expand(Input input)

Side Inputs: Addition inputs to a ParDo transformation

asList(): PCollectionView<T> passed to Transform<T>

PCollectionView (PCV)

  1. View.asMap: Produces PCV mapping each window to a Map<K,V>.
  • It is required that each key of the input be associated with a single value, per window.

2. View.asSingleton: Produces a PCV (from PCollection with single value per window as input) that returns the value in the main input window.

  • When read a single input:

<n keys, n values> → per windows <n keys, 1 value>


Immutable tuple of hetrogeneously typed PCollections.

  • Keyed by “Tuple Tags”.
  • Multiple PCollection input or output.

Full-Stack Data Engineer | System Design Enthusiast | ayusharora.me