Apache Beam Cheat Sheet
2 min readJan 13, 2021
Beam Transformations
- GroupByKey (GBK) : KV<K,V> → KV<K,Iterable<V>>
- CoGroupByKey (CoGBK): <KV<Id,CoGbkResult>> ; special map from tuple tag PCollection → Iterable<PCollection>
3. Combine.perKey :
3.1. Serializable Function<Iterable<T>,T>
3.2 CombineFn<InputT,AccumT,OutputT>
- Flatten: PCollectionList<T> is created, then merged
- Partition: returns a partition for (override) PCollectionList<T> with PartitionFn.
- Distinct.create: PCollection<KV<Type1,Type2>>
- Count.perKey: PCollection<KV<Type1,Long>>
Beam Transformations Code
- Beam Transformations are written using:
<new DoFn<K,V>
Beam allows lambda functions while writing code in Java.
Example
- Filter without Lambda:
- Filter with Lambda:
- Writing your own PTransform
- PTransform Methods:
public abstract OutputT expand(Input input)
Side Inputs: Addition inputs to a ParDo transformation
asList(): PCollectionView<T> passed to Transform<T>
PCollectionView (PCV)
- View.asMap: Produces PCV mapping each window to a Map<K,V>.
- It is required that each key of the input be associated with a single value, per window.
2. View.asSingleton: Produces a PCV (from PCollection with single value per window as input) that returns the value in the main input window.
- When read a single input:
<n keys, n values> → per windows <n keys, 1 value>
PCollectionTuple
Immutable tuple of hetrogeneously typed PCollections.
- Keyed by “Tuple Tags”.
- Multiple PCollection input or output.