Class DistributedRandomSampler<T>
- java.lang.Object
-
- org.apache.flink.api.java.sampling.RandomSampler<T>
-
- org.apache.flink.api.java.sampling.DistributedRandomSampler<T>
-
- Type Parameters:
T- The input data type.
- Direct Known Subclasses:
ReservoirSamplerWithoutReplacement,ReservoirSamplerWithReplacement
@Internal public abstract class DistributedRandomSampler<T> extends RandomSampler<T>
For sampling with fraction, the sample algorithms are natively distributed, while it's not true for fixed size sample algorithms. The fixed size sample algorithms require two-phases sampling (according to our current implementation). In the first phase, each distributed partition is sampled independently. The partial sampling results are handled by a central coordinator. The central coordinator combines the partial sampling results to form the final result.
-
-
Field Summary
Fields Modifier and Type Field Description protected Iterator<IntermediateSampleData<T>>emptyIntermediateIterableprotected intnumSamples-
Fields inherited from class org.apache.flink.api.java.sampling.RandomSampler
emptyIterable, EPSILON
-
-
Constructor Summary
Constructors Constructor Description DistributedRandomSampler(int numSamples)
-
Method Summary
All Methods Instance Methods Abstract Methods Concrete Methods Modifier and Type Method Description Iterator<T>sample(Iterator<T> input)Combine the first phase and second phase in sequence, implemented for test purpose only.Iterator<T>sampleInCoordinator(Iterator<IntermediateSampleData<T>> input)Sample algorithm for the second phase.abstract Iterator<IntermediateSampleData<T>>sampleInPartition(Iterator<T> input)Sample algorithm for the first phase.
-
-
-
Field Detail
-
numSamples
protected final int numSamples
-
emptyIntermediateIterable
protected final Iterator<IntermediateSampleData<T>> emptyIntermediateIterable
-
-
Method Detail
-
sampleInPartition
public abstract Iterator<IntermediateSampleData<T>> sampleInPartition(Iterator<T> input)
Sample algorithm for the first phase. It operates on a single partition.- Parameters:
input- The DataSet input of each partition.- Returns:
- Intermediate sample output which will be used as the input of the second phase.
-
sampleInCoordinator
public Iterator<T> sampleInCoordinator(Iterator<IntermediateSampleData<T>> input)
Sample algorithm for the second phase. This operation should be executed as the UDF of an all reduce operation.- Parameters:
input- The intermediate sample output generated in the first phase.- Returns:
- The sampled output.
-
-