Class ReservoirSamplerWithoutReplacement<T>
- java.lang.Object
-
- org.apache.flink.api.java.sampling.RandomSampler<T>
-
- org.apache.flink.api.java.sampling.DistributedRandomSampler<T>
-
- org.apache.flink.api.java.sampling.ReservoirSamplerWithoutReplacement<T>
-
- Type Parameters:
T- The type of the sampler.
@Internal public class ReservoirSamplerWithoutReplacement<T> extends DistributedRandomSampler<T>
A simple in memory implementation of Reservoir Sampling without replacement, and with only one pass through the input iteration whose size is unpredictable. The basic idea behind this sampler implementation is to generate a random number for each input element as its weight, select the top K elements with max weight. As the weights are generated randomly, so are the selected top K elements. The algorithm is implemented using theDistributedRandomSamplerinterface. In the first phase, we generate random numbers as the weights for each element and select top K elements as the output of each partitions. In the second phase, we select top K elements from all the outputs of the first phase.This implementation refers to the algorithm described in "Optimal Random Sampling from Distributed Streams Revisited".
-
-
Field Summary
-
Fields inherited from class org.apache.flink.api.java.sampling.DistributedRandomSampler
emptyIntermediateIterable, numSamples
-
Fields inherited from class org.apache.flink.api.java.sampling.RandomSampler
emptyIterable, EPSILON
-
-
Constructor Summary
Constructors Constructor Description ReservoirSamplerWithoutReplacement(int numSamples)Create a new sampler with reservoir size and a default random number generator.ReservoirSamplerWithoutReplacement(int numSamples, long seed)Create a new sampler with reservoir size and the seed for random number generator.ReservoirSamplerWithoutReplacement(int numSamples, Random random)Create a new sampler with reservoir size and a supplied random number generator.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description Iterator<IntermediateSampleData<T>>sampleInPartition(Iterator<T> input)Sample algorithm for the first phase.-
Methods inherited from class org.apache.flink.api.java.sampling.DistributedRandomSampler
sample, sampleInCoordinator
-
-
-
-
Constructor Detail
-
ReservoirSamplerWithoutReplacement
public ReservoirSamplerWithoutReplacement(int numSamples, Random random)Create a new sampler with reservoir size and a supplied random number generator.- Parameters:
numSamples- Maximum number of samples to retain in reservoir, must be non-negative.random- Instance of random number generator for sampling.
-
ReservoirSamplerWithoutReplacement
public ReservoirSamplerWithoutReplacement(int numSamples)
Create a new sampler with reservoir size and a default random number generator.- Parameters:
numSamples- Maximum number of samples to retain in reservoir, must be non-negative.
-
ReservoirSamplerWithoutReplacement
public ReservoirSamplerWithoutReplacement(int numSamples, long seed)Create a new sampler with reservoir size and the seed for random number generator.- Parameters:
numSamples- Maximum number of samples to retain in reservoir, must be non-negative.seed- Random number generator seed.
-
-
Method Detail
-
sampleInPartition
public Iterator<IntermediateSampleData<T>> sampleInPartition(Iterator<T> input)
Description copied from class:DistributedRandomSamplerSample algorithm for the first phase. It operates on a single partition.- Specified by:
sampleInPartitionin classDistributedRandomSampler<T>- Parameters:
input- The DataSet input of each partition.- Returns:
- Intermediate sample output which will be used as the input of the second phase.
-
-