Class ReservoirSamplerWithoutReplacement<T>

  • Type Parameters:
    T - The type of the sampler.

    @Internal
    public class ReservoirSamplerWithoutReplacement<T>
    extends DistributedRandomSampler<T>
    A simple in memory implementation of Reservoir Sampling without replacement, and with only one pass through the input iteration whose size is unpredictable. The basic idea behind this sampler implementation is to generate a random number for each input element as its weight, select the top K elements with max weight. As the weights are generated randomly, so are the selected top K elements. The algorithm is implemented using the DistributedRandomSampler interface. In the first phase, we generate random numbers as the weights for each element and select top K elements as the output of each partitions. In the second phase, we select top K elements from all the outputs of the first phase.

    This implementation refers to the algorithm described in "Optimal Random Sampling from Distributed Streams Revisited".

    • Constructor Detail

      • ReservoirSamplerWithoutReplacement

        public ReservoirSamplerWithoutReplacement​(int numSamples,
                                                  Random random)
        Create a new sampler with reservoir size and a supplied random number generator.
        Parameters:
        numSamples - Maximum number of samples to retain in reservoir, must be non-negative.
        random - Instance of random number generator for sampling.
      • ReservoirSamplerWithoutReplacement

        public ReservoirSamplerWithoutReplacement​(int numSamples)
        Create a new sampler with reservoir size and a default random number generator.
        Parameters:
        numSamples - Maximum number of samples to retain in reservoir, must be non-negative.
      • ReservoirSamplerWithoutReplacement

        public ReservoirSamplerWithoutReplacement​(int numSamples,
                                                  long seed)
        Create a new sampler with reservoir size and the seed for random number generator.
        Parameters:
        numSamples - Maximum number of samples to retain in reservoir, must be non-negative.
        seed - Random number generator seed.