Class DistributedRandomSampler<T>

  • Type Parameters:
    T - The input data type.
    Direct Known Subclasses:
    ReservoirSamplerWithoutReplacement, ReservoirSamplerWithReplacement

    @Internal
    public abstract class DistributedRandomSampler<T>
    extends RandomSampler<T>
    For sampling with fraction, the sample algorithms are natively distributed, while it's not true for fixed size sample algorithms. The fixed size sample algorithms require two-phases sampling (according to our current implementation). In the first phase, each distributed partition is sampled independently. The partial sampling results are handled by a central coordinator. The central coordinator combines the partial sampling results to form the final result.
    • Constructor Detail

      • DistributedRandomSampler

        public DistributedRandomSampler​(int numSamples)
    • Method Detail

      • sampleInPartition

        public abstract Iterator<IntermediateSampleData<T>> sampleInPartition​(Iterator<T> input)
        Sample algorithm for the first phase. It operates on a single partition.
        Parameters:
        input - The DataSet input of each partition.
        Returns:
        Intermediate sample output which will be used as the input of the second phase.
      • sampleInCoordinator

        public Iterator<T> sampleInCoordinator​(Iterator<IntermediateSampleData<T>> input)
        Sample algorithm for the second phase. This operation should be executed as the UDF of an all reduce operation.
        Parameters:
        input - The intermediate sample output generated in the first phase.
        Returns:
        The sampled output.
      • sample

        public Iterator<T> sample​(Iterator<T> input)
        Combine the first phase and second phase in sequence, implemented for test purpose only.
        Specified by:
        sample in class RandomSampler<T>
        Parameters:
        input - Source data.
        Returns:
        Sample result in sequence.