Class DataSetUtils


  • @Deprecated
    @PublicEvolving
    public final class DataSetUtils
    extends Object
    Deprecated.
    All Flink DataSet APIs are deprecated since Flink 1.18 and will be removed in a future Flink major version. You can still build your application in DataSet, but you should move to either the DataStream and/or Table API.
    This class provides simple utility methods for zipping elements in a data set with an index or with a unique identifier.
    See Also:
    FLIP-131: Consolidate the user-facing Dataflow SDKs/APIs (and deprecate the DataSet API
    • Method Summary

      All Methods Static Methods Concrete Methods Deprecated Methods 
      Modifier and Type Method Description
      static <T> Utils.ChecksumHashCode checksumHashCode​(DataSet<T> input)
      Deprecated.
      This method will be removed at some point.
      static <T> DataSet<org.apache.flink.api.java.tuple.Tuple2<Integer,​Long>> countElementsPerPartition​(DataSet<T> input)
      Deprecated.
      Method that goes over all the elements in each partition in order to retrieve the total number of elements.
      static int getBitSize​(long value)
      Deprecated.
       
      static <T> PartitionOperator<T> partitionByRange​(DataSet<T> input, org.apache.flink.api.common.distributions.DataDistribution distribution, int... fields)
      Deprecated.
      Range-partitions a DataSet on the specified tuple field positions.
      static <T> PartitionOperator<T> partitionByRange​(DataSet<T> input, org.apache.flink.api.common.distributions.DataDistribution distribution, String... fields)
      Deprecated.
      Range-partitions a DataSet on the specified fields.
      static <T,​K extends Comparable<K>>
      PartitionOperator<T>
      partitionByRange​(DataSet<T> input, org.apache.flink.api.common.distributions.DataDistribution distribution, org.apache.flink.api.java.functions.KeySelector<T,​K> keyExtractor)
      Deprecated.
      Range-partitions a DataSet using the specified key selector function.
      static <T> MapPartitionOperator<T,​T> sample​(DataSet<T> input, boolean withReplacement, double fraction)
      Deprecated.
      Generate a sample of DataSet by the probability fraction of each element.
      static <T> MapPartitionOperator<T,​T> sample​(DataSet<T> input, boolean withReplacement, double fraction, long seed)
      Deprecated.
      Generate a sample of DataSet by the probability fraction of each element.
      static <T> DataSet<T> sampleWithSize​(DataSet<T> input, boolean withReplacement, int numSamples)
      Deprecated.
      Generate a sample of DataSet which contains fixed size elements.
      static <T> DataSet<T> sampleWithSize​(DataSet<T> input, boolean withReplacement, int numSamples, long seed)
      Deprecated.
      Generate a sample of DataSet which contains fixed size elements.
      static <R extends org.apache.flink.api.java.tuple.Tuple,​T extends org.apache.flink.api.java.tuple.Tuple>
      R
      summarize​(DataSet<T> input)
      Deprecated.
      Summarize a DataSet of Tuples by collecting single pass statistics for all columns.
      static <T> DataSet<org.apache.flink.api.java.tuple.Tuple2<Long,​T>> zipWithIndex​(DataSet<T> input)
      Deprecated.
      Method that assigns a unique Long value to all elements in the input data set.
      static <T> DataSet<org.apache.flink.api.java.tuple.Tuple2<Long,​T>> zipWithUniqueId​(DataSet<T> input)
      Deprecated.
      Method that assigns a unique Long value to all elements in the input data set as described below.
    • Method Detail

      • countElementsPerPartition

        public static <T> DataSet<org.apache.flink.api.java.tuple.Tuple2<Integer,​Long>> countElementsPerPartition​(DataSet<T> input)
        Deprecated.
        Method that goes over all the elements in each partition in order to retrieve the total number of elements.
        Parameters:
        input - the DataSet received as input
        Returns:
        a data set containing tuples of subtask index, number of elements mappings.
      • zipWithIndex

        public static <T> DataSet<org.apache.flink.api.java.tuple.Tuple2<Long,​T>> zipWithIndex​(DataSet<T> input)
        Deprecated.
        Method that assigns a unique Long value to all elements in the input data set. The generated values are consecutive.
        Parameters:
        input - the input data set
        Returns:
        a data set of tuple 2 consisting of consecutive ids and initial values.
      • zipWithUniqueId

        public static <T> DataSet<org.apache.flink.api.java.tuple.Tuple2<Long,​T>> zipWithUniqueId​(DataSet<T> input)
        Deprecated.
        Method that assigns a unique Long value to all elements in the input data set as described below.
        • a map function is applied to the input data set
        • each map task holds a counter c which is increased for each record
        • c is shifted by n bits where n = log2(number of parallel tasks)
        • to create a unique ID among all tasks, the task id is added to the counter
        • for each record, the resulting counter is collected
        Parameters:
        input - the input data set
        Returns:
        a data set of tuple 2 consisting of ids and initial values.
      • sample

        public static <T> MapPartitionOperator<T,​T> sample​(DataSet<T> input,
                                                                 boolean withReplacement,
                                                                 double fraction)
        Deprecated.
        Generate a sample of DataSet by the probability fraction of each element.
        Parameters:
        withReplacement - Whether element can be selected more than once.
        fraction - Probability that each element is chosen, should be [0,1] without replacement, and [0, ∞) with replacement. While fraction is larger than 1, the elements are expected to be selected multi times into sample on average.
        Returns:
        The sampled DataSet
      • sample

        public static <T> MapPartitionOperator<T,​T> sample​(DataSet<T> input,
                                                                 boolean withReplacement,
                                                                 double fraction,
                                                                 long seed)
        Deprecated.
        Generate a sample of DataSet by the probability fraction of each element.
        Parameters:
        withReplacement - Whether element can be selected more than once.
        fraction - Probability that each element is chosen, should be [0,1] without replacement, and [0, ∞) with replacement. While fraction is larger than 1, the elements are expected to be selected multi times into sample on average.
        seed - random number generator seed.
        Returns:
        The sampled DataSet
      • sampleWithSize

        public static <T> DataSet<T> sampleWithSize​(DataSet<T> input,
                                                    boolean withReplacement,
                                                    int numSamples)
        Deprecated.
        Generate a sample of DataSet which contains fixed size elements.

        NOTE: Sample with fixed size is not as efficient as sample with fraction, use sample with fraction unless you need exact precision.

        Parameters:
        withReplacement - Whether element can be selected more than once.
        numSamples - The expected sample size.
        Returns:
        The sampled DataSet
      • sampleWithSize

        public static <T> DataSet<T> sampleWithSize​(DataSet<T> input,
                                                    boolean withReplacement,
                                                    int numSamples,
                                                    long seed)
        Deprecated.
        Generate a sample of DataSet which contains fixed size elements.

        NOTE: Sample with fixed size is not as efficient as sample with fraction, use sample with fraction unless you need exact precision.

        Parameters:
        withReplacement - Whether element can be selected more than once.
        numSamples - The expected sample size.
        seed - Random number generator seed.
        Returns:
        The sampled DataSet
      • partitionByRange

        public static <T> PartitionOperator<T> partitionByRange​(DataSet<T> input,
                                                                org.apache.flink.api.common.distributions.DataDistribution distribution,
                                                                int... fields)
        Deprecated.
        Range-partitions a DataSet on the specified tuple field positions.
      • partitionByRange

        public static <T> PartitionOperator<T> partitionByRange​(DataSet<T> input,
                                                                org.apache.flink.api.common.distributions.DataDistribution distribution,
                                                                String... fields)
        Deprecated.
        Range-partitions a DataSet on the specified fields.
      • partitionByRange

        public static <T,​K extends Comparable<K>> PartitionOperator<T> partitionByRange​(DataSet<T> input,
                                                                                              org.apache.flink.api.common.distributions.DataDistribution distribution,
                                                                                              org.apache.flink.api.java.functions.KeySelector<T,​K> keyExtractor)
        Deprecated.
        Range-partitions a DataSet using the specified key selector function.
      • summarize

        public static <R extends org.apache.flink.api.java.tuple.Tuple,​T extends org.apache.flink.api.java.tuple.Tuple> R summarize​(DataSet<T> input)
                                                                                                                                   throws Exception
        Deprecated.
        Summarize a DataSet of Tuples by collecting single pass statistics for all columns.

        Example usage:

        
         Dataset<Tuple3<Double, String, Boolean>> input = // [...]
         Tuple3<NumericColumnSummary,StringColumnSummary, BooleanColumnSummary> summary = DataSetUtils.summarize(input)
        
         summary.f0.getStandardDeviation()
         summary.f1.getMaxLength()
         
        Returns:
        the summary as a Tuple the same width as input rows
        Throws:
        Exception
      • checksumHashCode

        @Deprecated
        public static <T> Utils.ChecksumHashCode checksumHashCode​(DataSet<T> input)
                                                           throws Exception
        Deprecated.
        This method will be removed at some point.
        Convenience method to get the count (number of elements) of a DataSet as well as the checksum (sum over element hashes).
        Returns:
        A ChecksumHashCode that represents the count and checksum of elements in the data set.
        Throws:
        Exception
      • getBitSize

        public static int getBitSize​(long value)
        Deprecated.