Package org.apache.flink.api.java.utils
Class DataSetUtils
- java.lang.Object
-
- org.apache.flink.api.java.utils.DataSetUtils
-
@Deprecated @PublicEvolving public final class DataSetUtils extends Object
Deprecated.All Flink DataSet APIs are deprecated since Flink 1.18 and will be removed in a future Flink major version. You can still build your application in DataSet, but you should move to either the DataStream and/or Table API.This class provides simple utility methods for zipping elements in a data set with an index or with a unique identifier.
-
-
Method Summary
All Methods Static Methods Concrete Methods Deprecated Methods Modifier and Type Method Description static <T> Utils.ChecksumHashCodechecksumHashCode(DataSet<T> input)Deprecated.This method will be removed at some point.static <T> DataSet<org.apache.flink.api.java.tuple.Tuple2<Integer,Long>>countElementsPerPartition(DataSet<T> input)Deprecated.Method that goes over all the elements in each partition in order to retrieve the total number of elements.static intgetBitSize(long value)Deprecated.static <T> PartitionOperator<T>partitionByRange(DataSet<T> input, org.apache.flink.api.common.distributions.DataDistribution distribution, int... fields)Deprecated.Range-partitions a DataSet on the specified tuple field positions.static <T> PartitionOperator<T>partitionByRange(DataSet<T> input, org.apache.flink.api.common.distributions.DataDistribution distribution, String... fields)Deprecated.Range-partitions a DataSet on the specified fields.static <T,K extends Comparable<K>>
PartitionOperator<T>partitionByRange(DataSet<T> input, org.apache.flink.api.common.distributions.DataDistribution distribution, org.apache.flink.api.java.functions.KeySelector<T,K> keyExtractor)Deprecated.Range-partitions a DataSet using the specified key selector function.static <T> MapPartitionOperator<T,T>sample(DataSet<T> input, boolean withReplacement, double fraction)Deprecated.Generate a sample of DataSet by the probability fraction of each element.static <T> MapPartitionOperator<T,T>sample(DataSet<T> input, boolean withReplacement, double fraction, long seed)Deprecated.Generate a sample of DataSet by the probability fraction of each element.static <T> DataSet<T>sampleWithSize(DataSet<T> input, boolean withReplacement, int numSamples)Deprecated.Generate a sample of DataSet which contains fixed size elements.static <T> DataSet<T>sampleWithSize(DataSet<T> input, boolean withReplacement, int numSamples, long seed)Deprecated.Generate a sample of DataSet which contains fixed size elements.static <R extends org.apache.flink.api.java.tuple.Tuple,T extends org.apache.flink.api.java.tuple.Tuple>
Rsummarize(DataSet<T> input)Deprecated.Summarize a DataSet of Tuples by collecting single pass statistics for all columns.static <T> DataSet<org.apache.flink.api.java.tuple.Tuple2<Long,T>>zipWithIndex(DataSet<T> input)Deprecated.Method that assigns a uniqueLongvalue to all elements in the input data set.static <T> DataSet<org.apache.flink.api.java.tuple.Tuple2<Long,T>>zipWithUniqueId(DataSet<T> input)Deprecated.Method that assigns a uniqueLongvalue to all elements in the input data set as described below.
-
-
-
Method Detail
-
countElementsPerPartition
public static <T> DataSet<org.apache.flink.api.java.tuple.Tuple2<Integer,Long>> countElementsPerPartition(DataSet<T> input)
Deprecated.Method that goes over all the elements in each partition in order to retrieve the total number of elements.- Parameters:
input- the DataSet received as input- Returns:
- a data set containing tuples of subtask index, number of elements mappings.
-
zipWithIndex
public static <T> DataSet<org.apache.flink.api.java.tuple.Tuple2<Long,T>> zipWithIndex(DataSet<T> input)
Deprecated.Method that assigns a uniqueLongvalue to all elements in the input data set. The generated values are consecutive.- Parameters:
input- the input data set- Returns:
- a data set of tuple 2 consisting of consecutive ids and initial values.
-
zipWithUniqueId
public static <T> DataSet<org.apache.flink.api.java.tuple.Tuple2<Long,T>> zipWithUniqueId(DataSet<T> input)
Deprecated.Method that assigns a uniqueLongvalue to all elements in the input data set as described below.- a map function is applied to the input data set
- each map task holds a counter c which is increased for each record
- c is shifted by n bits where n = log2(number of parallel tasks)
- to create a unique ID among all tasks, the task id is added to the counter
- for each record, the resulting counter is collected
- Parameters:
input- the input data set- Returns:
- a data set of tuple 2 consisting of ids and initial values.
-
sample
public static <T> MapPartitionOperator<T,T> sample(DataSet<T> input, boolean withReplacement, double fraction)
Deprecated.Generate a sample of DataSet by the probability fraction of each element.- Parameters:
withReplacement- Whether element can be selected more than once.fraction- Probability that each element is chosen, should be [0,1] without replacement, and [0, ∞) with replacement. While fraction is larger than 1, the elements are expected to be selected multi times into sample on average.- Returns:
- The sampled DataSet
-
sample
public static <T> MapPartitionOperator<T,T> sample(DataSet<T> input, boolean withReplacement, double fraction, long seed)
Deprecated.Generate a sample of DataSet by the probability fraction of each element.- Parameters:
withReplacement- Whether element can be selected more than once.fraction- Probability that each element is chosen, should be [0,1] without replacement, and [0, ∞) with replacement. While fraction is larger than 1, the elements are expected to be selected multi times into sample on average.seed- random number generator seed.- Returns:
- The sampled DataSet
-
sampleWithSize
public static <T> DataSet<T> sampleWithSize(DataSet<T> input, boolean withReplacement, int numSamples)
Deprecated.Generate a sample of DataSet which contains fixed size elements.NOTE: Sample with fixed size is not as efficient as sample with fraction, use sample with fraction unless you need exact precision.
- Parameters:
withReplacement- Whether element can be selected more than once.numSamples- The expected sample size.- Returns:
- The sampled DataSet
-
sampleWithSize
public static <T> DataSet<T> sampleWithSize(DataSet<T> input, boolean withReplacement, int numSamples, long seed)
Deprecated.Generate a sample of DataSet which contains fixed size elements.NOTE: Sample with fixed size is not as efficient as sample with fraction, use sample with fraction unless you need exact precision.
- Parameters:
withReplacement- Whether element can be selected more than once.numSamples- The expected sample size.seed- Random number generator seed.- Returns:
- The sampled DataSet
-
partitionByRange
public static <T> PartitionOperator<T> partitionByRange(DataSet<T> input, org.apache.flink.api.common.distributions.DataDistribution distribution, int... fields)
Deprecated.Range-partitions a DataSet on the specified tuple field positions.
-
partitionByRange
public static <T> PartitionOperator<T> partitionByRange(DataSet<T> input, org.apache.flink.api.common.distributions.DataDistribution distribution, String... fields)
Deprecated.Range-partitions a DataSet on the specified fields.
-
partitionByRange
public static <T,K extends Comparable<K>> PartitionOperator<T> partitionByRange(DataSet<T> input, org.apache.flink.api.common.distributions.DataDistribution distribution, org.apache.flink.api.java.functions.KeySelector<T,K> keyExtractor)
Deprecated.Range-partitions a DataSet using the specified key selector function.
-
summarize
public static <R extends org.apache.flink.api.java.tuple.Tuple,T extends org.apache.flink.api.java.tuple.Tuple> R summarize(DataSet<T> input) throws Exception
Deprecated.Summarize a DataSet of Tuples by collecting single pass statistics for all columns.Example usage:
Dataset<Tuple3<Double, String, Boolean>> input = // [...] Tuple3<NumericColumnSummary,StringColumnSummary, BooleanColumnSummary> summary = DataSetUtils.summarize(input) summary.f0.getStandardDeviation() summary.f1.getMaxLength()- Returns:
- the summary as a Tuple the same width as input rows
- Throws:
Exception
-
checksumHashCode
@Deprecated public static <T> Utils.ChecksumHashCode checksumHashCode(DataSet<T> input) throws Exception
Deprecated.This method will be removed at some point.Convenience method to get the count (number of elements) of a DataSet as well as the checksum (sum over element hashes).- Returns:
- A ChecksumHashCode that represents the count and checksum of elements in the data set.
- Throws:
Exception
-
getBitSize
public static int getBitSize(long value)
Deprecated.
-
-