Interface InputFormat<OT,​T extends InputSplit>

  • Type Parameters:
    OT - The type of the produced records.
    T - The type of input split.
    All Superinterfaces:
    InputSplitSource<T>, Serializable
    All Known Implementing Classes:
    BinaryInputFormat, DelimitedInputFormat, FileInputFormat, GenericCsvInputFormat, GenericInputFormat, ReplicatingInputFormat, RichInputFormat, SerializedInputFormat

    @Public
    public interface InputFormat<OT,​T extends InputSplit>
    extends InputSplitSource<T>, Serializable
    The base interface for data sources that produces records.

    The input format handles the following:

    • It describes how the input is split into splits that can be processed in parallel.
    • It describes how to read records from the input split.
    • It describes how to gather basic statistics from the input.

    The life cycle of an input format is the following:

    1. After being instantiated (parameterless), it is configured with a Configuration object. Basic fields are read from the configuration, such as a file path, if the format describes files as input.
    2. Optionally: It is called by the compiler to produce basic statistics about the input.
    3. It is called to create the input splits.
    4. Each parallel input task creates an instance, configures it and opens it for a specific split.
    5. All records are read from the input
    6. The input format is closed

    IMPORTANT NOTE: Input formats must be written such that an instance can be opened again after it was closed. That is due to the fact that the input format is used for potentially multiple splits. After a split is done, the format's close function is invoked and, if another split is available, the open function is invoked afterwards for the next split.

    See Also:
    InputSplit, BaseStatistics
    • Method Detail

      • configure

        void configure​(Configuration parameters)
        Configures this input format. Since input formats are instantiated generically and hence parameterless, this method is the place where the input formats set their basic fields based on configuration values.

        This method is always called first on a newly instantiated input format.

        Parameters:
        parameters - The configuration with all parameters (note: not the Flink config but the TaskConfig).
      • getStatistics

        BaseStatistics getStatistics​(BaseStatistics cachedStatistics)
                              throws IOException
        Gets the basic statistics from the input described by this format. If the input format does not know how to create those statistics, it may return null. This method optionally gets a cached version of the statistics. The input format may examine them and decide whether it directly returns them without spending effort to re-gather the statistics.

        When this method is called, the input format is guaranteed to be configured.

        Parameters:
        cachedStatistics - The statistics that were cached. May be null.
        Returns:
        The base statistics for the input, or null, if not available.
        Throws:
        IOException
      • createInputSplits

        T[] createInputSplits​(int minNumSplits)
                       throws IOException
        Description copied from interface: InputSplitSource
        Computes the input splits. The given minimum number of splits is a hint as to how many splits are desired.
        Specified by:
        createInputSplits in interface InputSplitSource<OT>
        Parameters:
        minNumSplits - Number of minimal input splits, as a hint.
        Returns:
        An array of input splits.
        Throws:
        IOException
      • getInputSplitAssigner

        InputSplitAssigner getInputSplitAssigner​(T[] inputSplits)
        Description copied from interface: InputSplitSource
        Returns the assigner for the input splits. Assigner determines which parallel instance of the input format gets which input split.
        Specified by:
        getInputSplitAssigner in interface InputSplitSource<OT>
        Returns:
        The input split assigner.
      • open

        void open​(T split)
           throws IOException
        Opens a parallel instance of the input format to work on a split.

        When this method is called, the input format it guaranteed to be configured.

        Parameters:
        split - The split to be opened.
        Throws:
        IOException - Thrown, if the spit could not be opened due to an I/O problem.
      • reachedEnd

        boolean reachedEnd()
                    throws IOException
        Method used to check if the end of the input is reached.

        When this method is called, the input format it guaranteed to be opened.

        Returns:
        True if the end is reached, otherwise false.
        Throws:
        IOException - Thrown, if an I/O error occurred.
      • nextRecord

        OT nextRecord​(OT reuse)
               throws IOException
        Reads the next record from the input.

        When this method is called, the input format it guaranteed to be opened.

        Parameters:
        reuse - Object that may be reused.
        Returns:
        Read record.
        Throws:
        IOException - Thrown, if an I/O error occurred.
      • close

        void close()
            throws IOException
        Method that marks the end of the life-cycle of an input split. Should be used to close channels and streams and release resources. After this method returns without an error, the input is assumed to be correctly read.

        When this method is called, the input format it guaranteed to be opened.

        Throws:
        IOException - Thrown, if the input could not be closed properly.