Package org.apache.flink.api.common.io
Class DelimitedInputFormat<OT>
- java.lang.Object
-
- org.apache.flink.api.common.io.RichInputFormat<OT,FileInputSplit>
-
- org.apache.flink.api.common.io.FileInputFormat<OT>
-
- org.apache.flink.api.common.io.DelimitedInputFormat<OT>
-
- All Implemented Interfaces:
Serializable,CheckpointableInputFormat<FileInputSplit,Long>,InputFormat<OT,FileInputSplit>,InputSplitSource<FileInputSplit>
- Direct Known Subclasses:
GenericCsvInputFormat
@Public public abstract class DelimitedInputFormat<OT> extends FileInputFormat<OT> implements CheckpointableInputFormat<FileInputSplit,Long>
Base implementation for input formats that split the input at a delimiter into records. The parsing of the record bytes into the record has to be implemented in thereadRecord(Object, byte[], int, int)method.The default delimiter is the newline character
'\n'.- See Also:
- Serialized Form
-
-
Nested Class Summary
-
Nested classes/interfaces inherited from class org.apache.flink.api.common.io.FileInputFormat
FileInputFormat.FileBaseStatistics, FileInputFormat.InputSplitOpenThread
-
-
Field Summary
Fields Modifier and Type Field Description protected byte[]currBufferprotected intcurrLenprotected intcurrOffsetprotected static StringRECORD_DELIMITERThe configuration key to set the record delimiter.-
Fields inherited from class org.apache.flink.api.common.io.FileInputFormat
currentSplit, ENUMERATE_NESTED_FILES_FLAG, enumerateNestedFiles, filePath, INFLATER_INPUT_STREAM_FACTORIES, minSplitSize, numSplits, openTimeout, READ_WHOLE_SPLIT_FLAG, splitLength, splitStart, stream, unsplittable
-
-
Constructor Summary
Constructors Modifier Constructor Description DelimitedInputFormat()protectedDelimitedInputFormat(Path filePath, Configuration configuration)
-
Method Summary
All Methods Static Methods Instance Methods Abstract Methods Concrete Methods Deprecated Methods Modifier and Type Method Description voidclose()Closes the input by releasing all buffers and closing the file input stream.voidconfigure(Configuration parameters)Configures this input format by reading the path to the file from the configuration and the string that defines the record delimiter.intgetBufferSize()CharsetgetCharset()Get the character set used for the row delimiter.LonggetCurrentState()Returns the split currently being read, along with its current state.byte[]getDelimiter()intgetLineLengthLimit()intgetNumLineSamples()FileInputFormat.FileBaseStatisticsgetStatistics(BaseStatistics cachedStats)Obtains basic file statistics containing only file size.protected voidinitializeSplit(FileInputSplit split, Long state)Initialization method that is called after opening or reopening an input split.protected static voidloadConfigParameters(Configuration parameters)protected static voidloadGlobalConfigParams()Deprecated.Please useloadConfigParameters(Configuration configOTnextRecord(OT record)Reads the next record from the input.voidopen(FileInputSplit split)Opens the given input split.booleanreachedEnd()Checks whether the current split is at its end.protected booleanreadLine()abstract OTreadRecord(OT reuse, byte[] bytes, int offset, int numBytes)This function parses the given byte array which represents a serialized record.voidreopen(FileInputSplit split, Long state)Restores the state of a parallel instance reading from anInputFormat.voidsetBufferSize(int bufferSize)voidsetCharset(String charset)Set the name of the character set used for the row delimiter.voidsetDelimiter(byte[] delimiter)voidsetDelimiter(char delimiter)voidsetDelimiter(String delimiter)voidsetLineLengthLimit(int lineLengthLimit)voidsetNumLineSamples(int numLineSamples)-
Methods inherited from class org.apache.flink.api.common.io.FileInputFormat
acceptFile, createInputSplits, decorateInputStream, extractFileExtension, getFilePath, getFilePaths, getFileStats, getFileStats, getInflaterInputStreamFactory, getInputSplitAssigner, getMinSplitSize, getNestedFileEnumeration, getNumSplits, getOpenTimeout, getSplitLength, getSplitStart, registerInflaterInputStreamFactory, setFilePath, setFilePath, setFilePaths, setFilePaths, setFilesFilter, setMinSplitSize, setNestedFileEnumeration, setNumSplits, setOpenTimeout, supportsMultiPaths, testForUnsplittable, toString
-
Methods inherited from class org.apache.flink.api.common.io.RichInputFormat
closeInputFormat, getRuntimeContext, openInputFormat, setRuntimeContext
-
-
-
-
Field Detail
-
currBuffer
protected transient byte[] currBuffer
-
currOffset
protected transient int currOffset
-
currLen
protected transient int currLen
-
RECORD_DELIMITER
protected static final String RECORD_DELIMITER
The configuration key to set the record delimiter.- See Also:
- Constant Field Values
-
-
Constructor Detail
-
DelimitedInputFormat
public DelimitedInputFormat()
-
DelimitedInputFormat
protected DelimitedInputFormat(Path filePath, Configuration configuration)
-
-
Method Detail
-
loadGlobalConfigParams
@Deprecated protected static void loadGlobalConfigParams()
Deprecated.Please useloadConfigParameters(Configuration config
-
loadConfigParameters
protected static void loadConfigParameters(Configuration parameters)
-
getCharset
@PublicEvolving public Charset getCharset()
Get the character set used for the row delimiter. This is also used by subclasses to interpret field delimiters, comment strings, and for configuringFieldParsers.- Returns:
- the charset
-
setCharset
@PublicEvolving public void setCharset(String charset)
Set the name of the character set used for the row delimiter. This is also used by subclasses to interpret field delimiters, comment strings, and for configuringFieldParsers.These fields are interpreted when set. Changing the charset thereafter may cause unexpected results.
- Parameters:
charset- name of the charset
-
getDelimiter
public byte[] getDelimiter()
-
setDelimiter
public void setDelimiter(byte[] delimiter)
-
setDelimiter
public void setDelimiter(char delimiter)
-
setDelimiter
public void setDelimiter(String delimiter)
-
getLineLengthLimit
public int getLineLengthLimit()
-
setLineLengthLimit
public void setLineLengthLimit(int lineLengthLimit)
-
getBufferSize
public int getBufferSize()
-
setBufferSize
public void setBufferSize(int bufferSize)
-
getNumLineSamples
public int getNumLineSamples()
-
setNumLineSamples
public void setNumLineSamples(int numLineSamples)
-
readRecord
public abstract OT readRecord(OT reuse, byte[] bytes, int offset, int numBytes) throws IOException
This function parses the given byte array which represents a serialized record. The function returns a valid record or throws an IOException.- Parameters:
reuse- An optionally reusable object.bytes- Binary data of serialized records.offset- The offset where to start to read the record data.numBytes- The number of bytes that can be read starting at the offset position.- Returns:
- Returns the read record if it was successfully deserialized.
- Throws:
IOException- if the record could not be read.
-
configure
public void configure(Configuration parameters)
Configures this input format by reading the path to the file from the configuration and the string that defines the record delimiter.- Specified by:
configurein interfaceInputFormat<OT,FileInputSplit>- Overrides:
configurein classFileInputFormat<OT>- Parameters:
parameters- The configuration object to read the parameters from.- See Also:
InputFormat.configure(org.apache.flink.configuration.Configuration)
-
getStatistics
public FileInputFormat.FileBaseStatistics getStatistics(BaseStatistics cachedStats) throws IOException
Description copied from class:FileInputFormatObtains basic file statistics containing only file size. If the input is a directory, then the size is the sum of all contained files.- Specified by:
getStatisticsin interfaceInputFormat<OT,FileInputSplit>- Overrides:
getStatisticsin classFileInputFormat<OT>- Parameters:
cachedStats- The statistics that were cached. May be null.- Returns:
- The base statistics for the input, or null, if not available.
- Throws:
IOException- See Also:
InputFormat.getStatistics(org.apache.flink.api.common.io.statistics.BaseStatistics)
-
open
public void open(FileInputSplit split) throws IOException
Opens the given input split. This method opens the input stream to the specified file, allocates read buffers and positions the stream at the correct position, making sure that any partial record at the beginning is skipped.- Specified by:
openin interfaceInputFormat<OT,FileInputSplit>- Overrides:
openin classFileInputFormat<OT>- Parameters:
split- The input split to open.- Throws:
IOException- Thrown, if the spit could not be opened due to an I/O problem.- See Also:
FileInputFormat.open(org.apache.flink.core.fs.FileInputSplit)
-
reachedEnd
public boolean reachedEnd()
Checks whether the current split is at its end.- Specified by:
reachedEndin interfaceInputFormat<OT,FileInputSplit>- Returns:
- True, if the split is at its end, false otherwise.
-
nextRecord
public OT nextRecord(OT record) throws IOException
Description copied from interface:InputFormatReads the next record from the input.When this method is called, the input format it guaranteed to be opened.
- Specified by:
nextRecordin interfaceInputFormat<OT,FileInputSplit>- Parameters:
record- Object that may be reused.- Returns:
- Read record.
- Throws:
IOException- Thrown, if an I/O error occurred.
-
close
public void close() throws IOExceptionCloses the input by releasing all buffers and closing the file input stream.- Specified by:
closein interfaceInputFormat<OT,FileInputSplit>- Overrides:
closein classFileInputFormat<OT>- Throws:
IOException- Thrown, if the closing of the file stream causes an I/O error.
-
readLine
protected final boolean readLine() throws IOException- Throws:
IOException
-
getCurrentState
@PublicEvolving public Long getCurrentState() throws IOException
Description copied from interface:CheckpointableInputFormatReturns the split currently being read, along with its current state. This will be used to restore the state of the reading channel when recovering from a task failure. In the case of a simple text file, the state can correspond to the last read offset in the split.- Specified by:
getCurrentStatein interfaceCheckpointableInputFormat<FileInputSplit,Long>- Returns:
- The state of the channel.
- Throws:
IOException- Thrown if the creation of the state object failed.
-
reopen
@PublicEvolving public void reopen(FileInputSplit split, Long state) throws IOException
Description copied from interface:CheckpointableInputFormatRestores the state of a parallel instance reading from anInputFormat. This is necessary when recovering from a task failure. When this method is called, the input format it guaranteed to be configured.NOTE: The caller has to make sure that the provided split is the one to whom the state belongs.
- Specified by:
reopenin interfaceCheckpointableInputFormat<FileInputSplit,Long>- Parameters:
split- The split to be opened.state- The state from which to start from. This can contain the offset, but also other data, depending on the input format.- Throws:
IOException
-
initializeSplit
protected void initializeSplit(FileInputSplit split, @Nullable Long state) throws IOException
Initialization method that is called after opening or reopening an input split.- Parameters:
split- Split that was opened or reopenedstate- Checkpointed state if the split was reopened- Throws:
IOException
-
-