Package org.apache.flink.api.common.io
Class FileInputFormat<OT>
- java.lang.Object
-
- org.apache.flink.api.common.io.RichInputFormat<OT,FileInputSplit>
-
- org.apache.flink.api.common.io.FileInputFormat<OT>
-
- All Implemented Interfaces:
Serializable,InputFormat<OT,FileInputSplit>,InputSplitSource<FileInputSplit>
- Direct Known Subclasses:
BinaryInputFormat,DelimitedInputFormat
@Public public abstract class FileInputFormat<OT> extends RichInputFormat<OT,FileInputSplit>
The base class forRichInputFormats that read from files. For specific input types theInputFormat.nextRecord(Object)andInputFormat.reachedEnd()methods need to be implemented. Additionally, one may overrideopen(FileInputSplit)andclose()to change the life cycle behavior.After the
open(FileInputSplit)method completed, the file input data is available from thestreamfield.- See Also:
- Serialized Form
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static classFileInputFormat.FileBaseStatisticsEncapsulation of the basic statistics the optimizer obtains about a file.static classFileInputFormat.InputSplitOpenThreadObtains a DataInputStream in an thread that is not interrupted.
-
Field Summary
Fields Modifier and Type Field Description protected FileInputSplitcurrentSplitThe current split that this parallel instance must consume.static StringENUMERATE_NESTED_FILES_FLAGDeprecated.protected booleanenumerateNestedFilesThe flag to specify whether recursive traversal of the input directory structure is enabled.protected PathfilePathDeprecated.protected static Map<String,InflaterInputStreamFactory<?>>INFLATER_INPUT_STREAM_FACTORIESA mapping of file extensions to decompression algorithms based on DEFLATE.protected longminSplitSizeThe minimal split size, set by the configure() method.protected intnumSplitsThe desired number of splits, as set by the configure() method.protected longopenTimeoutStream opening timeout.protected static longREAD_WHOLE_SPLIT_FLAGThe splitLength is set to -1L for reading the whole split.protected longsplitLengthThe length of the split that this parallel instance must consume.protected longsplitStartThe start of the split that this parallel instance must consume.protected FSDataInputStreamstreamThe input stream reading from the input file.protected booleanunsplittableSome file input formats are not splittable on a block level (deflate) Therefore, the FileInputFormat can only read whole files.
-
Constructor Summary
Constructors Modifier Constructor Description FileInputFormat()protectedFileInputFormat(Path filePath)
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Deprecated Methods Modifier and Type Method Description booleanacceptFile(FileStatus fileStatus)A simple hook to filter files and directories from the input.voidclose()Closes the file input stream of the input format.voidconfigure(Configuration parameters)Configures the file input format by reading the file path from the configuration.FileInputSplit[]createInputSplits(int minNumSplits)Computes the input splits for the file.protected FSDataInputStreamdecorateInputStream(FSDataInputStream inputStream, FileInputSplit fileSplit)This method allows to wrap/decorate the rawFSDataInputStreamfor a certain file split, e.g., for decoding.protected static StringextractFileExtension(String fileName)Returns the extension of a file name (!= a path).PathgetFilePath()Deprecated.Please use getFilePaths() instead.Path[]getFilePaths()Returns the paths of all files to be read by the FileInputFormat.protected FileInputFormat.FileBaseStatisticsgetFileStats(FileInputFormat.FileBaseStatistics cachedStats, Path[] filePaths, ArrayList<FileStatus> files)protected FileInputFormat.FileBaseStatisticsgetFileStats(FileInputFormat.FileBaseStatistics cachedStats, Path filePath, FileSystem fs, ArrayList<FileStatus> files)protected static InflaterInputStreamFactory<?>getInflaterInputStreamFactory(String fileExtension)LocatableInputSplitAssignergetInputSplitAssigner(FileInputSplit[] splits)Returns the assigner for the input splits.longgetMinSplitSize()booleangetNestedFileEnumeration()intgetNumSplits()longgetOpenTimeout()longgetSplitLength()Gets the length or remaining length of the current split.longgetSplitStart()Gets the start of the current split.FileInputFormat.FileBaseStatisticsgetStatistics(BaseStatistics cachedStats)Obtains basic file statistics containing only file size.static Set<String>getSupportedCompressionFormats()voidopen(FileInputSplit fileSplit)Opens an input stream to the file defined in the input format.static voidregisterInflaterInputStreamFactory(String fileExtension, InflaterInputStreamFactory<?> factory)Registers a decompression algorithm through aInflaterInputStreamFactorywith a file extension for transparent decompression.voidsetFilePath(String filePath)voidsetFilePath(Path filePath)Sets a single path of a file to be read.voidsetFilePaths(String... filePaths)Sets multiple paths of files to be read.voidsetFilePaths(Path... filePaths)Sets multiple paths of files to be read.voidsetFilesFilter(FilePathFilter filesFilter)voidsetMinSplitSize(long minSplitSize)voidsetNestedFileEnumeration(boolean enable)voidsetNumSplits(int numSplits)voidsetOpenTimeout(long openTimeout)booleansupportsMultiPaths()Deprecated.Will be removed for Flink 2.0.protected booleantestForUnsplittable(FileStatus pathFile)StringtoString()-
Methods inherited from class org.apache.flink.api.common.io.RichInputFormat
closeInputFormat, getRuntimeContext, openInputFormat, setRuntimeContext
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
-
Methods inherited from interface org.apache.flink.api.common.io.InputFormat
nextRecord, reachedEnd
-
-
-
-
Field Detail
-
INFLATER_INPUT_STREAM_FACTORIES
protected static final Map<String,InflaterInputStreamFactory<?>> INFLATER_INPUT_STREAM_FACTORIES
A mapping of file extensions to decompression algorithms based on DEFLATE. Such compressions lead to unsplittable files.
-
READ_WHOLE_SPLIT_FLAG
protected static final long READ_WHOLE_SPLIT_FLAG
The splitLength is set to -1L for reading the whole split.- See Also:
- Constant Field Values
-
stream
protected transient FSDataInputStream stream
The input stream reading from the input file.
-
splitStart
protected transient long splitStart
The start of the split that this parallel instance must consume.
-
splitLength
protected transient long splitLength
The length of the split that this parallel instance must consume.
-
currentSplit
protected transient FileInputSplit currentSplit
The current split that this parallel instance must consume.
-
filePath
@Deprecated protected Path filePath
Deprecated.The path to the file that contains the input.
-
minSplitSize
protected long minSplitSize
The minimal split size, set by the configure() method.
-
numSplits
protected int numSplits
The desired number of splits, as set by the configure() method.
-
openTimeout
protected long openTimeout
Stream opening timeout.
-
unsplittable
protected boolean unsplittable
Some file input formats are not splittable on a block level (deflate) Therefore, the FileInputFormat can only read whole files.
-
enumerateNestedFiles
protected boolean enumerateNestedFiles
The flag to specify whether recursive traversal of the input directory structure is enabled.
-
ENUMERATE_NESTED_FILES_FLAG
@Deprecated public static final String ENUMERATE_NESTED_FILES_FLAG
Deprecated.The config parameter which defines whether input directories are recursively traversed.- See Also:
- Constant Field Values
-
-
Constructor Detail
-
FileInputFormat
public FileInputFormat()
-
FileInputFormat
protected FileInputFormat(Path filePath)
-
-
Method Detail
-
registerInflaterInputStreamFactory
public static void registerInflaterInputStreamFactory(String fileExtension, InflaterInputStreamFactory<?> factory)
Registers a decompression algorithm through aInflaterInputStreamFactorywith a file extension for transparent decompression.- Parameters:
fileExtension- of the compressed filesfactory- to create anInflaterInputStreamthat handles the decompression format
-
getInflaterInputStreamFactory
protected static InflaterInputStreamFactory<?> getInflaterInputStreamFactory(String fileExtension)
-
getSupportedCompressionFormats
@VisibleForTesting public static Set<String> getSupportedCompressionFormats()
-
extractFileExtension
protected static String extractFileExtension(String fileName)
Returns the extension of a file name (!= a path).- Returns:
- the extension of the file name or
nullif there is no extension.
-
getFilePath
@Deprecated public Path getFilePath()
Deprecated.Please use getFilePaths() instead.- Returns:
- The path of the file to read.
-
getFilePaths
public Path[] getFilePaths()
Returns the paths of all files to be read by the FileInputFormat.- Returns:
- The list of all paths to read.
-
setFilePath
public void setFilePath(String filePath)
-
setFilePath
public void setFilePath(Path filePath)
Sets a single path of a file to be read.- Parameters:
filePath- The path of the file to read.
-
setFilePaths
public void setFilePaths(String... filePaths)
Sets multiple paths of files to be read.- Parameters:
filePaths- The paths of the files to read.
-
setFilePaths
public void setFilePaths(Path... filePaths)
Sets multiple paths of files to be read.- Parameters:
filePaths- The paths of the files to read.
-
getMinSplitSize
public long getMinSplitSize()
-
setMinSplitSize
public void setMinSplitSize(long minSplitSize)
-
getNumSplits
public int getNumSplits()
-
setNumSplits
public void setNumSplits(int numSplits)
-
getOpenTimeout
public long getOpenTimeout()
-
setOpenTimeout
public void setOpenTimeout(long openTimeout)
-
setNestedFileEnumeration
public void setNestedFileEnumeration(boolean enable)
-
getNestedFileEnumeration
public boolean getNestedFileEnumeration()
-
getSplitStart
public long getSplitStart()
Gets the start of the current split.- Returns:
- The start of the split.
-
getSplitLength
public long getSplitLength()
Gets the length or remaining length of the current split.- Returns:
- The length or remaining length of the current split.
-
setFilesFilter
public void setFilesFilter(FilePathFilter filesFilter)
-
configure
public void configure(Configuration parameters)
Configures the file input format by reading the file path from the configuration.- Parameters:
parameters- The configuration with all parameters (note: not the Flink config but the TaskConfig).- See Also:
InputFormat.configure(org.apache.flink.configuration.Configuration)
-
getStatistics
public FileInputFormat.FileBaseStatistics getStatistics(BaseStatistics cachedStats) throws IOException
Obtains basic file statistics containing only file size. If the input is a directory, then the size is the sum of all contained files.- Parameters:
cachedStats- The statistics that were cached. May be null.- Returns:
- The base statistics for the input, or null, if not available.
- Throws:
IOException- See Also:
InputFormat.getStatistics(org.apache.flink.api.common.io.statistics.BaseStatistics)
-
getFileStats
protected FileInputFormat.FileBaseStatistics getFileStats(FileInputFormat.FileBaseStatistics cachedStats, Path[] filePaths, ArrayList<FileStatus> files) throws IOException
- Throws:
IOException
-
getFileStats
protected FileInputFormat.FileBaseStatistics getFileStats(FileInputFormat.FileBaseStatistics cachedStats, Path filePath, FileSystem fs, ArrayList<FileStatus> files) throws IOException
- Throws:
IOException
-
getInputSplitAssigner
public LocatableInputSplitAssigner getInputSplitAssigner(FileInputSplit[] splits)
Description copied from interface:InputSplitSourceReturns the assigner for the input splits. Assigner determines which parallel instance of the input format gets which input split.- Returns:
- The input split assigner.
-
createInputSplits
public FileInputSplit[] createInputSplits(int minNumSplits) throws IOException
Computes the input splits for the file. By default, one file block is one split. If more splits are requested than blocks are available, then a split may be a fraction of a block and splits may cross block boundaries.- Parameters:
minNumSplits- The minimum desired number of file splits.- Returns:
- The computed file splits.
- Throws:
IOException- See Also:
InputFormat.createInputSplits(int)
-
testForUnsplittable
protected boolean testForUnsplittable(FileStatus pathFile)
-
acceptFile
public boolean acceptFile(FileStatus fileStatus)
A simple hook to filter files and directories from the input. The method may be overridden. Hadoop's FileInputFormat has a similar mechanism and applies the same filters by default.- Parameters:
fileStatus- The file status to check.- Returns:
- true, if the given file or directory is accepted
-
open
public void open(FileInputSplit fileSplit) throws IOException
Opens an input stream to the file defined in the input format. The stream is positioned at the beginning of the given split.The stream is actually opened in an asynchronous thread to make sure any interruptions to the thread working on the input format do not reach the file system.
- Parameters:
fileSplit- The split to be opened.- Throws:
IOException- Thrown, if the spit could not be opened due to an I/O problem.
-
decorateInputStream
protected FSDataInputStream decorateInputStream(FSDataInputStream inputStream, FileInputSplit fileSplit) throws Throwable
This method allows to wrap/decorate the rawFSDataInputStreamfor a certain file split, e.g., for decoding. When overriding this method, also consider adaptingtestForUnsplittable(org.apache.flink.core.fs.FileStatus)if your stream decoration renders the input file unsplittable. Also consider calling existing superclass implementations.- Parameters:
inputStream- is the input stream to decoratedfileSplit- is the file split for which the input stream shall be decorated- Returns:
- the decorated input stream
- Throws:
Throwable- if the decoration fails- See Also:
InputStreamFSInputWrapper
-
close
public void close() throws IOExceptionCloses the file input stream of the input format.- Throws:
IOException- Thrown, if the input could not be closed properly.
-
supportsMultiPaths
@Deprecated public boolean supportsMultiPaths()
Deprecated.Will be removed for Flink 2.0.Override this method to supports multiple paths. When this method will be removed, all FileInputFormats have to support multiple paths.- Returns:
- True if the FileInputFormat supports multiple paths, false otherwise.
-
-