PDF-table is Java utility library that can be used for parsing tabular data in PDF documents.
Core processing of PDF documents is performed with utilization of Apache PDFBox and OpenCV.
pdf-table requires compiled OpenCV 3.4.2 to work properly:
-
Download OpenCV v3.4.2 from https://github.com/opencv/opencv/releases/tag/3.4.2
-
Unpack it and add to your system PATH:
-
Windows:
<opencv dir>\build\java\x64
-
Linux:
TODO
-
<dependency>
<groupId>com.github.rostrovsky</groupId>
<artifactId>pdf-table</artifactId>
<version>1.0.0</version>
</dependency>
When PDF document page is being parsed, following operations are performed:
-
Page is converted to grayscale image [OpenCV].
-
Binary Inverted Threshold (BIT) is applied to grayscaled image [OpenCV].
-
Contours are detected on BIT image and contour mask is created (additional Canny filtering can be turned on in this step) [OpenCV].
-
Contour mask is XORed with BIT image [OpenCV].
-
Contours are detected once again on XORed image (additional Canny filtering can be turned on in this step) [OpenCV].
-
Final contours are drawn [OpenCV].
-
Bounding rectangles are detected from final contours [OpenCV].
-
PDF is being parsed region-by-region using bounding rectangles coordinates [Apache PDFBox].
Above algorithm is mostly derived from http://stackoverflow.com/a/23106594.
For more information about parsed output, refer to Output format
class SingleThreadParser {
public static void main(String[] args) throws IOException {
PDDocument pdfDoc = PDDocument.load(new File("some.pdf"));
PdfTableReader reader = new PdfTableReader();
List<ParsedTablePage> parsed = reader.parsePdfTablePages(pdfDoc, 1, pdfDoc.getNumberOfPages());
}
}
class MultiThreadParser {
public static void main(String[] args) throws IOException {
final int THREAD_COUNT = 8;
PDDocument pdfDoc = PDDocument.load(new File("some.pdf"));
PdfTableReader reader = new PdfTableReader();
// parse pages simultaneously
ExecutorService executor = Executors.newFixedThreadPool(THREAD_COUNT);
List<Future<ParsedTablePage>> futures = new ArrayList<>();
for (final int pageNum : IntStream.rangeClosed(1, pdfDoc.getNumberOfPages()).toArray()) {
Callable<ParsedTablePage> callable = () -> {
ParsedTablePage page = reader.parsePdfTablePage(pdfDoc, pageNum);
return page;
};
futures.add(executor.submit(callable));
}
// collect parsed pages
List<ParsedTablePage> unsortedParsedPages = new ArrayList<>(pdfDoc.getNumberOfPages());
try {
for (Future<ParsedTablePage> f : futures) {
ParsedTablePage page = f.get();
unsortedParsedPages.add(page.getPageNum() - 1, page);
}
} catch (Exception e) {
throw new RuntimeException(e);
}
// sort pages by pageNum
List<ParsedTablePage> sortedParsedPages = unsortedParsedPages.stream()
.sorted((p1, p2) -> Integer.compare(p1.getPageNum(), p2.getPageNum())).collect(Collectors.toList());
}
}
PDF-Table provides methods for saving PDF pages as PNG images.
Rendering DPI can be modified in PdfTableSettings
(see: Parsing settings).
class SingleThreadPNGDump {
public static void main(String[] args) throws IOException {
PDDocument pdfDoc = PDDocument.load(new File("some.pdf"));
Path outputPath = Paths.get("C:", "some_directory");
PdfTableReader reader = new PdfTableReader();
reader.savePdfPagesAsPNG(pdfDoc, 1, pdfDoc.getNumberOfPages(), outputPath);
}
}
class MultiThreadPNGDump {
public static void main(String[] args) throws IOException {
final int THREAD_COUNT = 8;
Path outputPath = Paths.get("C:", "some_directory");
PDDocument pdfDoc = PDDocument.load(new File("some.pdf"));
PdfTableReader reader = new PdfTableReader();
ExecutorService executor = Executors.newFixedThreadPool(THREAD_COUNT);
List<Future<Boolean>> futures = new ArrayList<>();
for (final int pageNum : IntStream.rangeClosed(1, pdfDoc.getNumberOfPages()).toArray()) {
Callable<Boolean> callable = () -> {
reader.savePdfPageAsPNG(pdfDoc, pageNum, outputPath);
return true;
};
futures.add(executor.submit(callable));
}
try {
for (Future<Boolean> f : futures) {
f.get();
}
} catch (Exception e) {
throw new RuntimeException(e);
}
}
}
When tables in PDF document cannot be parsed correctly with default settings, user can save debug images that show page
at various stages of processing.
Using these images, user can adjust PdfTableSettings
accordingly to achieve desired results
(see: Parsing settings).
class SingleThreadDebugImgsDump {
public static void main(String[] args) throws IOException {
PDDocument pdfDoc = PDDocument.load(new File("some.pdf"));
Path outputPath = Paths.get("C:", "some_directory");
PdfTableReader reader = new PdfTableReader();
reader.savePdfTablePagesDebugImages(pdfDoc, 1, pdfDoc.getNumberOfPages(), outputPath);
}
}
class MultiThreadDebugImgsDump {
public static void main(String[] args) throws IOException {
final int THREAD_COUNT = 8;
Path outputPath = Paths.get("C:", "some_directory");
PDDocument pdfDoc = PDDocument.load(new File("some.pdf"));
PdfTableReader reader = new PdfTableReader();
ExecutorService executor = Executors.newFixedThreadPool(THREAD_COUNT);
List<Future<Boolean>> futures = new ArrayList<>();
for (final int pageNum : IntStream.rangeClosed(1, pdfDoc.getNumberOfPages()).toArray()) {
Callable<Boolean> callable = () -> {
reader.savePdfTablePagesDebugImage(pdfDoc, pageNum, outputPath);
return true;
};
futures.add(executor.submit(callable));
}
try {
for (Future<Boolean> f : futures) {
f.get();
}
} catch (Exception e) {
throw new RuntimeException(e);
}
}
}
PDF rendering and OpenCV filtering settings are stored in PdfTableSettings
object.
Custom settings instance can be passed to PdfTableReader
constructor when non-default values are needed:
(...)
// build settings object
PdfTableSettings settings = PdfTableSettings.getBuilder()
.setCannyFiltering(true)
.setCannyApertureSize(5)
.setCannyThreshold1(40)
.setCannyThreshold2(190.5)
.setPdfRenderingDpi(160)
.build();
// pass settings to reader
PdfTableReader reader = new PdfTableReader(settings);
Each parsed PDF page is being returned as ParsedTablePage
object:
(...)
PDDocument pdfDoc = PDDocument.load(new File("some.pdf"));
PdfTableReader reader = new PdfTableReader();
// first page in document has index == 1, not 0 !
ParsedTablePage firstPage = reader.parsePdfTablePage(pdfDoc, 1);
// getting page number
assert firstPage.getPageNum() == 1;
// rows and cells are zero-indexed just like elements of the List
// getting first row
ParsedTablePage.ParsedTableRow firstRow = firstPage.getRow(0);
// getting third cell in second row
String thirdCellContent = firstPage.getRow(1).getCell(2);
// cell content usually contain <CR><LF> characters,
// so it is recommended to trim them before processing
double thirdCellNumericValue = Double.valueOf(thirdCellContent.trim());