文本提取器

xsx2025/7/21大约 1 分钟pdfbox模块高级功能提取器

示例

提取文本

说明

返回值为 map ，key 为页面索引， value 为提取文本
可根据页面索引进行提取

try (
        // 加载文档
        Document document = PdfHandler.getDocumentHandler().load("E:\\PDF\\pdfbox\\extractor\\hello-world.pdf");
        // 获取文档提取器
        DocumentExtractor extractor = PdfHandler.getDocumentExtractor(document);
) {
    // 提取文本信息
    Map<Integer, List<String>> map = extractor.extractText();
    // 输出文本信息
    map.forEach((key, value) -> System.out.println("第" + key + "页：" + value));
}

根据正则提取文本

说明

返回值为 map ，key 为页面索引， value 为提取文本
可根据页面索引进行提取

try (
        // 加载文档
        Document document = PdfHandler.getDocumentHandler().load("E:\\PDF\\pdfbox\\extractor\\hello-world.pdf");
        // 获取文档提取器
        DocumentExtractor extractor = PdfHandler.getDocumentExtractor(document);
) {
    // 提取文本信息
    Map<Integer, List<String>> map = extractor.extractText(".ll.");
    // 输出文本信息
    map.forEach((key, value) -> System.out.println("第" + key + "页：" + value));
}

根据区域提取文本

说明

返回值为 map ，key 为页面索引， value 为 map，其中 key 为区域名称， value 为提取文本
可根据页面索引进行提取

try (
        // 加载文档
        Document document = PdfHandler.getDocumentHandler().load("E:\\PDF\\pdfbox\\extractor\\hello-world.pdf");
        // 获取文档提取器
        DocumentExtractor extractor = PdfHandler.getDocumentExtractor(document);
) {
    // 获取页面
    Page page = document.getPage(0);
    // 提取文本信息
    Map<Integer, Map<String, String>> map = extractor.extractTextByRegionArea(
            Collections.singletonMap("test", new Rectangle(0, 0, page.getWidth().intValue(), page.getHeight().intValue())),
            0
    );
    // 输出文本信息
    map.forEach((key, value) -> System.out.println("第" + key + "页：" + value));
}