Reading PDF Files in Java with Apache PDFBox

PDF (Portable Document Format) is a widely used file format for document exchange. In this article, we'll explore how to read PDF files in Java using Apache PDFBox, an open-source Java library for working with PDF files.

Introduction to Apache PDFBox

Apache PDFBox is an open-source Java library for working with PDF files. It provides a rich set of APIs for parsing, creating, modifying, and extracting data from PDF files. The library is easy to use and well-documented, making it a popular choice for developers working with PDF files in Java.

Setting up the Development Environment

To get started with Apache PDFBox, you'll need to have Java and a development environment installed on your computer. You can download and install the latest version of Java from the official Oracle website. For the development environment, you can use any Java Integrated Development Environment (IDE) of your choice, such as Eclipse, IntelliJ IDEA, or NetBeans.

You can add Apache PDFBox to your project by downloading the library from the official Apache website or by adding the following dependency to your project's build file if you're using Maven:


Reading PDF Files with Apache PDFBox

With the development environment set up, we can now write some code to read PDF files in Java. The following example shows how to read a PDF file using Apache PDFBox:

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;


public class PDFReader {
  public static void main(String[] args) {
    try {
      // Load the PDF document
      File file = new File("sample.pdf");
      PDDocument document = PDDocument.load(file);
      // Extract text from the PDF document
      PDFTextStripper pdfStripper = new PDFTextStripper();
      String text = pdfStripper.getText(document);
      // Print the extracted text
      // Close the PDF document
    } catch (IOException e) {

The code starts by loading the PDF document using the PDDocument.load method. The PDFTextStripper class is then used to extract the text from the document, and the extracted text is printed to the console using the System.out.println method. Finally, the document.close method is used to close the PDF document and release the resources used by PDFBox.


In this article, we learned how to read PDF files in Java.