Converting PDF to Html in Java: A Simple Guide

Daniel Angel
4 min readMar 31, 2023

--

In today’s digital world, there is a high demand for the ability to convert PDF to HTML format. This is because:

  1. Compatibility: HTML is a widely-used format for web content, and converting PDF files to HTML makes them more accessible and compatible with web browsers, mobile devices, and other digital platforms.
  2. Searchability: Unlike PDF files, HTML files are easily searchable by search engines and text editors, which makes it easier for users to find and extract specific information from them.
  3. Accessibility: Converting PDF files to HTML format can make them more accessible to people with disabilities who use assistive technologies, such as screen readers, to access digital content.
  4. Editing: HTML files can be easily edited and customized using web development tools, whereas PDF files require specialized software for editing.

We will be covering the following steps:

  1. Setting up the project
  2. Creating a PDF generator utility class
  3. Creating a REST controller
  4. Testing our API in Postman

Step 1: Setting up the project

To get started, we need to create a new Spring Boot project. You can use your favorite IDE or build tool to do this.

We will also need to add the iText library as a dependency. We can do this by adding the following dependency to our pom.xml file:

  <dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>pdfbox-tools</artifactId>
<version>2.0.25</version>
</dependency>
<dependency>
<groupId>net.sf.cssbox</groupId>
<artifactId>pdf2dom</artifactId>
<version>2.0.1</version>
</dependency>
<dependency>
  • org.apache.pdfbox:pdfbox-tools:2.0.25 is a dependency of Apache PDFBox that provides additional tools for working with PDF files, such as extracting text or images, merging or splitting PDF files, or converting files to other formats.
  • net.sf.cssbox:pdf2dom:2.0.1 is a dependency of CSSBox that provides a tool for converting PDF files to HTML DOM. This allows working with the PDF content as if it were HTML, making it easier to manipulate and analyze.

If you need to perform the reverse process I leave the tutorial

Step 2: Creating a Html generator utility class

Next, we will create a utility class called HtmlGenerator that will handle the conversion of PDF to HTML. This class will have a static method called generateHtmlFromPdf that takes two parameters: the inputStream and outputStream.

public class HtmlGenerator {

// return html in string
public static String generateHtmlFromPdf(InputStream inputStream) throws IOException {
PDDocument pdf = PDDocument.load(inputStream);
PDFDomTree parser = new PDFDomTree();
ByteArrayOutputStream baos = new ByteArrayOutputStream();
Writer output = new PrintWriter(baos, true, StandardCharsets.UTF_8);
parser.writeText(pdf, output);
output.close();
pdf.close();
return new String(baos.toByteArray(), StandardCharsets.UTF_8);
}
}

Let’s break down what we did here:

  • We created a new ByteArrayOutputStream called baos to store the generated HTML.
  • We created a new PrintWriter called output and passed it the ByteArrayOutputStream as well as the encoding to use (UTF-8).
  • We used the PDFDomTree class to parse the PDF and generate the HTML, writing the output to the PrintWriter.
  • We closed the PrintWriter and the PDDocument.
  • Finally, we returned the contents of the ByteArrayOutputStream as a string using the UTF-8 encoding.

With this modification, our API endpoint now returns the generated HTML to the client.

That’s it for this step! In the next step, we will write some tests to ensure our API endpoint is working correctly.

Step 3: Creating a REST controller

we are defining a @PostMapping endpoint that listens for requests to /convert-html. The endpoint expects a MultipartFile named file to be included in the request.

The endpoint then calls the generateHtmlFromPdf method and passes the input stream of the MultipartFile as an argument. The generateHtmlFromPdf method converts the PDF to HTML and returns the HTML string.

Finally, the endpoint returns a ResponseEntity with the HTML string as the body and a 200 OK status code.

@PostMapping(value = "/convert-html")
public ResponseEntity<String> convertToHtml(@RequestParam MultipartFile file) throws IOException {
String html = HtmlGenerator.generateHtmlFromPdf(file.getInputStream());
return ResponseEntity.ok(html);
}

Step 4: Testing our API in Postman

It is important to keep in mind that html and css content generated from a pdf does not have the same quality as one of a web page as such and it is convenient to use them in scenarios where this is not a problem

Content of html:

Preview of result:

Conclusion

In this tutorial, we have learned how to convert pdf to html using Java and the apache pdfbox library. We’ve created an HTML generating utility class that can handle PDF files, and exposed it as a RESTful web service using a service class and driver. This example provides a good starting point for anyone who wants to convert PDF to HTML using Java.

If you liked it, leave a round of applause or subscribe for more interesting topics and from an easy and fast perspective.

--

--