Converting PDF to Html in Java: A Simple Guide
In today’s digital world, there is a high demand for the ability to convert PDF to HTML format. This is because:
- Compatibility: HTML is a widely-used format for web content, and converting PDF files to HTML makes them more accessible and compatible with web browsers, mobile devices, and other digital platforms.
- Searchability: Unlike PDF files, HTML files are easily searchable by search engines and text editors, which makes it easier for users to find and extract specific information from them.
- Accessibility: Converting PDF files to HTML format can make them more accessible to people with disabilities who use assistive technologies, such as screen readers, to access digital content.
- Editing: HTML files can be easily edited and customized using web development tools, whereas PDF files require specialized software for editing.
We will be covering the following steps:
- Setting up the project
- Creating a PDF generator utility class
- Creating a REST controller
- Testing our API in Postman
Step 1: Setting up the project
To get started, we need to create a new Spring Boot project. You can use your favorite IDE or build tool to do this.
We will also need to add the iText library as a dependency. We can do this by adding the following dependency to our pom.xml
file:
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>pdfbox-tools</artifactId>
<version>2.0.25</version>
</dependency>
<dependency>
<groupId>net.sf.cssbox</groupId>
<artifactId>pdf2dom</artifactId>
<version>2.0.1</version>
</dependency>
<dependency>
org.apache.pdfbox:pdfbox-tools:2.0.25
is a dependency of Apache PDFBox that provides additional tools for working with PDF files, such as extracting text or images, merging or splitting PDF files, or converting files to other formats.net.sf.cssbox:pdf2dom:2.0.1
is a dependency of CSSBox that provides a tool for converting PDF files to HTML DOM. This allows working with the PDF content as if it were HTML, making it easier to manipulate and analyze.
If you need to perform the reverse process I leave the tutorial
Step 2: Creating a Html generator utility class
Next, we will create a utility class called HtmlGenerator
that will handle the conversion of PDF to HTML. This class will have a static method called generateHtmlFromPdf
that takes two parameters: the inputStream and outputStream.
public class HtmlGenerator {
// return html in string
public static String generateHtmlFromPdf(InputStream inputStream) throws IOException {
PDDocument pdf = PDDocument.load(inputStream);
PDFDomTree parser = new PDFDomTree();
ByteArrayOutputStream baos = new ByteArrayOutputStream();
Writer output = new PrintWriter(baos, true, StandardCharsets.UTF_8);
parser.writeText(pdf, output);
output.close();
pdf.close();
return new String(baos.toByteArray(), StandardCharsets.UTF_8);
}
}
Let’s break down what we did here:
- We created a new
ByteArrayOutputStream
calledbaos
to store the generated HTML. - We created a new
PrintWriter
calledoutput
and passed it theByteArrayOutputStream
as well as the encoding to use (UTF-8). - We used the
PDFDomTree
class to parse the PDF and generate the HTML, writing the output to thePrintWriter
. - We closed the
PrintWriter
and thePDDocument
. - Finally, we returned the contents of the
ByteArrayOutputStream
as a string using theUTF-8
encoding.
With this modification, our API endpoint now returns the generated HTML to the client.
That’s it for this step! In the next step, we will write some tests to ensure our API endpoint is working correctly.
Step 3: Creating a REST controller
we are defining a @PostMapping
endpoint that listens for requests to /convert-html
. The endpoint expects a MultipartFile
named file
to be included in the request.
The endpoint then calls the generateHtmlFromPdf
method and passes the input stream of the MultipartFile
as an argument. The generateHtmlFromPdf
method converts the PDF to HTML and returns the HTML string.
Finally, the endpoint returns a ResponseEntity
with the HTML string as the body and a 200 OK
status code.
@PostMapping(value = "/convert-html")
public ResponseEntity<String> convertToHtml(@RequestParam MultipartFile file) throws IOException {
String html = HtmlGenerator.generateHtmlFromPdf(file.getInputStream());
return ResponseEntity.ok(html);
}
Step 4: Testing our API in Postman
It is important to keep in mind that html and css content generated from a pdf does not have the same quality as one of a web page as such and it is convenient to use them in scenarios where this is not a problem
Content of html:
Preview of result:
Conclusion
In this tutorial, we have learned how to convert pdf to html using Java and the apache pdfbox library. We’ve created an HTML generating utility class that can handle PDF files, and exposed it as a RESTful web service using a service class and driver. This example provides a good starting point for anyone who wants to convert PDF to HTML using Java.
If you liked it, leave a round of applause or subscribe for more interesting topics and from an easy and fast perspective.