How to Extract Text and Images from PDF using Python?

| | , ,

Home » How to Extract Text and Images from PDF using Python?
Spread the love
  •  
  •  
  •  
  •  
  • 1
  •  
  •  
  •  
  •  
  •  
  •  

This article will see how we can use Python to work with PDF (Portable Document Format) files. PDF files contain images, documents, text, links, audio, video, you can also add a hyperlink to a pdf file. So, basically, this article will help you on How to Extract Text and Images from PDF using Python?

The topics we are covering in this article are given below.

  1. Reading text PDF files.
  2. Reading tables in PDF files.
  3. Extracting images from PDF files.
  4. Write a PDF file

Working with PDF files in python is very easy you can use different types of Python libraries/module for working in PDF like PyPDF2, tabula-py, PyMuPDF, etc. We are going to use some of these libraries in this tutorial as they are very easy you just need to install the library and run some codes in your ide let’s see how to do this process. So, let’s start with how to extract text and images from PDF using Python?

Reading PDF files

Step -1: Get a sample file

The first thing we need is a .pdf file (sample.pdf) for reading pdf files. After you have the .pdf file to work, let’s get to the coding.

Step -2: Install the required library/module

you need to install a library called PyPDF for python you can install it by running a command in your terminal.

pip3 install PyPDF

Step -3: Writing the code

Open your IDE (I am using PyCharm you can use a different one like VS Code) and start writing code but before that let’s see the steps we need to write the code:

  • Import the PyPDF3 module in your IDE
  • Open the pdf file in binary mode and save a file object as PDF file.
  • Create an object of PDF filereader class.
  • Print the number of pages in the pdf file using ‘numPages’ property. It tells us the number of pages (in our pdf file there are 206 pages).
  • Then we create an object of pages class and define specific page numbers(start with 0) which page content we are extracting here we are extracting text from page number 85.
  • Now we are going to use a function called ‘extractText()’ that is going to extract the text from a PDF file from a specific page number which we are providing.
  • Lastly, close the PDF file.

Now let’s see the process in Python code:

#import the PyPDF2 module
import PyPDF2

#open the PDF file
PDFfile = open('Sample.pdf.', 'rb')

PDFfilereader = PyPDF2.PdfFileReader(PDFfile)

#print the number of pages
print(PDFfilereader.numPages)

#provide the page number
pages = PDFfilereader.getPage(85)

#extracting the text in PDF file
print(pages.extractText())

#close the PDF file
PDFfile.close()

Output:

206
76pronounced:declareddiscreet......................................................... .........................Complete the Table as shown below. Comprehension.

In the first line of output, you can see a number(206) that’s the number of the page and the rest of the text is the context of the specified number page.

Reading tables in PDF files

Step -1: Get a sample file

The first thing we need for reading the table in a pdf file is a .pdf (sample.pdf) file that contains a table. After you have the .pdf file to work, let’s get to the coding.

Step -3: Install the required library/module

Method -1:

You need to install a library called tabula-py for python it helps read the table in a pdf file, you can install it by running a command in your terminal:

pip3 install tabula-py

Open your ide (I am using Pycharm you can use a different one like vs code) and start writing code but before that let’s see the steps we need to take to write the code:

  • First, you need to import the tabula library.
  • Second important the pdf file that contains a table.
from tabula import read_pdf 
from tabulate, import tabulate 

#reads the table from pdf file 

df = read_pdf("abc.pdf",pages="all") #address of pdf file
print(tabulate(df))

You can also read multiple tables as independent tables. You can use the below code to do so:

#select the pdf file
file = "sample.pdf"

#reading both table as an independent table
tables = tabula.read_pdf(file,pages=1,multiple_tables=True)
print(tables[0])
print(tables[1])

Method -2:

You need to install a library called camelot-py for Python. It helps to read the table in a pdf file. You can install it by running a command in your terminal:

pip3 install camelot-py

Let’s see the steps we need to write the code:

  • Import the Camelot library.
  • Extracting all the tables from the pdf
  • Finally print it.

It’s a very simple process you can just copy-paste the code in your IDE but don’t forget to keep the pdf file in the same folder as the Python file.

Extracting images from PDF files

Step -1: Get a sample file

The first thing we need for extracting the images from PDF files is a .pdf file (sample.pdf) that contains images that you want to extract. After you have the .pdf file to work, let’s get to the coding.

Step -2: Install the required library/module

You need to install a library called PyMuPDF (you can use PyPDF2 as well but this is easier) for Python. You can install it by running a command in your terminal.

pip3 install PyMuPDF Pillow

Step -3: Writing the code

Let’s start writing the code but before that let’s see the steps we need to take to write the code:

  • Import the Fitz module to your ide.
  • Next, we are going to create a file and store the name of the file “sample.pdf”.
  • Then we are opening the pdf file fitz.open
  • Then create another variable called image_list and apply the method on the pdf that is to get PageImageList() and provide a page number.
  • The next thing is we are simply going to apply this loop in this image list.
  • And next, we are going to extract Xref from it because we only want pixels (if you want you can extract another thing like the position of the image, properties of the image, etc)
  • The next thing is we need to convert it into a pixmap for that we are simply going to create a variable called pix.
  • And then put an “if” condition if the image is grayscale or colored then we simply save it.
  • Lastly, we are simply going to print the images and extract them.
#import the library
import fitz

file = 'sample.pdf'

#open the fitz file
pdf = fitz.open(file)

#select the page number
image_list = pdf.getPageImageList(0)

#applying the loop
for image in image_list:
   xref = image[0]
   pix = fitz.Pixmap(pdf, xref)
   if pix.n < 5:
       pix.writePNG(f'{xref}.png')
   else:
       pix1 = fitz.open(fitz.csRGB, pix)
       pix1.writePNG(f'{xref}.png')
       pix1 = None
   pix = None

#print the images
print(len(image_list), 'detected')

Output:

2 detected

Writing PDF files

We’re going to use FPDF module to write the PDF file. So, install the FPDF module using the below command:

pip3 install fpdf==1.7

Once, you’re done with the installation. Use the below code to write the PDF file:

from fpdf import FPDF
text = "Hello, this text will be stored in PDF file by GeekyHumans"
pdf = FPDF()
pdf.add_page()
pdf.set_xy(0, 0)
pdf.set_font('arial', 'B', 13.0)
pdf.cell(ln=0, h=5.0, align='L', w=0, txt=text, border=0)
pdf.output('test.pdf', 'F')

Now, you’re good to go with the PDF. A new PDF file will be created in the same folder where your Python code resides.

Final Words

In this article, we covered how to extract text and images from PDF using Python. Writing and reading a PDF file can be a tough task as it involves a lot of elements such as text, images, tables, etc. But we made it simple for you to understand the basics of manipulating a PDF file using Python. I hope you understood the code and it was easy for you to implement the same. Please let us know in the comment section if you’re facing any problems or you’re not able to run the code.

Here are some useful tutorials that you can read:

  •  
    1
    Share
  •  
  •  
  •  
  •  
  • 1
  •  
  •  
  •  
  •  
Previous

How to Secure Files using Python?

Insert, Read, Update, Delete in MongoDB using PyMongo

Next

Leave a Comment