Efficient PDF to Text Conversion: Handling Unprocessed Files
Written on
Chapter 1: Introduction
In the process of converting scanned PDF documents into text files, I encountered an issue where my program crashed, leaving several files unprocessed.
If I had relied solely on Python for the conversion, the approach would have been straightforward. I could simply loop through the PDFs to check for existing text files. If a corresponding text file was absent, I would convert the PDF.
However, I opted for PDFElement to expedite the conversion process, as it proved to be faster than executing a Python script. Having already invested $2000 in this project, I found the $120 cost for PDFElement reasonable.
But how could I effectively handle the unconverted files? Let me walk you through my method utilizing Python.
Section 1.1: Setting Up the Environment
The first step involved importing the necessary packages:
# Used to find names of the files in a given directory
import glob
# Used collections to identify names that only appeared once
from collections import Counter
# Used to copy files to a secondary directory for processing
import shutil
Next, I gathered a list of all the files within the PDF directory. I printed the length of my files list to ascertain how many files I had to manage.
files = glob.glob(r'D:pdfscitibankpdfs*.*')
print(len(files))
Glob returns the filename along with its full path. I only required the name without the extension, so I used os.path.basename while excluding the last four characters. I compiled these names into a list called pdfs.
pdfs = []
for file in files:
pdfs.append(os.path.basename(file[:-4]))
I then created a Counter object from the pdfs list and employed a list comprehension to identify unique file names. This revealed that I had 2758 unique files.
# Counter for the list of pdfs
c = Counter(pdfs)
# List of files occurring once identified using list comprehension
unprocessed = [n for n in c if c[n] == 1]
Section 1.2: Finding Unprocessed Files
With the list of unprocessed files established, I needed to obtain their full paths. Although I could have concatenated the path and extension due to their uniformity, I chose to implement a for loop for clarity.
I initiated an empty list to hold the files designated for moving, then created a nested for loop. The outer loop iterated through the unprocessed file names, while the inner loop traversed the original file names. If a match was found, the file name was added to moveFiles. While this may not be the most efficient method, it was effective and executed swiftly.
moveFiles = []
for name in unprocessed:
for file in files:
if name in file:
moveFiles.append(file)
Finally, I copied the identified files into a subdirectory and re-launched PDFElement to finalize the conversion.
for file in moveFiles:
shutil.copy(file, r'D:pdfscitibankpdfsmissed')
And that’s how I handled the process. Thank you for taking the time to read through my experience.
Chapter 2: Additional Resources
The first video, "Common Words in PDF Files: Let Python do the Reading," delves into how Python can help extract meaningful information from PDF documents.
The second video, "Power Automate Desktop: PDF Extraction and Application Entry," showcases how to automate PDF extraction using Power Automate Desktop, enhancing your data entry tasks.