Documents often arrive in varying orientations, creating problems in automatic processing and data extraction. While many documents adhere to standardized formats, like invoices or intake forms, the inconsistency in orientation disrupts document classification, data extraction, and validation workflows.
This issue is typically caused by human error in scanning, varying departmental standards, or automated systems mishandling document preparation. Current solutions, including deep learning-based models and brute force methods, have demonstrated inefficiency or excessive computational costs.
We evaluated and compared three initial methods before developing a novel, optimized solution for document rotation detection. Here’s a five-step summary of the progression from brute force to our final solution:
Brute Force Approach: In this method, OCR is applied to each document four times, once for each possible orientation (North, South, East, West). While this method extracts the text, it is computationally inefficient, especially for large document volumes, due to the repeated OCR operations.
Deep Learning Model Approach: We trained multiple deep learning models to detect document orientations, achieving up to 94% accuracy. However, this method required multiple inferences—one for the deep learning model and another for OCR—resulting in high computational costs and longer processing times.
DFTOP Approach (Double Fourier Transform Optimized Process): This approach utilizes Fourier Transforms (FFT) to analyze text frequency patterns, coupled with OCR for text extraction. This combination reduces the number of OCR operations, improving both efficiency and scalability while maintaining accuracy.
Observations and Insights: We observed distinct rotation patterns in human-scanned documents. These patterns enabled us to anticipate likely rotations based on user behavior, which informed the development of a more efficient detection method that reduces the need for multiple OCR passes.
Implementation Overview: The final approach uses Fourier Transforms for rotation detection, supported by Parseval’s theorem for signal processing, which allows us to simplify computations and determine the correct document orientation before running OCR. This hybrid technique minimizes computational costs and improves throughput.
Consider a 4-page PDF document with different orientations for each page:
The following table outlines the validation accuracy of several deep learning models used:
*Fourier Transform: Used to analyze the frequency patterns in document text, allowing us to detect correct orientation without needing OCR multiple times.
*Parseval’s Theorem: Applied for signal processing to simplify computations while detecting document alignment through frequency analysis.
We rely on Parseval’s theorem for the Fourier analysis and signal processing. Mathematically, for a continuous function (f(t)) with its Fourier transform (F(omega)), Parseval’s theorem can be expressed as:
The Probability Distribution over different orientations/angles of the document pages:
This figure shows a common pattern expected from humans using the system, with one orientation (usually the correct one) having a higher probability. And there’s some bias in the missed pages so one orientation is more likely to happen than the other two.
The following figure shows:
This figure above illustrates the algorithm’s capability to detect text alignment through a series of methodical steps:
Step 1 – Grayscale Conversion: Convert each document page to grayscale for normalization.
Step 2 – Resizing: Resize to a standardized dimension (720×720 pixels).
Step 3 – Normalization: Generate vertical and horizontal components by normalizing the image.
Step 4 – Horizontal Frequency Analysis: Apply Horizontal Frequency Analysis using Fourier Transform to extract frequency components.
Step 5 – Vertical Frequency Analysis: Apply Vertical Frequency Analysis using FFT for vertical component extraction.
Step 6 – Correct Document Orientation: Based on the frequency distribution, determine the correct document orientation.
When the text is correctly aligned, the vertical frequency component exhibits higher energy compared to the horizontal component. This phenomenon occurs because the spacing between lines and the lines themselves create a sine wave-like pattern, which is more pronounced in the vertical frequency domain.
These tables will help visualize the efficiency of the different approaches used in document orientation detection.
This table will summarize the number of OCR operations required by each approach based on the probability of different rotations P(rot).
This table shows how the number of OCR operations varies for each approach, and highlights how the DFTOP approach reduces the need for multiple OCR passes.
This table compares the execution time required by each approach for a given document processing task, assuming 1 unit of time per OCR operation.
We can clearly see that time advantage of the DFTOP compared with Brute Force is:
Also since the execution time of DFTOP is bounded by 2, it’s always more efficient computationally than the DL approach.
Our approach outperformed both the brute force and deep learning models in terms of computational efficiency and accuracy:
This case study demonstrates how our novel Fourier Transform-based approach solved the problem of document rotation detection with significantly fewer computational resources than traditional methods like brute force and deep learning models. It effectively balances accuracy and efficiency, making it scalable for large document volumes.