Convert PDF to text
Extract text from a PDF - Supports OCR
PDF-to-Text allows you to extract all the text data from PDF files and further analyze the text or use the text in applications such as question answering. Note that you can save the extracted text into a knowledge-set to avoid redoing the PDF-to-Text step.
On this page, we will introduce the Tool step at Relevance to convert PDF to text.
How to use Convert PDF to text step
Add the component
Add the PDF to text converter step to your Tool (check how to get started with creating a tool).
File URL
A PDF-to-text converter requires a file as an input. If your file is publicly accessible on the web (i.e. with no
authentication or sign-up requirement), simply provide the URL directly or as a
text input. Otherwise, you will need to add a
File-to-URL input.
In either situation, use the {{variable name}}
to provide the data to the converter.
Use OCR
OCR (Optical character recognition or optical character reader) is needed for image PDFs (e.g. scanned data). This option uses more credits. So, only activate it for image PDFs.
Available converters
- Fast converter: Relevance AI’s default audio and video-to-text converter which is fast and reasonably accurate
- Quality converter: Slower and more accurate compared to the previous option
Follow the links below for more information about
- How to run a step
- How to delete a step
- How to configure output
- How to configure a default value
- How to move a step in a Tool
- How to duplicate a step
- How to add condition to a step (i.e. execute only if a condition is met)
- How to loop a step (i.e. run one step multiple times)
Access the step output
The output is a dictionary with two keys text
and number_of_pages
containing the extracted text and the number
of pages in the file respectively. Below you can see samples where the default name assigned to the step pdf_to_text
is used.
Note that a step name is different from the step title. Step titles can be found on the top left
of steps. A step name is shown on the bottom left, in smaller font and highlighted green.
pdf_to_text.text
pdf_to_text.number_of_pages
Common errors
Unsupported protocol
An error similar to the one noted below indicates that the provided input is not a valid URL.
Error:
Only HTTP(S) protocols are supported
Was this page helpful?