Extract Website Content

There is a large amount of text data available on the net which can be an amazing source for tasks such as Q/A, research, and context generation. Relevance provides you with an easy-to-use component for scraping website contents

How to use the Extract website content step

Add the component

You need to add the “Extract website content” step to your Tool (check how to get started with creating a tool). Extract website content step

Website URL

Extract website content step

Directly enter a URL in the box or use the {{ to activate the variable mode. For instance, if the URL is in an input component called my_url, use {{my_url}}. Or if the URL is the output of a Google-Search step, use {{google.organic[0].link}} to access the url in the first search results (FAQ).

Method

You can specify if you wish to scrape data as Text or HTML. By default, Relevance scraps the Text.

Element Selector

You can specify which element from the HTML components to scrape. By default, it is set to body. Note that using + Add new, you can specify a list of elements to be scrapped.

Extra headers

If you need to provide special information to be able to scrape a website, provide the data as a JSON object. The below object shows an example where an authentication token called auth-token and a user-id are required.

{
    "auth-token":"AUTHENTICATION-TOKEN",
    "user-id":"USER-ID"
}

Follow the links below for more information about

How to run a step
How to delete a step
How to configure output
How to configure a default value
How to move a step in a Tool
How to duplicate a step
How to add condition to a step (i.e. execute only if a condition is met)
How to loop a step (i.e. run one step multiple times)

Access the step output

The output is a dictionary with two keys page and selectors containing the extracted text and if any selectors were used. Below you can see samples where the default name assigned to the step scrape is used. Note that a step name is different from the step title. Step titles can be found on the top left of steps. A step name is shown on the bottom left, in smaller font and highlighted green.

scrape.output.page
scrape.output.selectors

Common errors

Wrong URL formatting

This error occurs when the URL field is set to a value that is not of type string. When using the output of another step make sure, you access the URL field. Read more at URL must be a string.

Non-array Elements

When setting up specific elements to be scrapped, make sure to use + Add new to have more than one Element. And if the button is clicked do not leave it as an empty list. Use the x icon to the right of the row to remove the extra line.

Studio transformation browserless_scrape input validation error: must be array {"type":"array"} /element_selector

Invalid URL

This error occurs when the provided URL is not valid. Protocol error (Page.navigate): Cannot navigate to invalid URL

Network issue

This error normally occurs when there are network issues. Make sure your connection is strong, refresh the page and try again. Navigation failed because browser has disconnected!

Time output

This happens wh Navigation timeout of 30000 ms exceeded

Was this page helpful?

PDF to Text Knowledge search

How to use the Extract website content step
Add the component
Website URL
Method
Element Selector
Extra headers
Access the step output
Common errors
Wrong URL formatting
Non-array Elements
Invalid URL
Network issue
Time output

Get started

Agents

Tools and Templates

Use cases

How to build a custom Tool

Data

Notebook

Integrations

SDK

Datasets

How to use the Extract website content step

Add the component

Website URL

Method

Element Selector

Extra headers

Access the step output

Common errors

Wrong URL formatting

Non-array Elements

Invalid URL

Network issue

Time output

Get started

Agents

Tools and Templates

Use cases

How to build a custom Tool

Data

Notebook

Integrations

SDK

Datasets

​How to use the Extract website content step

​Add the component

​Website URL

​Method

​Element Selector

​Extra headers

​Access the step output

​Common errors

​Wrong URL formatting

​Non-array Elements

​Invalid URL

​Network issue

​Time output

How to use the Extract website content step

Add the component

Website URL

Method

Element Selector

Extra headers

Access the step output

Common errors

Wrong URL formatting

Non-array Elements

Invalid URL

Network issue

Time output