Back to Blog
n8nweb scrapingguideautomationtutorial
How to Scrape Any Website with n8n: A Complete 2026 Guide
n8nautomation TeamApril 30, 2026
TL;DR: This guide provides a step-by-step tutorial on web scraping with n8n. You'll learn to use the HTTP Request and HTML Extract nodes to pull data from websites, handle pagination to scrape multiple pages, and save the structured data into Google Sheets or a database.
Automating data collection through **n8n web scraping** allows you to extract valuable information directly from websites without manual copy-pasting. Whether you're tracking competitor prices, gathering sales leads, or aggregating news articles, n8n provides the tools to build powerful, automated scrapers. This guide walks you through the entire process, from fetching a web page's content to parsing it and saving the structured data where you need it.
Core Nodes for n8n Web Scraping
At the heart of any web scraping workflow in n8n are two essential nodes: `HTTP Request` and `HTML Extract`. Understanding their roles is the first step to building a successful scraper.- HTTP Request Node: This is your starting point. Its job is to connect to a specific URL and fetch the raw HTML content of the page, just like a web browser does. You'll specify the URL you want to scrape and the method (usually GET). It outputs the entire HTML document as a single string.
- HTML Extract Node: Once you have the raw HTML from the HTTP Request node, the HTML Extract node comes into play. It parses this HTML and allows you to pinpoint the exact pieces of data you want to extract using CSS Selectors. You can tell it to grab all headlines, product prices, or image links, and it will return the data in a structured JSON format.
Step-by-Step: Scraping Product Data from a Website
Let's build a practical workflow to scrape product names and prices from a demo e-commerce site. For this example, we'll use Web Scraper Test Sites (https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops).
- Set up the HTTP Request Node:
- Add an `HTTP Request` node to your canvas.
- Set the 'URL' parameter to:
https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops - Under 'Options', ensure 'Response Format' is set to 'HTML'.
- Execute the node. You should see the full HTML of the page in the output.
- Configure the HTML Extract Node:
- Add an `HTML Extract` node and connect it after the `HTTP Request` node.
- In the 'Source Data' field, keep the default `Input Field Name`, which is `data`. This tells the node to use the output from the previous step.
- Now we need to find the CSS selectors. Open the target URL in your browser, right-click on a product title, and select "Inspect". You'll see it's an `` tag with a class like `title`. The price is inside a `
` with the class `price`.
- In the `HTML Extract` node properties, click 'Add Extraction Value'.
- For the first value (the product title):
- Key:
title - CSS Selector:
.title - Return Value:
Text
- Key:
- Add a second extraction value for the price:
- Key:
price - CSS Selector:
.price - Return Value:
Text
- Key:
- Execute and Check: Run the `HTML Extract` node. The output will be a JSON object containing arrays of all the titles and prices it found on the page.
Handling Pagination to Scrape Multiple Pages
Most websites spread their content across multiple pages. A robust **n8n web scraping** workflow must be able to navigate through this pagination. The most common method is to use a Loop node to increment the page number in the URL. Let's assume the URL structure is/products?page=1, /products?page=2, etc.
- Start with a Manual Trigger: Begin your workflow with a `Manual` trigger node.
- Set up the Loop:
- Add a `Loop Over Items` node.
- Set the 'Mode' to 'For Loop'.
- Let's say we want to scrape 3 pages. Set 'Start' to
1and 'End' to3. This will make the loop run three times, with the iteration number being 1, 2, and 3.
- Modify the HTTP Request Node:
- Place your existing `HTTP Request` node inside the loop (drag it over the 'Add Node' circle).
- Modify the URL to use an expression that includes the loop's current iteration number. Click the 'Add Expression' button next to the URL field.
- Your new URL expression will look like this:
https://my-example-site.com/products?page={{ $item(0).$node["Loop Over Items"].context["iteration"] }}. This dynamically inserts the current loop number into the URL.
- Connect the HTML Extract Node: Place the `HTML Extract` node after the `HTTP Request` node *inside* the loop. It will now process each page fetched by the loop. When the workflow runs, it will scrape page 1, then page 2, then page 3, collecting all the data.
Tip: Some sites use "infinite scroll". In these cases, you might need to use a tool like Puppeteer or Playwright to control a browser instance, which can be done via the `Execute Command` node in n8n if set up on your server. For most standard pagination, looping is sufficient.
Extracting Structured Data with JSON-LD
Many modern websites embed structured data directly into their HTML using a format called JSON-LD (JSON for Linking Data). This is a gift for web scrapers because it provides clean, machine-readable data without needing complex CSS selectors. Look for a<script type="application/ld+json"> tag in the page's HTML source. It often contains detailed product info, article metadata, or event details.
- Use an `HTTP Request` node to fetch the page as usual.
- Add an `HTML Extract` node.
- Set up one 'Extraction Value':
- Key:
jsonData - CSS Selector:
script[type="application/ld+json"] - Return Value:
HTML(or Text)
- Key:
- The output will be the raw JSON string. Now, add a `Code` node after it to parse this string into a usable JSON object.
// Get the JSON string from the previous node const jsonString = $input.item.json.jsonData; // Parse the string into a JSON object const parsedData = JSON.parse(jsonString); // Return the parsed data return { json: parsedData };
Saving Your Scraped Data to Google Sheets
Once you've extracted your data, you need to store it somewhere useful. A common destination is Google Sheets.- Restructure the Data: The `HTML Extract` node outputs data in a format like
{ "title": ["Title A", "Title B"], "price": ["$100", "$200"] }. To add this to a sheet, you need to restructure it into a list of objects, like[{ "title": "Title A", "price": "$100" }, { "title": "Title B", "price": "$200" }]. Use a `Code` node for this transformation. - Connect the Google Sheets Node:
- Add a `Google Sheets` node to your workflow.
- Authenticate with your Google account.
- Set the 'Operation' to 'Append or Update'.
- Select your 'Sheet ID' and 'Sheet Name'.
- Map the columns in your sheet to the data fields from your n8n workflow. For example, map the 'Title' column to the `{{ $json.title }}` expression and the 'Price' column to `{{ $json.price }}`.
Best Practices for Ethical Web Scraping
Web scraping exists in a legal and ethical gray area. Being a responsible scraper is crucial to avoid getting your IP blocked and to respect website owners.- Check
robots.txt: Always check a website'srobots.txtfile (e.g.,example.com/robots.txt). It outlines which parts of the site the owner prefers bots not to access. While not legally binding, it's a rule of etiquette. - Scrape at a Reasonable Rate: Don't hammer a server with hundreds of requests per second. This can slow down the site for human users and will likely get you blocked. Introduce a delay between your requests using the `Wait` node in n8n. A few seconds between requests is a good starting point.
- Identify Your Bot: In the `HTTP Request` node, under 'Options', you can add a 'Header' called `User-Agent`. Set its value to something that identifies your bot, like "MyProductScraper/1.0 (Contact: myemail@example.com)". This is a transparent way to show who you are.
- Use Public APIs When Available: Before you resort to scraping, check if the website offers a public API. An API is a much more stable and legitimate way to get the data you need.