Convert URL HTML to Markdown Format and Get Page Links

Efficiently transform web page content by converting HTML from specified URLs into clean Markdown format and extracting all embedded links. This powerful n8n workflow integrates with various tools to fetch URLs from your own data source, process them in batches, and then deliver the Markdown content and extracted links to a destination of your choice. It's ideal for content strategists, SEO professionals, or data analysts who need to quickly archive web pages, analyze link structures, or repurpose web content for different platforms without manual copy-pasting and formatting. By automating the retrieval, conversion, and link extraction, this workflow significantly reduces the manual effort and time spent on content acquisition and analysis, allowing you to focus on higher-value tasks and ensuring consistent data output.

17 nodesmanual trigger198 views0 copiesData

Workflow JSON

{"meta": {"instanceId": "6b6a2db47bdf8371d21090c511052883cc9a3f6af5d0d9d567c702d74a18820e"}, "nodes": [{"id": "f4570aad-db25-4dcd-8589-b1c8335935de", "name": "When clicking \u2018Test workflow\u2019", "type": "n8n-nodes-base.manualTrigger", "position": [-180, 3800], "parameters": {}, "typeVersion": 1}, {"id": "bd481559-85f2-4865-8d85-e50e72369f26", "name": "Wait", "type": "n8n-nodes-base.wait", "position": [940, 3620], "webhookId": "f10708f0-38c6-4c75-b635-37222d5b183a", "parameters": {"amount": 45}, "typeVersion": 1.1}, {"id": "cc9e9947-19e4-47c5-95b0-a631d688a8b6", "name": "Sticky Note36", "type": "n8n-nodes-base.stickyNote", "position": [549.7858793743054, 3709.534654112671], "parameters": {"color": 7, "width": 327.8244990224782, "height": 268.48353140372035, "content": "**40 at a time seems to be the memory limit on my server - run until complete with batches of 40 or increase based on your server memory**\n"}, "typeVersion": 1}, {"id": "9ebbd993-9194-40b1-a98e-352eb3a3f9eb", "name": "Sticky Note28", "type": "n8n-nodes-base.stickyNote", "position": [-50.797941767307435, 3729.028866440868], "parameters": {"color": 7, "width": 574.7594700148138, "height": 248.90718753310907, "content": "**Firecrawl.dev retrieves markdown inc. title, description, links & content. First define the URLs you'd like to scrape**\n"}, "typeVersion": 1}, {"id": "71c0f975-c0f9-47ae-a245-f852387ad461", "name": "Connect to your own data source", "type": "n8n-nodes-base.noOp", "position": [1380, 3820], "parameters": {}, "typeVersion": 1}, {"id": "fba918e7-2c88-4de3-a789-cadbf4f2584e", "name": "Get urls from own data source", "type": "n8n-nodes-base.noOp", "position": [0, 3800], "parameters": {}, "typeVersion": 1}, {"id": "221a75eb-0bc8-4747-9ec1-1879c46d9163", "name": "Example fields from data source", "type": "n8n-nodes-base.set", "notes": "Define URLs in array", "position": [200, 3800], "parameters": {"options": {}, "assignments": {"assignments": [{"id": "cc2c6af0-68d3-49eb-85fe-3288d2ed0f6b", "name": "Page", "type": "array", "value": "[\"https://www.automake.io/\", \"https://www.n8n.io/\"]"}]}, "includeOtherFields": true}, "notesInFlow": true, "typeVersion": 3.4}, {"id": "5a914964-e8ef-4064-8ecb-f1866de0d8c6", "name": "Sticky Note33", "type": "n8n-nodes-base.stickyNote", "position": [-40, 4000], "parameters": {"color": 3, "width": 510.3561134140244, "height": 94.13486342358942, "content": "**REQUIRED**\nConnect to your database of urls to input. Name the column `Page` like in the `Example fields from data source` node and make sure it has one link per row like `split out page urls`"}, "typeVersion": 1}, {"id": "5c004d5c-afeb-47c9-b30b-eb88880f87b9", "name": "Sticky Note34", "type": "n8n-nodes-base.stickyNote", "position": [900, 4000], "parameters": {"color": 3, "width": 284.87764467541297, "height": 168.68864948728321, "content": "**REQUIRED**\nUpdate the Auth parameter to your own [Firecrawl](https://firecrawl.dev) dev token\n\n**Header Auth parameter**\nname - Authorization\nvalue - your-own-api-key"}, "typeVersion": 1}, {"id": "53d97054-a5e4-4819-bdd9-f8632c33eba2", "name": "Sticky Note35", "type": "n8n-nodes-base.stickyNote", "position": [1360, 4000], "parameters": {"color": 3, "width": 284.87764467541297, "height": 91.91340067739628, "content": "**REQUIRED** \nOutput the data to your own data source e.g. Airtable"}, "typeVersion": 1}, {"id": "357a463f-7581-43ba-8930-af27e4762905", "name": "Sticky Note37", "type": "n8n-nodes-base.stickyNote", "position": [900, 3570.2075673933587], "parameters": {"color": 7, "width": 181.96744211154697, "height": 189.23753199986137, "content": "**Respect API limits (10 requests per min)**\n"}, "typeVersion": 1}, {"id": "77311c67-f50f-427a-87fd-b29b1f542bbc", "name": "40 items at a time", "type": "n8n-nodes-base.limit", "position": [580, 3800], "parameters": {"maxItems": 40}, "typeVersion": 1}, {"id": "43557ab1-4e52-4598-83a9-e39d5afc6de7", "name": "10 at a time", "type": "n8n-nodes-base.splitInBatches", "position": [740, 3800], "parameters": {"options": {}, "batchSize": 10}, "typeVersion": 3}, {"id": "555d52e7-010b-462b-9382-26804493de1c", "name": "Markdown data and Links", "type": "n8n-nodes-base.set", "position": [1160, 3820], "parameters": {"options": {}, "assignments": {"assignments": [{"id": "3a959c64-4c3c-4072-8427-67f6f6ecba1b", "name": "title", "type": "string", "value": "={{ $json.data.metadata.title }}"}, {"id": "d2da0859-a7a0-4c39-913a-150ecb95d075", "name": "description", "type": "string", "value": "={{ $json.data.metadata.description }}"}, {"id": "62bd2d76-b78d-4501-a59b-a25ed7b345b0", "name": "content", "type": "string", "value": "={{ $json.data.markdown }}"}, {"id": "d4c712fa-b52a-498f-8abc-26dc72be61f7", "name": "links", "type": "string", "value": "={{ $json.data.links }} "}]}}, "notesInFlow": true, "typeVersion": 3.4}, {"id": "aac948e6-ac86-4cea-be84-f27919d6d936", "name": "Split out page URLs", "type": "n8n-nodes-base.splitOut", "position": [380, 3800], "parameters": {"options": {}, "fieldToSplitOut": "Page"}, "typeVersion": 1}, {"id": "71c5a0d4-540e-4766-ae99-bdc427019dac", "name": "Retrieve Page Markdown and Links", "type": "n8n-nodes-base.httpRequest", "notes": "curl -X POST https://api.firecrawl.dev/v1/scrape \\\n -H 'Content-Type: application/json' \\\n -H 'Authorization: Bearer YOUR_API_KEY' \\\n -d '{\n \"url\": \"https://docs.firecrawl.dev\",\n \"formats\" : [\"markdown\", \"html\"]\n }'\n", "position": [960, 3820], "parameters": {"url": "https://api.firecrawl.dev/v1/scrape", "method": "POST", "options": {}, "jsonBody": "={\n \"url\": \"{{ $json.Page }}\",\n \"formats\" : [\"markdown\", \"links\"]\n} ", "sendBody": true, "sendHeaders": true, "specifyBody": "json", "authentication": "genericCredentialType", "genericAuthType": "httpHeaderAuth", "headerParameters": {"parameters": [{"name": "Content-Type", "value": "application/json"}]}}, "credentials": {"httpHeaderAuth": {"id": "", "name": "[Your httpHeaderAuth]"}}, "retryOnFail": true, "typeVersion": 4.2, "waitBetweenTries": 5000}, {"id": "a2f12929-262e-4354-baa3-f9e3c05ec2eb", "name": "Sticky Note38", "type": "n8n-nodes-base.stickyNote", "position": [-840, 3340], "parameters": {"color": 4, "width": 581.9949654101088, "height": 818.5240734585421, "content": "## Convert URL HTML to Markdown and Get Page Links\n\n## Use Case\nTransform web pages into AI-friendly markdown format:\n- You need to process webpage content for LLM analysis\n- You want to extract both content and links from web pages\n- You need clean, formatted text without HTML markup\n- You want to respect API rate limits while crawling pages\n\n## What this Workflow Does\nThe workflow uses Firecrawl.dev API to process webpages:\n- Converts HTML content to markdown format\n- Extracts all links from each webpage\n- Handles API rate limiting automatically\n- Processes URLs in batches from your database\n\n## Setup\n1. Create a [Firecrawl.dev](https://www.firecrawl.dev/) account and get your API key\n2. Add your Firecrawl API key to the HTTP Request node's Authorization header\n3. Connect your URL database to the input node (column name must be \"Page\") or edit the array in `Example fields from data source`\n4. Configure your preferred output database connection\n\n## How to Adjust it to Your Needs\n- Modify input source to pull URLs from different databases\n- Adjust rate limiting parameters if needed\n- Customize output format for your specific use case\n\n\nMade by Simon @ [automake.io](https://automake.io)\n"}, "typeVersion": 1}], "pinData": {}, "connections": {"Wait": {"main": [[{"node": "10 at a time", "type": "main", "index": 0}]]}, "10 at a time": {"main": [null, [{"node": "Retrieve Page Markdown and Links", "type": "main", "index": 0}]]}, "40 items at a time": {"main": [[{"node": "10 at a time", "type": "main", "index": 0}]]}, "Split out page URLs": {"main": [[{"node": "40 items at a time", "type": "main", "index": 0}]]}, "Markdown data and Links": {"main": [[{"node": "Connect to your own data source", "type": "main", "index": 0}]]}, "Get urls from own data source": {"main": [[{"node": "Example fields from data source", "type": "main", "index": 0}]]}, "Connect to your own data source": {"main": [[{"node": "Wait", "type": "main", "index": 0}]]}, "Example fields from data source": {"main": [[{"node": "Split out page URLs", "type": "main", "index": 0}]]}, "Retrieve Page Markdown and Links": {"main": [[{"node": "Markdown data and Links", "type": "main", "index": 0}]]}, "When clicking \u2018Test workflow\u2019": {"main": [[{"node": "Get urls from own data source", "type": "main", "index": 0}]]}}}

How to Import This Workflow

  1. 1Copy the workflow JSON above using the Copy Workflow JSON button.
  2. 2Open your n8n instance and go to Workflows.
  3. 3Click Import from JSON and paste the copied workflow.

Don't have an n8n instance? Start your free trial at n8nautomation.cloud

Related Templates

Ask questions about a PDF using AI

Effortlessly transform your Google Drive PDFs into an interactive knowledge base with this powerful AI workflow. This n8n automation connects your Google Drive files, processes them with OpenAI embeddings, and stores them in a Pinecone vector database, allowing you to ask questions and receive intelligent answers directly from your document content. When a new PDF is uploaded to Google Drive, the workflow automatically extracts its text, splits it into manageable chunks using the Recursive Character Text Splitter, generates embeddings via OpenAI, and then inserts this structured data into Pinecone for efficient retrieval. Later, by clicking the 'Chat' button, you can engage in a natural language conversation with your document, powered by the OpenAI Chat Model and the Question and Answer Chain, which retrieves relevant information from Pinecone. This is ideal for researchers needing to quickly extract insights from large reports, legal professionals analyzing contracts, or businesses creating searchable knowledge bases from their documentation, saving countless hours of manual review and information searching.

16 nodes

Supabase Insertion & Upsertion & Retrieval

Efficiently manage and query your data with the Supabase Insertion & Upsertion & Retrieval workflow, a powerful solution for integrating document management with intelligent data processing. This 21-node workflow, triggered manually, connects Google Drive, Supabase, and OpenAI to automate the ingestion, updating, and retrieval of information. It allows you to upload documents from Google Drive, which are then processed by a Recursive Character Text Splitter and embedded using OpenAI Embeddings for insertion or upsertion into your Supabase vector store via the Insert Documents and Update Documents nodes. When a chat message is received, the workflow leverages OpenAI's Chat Model and a Question and Answer Chain to retrieve relevant information from Supabase using the Retrieve by Query node, providing intelligent responses based on your stored documents. This workflow is ideal for businesses and individuals who need to maintain an up-to-date knowledge base, power AI-driven chatbots with proprietary information, or automate the synchronization of document content with a searchable database, significantly reducing manual data entry and improving information accessibility.

21 nodes

Chat with Postgresql Database

Empower your users to interact with your PostgreSQL database using natural language by automating the process of querying and retrieving information. This workflow connects a chat interface, triggered by a new message, to an AI Agent that leverages OpenAI's powerful language model to understand user requests. The AI Agent intelligently utilizes a suite of PostgreSQL tools, including "Get Table Definition," "Execute SQL Query," and "Get DB Schema and Tables List," to dynamically fetch database schema, generate appropriate SQL queries, and execute them against your database. Chat history is maintained using an AI memory buffer, allowing for contextual conversations. This solution is ideal for support teams needing quick data lookups, business analysts exploring data without writing SQL, or developers building interactive data dashboards. It eliminates the need for manual SQL query writing, speeds up data access, and reduces the training burden for non-technical users, saving significant time and resources while improving data accessibility.

11 nodes

Ready to automate with n8n?

Get affordable managed n8n hosting with 24/7 support.