mirror of
https://github.com/open-webui/open-webui.git
synced 2026-05-06 10:58:17 -05:00
issue: Unable to get self-hosted Firecrawl Web Loader Engine to work #4941
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @MikeNatC on GitHub (Apr 23, 2025).
Check Existing Issues
Installation Method
Docker
Open WebUI Version
v0.6.5
Ollama Version (if applicable)
N.A.
Operating System
Unraid
Browser (if applicable)
Chrome Version 135.0.7049.85 (Official Build) (64-bit)
Confirmation
README.md.Expected Behavior
Open WebUI uses the Firecrawl web engine loader to scrape the web pages from SearXNG and to use it to provide context to the chat.
Actual Behavior
Open WebUI indicates 'No search results found' and completes chat without any search results.
Steps to Reproduce
Firecrawland insert an API key based on the 'API_TEST_KEY` env variable used in the Firecrawl container.Logs & Screenshots
I didn't include the browser console logs cause they didn't show any errors.
This is my Open WebUI docker container log. Note that I used the 'diff' language and '-' marker to highlight the error messages.
@tth37 commented on GitHub (Apr 23, 2025):
It appears there’s an issue with the FirecrawlLoader where the loaded documents don’t include a
sourcekey in the metadata. Instead, anog:urlis being returned. Would you mind sharing your Firecrawl URL with me personally? I’ll work on resolving this issue.(Alternatively, sharing the steps to set up a Firecrawl Docker container that replicates your case would also be helpful.)
My email: xgpsthd0902@outlook.com
@MikeNatC commented on GitHub (Apr 23, 2025):
Thanks for looking into this, @tth37
My Firecrawl Docker isn't exposed to the internet and I'm not sure if providing access via a cloudflared tunnel will complicate how it works. So I think it might be easier to provide you with the steps I took to create the Docker container.
git clone https://github.com/mendableai/firecrawl.git /path/to/selected/folder/[This is because one of the services does not have a container image and must be built from the source code.]TEST_API_KEYto include a default key as followsTEST_API_KEY: ${TEST_API_KEY:-default_api_key}.envfile in the folder at the root directory using the following template.@tjbck commented on GitHub (Apr 23, 2025):
Should be addressed with
09874ab83d@tth37 commented on GitHub (Apr 23, 2025):
@tjbck Thanks!! But what are your thoughts on changing FirecrawlLoader's default mode to 'scrape'? The 'scrape' mode only fetches a single URL, which aligns with the behavior of most other web loaders. Additionally, if 'crawl' mode is selected (especially in a self-hosted Firecrawl instance), it can take an unacceptably long time just to process a single webpage.
09874ab83d/backend/open_webui/retrieval/web/utils.py (L173)@Xi-Gong commented on GitHub (Apr 24, 2025):
After several test I found that issues comes from
crawlmode, it consumes so many credits, and it's too slow even for Firecrawl official paid plan. Another problem is openwebui can not use Firecrawlscrapewebpages in parallel, which means the process time is multiplied. The official Firecrawl service does have a feature calledConcurrent Browser, which means you canscrapepages in parallel, it will be nice to support it soon.@ER-EPR commented on GitHub (May 22, 2025):
Single thread scrape is still too slow now, could you provide a pull request to support parallel scrape, please?
@tth37 commented on GitHub (May 22, 2025):
@ER-EPR Certainly! I'll submit a PR to add support for batched parallel crawling in next few days.
@ER-EPR commented on GitHub (Jun 3, 2025):
I did some investigation today, it seems the langchain document loader api use synchronized lazy_load() and asynchronized alazy_load() to do the work. And in open webui code I saw a rate limiting class. Also from the self-hosted firecrawl log, it seems open webUI, which use lazy_load(), have to wait for the result before send the next url. If the url is fake or inaccessible, it would take a long time to return an error. So if open webui can utilize alazy_load() to send urls and in a separate process look for the completed scrape requests and start embedding work, it will be a huge boost compared with current workflow. Now it has to wait for all urls are scraped before embedding could begin. It's a huge waste of time I think.
Firecrawl log: