[PR #977] [CLOSED] Adding weboob in the Web Crawling section #882

Closed
opened 2025-11-06 13:04:36 -06:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/vinta/awesome-python/pull/977
Author: @Mistress-Anna
Created: 11/15/2017
Status: Closed

Base: masterHead: patch-1


📝 Commits (1)

  • 75fc90d Adding weboob in the Web Crawling section

📊 Changes

1 file changed (+1 additions, -0 deletions)

View changed files

📝 README.md (+1 -0)

📄 Description

What is this Python project?

WebOOB is a framework for scraping websites and aggregating data from multiple websites.

What's the difference between this Python project and similar ones?

  • Routing model of URL patterns to multiple class of Page with all the parsing associated to each of those Pages, for cleaner code
  • Scraping is made easy thanks to "declarative parsing": each Page can have a few XPaths, configure a few "filters" to apply on those XPaths (like parsing int, apply regex, etc.), and you're set!
  • Like every high-level feature in WebOOB, this declarative parsing can be disabled locally, when it doesn't fit for a particular site, and it's always possible to fallback to plain-old procedural parsing code
  • Pagination handling, supports infinite iterators
  • Typed data models to ensure clean scraped data
  • Can handle HTML/XML, JSON, and even XLS or PDF
  • (Optional) Can aggregate data from multiple websites by grouping them in categories (for example "video sites", "banking sites", "public transport sites", "event sites", etc.)
  • Comes builtin with a ~250 pre-existing website crawling backends
  • Has a few graphical and command-line apps to explore and search the scraped data

🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/vinta/awesome-python/pull/977 **Author:** [@Mistress-Anna](https://github.com/Mistress-Anna) **Created:** 11/15/2017 **Status:** ❌ Closed **Base:** `master` ← **Head:** `patch-1` --- ### 📝 Commits (1) - [`75fc90d`](https://github.com/vinta/awesome-python/commit/75fc90de51dc14c912b20815cfbd5a0283ac9bfb) Adding weboob in the Web Crawling section ### 📊 Changes **1 file changed** (+1 additions, -0 deletions) <details> <summary>View changed files</summary> 📝 `README.md` (+1 -0) </details> ### 📄 Description ## What is this Python project? WebOOB is a framework for scraping websites and aggregating data from multiple websites. ## What's the difference between this Python project and similar ones? * Routing model of URL patterns to multiple class of Page with all the parsing associated to each of those Pages, for cleaner code * Scraping is made easy thanks to "declarative parsing": each Page can have a few XPaths, configure a few "filters" to apply on those XPaths (like parsing int, apply regex, etc.), and you're set! * Like every high-level feature in WebOOB, this declarative parsing can be disabled locally, when it doesn't fit for a particular site, and it's always possible to fallback to plain-old procedural parsing code * Pagination handling, supports infinite iterators * Typed data models to ensure clean scraped data * Can handle HTML/XML, JSON, and even XLS or PDF * (Optional) Can aggregate data from multiple websites by grouping them in categories (for example "video sites", "banking sites", "public transport sites", "event sites", etc.) * Comes builtin with a ~250 pre-existing website crawling backends * Has a few graphical and command-line apps to explore and search the scraped data --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2025-11-06 13:04:36 -06:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/awesome-python#882