Transforms complex documents like PDFs into LLM-ready markdown/JSON for your Agentic workflows.
-
Updated
Feb 9, 2026 - Python
Transforms complex documents like PDFs into LLM-ready markdown/JSON for your Agentic workflows.
PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
Web Crawler/Spider for NodeJS + server-side jQuery ;-)
Meltano: the declarative code-first data integration engine that powers your wildest data and ML-powered product ideas. Say goodbye to writing, maintaining, and scaling your own API integrations.
Open-source platform for extracting structured data from documents using AI.
Crawly, a high-level web crawling & scraping framework for Elixir.
Extract structured data from web sites. Web sites scraping.
A simple resume parser used for extracting information from resumes
Receipt scanner extracts information from your PDF or image receipts - built in NodeJS
Turn Webpage to LLM friendly input text. Similar to Firecrawl and Jina Reader API. Makes RAG, AI web scraping, image & webpage links extraction easy.
Extract data from .trace documents generated by Instruments
📄🔍 Parse, extract, and analyze documents with ease 📄🔍
wxpath - declarative web crawling with XPath; a Web Query Language (WQL)
extract data from html table
An R package for acquisition and processing of NASA SMAP data
Library and cli for extracting data from HTML via CSS selectors
FBLYZE is a Facebook scraping system and analysis system.
Get Lyrics for any songs by just passing in the song name (spelled or misspelled) in less than 2 seconds using this awesome Python Library.
Extracting and parsing structured data with jQuery Selector, XPath or JsonPath from common web format like HTML, XML and JSON.
Add a description, image, and links to the extract-data topic page so that developers can more easily learn about it.
To associate your repository with the extract-data topic, visit your repo's landing page and select "manage topics."