mishushakov/llm-scraper
LLM Scraper
LLM Scraper is a TypeScript library that allows you to extract structured data from any webpage using LLMs.
[!IMPORTANT] LLM Scraper was updated to version 1.6.
The new version comes with Vercel AI SDK 4 support, JSON Schema, better type-safety, improved code generation and updated examples.
[!TIP] Under the hood, it uses function calling to convert pages to structured data. You can find more about this approach here.
Features
- Supports GPT, Sonnet, Gemini, Llama, Qwen model series
- Schemas defined with Zod or JSON Schema
- Full type-safety with TypeScript
- Based on Playwright framework
- Streaming objects
- Code-generation
- Supports 4 formatting modes:
htmlfor loading pre-processed HTMLraw_htmlfor loading raw HTML (no processing)markdownfor loading markdowntextfor loading extracted text (using Readability.js)imagefor loading a screenshot (multi-modal only)
Make sure to give it a star!
Getting started
-
Install the required dependencies from npm:
1npm i zod playwright llm-scraper -
Initialize your LLM:
OpenAI
1npm i @ai-sdk/openai1 2 3import { openai } from '@ai-sdk/openai' const llm = openai.chat('gpt-4o')Anthropic
1npm i @ai-sdk/anthropic1 2 3import { anthropic } from '@ai-sdk/anthropic' const llm = anthropic('claude-3-5-sonnet-20240620')Google
1npm i @ai-sdk/google1 2 3import { google } from '@ai-sdk/google' const llm = google('gemini-1.5-flash')Groq
1npm i @ai-sdk/openai1 2 3 4 5 6 7import { createOpenAI } from '@ai-sdk/openai' const groq = createOpenAI({ baseURL: 'https://api.groq.com/openai/v1', apiKey: process.env.GROQ_API_KEY, }) const llm = groq('llama3-8b-8192')Ollama
1npm i ollama-ai-provider1 2 3import { ollama } from 'ollama-ai-provider' const llm = ollama('llama3') -
Create a new scraper instance provided with the llm:
1 2 3import LLMScraper from 'llm-scraper' const scraper = new LLMScraper(llm)
Example
In this example, we’re extracting top stories from HackerNews:
|
|
Output
|
|
More examples can be found in the examples folder.
Streaming
Replace your run function with stream to get a partial object stream (Vercel AI SDK only).
|
|
Code-generation
Using the generate function you can generate re-usable playwright script that scrapes the contents according to a schema.
|
|
Contributing
As an open-source project, we welcome contributions from the community. If you are experiencing any bugs or want to add some improvements, please feel free to open an issue or pull request.