📁Document Loaders
Document loaders allow you to load documents from different sources like PDF, TXT, CSV, Notion, Confluence etc. They are often used together with Vector Stores to be upserted as embeddings.
Last updated
Document loaders allow you to load documents from different sources like PDF, TXT, CSV, Notion, Confluence etc. They are often used together with Vector Stores to be upserted as embeddings.
Last updated
Load data from an API.
Load data from Airtable table.
Load data from Apify Website Content Crawler.
Cheerio is lightweight and doesn't require a full browser environment like some other scraping tools. Keep in mind that when scraping websites, you should always review and comply with the website's terms of service and policies to ensure ethical and legal use of the data.
Scrape One URL
• (Optional) Connect Text Splitter.
• Input desired URL to be scrapped.
Crawl & Scrape Multiple URLs
Visit Web Crawl guide to allow scaping of multiple pages.
Output
Loads URL content as Document
Load data from a Confluence Document
Load data from CSV files.
Custom function for loading documents.
Load data from pre-configured document stores.
Load data from DOCX files.
Load data from a Figma file.
Load data from folder with multiple files.
Load data from GitBook.
Load data from a GitHub repository.
Load data from JSON files.
Load data from JSON Lines files.
Load data from Notion Database (each row is a separate document with all properties as metadata).
Load data from the exported and unzipped Notion folder.
Load data from Notion Page (including child pages all as separate documents).
Portable Document Format (PDF), standardized as ISO 32000, is a file format developed by Adobe in 1992 to present documents, including text formatting and images, in a manner independent of application software, hardware, and operating systems. The Pdf File module decodes the base64-encoded data from the PDF document and then loads the PDF content. If a textSplitter is provided, it uses it to split the text content.
Inputs
Text Splitter (optional) PDF File Usage One Document per Page OR One Document per File
Output
loads PDF content
Load data from plain text.
Playwright is a Node.js library that allows automation of web browsers for web scraping. It was developed by Microsoft and supports multiple browsers, including Chromium. Keep in mind that when scraping websites, you should always review and comply with the website's terms of service and policies to ensure ethical and legal use of the data.
Scrape One URL
• (Optional) Connect Text Splitter.
Input desired URL to be scraped.
• Crawl & Scrape Multiple URLs
• Visit Web Crawl guide to allow scaping of multiple pages.
Output
• Loads URL content as Document
Puppeteer is a Node.js library, controls Chrome/Chromium through the DevTools Protocol in headless mode. Keep in mind that when scraping websites, you should always review and comply with the website's terms of service and policies to ensure ethical and legal use of the data.
Scrape One URL
• (Optional) Connect Text Splitter.
Input desired URL to be scraped.
• Crawl & Scrape Multiple URLs
• Visit Web Crawl guide to allow scaping of multiple pages.
Output
• Loads URL content as Document
S3 File Loader allows you to retrieve a file from s3, and use Unstructured to preprocess into a structured Document object that is ready to be converted into vector embeddings. Unstructured is being used to cater for wide range of different file types. Regardless if your file on s3 is PDF, XML, DOCX, CSV, it can be processed by Unstructured. See here for supported file types.
Unstructured Setup
You can either use the hosted API or running locally via Docker.
Docker: docker run -p 8000:8000 -d --rm --name unstructured-api quay.io/unstructured-io/unstructured-api:latest --port 8000 --host 0.0.0.0
S3 File Loader Setup
1. Drag and drop S3 file loader onto canvas:
2. AWS Credential: Create a new credential for your AWS account. You'll need the access and secret key. Remember to grant s3 bucket policy to the associated account. You can refer to the policy guide here.
Bucket: Login to your AWS console and navigate to S3. Get your bucket name:
Key: Click on the object you would like to use, and get the Key name:
1. Unstructured API URL: Depending on how you are using Unstructured, whether it’s through Hosted API or Docker, change the Unstructured API URL parameter. If you are using Hosted API, you'll need the API key as well.
2. You can then start chatting with your file from S3. You don't have to specify the text splitter for chunking down the document because that’s handled by Unstructured automatically.
Load data from real-time search results.
Load and process data from web search results.
Load data from text files.
Use Unstructured.io to load data from a file path.
Use Unstructured.io to load data from a folder. Note: Currently doesn't support .png and. heic until unstructured is updated.
Search documents with scores from vector store.
30)BraveSearch API Document Loader
Purpose: Fetches real-time web search results from Brave Search.
Functionality: Utilizes the Brave Search API to retrieve up-to-date information from the web.
Use Case: Ideal for applications requiring current web data integration.
31) File Loader
Purpose: Handles various file types by automatically selecting the appropriate loader.
Functionality: When a file is uploaded, it determines the file type (e.g., .csv
, .pdf
) and uses the corresponding loader.
Use Case: Simplifies the process of loading diverse document types without manual configuration.
32) FireCrawl
Purpose: Crawls websites and converts them into LLM-ready markdown data.
Functionality: Processes a given URL, crawls accessible subpages, and extracts clean markdown content.
Use Case: Useful for transforming entire websites into structured data suitable for AI applications.
33) S3 Directory
Purpose: Retrieves and processes files stored in an AWS S3 bucket.
Functionality: Accesses files from S3, uses Unstructured to preprocess them into structured documents ready for embedding.
Use Case: Ideal for loading and processing documents stored in AWS S3 for further analysis or integration.
34) Spider Document Loader
Purpose: Scrapes or crawls web pages to extract content.
Functionality: Offers two modes: "Scrape" for single pages and "Crawl" for multiple pages, extracting content accordingly.
Use Case: Suitable for applications needing to gather data from websites for analysis or integration.