📁Document Loaders
Document loaders allow you to load documents from different sources like PDF, TXT, CSV, Notion, Confluence etc. They are often used together with Vector Stores to be upserted as embeddings.
1)API Loader
Load data from an API.
2)Airtable
Load data from Airtable table.
3)Apify Website Content Crawler
Load data from Apify Website Content Crawler.
4)Cheerio Web Scraper
Cheerio is lightweight and doesn't require a full browser environment like some other scraping tools. Keep in mind that when scraping websites, you should always review and comply with the website's terms of service and policies to ensure ethical and legal use of the data.
Scrape One URL
• (Optional) Connect Text Splitter.
• Input desired URL to be scrapped.
Crawl & Scrape Multiple URLs
Visit Web Crawl guide to allow scaping of multiple pages.
Output
Loads URL content as Document
5)Confluence
Load data from a Confluence Document
6)Csv File
Load data from CSV files.
7)Custom Document Loader
Custom function for loading documents.
8)Document Store
Load data from pre-configured document stores.
9)Docx File
Load data from DOCX files.
10)Figma
Load data from a Figma file.
11)Folder with Files
Load data from folder with multiple files.
12)GitBook
Load data from GitBook.
13)Github
Load data from a GitHub repository.
14)Json File
Load data from JSON files.
15)Json Lines File
Load data from JSON Lines files.
16)Notion Database
Load data from Notion Database (each row is a separate document with all properties as metadata).
17)Notion Folder
Load data from the exported and unzipped Notion folder.
18)Notion Page
Load data from Notion Page (including child pages all as separate documents).
19)PDF Files
Portable Document Format (PDF), standardized as ISO 32000, is a file format developed by Adobe in 1992 to present documents, including text formatting and images, in a manner independent of application software, hardware, and operating systems. The Pdf File module decodes the base64-encoded data from the PDF document and then loads the PDF content. If a textSplitter is provided, it uses it to split the text content.
Inputs
Text Splitter (optional) PDF File Usage One Document per Page OR One Document per File
Output
loads PDF content
20)Plain Text
Load data from plain text.
21)Playwright Web Scraper
Playwright is a Node.js library that allows automation of web browsers for web scraping. It was developed by Microsoft and supports multiple browsers, including Chromium. Keep in mind that when scraping websites, you should always review and comply with the website's terms of service and policies to ensure ethical and legal use of the data.
Scrape One URL
• (Optional) Connect Text Splitter.
Input desired URL to be scraped.
• Crawl & Scrape Multiple URLs
• Visit Web Crawl guide to allow scaping of multiple pages.
Output
• Loads URL content as Document
22)Puppeteer Web Scraper
Puppeteer is a Node.js library, controls Chrome/Chromium through the DevTools Protocol in headless mode. Keep in mind that when scraping websites, you should always review and comply with the website's terms of service and policies to ensure ethical and legal use of the data.
Scrape One URL
• (Optional) Connect Text Splitter.
Input desired URL to be scraped.
• Crawl & Scrape Multiple URLs
• Visit Web Crawl guide to allow scaping of multiple pages.
Output
• Loads URL content as Document
23)S3 File Loader
S3 File Loader allows you to retrieve a file from s3, and use Unstructured to preprocess into a structured Document object that is ready to be converted into vector embeddings. Unstructured is being used to cater for wide range of different file types. Regardless if your file on s3 is PDF, XML, DOCX, CSV, it can be processed by Unstructured. See here for supported file types.
Unstructured Setup
You can either use the hosted API or running locally via Docker.
Docker: docker run -p 8000:8000 -d --rm --name unstructured-api quay.io/unstructured-io/unstructured-api:latest --port 8000 --host 0.0.0.0
S3 File Loader Setup
1. Drag and drop S3 file loader onto canvas:
2. AWS Credential: Create a new credential for your AWS account. You'll need the access and secret key. Remember to grant s3 bucket policy to the associated account. You can refer to the policy guide here.
Bucket: Login to your AWS console and navigate to S3. Get your bucket name:
Key: Click on the object you would like to use, and get the Key name:
1. Unstructured API URL: Depending on how you are using Unstructured, whether it’s through Hosted API or Docker, change the Unstructured API URL parameter. If you are using Hosted API, you'll need the API key as well.
2. You can then start chatting with your file from S3. You don't have to specify the text splitter for chunking down the document because that’s handled by Unstructured automatically.
24)Search API For Web Search
Load data from real-time search results.
25)SerpApi For Web Search
Load and process data from web search results.
26)Text File
Load data from text files.
27)Unstructured File Loader
Use Unstructured.io to load data from a file path.
28)Unstructured Folder Loader
Use Unstructured.io to load data from a folder. Note: Currently doesn't support .png and. heic until unstructured is updated.
29)VectorStore To Document
Search documents with scores from vector store.
Last updated