📁Document Loaders
Document loaders allow you to load documents from different sources like PDF, TXT, CSV, Notion, Confluence etc. They are often used together with Vector Stores to be upserted as embeddings.
Last updated
Document loaders allow you to load documents from different sources like PDF, TXT, CSV, Notion, Confluence etc. They are often used together with Vector Stores to be upserted as embeddings.
Last updated
Load data from an API.
Load data from Airtable table.
Load data from Apify Website Content Crawler.
Cheerio is lightweight and doesn't require a full browser environment like some other scraping tools. Keep in mind that when scraping websites, you should always review and comply with the website's terms of service and policies to ensure ethical and legal use of the data.
Scrape One URL
• (Optional) Connect Text Splitter.
• Input desired URL to be scrapped.
Crawl & Scrape Multiple URLs
Visit Web Crawl guide to allow scaping of multiple pages.
Output
Loads URL content as Document
Load data from a Confluence Document
Load data from CSV files.
Custom function for loading documents.
Load data from pre-configured document stores.
Load data from DOCX files.
Load data from a Figma file.
Load data from folder with multiple files.
Load data from GitBook.
Load data from a GitHub repository.
Load data from JSON files.
Load data from JSON Lines files.
Load data from Notion Database (each row is a separate document with all properties as metadata).
Load data from the exported and unzipped Notion folder.
Load data from Notion Page (including child pages all as separate documents).
Portable Document Format (PDF), standardized as ISO 32000, is a file format developed by Adobe in 1992 to present documents, including text formatting and images, in a manner independent of application software, hardware, and operating systems. The Pdf File module decodes the base64-encoded data from the PDF document and then loads the PDF content. If a textSplitter is provided, it uses it to split the text content.
Inputs
Text Splitter (optional) PDF File Usage One Document per Page OR One Document per File
Output
loads PDF content
Load data from plain text.
Playwright is a Node.js library that allows automation of web browsers for web scraping. It was developed by Microsoft and supports multiple browsers, including Chromium. Keep in mind that when scraping websites, you should always review and comply with the website's terms of service and policies to ensure ethical and legal use of the data.
Scrape One URL
• (Optional) Connect Text Splitter.
Input desired URL to be scraped.
• Crawl & Scrape Multiple URLs
• Visit Web Crawl guide to allow scaping of multiple pages.
Output
• Loads URL content as Document
Puppeteer is a Node.js library, controls Chrome/Chromium through the DevTools Protocol in headless mode. Keep in mind that when scraping websites, you should always review and comply with the website's terms of service and policies to ensure ethical and legal use of the data.
Scrape One URL
• (Optional) Connect Text Splitter.
Input desired URL to be scraped.
• Crawl & Scrape Multiple URLs
• Visit Web Crawl guide to allow scaping of multiple pages.
Output
• Loads URL content as Document
S3 File Loader allows you to retrieve a file from s3, and use Unstructured to preprocess into a structured Document object that is ready to be converted into vector embeddings. Unstructured is being used to cater for wide range of different file types. Regardless if your file on s3 is PDF, XML, DOCX, CSV, it can be processed by Unstructured. See here for supported file types.
Unstructured Setup
You can either use the hosted API or running locally via Docker.
Docker: docker run -p 8000:8000 -d --rm --name unstructured-api quay.io/unstructured-io/unstructured-api:latest --port 8000 --host 0.0.0.0
S3 File Loader Setup
1. Drag and drop S3 file loader onto canvas:
2. AWS Credential: Create a new credential for your AWS account. You'll need the access and secret key. Remember to grant s3 bucket policy to the associated account. You can refer to the policy guide here.
Bucket: Login to your AWS console and navigate to S3. Get your bucket name:
Key: Click on the object you would like to use, and get the Key name:
1. Unstructured API URL: Depending on how you are using Unstructured, whether it’s through Hosted API or Docker, change the Unstructured API URL parameter. If you are using Hosted API, you'll need the API key as well.
2. You can then start chatting with your file from S3. You don't have to specify the text splitter for chunking down the document because that’s handled by Unstructured automatically.
Load data from real-time search results.
Load and process data from web search results.
Load data from text files.
Use Unstructured.io to load data from a file path.
Use Unstructured.io to load data from a folder. Note: Currently doesn't support .png and. heic until unstructured is updated.
Search documents with scores from vector store.