📁Document Loaders

Document loaders allow you to load documents from different sources like PDF, TXT, CSV, Notion, Confluence etc. They are often used together with Vector Stores to be upserted as embeddings.

1)API Loader

Load data from an API.

2)Airtable

Load data from Airtable table.

3)Apify Website Content Crawler

Load data from Apify Website Content Crawler.

4)Cheerio Web Scraper

Cheerio is lightweight and doesn't require a full browser environment like some other scraping tools. Keep in mind that when scraping websites, you should always review and comply with the website's terms of service and policies to ensure ethical and legal use of the data.

Scrape One URL

• (Optional) Connect Text Splitter.

• Input desired URL to be scrapped.

Crawl & Scrape Multiple URLs

Visit Web Crawl guide to allow scaping of multiple pages.

Output

Loads URL content as Document

5)Confluence

Load data from a Confluence Document

6)Csv File

Load data from CSV files.

7)Custom Document Loader

Custom function for loading documents.

8)Document Store

Load data from pre-configured document stores.

9)Docx File

Load data from DOCX files.

10)Figma

Load data from a Figma file.

11)Folder with Files

Load data from folder with multiple files.

12)GitBook

Load data from GitBook.

13)Github

Load data from a GitHub repository.

14)Json File

Load data from JSON files.

15)Json Lines File

Load data from JSON Lines files.

16)Notion Database

Load data from Notion Database (each row is a separate document with all properties as metadata).

17)Notion Folder

Load data from the exported and unzipped Notion folder.

18)Notion Page

Load data from Notion Page (including child pages all as separate documents).

19)PDF Files

Portable Document Format (PDF), standardized as ISO 32000, is a file format developed by Adobe in 1992 to present documents, including text formatting and images, in a manner independent of application software, hardware, and operating systems. The Pdf File module decodes the base64-encoded data from the PDF document and then loads the PDF content. If a textSplitter is provided, it uses it to split the text content.

Inputs

Text Splitter (optional) PDF File Usage One Document per Page OR One Document per File

Output

loads PDF content

20)Plain Text

Load data from plain text.

21)Playwright Web Scraper

Playwright is a Node.js library that allows automation of web browsers for web scraping. It was developed by Microsoft and supports multiple browsers, including Chromium. Keep in mind that when scraping websites, you should always review and comply with the website's terms of service and policies to ensure ethical and legal use of the data.

Scrape One URL

• (Optional) Connect Text Splitter.

Input desired URL to be scraped.

• Crawl & Scrape Multiple URLs

• Visit Web Crawl guide to allow scaping of multiple pages.

Output

• Loads URL content as Document

22)Puppeteer Web Scraper

Puppeteer is a Node.js library, controls Chrome/Chromium through the DevTools Protocol in headless mode. Keep in mind that when scraping websites, you should always review and comply with the website's terms of service and policies to ensure ethical and legal use of the data.

Scrape One URL

• (Optional) Connect Text Splitter.

Input desired URL to be scraped.

• Crawl & Scrape Multiple URLs

• Visit Web Crawl guide to allow scaping of multiple pages.

Output

• Loads URL content as Document

23)S3 File Loader

S3 File Loader allows you to retrieve a file from s3, and use Unstructured to preprocess into a structured Document object that is ready to be converted into vector embeddings. Unstructured is being used to cater for wide range of different file types. Regardless if your file on s3 is PDF, XML, DOCX, CSV, it can be processed by Unstructured. See here for supported file types.

Unstructured Setup

You can either use the hosted API or running locally via Docker.

Hosted API

Docker: docker run -p 8000:8000 -d --rm --name unstructured-api quay.io/unstructured-io/unstructured-api:latest --port 8000 --host 0.0.0.0

S3 File Loader Setup

1. Drag and drop S3 file loader onto canvas:

2. AWS Credential: Create a new credential for your AWS account. You'll need the access and secret key. Remember to grant s3 bucket policy to the associated account. You can refer to the policy guide here.

Bucket: Login to your AWS console and navigate to S3. Get your bucket name:

Key: Click on the object you would like to use, and get the Key name:

1. Unstructured API URL: Depending on how you are using Unstructured, whether it’s through Hosted API or Docker, change the Unstructured API URL parameter. If you are using Hosted API, you'll need the API key as well.

2. You can then start chatting with your file from S3. You don't have to specify the text splitter for chunking down the document because that’s handled by Unstructured automatically.

24)Search API For Web Search

Load data from real-time search results.

25)SerpApi For Web Search

Load and process data from web search results.

26)Text File

Load data from text files.

27)Unstructured File Loader

Use Unstructured.io to load data from a file path.

28)Unstructured Folder Loader

Use Unstructured.io to load data from a folder. Note: Currently doesn't support .png and. heic until unstructured is updated.

29)VectorStore To Document

Search documents with scores from vector store.

30)BraveSearch API Document Loader

Purpose: Fetches real-time web search results from Brave Search.
Functionality: Utilizes the Brave Search API to retrieve up-to-date information from the web.
Use Case: Ideal for applications requiring current web data integration.

31) File Loader

Purpose: Handles various file types by automatically selecting the appropriate loader.
Functionality: When a file is uploaded, it determines the file type (e.g., .csv, .pdf) and uses the corresponding loader.
Use Case: Simplifies the process of loading diverse document types without manual configuration.

32) FireCrawl

Purpose: Crawls websites and converts them into LLM-ready markdown data.
Functionality: Processes a given URL, crawls accessible subpages, and extracts clean markdown content.
Use Case: Useful for transforming entire websites into structured data suitable for AI applications.

33) S3 Directory

Purpose: Retrieves and processes files stored in an AWS S3 bucket.
Functionality: Accesses files from S3, uses Unstructured to preprocess them into structured documents ready for embedding.
Use Case: Ideal for loading and processing documents stored in AWS S3 for further analysis or integration.

34) Spider Document Loader

Purpose: Scrapes or crawls web pages to extract content.
Functionality: Offers two modes: "Scrape" for single pages and "Crawl" for multiple pages, extracting content accordingly.
Use Case: Suitable for applications needing to gather data from websites for analysis or integration.

Previous🗨️Chat Models Next🧬Embeddings

Last updated 7 months ago