Working with Files

How to upload and index PDF files and text documents

Working with Files

Ducky supports uploading and indexing files directly, making it easy to add PDFs and text documents to your search index without manual content extraction.

Supported File Types

  • PDF files - Automatically extracts text content
  • Text files - UTF-8 encoded documents (.txt, .md, etc.)
  • Maximum file size: 60MB

Uploading Files

Python SDK

from duckyai import DuckyAI

ducky = DuckyAI(api_key="your-api-key")

# Upload a PDF file
with open("user-manual.pdf", "rb") as file:
    result = ducky.documents.index_file(
        index_name="documentation",
        doc_id="user-manual-v2",
        file={
            "file_name": "user-manual.pdf",
            "content": file
        },
        title="User Manual v2.0",
        metadata={"version": "2.0", "type": "manual"}
    )

print(f"File uploaded: {result.doc_id}")

# Upload a text file
with open("policy.txt", "rb") as file:
    result = ducky.documents.index_file(
        index_name="policies",
        doc_id="privacy-policy",
        file={
            "file_name": "policy.txt", 
            "content": file
        },
        title="Privacy Policy"
    )

TypeScript SDK

import { Ducky } from "duckyai-ts";
import { openAsBlob } from "node:fs";

const ducky = new Ducky({
  apiKey: process.env.DUCKY_API_KEY ?? "",
});

// Upload a PDF file
const pdfResult = await ducky.documents.indexFile({
  indexName: "documentation",
  docId: "user-manual-v2",
  file: await openAsBlob("user-manual.pdf"),
  title: "User Manual v2.0",
  metadata: { version: "2.0", type: "manual" }
});

console.log(`File uploaded: ${pdfResult.docId}`);

// Upload a text file
const textResult = await ducky.documents.indexFile({
  indexName: "policies",
  docId: "privacy-policy", 
  file: await openAsBlob("policy.txt"),
  title: "Privacy Policy"
});

How File Processing Works

Automatic Content Extraction

When you upload a file, Ducky automatically:

  1. Extracts text content from PDFs or reads UTF-8 text files
  2. Processes content asynchronously in the background
  3. Makes content searchable once processing completes

Processing Time

  • Text files: Ready within seconds
  • PDF files: Processing time depends on file size and complexity
  • Large files: May take several minutes to become fully searchable

PDF Processing

For PDF files, Ducky:

  • Extracts text from all pages
  • Maintains document structure where possible
  • Handles multi-page documents automatically

File Updates

You can update files using the same doc_id:

# Update an existing file
with open("user-manual-v3.pdf", "rb") as file:
    result = ducky.documents.index_file(
        index_name="documentation",
        doc_id="user-manual-v2",  # Same doc_id updates the existing file
        file={
            "file_name": "user-manual-v3.pdf",
            "content": file
        },
        title="User Manual v3.0",
        metadata={"version": "3.0", "type": "manual"}
    )
// Update an existing file
const updateResult = await ducky.documents.indexFile({
  indexName: "documentation",
  docId: "user-manual-v2",  // Same doc_id updates the existing file
  file: await openAsBlob("user-manual-v3.pdf"),
  title: "User Manual v3.0",
  metadata: { version: "3.0", type: "manual" }
});

Best Practices

File Naming and Organization

# Good - descriptive doc_ids
doc_id = "user-manual-2024"
doc_id = "privacy-policy-latest"
doc_id = "product-spec-v2-1"

# Include version info in metadata
metadata = {
    "version": "2.1",
    "document_type": "specification",
    "last_updated": "2024-01-15"
}

Handling Large Files

# For large files, consider breaking them into sections
# if they contain distinct topics

# Upload individual chapters
with open("chapter1.pdf", "rb") as file:
    ducky.documents.index_file(
        index_name="textbook",
        doc_id="textbook-chapter-1",
        file={"file_name": "chapter1.pdf", "content": file},
        title="Chapter 1: Introduction",
        metadata={"chapter": 1, "subject": "mathematics"}
    )

File Metadata

Use metadata to organize and filter your files:

# Categorize by document type
metadata = {
    "document_type": "manual",
    "department": "engineering", 
    "confidentiality": "public",
    "file_format": "pdf"
}

# Then filter when searching
results = ducky.documents.retrieve(
    index_name="documents",
    query="installation process",
    metadata_filter={"document_type": "manual"}
)
// Categorize by document type
const metadata = {
  documentType: "manual",
  department: "engineering", 
  confidentiality: "public",
  fileFormat: "pdf"
};

// Then filter when searching
const results = await ducky.documents.retrieve({
  indexName: "documents",
  query: "installation process",
  metadataFilter: { documentType: "manual" }
});

Common Use Cases

Documentation Libraries

# Upload company documentation
files = ["handbook.pdf", "policies.pdf", "procedures.pdf"]

for filename in files:
    with open(filename, "rb") as file:
        ducky.documents.index_file(
            index_name="company-docs",
            doc_id=filename.replace(".pdf", ""),
            file={"file_name": filename, "content": file}
        )

Knowledge Bases

# Create searchable knowledge base from PDF manuals
with open("technical-manual.pdf", "rb") as file:
    ducky.documents.index_file(
        index_name="technical-knowledge",
        doc_id="tech-manual-2024",
        file={"file_name": "technical-manual.pdf", "content": file},
        metadata={"category": "technical", "audience": "engineers"}
    )

File uploads make it easy to get your existing documents into Ducky without manual content copying. The automatic processing handles the technical details, so you can focus on organizing and searching your content.

🦆

Get in touch or see our roadmap if you need help