How to load data into Ducky

Prerequisites

Before you can upload data into Ducky, ensure you already done the following:

You have an activated account on Ducky.
Create an project and an index, or following the Authentication and setup.
An valid API key for your project.
[Optional] You may want to install the python SDK if you are already developing in Python.

python -m pip install duckyai

Data Organization

In Ducky, documents are stored and organized into indexes, which are structured collections of related data. Each index represents a specific dataset, such as customer reviews, product descriptions, or internal reports. When data is added, each document is deposited into a designated index, ensuring it is categorized appropriately. During a search, users target a specific index, so queries return results only from the most relevant dataset.

Ducky Data Structure

Example on how to load a document into Ducky

from duckyai import DuckyAI

client = DuckyAI(api_key="<DUCKYAI_API_KEY>")

result = client.documents.index(
    index_name="customer-reviews",
    doc_id="review_id1",
    content="The bag itself was mediocre, but the customer service team was exceptional in resolving my issue quickly and efficiently.",
    title="okay bag but great customer service",
    metadata={"source": "online", "rating": 4},
)

curl --request POST \
     --url https://api.ducky.ai/v1/documents/index-text \
     --header 'accept: application/json' \
     --header 'content-type: application/json' \
     --header 'x-api-key: <DUCKYAI_API_KEY>' \
     --data '
{
  "index_name": "customer-reviews",
  "source_document_id": "review_id1",
  "title": "okay bag but great customer service",
  "content": "The bag itself was mediocre, but the customer service team was exceptional in resolving my issue quickly and efficiently.",
  "metadata": {
    "source": "online",
    "rating": "4"
  }
}
'

Ducky organizes data into a simple, flexible structure to make it easy for developers to upload, manage, and search through their documents. Each document in Ducky must follow a specific schema with the following key fields:

content (Required)
The core material of the document and the most critical field in Ducky’s data model. Ducky processes and indexes this text to enable fast and accurate search functionality. Currently, only text is supported, but future updates will include support for images and other content types.
title (Optional)
A concise, human-readable summary of the document. This field is indexed and searchable and can be used to provide users with quick identification or context. Developers can optionally use the metadata field to include alternative titles if needed for display purposes but not for search.
url (Optional)
A link to the source of the document or additional reference material. This field is indexed and searchable, allowing developers to retrieve documents based on their source URL. It is particularly useful for enabling navigation to original documents in user-facing interfaces while also supporting search functionality when the URL itself is a key part of the query.
source_document_id (Optional)
A unique identifier for the document that links back to its source. Useful for tracking documents and mapping them to external systems. The system will provide a generated ID in the response if one was not provided.
metadata (Optional)
A flexible dictionary for storing additional context or attributes about the document. Can include fields such as tags, categories, or ratings, which can enhance filtering or organization. We support value types of string, number, and boolean.

Batch Indexing in Ducky

DuckyAI supports batch indexing, allowing you to efficiently index multiple documents in a single operation, with a limit of up to 100 documents per batch. This feature is particularly useful when dealing with large datasets, as it reduces the overhead of making individual API calls for each document. Below is an example demonstrating how to use DuckyAI's batch indexing capability with a sample dataset of customer reviews:

import asyncio
from duckyai import DuckyAI

# Example dataset with three entries
dataset = [
    {
        "content": "The bag itself was mediocre, but the customer service team was exceptional in resolving my issue quickly and efficiently.",
        "title": "okay bag but great customer service",
        "metadata": {"source": "online", "rating": 4},
    },
    {
        "content": "The product quality was outstanding, and the delivery was quicker than expected.",
        "title": "excellent product and fast delivery",
        "metadata": {"source": "online", "rating": 5},
    },
    {
        "content": "The item was defective, but the return process was smooth and hassle-free.",
        "title": "defective item but smooth return process",
        "metadata": {"source": "online", "rating": 3},
    },
]

client = DuckyAI(api_key="<DUCKYAI_API_KEY>")

async def index_all_documents(dataset):
    docs = [
        {
            "index_name": "customer-reviews",
            "content": doc["content"],
            "title": doc["title"],
            "metadata": doc["metadata"]
        }
        for doc in dataset
    ]
    await client.documents.batch_index_async(documents=docs)

if __name__ == "__main__":
    asyncio.run(index_all_documents(dataset))
    print("Done")

Async Indexing in Ducky

For developers handling large datasets or multiple documents, Ducky supports asynchronous indexing to improve performance and scalability. With index_async, you can upload documents concurrently, minimizing the time spent waiting for individual operations to complete.

Below is an example of how to index multiple documents asynchronously:

import asyncio
from duckyai import DuckyAI

# Example dataset with three entries
dataset = [
    {
        "content": "The bag itself was mediocre, but the customer service team was exceptional in resolving my issue quickly and efficiently.",
        "title": "okay bag but great customer service", 
        "metadata": {"source": "online", "rating": 4},
    },
    {
        "content": "The product quality was outstanding, and the delivery was quicker than expected.",
        "title": "excellent product and fast delivery",
        "metadata": {"source": "online", "rating": 5},
    },
    {
        "content": "The item was defective, but the return process was smooth and hassle-free.",
        "title": "defective item but smooth return process",
        "metadata": {"source": "online", "rating": 3},
    },
]

client = DuckyAI(api_key="<DUCKYAI_API_KEY>")

CONCURRENCY_LIMIT = 5
semaphore = asyncio.Semaphore(CONCURRENCY_LIMIT)

async def index_document(doc):
    async with semaphore:  # Limit concurrency using the semaphore
        await client.documents.index_async(**doc, index_name="customer-reviews")

async def index_all_documents(dataset):
    tasks = [index_document(doc) for doc in dataset]
    # Run all tasks with concurrency control
    await asyncio.gather(*tasks)

if __name__ == "__main__":
    asyncio.run(index_all_documents(dataset))
    print("Done")

Async indexing in Ducky is efficient, as it reduces the time required to upload multiple documents by handling them concurrently. It scales easily to larger datasets without blocking other processes and integrates seamlessly with Python's asyncio module, simplifying the management of asynchronous tasks.

🦆
Get in touch or see our roadmap if you need help