Advanced Metadata Usage

Master complex filtering, data types, and metadata best practices

Advanced Metadata Usage

Metadata in Ducky allows you to attach structured information to your documents, enabling powerful filtering and organization capabilities. This guide covers advanced patterns for using metadata effectively in your applications.

Introduction

Metadata is key-value data attached to documents that helps you:

  • Organize content by categories, tags, or hierarchies
  • Filter search results based on specific criteria
  • Implement business logic like permissions, workflows, or content states
  • Track document properties like creation dates, authors, or versions

Advanced metadata usage involves structuring this data strategically and using sophisticated filtering to create rich, dynamic search experiences.

Metadata Basics

Supported Data Types

Ducky supports the following metadata value types:

from duckyai import DuckyAI

ducky = DuckyAI(api_key="<DUCKYAI_API_KEY>")

# String values
ducky.documents.index(
    index_name="content",
    doc_id="doc1",
    content="Document content",
    metadata={
        "category": "technology",
        "author": "John Doe",
        "status": "published"
    }
)

# Number values
ducky.documents.index(
    index_name="content",
    doc_id="doc2",
    content="Document content",
    metadata={
        "price": 29.99,
        "rating": 4.5,
        "view_count": 1250
    }
)

# Boolean values
ducky.documents.index(
    index_name="content",
    doc_id="doc3",
    content="Document content",
    metadata={
        "is_featured": True,
        "is_public": False,
        "requires_login": True
    }
)

# Array of strings
ducky.documents.index(
    index_name="content",
    doc_id="doc4",
    content="Document content",
    metadata={
        "tags": ["python", "tutorial", "beginner"],
        "departments": ["engineering", "product"],
        "permissions": ["read", "write"]
    }
)
import { Ducky } from "duckyai-ts";

const ducky = new Ducky({
  apiKey: process.env["DUCKY_API_KEY"] ?? "",
});

// String values
await ducky.documents.index({
  indexName: "content",
  docId: "doc1",
  content: "Document content",
  metadata: {
    category: "technology",
    author: "John Doe",
    status: "published"
  }
});

// Number values
await ducky.documents.index({
  indexName: "content",
  docId: "doc2",
  content: "Document content",
  metadata: {
    price: 29.99,
    rating: 4.5,
    view_count: 1250
  }
});

// Boolean values
await ducky.documents.index({
  indexName: "content",
  docId: "doc3",
  content: "Document content",
  metadata: {
    is_featured: true,
    is_public: false,
    requires_login: true
  }
});

// Array of strings
await ducky.documents.index({
  indexName: "content",
  docId: "doc4",
  content: "Document content",
  metadata: {
    tags: ["python", "tutorial", "beginner"],
    departments: ["engineering", "product"],
    permissions: ["read", "write"]
  }
});

Field Naming Rules

  • No forward slashes: Field names cannot contain "/" characters
  • Reserved prefixes: Avoid fields starting with "ducky_" (reserved for internal use)
  • Recommended naming: Use snake_case or camelCase consistently
# Good field names
metadata = {
    "user_role": "admin",
    "created_date": "2024-01-15",
    "content_type": "article"
}

# Avoid these
metadata = {
    "user/role": "admin",        # Contains "/"
    "ducky_internal": "value",   # Reserved prefix
}
// Good field names
const metadata = {
  userRole: "admin",
  createdDate: "2024-01-15",
  contentType: "article"
};

// Avoid these
const metadata = {
  "user/role": "admin",        // Contains "/"
  "ducky_internal": "value",   // Reserved prefix
};

Common Validation Errors

# This will cause validation errors
try:
    ducky.documents.index(
        index_name="content",
        doc_id="doc1",
        content="Content",
        metadata={
            "nested": {"objects": "not supported"},  # Nested objects not allowed
            "invalid/field": "value",                # Forward slash not allowed
            "mixed_array": ["string", 123, True]     # Mixed-type arrays not supported
        }
    )
except Exception as e:
    print(f"Validation error: {e}")
// This will cause validation errors
try {
  await ducky.documents.index({
    indexName: "content",
    docId: "doc1",
    content: "Content",
    metadata: {
      nested: { objects: "not supported" },  // Nested objects not allowed
      "invalid/field": "value",              // Forward slash not allowed
      mixed_array: ["string", 123, true]     // Mixed-type arrays not supported
    }
  });
} catch (error) {
  console.log(`Validation error: ${error}`);
}

Advanced Filtering

Comparison Operators

Use comparison operators to filter documents based on metadata values:

# Exact match (simplified syntax)
results = ducky.documents.retrieve(
    index_name="content",
    query="search term",
    top_k=10,
    metadata_filter={
        "category": "technology"  # Equivalent to {"$eq": "technology"}
    }
)

# Explicit equality
results = ducky.documents.retrieve(
    index_name="content",
    query="search term",
    top_k=10,
    metadata_filter={
        "category": {"$eq": "technology"}
    }
)

# Not equal
results = ducky.documents.retrieve(
    index_name="content",
    query="search term",
    top_k=10,
    metadata_filter={
        "status": {"$ne": "draft"}
    }
)

# Numerical comparisons
results = ducky.documents.retrieve(
    index_name="content",
    query="search term",
    top_k=10,
    metadata_filter={
        "price": {"$gte": 20.0, "$lte": 100.0},  # Between 20 and 100
        "rating": {"$gt": 4.0}                   # Greater than 4.0
    }
)

# Array membership
results = ducky.documents.retrieve(
    index_name="content",
    query="search term",
    top_k=10,
    metadata_filter={
        "tags": {"$in": ["python", "javascript"]},      # Contains python OR javascript
        "departments": {"$nin": ["deprecated", "old"]}  # Does NOT contain deprecated or old
    }
)
// Exact match (simplified syntax)
const results = await ducky.documents.retrieve({
  indexName: "content",
  query: "search term",
  topK: 10,
  metadataFilter: {
    category: "technology"  // Equivalent to {"$eq": "technology"}
  }
});

// Explicit equality
const results = await ducky.documents.retrieve({
  indexName: "content",
  query: "search term",
  topK: 10,
  metadataFilter: {
    category: { "$eq": "technology" }
  }
});

// Not equal
const results = await ducky.documents.retrieve({
  indexName: "content",
  query: "search term",
  topK: 10,
  metadataFilter: {
    status: { "$ne": "draft" }
  }
});

// Numerical comparisons
const results = await ducky.documents.retrieve({
  indexName: "content",
  query: "search term",
  topK: 10,
  metadataFilter: {
    price: { "$gte": 20.0, "$lte": 100.0 },  // Between 20 and 100
    rating: { "$gt": 4.0 }                   // Greater than 4.0
  }
});

// Array membership
const results = await ducky.documents.retrieve({
  indexName: "content",
  query: "search term",
  topK: 10,
  metadataFilter: {
    tags: { "$in": ["python", "javascript"] },      // Contains python OR javascript
    departments: { "$nin": ["deprecated", "old"] }  // Does NOT contain deprecated or old
  }
});

Logical Operators

Combine multiple conditions using logical operators:

# AND logic (multiple conditions must ALL be true)
results = ducky.documents.retrieve(
    index_name="content",
    query="search term",
    top_k=10,
    metadata_filter={
        "category": "technology",
        "status": "published",
        "is_featured": True
    }
)

# OR logic (at least one condition must be true)
results = ducky.documents.retrieve(
    index_name="content",
    query="search term",
    top_k=10,
    metadata_filter={
        "category": {
            "$or": [
                {"$eq": "technology"},
                {"$eq": "science"},
                {"$eq": "engineering"}
            ]
        }
    }
)

# Complex nested logic
results = ducky.documents.retrieve(
    index_name="content",
    query="search term",
    top_k=10,
    metadata_filter={
        "status": "published",
        "category": {
            "$or": [
                {"$eq": "technology"},
                {"$eq": "science"}
            ]
        },
        "rating": {"$gte": 4.0}
    }
)
// AND logic (multiple conditions must ALL be true)
const results = await ducky.documents.retrieve({
  indexName: "content",
  query: "search term",
  topK: 10,
  metadataFilter: {
    category: "technology",
    status: "published",
    is_featured: true
  }
});

// OR logic (at least one condition must be true)
const results = await ducky.documents.retrieve({
  indexName: "content",
  query: "search term",
  topK: 10,
  metadataFilter: {
    category: {
      "$or": [
        { "$eq": "technology" },
        { "$eq": "science" },
        { "$eq": "engineering" }
      ]
    }
  }
});

// Complex nested logic
const results = await ducky.documents.retrieve({
  indexName: "content",
  query: "search term",
  topK: 10,
  metadataFilter: {
    status: "published",
    category: {
      "$or": [
        { "$eq": "technology" },
        { "$eq": "science" }
      ]
    },
    rating: { "$gte": 4.0 }
  }
});

Complex Query Examples

# Find high-rated technology articles for premium users
results = ducky.documents.retrieve(
    index_name="content",
    query="artificial intelligence",
    top_k=10,
    metadata_filter={
        "category": "technology",
        "rating": {"$gte": 4.5},
        "access_level": {
            "$or": [
                {"$eq": "premium"},
                {"$eq": "enterprise"}
            ]
        },
        "tags": {"$in": ["ai", "machine-learning", "deep-learning"]},
        "is_featured": True
    }
)

# Find recent documents excluding drafts, with numerical scoring
results = ducky.documents.retrieve(
    index_name="content",
    query="product updates",
    top_k=20,
    metadata_filter={
        "created_year": {"$gte": 2024},
        "status": {"$ne": "draft"},
        "priority_score": {"$gt": 75},
        "departments": {
            "$or": [
                {"$in": ["product", "engineering"]},
                {"$in": ["marketing", "sales"]}
            ]
        }
    }
)
// Find high-rated technology articles for premium users
const results = await ducky.documents.retrieve({
  indexName: "content",
  query: "artificial intelligence",
  topK: 10,
  metadataFilter: {
    category: "technology",
    rating: { "$gte": 4.5 },
    access_level: {
      "$or": [
        { "$eq": "premium" },
        { "$eq": "enterprise" }
      ]
    },
    tags: { "$in": ["ai", "machine-learning", "deep-learning"] },
    is_featured: true
  }
});

// Find recent documents excluding drafts, with numerical scoring
const results = await ducky.documents.retrieve({
  indexName: "content",
  query: "product updates",
  topK: 20,
  metadataFilter: {
    created_year: { "$gte": 2024 },
    status: { "$ne": "draft" },
    priority_score: { "$gt": 75 },
    departments: {
      "$or": [
        { "$in": ["product", "engineering"] },
        { "$in": ["marketing", "sales"] }
      ]
    }
  }
});

Best Practices

Efficient Metadata Design

# Good: Structured, consistent metadata
good_metadata = {
    "content_type": "article",           # Consistent categorization
    "publish_date": "2024-01-15",        # Standardized date format
    "author_id": "user_123",             # Use IDs for relationships
    "tags": ["python", "tutorial"],      # Normalized, lowercase tags
    "priority": 85,                      # Numerical for comparisons
    "is_public": True                    # Boolean for binary states
}

# Avoid: Inconsistent, hard-to-filter metadata
avoid_metadata = {
    "Type": "ARTICLE",                   # Inconsistent casing
    "date": "Jan 15, 2024",             # Non-standard date format
    "author": "John Doe",               # Full names instead of IDs
    "tags": ["Python", "TUTORIAL"],     # Inconsistent casing
    "priority": "high",                 # String instead of number
    "visibility": "public"              # String instead of boolean
}
// Good: Structured, consistent metadata
const goodMetadata = {
  contentType: "article",           // Consistent categorization
  publishDate: "2024-01-15",        // Standardized date format
  authorId: "user_123",             // Use IDs for relationships
  tags: ["python", "tutorial"],     // Normalized, lowercase tags
  priority: 85,                     // Numerical for comparisons
  isPublic: true                    // Boolean for binary states
};

// Avoid: Inconsistent, hard-to-filter metadata
const avoidMetadata = {
  Type: "ARTICLE",                   // Inconsistent casing
  date: "Jan 15, 2024",             // Non-standard date format
  author: "John Doe",               // Full names instead of IDs
  tags: ["Python", "TUTORIAL"],     // Inconsistent casing
  priority: "high",                 // String instead of number
  visibility: "public"              // String instead of boolean
};

Performance Tips

# Efficient: Use specific, selective filters
efficient_filter = {
    "category": "technology",        # Highly selective
    "status": "published",           # Filters out many documents
    "rating": {"$gte": 4.5}         # Numerical comparison
}

# Less efficient: Broad, non-selective filters
broad_filter = {
    "has_content": True,             # Matches almost all documents
    "tags": {"$nin": ["deprecated"]} # Excludes very few documents
}

# Optimize array searches
# Good: Search for specific values
tags_filter = {
    "tags": {"$in": ["python", "javascript"]}
}

# Better: Use boolean flags for common filters
metadata_with_flags = {
    "tags": ["python", "web", "tutorial"],
    "is_beginner_friendly": True,    # Boolean flag for common filter
    "is_advanced": False,
    "has_code_examples": True
}
// Efficient: Use specific, selective filters
const efficientFilter = {
  category: "technology",        // Highly selective
  status: "published",           // Filters out many documents
  rating: { "$gte": 4.5 }        // Numerical comparison
};

// Less efficient: Broad, non-selective filters
const broadFilter = {
  has_content: true,             // Matches almost all documents
  tags: { "$nin": ["deprecated"] } // Excludes very few documents
};

// Optimize array searches
// Good: Search for specific values
const tagsFilter = {
  tags: { "$in": ["python", "javascript"] }
};

// Better: Use boolean flags for common filters
const metadataWithFlags = {
  tags: ["python", "web", "tutorial"],
  isBeginnerFriendly: true,    // Boolean flag for common filter
  isAdvanced: false,
  hasCodeExamples: true
};

Common Mistakes to Avoid

# Mistake 1: Using strings for numerical comparisons
# Bad
metadata = {"rating": "4.5"}  # String - can't use $gt, $lt
filter = {"rating": {"$gt": "4.0"}}  # String comparison doesn't work as expected

# Good
metadata = {"rating": 4.5}  # Number
filter = {"rating": {"$gt": 4.0}}  # Numerical comparison

# Mistake 2: Inconsistent data types
# Bad
metadata1 = {"priority": "high"}
metadata2 = {"priority": 85}
metadata3 = {"priority": True}

# Good - consistent numerical priorities
metadata1 = {"priority": 90}
metadata2 = {"priority": 85}
metadata3 = {"priority": 95}

# Mistake 3: Overly complex metadata structures
# Bad - trying to nest objects
metadata = {
    "user": {
        "name": "John",
        "role": "admin"
    }
}

# Good - flatten the structure
metadata = {
    "user_name": "John",
    "user_role": "admin"
}
// Mistake 1: Using strings for numerical comparisons
// Bad
const metadata = { rating: "4.5" };  // String - can't use $gt, $lt
const filter = { rating: { "$gt": "4.0" } };  // String comparison doesn't work as expected

// Good
const metadata = { rating: 4.5 };  // Number
const filter = { rating: { "$gt": 4.0 } };  // Numerical comparison

// Mistake 2: Inconsistent data types
// Bad
const metadata1 = { priority: "high" };
const metadata2 = { priority: 85 };
const metadata3 = { priority: true };

// Good - consistent numerical priorities
const metadata1 = { priority: 90 };
const metadata2 = { priority: 85 };
const metadata3 = { priority: 95 };

// Mistake 3: Overly complex metadata structures
// Bad - trying to nest objects
const metadata = {
  user: {
    name: "John",
    role: "admin"
  }
};

// Good - flatten the structure
const metadata = {
  userName: "John",
  userRole: "admin"
};
🦆

Get in touch or see our roadmap if you need help