📊 Database Profiling¶

Comprehensive database metadata extraction system based on research from “Automatic Metadata Extraction for Text-to-SQL” paper.

🎯 Overview¶

The Database Profiling system provides comprehensive metadata extraction with two profiling modes:

📊 Statistical Profiling (Base Layer)¶

Row counts & NULL statistics - Data completeness analysis
Type-specific profiling - Numeric, date, and string analysis
Semantic type inference - Automatic detection of emails, phones, URLs, etc.
Data quality scoring - Multi-dimensional quality assessment
LSH similarity - Locality-sensitive hashing for column similarity
Database-native operations - Optimal performance with universal compatibility

🤖 LLM-Enhanced Profiling (Advanced Layer)¶

Business purpose analysis - LLM-generated field descriptions and context
SQL generation hints - Task-aligned metadata for better query generation
Join candidate scoring - Intelligent relationship discovery
Usage pattern analysis - Common query patterns and optimization suggestions
Domain classification - Business domain and data freshness assessment

🚀 Quick Start¶

Optimized Database Profiling¶

from ryoma_ai.datasource.postgres import PostgresDataSource

# Create datasource with automatic profiling capabilities
datasource = PostgresDataSource(
    connection_string="postgresql://user:pass@localhost:5432/db"
)

# Profile a table - automatically uses database-native methods
profile = datasource.profile_table("customers")
print(f"Rows: {profile['table_profile']['row_count']:,}")
print(f"Completeness: {profile['table_profile']['completeness_score']:.2%}")
print(f"Method: {profile['profiling_summary']['profiling_method']}")  # Always "native_optimized"

LLM-Enhanced Profiling¶

from ryoma_ai.agent.internals.enhanced_profiler import EnhancedDatabaseProfiler
from ryoma_ai.models.openai import OpenAIModel

# Create enhanced profiler with LLM analysis
model = OpenAIModel("gpt-4")
enhanced_profiler = EnhancedDatabaseProfiler(
    datasource=datasource,
    model=model,
    enable_llm_analysis=True
)

# Get enhanced table metadata
enhanced_metadata = enhanced_profiler.profile_table_enhanced("customers")
print(f"Table purpose: {enhanced_metadata.primary_purpose}")
print(f"Business domain: {enhanced_metadata.business_domain}")

# Get enhanced field metadata
field_metadata = enhanced_profiler.profile_field_enhanced("customers", "email")
print(f"Field description: {field_metadata.llm_description}")
print(f"Business purpose: {field_metadata.business_purpose}")
print(f"SQL hints: {field_metadata.sql_generation_hints}")

Custom Configuration¶

# Custom profiling configuration
datasource = PostgresDataSource(
    connection_string="postgresql://user:pass@localhost:5432/db",
    profiler_config={
        "sample_size": 10000,      # Rows for LSH analysis
        "top_k": 10,               # Top frequent values
        "enable_lsh": True,        # Column similarity matching
        "lsh_threshold": 0.8,      # Similarity threshold
        "num_hashes": 128,         # LSH precision
        "enable_llm_enhancement": True  # Enable LLM analysis
    }
)

# Enhanced profiler with Ibis base layer
enhanced_profiler = EnhancedDatabaseProfiler(
    datasource=datasource,
    model=model,
    enable_llm_analysis=True,
    analysis_sample_size=100  # Sample size for LLM analysis
)

🤖 LLM-Enhanced Profiling¶

Enhanced Field Metadata¶

The EnhancedDatabaseProfiler provides AI-powered metadata generation:

from ryoma_ai.agent.internals.enhanced_profiler import EnhancedDatabaseProfiler

# Create enhanced profiler
enhanced_profiler = EnhancedDatabaseProfiler(
    datasource=datasource,
    model=model,
    enable_llm_analysis=True
)

# Get enhanced field analysis
field_metadata = enhanced_profiler.profile_field_enhanced("orders", "customer_id")

# Access enhanced information
print(f"Description: {field_metadata.llm_description}")
print(f"Business purpose: {field_metadata.business_purpose}")
print(f"SQL hints: {field_metadata.sql_generation_hints}")
print(f"Join score: {field_metadata.join_candidate_score}")
print(f"Semantic tags: {field_metadata.semantic_tags}")
print(f"Usage patterns: {field_metadata.usage_patterns}")

Enhanced Table Metadata¶

# Get comprehensive table analysis
table_metadata = enhanced_profiler.profile_table_enhanced("customers")

# Access table-level insights
print(f"Table description: {table_metadata.table_description}")
print(f"Primary purpose: {table_metadata.primary_purpose}")
print(f"Business domain: {table_metadata.business_domain}")
print(f"Data freshness: {table_metadata.data_freshness_assessment}")

# Access join patterns
for pattern in table_metadata.common_join_patterns:
    print(f"Join pattern: {pattern}")

# Access all field metadata
for field_name, field_meta in table_metadata.field_metadata.items():
    print(f"{field_name}: {field_meta.business_purpose}")

SQL Generation Context¶

Get optimized metadata for SQL generation tasks:

# Get context optimized for SQL generation
sql_context = enhanced_profiler.get_sql_generation_context("customers")

# This provides metadata in a format optimized for text-to-SQL models
print(f"Table context: {sql_context['table_context']}")
print(f"Field contexts: {sql_context['field_contexts']}")
print(f"Join suggestions: {sql_context['join_suggestions']}")

📋 Core Features¶

Table-Level Profiling¶

# Comprehensive table analysis
profile = datasource.profile_table("customers")

# Access table metrics
table_info = profile["table_profile"]
print(f"Row count: {table_info['row_count']:,}")
print(f"Column count: {table_info['column_count']}")
print(f"Completeness score: {table_info['completeness_score']:.3f}")
print(f"Consistency score: {table_info['consistency_score']:.3f}")
print(f"Profiling method: {profile['profiling_summary']['profiling_method']}")

Column-Level Analysis¶

# Individual column profiling
column_profile = datasource.profile_column("customers", "email")

# Check semantic type and quality
if column_profile["semantic_type"] == "email":
    quality = column_profile["data_quality_score"]
    null_pct = column_profile["null_percentage"]
    distinct_ratio = column_profile["distinct_ratio"]
    
    print(f"Email column quality: {quality:.3f}")
    print(f"NULL percentage: {null_pct:.1f}%")
    print(f"Distinct ratio: {distinct_ratio:.3f}")
    
    # Show top values
    for i, value_info in enumerate(column_profile["top_k_values"][:3], 1):
        print(f"{i}. {value_info['value']} ({value_info['count']} times)")

Enhanced Catalog¶

# Get catalog with profiling data
catalog = datasource.get_enhanced_catalog(include_profiles=True)

for schema in catalog.schemas:
    print(f"Schema: {schema.schema_name}")
    for table in schema.tables:
        if table.profile:
            print(f"  Table: {table.table_name} ({table.profile.row_count:,} rows)")
            
            # Find high-quality columns
            high_quality_cols = table.get_high_quality_columns(min_quality_score=0.8)
            print(f"    High-quality columns: {len(high_quality_cols)}")
            
            # Show profiled columns
            profiled_cols = table.get_profiled_columns()
            print(f"    Profiled columns: {len(profiled_cols)}")

🔍 Profiling Results¶

Basic Table Profile Structure¶

{
  "table_profile": {
    "table_name": "customers",
    "row_count": 150000,
    "column_count": 12,
    "completeness_score": 0.95,
    "consistency_score": 0.88,
    "profiled_at": "2024-01-15T10:30:00Z",
    "profiling_duration_seconds": 2.34
  },
  "profiling_summary": {
    "profiling_method": "ibis_enhanced",
    "total_columns": 12
  }
}

Enhanced Table Metadata Structure¶

{
  "base_profile": {
    "table_name": "customers",
    "row_count": 150000,
    "column_count": 12
  },
  "table_description": "Customer information table containing contact details and account data",
  "primary_purpose": "Store customer profiles and contact information for CRM operations",
  "business_domain": "Customer Relationship Management",
  "data_freshness_assessment": "Updated daily via ETL pipeline",
  "common_join_patterns": [
    {
      "target_table": "orders",
      "join_column": "customer_id",
      "join_type": "one_to_many",
      "confidence": 0.95
    }
  ],
  "field_metadata": {
    "customer_id": {
      "llm_description": "Unique identifier for customer records",
      "business_purpose": "Primary key for customer identification",
      "sql_generation_hints": ["Use for JOINs", "Always include in GROUP BY"],
      "join_candidate_score": 0.98
    }
  }
}

Enhanced Field Metadata Structure¶

{
  "base_profile": {
    "column_name": "email",
    "semantic_type": "email",
    "data_quality_score": 0.92,
    "null_percentage": 5.2,
    "distinct_ratio": 0.98
  },
  "llm_description": "Customer email addresses for communication and account identification",
  "business_purpose": "Primary contact method for customer communications and login",
  "sql_generation_hints": [
    "Use for customer identification",
    "Good for filtering active customers",
    "Consider NULL handling in queries"
  ],
  "join_candidate_score": 0.85,
  "semantic_tags": ["contact_info", "identifier", "communication"],
  "data_quality_assessment": "High quality with 95% completeness and valid email format",
  "usage_patterns": [
    "Frequently used in WHERE clauses",
    "Common in customer lookup queries",
    "Often joined with user_accounts table"
  ]
}

🎯 Advanced Features¶

Semantic Type Detection¶

Automatically detects column semantic types:

# Check detected semantic types
column_profile = datasource.profile_column("users", "phone_number")

semantic_type = column_profile["semantic_type"]
if semantic_type == "phone":
    print("📞 Phone number column detected")
elif semantic_type == "email":
    print("📧 Email column detected")
elif semantic_type == "url":
    print("🌐 URL column detected")
elif semantic_type == "identifier":
    print("🆔 ID column detected")

Supported Types:

📧 Email - Pattern-based email detection
📞 Phone - Phone number format recognition
🌐 URL - Web URL identification
🆔 Identifier - High-uniqueness ID detection
📝 General - Default text classification

Data Quality Scoring¶

Multi-dimensional quality assessment:

# Quality score calculation
def explain_quality_score(profile):
    completeness = 1 - (profile["null_percentage"] / 100)
    uniqueness = min(1.0, profile["distinct_ratio"] * 2)
    reliability = min(1.0, profile["sample_size"] / 1000)
    
    quality_score = (
        completeness * 0.5 +      # 50% weight
        uniqueness * 0.3 +        # 30% weight  
        reliability * 0.2         # 20% weight
    )
    
    print(f"Completeness: {completeness:.3f}")
    print(f"Uniqueness: {uniqueness:.3f}")
    print(f"Reliability: {reliability:.3f}")
    print(f"Overall Quality: {quality_score:.3f}")

column_profile = datasource.profile_column("customers", "customer_id")
explain_quality_score(column_profile)

Column Similarity Analysis¶

Find similar columns using LSH (Locality-Sensitive Hashing):

# Find similar columns
similar_columns = datasource.find_similar_columns("customer_id", threshold=0.8)
print(f"Columns similar to 'customer_id': {similar_columns}")
# Output: ["user_id", "client_id", "account_id"]

# Use for schema linking
if similar_columns:
    print("Found potential join candidates:")
    for col in similar_columns:
        print(f"  - {col}")

⚙️ Configuration Options¶

Performance Tuning¶

# Development configuration (fast Ibis profiling)
dev_config = {
    "sample_size": 1000,    # For LSH only (Ibis handles all stats natively)
    "top_k": 5,             # Ibis value_counts limit
    "enable_lsh": False,    # Skip similarity analysis
    "enable_llm_enhancement": False
}

# Production configuration (optimized Ibis + LLM)
prod_config = {
    "sample_size": 10000,   # LSH sample size (Ibis stats are full table)
    "top_k": 10,            # Ibis value_counts limit
    "enable_lsh": True,     # Column similarity via LSH
    "lsh_threshold": 0.8,
    "enable_llm_enhancement": True,
    "analysis_sample_size": 50
}

# High accuracy configuration (full Ibis + comprehensive LLM)
accuracy_config = {
    "sample_size": 50000,   # Large LSH sample (Ibis always uses full data)
    "top_k": 20,            # More frequent values via Ibis
    "enable_lsh": True,
    "lsh_threshold": 0.9,
    "num_hashes": 256,
    "enable_llm_enhancement": True,
    "analysis_sample_size": 200
}

# LLM-focused configuration (efficient Ibis + deep business analysis)
llm_focused_config = {
    "sample_size": 5000,    # Minimal LSH (Ibis stats are always complete)
    "enable_llm_enhancement": True,
    "analysis_sample_size": 500,  # Large LLM analysis sample
    "enable_business_context": True,
    "enable_join_analysis": True
}

Backend-Specific Optimizations¶

# PostgreSQL with advanced features
postgres_ds = PostgresDataSource(
    connection_string="postgresql://localhost:5432/db",
    profiler_config={
        "use_pg_stats": True,      # Leverage pg_stats views
        "use_histograms": True,    # Use histogram data
        "sample_size": 10000
    }
)

# BigQuery with ML functions
bigquery_ds = BigQueryDataSource(
    project_id="my-project",
    profiler_config={
        "use_ml_functions": True,  # Use ML.FEATURE_INFO
        "sample_size": 20000
    }
)

🔧 Direct Profiler Usage¶

Optimized Ibis Profiler¶

from ryoma_ai.datasource.profiler import DatabaseProfiler

# Create optimized profiler instance (always uses Ibis)
profiler = DatabaseProfiler(
    sample_size=10000,  # Used for LSH analysis only
    top_k=10,           # Ibis value_counts limit
    enable_lsh=True,    # Column similarity matching
    enable_llm_enhancement=False  # Statistical only
)

# Profile table directly - uses Ibis native methods
table_profile = profiler.profile_table(datasource, "customers")
print(f"Profiled {table_profile.table_name} in {table_profile.profiling_duration_seconds:.2f}s")

# Profile individual column - uses Ibis statistical functions
column_profile = profiler.profile_column(datasource, "customers", "email")
print(f"Email quality score: {column_profile.data_quality_score:.3f}")

Enhanced Profiler with LLM Analysis¶

from ryoma_ai.agent.internals.enhanced_profiler import EnhancedDatabaseProfiler
from ryoma_ai.models.openai import OpenAIModel

# Create enhanced profiler with LLM capabilities
model = OpenAIModel("gpt-4")
enhanced_profiler = EnhancedDatabaseProfiler(
    datasource=datasource,
    model=model,
    enable_llm_analysis=True,
    analysis_sample_size=100
)

# Profile table with LLM enhancement
enhanced_table = enhanced_profiler.profile_table_enhanced("customers")
print(f"Table purpose: {enhanced_table.primary_purpose}")
print(f"Business domain: {enhanced_table.business_domain}")

# Profile field with LLM enhancement
enhanced_field = enhanced_profiler.profile_field_enhanced("customers", "email")
print(f"Field description: {enhanced_field.llm_description}")
print(f"SQL hints: {enhanced_field.sql_generation_hints}")

Hybrid Approach (Ibis + LLM)¶

# Combine optimized Ibis profiling with LLM enhancement
base_profiler = DatabaseProfiler(sample_size=10000, enable_lsh=True)  # Always uses Ibis
enhanced_profiler = EnhancedDatabaseProfiler(
    datasource=datasource,
    model=model,
    base_profiler=base_profiler,  # Reuse optimized Ibis profiler
    enable_llm_analysis=True
)

# Get both Ibis statistical data and LLM-enhanced metadata
enhanced_metadata = enhanced_profiler.profile_table_enhanced("customers")

# Access Ibis-generated statistical data
base_stats = enhanced_metadata.base_profile
print(f"Row count: {base_stats.row_count:,}")  # From Ibis COUNT()
print(f"Completeness: {base_stats.completeness_score:.2%}")  # From Ibis NULL analysis

# Access LLM-enhanced insights
print(f"Business purpose: {enhanced_metadata.primary_purpose}")
print(f"Data freshness: {enhanced_metadata.data_freshness_assessment}")

Batch Profiling¶

# Profile multiple tables
tables_to_profile = ["customers", "orders", "products"]
profiles = {}

for table_name in tables_to_profile:
    print(f"Profiling {table_name}...")
    profile = datasource.profile_table(table_name)
    profiles[table_name] = profile
    
    # Show summary
    table_info = profile["table_profile"]
    print(f"  Rows: {table_info['row_count']:,}")
    print(f"  Completeness: {table_info['completeness_score']:.2%}")

🤖 LLM-Enhanced Features¶

Business Purpose Analysis¶

The enhanced profiler uses LLM analysis to understand the business context of your data:

# Get business context for a field
field_metadata = enhanced_profiler.profile_field_enhanced("orders", "total_amount")

print(f"Business purpose: {field_metadata.business_purpose}")
# Output: "Monetary value of customer orders for revenue tracking and financial reporting"

print(f"Usage patterns: {field_metadata.usage_patterns}")
# Output: ["Used in revenue calculations", "Filtered for large orders", "Aggregated for reporting"]

SQL Generation Optimization¶

Enhanced metadata provides specific hints for better SQL generation:

# Get SQL-optimized context
sql_context = enhanced_profiler.get_sql_generation_context("customers")

# Use in SQL agent for better query generation
from ryoma_ai.agent.sql import SqlAgent

agent = SqlAgent(model="gpt-4", mode="enhanced")
agent.add_datasource(datasource)

# The agent automatically uses enhanced metadata for:
# - Better table selection based on business purpose
# - Smarter column selection using SQL hints
# - Optimized joins using join candidate scores
# - Appropriate filtering based on data quality scores

response = agent.stream("Find high-value customers from last quarter")

Join Pattern Discovery¶

Automatically discover and score potential join relationships:

# Get table metadata with join patterns
table_metadata = enhanced_profiler.profile_table_enhanced("customers")

for pattern in table_metadata.common_join_patterns:
    print(f"Can join with {pattern['target_table']} on {pattern['join_column']}")
    print(f"Join type: {pattern['join_type']}, Confidence: {pattern['confidence']:.2%}")

Data Quality Assessment¶

LLM-powered data quality analysis provides actionable insights:

field_metadata = enhanced_profiler.profile_field_enhanced("customers", "phone")

print(f"Quality assessment: {field_metadata.data_quality_assessment}")
# Output: "Moderate quality with 15% missing values and inconsistent formatting.
#          Recommend data cleaning for phone number standardization."

# Use quality scores for column selection
high_quality_fields = [
    field_name for field_name, field_meta in table_metadata.field_metadata.items()
    if field_meta.base_profile.data_quality_score > 0.8
]
print(f"High quality fields: {high_quality_fields}")

📊 Integration with SQL Agent¶

Enhanced Query Generation¶

from ryoma_ai.agent.sql import SqlAgent

# Create agent with profiled datasource
agent = SqlAgent(model="gpt-4", mode="enhanced")
agent.add_datasource(datasource)  # Profiling data automatically used

# Profiling improves query generation
response = agent.stream("""
Find high-value customers who haven't purchased recently.
Focus on customers with good data quality.
""")

# Agent uses profiling data for:
# - Better table selection (row counts, completeness)
# - Smarter column selection (quality scores, semantic types)
# - Optimized joins (similarity analysis)
# - Appropriate filtering (NULL percentages)

🛡️ Production Considerations¶

Performance Monitoring¶

# Monitor profiling performance
profile = datasource.profile_table("large_table")
duration = profile["table_profile"]["profiling_duration_seconds"]

if duration > 10:
    print("⚠️ Profiling took too long, consider:")
    print("  - Reducing sample_size")
    print("  - Disabling LSH (enable_lsh=False)")
    print("  - Using cached profiles")

Caching Profiles¶

# Cache profiles for reuse
import json
from datetime import datetime, timedelta

def cache_profile(table_name, profile):
    cache_data = {
        "profile": profile,
        "cached_at": datetime.now().isoformat()
    }
    with open(f"profiles/{table_name}.json", "w") as f:
        json.dump(cache_data, f)

def load_cached_profile(table_name, max_age_hours=24):
    try:
        with open(f"profiles/{table_name}.json", "r") as f:
            cache_data = json.load(f)
        
        cached_at = datetime.fromisoformat(cache_data["cached_at"])
        if datetime.now() - cached_at < timedelta(hours=max_age_hours):
            return cache_data["profile"]
    except FileNotFoundError:
        pass
    return None

# Use cached profile if available
cached = load_cached_profile("customers")
if cached:
    profile = cached
else:
    profile = datasource.profile_table("customers")
    cache_profile("customers", profile)

🎯 Best Practices¶

1. Choose the Right Profiling Mode¶

Ibis-only: Fast, consistent, database-native profiling
Ibis + LLM: Production use with business context analysis
Hybrid: Optimized Ibis base with cached LLM enhancement

# Development: Fast Ibis-only profiling
dev_profiler = DatabaseProfiler(sample_size=1000, enable_llm_enhancement=False)

# Production: Ibis + LLM enhancement with caching
prod_profiler = EnhancedDatabaseProfiler(
    datasource=datasource,
    model=model,
    enable_llm_analysis=True,
    analysis_sample_size=100
)

2. Optimize LLM Usage¶

Cache enhanced metadata to reduce API calls
Use appropriate sample sizes for LLM analysis
Enable LLM analysis only for critical tables

# Cache enhanced metadata
enhanced_metadata = enhanced_profiler.profile_table_enhanced("customers")
# Subsequent calls use cached results

# Selective LLM analysis
important_tables = ["customers", "orders", "products"]
for table in important_tables:
    enhanced_profiler.profile_table_enhanced(table)

3. Leverage Enhanced Insights¶

Use business purpose for better table selection
Apply SQL hints for query optimization
Utilize join scores for relationship discovery

# Use enhanced metadata in SQL generation
sql_context = enhanced_profiler.get_sql_generation_context("customers")
# Pass to SQL agent for better query generation

4. Monitor Performance and Costs¶

Track LLM API usage and costs
Monitor profiling duration for large tables
Cache profiles for frequently used tables
Set up alerts for profiling failures

5. Production Deployment¶

Schedule regular profile updates
Use hybrid approach for cost optimization
Implement proper error handling and fallbacks
Monitor data quality trends over time