๐Ÿ“Š Database Profilingยถ

Comprehensive database metadata extraction system based on research from โ€œAutomatic Metadata Extraction for Text-to-SQLโ€ paper.

๐ŸŽฏ Overviewยถ

The Database Profiling system provides comprehensive metadata extraction with two profiling modes:

๐Ÿ“Š Statistical Profiling (Base Layer)ยถ

  • Row counts & NULL statistics - Data completeness analysis

  • Type-specific profiling - Numeric, date, and string analysis

  • Semantic type inference - Automatic detection of emails, phones, URLs, etc.

  • Data quality scoring - Multi-dimensional quality assessment

  • LSH similarity - Locality-sensitive hashing for column similarity

  • Database-native operations - Optimal performance with universal compatibility

๐Ÿค– LLM-Enhanced Profiling (Advanced Layer)ยถ

  • Business purpose analysis - LLM-generated field descriptions and context

  • SQL generation hints - Task-aligned metadata for better query generation

  • Join candidate scoring - Intelligent relationship discovery

  • Usage pattern analysis - Common query patterns and optimization suggestions

  • Domain classification - Business domain and data freshness assessment

๐Ÿš€ Quick Startยถ

Optimized Database Profilingยถ

from ryoma_ai.datasource.postgres import PostgresDataSource

# Create datasource with automatic profiling capabilities
datasource = PostgresDataSource(
    connection_string="postgresql://user:pass@localhost:5432/db"
)

# Profile a table - automatically uses database-native methods
profile = datasource.profile_table("customers")
print(f"Rows: {profile['table_profile']['row_count']:,}")
print(f"Completeness: {profile['table_profile']['completeness_score']:.2%}")
print(f"Method: {profile['profiling_summary']['profiling_method']}")  # Always "native_optimized"

LLM-Enhanced Profilingยถ

from ryoma_ai.agent.internals.enhanced_profiler import EnhancedDatabaseProfiler
from ryoma_ai.models.openai import OpenAIModel

# Create enhanced profiler with LLM analysis
model = OpenAIModel("gpt-4")
enhanced_profiler = EnhancedDatabaseProfiler(
    datasource=datasource,
    model=model,
    enable_llm_analysis=True
)

# Get enhanced table metadata
enhanced_metadata = enhanced_profiler.profile_table_enhanced("customers")
print(f"Table purpose: {enhanced_metadata.primary_purpose}")
print(f"Business domain: {enhanced_metadata.business_domain}")

# Get enhanced field metadata
field_metadata = enhanced_profiler.profile_field_enhanced("customers", "email")
print(f"Field description: {field_metadata.llm_description}")
print(f"Business purpose: {field_metadata.business_purpose}")
print(f"SQL hints: {field_metadata.sql_generation_hints}")

Custom Configurationยถ

# Custom profiling configuration
datasource = PostgresDataSource(
    connection_string="postgresql://user:pass@localhost:5432/db",
    profiler_config={
        "sample_size": 10000,      # Rows for LSH analysis
        "top_k": 10,               # Top frequent values
        "enable_lsh": True,        # Column similarity matching
        "lsh_threshold": 0.8,      # Similarity threshold
        "num_hashes": 128,         # LSH precision
        "enable_llm_enhancement": True  # Enable LLM analysis
    }
)

# Enhanced profiler with Ibis base layer
enhanced_profiler = EnhancedDatabaseProfiler(
    datasource=datasource,
    model=model,
    enable_llm_analysis=True,
    analysis_sample_size=100  # Sample size for LLM analysis
)

๐Ÿค– LLM-Enhanced Profilingยถ

Enhanced Field Metadataยถ

The EnhancedDatabaseProfiler provides AI-powered metadata generation:

from ryoma_ai.agent.internals.enhanced_profiler import EnhancedDatabaseProfiler

# Create enhanced profiler
enhanced_profiler = EnhancedDatabaseProfiler(
    datasource=datasource,
    model=model,
    enable_llm_analysis=True
)

# Get enhanced field analysis
field_metadata = enhanced_profiler.profile_field_enhanced("orders", "customer_id")

# Access enhanced information
print(f"Description: {field_metadata.llm_description}")
print(f"Business purpose: {field_metadata.business_purpose}")
print(f"SQL hints: {field_metadata.sql_generation_hints}")
print(f"Join score: {field_metadata.join_candidate_score}")
print(f"Semantic tags: {field_metadata.semantic_tags}")
print(f"Usage patterns: {field_metadata.usage_patterns}")

Enhanced Table Metadataยถ

# Get comprehensive table analysis
table_metadata = enhanced_profiler.profile_table_enhanced("customers")

# Access table-level insights
print(f"Table description: {table_metadata.table_description}")
print(f"Primary purpose: {table_metadata.primary_purpose}")
print(f"Business domain: {table_metadata.business_domain}")
print(f"Data freshness: {table_metadata.data_freshness_assessment}")

# Access join patterns
for pattern in table_metadata.common_join_patterns:
    print(f"Join pattern: {pattern}")

# Access all field metadata
for field_name, field_meta in table_metadata.field_metadata.items():
    print(f"{field_name}: {field_meta.business_purpose}")

SQL Generation Contextยถ

Get optimized metadata for SQL generation tasks:

# Get context optimized for SQL generation
sql_context = enhanced_profiler.get_sql_generation_context("customers")

# This provides metadata in a format optimized for text-to-SQL models
print(f"Table context: {sql_context['table_context']}")
print(f"Field contexts: {sql_context['field_contexts']}")
print(f"Join suggestions: {sql_context['join_suggestions']}")

๐Ÿ“‹ Core Featuresยถ

Table-Level Profilingยถ

# Comprehensive table analysis
profile = datasource.profile_table("customers")

# Access table metrics
table_info = profile["table_profile"]
print(f"Row count: {table_info['row_count']:,}")
print(f"Column count: {table_info['column_count']}")
print(f"Completeness score: {table_info['completeness_score']:.3f}")
print(f"Consistency score: {table_info['consistency_score']:.3f}")
print(f"Profiling method: {profile['profiling_summary']['profiling_method']}")

Column-Level Analysisยถ

# Individual column profiling
column_profile = datasource.profile_column("customers", "email")

# Check semantic type and quality
if column_profile["semantic_type"] == "email":
    quality = column_profile["data_quality_score"]
    null_pct = column_profile["null_percentage"]
    distinct_ratio = column_profile["distinct_ratio"]
    
    print(f"Email column quality: {quality:.3f}")
    print(f"NULL percentage: {null_pct:.1f}%")
    print(f"Distinct ratio: {distinct_ratio:.3f}")
    
    # Show top values
    for i, value_info in enumerate(column_profile["top_k_values"][:3], 1):
        print(f"{i}. {value_info['value']} ({value_info['count']} times)")

Enhanced Catalogยถ

# Get catalog with profiling data
catalog = datasource.get_enhanced_catalog(include_profiles=True)

for schema in catalog.schemas:
    print(f"Schema: {schema.schema_name}")
    for table in schema.tables:
        if table.profile:
            print(f"  Table: {table.table_name} ({table.profile.row_count:,} rows)")
            
            # Find high-quality columns
            high_quality_cols = table.get_high_quality_columns(min_quality_score=0.8)
            print(f"    High-quality columns: {len(high_quality_cols)}")
            
            # Show profiled columns
            profiled_cols = table.get_profiled_columns()
            print(f"    Profiled columns: {len(profiled_cols)}")

๐Ÿ” Profiling Resultsยถ

Basic Table Profile Structureยถ

{
  "table_profile": {
    "table_name": "customers",
    "row_count": 150000,
    "column_count": 12,
    "completeness_score": 0.95,
    "consistency_score": 0.88,
    "profiled_at": "2024-01-15T10:30:00Z",
    "profiling_duration_seconds": 2.34
  },
  "profiling_summary": {
    "profiling_method": "ibis_enhanced",
    "total_columns": 12
  }
}

Enhanced Table Metadata Structureยถ

{
  "base_profile": {
    "table_name": "customers",
    "row_count": 150000,
    "column_count": 12
  },
  "table_description": "Customer information table containing contact details and account data",
  "primary_purpose": "Store customer profiles and contact information for CRM operations",
  "business_domain": "Customer Relationship Management",
  "data_freshness_assessment": "Updated daily via ETL pipeline",
  "common_join_patterns": [
    {
      "target_table": "orders",
      "join_column": "customer_id",
      "join_type": "one_to_many",
      "confidence": 0.95
    }
  ],
  "field_metadata": {
    "customer_id": {
      "llm_description": "Unique identifier for customer records",
      "business_purpose": "Primary key for customer identification",
      "sql_generation_hints": ["Use for JOINs", "Always include in GROUP BY"],
      "join_candidate_score": 0.98
    }
  }
}

Enhanced Field Metadata Structureยถ

{
  "base_profile": {
    "column_name": "email",
    "semantic_type": "email",
    "data_quality_score": 0.92,
    "null_percentage": 5.2,
    "distinct_ratio": 0.98
  },
  "llm_description": "Customer email addresses for communication and account identification",
  "business_purpose": "Primary contact method for customer communications and login",
  "sql_generation_hints": [
    "Use for customer identification",
    "Good for filtering active customers",
    "Consider NULL handling in queries"
  ],
  "join_candidate_score": 0.85,
  "semantic_tags": ["contact_info", "identifier", "communication"],
  "data_quality_assessment": "High quality with 95% completeness and valid email format",
  "usage_patterns": [
    "Frequently used in WHERE clauses",
    "Common in customer lookup queries",
    "Often joined with user_accounts table"
  ]
}

๐ŸŽฏ Advanced Featuresยถ

Semantic Type Detectionยถ

Automatically detects column semantic types:

# Check detected semantic types
column_profile = datasource.profile_column("users", "phone_number")

semantic_type = column_profile["semantic_type"]
if semantic_type == "phone":
    print("๐Ÿ“ž Phone number column detected")
elif semantic_type == "email":
    print("๐Ÿ“ง Email column detected")
elif semantic_type == "url":
    print("๐ŸŒ URL column detected")
elif semantic_type == "identifier":
    print("๐Ÿ†” ID column detected")

Supported Types:

  • ๐Ÿ“ง Email - Pattern-based email detection

  • ๐Ÿ“ž Phone - Phone number format recognition

  • ๐ŸŒ URL - Web URL identification

  • ๐Ÿ†” Identifier - High-uniqueness ID detection

  • ๐Ÿ“ General - Default text classification

Data Quality Scoringยถ

Multi-dimensional quality assessment:

# Quality score calculation
def explain_quality_score(profile):
    completeness = 1 - (profile["null_percentage"] / 100)
    uniqueness = min(1.0, profile["distinct_ratio"] * 2)
    reliability = min(1.0, profile["sample_size"] / 1000)
    
    quality_score = (
        completeness * 0.5 +      # 50% weight
        uniqueness * 0.3 +        # 30% weight  
        reliability * 0.2         # 20% weight
    )
    
    print(f"Completeness: {completeness:.3f}")
    print(f"Uniqueness: {uniqueness:.3f}")
    print(f"Reliability: {reliability:.3f}")
    print(f"Overall Quality: {quality_score:.3f}")

column_profile = datasource.profile_column("customers", "customer_id")
explain_quality_score(column_profile)

Column Similarity Analysisยถ

Find similar columns using LSH (Locality-Sensitive Hashing):

# Find similar columns
similar_columns = datasource.find_similar_columns("customer_id", threshold=0.8)
print(f"Columns similar to 'customer_id': {similar_columns}")
# Output: ["user_id", "client_id", "account_id"]

# Use for schema linking
if similar_columns:
    print("Found potential join candidates:")
    for col in similar_columns:
        print(f"  - {col}")

โš™๏ธ Configuration Optionsยถ

Performance Tuningยถ

# Development configuration (fast Ibis profiling)
dev_config = {
    "sample_size": 1000,    # For LSH only (Ibis handles all stats natively)
    "top_k": 5,             # Ibis value_counts limit
    "enable_lsh": False,    # Skip similarity analysis
    "enable_llm_enhancement": False
}

# Production configuration (optimized Ibis + LLM)
prod_config = {
    "sample_size": 10000,   # LSH sample size (Ibis stats are full table)
    "top_k": 10,            # Ibis value_counts limit
    "enable_lsh": True,     # Column similarity via LSH
    "lsh_threshold": 0.8,
    "enable_llm_enhancement": True,
    "analysis_sample_size": 50
}

# High accuracy configuration (full Ibis + comprehensive LLM)
accuracy_config = {
    "sample_size": 50000,   # Large LSH sample (Ibis always uses full data)
    "top_k": 20,            # More frequent values via Ibis
    "enable_lsh": True,
    "lsh_threshold": 0.9,
    "num_hashes": 256,
    "enable_llm_enhancement": True,
    "analysis_sample_size": 200
}

# LLM-focused configuration (efficient Ibis + deep business analysis)
llm_focused_config = {
    "sample_size": 5000,    # Minimal LSH (Ibis stats are always complete)
    "enable_llm_enhancement": True,
    "analysis_sample_size": 500,  # Large LLM analysis sample
    "enable_business_context": True,
    "enable_join_analysis": True
}

Backend-Specific Optimizationsยถ

# PostgreSQL with advanced features
postgres_ds = PostgresDataSource(
    connection_string="postgresql://localhost:5432/db",
    profiler_config={
        "use_pg_stats": True,      # Leverage pg_stats views
        "use_histograms": True,    # Use histogram data
        "sample_size": 10000
    }
)

# BigQuery with ML functions
bigquery_ds = BigQueryDataSource(
    project_id="my-project",
    profiler_config={
        "use_ml_functions": True,  # Use ML.FEATURE_INFO
        "sample_size": 20000
    }
)

๐Ÿ”ง Direct Profiler Usageยถ

Optimized Ibis Profilerยถ

from ryoma_ai.datasource.profiler import DatabaseProfiler

# Create optimized profiler instance (always uses Ibis)
profiler = DatabaseProfiler(
    sample_size=10000,  # Used for LSH analysis only
    top_k=10,           # Ibis value_counts limit
    enable_lsh=True,    # Column similarity matching
    enable_llm_enhancement=False  # Statistical only
)

# Profile table directly - uses Ibis native methods
table_profile = profiler.profile_table(datasource, "customers")
print(f"Profiled {table_profile.table_name} in {table_profile.profiling_duration_seconds:.2f}s")

# Profile individual column - uses Ibis statistical functions
column_profile = profiler.profile_column(datasource, "customers", "email")
print(f"Email quality score: {column_profile.data_quality_score:.3f}")

Enhanced Profiler with LLM Analysisยถ

from ryoma_ai.agent.internals.enhanced_profiler import EnhancedDatabaseProfiler
from ryoma_ai.models.openai import OpenAIModel

# Create enhanced profiler with LLM capabilities
model = OpenAIModel("gpt-4")
enhanced_profiler = EnhancedDatabaseProfiler(
    datasource=datasource,
    model=model,
    enable_llm_analysis=True,
    analysis_sample_size=100
)

# Profile table with LLM enhancement
enhanced_table = enhanced_profiler.profile_table_enhanced("customers")
print(f"Table purpose: {enhanced_table.primary_purpose}")
print(f"Business domain: {enhanced_table.business_domain}")

# Profile field with LLM enhancement
enhanced_field = enhanced_profiler.profile_field_enhanced("customers", "email")
print(f"Field description: {enhanced_field.llm_description}")
print(f"SQL hints: {enhanced_field.sql_generation_hints}")

Hybrid Approach (Ibis + LLM)ยถ

# Combine optimized Ibis profiling with LLM enhancement
base_profiler = DatabaseProfiler(sample_size=10000, enable_lsh=True)  # Always uses Ibis
enhanced_profiler = EnhancedDatabaseProfiler(
    datasource=datasource,
    model=model,
    base_profiler=base_profiler,  # Reuse optimized Ibis profiler
    enable_llm_analysis=True
)

# Get both Ibis statistical data and LLM-enhanced metadata
enhanced_metadata = enhanced_profiler.profile_table_enhanced("customers")

# Access Ibis-generated statistical data
base_stats = enhanced_metadata.base_profile
print(f"Row count: {base_stats.row_count:,}")  # From Ibis COUNT()
print(f"Completeness: {base_stats.completeness_score:.2%}")  # From Ibis NULL analysis

# Access LLM-enhanced insights
print(f"Business purpose: {enhanced_metadata.primary_purpose}")
print(f"Data freshness: {enhanced_metadata.data_freshness_assessment}")

Batch Profilingยถ

# Profile multiple tables
tables_to_profile = ["customers", "orders", "products"]
profiles = {}

for table_name in tables_to_profile:
    print(f"Profiling {table_name}...")
    profile = datasource.profile_table(table_name)
    profiles[table_name] = profile
    
    # Show summary
    table_info = profile["table_profile"]
    print(f"  Rows: {table_info['row_count']:,}")
    print(f"  Completeness: {table_info['completeness_score']:.2%}")

๐Ÿค– LLM-Enhanced Featuresยถ

Business Purpose Analysisยถ

The enhanced profiler uses LLM analysis to understand the business context of your data:

# Get business context for a field
field_metadata = enhanced_profiler.profile_field_enhanced("orders", "total_amount")

print(f"Business purpose: {field_metadata.business_purpose}")
# Output: "Monetary value of customer orders for revenue tracking and financial reporting"

print(f"Usage patterns: {field_metadata.usage_patterns}")
# Output: ["Used in revenue calculations", "Filtered for large orders", "Aggregated for reporting"]

SQL Generation Optimizationยถ

Enhanced metadata provides specific hints for better SQL generation:

# Get SQL-optimized context
sql_context = enhanced_profiler.get_sql_generation_context("customers")

# Use in SQL agent for better query generation
from ryoma_ai.agent.sql import SqlAgent

agent = SqlAgent(model="gpt-4", mode="enhanced")
agent.add_datasource(datasource)

# The agent automatically uses enhanced metadata for:
# - Better table selection based on business purpose
# - Smarter column selection using SQL hints
# - Optimized joins using join candidate scores
# - Appropriate filtering based on data quality scores

response = agent.stream("Find high-value customers from last quarter")

Join Pattern Discoveryยถ

Automatically discover and score potential join relationships:

# Get table metadata with join patterns
table_metadata = enhanced_profiler.profile_table_enhanced("customers")

for pattern in table_metadata.common_join_patterns:
    print(f"Can join with {pattern['target_table']} on {pattern['join_column']}")
    print(f"Join type: {pattern['join_type']}, Confidence: {pattern['confidence']:.2%}")

Data Quality Assessmentยถ

LLM-powered data quality analysis provides actionable insights:

field_metadata = enhanced_profiler.profile_field_enhanced("customers", "phone")

print(f"Quality assessment: {field_metadata.data_quality_assessment}")
# Output: "Moderate quality with 15% missing values and inconsistent formatting.
#          Recommend data cleaning for phone number standardization."

# Use quality scores for column selection
high_quality_fields = [
    field_name for field_name, field_meta in table_metadata.field_metadata.items()
    if field_meta.base_profile.data_quality_score > 0.8
]
print(f"High quality fields: {high_quality_fields}")

๐Ÿ“Š Integration with SQL Agentยถ

Enhanced Query Generationยถ

from ryoma_ai.agent.sql import SqlAgent

# Create agent with profiled datasource
agent = SqlAgent(model="gpt-4", mode="enhanced")
agent.add_datasource(datasource)  # Profiling data automatically used

# Profiling improves query generation
response = agent.stream("""
Find high-value customers who haven't purchased recently.
Focus on customers with good data quality.
""")

# Agent uses profiling data for:
# - Better table selection (row counts, completeness)
# - Smarter column selection (quality scores, semantic types)
# - Optimized joins (similarity analysis)
# - Appropriate filtering (NULL percentages)

๐Ÿ›ก๏ธ Production Considerationsยถ

Performance Monitoringยถ

# Monitor profiling performance
profile = datasource.profile_table("large_table")
duration = profile["table_profile"]["profiling_duration_seconds"]

if duration > 10:
    print("โš ๏ธ Profiling took too long, consider:")
    print("  - Reducing sample_size")
    print("  - Disabling LSH (enable_lsh=False)")
    print("  - Using cached profiles")

Caching Profilesยถ

# Cache profiles for reuse
import json
from datetime import datetime, timedelta

def cache_profile(table_name, profile):
    cache_data = {
        "profile": profile,
        "cached_at": datetime.now().isoformat()
    }
    with open(f"profiles/{table_name}.json", "w") as f:
        json.dump(cache_data, f)

def load_cached_profile(table_name, max_age_hours=24):
    try:
        with open(f"profiles/{table_name}.json", "r") as f:
            cache_data = json.load(f)
        
        cached_at = datetime.fromisoformat(cache_data["cached_at"])
        if datetime.now() - cached_at < timedelta(hours=max_age_hours):
            return cache_data["profile"]
    except FileNotFoundError:
        pass
    return None

# Use cached profile if available
cached = load_cached_profile("customers")
if cached:
    profile = cached
else:
    profile = datasource.profile_table("customers")
    cache_profile("customers", profile)

๐ŸŽฏ Best Practicesยถ

1. Choose the Right Profiling Modeยถ

  • Ibis-only: Fast, consistent, database-native profiling

  • Ibis + LLM: Production use with business context analysis

  • Hybrid: Optimized Ibis base with cached LLM enhancement

# Development: Fast Ibis-only profiling
dev_profiler = DatabaseProfiler(sample_size=1000, enable_llm_enhancement=False)

# Production: Ibis + LLM enhancement with caching
prod_profiler = EnhancedDatabaseProfiler(
    datasource=datasource,
    model=model,
    enable_llm_analysis=True,
    analysis_sample_size=100
)

2. Optimize LLM Usageยถ

  • Cache enhanced metadata to reduce API calls

  • Use appropriate sample sizes for LLM analysis

  • Enable LLM analysis only for critical tables

# Cache enhanced metadata
enhanced_metadata = enhanced_profiler.profile_table_enhanced("customers")
# Subsequent calls use cached results

# Selective LLM analysis
important_tables = ["customers", "orders", "products"]
for table in important_tables:
    enhanced_profiler.profile_table_enhanced(table)

3. Leverage Enhanced Insightsยถ

  • Use business purpose for better table selection

  • Apply SQL hints for query optimization

  • Utilize join scores for relationship discovery

# Use enhanced metadata in SQL generation
sql_context = enhanced_profiler.get_sql_generation_context("customers")
# Pass to SQL agent for better query generation

4. Monitor Performance and Costsยถ

  • Track LLM API usage and costs

  • Monitor profiling duration for large tables

  • Cache profiles for frequently used tables

  • Set up alerts for profiling failures

5. Production Deploymentยถ

  • Schedule regular profile updates

  • Use hybrid approach for cost optimization

  • Implement proper error handling and fallbacks

  • Monitor data quality trends over time