# 📊 Database Profiling Comprehensive database metadata extraction system based on research from "Automatic Metadata Extraction for Text-to-SQL" paper. ## 🎯 Overview The Database Profiling system provides: - **Statistical Analysis** - Row counts, NULL statistics, distinct-value ratios - **Type-Specific Profiling** - Numeric, date, and string analysis - **Semantic Type Inference** - Automatic detection of emails, phones, URLs, etc. - **Data Quality Scoring** - Multi-dimensional quality assessment - **LSH Similarity** - Locality-sensitive hashing for column similarity - **Ibis Integration** - Native database optimizations for better performance ## 🚀 Quick Start ### Enable Profiling ```python from ryoma_ai.datasource.postgres import PostgresDataSource # Enable profiling with default settings datasource = PostgresDataSource( connection_string="postgresql://user:pass@localhost:5432/db", enable_profiling=True ) # Profile a table profile = datasource.profile_table("customers") print(f"Rows: {profile['table_profile']['row_count']:,}") print(f"Completeness: {profile['table_profile']['completeness_score']:.2%}") ``` ### Custom Configuration ```python # Advanced profiling configuration datasource = PostgresDataSource( connection_string="postgresql://user:pass@localhost:5432/db", enable_profiling=True, profiler_config={ "sample_size": 10000, # Rows to analyze "top_k": 10, # Top frequent values "enable_lsh": True, # Column similarity "lsh_threshold": 0.8, # Similarity threshold "num_hashes": 128 # LSH precision } ) ``` ## 📋 Core Features ### Table-Level Profiling ```python # Comprehensive table analysis profile = datasource.profile_table("customers") # Access table metrics table_info = profile["table_profile"] print(f"Row count: {table_info['row_count']:,}") print(f"Column count: {table_info['column_count']}") print(f"Completeness score: {table_info['completeness_score']:.3f}") print(f"Consistency score: {table_info['consistency_score']:.3f}") print(f"Profiling method: {profile['profiling_summary']['profiling_method']}") ``` ### Column-Level Analysis ```python # Individual column profiling column_profile = datasource.profile_column("customers", "email") # Check semantic type and quality if column_profile["semantic_type"] == "email": quality = column_profile["data_quality_score"] null_pct = column_profile["null_percentage"] distinct_ratio = column_profile["distinct_ratio"] print(f"Email column quality: {quality:.3f}") print(f"NULL percentage: {null_pct:.1f}%") print(f"Distinct ratio: {distinct_ratio:.3f}") # Show top values for i, value_info in enumerate(column_profile["top_k_values"][:3], 1): print(f"{i}. {value_info['value']} ({value_info['count']} times)") ``` ### Enhanced Catalog ```python # Get catalog with profiling data catalog = datasource.get_enhanced_catalog(include_profiles=True) for schema in catalog.schemas: print(f"Schema: {schema.schema_name}") for table in schema.tables: if table.profile: print(f" Table: {table.table_name} ({table.profile.row_count:,} rows)") # Find high-quality columns high_quality_cols = table.get_high_quality_columns(min_quality_score=0.8) print(f" High-quality columns: {len(high_quality_cols)}") # Show profiled columns profiled_cols = table.get_profiled_columns() print(f" Profiled columns: {len(profiled_cols)}") ``` ## 🔍 Profiling Results ### Table Profile Structure ```json { "table_profile": { "table_name": "customers", "row_count": 150000, "column_count": 12, "completeness_score": 0.95, "consistency_score": 0.88, "profiled_at": "2024-01-15T10:30:00Z", "profiling_duration_seconds": 2.34 }, "profiling_summary": { "profiling_method": "ibis_enhanced", "total_columns": 12 } } ``` ### Column Profile Structure ```json { "column_name": "email", "semantic_type": "email", "data_quality_score": 0.92, "row_count": 150000, "null_count": 7800, "null_percentage": 5.2, "distinct_count": 147200, "distinct_ratio": 0.98, "top_k_values": [ {"value": "user@example.com", "count": 15, "percentage": 0.01} ], "string_stats": { "min_length": 8, "max_length": 64, "avg_length": 24.5, "character_types": { "alphabetic": 15420, "numeric": 3240, "special": 890 } } } ``` ## 🎯 Advanced Features ### Semantic Type Detection Automatically detects column semantic types: ```python # Check detected semantic types column_profile = datasource.profile_column("users", "phone_number") semantic_type = column_profile["semantic_type"] if semantic_type == "phone": print("📞 Phone number column detected") elif semantic_type == "email": print("📧 Email column detected") elif semantic_type == "url": print("🌐 URL column detected") elif semantic_type == "identifier": print("🆔 ID column detected") ``` **Supported Types:** - **📧 Email** - Pattern-based email detection - **📞 Phone** - Phone number format recognition - **🌐 URL** - Web URL identification - **🆔 Identifier** - High-uniqueness ID detection - **📝 General** - Default text classification ### Data Quality Scoring Multi-dimensional quality assessment: ```python # Quality score calculation def explain_quality_score(profile): completeness = 1 - (profile["null_percentage"] / 100) uniqueness = min(1.0, profile["distinct_ratio"] * 2) reliability = min(1.0, profile["sample_size"] / 1000) quality_score = ( completeness * 0.5 + # 50% weight uniqueness * 0.3 + # 30% weight reliability * 0.2 # 20% weight ) print(f"Completeness: {completeness:.3f}") print(f"Uniqueness: {uniqueness:.3f}") print(f"Reliability: {reliability:.3f}") print(f"Overall Quality: {quality_score:.3f}") column_profile = datasource.profile_column("customers", "customer_id") explain_quality_score(column_profile) ``` ### Column Similarity Analysis Find similar columns using LSH (Locality-Sensitive Hashing): ```python # Find similar columns similar_columns = datasource.find_similar_columns("customer_id", threshold=0.8) print(f"Columns similar to 'customer_id': {similar_columns}") # Output: ["user_id", "client_id", "account_id"] # Use for schema linking if similar_columns: print("Found potential join candidates:") for col in similar_columns: print(f" - {col}") ``` ## ⚙️ Configuration Options ### Performance Tuning ```python # Development configuration (fast) dev_config = { "sample_size": 1000, "top_k": 5, "enable_lsh": False } # Production configuration (balanced) prod_config = { "sample_size": 10000, "top_k": 10, "enable_lsh": True, "lsh_threshold": 0.8 } # High accuracy configuration (thorough) accuracy_config = { "sample_size": 50000, "top_k": 20, "enable_lsh": True, "lsh_threshold": 0.9, "num_hashes": 256 } ``` ### Backend-Specific Optimizations ```python # PostgreSQL with advanced features postgres_ds = PostgresDataSource( connection_string="postgresql://localhost:5432/db", enable_profiling=True, profiler_config={ "use_pg_stats": True, # Leverage pg_stats views "use_histograms": True, # Use histogram data "sample_size": 10000 } ) # BigQuery with ML functions bigquery_ds = BigQueryDataSource( project_id="my-project", enable_profiling=True, profiler_config={ "use_ml_functions": True, # Use ML.FEATURE_INFO "sample_size": 20000 } ) ``` ## 🔧 Direct Profiler Usage ### Standalone Profiler ```python from ryoma_ai.datasource.profiler import DatabaseProfiler # Create profiler instance profiler = DatabaseProfiler( sample_size=10000, top_k=10, enable_lsh=True ) # Profile table directly table_profile = profiler.profile_table(datasource, "customers") print(f"Profiled {table_profile.table_name} in {table_profile.profiling_duration_seconds:.2f}s") # Profile individual column column_profile = profiler.profile_column(datasource, "customers", "email") print(f"Email quality score: {column_profile.data_quality_score:.3f}") ``` ### Batch Profiling ```python # Profile multiple tables tables_to_profile = ["customers", "orders", "products"] profiles = {} for table_name in tables_to_profile: print(f"Profiling {table_name}...") profile = datasource.profile_table(table_name) profiles[table_name] = profile # Show summary table_info = profile["table_profile"] print(f" Rows: {table_info['row_count']:,}") print(f" Completeness: {table_info['completeness_score']:.2%}") ``` ## 📊 Integration with SQL Agent ### Enhanced Query Generation ```python from ryoma_ai.agent.sql import SqlAgent # Create agent with profiled datasource agent = SqlAgent(model="gpt-4", mode="enhanced") agent.add_datasource(datasource) # Profiling data automatically used # Profiling improves query generation response = agent.stream(""" Find high-value customers who haven't purchased recently. Focus on customers with good data quality. """) # Agent uses profiling data for: # - Better table selection (row counts, completeness) # - Smarter column selection (quality scores, semantic types) # - Optimized joins (similarity analysis) # - Appropriate filtering (NULL percentages) ``` ## 🛡️ Production Considerations ### Performance Monitoring ```python # Monitor profiling performance profile = datasource.profile_table("large_table") duration = profile["table_profile"]["profiling_duration_seconds"] if duration > 10: print("⚠️ Profiling took too long, consider:") print(" - Reducing sample_size") print(" - Disabling LSH (enable_lsh=False)") print(" - Using cached profiles") ``` ### Caching Profiles ```python # Cache profiles for reuse import json from datetime import datetime, timedelta def cache_profile(table_name, profile): cache_data = { "profile": profile, "cached_at": datetime.now().isoformat() } with open(f"profiles/{table_name}.json", "w") as f: json.dump(cache_data, f) def load_cached_profile(table_name, max_age_hours=24): try: with open(f"profiles/{table_name}.json", "r") as f: cache_data = json.load(f) cached_at = datetime.fromisoformat(cache_data["cached_at"]) if datetime.now() - cached_at < timedelta(hours=max_age_hours): return cache_data["profile"] except FileNotFoundError: pass return None # Use cached profile if available cached = load_cached_profile("customers") if cached: profile = cached else: profile = datasource.profile_table("customers") cache_profile("customers", profile) ``` ## 🎯 Best Practices ### 1. **Configure Appropriately** - Use smaller samples for development - Enable LSH for large schemas - Adjust thresholds based on your data ### 2. **Monitor Performance** - Track profiling duration - Cache profiles for frequently used tables - Use appropriate sample sizes ### 3. **Leverage Results** - Use quality scores for column selection - Apply semantic types for better formatting - Utilize similarity for schema linking ### 4. **Production Deployment** - Schedule regular profile updates - Monitor profiling overhead - Set up alerts for failures ## 🔗 Related Documentation - **[Enhanced SQL Agent](../agent/sql.md)** - Uses profiling for better queries - **[Data Sources](../data-sources/index.md)** - Database connectors with profiling - **[Architecture Guide](../../architecture/database-profiling.md)** - Technical details