# 📊 Database Profiling System > **Comprehensive metadata extraction for Text-to-SQL applications** > Based on "Automatic Metadata Extraction for Text-to-SQL" research paper ## 🚀 Quick Start ```python from ryoma_ai.datasource.postgres import PostgresDataSource # Enable profiling with Ibis optimization datasource = PostgresDataSource( connection_string="postgresql://user:pass@host:5432/db", enable_profiling=True ) # Profile table - automatic method selection profile = datasource.profile_table("customers") print(f"Method: {profile['profiling_summary']['profiling_method']}") ``` ## 🎯 Core Features | 📊 Feature | 🔧 Implementation | 💡 Use Case | |------------|------------------|-------------| | **Row counts & NULL stats** | `COUNT()`, `COUNT(column)` | Data completeness | | **Distinct-value ratios** | `COUNT(DISTINCT column)` | Cardinality analysis | | **Statistical measures** | `MIN()`, `MAX()`, `AVG()` | Numeric profiling | | **String analysis** | `LENGTH()`, pattern matching | Text data insights | | **Top-k frequent values** | `GROUP BY ... ORDER BY COUNT()` | Common patterns | | **LSH similarity** | MinHash algorithms | Column matching | ## 🏗️ Architecture ### Intelligent Profiling Engine ```mermaid graph LR A[📊 Profile Request] --> B{🔍 Ibis Available?} B -->|✅ Yes| C[⚡ Ibis-Enhanced] B -->|❌ No| D[🔧 Standard Mode] C --> E{✅ Success?} E -->|✅ Yes| F[📋 Results] E -->|❌ No| D D --> F classDef request fill:#e3f2fd,stroke:#1976d2 classDef decision fill:#fff8e1,stroke:#f57c00 classDef process fill:#f3e5f5,stroke:#7b1fa2 classDef result fill:#e8f5e8,stroke:#388e3c class A request class B,E decision class C,D process class F result ``` ### 🚀 Why Ibis Integration? | 💡 Benefit | 📝 Description | |------------|----------------| | **Server-side computation** | Statistics calculated in database | | **Optimized SQL generation** | Uses database query planner | | **Reduced network I/O** | Only results transferred | | **Native functions** | Leverages built-in DB functions | | **Automatic fallback** | Ensures 100% compatibility | ### 🔧 API Methods ```python # 📊 Table profiling profile = datasource.profile_table("customers") # 📋 Column analysis column_profile = datasource.profile_column("customers", "email") # 🔍 Direct Ibis access ibis_table = datasource.connect().table("customers") stats = ibis_table.describe().to_pandas() ``` ## 📊 Example Results
📋 Table Profile Example ```json { "table_profile": { "row_count": 150000, "completeness_score": 0.95, "consistency_score": 0.88 }, "profiling_summary": { "profiling_method": "ibis_enhanced", "total_columns": 12 } } ```
📋 Column Profile Example ```json { "email": { "semantic_type": "email", "data_quality_score": 0.92, "null_percentage": 5.2, "distinct_ratio": 0.98, "top_k_values": [ {"value": "user@example.com", "count": 15, "percentage": 0.01} ] } } ```
## 🎯 Advanced Features ### 🏷️ Semantic Type Detection | 🔍 Type | 🎯 Use Case | |---------|-------------| | **📧 Email** | Contact analysis | | **📞 Phone** | Communication data | | **🌐 URL** | Web analytics | | **🆔 ID** | Primary key detection | ### 📊 Data Quality Formula ```python quality_score = ( completeness * 0.5 + # 1 - NULL% uniqueness * 0.3 + # Distinct ratio reliability * 0.2 # Sample size ) ``` ### 🔗 Column Similarity ```python # Find similar columns similar = datasource.find_similar_columns("customer_id", threshold=0.8) # → ["user_id", "client_id", "account_id"] ``` ## ⚙️ Configuration ### 🚀 Quick Setup ```python datasource = PostgresDataSource( connection_string="postgresql://...", enable_profiling=True # Auto-optimization ) ``` ### 🎛️ Tuning Options | 🎯 Use Case | 📊 Sample | 🔝 Top-K | 🔗 LSH | |-------------|-----------|----------|--------| | **🚀 Development** | 1K | 5 | Off | | **⚖️ Production** | 10K | 10 | On | | **🎯 High Accuracy** | 50K | 20 | On |
🔧 Custom Configuration ```python profiler_config = { "sample_size": 10000, # Rows to analyze "top_k": 10, # Frequent values "enable_lsh": True, # Similarity matching "lsh_threshold": 0.8 # Similarity threshold } ```
## 💻 Usage Examples
🚀 Basic Profiling ```python from ryoma_ai.datasource.postgres import PostgresDataSource datasource = PostgresDataSource( connection_string="postgresql://user:pass@host:5432/db", enable_profiling=True ) # Profile table profile = datasource.profile_table("customers") method = profile["profiling_summary"]["profiling_method"] print(f"Method: {method}") ```
🔍 Advanced Analysis ```python # Column profiling email_profile = datasource.profile_column("customers", "email") if email_profile["semantic_type"] == "email": quality = email_profile["data_quality_score"] print(f"Email quality: {quality:.2f}") # Enhanced catalog catalog = datasource.get_enhanced_catalog(include_profiles=True) for schema in catalog.schemas: for table in schema.tables: high_quality = table.get_high_quality_columns(min_quality_score=0.8) print(f"{table.table_name}: {len(high_quality)} high-quality columns") ```
⚡ Direct Ibis Access ```python # Custom analysis with Ibis conn = datasource.connect() ibis_table = conn.table("customers") stats = ibis_table.describe().to_pandas() age_mean = ibis_table.age.mean().to_pandas() top_cities = ibis_table.city.value_counts().limit(5).to_pandas() ```
## 🔧 Backend Support | 🗄️ Database | 🚀 Native Features | 🎯 Best For | |-------------|-------------------|-------------| | **PostgreSQL** | `pg_stats`, histograms | Production OLTP | | **BigQuery** | ML functions, `INFORMATION_SCHEMA` | Analytics | | **DuckDB** | Advanced analytics | OLAP workloads | | **SQLite** | `PRAGMA` optimizations | Development | | **Snowflake** | Cloud-native views | Data warehouse | | **MySQL** | `INFORMATION_SCHEMA` | Web backends | ## 🤖 Text-to-SQL Benefits ### 🔗 Enhanced Schema Linking | 🎯 Feature | 💡 How It Helps | |------------|-----------------| | **Statistical relevance** | Uses row counts for table selection | | **Semantic types** | Detects emails, phones, IDs | | **Quality scores** | Filters low-quality columns | | **Similarity analysis** | Finds related columns | ### ⚡ Smarter Query Generation ```python # Profiling-informed optimization if column_profile["distinct_ratio"] > 0.8: query += f"GROUP BY {column_name}" # High cardinality elif column_profile["null_percentage"] < 5: query += f"WHERE {column_name} IS NOT NULL" # Low nulls ``` ### 🛡️ Error Prevention - **NULL handling** based on actual percentages - **Type safety** using semantic types - **Cardinality awareness** for optimization ## 🚀 Production Guide ### ✅ Deployment Checklist - [ ] Enable profiling: `enable_profiling=True` - [ ] Configure sampling for your data size - [ ] Set quality thresholds - [ ] Monitor profiling overhead - [ ] Schedule regular updates ### 📊 Scaling Guide | 📏 Table Size | ⚙️ Config | 📝 Notes | |---------------|-----------|----------| | < 100K rows | Default | Full analysis | | 100K-1M rows | `sample_size=5000` | Balanced | | > 1M rows | `sample_size=10000` | Optimized |
🔍 Monitoring & Troubleshooting ```python # Check method used method = profile["profiling_summary"]["profiling_method"] if method == "standard": print("⚠️ Ibis not used - check compatibility") # Monitor duration duration = profile["table_profile"]["profiling_duration_seconds"] if duration > 10: print("⚠️ Consider reducing sample_size") ```
## 📚 API Reference
🔧 Core Methods ```python # Table profiling profile = datasource.profile_table(table_name, schema=None) # Column profiling column_profile = datasource.profile_column(table_name, column_name, schema=None) # Enhanced catalog catalog = datasource.get_enhanced_catalog(include_profiles=True) # Column similarity similar = datasource.find_similar_columns(column_name, threshold=0.8) ```
⚙️ Configuration Options ```python profiler_config = { "sample_size": 10000, # Rows to sample "top_k": 10, # Top frequent values "lsh_threshold": 0.8, # Similarity threshold "num_hashes": 128, # LSH hash functions "enable_lsh": True # Enable similarity matching } ```
## 🚨 Troubleshooting | ⚠️ Issue | 💡 Solution | |----------|-------------| | **Slow profiling** | Reduce `sample_size` | | **Memory errors** | Set `enable_lsh=False` | | **Permission errors** | Check DB user permissions | | **Type errors** | Auto-handled by fallback |
🔍 Debug Examples ```python # Check profiling method method = profile["profiling_summary"]["profiling_method"] if method == "standard": print("Ibis not used - check compatibility") # Reduce sample for large tables profiler_config = {"sample_size": 1000} ```
## 🔮 Roadmap ### 🚀 Upcoming Features - **🤖 ML-based pattern detection** - Advanced semantic types - **🔗 Cross-table relationships** - Statistical correlation analysis - **⏰ Temporal profiling** - Data quality tracking over time - **🎯 Custom semantic types** - Domain-specific definitions --- ## 📖 References | 📚 Resource | 🔗 Link | |-------------|---------| | **Research Paper** | "Automatic Metadata Extraction for Text-to-SQL" | | **Ibis Framework** | [ibis-project.org](https://ibis-project.org) | | **MinHash/LSH** | Locality-Sensitive Hashing algorithms | | **Statistical Methods** | Database profiling techniques |