๐ Database Profiling Systemยถ
Comprehensive metadata extraction for Text-to-SQL applications Based on โAutomatic Metadata Extraction for Text-to-SQLโ research paper
๐ Quick Startยถ
from ryoma_ai.datasource.postgres import PostgresDataSource
# Enable profiling with Ibis optimization
datasource = PostgresDataSource(
connection_string="postgresql://user:pass@host:5432/db",
enable_profiling=True
)
# Profile table - automatic method selection
profile = datasource.profile_table("customers")
print(f"Method: {profile['profiling_summary']['profiling_method']}")
๐ฏ Core Featuresยถ
๐ Feature |
๐ง Implementation |
๐ก Use Case |
---|---|---|
Row counts & NULL stats |
|
Data completeness |
Distinct-value ratios |
|
Cardinality analysis |
Statistical measures |
|
Numeric profiling |
String analysis |
|
Text data insights |
Top-k frequent values |
|
Common patterns |
LSH similarity |
MinHash algorithms |
Column matching |
๐๏ธ Architectureยถ
Intelligent Profiling Engineยถ
graph LR
A[๐ Profile Request] --> B{๐ Ibis Available?}
B -->|โ
Yes| C[โก Ibis-Enhanced]
B -->|โ No| D[๐ง Standard Mode]
C --> E{โ
Success?}
E -->|โ
Yes| F[๐ Results]
E -->|โ No| D
D --> F
classDef request fill:#e3f2fd,stroke:#1976d2
classDef decision fill:#fff8e1,stroke:#f57c00
classDef process fill:#f3e5f5,stroke:#7b1fa2
classDef result fill:#e8f5e8,stroke:#388e3c
class A request
class B,E decision
class C,D process
class F result
๐ Why Ibis Integration?ยถ
๐ก Benefit |
๐ Description |
---|---|
Server-side computation |
Statistics calculated in database |
Optimized SQL generation |
Uses database query planner |
Reduced network I/O |
Only results transferred |
Native functions |
Leverages built-in DB functions |
Automatic fallback |
Ensures 100% compatibility |
๐ง API Methodsยถ
# ๐ Table profiling
profile = datasource.profile_table("customers")
# ๐ Column analysis
column_profile = datasource.profile_column("customers", "email")
# ๐ Direct Ibis access
ibis_table = datasource.connect().table("customers")
stats = ibis_table.describe().to_pandas()
๐ Example Resultsยถ
๐ Table Profile Example
{
"table_profile": {
"row_count": 150000,
"completeness_score": 0.95,
"consistency_score": 0.88
},
"profiling_summary": {
"profiling_method": "ibis_enhanced",
"total_columns": 12
}
}
๐ Column Profile Example
{
"email": {
"semantic_type": "email",
"data_quality_score": 0.92,
"null_percentage": 5.2,
"distinct_ratio": 0.98,
"top_k_values": [
{"value": "user@example.com", "count": 15, "percentage": 0.01}
]
}
}
๐ฏ Advanced Featuresยถ
๐ท๏ธ Semantic Type Detectionยถ
๐ Type |
๐ฏ Use Case |
---|---|
๐ง Email |
Contact analysis |
๐ Phone |
Communication data |
๐ URL |
Web analytics |
๐ ID |
Primary key detection |
๐ Data Quality Formulaยถ
quality_score = (
completeness * 0.5 + # 1 - NULL%
uniqueness * 0.3 + # Distinct ratio
reliability * 0.2 # Sample size
)
๐ Column Similarityยถ
# Find similar columns
similar = datasource.find_similar_columns("customer_id", threshold=0.8)
# โ ["user_id", "client_id", "account_id"]
โ๏ธ Configurationยถ
๐ Quick Setupยถ
datasource = PostgresDataSource(
connection_string="postgresql://...",
enable_profiling=True # Auto-optimization
)
๐๏ธ Tuning Optionsยถ
๐ฏ Use Case |
๐ Sample |
๐ Top-K |
๐ LSH |
---|---|---|---|
๐ Development |
1K |
5 |
Off |
โ๏ธ Production |
10K |
10 |
On |
๐ฏ High Accuracy |
50K |
20 |
On |
๐ง Custom Configuration
profiler_config = {
"sample_size": 10000, # Rows to analyze
"top_k": 10, # Frequent values
"enable_lsh": True, # Similarity matching
"lsh_threshold": 0.8 # Similarity threshold
}
๐ป Usage Examplesยถ
๐ Basic Profiling
from ryoma_ai.datasource.postgres import PostgresDataSource
datasource = PostgresDataSource(
connection_string="postgresql://user:pass@host:5432/db",
enable_profiling=True
)
# Profile table
profile = datasource.profile_table("customers")
method = profile["profiling_summary"]["profiling_method"]
print(f"Method: {method}")
๐ Advanced Analysis
# Column profiling
email_profile = datasource.profile_column("customers", "email")
if email_profile["semantic_type"] == "email":
quality = email_profile["data_quality_score"]
print(f"Email quality: {quality:.2f}")
# Enhanced catalog
catalog = datasource.get_enhanced_catalog(include_profiles=True)
for schema in catalog.schemas:
for table in schema.tables:
high_quality = table.get_high_quality_columns(min_quality_score=0.8)
print(f"{table.table_name}: {len(high_quality)} high-quality columns")
โก Direct Ibis Access
# Custom analysis with Ibis
conn = datasource.connect()
ibis_table = conn.table("customers")
stats = ibis_table.describe().to_pandas()
age_mean = ibis_table.age.mean().to_pandas()
top_cities = ibis_table.city.value_counts().limit(5).to_pandas()
๐ง Backend Supportยถ
๐๏ธ Database |
๐ Native Features |
๐ฏ Best For |
---|---|---|
PostgreSQL |
|
Production OLTP |
BigQuery |
ML functions, |
Analytics |
DuckDB |
Advanced analytics |
OLAP workloads |
SQLite |
|
Development |
Snowflake |
Cloud-native views |
Data warehouse |
MySQL |
|
Web backends |
๐ค Text-to-SQL Benefitsยถ
๐ Enhanced Schema Linkingยถ
๐ฏ Feature |
๐ก How It Helps |
---|---|
Statistical relevance |
Uses row counts for table selection |
Semantic types |
Detects emails, phones, IDs |
Quality scores |
Filters low-quality columns |
Similarity analysis |
Finds related columns |
โก Smarter Query Generationยถ
# Profiling-informed optimization
if column_profile["distinct_ratio"] > 0.8:
query += f"GROUP BY {column_name}" # High cardinality
elif column_profile["null_percentage"] < 5:
query += f"WHERE {column_name} IS NOT NULL" # Low nulls
๐ก๏ธ Error Preventionยถ
NULL handling based on actual percentages
Type safety using semantic types
Cardinality awareness for optimization
๐ Production Guideยถ
โ Deployment Checklistยถ
Enable profiling:
enable_profiling=True
Configure sampling for your data size
Set quality thresholds
Monitor profiling overhead
Schedule regular updates
๐ Scaling Guideยถ
๐ Table Size |
โ๏ธ Config |
๐ Notes |
---|---|---|
< 100K rows |
Default |
Full analysis |
100K-1M rows |
|
Balanced |
> 1M rows |
|
Optimized |
๐ Monitoring & Troubleshooting
# Check method used
method = profile["profiling_summary"]["profiling_method"]
if method == "standard":
print("โ ๏ธ Ibis not used - check compatibility")
# Monitor duration
duration = profile["table_profile"]["profiling_duration_seconds"]
if duration > 10:
print("โ ๏ธ Consider reducing sample_size")
๐ API Referenceยถ
๐ง Core Methods
# Table profiling
profile = datasource.profile_table(table_name, schema=None)
# Column profiling
column_profile = datasource.profile_column(table_name, column_name, schema=None)
# Enhanced catalog
catalog = datasource.get_enhanced_catalog(include_profiles=True)
# Column similarity
similar = datasource.find_similar_columns(column_name, threshold=0.8)
โ๏ธ Configuration Options
profiler_config = {
"sample_size": 10000, # Rows to sample
"top_k": 10, # Top frequent values
"lsh_threshold": 0.8, # Similarity threshold
"num_hashes": 128, # LSH hash functions
"enable_lsh": True # Enable similarity matching
}
๐จ Troubleshootingยถ
โ ๏ธ Issue |
๐ก Solution |
---|---|
Slow profiling |
Reduce |
Memory errors |
Set |
Permission errors |
Check DB user permissions |
Type errors |
Auto-handled by fallback |
๐ Debug Examples
# Check profiling method
method = profile["profiling_summary"]["profiling_method"]
if method == "standard":
print("Ibis not used - check compatibility")
# Reduce sample for large tables
profiler_config = {"sample_size": 1000}
๐ฎ Roadmapยถ
๐ Upcoming Featuresยถ
๐ค ML-based pattern detection - Advanced semantic types
๐ Cross-table relationships - Statistical correlation analysis
โฐ Temporal profiling - Data quality tracking over time
๐ฏ Custom semantic types - Domain-specific definitions
๐ Referencesยถ
๐ Resource |
๐ Link |
---|---|
Research Paper |
โAutomatic Metadata Extraction for Text-to-SQLโ |
Ibis Framework |
|
MinHash/LSH |
Locality-Sensitive Hashing algorithms |
Statistical Methods |
Database profiling techniques |