๐Ÿ“Š Database Profiling Systemยถ

Comprehensive metadata extraction for Text-to-SQL applications Based on โ€œAutomatic Metadata Extraction for Text-to-SQLโ€ research paper

๐Ÿš€ Quick Startยถ

from ryoma_ai.datasource.postgres import PostgresDataSource

# Enable profiling with Ibis optimization
datasource = PostgresDataSource(
    connection_string="postgresql://user:pass@host:5432/db",
    enable_profiling=True
)

# Profile table - automatic method selection
profile = datasource.profile_table("customers")
print(f"Method: {profile['profiling_summary']['profiling_method']}")

๐ŸŽฏ Core Featuresยถ

๐Ÿ“Š Feature

๐Ÿ”ง Implementation

๐Ÿ’ก Use Case

Row counts & NULL stats

COUNT(), COUNT(column)

Data completeness

Distinct-value ratios

COUNT(DISTINCT column)

Cardinality analysis

Statistical measures

MIN(), MAX(), AVG()

Numeric profiling

String analysis

LENGTH(), pattern matching

Text data insights

Top-k frequent values

GROUP BY ... ORDER BY COUNT()

Common patterns

LSH similarity

MinHash algorithms

Column matching

๐Ÿ—๏ธ Architectureยถ

Intelligent Profiling Engineยถ

graph LR
    A[๐Ÿ“Š Profile Request] --> B{๐Ÿ” Ibis Available?}
    B -->|โœ… Yes| C[โšก Ibis-Enhanced]
    B -->|โŒ No| D[๐Ÿ”ง Standard Mode]
    C --> E{โœ… Success?}
    E -->|โœ… Yes| F[๐Ÿ“‹ Results]
    E -->|โŒ No| D
    D --> F

    classDef request fill:#e3f2fd,stroke:#1976d2
    classDef decision fill:#fff8e1,stroke:#f57c00
    classDef process fill:#f3e5f5,stroke:#7b1fa2
    classDef result fill:#e8f5e8,stroke:#388e3c

    class A request
    class B,E decision
    class C,D process
    class F result

๐Ÿš€ Why Ibis Integration?ยถ

๐Ÿ’ก Benefit

๐Ÿ“ Description

Server-side computation

Statistics calculated in database

Optimized SQL generation

Uses database query planner

Reduced network I/O

Only results transferred

Native functions

Leverages built-in DB functions

Automatic fallback

Ensures 100% compatibility

๐Ÿ”ง API Methodsยถ

# ๐Ÿ“Š Table profiling
profile = datasource.profile_table("customers")

# ๐Ÿ“‹ Column analysis
column_profile = datasource.profile_column("customers", "email")

# ๐Ÿ” Direct Ibis access
ibis_table = datasource.connect().table("customers")
stats = ibis_table.describe().to_pandas()

๐Ÿ“Š Example Resultsยถ

๐Ÿ“‹ Table Profile Example
{
  "table_profile": {
    "row_count": 150000,
    "completeness_score": 0.95,
    "consistency_score": 0.88
  },
  "profiling_summary": {
    "profiling_method": "ibis_enhanced",
    "total_columns": 12
  }
}
๐Ÿ“‹ Column Profile Example
{
  "email": {
    "semantic_type": "email",
    "data_quality_score": 0.92,
    "null_percentage": 5.2,
    "distinct_ratio": 0.98,
    "top_k_values": [
      {"value": "user@example.com", "count": 15, "percentage": 0.01}
    ]
  }
}

๐ŸŽฏ Advanced Featuresยถ

๐Ÿท๏ธ Semantic Type Detectionยถ

๐Ÿ” Type

๐ŸŽฏ Use Case

๐Ÿ“ง Email

Contact analysis

๐Ÿ“ž Phone

Communication data

๐ŸŒ URL

Web analytics

๐Ÿ†” ID

Primary key detection

๐Ÿ“Š Data Quality Formulaยถ

quality_score = (
    completeness * 0.5 +      # 1 - NULL%
    uniqueness * 0.3 +        # Distinct ratio
    reliability * 0.2         # Sample size
)

๐Ÿ”— Column Similarityยถ

# Find similar columns
similar = datasource.find_similar_columns("customer_id", threshold=0.8)
# โ†’ ["user_id", "client_id", "account_id"]

โš™๏ธ Configurationยถ

๐Ÿš€ Quick Setupยถ

datasource = PostgresDataSource(
    connection_string="postgresql://...",
    enable_profiling=True  # Auto-optimization
)

๐ŸŽ›๏ธ Tuning Optionsยถ

๐ŸŽฏ Use Case

๐Ÿ“Š Sample

๐Ÿ” Top-K

๐Ÿ”— LSH

๐Ÿš€ Development

1K

5

Off

โš–๏ธ Production

10K

10

On

๐ŸŽฏ High Accuracy

50K

20

On

๐Ÿ”ง Custom Configuration
profiler_config = {
    "sample_size": 10000,    # Rows to analyze
    "top_k": 10,             # Frequent values
    "enable_lsh": True,      # Similarity matching
    "lsh_threshold": 0.8     # Similarity threshold
}

๐Ÿ’ป Usage Examplesยถ

๐Ÿš€ Basic Profiling
from ryoma_ai.datasource.postgres import PostgresDataSource

datasource = PostgresDataSource(
    connection_string="postgresql://user:pass@host:5432/db",
    enable_profiling=True
)

# Profile table
profile = datasource.profile_table("customers")
method = profile["profiling_summary"]["profiling_method"]
print(f"Method: {method}")
๐Ÿ” Advanced Analysis
# Column profiling
email_profile = datasource.profile_column("customers", "email")
if email_profile["semantic_type"] == "email":
    quality = email_profile["data_quality_score"]
    print(f"Email quality: {quality:.2f}")

# Enhanced catalog
catalog = datasource.get_enhanced_catalog(include_profiles=True)
for schema in catalog.schemas:
    for table in schema.tables:
        high_quality = table.get_high_quality_columns(min_quality_score=0.8)
        print(f"{table.table_name}: {len(high_quality)} high-quality columns")
โšก Direct Ibis Access
# Custom analysis with Ibis
conn = datasource.connect()
ibis_table = conn.table("customers")

stats = ibis_table.describe().to_pandas()
age_mean = ibis_table.age.mean().to_pandas()
top_cities = ibis_table.city.value_counts().limit(5).to_pandas()

๐Ÿ”ง Backend Supportยถ

๐Ÿ—„๏ธ Database

๐Ÿš€ Native Features

๐ŸŽฏ Best For

PostgreSQL

pg_stats, histograms

Production OLTP

BigQuery

ML functions, INFORMATION_SCHEMA

Analytics

DuckDB

Advanced analytics

OLAP workloads

SQLite

PRAGMA optimizations

Development

Snowflake

Cloud-native views

Data warehouse

MySQL

INFORMATION_SCHEMA

Web backends

๐Ÿค– Text-to-SQL Benefitsยถ

๐Ÿ”— Enhanced Schema Linkingยถ

๐ŸŽฏ Feature

๐Ÿ’ก How It Helps

Statistical relevance

Uses row counts for table selection

Semantic types

Detects emails, phones, IDs

Quality scores

Filters low-quality columns

Similarity analysis

Finds related columns

โšก Smarter Query Generationยถ

# Profiling-informed optimization
if column_profile["distinct_ratio"] > 0.8:
    query += f"GROUP BY {column_name}"  # High cardinality
elif column_profile["null_percentage"] < 5:
    query += f"WHERE {column_name} IS NOT NULL"  # Low nulls

๐Ÿ›ก๏ธ Error Preventionยถ

  • NULL handling based on actual percentages

  • Type safety using semantic types

  • Cardinality awareness for optimization

๐Ÿš€ Production Guideยถ

โœ… Deployment Checklistยถ

  • Enable profiling: enable_profiling=True

  • Configure sampling for your data size

  • Set quality thresholds

  • Monitor profiling overhead

  • Schedule regular updates

๐Ÿ“Š Scaling Guideยถ

๐Ÿ“ Table Size

โš™๏ธ Config

๐Ÿ“ Notes

< 100K rows

Default

Full analysis

100K-1M rows

sample_size=5000

Balanced

> 1M rows

sample_size=10000

Optimized

๐Ÿ” Monitoring & Troubleshooting
# Check method used
method = profile["profiling_summary"]["profiling_method"]
if method == "standard":
    print("โš ๏ธ Ibis not used - check compatibility")

# Monitor duration
duration = profile["table_profile"]["profiling_duration_seconds"]
if duration > 10:
    print("โš ๏ธ Consider reducing sample_size")

๐Ÿ“š API Referenceยถ

๐Ÿ”ง Core Methods
# Table profiling
profile = datasource.profile_table(table_name, schema=None)

# Column profiling
column_profile = datasource.profile_column(table_name, column_name, schema=None)

# Enhanced catalog
catalog = datasource.get_enhanced_catalog(include_profiles=True)

# Column similarity
similar = datasource.find_similar_columns(column_name, threshold=0.8)
โš™๏ธ Configuration Options
profiler_config = {
    "sample_size": 10000,        # Rows to sample
    "top_k": 10,                 # Top frequent values
    "lsh_threshold": 0.8,        # Similarity threshold
    "num_hashes": 128,           # LSH hash functions
    "enable_lsh": True           # Enable similarity matching
}

๐Ÿšจ Troubleshootingยถ

โš ๏ธ Issue

๐Ÿ’ก Solution

Slow profiling

Reduce sample_size

Memory errors

Set enable_lsh=False

Permission errors

Check DB user permissions

Type errors

Auto-handled by fallback

๐Ÿ” Debug Examples
# Check profiling method
method = profile["profiling_summary"]["profiling_method"]
if method == "standard":
    print("Ibis not used - check compatibility")

# Reduce sample for large tables
profiler_config = {"sample_size": 1000}

๐Ÿ”ฎ Roadmapยถ

๐Ÿš€ Upcoming Featuresยถ

  • ๐Ÿค– ML-based pattern detection - Advanced semantic types

  • ๐Ÿ”— Cross-table relationships - Statistical correlation analysis

  • โฐ Temporal profiling - Data quality tracking over time

  • ๐ŸŽฏ Custom semantic types - Domain-specific definitions


๐Ÿ“– Referencesยถ

๐Ÿ“š Resource

๐Ÿ”— Link

Research Paper

โ€œAutomatic Metadata Extraction for Text-to-SQLโ€

Ibis Framework

ibis-project.org

MinHash/LSH

Locality-Sensitive Hashing algorithms

Statistical Methods

Database profiling techniques