๐Ÿ“Š Database Profiling Systemยถ

Comprehensive metadata extraction for Text-to-SQL applications

๐Ÿš€ Quick Startยถ

from ryoma_data import DataSource, DatabaseProfiler

# Create datasource
datasource = DataSource(
    "postgres",
    host="localhost",
    database="mydb",
    user="user",
    password="pass"
)

# Create profiler
profiler = DatabaseProfiler()

# Profile table
profile = profiler.profile_table(datasource, "customers")
print(f"Rows: {profile.row_count}")
print(f"Completeness: {profile.completeness_score}")

๐ŸŽฏ Core Featuresยถ

๐Ÿ“Š Feature

๐Ÿ’ก Use Case

Row counts & NULL stats

Data completeness

Distinct-value ratios

Cardinality analysis

Statistical measures

Numeric profiling

String analysis

Text data insights

Top-k frequent values

Common patterns

LSH similarity

Column matching

๐Ÿ”ง API Methodsยถ

# Table profiling
profile = profiler.profile_table(datasource, "customers")

# Column analysis
column_profile = profiler.profile_column(datasource, "customers", "email")

# Configure profiler
profiler = DatabaseProfiler(
    sample_size=10000,
    top_k=20,
    enable_lsh=True
)

๐Ÿ“Š Example Resultsยถ

๐Ÿ“‹ Table Profile Example
{
  "table_profile": {
    "row_count": 150000,
    "completeness_score": 0.95,
    "consistency_score": 0.88
  },
  "profiling_summary": {
    "profiling_method": "ibis_enhanced",
    "total_columns": 12
  }
}
๐Ÿ“‹ Column Profile Example
{
  "email": {
    "semantic_type": "email",
    "data_quality_score": 0.92,
    "null_percentage": 5.2,
    "distinct_ratio": 0.98,
    "top_k_values": [
      {"value": "user@example.com", "count": 15, "percentage": 0.01}
    ]
  }
}

๐ŸŽฏ Advanced Featuresยถ

๐Ÿท๏ธ Semantic Type Detectionยถ

๐Ÿ” Type

๐ŸŽฏ Use Case

๐Ÿ“ง Email

Contact analysis

๐Ÿ“ž Phone

Communication data

๐ŸŒ URL

Web analytics

๐Ÿ†” ID

Primary key detection

๐Ÿ“Š Data Quality Formulaยถ

quality_score = (
    completeness * 0.5 +      # 1 - NULL%
    uniqueness * 0.3 +        # Distinct ratio
    reliability * 0.2         # Sample size
)

๐Ÿ”— Column Similarityยถ

# Find similar columns
similar = datasource.find_similar_columns("customer_id", threshold=0.8)
# โ†’ ["user_id", "client_id", "account_id"]

โš™๏ธ Configurationยถ

๐Ÿš€ Quick Setupยถ

from ryoma_data import DataSource, DatabaseProfiler

datasource = DataSource(
    "postgres",
    connection_string="postgresql://..."
)

profiler = DatabaseProfiler()

๐ŸŽ›๏ธ Tuning Optionsยถ

๐ŸŽฏ Use Case

๐Ÿ“Š Sample

๐Ÿ” Top-K

๐Ÿ”— LSH

๐Ÿš€ Development

1K

5

Off

โš–๏ธ Production

10K

10

On

๐ŸŽฏ High Accuracy

50K

20

On

๐Ÿ”ง Custom Configuration
from ryoma_data import DatabaseProfiler

profiler = DatabaseProfiler(
    sample_size=10000,
    top_k=10,
    enable_lsh=True
)

๐Ÿ’ป Usage Examplesยถ

๐Ÿš€ Basic Profiling
from ryoma_data import DataSource, DatabaseProfiler

datasource = DataSource(
    "postgres",
    connection_string="postgresql://user:pass@host:5432/db"
)

# Profile table
profiler = DatabaseProfiler()
profile = profiler.profile_table(datasource, "customers")
method = profile["profiling_summary"]["profiling_method"]
print(f"Method: {method}")
๐Ÿ” Advanced Analysis
# Column profiling
email_profile = datasource.profile_column("customers", "email")
if email_profile["semantic_type"] == "email":
    quality = email_profile["data_quality_score"]
    print(f"Email quality: {quality:.2f}")

# Enhanced catalog
catalog = datasource.get_enhanced_catalog(include_profiles=True)
for schema in catalog.schemas:
    for table in schema.tables:
        high_quality = table.get_high_quality_columns(min_quality_score=0.8)
        print(f"{table.table_name}: {len(high_quality)} high-quality columns")
โšก Direct Ibis Access
# Custom analysis with Ibis
conn = datasource.connect()
ibis_table = conn.table("customers")

stats = ibis_table.describe().to_pandas()
age_mean = ibis_table.age.mean().to_pandas()
top_cities = ibis_table.city.value_counts().limit(5).to_pandas()

๐Ÿ”ง Backend Supportยถ

๐Ÿ—„๏ธ Database

๐Ÿš€ Native Features

๐ŸŽฏ Best For

PostgreSQL

pg_stats, histograms

Production OLTP

BigQuery

ML functions, INFORMATION_SCHEMA

Analytics

DuckDB

Advanced analytics

OLAP workloads

SQLite

PRAGMA optimizations

Development

Snowflake

Cloud-native views

Data warehouse

MySQL

INFORMATION_SCHEMA

Web backends

๐Ÿค– Text-to-SQL Benefitsยถ

๐Ÿ”— Enhanced Schema Linkingยถ

๐ŸŽฏ Feature

๐Ÿ’ก How It Helps

Statistical relevance

Uses row counts for table selection

Semantic types

Detects emails, phones, IDs

Quality scores

Filters low-quality columns

Similarity analysis

Finds related columns

โšก Smarter Query Generationยถ

# Profiling-informed optimization
if column_profile["distinct_ratio"] > 0.8:
    query += f"GROUP BY {column_name}"  # High cardinality
elif column_profile["null_percentage"] < 5:
    query += f"WHERE {column_name} IS NOT NULL"  # Low nulls

๐Ÿ›ก๏ธ Error Preventionยถ

  • NULL handling based on actual percentages

  • Type safety using semantic types

  • Cardinality awareness for optimization

๐Ÿš€ Production Guideยถ

โœ… Deployment Checklistยถ

  • Create DatabaseProfiler instance

  • Configure sampling for your data size

  • Set quality thresholds

  • Monitor profiling overhead

  • Schedule regular updates

๐Ÿ“Š Scaling Guideยถ

๐Ÿ“ Table Size

โš™๏ธ Config

๐Ÿ“ Notes

< 100K rows

Default

Full analysis

100K-1M rows

sample_size=5000

Balanced

> 1M rows

sample_size=10000

Optimized

๐Ÿ” Monitoring & Troubleshooting
# Check method used
method = profile["profiling_summary"]["profiling_method"]
if method == "standard":
    print("โš ๏ธ Ibis not used - check compatibility")

# Monitor duration
duration = profile["table_profile"]["profiling_duration_seconds"]
if duration > 10:
    print("โš ๏ธ Consider reducing sample_size")

๐Ÿ“š API Referenceยถ

๐Ÿ”ง Core Methods
# Table profiling
profile = datasource.profile_table(table_name, schema=None)

# Column profiling
column_profile = datasource.profile_column(table_name, column_name, schema=None)

# Enhanced catalog
catalog = datasource.get_enhanced_catalog(include_profiles=True)

# Column similarity
similar = datasource.find_similar_columns(column_name, threshold=0.8)
โš™๏ธ Configuration Options
from ryoma_data import DatabaseProfiler

profiler = DatabaseProfiler(
    sample_size=10000,
    top_k=10,
    enable_lsh=True
)

๐Ÿšจ Troubleshootingยถ

โš ๏ธ Issue

๐Ÿ’ก Solution

Slow profiling

Reduce sample_size

Memory errors

Set enable_lsh=False

Permission errors

Check DB user permissions

Type errors

Auto-handled by fallback

๐Ÿ” Debug Examples
# Check profiling method
method = profile["profiling_summary"]["profiling_method"]
if method == "standard":
    print("Ibis not used - check compatibility")

# Reduce sample for large tables
profiler = DatabaseProfiler(sample_size=1000)

๐Ÿ”ฎ Roadmapยถ

๐Ÿš€ Upcoming Featuresยถ

  • ๐Ÿค– ML-based pattern detection - Advanced semantic types

  • ๐Ÿ”— Cross-table relationships - Statistical correlation analysis

  • โฐ Temporal profiling - Data quality tracking over time

  • ๐ŸŽฏ Custom semantic types - Domain-specific definitions


๐Ÿ“– Referencesยถ

๐Ÿ“š Resource

๐Ÿ”— Link

Research Paper

โ€œAutomatic Metadata Extraction for Text-to-SQLโ€

Ibis Framework

ibis-project.org

MinHash/LSH

Locality-Sensitive Hashing algorithms

Statistical Methods

Database profiling techniques