๐ Database Profiling Systemยถ
Comprehensive metadata extraction for Text-to-SQL applications
๐ Quick Startยถ
from ryoma_data import DataSource, DatabaseProfiler
# Create datasource
datasource = DataSource(
"postgres",
host="localhost",
database="mydb",
user="user",
password="pass"
)
# Create profiler
profiler = DatabaseProfiler()
# Profile table
profile = profiler.profile_table(datasource, "customers")
print(f"Rows: {profile.row_count}")
print(f"Completeness: {profile.completeness_score}")
๐ฏ Core Featuresยถ
๐ Feature |
๐ก Use Case |
|---|---|
Row counts & NULL stats |
Data completeness |
Distinct-value ratios |
Cardinality analysis |
Statistical measures |
Numeric profiling |
String analysis |
Text data insights |
Top-k frequent values |
Common patterns |
LSH similarity |
Column matching |
๐ง API Methodsยถ
# Table profiling
profile = profiler.profile_table(datasource, "customers")
# Column analysis
column_profile = profiler.profile_column(datasource, "customers", "email")
# Configure profiler
profiler = DatabaseProfiler(
sample_size=10000,
top_k=20,
enable_lsh=True
)
๐ Example Resultsยถ
๐ Table Profile Example
{
"table_profile": {
"row_count": 150000,
"completeness_score": 0.95,
"consistency_score": 0.88
},
"profiling_summary": {
"profiling_method": "ibis_enhanced",
"total_columns": 12
}
}
๐ Column Profile Example
{
"email": {
"semantic_type": "email",
"data_quality_score": 0.92,
"null_percentage": 5.2,
"distinct_ratio": 0.98,
"top_k_values": [
{"value": "user@example.com", "count": 15, "percentage": 0.01}
]
}
}
๐ฏ Advanced Featuresยถ
๐ท๏ธ Semantic Type Detectionยถ
๐ Type |
๐ฏ Use Case |
|---|---|
๐ง Email |
Contact analysis |
๐ Phone |
Communication data |
๐ URL |
Web analytics |
๐ ID |
Primary key detection |
๐ Data Quality Formulaยถ
quality_score = (
completeness * 0.5 + # 1 - NULL%
uniqueness * 0.3 + # Distinct ratio
reliability * 0.2 # Sample size
)
๐ Column Similarityยถ
# Find similar columns
similar = datasource.find_similar_columns("customer_id", threshold=0.8)
# โ ["user_id", "client_id", "account_id"]
โ๏ธ Configurationยถ
๐ Quick Setupยถ
from ryoma_data import DataSource, DatabaseProfiler
datasource = DataSource(
"postgres",
connection_string="postgresql://..."
)
profiler = DatabaseProfiler()
๐๏ธ Tuning Optionsยถ
๐ฏ Use Case |
๐ Sample |
๐ Top-K |
๐ LSH |
|---|---|---|---|
๐ Development |
1K |
5 |
Off |
โ๏ธ Production |
10K |
10 |
On |
๐ฏ High Accuracy |
50K |
20 |
On |
๐ง Custom Configuration
from ryoma_data import DatabaseProfiler
profiler = DatabaseProfiler(
sample_size=10000,
top_k=10,
enable_lsh=True
)
๐ป Usage Examplesยถ
๐ Basic Profiling
from ryoma_data import DataSource, DatabaseProfiler
datasource = DataSource(
"postgres",
connection_string="postgresql://user:pass@host:5432/db"
)
# Profile table
profiler = DatabaseProfiler()
profile = profiler.profile_table(datasource, "customers")
method = profile["profiling_summary"]["profiling_method"]
print(f"Method: {method}")
๐ Advanced Analysis
# Column profiling
email_profile = datasource.profile_column("customers", "email")
if email_profile["semantic_type"] == "email":
quality = email_profile["data_quality_score"]
print(f"Email quality: {quality:.2f}")
# Enhanced catalog
catalog = datasource.get_enhanced_catalog(include_profiles=True)
for schema in catalog.schemas:
for table in schema.tables:
high_quality = table.get_high_quality_columns(min_quality_score=0.8)
print(f"{table.table_name}: {len(high_quality)} high-quality columns")
โก Direct Ibis Access
# Custom analysis with Ibis
conn = datasource.connect()
ibis_table = conn.table("customers")
stats = ibis_table.describe().to_pandas()
age_mean = ibis_table.age.mean().to_pandas()
top_cities = ibis_table.city.value_counts().limit(5).to_pandas()
๐ง Backend Supportยถ
๐๏ธ Database |
๐ Native Features |
๐ฏ Best For |
|---|---|---|
PostgreSQL |
|
Production OLTP |
BigQuery |
ML functions, |
Analytics |
DuckDB |
Advanced analytics |
OLAP workloads |
SQLite |
|
Development |
Snowflake |
Cloud-native views |
Data warehouse |
MySQL |
|
Web backends |
๐ค Text-to-SQL Benefitsยถ
๐ Enhanced Schema Linkingยถ
๐ฏ Feature |
๐ก How It Helps |
|---|---|
Statistical relevance |
Uses row counts for table selection |
Semantic types |
Detects emails, phones, IDs |
Quality scores |
Filters low-quality columns |
Similarity analysis |
Finds related columns |
โก Smarter Query Generationยถ
# Profiling-informed optimization
if column_profile["distinct_ratio"] > 0.8:
query += f"GROUP BY {column_name}" # High cardinality
elif column_profile["null_percentage"] < 5:
query += f"WHERE {column_name} IS NOT NULL" # Low nulls
๐ก๏ธ Error Preventionยถ
NULL handling based on actual percentages
Type safety using semantic types
Cardinality awareness for optimization
๐ Production Guideยถ
โ Deployment Checklistยถ
Create
DatabaseProfilerinstanceConfigure sampling for your data size
Set quality thresholds
Monitor profiling overhead
Schedule regular updates
๐ Scaling Guideยถ
๐ Table Size |
โ๏ธ Config |
๐ Notes |
|---|---|---|
< 100K rows |
Default |
Full analysis |
100K-1M rows |
|
Balanced |
> 1M rows |
|
Optimized |
๐ Monitoring & Troubleshooting
# Check method used
method = profile["profiling_summary"]["profiling_method"]
if method == "standard":
print("โ ๏ธ Ibis not used - check compatibility")
# Monitor duration
duration = profile["table_profile"]["profiling_duration_seconds"]
if duration > 10:
print("โ ๏ธ Consider reducing sample_size")
๐ API Referenceยถ
๐ง Core Methods
# Table profiling
profile = datasource.profile_table(table_name, schema=None)
# Column profiling
column_profile = datasource.profile_column(table_name, column_name, schema=None)
# Enhanced catalog
catalog = datasource.get_enhanced_catalog(include_profiles=True)
# Column similarity
similar = datasource.find_similar_columns(column_name, threshold=0.8)
โ๏ธ Configuration Options
from ryoma_data import DatabaseProfiler
profiler = DatabaseProfiler(
sample_size=10000,
top_k=10,
enable_lsh=True
)
๐จ Troubleshootingยถ
โ ๏ธ Issue |
๐ก Solution |
|---|---|
Slow profiling |
Reduce |
Memory errors |
Set |
Permission errors |
Check DB user permissions |
Type errors |
Auto-handled by fallback |
๐ Debug Examples
# Check profiling method
method = profile["profiling_summary"]["profiling_method"]
if method == "standard":
print("Ibis not used - check compatibility")
# Reduce sample for large tables
profiler = DatabaseProfiler(sample_size=1000)
๐ฎ Roadmapยถ
๐ Upcoming Featuresยถ
๐ค ML-based pattern detection - Advanced semantic types
๐ Cross-table relationships - Statistical correlation analysis
โฐ Temporal profiling - Data quality tracking over time
๐ฏ Custom semantic types - Domain-specific definitions
๐ Referencesยถ
๐ Resource |
๐ Link |
|---|---|
Research Paper |
โAutomatic Metadata Extraction for Text-to-SQLโ |
Ibis Framework |
|
MinHash/LSH |
Locality-Sensitive Hashing algorithms |
Statistical Methods |
Database profiling techniques |