mlvern.data package

Submodules

mlvern.data.fingerprint module

mlvern.data.fingerprint.fingerprint_dataset(df: DataFrame, target: str)[source]

FAST fingerprinting – runs every time.

mlvern.data.inspect module

class mlvern.data.inspect.DataInspector(df: DataFrame, target: str | None = None, mlvern_dir: str = '.')[source]

Bases: object

Comprehensive data profiling and validation framework.

inspect() dict[str, Any][source]

Run complete inspection.

profile_data() dict[str, Any][source]

Part 1: Comprehensive data profiling.

safe_numeric_profile(min_rows: int = 2) dict[str, Any][source]

Check if dataset is large enough for numeric profiling.

Returns analysis or explicit skip status.

save_report(filename: str = 'data_inspection_report.json') Path[source]

Save inspection report to JSON file and return Path.

validate_data() dict[str, Any][source]

Part 2: Comprehensive data validation.

validate_input() bool[source]

Validate input data.

mlvern.data.inspect.inspect_data(df: DataFrame, target: str | None = None, mlvern_dir: str = '.') dict[str, Any][source]

Convenience function for data inspection.

mlvern.data.register module

mlvern.data.register.register_dataset(df, target, mlvern_dir)[source]

mlvern.data.risk_check module

mlvern.data.risk_check.class_imbalance(df: DataFrame, target: str) Dict[str, Any][source]
mlvern.data.risk_check.data_drift(baseline: DataFrame, current: DataFrame, cols: List[str] | None = None) Dict[str, Any][source]

Check drift between baseline and current.

Uses KS for numeric, chi2 for categorical.

mlvern.data.risk_check.run_risk_checks(df: DataFrame, target: str | None = None, sensitive: List[str] | None = None, baseline: DataFrame | None = None, train: DataFrame | None = None, test: DataFrame | None = None, mlvern_dir: str | None = None) Dict[str, Any][source]
mlvern.data.risk_check.sampling_bias(baseline: DataFrame, current: DataFrame, cols: List[str] | None = None) Dict[str, Any][source]

Compare categorical distributions using chi-squared test.

mlvern.data.risk_check.sensitive_attribute_imbalance(df: DataFrame, sensitive_cols: List[str]) Dict[str, Any][source]
mlvern.data.risk_check.target_leakage_detection(df: DataFrame, target: str, threshold: float = 0.99) Dict[str, Any][source]
mlvern.data.risk_check.train_test_mismatch(train: DataFrame, test: DataFrame, cols: List[str] | None = None) Dict[str, Any][source]

Wrapper around data_drift to check train vs test mismatch.

mlvern.data.statistics module

mlvern.data.statistics.compute_statistics(df: DataFrame, target: str | None = None, mlvern_dir: str | None = None) Dict[str, Any][source]

Collect statistics combining multiple functions.

mlvern.data.statistics.correlations(df: DataFrame, method: str = 'pearson') DataFrame[source]

Compute correlation matrix (pearson or spearman).

mlvern.data.statistics.dimensionality_signals(df: DataFrame, n_components: int = 5) Dict[str, Any][source]
mlvern.data.statistics.distribution_shape(df: DataFrame, col: str) Dict[str, Any][source]

Assess approximate distribution shape using skewness and kurtosis.

mlvern.data.statistics.feature_target_association(df: DataFrame, target: str) Dict[str, Any][source]
mlvern.data.statistics.hypothesis_test_two_samples(x: Series, y: Series) Dict[str, Any][source]

Perform two-sample t-test (Welch’s t-test).

mlvern.data.statistics.interaction_patterns(df: DataFrame, target: str | None = None, top_n: int = 10) Dict[str, Any][source]

Detect interaction patterns via pairwise product correlation.

mlvern.data.statistics.numeric_summary(df: DataFrame, cols: List[str] | None = None) Dict[str, Dict[str, Any]][source]

Compute mean, median, std, skewness for numeric columns.

mlvern.data.statistics.redundant_features(df: DataFrame, threshold: float = 0.95) List[Tuple[str, str, float]][source]

Return pairs of features with abs(correlation) >= threshold.

mlvern.data.statistics.vif(df: DataFrame, cols: List[str] | None = None) Dict[str, float][source]

Compute Variance Inflation Factor for numeric features.

Uses linear regression via numpy to compute R^2 for each feature against others.

Module contents

mlvern.data.fingerprint_dataset(df: DataFrame, target: str)[source]

FAST fingerprinting – runs every time.

mlvern.data.register_dataset(df, target, mlvern_dir)[source]