activelearner module¶

class spark_matcher.activelearner.ScoringLearner(col_names: List[str], scorer: sklearn.base.BaseEstimator, min_nr_samples: int = 10, uncertainty_threshold: float = 0.1, uncertainty_improvement_threshold: float = 0.01, n_uncertainty_improvement: int = 5, n_queries: int = 9999, sampling_method=<function uncertainty_sampling>, verbose: int = 0)¶

Bases: object

Class to train a string matching model using active learning. .. attribute:: col_names

column names used for matching

scorer¶: the scorer to be used in the active learning loop

min_nr_samples¶: minimum number of responses required before classifier convergence is tested

uncertainty_threshold¶: threshold on the uncertainty of the classifier during active learning, used for determining if the model has converged

uncertainty_improvement_threshold¶: threshold on the uncertainty improvement of classifier during active learning, used for determining if the model has converged

n_uncertainty_improvement¶: span of iterations to check for largest difference between uncertainties

n_queries¶: maximum number of iterations to be done for the active learning session

sampling_method¶: sampling method to be used for the active learning session

verbose¶: sets verbosity

fit(X: pandas.core.frame.DataFrame) → spark_matcher.activelearner.active_learner.ScoringLearner¶: Fit ScoringLearner instance on pairs of strings :param X: Pandas dataframe containing pairs of strings and distance metrics of paired strings

predict_proba(X: Union[pyspark.sql.dataframe.DataFrame, pandas.core.frame.DataFrame]) → Union[pyspark.sql.dataframe.DataFrame, pandas.core.frame.DataFrame]¶

Predict probabilities on new data whether the pairs are a match or not :param X: Pandas or Spark dataframe to predict on

Returns: match probabilities