activelearner module

class spark_matcher.activelearner.ScoringLearner(col_names: List[str], scorer: sklearn.base.BaseEstimator, min_nr_samples: int = 10, uncertainty_threshold: float = 0.1, uncertainty_improvement_threshold: float = 0.01, n_uncertainty_improvement: int = 5, n_queries: int = 9999, sampling_method=<function uncertainty_sampling>, verbose: int = 0)

Bases: object

Class to train a string matching model using active learning. .. attribute:: col_names

column names used for matching


the scorer to be used in the active learning loop


minimum number of responses required before classifier convergence is tested


threshold on the uncertainty of the classifier during active learning, used for determining if the model has converged


threshold on the uncertainty improvement of classifier during active learning, used for determining if the model has converged


span of iterations to check for largest difference between uncertainties


maximum number of iterations to be done for the active learning session


sampling method to be used for the active learning session


sets verbosity

fit(X: pandas.core.frame.DataFrame)spark_matcher.activelearner.active_learner.ScoringLearner

Fit ScoringLearner instance on pairs of strings :param X: Pandas dataframe containing pairs of strings and distance metrics of paired strings

predict_proba(X: Union[pyspark.sql.dataframe.DataFrame, pandas.core.frame.DataFrame])Union[pyspark.sql.dataframe.DataFrame, pandas.core.frame.DataFrame]

Predict probabilities on new data whether the pairs are a match or not :param X: Pandas or Spark dataframe to predict on

Returns: match probabilities