activelearner module

class spark_matcher.activelearner.ScoringLearner(col_names: List[str], scorer: sklearn.base.BaseEstimator, min_nr_samples: int = 10, uncertainty_threshold: float = 0.1, uncertainty_improvement_threshold: float = 0.01, n_uncertainty_improvement: int = 5, n_queries: int = 9999, sampling_method=<function uncertainty_sampling>, verbose: int = 0)

Bases: object

Class to train a string matching model using active learning. .. attribute:: col_names

column names used for matching

scorer

the scorer to be used in the active learning loop

min_nr_samples

minimum number of responses required before classifier convergence is tested

uncertainty_threshold

threshold on the uncertainty of the classifier during active learning, used for determining if the model has converged

uncertainty_improvement_threshold

threshold on the uncertainty improvement of classifier during active learning, used for determining if the model has converged

n_uncertainty_improvement

span of iterations to check for largest difference between uncertainties

n_queries

maximum number of iterations to be done for the active learning session

sampling_method

sampling method to be used for the active learning session

verbose

sets verbosity

fit(X: pandas.core.frame.DataFrame)spark_matcher.activelearner.active_learner.ScoringLearner

Fit ScoringLearner instance on pairs of strings :param X: Pandas dataframe containing pairs of strings and distance metrics of paired strings

predict_proba(X: Union[pyspark.sql.dataframe.DataFrame, pandas.core.frame.DataFrame])Union[pyspark.sql.dataframe.DataFrame, pandas.core.frame.DataFrame]

Predict probabilities on new data whether the pairs are a match or not :param X: Pandas or Spark dataframe to predict on

Returns: match probabilities