activelearner module¶
-
class
spark_matcher.activelearner.ScoringLearner(col_names: List[str], scorer: sklearn.base.BaseEstimator, min_nr_samples: int = 10, uncertainty_threshold: float = 0.1, uncertainty_improvement_threshold: float = 0.01, n_uncertainty_improvement: int = 5, n_queries: int = 9999, sampling_method=<function uncertainty_sampling>, verbose: int = 0)¶ Bases:
objectClass to train a string matching model using active learning. .. attribute:: col_names
column names used for matching
-
scorer¶ the scorer to be used in the active learning loop
-
min_nr_samples¶ minimum number of responses required before classifier convergence is tested
-
uncertainty_threshold¶ threshold on the uncertainty of the classifier during active learning, used for determining if the model has converged
-
uncertainty_improvement_threshold¶ threshold on the uncertainty improvement of classifier during active learning, used for determining if the model has converged
-
n_uncertainty_improvement¶ span of iterations to check for largest difference between uncertainties
-
n_queries¶ maximum number of iterations to be done for the active learning session
-
sampling_method¶ sampling method to be used for the active learning session
-
verbose¶ sets verbosity
-
fit(X: pandas.core.frame.DataFrame) → spark_matcher.activelearner.active_learner.ScoringLearner¶ Fit ScoringLearner instance on pairs of strings :param X: Pandas dataframe containing pairs of strings and distance metrics of paired strings
-
predict_proba(X: Union[pyspark.sql.dataframe.DataFrame, pandas.core.frame.DataFrame]) → Union[pyspark.sql.dataframe.DataFrame, pandas.core.frame.DataFrame]¶ Predict probabilities on new data whether the pairs are a match or not :param X: Pandas or Spark dataframe to predict on
Returns: match probabilities
-