activelearner module¶
-
class
spark_matcher.activelearner.
ScoringLearner
(col_names: List[str], scorer: sklearn.base.BaseEstimator, min_nr_samples: int = 10, uncertainty_threshold: float = 0.1, uncertainty_improvement_threshold: float = 0.01, n_uncertainty_improvement: int = 5, n_queries: int = 9999, sampling_method=<function uncertainty_sampling>, verbose: int = 0)¶ Bases:
object
Class to train a string matching model using active learning. .. attribute:: col_names
column names used for matching
-
scorer
¶ the scorer to be used in the active learning loop
-
min_nr_samples
¶ minimum number of responses required before classifier convergence is tested
-
uncertainty_threshold
¶ threshold on the uncertainty of the classifier during active learning, used for determining if the model has converged
-
uncertainty_improvement_threshold
¶ threshold on the uncertainty improvement of classifier during active learning, used for determining if the model has converged
-
n_uncertainty_improvement
¶ span of iterations to check for largest difference between uncertainties
-
n_queries
¶ maximum number of iterations to be done for the active learning session
-
sampling_method
¶ sampling method to be used for the active learning session
-
verbose
¶ sets verbosity
-
fit
(X: pandas.core.frame.DataFrame) → spark_matcher.activelearner.active_learner.ScoringLearner¶ Fit ScoringLearner instance on pairs of strings :param X: Pandas dataframe containing pairs of strings and distance metrics of paired strings
-
predict_proba
(X: Union[pyspark.sql.dataframe.DataFrame, pandas.core.frame.DataFrame]) → Union[pyspark.sql.dataframe.DataFrame, pandas.core.frame.DataFrame]¶ Predict probabilities on new data whether the pairs are a match or not :param X: Pandas or Spark dataframe to predict on
Returns: match probabilities
-