kgextension package

Submodules

kgextension.caching_helper module

kgextension.caching_helper.clear_cache()

Function that clears the cache of all cached methods when it’s called.

kgextension.caching_helper.freeze_unhashable(freeze_by='argument', freeze_argument=None, freeze_index=None)

Wrapper function to “freeze” a unhashable function attribute (dictionary or pandas Series) into a hashable OrderedDict. Used for functions that need to be cached but have these types of arguments as inputs.

Parameters
  • freeze_by (str, optional) – Used to indicate whether the argument that needs to be freezed is selected via its argument name (“argument”) or its index (“index”). Defaults to “argument”.

  • freeze_argument (str, optional) – Name of the argument that should be freezed. Used if freeze_by = “argument”. Defaults to None.

  • freeze_index (int, optional) – Index of the argument that should be freezed. Used if freeze_by = “index”. Defaults to None.

kgextension.caching_helper.show_cache_info()

Function that gives the user an overview over the status of all cached methods.

kgextension.caching_helper.unfreeze_unhashable(frozen_argument, frozen_type='series')

Function to “unfreeze” unhashable arguments “frozen” by the freeze_unhashable function.

Parameters
  • frozen_argument (tuple/OrderedDict) – The frozen argument. Pandas Series as tuple and dictionaries as OrderedDict.

  • frozen_type (str, optional) – Indicator whether the frozen arguemnt is a pandas Series (“series”) or a dictionary (“dict”). Defaults to “series”.

Returns

The content of the OrderedDict in its original format.

Return type

pd.Series/dict

kgextension.endpoints module

kgextension.endpoints.DBpedia = <kgextension.sparql_helper.RemoteEndpoint object>

Predefined SPARQL endpoint for DBpedia.

Settings:

ResultSetMaxRows = 10000; MaxQueryExecutionTime = 120 (seconds); MaxQueryCostEstimationTime = 1500 (seconds); Connection limit = 50 (parallel connections per IP address); maximum request rate = 100 (requests per second per IP address, with an initial burst of 120 requests)

NOTE: Queries which time out will return PARTIAL results in a best effort fashion, and will NOT return an error.

Source. https://wiki.dbpedia.org/public-sparql-endpoint

kgextension.endpoints.EUOpenData = <kgextension.sparql_helper.RemoteEndpoint object>

Predefined SPARQL endpoint for the EU Open Data Portal (EU ODP).

No Usage Policy found?

Source: https://data.europa.eu/euodp/en/developerscorner

kgextension.endpoints.WikiData = <kgextension.sparql_helper.RemoteEndpoint object>

Predefined SPARQL endpoint for WikiData.

NOTE: A user-specific user agent header is needed (https://meta.wikimedia.org/wiki/User-Agent_policy) -> Use “agent” argument!

There is a hard query deadline configured which is set to 60 seconds. There are also following limits:

One client (user agent + IP) is allowed 60 seconds of processing time each 60 seconds One client is allowed 30 error queries per minute

Clients exceeding the limits above are throttled with HTTP code 429. Use Retry-After header to see when the request can be repeated. If the client ignores 429 responses and continues to produce requests over the limits, it can be temporarily banned from the service. Clients who don’t comply with the User-Agent policy may be blocked completely – make sure to send a good User-Agent header.

Every query will timeout when it takes more time to execute than this configured deadline. You may want to optimize the query or report a problematic query here.

Also note that currently access to the service is limited to 5 parallel queries per IP. The above limits are subject to change depending on resources and usage patterns.

Source: https://www.mediawiki.org/wiki/Wikidata_Query_Service/User_Manual#Query_limits

kgextension.feature_selection module

kgextension.feature_selection.greedy_top_down_filter(df, label_column, column_prefix='new_link_type_', G=None, progress=True)

Hierarchical feature selection based on the Greedy Top Down algorithm.

Lu, S., Ye, Y., Tsui, R., Su, H., Rexit, R., Wesaratchakit, S., Liu, X. and Hwa, R., 2013, October. Domain ontology-based feature reduction for high dimensional drug data and its application to 30-day heart failure readmission prediction. In 9th IEEE International Conference on Collaborative Computing: Networking, Applications and Worksharing (pp. 478-484). IEEE.

Parameters
  • df (pd.DataFrame) – DataFrame that contains the label as well as the features generated (by a generator).

  • label_column (str) – Name of the label column.

  • column_prefix (str) – Prefix of the columns generated by the generator (e.g. “new_link_type_”). Defaults to “new_link_type_”. #TODO: Check if default makes sense!

  • G (nx.DirectedGraph, optional) – Graph that contains the hierarchy. If “None” that hierarchy attached to the provided df will be used. Defaults to None.

  • progress (bool, optional) – If True, progress bars will be shown to inform the user about the progress made by the process. Defaults to True.

Raises

TypeError – Raised if the graph provided is not a directed acyclic graph (DAGs).

Returns

DataFrame reduced to columns determined by the GTD algorithm as well as columns in the original df that are not created by a generator (that don’t start with column_prefix).

Return type

pd.DataFrame

kgextension.feature_selection.hierarchy_based_filter(df, label_column, G=None, threshold=0.99, metric='info_gain', pruning=True, all_remove=True, progress=True, **kwargs)

Feature selection approach, namely, SHSEL including the initial selection algorithm and pruning algorithm. Identify and filter out the ranges of nodes with similar relevance in each branch of the hierarchy.

Ristoski, P. and Paulheim, H., 2014, October. Feature selection in hierarchical feature spaces. In International conference on discovery science (pp. 288-300). Springer, Cham.

Parameters
  • df (pd.DataFrame) – Dataframe containing the original features and the class column.

  • label_column (str) – Name of the output/class column.

  • G (nx.DirectedGraph, optional) – The directed graph of all classes and superclasses can be specified here; if None the function looks for the graph in the pd.DataFrame.attrs.hierarchy attribute of the input dataframe. Defaults to None.

  • threshold (float, optional) – A relevance similarity threshold which is set be users, recommended to be 0.99. Defaults to 0.99.

  • metric (str/func, optional) – The relevance similarity metrics including infomation gain and correlation(“info_gain”/”correlation”). Can use your own metric function. Defaults to “info_gain”.

  • pruning (bool, optional) – If or not use the pruning algorithm, if True, select only the most valuable features which is greater than the average Information Gain values from the previously reduced set. Defaults to True.

  • all_remove (bool, optional) – Only valid when pruning is True. If or not strictly remove all the nodes once one of their info gain value are smaller than the average info gain of paths. Defaults to True.

  • progress (bool, optional) – If True, progress bars will be shown to inform the user about the progress made by the process. Defaults to True.

Returns

Filtered Dataframe containing the selected attributes.

Return type

pd.DataFrame

kgextension.feature_selection.hill_climbing_filter(df, label_column, metric='hill_climbing_cost_function', G=None, beta=0.05, k=5, progress=True, **kwargs)

Feature selection performed by comparing nodes with their parents in a bottom-up approach.

Wang, B.B., Mckay, R.B., Abbass, H.A. and Barlow, M., 2003, February. A comparative study for domain ontology guided feature extraction. In Proceedings of the 26th Australasian computer science conference-Volume 16 (pp. 69-78).

Parameters
  • df (pd.DataFrame) – Dataframe containing the original features and the class column.

  • label_column (str) – Name of the output/class column.

  • metric (str/func, optional) – Cost function to determine value of feature set. Higher values indicate a better feature set. Should take at least df and class_col(pd.Series of class column) as input and output a single numeric value. Defaults to ‘hill_climbing_cost_function’.

  • G (nx.DirectedGraph, optional) – The directed graph of all classes and superclasses can be specified here; if None the function looks for the graph in the pd.DataFrame.attrs.hierarchy attribute of the input dataframe. Defaults to None.

  • beta (float, optional) – Regularization parameter of cost function. Defaults to 0.05.

  • k (int, optional) – Number of nearest neighbors for cost function. Defaults to 5.

  • progress (bool, optional) – If True, progress updates will be shown to inform the user about the progress made by the process. Defaults to True.

Returns

dataframe with filtered classes

Return type

pd.DataFrame

kgextension.feature_selection.tree_based_filter(df, label_column, G=None, metric='Lift', progress=True)

Filter attributes with Tree-Based Feature Selection (TSEL). TSEL selects the most valuable attributes from each path in the hierarchy, based on lift or information gain.

Jeong, Y. and Myaeng, S.H., 2013, October. Feature selection using a semantic hierarchy for event recognition and type classification. In Proceedings of the Sixth International Joint Conference on Natural Language Processing (pp. 136-144).

Parameters
  • df (pd.DataFrame) – Dataframe with hierarchy (output of generator)

  • label_column (str) – Name of the column with the class/label

  • G (nx.DirectedGraph, optional) – The directed graph of all classes and superclasses can be specified here; if None the function looks for the graph in the pd.DataFrame.attrs.hierarchy attribute of the input dataframe. Defaults to None.

  • metric (str/func, optional) – Metric which is used to determine the representative features (IG/Lift). Defaults to ‘Lift’.

  • progress (bool, optional) – If True, progress updates will be shown to inform the user about the progress made by the process. Defaults to True.

Returns

Filtered Dataframe containing the selected attributes.

Return type

pd.DataFrame

kgextension.feature_selection_helper module

kgextension.feature_selection_helper.add_hierarchy_columns(df, G, keep_prefix=False)

Given a feature dataframe and corresponding hierarchy graph, add all the higher-level features to the dataframe with correct boolean values.

Parameters
  • df (pd.DataFrame) – Dataframe with all the lowest-level children features.

  • G (nx.DiGraph) – Directed feature hierarchy graph, direction from children to parents.

  • keep_prefix (bool, optional) – Whether to keep prefices from original directory children. Defaults to False.

Returns

Dataframe with all higher hierarchy features appended.

Return type

pd.DataFrame

kgextension.feature_selection_helper.calc_average_ig(path_nodes, node_values)

Helper function for SHSEL filter algorithm. It returns the average Infomation gain value of one existing path in pruning function.

Parameters
  • path_nodes (list) – Node in path whose node_availability is True.

  • node_values (dict) – Dictionary about every node in the directed graph and its information gain value.

Returns

The average InfoGain value of one existing path.

Return type

float

kgextension.feature_selection_helper.calc_gr(df, label_column, progress=True)

Calculated the Gain Ratio for each column of a df in relation to a specified label_column.

Parameters
  • df (pd.DataFrame) – Dataframe the Gain Ratio values need to be calculated for.

  • label_column (str) – Name of the label_column.

  • progress (bool, optional) – If True, progress bars will be shown to inform the user about the progress made by the process. Defaults to True.

Raises

RuntimeWarning – Is raised if the gain ratio calculation fails for a column (and returns a nan).

Returns

Dictionary with the column names as keys and the corresponding Gain Ratio values as values.

Return type

dict

kgextension.feature_selection_helper.calculate_lift(df, G, label_column)

Helper function for TSEL filter. Calculates the lift value for every node in a given graph.

Parameters
  • df (pd.DataFrame) – Dataframe the lift needs to be calculated for.

  • G (nx.DirectedGraph) – Directed graph for the dataframe.

  • label_column (str) – Name of the column with the class/label.

Returns

Dictionary containing column names as keys and lift as value.

Return type

dictionary

kgextension.feature_selection_helper.exist_unchecked_leafs(G)

Helper function for hierachical hill climbing. The function determines whether any of the leaf nodes of the graph have the attribute checked set to False. It returns number of leafs for which this is the case.

Parameters

G (nx.DirectedGraph) – The directed graph to be checked.

Returns

Number of unchecked leafs.

Return type

int

kgextension.feature_selection_helper.find_shortest_paths(G, root='VRN', progress=True)

Finds the shortest path between the (virtual) root node of a grahp and each leaf of the graph.

Parameters
  • G (nx.DirectedGraph) – Directed Graph.

  • root (str, optional) – Name of the root node. Defaults to “VRN”.

  • progress (bool, optional) – If True, progress bars will be shown to inform the user about the progress made by the process. Defaults to True.

Returns

List of shortest paths.

Return type

list

kgextension.feature_selection_helper.get_all_paths(G, root)

Helper function for TSEL filter. Returns all possible paths in a given graph.

Parameters
  • graph (nx.DirectedGraph) – Directed graph.

  • root (str) – Name of the root node.

Returns

List containing all paths in the graph.

Return type

list

kgextension.feature_selection_helper.get_max_node(candidates, gr_values, column_prefix='')

Given a set of candidate nodes, and return the one with the highest Gain Ratio.

Parameters
  • candidates (list) – List of candidate nodes.

  • gr_values (dict) – Dictionary with column names as keys and the corresponding Gain Ratio values as values.

  • column_prefix (str) – Prefix of the columns generated by the generator (e.g. “new_link_type_”). Defaults to “”.

Returns

Name of the node with the highest Gain Ratio in the candidate set.

Return type

str

kgextension.feature_selection_helper.gtd_logic(df, G, label_column, column_prefix, progress=True)

Greedy Top Down algorithm to select most relevant nodes in a Graph based on Gain Ratio.

Parameters
  • df (pd.DataFrame) – DataFrame.

  • G (nx.DirectedGraph) – Directed Graph containing the hierarchy.

  • label_column (str) – Name of the label column.

  • column_prefix (str) – Prefix of the columns generated by the generator (e.g. “new_link_type_”).

  • progress (bool, optional) – If True, progress bars will be shown to inform the user about the progress made by the process. Defaults to True.

Returns

Set of nodes (as strings) that are deemed most relevant by the algorithm.

Return type

set

kgextension.feature_selection_helper.hill_climbing_cost_function(df, class_col, alpha, beta, k)

Calculates the regularized purity for the hierarchical hill climbing algorithm using Nearest Neighbors.

Parameters
  • df (pd.DataFrame) – Dataframe with the feature selection to be evaluated.

  • class_col (pd.Series) – The column with the class/output values.

  • alpha (float) – Size of original feature space.

  • beta (float) – Regulatization parameter.

  • k (int) – Number of nearest neighbors.

Returns

Cost value for this set of features.

Return type

float

kgextension.feature_selection_helper.prune(df_filtered, G, node_values, node_availability, L, remove_flag=True, progress=True)

The pruning function of hierarchy_based_filter algorithm: select only the most valuable features which is greater than the average Information Gain values from the previously reduced set.

Parameters
  • df_filtered (pd.DataFrame) – The result dataframe which is outputed by initial selection algorithm.

  • G (nx.DirectedGraph) – The reverse of the directed graph of all classes and superclasses.

  • node_values (dictionary) – Dictionary contains the information gain value of every node in DirectedGraph.

  • node_availability (dictionary) – Dictionary contains every node in DirectedGraph and its availability (either True or False).

  • L (list) – A list contains the leaf nodes in DirectedGraph.

  • remove_flag (bool, optional) – If or not strictly remove all the nodes once one of their info gain value are smaller than the average info gain of paths. Defaults to True.

  • progress (bool, optional) – If True, progress bars will be shown to inform the user about the progress made by the process. Defaults to True.

Returns

Filtered Dataframe containing the selected attributes.

Return type

pd.DataFrame

kgextension.feature_selection_helper.representative_feature(path, values)

Helper function for TSEL filter. Returns the representative node of a given path.

Parameters
  • path (list) – Path containing some node names.

  • values (dict) – values containing nodes and their values.

Returns

Name of most valuable/representative node of the given path.

Return type

str

kgextension.feature_selection_sklearn module

class kgextension.feature_selection_sklearn.GreedyTopDownFilter(label_column, column_prefix='new_link_type_', G=None, progress=True)

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

fit(X, y=None)
transform(X, y=None)
class kgextension.feature_selection_sklearn.HierarchyBasedFilter(label_column, G=None, threshold=0.99, metric='info_gain', pruning=True, all_remove=True, progress=True, **kwargs)

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Feature selection approach, namely, SHSEL including the initial selection algorithm and pruning algorithm. Identify and filter out the ranges of nodes with similar relevance in each branch of the hierarchy. It can be used in a sklearn pipeline.

Ristoski, P. and Paulheim, H., 2014, October. Feature selection in hierarchical feature spaces. In International conference on discovery science (pp. 288-300). Springer, Cham.

Parameters
  • BaseEstimator (sklearn.base.BaseEstimator) –

  • TransformerMixin (sklearn.base.TransformerMixin) –

fit(X, y=None)
transform(X, y=None)
class kgextension.feature_selection_sklearn.HillClimbingFilter(label_column, metric='hill_climbing_cost_function', G=None, beta=0.05, k=5, progress=True, **kwargs)

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Feature selection performed by comparing nodes with their parents in a

bottom-up approach.

Wang, B.B., Mckay, R.B., Abbass, H.A. and Barlow, M., 2003, February. A comparative study for domain ontology guided feature extraction. In Proceedings of the 26th Australasian computer science conference-Volume 16 (pp. 69-78). Can be used in a sklearn pipeline.

Parameters
  • BaseEstimator (sklearn.base.BaseEstimator) –

  • TransformerMixin (sklearn.base.TransformerMixin) –

fit(X, y=None)
transform(X, y=None)
class kgextension.feature_selection_sklearn.TreeBasedFilter(label_column, G=None, metric='Lift', progress=True)

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

fit(X, y=None)
transform(X, y=None)

kgextension.fusion module

kgextension.fusion.data_fuser(df, clusters, boolean_method_single='provenance', boolean_method_multiple='voting', numeric_method_single='average', numeric_method_multiple='average', string_method_single='longest', string_method_multiple='longest', provenance_regex='http://dbpedia.org/', progress=True)

Fuses the columns in the “match” sets of the clusters. Determines type and size and automatically detects which of the functions to use. If a fusion match is a pair, the “single” functions is used, otherwise the “multiple” function. Available functions are first, last, longest, shortest, random.choice, voting and provenance. Other existing and user-defined functions can be passed as well, they should be applicable to pd.DataFrame.apply(axis=1).

Parameters
  • df (pd.DataFrame) – The DataFrame where schema matches are to be fused

  • clusters (list) – contains the clusters with the matching column names as sets

  • boolean_method_single (str, optional) – Method for single matches with boolean type. Defaults to “provenance”.

  • boolean_method_multiple (str, optional) – Method for multiple matches with boolean type. Defaults to “voting”.

  • numeric_method_single (str, optional) – mMthod for single matches with numeric type. Defaults to “average”.

  • numeric_method_multiple (str, optional) – Method for multiple matches with numeric type. Defaults to “average”.

  • string_method_single (str, optional) – Method for single matches with string type. Defaults to “longest”.

  • string_method_multiple (str, optional) – Method for multiple matches with string type. Defaults to “longest”.

  • provenance_regex (str, optional) – Pattern after which provenance is selected. Defaults to “http://dbpedia.org/”.

  • progress (bool, optional) – If True, progress bars will be shown to inform the user about the progress made by the process. Defaults to True.

Returns

DataFrame with fused columns.

Return type

pd.DataFrame

kgextension.fusion.get_fusion_clusters(df, threshold, progress=True)

Takes the attribute pairs generated by one of the matchers, discards all pairs with a similarity below the specified threshold and then clusters the remaining pairs into sets of equal attributes (based on the idea that similarities between attributes are Euclidean). Example: The pairs {car, auto} and {car, automobile} would be clustered into the set {car, auto, automobile} (if both pairs have a similarity ≥ the specified threshold).

Parameters
  • df (pd.DataFrame) – The DataFrame containing the similarities between the attribute pairs. This is generated by one of the matchers.

  • threshold (float) – Threshold that specifies the minimal similarity between two attributes, so that they are considered as matched.

Returns

List of sets that contain equal (matched) attributes.

Return type

list

kgextension.fusion_helper module

kgextension.fusion_helper.first(x)

Returns the first not-NA value, helper function for pd.DataFrame.apply.

Parameters

x (pd.Series) – columns/rows passed in pd.DataFrame.apply function

Returns

first not-NA value of the pd.Series

Return type

flexible

kgextension.fusion_helper.fusion_function_lookup(boolean_method_single, boolean_method_multiple, numeric_method_single, numeric_method_multiple, string_method_single, string_method_multiple)

Maps the right function to method passed as string. E.g. boolean_method_single = ‘random’ –> random.choice.

Parameters
  • boolean_method_single (str) – method to use for a cluster of size two and boolean values.

  • boolean_method_multiple (str) – method to use for a cluster of more than size two and boolean values.

  • numeric_method_single (str) – method to use for a cluster of size two and numeric values

  • numeric_method_multiple (str) – method to use for a cluster of more than size two and numeric values.

  • string_method_single (str) – method to use for a cluster of size two and string values.

  • string_method_multiple (str) – method to use for a cluster of more than size two and string values.

Returns

A dictionary with the mapping from method to function.

Return type

dict

kgextension.fusion_helper.last(x)

Returns the last not.na value, helper function for pd.DataFrame.apply.

Parameters

x (pd.Series) – columns/rows passed in pd.DataFrame.apply function

Returns

last not-NA value of the pd.Series

Return type

flexible

kgextension.fusion_helper.longest(x)

Returns the longest value, helper function for pd.DataFrame.apply.

Parameters

x (pd.Series) – columns/rows passed in pd.DataFrame.apply function

Returns

longest value of the pd.Series

Return type

str

kgextension.fusion_helper.provenance(columns, regex='http://dbpedia.org/')

Determines the name of the column matching the regex pattern.

Parameters
  • columns (pd.DataFrame.columns) – The columns of the schema matches to be fused

  • regex (str, optional) – The regex string identifiying the column name, generally the prefix of the feature. Defaults to “http://dbpedia.org/”.

Returns

The name of the column matching the regex pattern.

Return type

str

Raises

AttributeError – If no column or more than one columns of the fusion cluster match the pattern.

kgextension.fusion_helper.shortest(x)

Returns the shorest value, helper function for pd.DataFrame.apply.

Parameters

x (pd.Series) – columns/rows passed in pd.DataFrame.apply function

Returns

longest value of the pd.Series

Return type

str

kgextension.fusion_helper.voting(x)

Chooses the value with the most votes (mode value in statistics). If there is a draw, the first value is chosen.

Parameters

x (pd.Series) – columns/rows passed in pd.DataFrame.apply function

Returns

mode value of the pd.Series

Return type

flexible

kgextension.generator module

kgextension.generator.custom_sparql_generator(df, link_attribute, query, endpoint=<kgextension.sparql_helper.RemoteEndpoint object>, progress=True, attribute_generation_strategy='first', prefix_lookup=False, caching=True)

This generator issues a custom SPARQL query and creates additional attributes from the query results.

Parameters
  • df (pd.DataFrame) – Dataframe to which links are added

  • link_attribute (str) – Name of column containing the link to the knowledge graph.

  • query (str) – Custom SPARQL query which returns attributes to be appended.

  • endpoint (Endpoint, optional) – SPARQL Endpoint to be queried; ignored when “uri_data_model” = True. Defaults to DBpedia.

  • progress (bool, optional) – If True, progress bars will be shown to inform the user about the progress made by the process. Defaults to True.

  • prefix_lookup (bool/str/dict, optional) – True: Namespaces of prefixes will be looked up at prefix.cc and added to the sparql query. str: User provides the path to a json-file with prefixes and namespaces. dict: User provides a dictionary with prefixes and namespaces. Defaults to False.

  • caching (bool, optional) – Turn result-caching for queries issued during the execution on or off. Defaults to True.

Returns

Dataframe with new columns containing the query results.

Return type

pd.DataFrame

kgextension.generator.data_properties_generator(df, columns, endpoint=<kgextension.sparql_helper.RemoteEndpoint object>, uri_data_model=False, progress=True, type_filter=None, regex_filter=None, bundled_mode=True, prefix_lookup=False, caching=True)

Generator that takes a dataset with a link to a knowledge graph and creates a new feature for each data property of the given resource.

Parameters
  • df (pd.DataFrame) – Dataframe to which the features will be added

  • columns (str/list) – Name(s) of column(s) which contain(s) the link(s) to the knowledge graph.

  • endpoint (Endpoint, optional) – Base string to the knowledge graph; ignored when “uri_data_model” = True. Defaults to DBpedia.

  • uri_data_model (bool, optional) – If enabled, the URI is directly queried instead of a SPARQL endpoint. Defaults to False.

  • progress (bool, optional) – If True, progress bars will be shown to inform the user about the progress made by the process. Defaults to True.

  • type_filter (str, optional) – Property datatype to be selected from results (e.g. xsd:string). If a specific datatype should be excluded a “- ” needs to be prepended (e.g. - xsd:string). Defaults to None.

  • regex_filter (str, optional) – Regular expression for filtering properties. Defaults to None.

  • bundled_mode (bool, optional) – If True, all necessary queries are bundled into one query (using the VALUES method). - Requires a SPARQL 1.1 implementation! . Defaults to True.

  • prefix_lookup (bool/str/dict, optional) – True: Namespaces of prefixes will be looked up at prefix.cc and added to the sparql query. str: User provides the path to a json-file with prefixes and namespaces. dict: User provides a dictionary with prefixes and namespaces. Defaults to False.

  • caching (bool, optional) – Turn result-caching for queries issued during the execution on or off. Defaults to True.

Returns

Dataframe with a new column for each property.

Return type

pd.DataFrame

kgextension.generator.direct_type_generator(df, columns, endpoint=<kgextension.sparql_helper.RemoteEndpoint object>, uri_data_model=False, progress=True, prefix='', regex_filter=None, result_type='boolean', bundled_mode=True, hierarchy=False, prefix_lookup=False, caching=True)

Generator that takes a dataset with (a) link(s) to a knowledge graph and queries the type(s) of the linked ressources (using rdf:type). The resulting types are added as new columns, which are filled either with a boolean indicator or a count.

Parameters
  • df (pd.DataFrame) – Dataframe to which types are added.

  • columns (str/list) – Name(s) of column(s) which contain(s) the link(s) to the knowledge graph.

  • endpoint (Endpoint, optional) – SPARQL Endpoint to be queried; ignored when “uri_data_model” = True. Defaults to DBpedia.

  • uri_data_model (bool, optional) – If enabled, the URI is directly queried instead of a SPARQL . Defaults to False.

  • progress (bool, optional) – If True, progress bars will be shown to inform the user about the progress made by the process . Defaults to True.

  • prefix (str, optional) – Custom prefix for the SPARQL query. Defaults to “”.

  • regex_filter (list, optional) – A list filled with regexes (as strings) to filter the results . Defaults to None.

  • result_type (str, optional) – States wether the results should be boolean (“boolean”), counts (“counts”), relative counts (“relative”) or tfidf-values (“tfidf”) . Defaults to “boolean”.

  • bundled_mode (bool, optional) – If True, all necessary queries are bundled into one query (using the VALUES method). - Requires a SPARQL 1.1 implementation! . Defaults to True.

  • hierarchy (bool, optional) – If True, a hierarchy of all superclasses of the returned types is attached to the resulting dataframe. Defaults to False.

  • prefix_lookup (bool/str/dict, optional) – True: Namespaces of prefixes will be looked up at prefix.cc and added to the sparql query. str: User provides the path to a json-file with prefixes and namespaces. dict: User provides a dictionary with prefixes and namespaces. Defaults to False.

  • caching (bool, optional) – Turn result-caching for queries issued during the execution on or off. Defaults to True.

Returns

Returns dataframe with (a) new column(s) containing the found types.

Return type

pd.DataFrame

kgextension.generator.qualified_relation_generator(df, columns, endpoint=<kgextension.sparql_helper.RemoteEndpoint object>, uri_data_model=False, progress=True, prefix='Link', direction='Out', properties_regex_filter=None, types_regex_filter=None, result_type='boolean', hierarchy=False, prefix_lookup=False, caching=True)

Qualified relation generator considers not only relations, but also the related types, adding boolean, counts, relative counts or tfidf-values features for incoming and outgoing relations.

Parameters
  • df (pd.DataFrame) – Dataframe to which links are added.

  • columns (str/list) – Name(s) of column(s) which contain(s) the link(s) to the knowledge graph.

  • endpoint (Endpoint, optional) – SPARQL Endpoint to be queried; ignored when “uri_data_model” = True. Defaults to DBpedia.

  • uri_data_model (bool, optional) – If enabled, the URI is directly queried instead of a SPARQL endpoint. Defaults to False.

  • progress (bool, optional) – If True, progress bars will be shown to inform the user about the progress made by the process. Defaults to True.

  • prefix (str, optional) – Custom prefix for the SPARQL query. Defauls to “Link”.

  • direction (str, optional) – The direction for properties which choose from Incoming, Outgoing (In and Out). Defaults to “Out”.

  • properties_regex_filter (str, optional) – Regular expression for filtering properties. Defaults to None.

  • types_regex_filter (str, optional) – Regular expression for filtering types. Defaults to None.

  • result_type (str, optional) – States wether the results should be boolean (“boolean”), counts (“counts”), relative counts (“relative”) or tfidf-values (“tfidf”) Defaults to “boolean”.

  • hierarchy (bool, optional) – If True, a hierarchy of all superclasses of the returned types is attached to the resulting dataframe. Defaults to False.

  • prefix_lookup (bool/str/dict, optional) – True: Namespaces of prefixes will be looked up at prefix.cc and added to the sparql query. str: User provides the path to a json-file with prefixes and namespaces. dict: User provides a dictionary with prefixes and namespaces. Defaults to False.

  • caching (bool, optional) – Turn result-caching for queries issued during the execution on or off. Defaults to True.

Returns

Dataframe with new columns containing the links of properties to the knowledge graph

Return type

pd.DataFrame

kgextension.generator.specific_relation_generator(df, columns, endpoint=<kgextension.sparql_helper.RemoteEndpoint object>, uri_data_model=False, progress=True, direct_relation='http://purl.org/dc/terms/subject', hierarchy_relation=None, max_hierarchy_depth=1, prefix_lookup=False, caching=True)

Creates attributes from a specific direct relation. Additionally, it is possible to append a hierarchy with a user-defined hierarchy relation.

Parameters
  • df (pd.DataFrame) – the dataframe to extend

  • columns (str/list) – Name(s) of column(s) which contain(s) the link(s) to the knowledge graph.

  • endpoint (Endpoint, optional) – SPARQL Endpoint to be queried; ignored when “uri_data_model” = True. Defaults to DBpedia.

  • uri_data_model (bool, optional) – If enabled, the URI is directly queried instead of a SPARQL endpoint. Defaults to False.

  • progress (bool, optional) – If True, progress bars will be shown to inform the user about the progress made by the process. Defaults to True.

  • direct_relation (str, optional) – Direct relation used to create features. Defaults to “http://purl.org/dc/terms/subject”.

  • hierarchy_relation (str, optional) – Hierarchy relation used to connect categories, e.g. http://www.w3.org/2004/02/skos/core#broader. Defaults to None.

  • max_hierarchy_depth (int, optional) – Maximal number of hierarchy steps taken. Defaults to 1.

  • prefix_lookup (bool/str/dict, optional) – True: Namespaces of prefixes will be looked up at prefix.cc and added to the sparql query. str: User provides the path to a json-file with prefixes and namespaces. dict: User provides a dictionary with prefixes and namespaces. Defaults to False.

  • caching (bool, optional) – Turn result-caching for queries issued during the execution on or off. Defaults to True.

Returns

The dataframe with additional features.

Return type

pd.DataFrame

kgextension.generator.unqualified_relation_generator(df, columns, endpoint=<kgextension.sparql_helper.RemoteEndpoint object>, uri_data_model=False, progress=True, prefix='Link', direction='Out', regex_filter=None, result_type='boolean', prefix_lookup=False, caching=True)

Unqualified relation generator creates attributes from the existence of relations and adds boolean, counts, relative counts or tfidf-values features for incoming and outgoing relations.

Parameters
  • df (pd.DataFrame) – Dataframe to which links are added.

  • columns (str/list) – Name(s) of column(s) which contain(s) the link(s) to the knowledge graph.

  • endpoint (Endpoint, optional) – SPARQL Endpoint to be queried; ignored when “uri_data_model” = True. Defaults to DBpedia.

  • uri_data_model (bool, optional) – If enabled, the URI is directly queried instead of a SPARQL endpoint. Defaults to False.

  • progress (bool, optional) – If True, progress bars will be shown to inform the user about the progress made by the process. Defaults to True.

  • prefix (str, optional) – Custom prefix for the SPARQL query. Defauls to “Link”.

  • direction (str, optional) – The direction for properties which choose from Incoming, Outgoing (In and Out). Defaults to “Out”.

  • regex_filter (str, optional) – Regular expression for filtering properties. Defaults to None.

  • result_type (str, optional) – States wether the results should be boolean (“boolean”), counts (“counts”), relative counts (“relative”) or tfidf-values (“tfidf”) Defaults to “boolean”.

  • prefix_lookup (bool/str/dict, optional) – True: Namespaces of prefixes will be looked up at prefix.cc and added to the sparql query. str: User provides the path to a json-file with prefixes and namespaces. dict: User provides a dictionary with prefixes and namespaces. Defaults to False.

  • caching (bool, optional) – Turn result-caching for queries issued during the execution on or off. Defaults to True.

Returns

Dataframe with new columns containing the links of properties to the knowledge graph

Return type

pd.DataFrame

kgextension.generator_helper module

kgextension.generator_helper.create_graph_from_raw(DG, results, max_hierarchy_depth, current_level, uri_data_model)

Converts the XML obtained by the endpoint wrapper into a hierarchical directed graph.

Parameters
  • DG (Directed Graph) – The empty or preprocessed graph to be appended.

  • results (DOM/pd.DataFrame) – The raw results of the SPARQL query

  • max_hierarchy_depth (int) – The maximum number of hierarchy levels when the direct search is used.

  • current_level (pd.Series) – In case of iterative hierarchy generation the values of the current hierarchy level.

  • uri_data_model (bool) – If enabled, the URI is directly queried instead of a SPARQL endpoint.

Returns

Graph where edges point to direct superclasses of nodes. current_level: In case of iterative hierarchy generation the updated hierarchy level.

Return type

nx.DirectedGraph

kgextension.generator_helper.get_result_df(df, result_type, prefix, merged_df, column)

Helper function for unqualified and qualified relation generator. It helps to create the result dataframe and reduce the duplicated codes from the two main functions.

Parameters
  • df (pd.DataFrame) – The result dataframe dummies.

  • result_type (str) – The type of result chosen from boolean, count, relative count or tf-idf.

  • prefix (str) – Prefix set automatically by the generator.

  • merged_df (pd.DataFrame) – The original dataframe inputed by users.

  • column (str) – Name of the attribute containing entities that should be found.

Returns

The final dataframe.

Return type

pd.DataFrame

kgextension.generator_helper.hierarchy_graph_generator(col, hierarchy_relation='http://www.w3.org/2000/01/rdf-schema#subClassOf', max_hierarchy_depth=None, endpoint=<kgextension.sparql_helper.RemoteEndpoint object>, uri_data_model=False, progress=False, caching=True)

Computes a hierarchy graph from an original set of features, where directed edges symbolise a hierarchy relation from subclass to superclass.

Parameters
  • col (pd.Series) – The classes/categories for which the hierarchy graph is generated.

  • hierarchy_relation (str, optional) – The hierarchy relation to be used. Defaults to “http://www.w3.org/2000/01/rdf-schema#subClassOf”.

  • max_hierarchy_depth (int, optional) – Number of jumps in hierarchy. If None, transitive jumps are used. Defaults to None.

  • endpoint (Endpoint, optional) – Link to the SPARQL endpoint that should be queried. Defaults to DBpedia.

  • uri_data_model (bool, optional) – whether to use sparql querier or the uri data model. Defaults to False.

  • progress (bool, optional) – If True, progress bars will be shown to inform the user about the progress made by the process (if “uri_data_model” = True). Defaults to False.

  • caching (bool, optional) – Turn result-caching for queries issued during the execution on or off.

Returns

Graph where edges point to direct superclasses of nodes.

Return type

nx.DirectedGraph

kgextension.generator_helper.hierarchy_query_creator(col, hierarchy_relation, max_hierarchy_depth, uri_data_model)

Creates a Sparql query to retrieve the hierarchy of classes/categories.

Parameters
  • col (pd.Series) – pd.Series containing the URIs.

  • hierarchy_relation (str) – A hierarchy relation, e.g. http://www.w3.org/2004/02/skos/core#broader.

  • max_hierarchy_depth (int) – The maximum number of hierarchy levels added based on the original resources. If None is passed, transitive hierarchies are created, this may lead to a timeout.

  • uri_data_model (bool) – If false formulates query for endpoints.

Returns

The SPARQL Query for hierarchy retrieval.

Return type

str

kgextension.generator_sklearn module

class kgextension.generator_sklearn.DataPropertiesGenerator(columns, endpoint=<kgextension.sparql_helper.RemoteEndpoint object>, uri_data_model=False, progress=True, type_filter=None, regex_filter=None, bundled_mode=True, prefix_lookup=False, caching=True)

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

fit(X, y=None)
transform(X, y=None)
class kgextension.generator_sklearn.DirectTypeGenerator(columns, endpoint=<kgextension.sparql_helper.RemoteEndpoint object>, uri_data_model=False, progress=True, prefix='', regex_filter=None, result_type='boolean', bundled_mode=True, hierarchy=False, prefix_lookup=False, caching=True)

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

fit(X, y=None)
transform(X, y=None)
class kgextension.generator_sklearn.QualifiedRelationGenerator(columns, endpoint=<kgextension.sparql_helper.RemoteEndpoint object>, uri_data_model=False, progress=True, prefix='Link', direction='Out', properties_regex_filter=None, types_regex_filter=None, result_type='boolean', hierarchy=False, prefix_lookup=False, caching=True)

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Qualified relation generator considers not only relations, but also the related types, adding boolean, counts, relative counts or tfidf-values features for incoming and outgoing relations.

Args:

BaseEstimator (sklearn.base.BaseEstimator) TransformerMixin (sklearn.base.TransformerMixin)

fit(X, y=None)
transform(X, y=None)
class kgextension.generator_sklearn.SpecificRelationGenerator(columns, endpoint=<kgextension.sparql_helper.RemoteEndpoint object>, uri_data_model=False, progress=True, direct_relation='http://purl.org/dc/terms/subject', hierarchy_relation=None, max_hierarchy_depth=1, prefix_lookup=False, caching=True)

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

fit(X, y=None)
transform(X, y=None)
class kgextension.generator_sklearn.UnqualifiedRelationGenerator(columns, endpoint=<kgextension.sparql_helper.RemoteEndpoint object>, uri_data_model=False, progress=True, prefix='Link', direction='Out', regex_filter=None, result_type='boolean', prefix_lookup=False, caching=True)

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Unqualified relation generator creates attributes from the existence of relations and adds boolean, counts, relative counts or tfidf-values features for incoming and outgoing relations.

Parameters
  • BaseEstimator (sklearn.base.BaseEstimator) –

  • TransformerMixin (sklearn.base.TransformerMixin) –

fit(X, y=None)
transform(X, y=None)

kgextension.linking module

kgextension.linking.dbpedia_lookup_linker(df, column, new_attribute_name='new_link', progress=True, base_url='https://lookup.dbpedia.org/api/search/', max_hits=1, query_class='', lookup_api='KeywordSearch', caching=True)

Implementation of the DBpedia Lookup service (https://github.com/dbpedia/lookup). Takes strings from a column, looks for matching DBPedia entities and returns their URIs to newly added columns.

Parameters
  • df (pd.DataFrame) – Dataframe to which links are added.

  • column (str) – Name of the attribute containing entities that should be looked up.

  • new_attribute_name (str, optional) – Name of column / prefix of columns containing the link to the knowledge graph. Defaults to “new_link”.

  • progress (bool, optional) – If True, progress bars will be shown to inform the user about the progress made by the process. Defaults to True.

  • base_url (str, optional) – Set the base URL for the generation of request URLs. Defaults to “https://lookup.dbpedia.org/api/search/”.

  • max_hits (int, optional) – Maximal number of URIs that should be returned per entity. Defaults to 1.

  • query_class (str, optional) – A DBpedia class from the DBpedia Ontology (https://wiki.dbpedia.org/services-resources/ontology) that the results should have (without prefix, e.g., dbo:place as place). Defaults to “”.

  • lookup_api (str, optional) – Choose between KeywordSearch and PrefixSearch mode of DBpedia Lookup. Defaults to “KeywordSearch”.

  • caching (bool, optional) – Turn result-caching for lookups issued during the execution on or off. Defaults to True

Returns

Returns dataframe with (a) new column(s) containing the links to the DBpedia entities.

Return type

pd.DataFrame

kgextension.linking.dbpedia_spotlight_linker(df, column, new_attribute_name='new_link', progress=True, max_hits=1, language='en', selection='first', confidence=0.3, support=5, min_similarity_score=0.5, caching=True)

Implementation of the DBpedia Spotlight Service (https://www.dbpedia-spotlight.org/). Takes strings from a column, looks for linked Wikipedia entities and returns their URIs to newly added columns.

Parameters
  • df (pd.DataFrame) – Dataframe to which links are added.

  • column (str) – Name of the column whose entities should be found.

  • new_attribute_name (str, optional) – Name of column containing the link to the knowledge graph. Defaults to “new_link”.

  • progress (bool, optional) – If True, progress bars will be shown to inform the user about the progress made by the process. Defaults to True.

  • max_hits (int, optional) – Maximal number of URI’s that should be returned per entity. Defaults to 1.

  • language (str, optional) – The DBPedia language setting. Defaults to “en”.

  • selection (str, optional) – Specifies whether the entities that occur first (first), that have the highest support(support) or that have the highest similarity score(similarityScore) should be chosen. Defaults to “first”.

  • confidence (float, optional) – Confidence threshold. Defaults to 0.3.

  • support (int, optional) – Support threshold. Defaults to 5.

  • min_similarity_score (float, optional) – Minimal similarity threshold. Defaults to 0.5.

  • caching (bool, optional) – Turn result-caching for queries issued during the execution on or off. Defaults to True.

Returns

Returns dataframe with (a) new column(s) containing the

DBPedia URIs.

Return type

pd.DataFrame

kgextension.linking.label_linker(df, column, new_attribute_name='new_link', progress=True, endpoint=<kgextension.sparql_helper.RemoteEndpoint object>, result_filter=None, language='en', max_hits=1, label_property='rdfs:label', prefix_lookup=False, caching=True)

Label Linker takes attributes from a column and adds a new column with the respective knowledge graph links based on the provided label_property (rdfs:label by default).

Parameters
  • df (pd.DataFrame) – Dataframe to which links are added.

  • column (str) – Name of the column whose entities should be found.

  • new_attribute_name (str, optional) – Name of column containing the link to the knowledge graph. Defaults to “new_link”.

  • progress (bool, optional) – If True, progress bars will be shown to inform the user about the progress made by the process. Defaults to True.

  • endpoint (Endpoint, optional) – Choose SPARQL endpoint connection. Defaults to DBpedia.

  • result_filter (list, optional) – A list filled with regexes (as strings) to filter the results. Defaults to None.

  • language (str, optional) – Used to specify the language the labels are in. If the queried endpoint does not use language tags, set to None. Defaults to “en”.

  • max_hits (int, optional) – Maximal number of URI’s that should be returned per entity. Defaults to 1.

  • label_property (str, optional) – Specifies the label_property the should be used in the query. Defaults to “rdfs:label”.

  • prefix_lookup (bool/str/dict, optional) – True: Namespaces of prefixes will be looked up at prefix.cc and added to the sparql query. str: User provides the path to a json-file with prefixes and namespaces. dict: User provides a dictionary with prefixes and namespaces. Defaults to False.

  • caching (bool, optional) – Turn result-caching for queries issued during the execution on or off. Defaults to True.

Returns

Dataframe with a new column containing the links to the knowledge graph.

Return type

pd.DataFrame

kgextension.linking.pattern_linker(df, column, new_attribute_name='new_link', progress=True, base_url='http://dbpedia.org/resource/', url_encoding=True, DBpedia_link_format=True)

Basic Pattern Linker that takes attributes from a column and a base link and generates a new column with the respective knowledge graph links.

Parameters
  • df (pd.DataFrame) – Dataframe to which links are added.

  • column (str) – Name of column whose entities should be found.

  • new_attribute_name (str, optional) – Name of column containing the link to the knowledge graph. Defaults to “new_link”.

  • progress (bool, optional) – If True, progress updates will be shown to inform the user about the progress made by the process. Defaults to True.

  • base_url (str, optional) – Base string to the knowledge graph. Defaults to “www.dbpedia.org/resource/”.

  • url_encoding (bool, optional) – Enables automatic url encoding. Defaults to True.

  • DBpedia_link_format (bool, optional) – Enables conversion to DBpedia link format. Defaults to True.

Returns

Dataframe with a new column containing the links to the knowledge graph.

Return type

pd.DataFrame

kgextension.linking.sameas_linker(df, column, new_attribute_name='new_link', progress=True, endpoint=<kgextension.sparql_helper.RemoteEndpoint object>, result_filter=None, uri_data_model=False, bundled_mode=True, prefix_lookup=False, caching=True)

Function that takes URIs from a column of a DataFrame and queries a given SPARQL endpoint for ressources which are connected to these URIs via owl:sameAs. Found ressources are added as new columns to the dataframe and the dataframe is returned.

Parameters
  • df (pd.DataFrame) – Dataframe to which links are added.

  • column (str) – Name of the column for whose entities links should be found.

  • new_attribute_name (str, optional) – Name / prefix of the column(s) containing the found links. Defaults to “new_link”.

  • progress (bool, optional) – If True, progress bars will be shown to inform the user about the progress made by the process (if “uri_data_model” = True). Defaults to True.

  • endpoint (Endpoint, optional) – SPARQL Endpoint to be queried; ignored when “uri_data_model” = True. Defaults to DBpedia.

  • result_filter (list, optional) – A list filled with regexes (as strings) to filter the results. Defaults to None.

  • uri_data_model (bool, optional) – If enabled, the URI is directly queried instead of a SPARQL endpoint. Defaults to False.

  • bundled_mode (bool, optional) – If True, all necessary queries are boundled into one querie (using the VALUES method). - Requires a SPARQL 1.1 implementation!. Defaults to True.

  • prefix_lookup (bool/str/dict, optional) – True: Namespaces of prefixes will be looked up at prefix.cc and added to the sparql query. str: User provides the path to a json-file with prefixes and namespaces. dict: User provides a dictionary with prefixes and namespaces. Defaults to False.

  • caching (bool, optional) – Turn result-caching for queries issued during the execution on or off. Defaults to True.

Returns

Returns dataframe with (a) new column(s) containing the found ressources.

Return type

pd.DataFrame

kgextension.linking_helper module

kgextension.linking_helper.dll_query_resolver(query_link, maxHits)

Resolves a query link for the DBpedia Lookup API to a series of the URIs returned for that query.

Parameters
  • query_link (str) – API request in link form.

  • maxHits (int) – Maximal number of URIs that should be returned by the API.

Returns

Containing the URIs as strings.

Return type

pd.Series

kgextension.linking_helper.spotlight_uri_extractor(entry, link, max_hits=1, selection='first', confidence=0.5, support=20, min_similarity_score=0.8)

Finds linked DBPedia entities of a string and returns them as a list.

Parameters
  • entry (str) – Text in which entities are to be found.

  • link (str) – Link to DBPedia Spotlight.

  • max_hits (int, optional) – Maximal number of URIs that should be returned per entity. Defaults to 1.

  • selection (str, optional) – Specifies whether the entities that occur first (first), that have the highest support(support) or that have the highest similarity score(similarityScore) should be chosen. Defaults to “first”.

  • confidence (float, optional) – #TODO. Defaults to 0.5.

  • support (int, optional) – #TODO. Defaults to 20.

  • min_similarity_score (float, optional) – #TODO. Defaults to 0.8.

Returns

All URIs found in accordance with the parameters. If max_hits > found URIs the list is filled with NAs.

Return type

list

kgextension.linking_sklearn module

class kgextension.linking_sklearn.DbpediaLookupLinker(column, new_attribute_name='new_link', progress=True, base_url='http://lookup.dbpedia.org/api/search/', max_hits=1, query_class='', lookup_api='KeywordSearch', caching=True)

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

fit(X, y=None)
transform(X, y=None)
class kgextension.linking_sklearn.DbpediaSpotlightLinker(column, new_attribute_name='new_link', progress=True, max_hits=1, language='en', selection='first', confidence=0.3, support=5, min_similarity_score=0.5, caching=True)

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

fit(X, y=None)
transform(X, y=None)
class kgextension.linking_sklearn.LabelLinker(column, new_attribute_name='new_link', progress=True, endpoint=<kgextension.sparql_helper.RemoteEndpoint object>, result_filter=None, language='en', max_hits=1, label_property='rdfs:label', prefix_lookup=False, caching=True)

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

fit(X, y=None)
transform(X, y=None)
class kgextension.linking_sklearn.PatternLinker(column, new_attribute_name='new_link', progress=True, base_url='www.dbpedia.org/resource/', url_encoding=True, DBpedia_link_format=True)

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

fit(X, y=None)
transform(X, y=None)
class kgextension.linking_sklearn.SameAsLinker(column, new_attribute_name='new_link', progress=True, endpoint=<kgextension.sparql_helper.RemoteEndpoint object>, result_filter=None, uri_data_model=False, prefix='', bundled_mode=True, prefix_lookup=False, caching=True)

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

fit(X, y=None)
transform(X, y=None)

kgextension.schema_matching module

kgextension.schema_matching.label_schema_matching(df, endpoint=<kgextension.sparql_helper.RemoteEndpoint object>, uri_data_model=False, to_lowercase=True, remove_prefixes=True, remove_punctuation=True, prefix_threshold=1, progress=True, caching=True)

A schema matching method by checking for attribute – rdfs:label between links.

Parameters
  • df (pd.DataFrame) – The dataframe where matching attributes are supposed to be found.

  • endpoint (Endpoint, optional) – SPARQL Endpoint to be queried. Defaults to DBpedia.

  • uri_data_model (bool, optional) – If enabled, the URI is directly queried instead of a SPARQL endpoint. Defaults to False.

  • to_lowercase (bool, optional) – Converts queried strings to lowercase. Defaults to True.

  • remove_prefixes (bool, optional) – Removes prefices of queried strings. Defaults to True.

  • remove_punctuation (bool, optional) – Removes punctuation from queried strings. Defaults to True.

  • prefix_threshold (int, optional) – The number of occurences after which a prefix is considered “common”. Defaults to 1.

  • progress (bool, optional) – If True, progress bars will be shown to inform the user about the progress made by the process (if “uri_data_model” = True). Defaults to True.

  • caching (bool, optional) – Turn result-caching for queries issued during the execution on or off. Defaults to True.

Returns

Two columns with matching links and a third column with the overlapped label.

Return type

pd.DataFrame

kgextension.schema_matching.matching_combiner(matching_result_dfs, method='avg', columns=None, ignore_single_missings=False, weights=None, thresholds=None, merge_on=['uri_1', 'uri_2'])

Combines results of the schema matching functions into a single score per combination of attributes.

Parameters
  • matching_result_dfs (list) – Results of the schema matching functions.

  • method (str/method, optional) – Function combining the individual scores. Defaults to “avg”.

  • columns (list, optional) – Columns of the input dataframes to take into account. If none are given automatically detects them from the input. Defaults to None.

  • ignore_single_missings (bool, optional) – If enabled, computes scores even if one of the values is missing. Defaults to False.

  • weights (list, optional) – Weights for weighting the different scores, if method = “weighted”. Defaults to None.

  • thresholds (float, optional) – Thresholds for thresholding the different scores, if method = “thresholding”. Defaults to None.

  • merge_on (list, optional) – Names of the columns on which the DataFrames in “matching_result_dfs” should be merged. Defaults to [“uri_1”, “uri_2”].

Raises

ValueError – Raised if the input of “weights” or “thresholds” is not correct.

Returns

DataFrame that contains the combined matching score for each URI-pair.

Return type

pd.DataFrame

kgextension.schema_matching.relational_matching(df, endpoints=[<kgextension.sparql_helper.RemoteEndpoint object>, <kgextension.sparql_helper.RemoteEndpoint object>], uri_data_model=False, match_score=1, progress=True, caching=True)

Creates a mapping of matching attributes in the schema by checking for owl:sameAs, owl:equivalentClass, owl:Equivalent and wdt:P1628 links between them.

Parameters
  • df (pd.DataFrame) – Dataframe where matching attributes are supposed to be found.

  • endpoints (list, optional) – SPARQL Endpoint to be queried. Defaults to [DBpedia, WikiData].

  • uri_data_model (bool, optional) – If enabled, the URI is directly queried instead of a SPARQL endpoint. Defaults to False.

  • match_score (int, optional) – Score of the match: 0 < match_score <= 1. Defaults to 1.

  • progress (bool, optional) – If True, progress bars will be shown to inform the user about the progress made by the process (if “uri_data_model” = True). Defaults to True.

  • caching (bool, optional) – Turn result-caching for queries issued during the execution on or off. Defaults to True.

Returns

Two columns with matching links and a third column with the score, which is always one in case of the relational matching unless specified otherwise.

Return type

pd.DataFrame

kgextension.schema_matching.string_similarity_matching(df, predicate='rdfs:label', to_lowercase=True, remove_prefixes=True, remove_punctuation=True, similarity_metric='norm_levenshtein', prefix_threshold=1, n=2, progress=True, caching=True)

Calculates the string similarity from the text field obtained by querying the attributes for the predicate, by default rdfs:label.

Parameters
  • df (pd.DataFrame) – Dataframe where matching attributes are supposed to be found

  • predicate (str, optional) – Defaults to “rdfs:label”.

  • to_lowercase (bool, optional) – converts queried strings to lowercase. Defaults to True.

  • remove_prefixes (bool, optional) – removes prefices of queried strings. Defaults to True.

  • remove_punctuation (bool, optional) – removes punctuation from queried strings. Defaults to True.

  • similarity_metric (str, optional) – norm by which strings are compared. Defaults to “norm_levenshtein”.

  • prefix_threshold (int, optional) – The number of occurences after which a prefix is considered “common”. defaults to 1. n (int, optional): parameter for n-gram and Jaccard similarities. Defaults to 2.

  • progress (bool, optional) – If True, progress bars will be shown to inform the user about the progress made by the process. Defaults to True.

  • caching (bool, optional) – Turn result-caching for queries issued during the execution on or off. Defaults to True.

Returns

Two columns with matching links and a third column with the string similarity score.

Return type

pd.DataFrame

kgextension.schema_matching.value_overlap_matching(df, progress=True)

A schema matching method by calculating the similarities of link values.

Parameters
  • df (pd.DataFrame) – The dataframe where matching attributes are supposed to be found.

  • progress (bool, optional) – If True, progress bars will be shown to inform the user about the progress made by the process. Defaults to True.

Returns

Two columns with matching links and a third column with “value_overlap”.

Return type

pd.DataFrame

kgextension.schema_matching_fusion_sklearn module

class kgextension.schema_matching_fusion_sklearn.MatchingFuser(matching_functions, threshold=0.85, method='avg', columns=None, ignore_single_missings=False, weights=None, merge_on=['uri_1', 'uri_2'], boolean_method_single='provenance', boolean_method_multiple='voting', numeric_method_single='average', numeric_method_multiple='average', string_method_single='longest', string_method_multiple='longest', provenance_regex='http://dbpedia.org/', progress=True, caching=True)

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

fit(X, y=None)
transform(X, y=None)

kgextension.schema_matching_helper module

kgextension.schema_matching_helper.calc_string_similarity(uri_1, uri_2, label_dict, metric='norm_levenshtein', n=2)

Calculates the string similarity between two strings based on various metrics. The strings are retreived from a dictionary provided to the function.

Parameters
  • uri_1 (str) – URI linked to the first string (used as key for the label_dict).

  • uri_2 (str) – URI linked to the second string (used as key for the label_dict).

  • label_dict (dict) – Dictionary mapping the provided URIs (keys) to their respective strings.

  • metric (str/method, optional) – Name of the metric that should be used for the similarity calculation. Defaults to “norm_levenshtein”.

  • n (int, optional) – n-Value for the metrics “ngram” and “jaccard”. Defaults to 2.

Raises

ValueError – Gets raised in case a unknown metric is provided.

Returns

The similarity between the two strings.

Return type

float

kgextension.schema_matching_helper.clean_string(string, common_prefixes, to_lowercase=True, remove_prefixes=True, remove_punctuation=True)

Cleans a passed string by e.g. lowercasing it, stripping common prefixes from it and removing any punctuation.

Parameters
  • string (str) – The string that should be cleaned.

  • common_prefixes (list) – A list containing all (common) prefixes that should be removed.

  • to_lowercase (bool, optional) – Indicates whether or not the string should be transformed to lowercase. Defaults to True

  • remove_prefixes (bool, optional) – Indicates whether or not the string should be stripped from the specified common prefixes (of type: PREFIX:string). Defaults to True.

  • remove_punctuation (bool, optional) – Indicates whether or not all punctuation should be removed from the string. Defaults to True.

Returns

The cleaned string.

Return type

str

kgextension.schema_matching_helper.get_common_prefixes(df, threshold, column_name='o')

Finds common string prefixes (of type PREFIX:string) in a column of a specified DataFrame. Creates a list of all prefixes that appear more often than the specified threshold.

Parameters
  • df (pd.DataFrame) – The DataFrame containing the data.

  • threshold (int) – The threshold to filter uncommon prefixes.

  • column_name (str, optional) – Column name of the column containing the relevant strings. Defaults to “o”.

Returns

A list of all prefixes (of type PREFIX:string) that appear more often than the specified threshold.

Return type

list

kgextension.schema_matching_helper.get_value_overlap(df, col_name_dict, uri_1, uri_2)

Calculates the ratio of overlapping values of two columns of a DataFrame, using row-wise comparison.

Parameters
  • df (pd.DataFrame) – The DataFrame containing the rows that should be compared (with column names reduced to the URIs).

  • col_name_dict (dict) – Dictionary mapping the cleaned column names from the DataFrame to the full column names.

  • uri_1 (str) – Column name of the first column (just the URI).

  • uri_2 (str) – Column name of the second column (just the URI).

Returns

Ratio of overlapping values in the two columns.

Return type

float

kgextension.sparql_helper module

class kgextension.sparql_helper.Endpoint

Bases: object

Base Endpoint class.

class kgextension.sparql_helper.LocalEndpoint(file_path, file_format='auto')

Bases: kgextension.sparql_helper.Endpoint

LocalEndpoint class, that handles access to local RDF files.

close()

Closing the LocalEndpoint, i.e. releasing the data from memory.

initialize()

Initializing the LocalEndpoint, i.e. loading the data into memory.

query(query)

Function to issue a query against a LocalEndpoint.

Parameters

query (str) – SPARQL query.

Returns

The query results as DataFrame.

Return type

pd.DataFrame

class kgextension.sparql_helper.RemoteEndpoint(url, timeout=60, requests_per_min=100000, retries=10, page_size=0, supports_bundled_mode=True, persistence_file_path='rate_limits.db', agent='sparqlwrapper 1.8.5 (rdflib.github.io/sparqlwrapper)')

Bases: kgextension.sparql_helper.Endpoint

RemoteEndpoint class, that handles remote SPARQL endpoints.

kgextension.sparql_helper.endpoint_wrapper(query: str, endpoint: kgextension.sparql_helper.Endpoint, request_return_format='XML', verbose=False, return_XML=False, prefix_lookup=False, caching=True)

Wrapper function for sparql-querier and local rdf-files.

Parameters
  • query (str) – Query that should be sent to the SPARQL endpoint

  • endpoint (Endpoint) – Link to the SPARQL endpoint that should be queried.

  • request_return_format (str, optional) – Requesting a specific return format from the SPARQL endpoint. Defaults to “XML”.

  • verbose (bool, optional) – Set to True to let the function print additional information about the returned data - for debugging and testing. Defaults to False.

  • return_XML (bool, optional) – if True it returns the XML results instead of a dataframe. Defaults to False.

  • prefix_lookup (bool/str/dict, optional) – True: Namespaces of prefixes will be looked up at prefix.cc and added to the sparql query. str: User provides the path to a json-file with prefixes and namespaces. dict: User provides a dictionary with prefixes and namespaces. Defaults to False.

  • caching (bool, optional) – Turn result caching on or off. Defaults to True.

Returns

The query results in form of a DataFrame.

Return type

pd.DataFrame

kgextension.sparql_helper.endpoint_wrapper_logic(query, endpoint, request_return_format, verbose, return_XML)

This is a helper function for “endpoint_wrapper”, outsourced for caching purposes. Not intended for end-user usage. #TODO: Schöner lösen?

kgextension.sparql_helper.regex_string_generator(attribute, filters, logical_connective='OR')

#TODO

Parameters
  • attribute ([type]) – [description]

  • filters ([type]) – [description]

  • logical_connective (str, optional) – [description]. Defaults to “OR”.

Raises

ValueError – [description]

Returns

[description]

Return type

[type]

kgextension.sparql_helper_helper module

kgextension.sparql_helper_helper.get_initial_query_limit(query: str)

Returns the LIMIT within a SPARQL query string.

Parameters

query (str) – SPARQL query string.

Returns

Limit or 0 if no limit.

Return type

int

kgextension.sparql_helper_helper.get_initial_query_offset(query: str)

Returns the OFFSET within a SPARQL query string.

Parameters

query (str) – SPARQL query string.

Returns

Offset or 0 if no offset.

Return type

int

kgextension.uri_helper module

kgextension.uri_helper.query_uri(uri, query_string, return_formats={'wikidata.org': 'n3'}, verbose=True, caching=True)

Function that allows to query a given dereferencable URI with a given SPARQL query, without the need for an SPARQL endpoint.

Parameters
  • uri (str) – Dereferencable URI.

  • query_string (str) – SPARQL query (the URI should already be inserted via the values statement).

  • return_formats (dict, optional) – Used to set specific return formats for data sources (if the default “application/rdf+xml” is not supported). For supported formats see: https://rdflib.readthedocs.io/en/stable/plugin_parsers.html. Defaults to {“wikidata.org”: “n3”}.

  • verbose (bool, optional) – Turn on/off warnings for likely malformed URIs. Defaults to True.

  • caching (bool, optional) – Turn result caching on or off. Defaults to True.

Returns

Result of the SPARQL query issued against the URI. If the provided URI is NULL, then a empty dataframe is returned.

Return type

pd.DataFrame

kgextension.uri_helper.query_uri_logic(uri, query_string, return_format)

Parsing & querying logic of the “query_uri” function. Detached from the main function for caching purposes.

Parameters
  • uri (str) – Dereferencable URI.

  • query_string (str) – SPARQL query (the URI should already be inserted via the values statement).

  • return_format (dict) – Used to set specific return formats for data sources (if the default “application/rdf+xml” is not supported). For supported formats see: https://rdflib.readthedocs.io/en/stable/plugin_parsers.html.

Returns

Result of the SPARQL query issued against the URI.

Return type

pd.DataFrame

kgextension.uri_helper.uri_querier(df, column, query, regex_filter=None, return_formats={'wikidata.org': 'n3'}, verbose=True, caching=True, prefix_lookup=False, progress=True)

Wrapper function for the query_uri function. Queries each URI in a specified column of a DataFrame with a user-provided query and returns the results as one joint DataFrame.

Parameters
  • df (pd.DataFrame) – DataFrame that contains the URIs that should be queried.

  • column (str) – Column in the specified DataFrame that contains the URIs that should be queried.

  • query (str) – The SPARQL query that’s used for querying the URIs. Has to contain a single placehold (URI) in the VALUES statement. Example: “SELECT ?value ?p ?o WHERE {VALUES (?value) { (<URI>)} ?value ?p ?o }”

  • regex_filter (str, optional) – If set, just URIs matching the specified RegEx are queried. Defaults to None.

  • return_formats (dict, optional) – Used to set specific return formats for data sources (if the default “application/rdf+xml” is not supported). For supported formats see: https://rdflib.readthedocs.io/en/stable/plugin_parsers.html. Defaults to {“wikidata.org”: “n3”}.

  • verbose (bool, optional) – Turn on/off warnings for likely malformed URIs. Defaults to True.

  • caching (bool, optional) – Turn result caching on or off. Defaults to True.

  • prefix_lookup (bool/str/dict, optional) – True: Namespaces of prefixes will be looked up at prefix.cc and added to the sparql query. str: User provides the path to a json-file with prefixes and namespaces. dict: User provides a dictionary with prefixes and namespaces. Defaults to False.

  • progress (bool, optional) – If True, progress bars will be shown to inform the user about the progress made by the process. Defaults to True.

Returns

Joint DataFrame that contains the query-results of all URIs.

Return type

pd.DataFrame

kgextension.utilities module

kgextension.utilities.check_uri_redirects(df, column, replace=True, custom_name_postfix=None, redirection_property='http://dbpedia.org/ontology/wikiPageRedirects', endpoint=<kgextension.sparql_helper.RemoteEndpoint object>, regex_filter='dbpedia', bundled_mode=True, uri_data_model=False, progress=True, caching=True)

Takes a column of URIs from a DataFrame and checks for each if it has a redirection set by the endpoint. If this is the case, the URI it redirects to is either added in a new column or replaces the original URI.

Parameters
  • df (pd.DataFrame) – Dataframe for which the URIs should be inspected.

  • column (str) – Name of the column that contains the URIs that should be checked.

  • replace (bool, optional) – If True: URIs that get redirected will be replaced with the new URI; If False: A new column, containing the result for each URI, is added to the DataFrame. Defaults to True.

  • custom_name_postfix (str, optional) – Custom postfix for the newly created column (in case “replace” is set to False). Defaults to None.

  • redirection_property (str, optional) – Relation/Property URI that signals a redirect for this endpoint. Defaults to “http://dbpedia.org/ontology/wikiPageRedirects”.

  • endpoint (Endpoint, optional) – SPARQL Endpoint to be queried; ignored when “uri_data_model” = True. Defaults to DBpedia.

  • regex_filter (str, optional) – Just URIs matching the specified RegEx are checked for redirects. Defaults to “dbpedia”.

  • bundled_mode (bool, optional) – If True, all necessary queries are bundled into one query (using the VALUES method). - Requires a SPARQL 1.1 implementation!; ignored when “uri_data_model” = True. Defaults to True.

  • uri_data_model (bool, optional) – If enabled, the URI is directly queried instead of a SPARQL endpoint. Defaults to False.

  • progress (bool, optional) – If True, progress bars will be shown to inform the user about the progress made by the process (if “uri_data_model” = True). Defaults to True.

  • caching (bool, optional) – Turn result-caching for queries issued during the execution on or off. Defaults to True.

Raises

ValueError – Raised if ‘custom_name_postfix’ is set to “” instead of None.

Returns

Returns dataframe with cleaned links / a new column.

Return type

pd.DataFrame

Takes a column of URLs / URIs from a DataFrame and checks for each if it is resolvable. If not it’s either replaced with some user-specified entry or a flag is added to a newly generated column.

Parameters
  • df (pd.DataFrame) – Dataframe for which the links should be inspected.

  • columns (list) – List containing the names of the columns in the DataFrame, that contain the links.

  • purge (bool, optional) – If True: Links that are not resolvable will be replaced with “fill_with”; If False: A new column, containing the result for each link in boolean format, is added to the DataFrame. Defaults to True.

  • custom_name_postfix (str, optional) – Custom postfix for the newly created column (in case “purge” is set to False). Defaults to None.

  • fill_with (flexible, optional) – Specifies what not resolvable links should be replaced with (in case “purge” is set to True). Defaults to np.NaN.

  • caching (bool, optional) – Turn result caching on or off. Defaults to True.

  • progress (bool, optional) – If True, progress bars will be shown to inform the user about the progress made by the process. Defaults to True.

Raises

ValueError – Raised if ‘custom_name_postfix’ is set to “” instead of None.

Returns

Returns dataframe with cleaned links / a new column.

Return type

pd.DataFrame

kgextension.utilities_helper module

kgextension.utilities_helper.is_valid_url(url)

Checks if a URL is in proper format.

Parameters

url (str) – The URL that should be checked.

Returns

Result of the validity check in boolean form.

Return type

bool

kgextension.utilities_helper.url_exists(url)

Checks if a URL is resolvable / existing.

Parameters

url (str) – The URL that should be checked.

Returns

Result of the resolvability check in boolean form.

Return type

bool

kgextension.utilities_sklearn module

class kgextension.utilities_sklearn.CheckUriRedirects(column, replace=True, custom_name_postfix=None, redirection_property='http://dbpedia.org/ontology/wikiPageRedirects', endpoint=<kgextension.sparql_helper.RemoteEndpoint object>, regex_filter='dbpedia', bundled_mode=True, uri_data_model=False, progress=True, caching=True)

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

fit(X, y=None)
transform(X, y=None)

Module contents