kgextension package
Submodules
kgextension.caching_helper module
- kgextension.caching_helper.clear_cache()
Function that clears the cache of all cached methods when it’s called.
- kgextension.caching_helper.freeze_unhashable(freeze_by='argument', freeze_argument=None, freeze_index=None)
Wrapper function to “freeze” a unhashable function attribute (dictionary or pandas Series) into a hashable OrderedDict. Used for functions that need to be cached but have these types of arguments as inputs.
- Parameters
freeze_by (str, optional) – Used to indicate whether the argument that needs to be freezed is selected via its argument name (“argument”) or its index (“index”). Defaults to “argument”.
freeze_argument (str, optional) – Name of the argument that should be freezed. Used if freeze_by = “argument”. Defaults to None.
freeze_index (int, optional) – Index of the argument that should be freezed. Used if freeze_by = “index”. Defaults to None.
- kgextension.caching_helper.show_cache_info()
Function that gives the user an overview over the status of all cached methods.
- kgextension.caching_helper.unfreeze_unhashable(frozen_argument, frozen_type='series')
Function to “unfreeze” unhashable arguments “frozen” by the freeze_unhashable function.
- Parameters
frozen_argument (tuple/OrderedDict) – The frozen argument. Pandas Series as tuple and dictionaries as OrderedDict.
frozen_type (str, optional) – Indicator whether the frozen arguemnt is a pandas Series (“series”) or a dictionary (“dict”). Defaults to “series”.
- Returns
The content of the OrderedDict in its original format.
- Return type
pd.Series/dict
kgextension.endpoints module
- kgextension.endpoints.DBpedia = <kgextension.sparql_helper.RemoteEndpoint object>
Predefined SPARQL endpoint for DBpedia.
Settings:
ResultSetMaxRows = 10000; MaxQueryExecutionTime = 120 (seconds); MaxQueryCostEstimationTime = 1500 (seconds); Connection limit = 50 (parallel connections per IP address); maximum request rate = 100 (requests per second per IP address, with an initial burst of 120 requests)
NOTE: Queries which time out will return PARTIAL results in a best effort fashion, and will NOT return an error.
- kgextension.endpoints.EUOpenData = <kgextension.sparql_helper.RemoteEndpoint object>
Predefined SPARQL endpoint for the EU Open Data Portal (EU ODP).
No Usage Policy found?
- kgextension.endpoints.WikiData = <kgextension.sparql_helper.RemoteEndpoint object>
Predefined SPARQL endpoint for WikiData.
NOTE: A user-specific user agent header is needed (https://meta.wikimedia.org/wiki/User-Agent_policy) -> Use “agent” argument!
There is a hard query deadline configured which is set to 60 seconds. There are also following limits:
One client (user agent + IP) is allowed 60 seconds of processing time each 60 seconds One client is allowed 30 error queries per minute
Clients exceeding the limits above are throttled with HTTP code 429. Use Retry-After header to see when the request can be repeated. If the client ignores 429 responses and continues to produce requests over the limits, it can be temporarily banned from the service. Clients who don’t comply with the User-Agent policy may be blocked completely – make sure to send a good User-Agent header.
Every query will timeout when it takes more time to execute than this configured deadline. You may want to optimize the query or report a problematic query here.
Also note that currently access to the service is limited to 5 parallel queries per IP. The above limits are subject to change depending on resources and usage patterns.
Source: https://www.mediawiki.org/wiki/Wikidata_Query_Service/User_Manual#Query_limits
kgextension.feature_selection module
- kgextension.feature_selection.greedy_top_down_filter(df, label_column, column_prefix='new_link_type_', G=None, progress=True)
Hierarchical feature selection based on the Greedy Top Down algorithm.
Lu, S., Ye, Y., Tsui, R., Su, H., Rexit, R., Wesaratchakit, S., Liu, X. and Hwa, R., 2013, October. Domain ontology-based feature reduction for high dimensional drug data and its application to 30-day heart failure readmission prediction. In 9th IEEE International Conference on Collaborative Computing: Networking, Applications and Worksharing (pp. 478-484). IEEE.
- Parameters
df (pd.DataFrame) – DataFrame that contains the label as well as the features generated (by a generator).
label_column (str) – Name of the label column.
column_prefix (str) – Prefix of the columns generated by the generator (e.g. “new_link_type_”). Defaults to “new_link_type_”. #TODO: Check if default makes sense!
G (nx.DirectedGraph, optional) – Graph that contains the hierarchy. If “None” that hierarchy attached to the provided df will be used. Defaults to None.
progress (bool, optional) – If True, progress bars will be shown to inform the user about the progress made by the process. Defaults to True.
- Raises
TypeError – Raised if the graph provided is not a directed acyclic graph (DAGs).
- Returns
DataFrame reduced to columns determined by the GTD algorithm as well as columns in the original df that are not created by a generator (that don’t start with column_prefix).
- Return type
pd.DataFrame
- kgextension.feature_selection.hierarchy_based_filter(df, label_column, G=None, threshold=0.99, metric='info_gain', pruning=True, all_remove=True, progress=True, **kwargs)
Feature selection approach, namely, SHSEL including the initial selection algorithm and pruning algorithm. Identify and filter out the ranges of nodes with similar relevance in each branch of the hierarchy.
Ristoski, P. and Paulheim, H., 2014, October. Feature selection in hierarchical feature spaces. In International conference on discovery science (pp. 288-300). Springer, Cham.
- Parameters
df (pd.DataFrame) – Dataframe containing the original features and the class column.
label_column (str) – Name of the output/class column.
G (nx.DirectedGraph, optional) – The directed graph of all classes and superclasses can be specified here; if None the function looks for the graph in the pd.DataFrame.attrs.hierarchy attribute of the input dataframe. Defaults to None.
threshold (float, optional) – A relevance similarity threshold which is set be users, recommended to be 0.99. Defaults to 0.99.
metric (str/func, optional) – The relevance similarity metrics including infomation gain and correlation(“info_gain”/”correlation”). Can use your own metric function. Defaults to “info_gain”.
pruning (bool, optional) – If or not use the pruning algorithm, if True, select only the most valuable features which is greater than the average Information Gain values from the previously reduced set. Defaults to True.
all_remove (bool, optional) – Only valid when pruning is True. If or not strictly remove all the nodes once one of their info gain value are smaller than the average info gain of paths. Defaults to True.
progress (bool, optional) – If True, progress bars will be shown to inform the user about the progress made by the process. Defaults to True.
- Returns
Filtered Dataframe containing the selected attributes.
- Return type
pd.DataFrame
- kgextension.feature_selection.hill_climbing_filter(df, label_column, metric='hill_climbing_cost_function', G=None, beta=0.05, k=5, progress=True, **kwargs)
Feature selection performed by comparing nodes with their parents in a bottom-up approach.
Wang, B.B., Mckay, R.B., Abbass, H.A. and Barlow, M., 2003, February. A comparative study for domain ontology guided feature extraction. In Proceedings of the 26th Australasian computer science conference-Volume 16 (pp. 69-78).
- Parameters
df (pd.DataFrame) – Dataframe containing the original features and the class column.
label_column (str) – Name of the output/class column.
metric (str/func, optional) – Cost function to determine value of feature set. Higher values indicate a better feature set. Should take at least df and class_col(pd.Series of class column) as input and output a single numeric value. Defaults to ‘hill_climbing_cost_function’.
G (nx.DirectedGraph, optional) – The directed graph of all classes and superclasses can be specified here; if None the function looks for the graph in the pd.DataFrame.attrs.hierarchy attribute of the input dataframe. Defaults to None.
beta (float, optional) – Regularization parameter of cost function. Defaults to 0.05.
k (int, optional) – Number of nearest neighbors for cost function. Defaults to 5.
progress (bool, optional) – If True, progress updates will be shown to inform the user about the progress made by the process. Defaults to True.
- Returns
dataframe with filtered classes
- Return type
pd.DataFrame
- kgextension.feature_selection.tree_based_filter(df, label_column, G=None, metric='Lift', progress=True)
Filter attributes with Tree-Based Feature Selection (TSEL). TSEL selects the most valuable attributes from each path in the hierarchy, based on lift or information gain.
Jeong, Y. and Myaeng, S.H., 2013, October. Feature selection using a semantic hierarchy for event recognition and type classification. In Proceedings of the Sixth International Joint Conference on Natural Language Processing (pp. 136-144).
- Parameters
df (pd.DataFrame) – Dataframe with hierarchy (output of generator)
label_column (str) – Name of the column with the class/label
G (nx.DirectedGraph, optional) – The directed graph of all classes and superclasses can be specified here; if None the function looks for the graph in the pd.DataFrame.attrs.hierarchy attribute of the input dataframe. Defaults to None.
metric (str/func, optional) – Metric which is used to determine the representative features (IG/Lift). Defaults to ‘Lift’.
progress (bool, optional) – If True, progress updates will be shown to inform the user about the progress made by the process. Defaults to True.
- Returns
Filtered Dataframe containing the selected attributes.
- Return type
pd.DataFrame
kgextension.feature_selection_helper module
- kgextension.feature_selection_helper.add_hierarchy_columns(df, G, keep_prefix=False)
Given a feature dataframe and corresponding hierarchy graph, add all the higher-level features to the dataframe with correct boolean values.
- Parameters
df (pd.DataFrame) – Dataframe with all the lowest-level children features.
G (nx.DiGraph) – Directed feature hierarchy graph, direction from children to parents.
keep_prefix (bool, optional) – Whether to keep prefices from original directory children. Defaults to False.
- Returns
Dataframe with all higher hierarchy features appended.
- Return type
pd.DataFrame
- kgextension.feature_selection_helper.calc_average_ig(path_nodes, node_values)
Helper function for SHSEL filter algorithm. It returns the average Infomation gain value of one existing path in pruning function.
- Parameters
path_nodes (list) – Node in path whose node_availability is True.
node_values (dict) – Dictionary about every node in the directed graph and its information gain value.
- Returns
The average InfoGain value of one existing path.
- Return type
float
- kgextension.feature_selection_helper.calc_gr(df, label_column, progress=True)
Calculated the Gain Ratio for each column of a df in relation to a specified label_column.
- Parameters
df (pd.DataFrame) – Dataframe the Gain Ratio values need to be calculated for.
label_column (str) – Name of the label_column.
progress (bool, optional) – If True, progress bars will be shown to inform the user about the progress made by the process. Defaults to True.
- Raises
RuntimeWarning – Is raised if the gain ratio calculation fails for a column (and returns a nan).
- Returns
Dictionary with the column names as keys and the corresponding Gain Ratio values as values.
- Return type
dict
- kgextension.feature_selection_helper.calculate_lift(df, G, label_column)
Helper function for TSEL filter. Calculates the lift value for every node in a given graph.
- Parameters
df (pd.DataFrame) – Dataframe the lift needs to be calculated for.
G (nx.DirectedGraph) – Directed graph for the dataframe.
label_column (str) – Name of the column with the class/label.
- Returns
Dictionary containing column names as keys and lift as value.
- Return type
dictionary
- kgextension.feature_selection_helper.exist_unchecked_leafs(G)
Helper function for hierachical hill climbing. The function determines whether any of the leaf nodes of the graph have the attribute checked set to False. It returns number of leafs for which this is the case.
- Parameters
G (nx.DirectedGraph) – The directed graph to be checked.
- Returns
Number of unchecked leafs.
- Return type
int
- kgextension.feature_selection_helper.find_shortest_paths(G, root='VRN', progress=True)
Finds the shortest path between the (virtual) root node of a grahp and each leaf of the graph.
- Parameters
G (nx.DirectedGraph) – Directed Graph.
root (str, optional) – Name of the root node. Defaults to “VRN”.
progress (bool, optional) – If True, progress bars will be shown to inform the user about the progress made by the process. Defaults to True.
- Returns
List of shortest paths.
- Return type
list
- kgextension.feature_selection_helper.get_all_paths(G, root)
Helper function for TSEL filter. Returns all possible paths in a given graph.
- Parameters
graph (nx.DirectedGraph) – Directed graph.
root (str) – Name of the root node.
- Returns
List containing all paths in the graph.
- Return type
list
- kgextension.feature_selection_helper.get_max_node(candidates, gr_values, column_prefix='')
Given a set of candidate nodes, and return the one with the highest Gain Ratio.
- Parameters
candidates (list) – List of candidate nodes.
gr_values (dict) – Dictionary with column names as keys and the corresponding Gain Ratio values as values.
column_prefix (str) – Prefix of the columns generated by the generator (e.g. “new_link_type_”). Defaults to “”.
- Returns
Name of the node with the highest Gain Ratio in the candidate set.
- Return type
str
- kgextension.feature_selection_helper.gtd_logic(df, G, label_column, column_prefix, progress=True)
Greedy Top Down algorithm to select most relevant nodes in a Graph based on Gain Ratio.
- Parameters
df (pd.DataFrame) – DataFrame.
G (nx.DirectedGraph) – Directed Graph containing the hierarchy.
label_column (str) – Name of the label column.
column_prefix (str) – Prefix of the columns generated by the generator (e.g. “new_link_type_”).
progress (bool, optional) – If True, progress bars will be shown to inform the user about the progress made by the process. Defaults to True.
- Returns
Set of nodes (as strings) that are deemed most relevant by the algorithm.
- Return type
set
- kgextension.feature_selection_helper.hill_climbing_cost_function(df, class_col, alpha, beta, k)
Calculates the regularized purity for the hierarchical hill climbing algorithm using Nearest Neighbors.
- Parameters
df (pd.DataFrame) – Dataframe with the feature selection to be evaluated.
class_col (pd.Series) – The column with the class/output values.
alpha (float) – Size of original feature space.
beta (float) – Regulatization parameter.
k (int) – Number of nearest neighbors.
- Returns
Cost value for this set of features.
- Return type
float
- kgextension.feature_selection_helper.prune(df_filtered, G, node_values, node_availability, L, remove_flag=True, progress=True)
The pruning function of hierarchy_based_filter algorithm: select only the most valuable features which is greater than the average Information Gain values from the previously reduced set.
- Parameters
df_filtered (pd.DataFrame) – The result dataframe which is outputed by initial selection algorithm.
G (nx.DirectedGraph) – The reverse of the directed graph of all classes and superclasses.
node_values (dictionary) – Dictionary contains the information gain value of every node in DirectedGraph.
node_availability (dictionary) – Dictionary contains every node in DirectedGraph and its availability (either True or False).
L (list) – A list contains the leaf nodes in DirectedGraph.
remove_flag (bool, optional) – If or not strictly remove all the nodes once one of their info gain value are smaller than the average info gain of paths. Defaults to True.
progress (bool, optional) – If True, progress bars will be shown to inform the user about the progress made by the process. Defaults to True.
- Returns
Filtered Dataframe containing the selected attributes.
- Return type
pd.DataFrame
- kgextension.feature_selection_helper.representative_feature(path, values)
Helper function for TSEL filter. Returns the representative node of a given path.
- Parameters
path (list) – Path containing some node names.
values (dict) – values containing nodes and their values.
- Returns
Name of most valuable/representative node of the given path.
- Return type
str
kgextension.feature_selection_sklearn module
- class kgextension.feature_selection_sklearn.GreedyTopDownFilter(label_column, column_prefix='new_link_type_', G=None, progress=True)
Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
- fit(X, y=None)
- transform(X, y=None)
- class kgextension.feature_selection_sklearn.HierarchyBasedFilter(label_column, G=None, threshold=0.99, metric='info_gain', pruning=True, all_remove=True, progress=True, **kwargs)
Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
Feature selection approach, namely, SHSEL including the initial selection algorithm and pruning algorithm. Identify and filter out the ranges of nodes with similar relevance in each branch of the hierarchy. It can be used in a sklearn pipeline.
Ristoski, P. and Paulheim, H., 2014, October. Feature selection in hierarchical feature spaces. In International conference on discovery science (pp. 288-300). Springer, Cham.
- Parameters
BaseEstimator (sklearn.base.BaseEstimator) –
TransformerMixin (sklearn.base.TransformerMixin) –
- fit(X, y=None)
- transform(X, y=None)
- class kgextension.feature_selection_sklearn.HillClimbingFilter(label_column, metric='hill_climbing_cost_function', G=None, beta=0.05, k=5, progress=True, **kwargs)
Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
- Feature selection performed by comparing nodes with their parents in a
bottom-up approach.
Wang, B.B., Mckay, R.B., Abbass, H.A. and Barlow, M., 2003, February. A comparative study for domain ontology guided feature extraction. In Proceedings of the 26th Australasian computer science conference-Volume 16 (pp. 69-78). Can be used in a sklearn pipeline.
- Parameters
BaseEstimator (sklearn.base.BaseEstimator) –
TransformerMixin (sklearn.base.TransformerMixin) –
- fit(X, y=None)
- transform(X, y=None)
kgextension.fusion module
- kgextension.fusion.data_fuser(df, clusters, boolean_method_single='provenance', boolean_method_multiple='voting', numeric_method_single='average', numeric_method_multiple='average', string_method_single='longest', string_method_multiple='longest', provenance_regex='http://dbpedia.org/', progress=True)
Fuses the columns in the “match” sets of the clusters. Determines type and size and automatically detects which of the functions to use. If a fusion match is a pair, the “single” functions is used, otherwise the “multiple” function. Available functions are first, last, longest, shortest, random.choice, voting and provenance. Other existing and user-defined functions can be passed as well, they should be applicable to pd.DataFrame.apply(axis=1).
- Parameters
df (pd.DataFrame) – The DataFrame where schema matches are to be fused
clusters (list) – contains the clusters with the matching column names as sets
boolean_method_single (str, optional) – Method for single matches with boolean type. Defaults to “provenance”.
boolean_method_multiple (str, optional) – Method for multiple matches with boolean type. Defaults to “voting”.
numeric_method_single (str, optional) – mMthod for single matches with numeric type. Defaults to “average”.
numeric_method_multiple (str, optional) – Method for multiple matches with numeric type. Defaults to “average”.
string_method_single (str, optional) – Method for single matches with string type. Defaults to “longest”.
string_method_multiple (str, optional) – Method for multiple matches with string type. Defaults to “longest”.
provenance_regex (str, optional) – Pattern after which provenance is selected. Defaults to “http://dbpedia.org/”.
progress (bool, optional) – If True, progress bars will be shown to inform the user about the progress made by the process. Defaults to True.
- Returns
DataFrame with fused columns.
- Return type
pd.DataFrame
- kgextension.fusion.get_fusion_clusters(df, threshold, progress=True)
Takes the attribute pairs generated by one of the matchers, discards all pairs with a similarity below the specified threshold and then clusters the remaining pairs into sets of equal attributes (based on the idea that similarities between attributes are Euclidean). Example: The pairs {car, auto} and {car, automobile} would be clustered into the set {car, auto, automobile} (if both pairs have a similarity ≥ the specified threshold).
- Parameters
df (pd.DataFrame) – The DataFrame containing the similarities between the attribute pairs. This is generated by one of the matchers.
threshold (float) – Threshold that specifies the minimal similarity between two attributes, so that they are considered as matched.
- Returns
List of sets that contain equal (matched) attributes.
- Return type
list
kgextension.fusion_helper module
- kgextension.fusion_helper.first(x)
Returns the first not-NA value, helper function for pd.DataFrame.apply.
- Parameters
x (pd.Series) – columns/rows passed in pd.DataFrame.apply function
- Returns
first not-NA value of the pd.Series
- Return type
flexible
- kgextension.fusion_helper.fusion_function_lookup(boolean_method_single, boolean_method_multiple, numeric_method_single, numeric_method_multiple, string_method_single, string_method_multiple)
Maps the right function to method passed as string. E.g. boolean_method_single = ‘random’ –> random.choice.
- Parameters
boolean_method_single (str) – method to use for a cluster of size two and boolean values.
boolean_method_multiple (str) – method to use for a cluster of more than size two and boolean values.
numeric_method_single (str) – method to use for a cluster of size two and numeric values
numeric_method_multiple (str) – method to use for a cluster of more than size two and numeric values.
string_method_single (str) – method to use for a cluster of size two and string values.
string_method_multiple (str) – method to use for a cluster of more than size two and string values.
- Returns
A dictionary with the mapping from method to function.
- Return type
dict
- kgextension.fusion_helper.last(x)
Returns the last not.na value, helper function for pd.DataFrame.apply.
- Parameters
x (pd.Series) – columns/rows passed in pd.DataFrame.apply function
- Returns
last not-NA value of the pd.Series
- Return type
flexible
- kgextension.fusion_helper.longest(x)
Returns the longest value, helper function for pd.DataFrame.apply.
- Parameters
x (pd.Series) – columns/rows passed in pd.DataFrame.apply function
- Returns
longest value of the pd.Series
- Return type
str
- kgextension.fusion_helper.provenance(columns, regex='http://dbpedia.org/')
Determines the name of the column matching the regex pattern.
- Parameters
columns (pd.DataFrame.columns) – The columns of the schema matches to be fused
regex (str, optional) – The regex string identifiying the column name, generally the prefix of the feature. Defaults to “http://dbpedia.org/”.
- Returns
The name of the column matching the regex pattern.
- Return type
str
- Raises
AttributeError – If no column or more than one columns of the fusion cluster match the pattern.
- kgextension.fusion_helper.shortest(x)
Returns the shorest value, helper function for pd.DataFrame.apply.
- Parameters
x (pd.Series) – columns/rows passed in pd.DataFrame.apply function
- Returns
longest value of the pd.Series
- Return type
str
- kgextension.fusion_helper.voting(x)
Chooses the value with the most votes (mode value in statistics). If there is a draw, the first value is chosen.
- Parameters
x (pd.Series) – columns/rows passed in pd.DataFrame.apply function
- Returns
mode value of the pd.Series
- Return type
flexible
kgextension.generator module
- kgextension.generator.custom_sparql_generator(df, link_attribute, query, endpoint=<kgextension.sparql_helper.RemoteEndpoint object>, progress=True, attribute_generation_strategy='first', prefix_lookup=False, caching=True)
This generator issues a custom SPARQL query and creates additional attributes from the query results.
- Parameters
df (pd.DataFrame) – Dataframe to which links are added
link_attribute (str) – Name of column containing the link to the knowledge graph.
query (str) – Custom SPARQL query which returns attributes to be appended.
endpoint (Endpoint, optional) – SPARQL Endpoint to be queried; ignored when “uri_data_model” = True. Defaults to DBpedia.
progress (bool, optional) – If True, progress bars will be shown to inform the user about the progress made by the process. Defaults to True.
prefix_lookup (bool/str/dict, optional) – True: Namespaces of prefixes will be looked up at prefix.cc and added to the sparql query. str: User provides the path to a json-file with prefixes and namespaces. dict: User provides a dictionary with prefixes and namespaces. Defaults to False.
caching (bool, optional) – Turn result-caching for queries issued during the execution on or off. Defaults to True.
- Returns
Dataframe with new columns containing the query results.
- Return type
pd.DataFrame
- kgextension.generator.data_properties_generator(df, columns, endpoint=<kgextension.sparql_helper.RemoteEndpoint object>, uri_data_model=False, progress=True, type_filter=None, regex_filter=None, bundled_mode=True, prefix_lookup=False, caching=True)
Generator that takes a dataset with a link to a knowledge graph and creates a new feature for each data property of the given resource.
- Parameters
df (pd.DataFrame) – Dataframe to which the features will be added
columns (str/list) – Name(s) of column(s) which contain(s) the link(s) to the knowledge graph.
endpoint (Endpoint, optional) – Base string to the knowledge graph; ignored when “uri_data_model” = True. Defaults to DBpedia.
uri_data_model (bool, optional) – If enabled, the URI is directly queried instead of a SPARQL endpoint. Defaults to False.
progress (bool, optional) – If True, progress bars will be shown to inform the user about the progress made by the process. Defaults to True.
type_filter (str, optional) – Property datatype to be selected from results (e.g. xsd:string). If a specific datatype should be excluded a “- ” needs to be prepended (e.g. - xsd:string). Defaults to None.
regex_filter (str, optional) – Regular expression for filtering properties. Defaults to None.
bundled_mode (bool, optional) – If True, all necessary queries are bundled into one query (using the VALUES method). - Requires a SPARQL 1.1 implementation! . Defaults to True.
prefix_lookup (bool/str/dict, optional) – True: Namespaces of prefixes will be looked up at prefix.cc and added to the sparql query. str: User provides the path to a json-file with prefixes and namespaces. dict: User provides a dictionary with prefixes and namespaces. Defaults to False.
caching (bool, optional) – Turn result-caching for queries issued during the execution on or off. Defaults to True.
- Returns
Dataframe with a new column for each property.
- Return type
pd.DataFrame
- kgextension.generator.direct_type_generator(df, columns, endpoint=<kgextension.sparql_helper.RemoteEndpoint object>, uri_data_model=False, progress=True, prefix='', regex_filter=None, result_type='boolean', bundled_mode=True, hierarchy=False, prefix_lookup=False, caching=True)
Generator that takes a dataset with (a) link(s) to a knowledge graph and queries the type(s) of the linked ressources (using rdf:type). The resulting types are added as new columns, which are filled either with a boolean indicator or a count.
- Parameters
df (pd.DataFrame) – Dataframe to which types are added.
columns (str/list) – Name(s) of column(s) which contain(s) the link(s) to the knowledge graph.
endpoint (Endpoint, optional) – SPARQL Endpoint to be queried; ignored when “uri_data_model” = True. Defaults to DBpedia.
uri_data_model (bool, optional) – If enabled, the URI is directly queried instead of a SPARQL . Defaults to False.
progress (bool, optional) – If True, progress bars will be shown to inform the user about the progress made by the process . Defaults to True.
prefix (str, optional) – Custom prefix for the SPARQL query. Defaults to “”.
regex_filter (list, optional) – A list filled with regexes (as strings) to filter the results . Defaults to None.
result_type (str, optional) – States wether the results should be boolean (“boolean”), counts (“counts”), relative counts (“relative”) or tfidf-values (“tfidf”) . Defaults to “boolean”.
bundled_mode (bool, optional) – If True, all necessary queries are bundled into one query (using the VALUES method). - Requires a SPARQL 1.1 implementation! . Defaults to True.
hierarchy (bool, optional) – If True, a hierarchy of all superclasses of the returned types is attached to the resulting dataframe. Defaults to False.
prefix_lookup (bool/str/dict, optional) – True: Namespaces of prefixes will be looked up at prefix.cc and added to the sparql query. str: User provides the path to a json-file with prefixes and namespaces. dict: User provides a dictionary with prefixes and namespaces. Defaults to False.
caching (bool, optional) – Turn result-caching for queries issued during the execution on or off. Defaults to True.
- Returns
Returns dataframe with (a) new column(s) containing the found types.
- Return type
pd.DataFrame
- kgextension.generator.qualified_relation_generator(df, columns, endpoint=<kgextension.sparql_helper.RemoteEndpoint object>, uri_data_model=False, progress=True, prefix='Link', direction='Out', properties_regex_filter=None, types_regex_filter=None, result_type='boolean', hierarchy=False, prefix_lookup=False, caching=True)
Qualified relation generator considers not only relations, but also the related types, adding boolean, counts, relative counts or tfidf-values features for incoming and outgoing relations.
- Parameters
df (pd.DataFrame) – Dataframe to which links are added.
columns (str/list) – Name(s) of column(s) which contain(s) the link(s) to the knowledge graph.
endpoint (Endpoint, optional) – SPARQL Endpoint to be queried; ignored when “uri_data_model” = True. Defaults to DBpedia.
uri_data_model (bool, optional) – If enabled, the URI is directly queried instead of a SPARQL endpoint. Defaults to False.
progress (bool, optional) – If True, progress bars will be shown to inform the user about the progress made by the process. Defaults to True.
prefix (str, optional) – Custom prefix for the SPARQL query. Defauls to “Link”.
direction (str, optional) – The direction for properties which choose from Incoming, Outgoing (In and Out). Defaults to “Out”.
properties_regex_filter (str, optional) – Regular expression for filtering properties. Defaults to None.
types_regex_filter (str, optional) – Regular expression for filtering types. Defaults to None.
result_type (str, optional) – States wether the results should be boolean (“boolean”), counts (“counts”), relative counts (“relative”) or tfidf-values (“tfidf”) Defaults to “boolean”.
hierarchy (bool, optional) – If True, a hierarchy of all superclasses of the returned types is attached to the resulting dataframe. Defaults to False.
prefix_lookup (bool/str/dict, optional) – True: Namespaces of prefixes will be looked up at prefix.cc and added to the sparql query. str: User provides the path to a json-file with prefixes and namespaces. dict: User provides a dictionary with prefixes and namespaces. Defaults to False.
caching (bool, optional) – Turn result-caching for queries issued during the execution on or off. Defaults to True.
- Returns
Dataframe with new columns containing the links of properties to the knowledge graph
- Return type
pd.DataFrame
- kgextension.generator.specific_relation_generator(df, columns, endpoint=<kgextension.sparql_helper.RemoteEndpoint object>, uri_data_model=False, progress=True, direct_relation='http://purl.org/dc/terms/subject', hierarchy_relation=None, max_hierarchy_depth=1, prefix_lookup=False, caching=True)
Creates attributes from a specific direct relation. Additionally, it is possible to append a hierarchy with a user-defined hierarchy relation.
- Parameters
df (pd.DataFrame) – the dataframe to extend
columns (str/list) – Name(s) of column(s) which contain(s) the link(s) to the knowledge graph.
endpoint (Endpoint, optional) – SPARQL Endpoint to be queried; ignored when “uri_data_model” = True. Defaults to DBpedia.
uri_data_model (bool, optional) – If enabled, the URI is directly queried instead of a SPARQL endpoint. Defaults to False.
progress (bool, optional) – If True, progress bars will be shown to inform the user about the progress made by the process. Defaults to True.
direct_relation (str, optional) – Direct relation used to create features. Defaults to “http://purl.org/dc/terms/subject”.
hierarchy_relation (str, optional) – Hierarchy relation used to connect categories, e.g. http://www.w3.org/2004/02/skos/core#broader. Defaults to None.
max_hierarchy_depth (int, optional) – Maximal number of hierarchy steps taken. Defaults to 1.
prefix_lookup (bool/str/dict, optional) – True: Namespaces of prefixes will be looked up at prefix.cc and added to the sparql query. str: User provides the path to a json-file with prefixes and namespaces. dict: User provides a dictionary with prefixes and namespaces. Defaults to False.
caching (bool, optional) – Turn result-caching for queries issued during the execution on or off. Defaults to True.
- Returns
The dataframe with additional features.
- Return type
pd.DataFrame
- kgextension.generator.unqualified_relation_generator(df, columns, endpoint=<kgextension.sparql_helper.RemoteEndpoint object>, uri_data_model=False, progress=True, prefix='Link', direction='Out', regex_filter=None, result_type='boolean', prefix_lookup=False, caching=True)
Unqualified relation generator creates attributes from the existence of relations and adds boolean, counts, relative counts or tfidf-values features for incoming and outgoing relations.
- Parameters
df (pd.DataFrame) – Dataframe to which links are added.
columns (str/list) – Name(s) of column(s) which contain(s) the link(s) to the knowledge graph.
endpoint (Endpoint, optional) – SPARQL Endpoint to be queried; ignored when “uri_data_model” = True. Defaults to DBpedia.
uri_data_model (bool, optional) – If enabled, the URI is directly queried instead of a SPARQL endpoint. Defaults to False.
progress (bool, optional) – If True, progress bars will be shown to inform the user about the progress made by the process. Defaults to True.
prefix (str, optional) – Custom prefix for the SPARQL query. Defauls to “Link”.
direction (str, optional) – The direction for properties which choose from Incoming, Outgoing (In and Out). Defaults to “Out”.
regex_filter (str, optional) – Regular expression for filtering properties. Defaults to None.
result_type (str, optional) – States wether the results should be boolean (“boolean”), counts (“counts”), relative counts (“relative”) or tfidf-values (“tfidf”) Defaults to “boolean”.
prefix_lookup (bool/str/dict, optional) – True: Namespaces of prefixes will be looked up at prefix.cc and added to the sparql query. str: User provides the path to a json-file with prefixes and namespaces. dict: User provides a dictionary with prefixes and namespaces. Defaults to False.
caching (bool, optional) – Turn result-caching for queries issued during the execution on or off. Defaults to True.
- Returns
Dataframe with new columns containing the links of properties to the knowledge graph
- Return type
pd.DataFrame
kgextension.generator_helper module
- kgextension.generator_helper.create_graph_from_raw(DG, results, max_hierarchy_depth, current_level, uri_data_model)
Converts the XML obtained by the endpoint wrapper into a hierarchical directed graph.
- Parameters
DG (Directed Graph) – The empty or preprocessed graph to be appended.
results (DOM/pd.DataFrame) – The raw results of the SPARQL query
max_hierarchy_depth (int) – The maximum number of hierarchy levels when the direct search is used.
current_level (pd.Series) – In case of iterative hierarchy generation the values of the current hierarchy level.
uri_data_model (bool) – If enabled, the URI is directly queried instead of a SPARQL endpoint.
- Returns
Graph where edges point to direct superclasses of nodes. current_level: In case of iterative hierarchy generation the updated hierarchy level.
- Return type
nx.DirectedGraph
- kgextension.generator_helper.get_result_df(df, result_type, prefix, merged_df, column)
Helper function for unqualified and qualified relation generator. It helps to create the result dataframe and reduce the duplicated codes from the two main functions.
- Parameters
df (pd.DataFrame) – The result dataframe dummies.
result_type (str) – The type of result chosen from boolean, count, relative count or tf-idf.
prefix (str) – Prefix set automatically by the generator.
merged_df (pd.DataFrame) – The original dataframe inputed by users.
column (str) – Name of the attribute containing entities that should be found.
- Returns
The final dataframe.
- Return type
pd.DataFrame
- kgextension.generator_helper.hierarchy_graph_generator(col, hierarchy_relation='http://www.w3.org/2000/01/rdf-schema#subClassOf', max_hierarchy_depth=None, endpoint=<kgextension.sparql_helper.RemoteEndpoint object>, uri_data_model=False, progress=False, caching=True)
Computes a hierarchy graph from an original set of features, where directed edges symbolise a hierarchy relation from subclass to superclass.
- Parameters
col (pd.Series) – The classes/categories for which the hierarchy graph is generated.
hierarchy_relation (str, optional) – The hierarchy relation to be used. Defaults to “http://www.w3.org/2000/01/rdf-schema#subClassOf”.
max_hierarchy_depth (int, optional) – Number of jumps in hierarchy. If None, transitive jumps are used. Defaults to None.
endpoint (Endpoint, optional) – Link to the SPARQL endpoint that should be queried. Defaults to DBpedia.
uri_data_model (bool, optional) – whether to use sparql querier or the uri data model. Defaults to False.
progress (bool, optional) – If True, progress bars will be shown to inform the user about the progress made by the process (if “uri_data_model” = True). Defaults to False.
caching (bool, optional) – Turn result-caching for queries issued during the execution on or off.
- Returns
Graph where edges point to direct superclasses of nodes.
- Return type
nx.DirectedGraph
- kgextension.generator_helper.hierarchy_query_creator(col, hierarchy_relation, max_hierarchy_depth, uri_data_model)
Creates a Sparql query to retrieve the hierarchy of classes/categories.
- Parameters
col (pd.Series) – pd.Series containing the URIs.
hierarchy_relation (str) – A hierarchy relation, e.g. http://www.w3.org/2004/02/skos/core#broader.
max_hierarchy_depth (int) – The maximum number of hierarchy levels added based on the original resources. If None is passed, transitive hierarchies are created, this may lead to a timeout.
uri_data_model (bool) – If false formulates query for endpoints.
- Returns
The SPARQL Query for hierarchy retrieval.
- Return type
str
kgextension.generator_sklearn module
- class kgextension.generator_sklearn.DataPropertiesGenerator(columns, endpoint=<kgextension.sparql_helper.RemoteEndpoint object>, uri_data_model=False, progress=True, type_filter=None, regex_filter=None, bundled_mode=True, prefix_lookup=False, caching=True)
Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
- fit(X, y=None)
- transform(X, y=None)
- class kgextension.generator_sklearn.DirectTypeGenerator(columns, endpoint=<kgextension.sparql_helper.RemoteEndpoint object>, uri_data_model=False, progress=True, prefix='', regex_filter=None, result_type='boolean', bundled_mode=True, hierarchy=False, prefix_lookup=False, caching=True)
Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
- fit(X, y=None)
- transform(X, y=None)
- class kgextension.generator_sklearn.QualifiedRelationGenerator(columns, endpoint=<kgextension.sparql_helper.RemoteEndpoint object>, uri_data_model=False, progress=True, prefix='Link', direction='Out', properties_regex_filter=None, types_regex_filter=None, result_type='boolean', hierarchy=False, prefix_lookup=False, caching=True)
Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
Qualified relation generator considers not only relations, but also the related types, adding boolean, counts, relative counts or tfidf-values features for incoming and outgoing relations.
- Args:
BaseEstimator (sklearn.base.BaseEstimator) TransformerMixin (sklearn.base.TransformerMixin)
- fit(X, y=None)
- transform(X, y=None)
- class kgextension.generator_sklearn.SpecificRelationGenerator(columns, endpoint=<kgextension.sparql_helper.RemoteEndpoint object>, uri_data_model=False, progress=True, direct_relation='http://purl.org/dc/terms/subject', hierarchy_relation=None, max_hierarchy_depth=1, prefix_lookup=False, caching=True)
Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
- fit(X, y=None)
- transform(X, y=None)
- class kgextension.generator_sklearn.UnqualifiedRelationGenerator(columns, endpoint=<kgextension.sparql_helper.RemoteEndpoint object>, uri_data_model=False, progress=True, prefix='Link', direction='Out', regex_filter=None, result_type='boolean', prefix_lookup=False, caching=True)
Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
Unqualified relation generator creates attributes from the existence of relations and adds boolean, counts, relative counts or tfidf-values features for incoming and outgoing relations.
- Parameters
BaseEstimator (sklearn.base.BaseEstimator) –
TransformerMixin (sklearn.base.TransformerMixin) –
- fit(X, y=None)
- transform(X, y=None)
kgextension.link_exploration module
- kgextension.link_exploration.link_explorer(df, base_link_column, number_of_hops=1, links_to_follow=['owl:sameAs'], lod_sources=[], exclude_sources=[], prefix_lookup=False, progress=True, caching=True)
Follows the defined links starting from a base link to a certain number of hops. Adds the discovered links as new columns to the dataframe.
- Parameters
df (pd.DataFrame) – Dataframe with a base link
base_link_column (str) – Name of column which contains the base link to start with.
number_of_hops (int, optional) – Depth of exlporation of the LOD cloud. Defaults to 1.
links_to_follow (list, optional) – Names of links that should be followed. Defaults to “owl:sameAs”.
lod_sources (list, optional) – Restrict exploration to certain datasets. Use strings or regular expressions to define the allowed datasets. Defaults to [].
exclude_sources (list, optional) – Exclude certain datasets from exploration. Use strings or regular expressions to define the datasets. Defaults to [].
prefix_lookup (bool/str/dict, optional) – True: Namespaces of prefixes will be looked up at prefix.cc and added to the sparql query. str: User provides the path to a json-file with prefixes and namespaces. dict: User provides a dictionary with prefixes and namespaces. Defaults to False.
progress (bool, optional) – If True, progress bars will be shown to inform the user about the progress made by the process. Defaults to True.
caching (bool, optional) – Turn result-caching for queries issued during the execution on or off. Defaults to True.
- Returns
Dataframe with a new column for each discovered link.
- Return type
pd.DataFrame
kgextension.link_exploration_sklearn module
kgextension.linking module
- kgextension.linking.dbpedia_lookup_linker(df, column, new_attribute_name='new_link', progress=True, base_url='https://lookup.dbpedia.org/api/search/', max_hits=1, query_class='', lookup_api='KeywordSearch', caching=True)
Implementation of the DBpedia Lookup service (https://github.com/dbpedia/lookup). Takes strings from a column, looks for matching DBPedia entities and returns their URIs to newly added columns.
- Parameters
df (pd.DataFrame) – Dataframe to which links are added.
column (str) – Name of the attribute containing entities that should be looked up.
new_attribute_name (str, optional) – Name of column / prefix of columns containing the link to the knowledge graph. Defaults to “new_link”.
progress (bool, optional) – If True, progress bars will be shown to inform the user about the progress made by the process. Defaults to True.
base_url (str, optional) – Set the base URL for the generation of request URLs. Defaults to “https://lookup.dbpedia.org/api/search/”.
max_hits (int, optional) – Maximal number of URIs that should be returned per entity. Defaults to 1.
query_class (str, optional) – A DBpedia class from the DBpedia Ontology (https://wiki.dbpedia.org/services-resources/ontology) that the results should have (without prefix, e.g., dbo:place as place). Defaults to “”.
lookup_api (str, optional) – Choose between KeywordSearch and PrefixSearch mode of DBpedia Lookup. Defaults to “KeywordSearch”.
caching (bool, optional) – Turn result-caching for lookups issued during the execution on or off. Defaults to True
- Returns
Returns dataframe with (a) new column(s) containing the links to the DBpedia entities.
- Return type
pd.DataFrame
- kgextension.linking.dbpedia_spotlight_linker(df, column, new_attribute_name='new_link', progress=True, max_hits=1, language='en', selection='first', confidence=0.3, support=5, min_similarity_score=0.5, caching=True)
Implementation of the DBpedia Spotlight Service (https://www.dbpedia-spotlight.org/). Takes strings from a column, looks for linked Wikipedia entities and returns their URIs to newly added columns.
- Parameters
df (pd.DataFrame) – Dataframe to which links are added.
column (str) – Name of the column whose entities should be found.
new_attribute_name (str, optional) – Name of column containing the link to the knowledge graph. Defaults to “new_link”.
progress (bool, optional) – If True, progress bars will be shown to inform the user about the progress made by the process. Defaults to True.
max_hits (int, optional) – Maximal number of URI’s that should be returned per entity. Defaults to 1.
language (str, optional) – The DBPedia language setting. Defaults to “en”.
selection (str, optional) – Specifies whether the entities that occur first (first), that have the highest support(support) or that have the highest similarity score(similarityScore) should be chosen. Defaults to “first”.
confidence (float, optional) – Confidence threshold. Defaults to 0.3.
support (int, optional) – Support threshold. Defaults to 5.
min_similarity_score (float, optional) – Minimal similarity threshold. Defaults to 0.5.
caching (bool, optional) – Turn result-caching for queries issued during the execution on or off. Defaults to True.
- Returns
- Returns dataframe with (a) new column(s) containing the
DBPedia URIs.
- Return type
pd.DataFrame
- kgextension.linking.label_linker(df, column, new_attribute_name='new_link', progress=True, endpoint=<kgextension.sparql_helper.RemoteEndpoint object>, result_filter=None, language='en', max_hits=1, label_property='rdfs:label', prefix_lookup=False, caching=True)
Label Linker takes attributes from a column and adds a new column with the respective knowledge graph links based on the provided label_property (rdfs:label by default).
- Parameters
df (pd.DataFrame) – Dataframe to which links are added.
column (str) – Name of the column whose entities should be found.
new_attribute_name (str, optional) – Name of column containing the link to the knowledge graph. Defaults to “new_link”.
progress (bool, optional) – If True, progress bars will be shown to inform the user about the progress made by the process. Defaults to True.
endpoint (Endpoint, optional) – Choose SPARQL endpoint connection. Defaults to DBpedia.
result_filter (list, optional) – A list filled with regexes (as strings) to filter the results. Defaults to None.
language (str, optional) – Used to specify the language the labels are in. If the queried endpoint does not use language tags, set to None. Defaults to “en”.
max_hits (int, optional) – Maximal number of URI’s that should be returned per entity. Defaults to 1.
label_property (str, optional) – Specifies the label_property the should be used in the query. Defaults to “rdfs:label”.
prefix_lookup (bool/str/dict, optional) – True: Namespaces of prefixes will be looked up at prefix.cc and added to the sparql query. str: User provides the path to a json-file with prefixes and namespaces. dict: User provides a dictionary with prefixes and namespaces. Defaults to False.
caching (bool, optional) – Turn result-caching for queries issued during the execution on or off. Defaults to True.
- Returns
Dataframe with a new column containing the links to the knowledge graph.
- Return type
pd.DataFrame
- kgextension.linking.pattern_linker(df, column, new_attribute_name='new_link', progress=True, base_url='http://dbpedia.org/resource/', url_encoding=True, DBpedia_link_format=True)
Basic Pattern Linker that takes attributes from a column and a base link and generates a new column with the respective knowledge graph links.
- Parameters
df (pd.DataFrame) – Dataframe to which links are added.
column (str) – Name of column whose entities should be found.
new_attribute_name (str, optional) – Name of column containing the link to the knowledge graph. Defaults to “new_link”.
progress (bool, optional) – If True, progress updates will be shown to inform the user about the progress made by the process. Defaults to True.
base_url (str, optional) – Base string to the knowledge graph. Defaults to “www.dbpedia.org/resource/”.
url_encoding (bool, optional) – Enables automatic url encoding. Defaults to True.
DBpedia_link_format (bool, optional) – Enables conversion to DBpedia link format. Defaults to True.
- Returns
Dataframe with a new column containing the links to the knowledge graph.
- Return type
pd.DataFrame
- kgextension.linking.sameas_linker(df, column, new_attribute_name='new_link', progress=True, endpoint=<kgextension.sparql_helper.RemoteEndpoint object>, result_filter=None, uri_data_model=False, bundled_mode=True, prefix_lookup=False, caching=True)
Function that takes URIs from a column of a DataFrame and queries a given SPARQL endpoint for ressources which are connected to these URIs via owl:sameAs. Found ressources are added as new columns to the dataframe and the dataframe is returned.
- Parameters
df (pd.DataFrame) – Dataframe to which links are added.
column (str) – Name of the column for whose entities links should be found.
new_attribute_name (str, optional) – Name / prefix of the column(s) containing the found links. Defaults to “new_link”.
progress (bool, optional) – If True, progress bars will be shown to inform the user about the progress made by the process (if “uri_data_model” = True). Defaults to True.
endpoint (Endpoint, optional) – SPARQL Endpoint to be queried; ignored when “uri_data_model” = True. Defaults to DBpedia.
result_filter (list, optional) – A list filled with regexes (as strings) to filter the results. Defaults to None.
uri_data_model (bool, optional) – If enabled, the URI is directly queried instead of a SPARQL endpoint. Defaults to False.
bundled_mode (bool, optional) – If True, all necessary queries are boundled into one querie (using the VALUES method). - Requires a SPARQL 1.1 implementation!. Defaults to True.
prefix_lookup (bool/str/dict, optional) – True: Namespaces of prefixes will be looked up at prefix.cc and added to the sparql query. str: User provides the path to a json-file with prefixes and namespaces. dict: User provides a dictionary with prefixes and namespaces. Defaults to False.
caching (bool, optional) – Turn result-caching for queries issued during the execution on or off. Defaults to True.
- Returns
Returns dataframe with (a) new column(s) containing the found ressources.
- Return type
pd.DataFrame
kgextension.linking_helper module
- kgextension.linking_helper.dll_query_resolver(query_link, maxHits)
Resolves a query link for the DBpedia Lookup API to a series of the URIs returned for that query.
- Parameters
query_link (str) – API request in link form.
maxHits (int) – Maximal number of URIs that should be returned by the API.
- Returns
Containing the URIs as strings.
- Return type
pd.Series
- kgextension.linking_helper.spotlight_uri_extractor(entry, link, max_hits=1, selection='first', confidence=0.5, support=20, min_similarity_score=0.8)
Finds linked DBPedia entities of a string and returns them as a list.
- Parameters
entry (str) – Text in which entities are to be found.
link (str) – Link to DBPedia Spotlight.
max_hits (int, optional) – Maximal number of URIs that should be returned per entity. Defaults to 1.
selection (str, optional) – Specifies whether the entities that occur first (first), that have the highest support(support) or that have the highest similarity score(similarityScore) should be chosen. Defaults to “first”.
confidence (float, optional) – #TODO. Defaults to 0.5.
support (int, optional) – #TODO. Defaults to 20.
min_similarity_score (float, optional) – #TODO. Defaults to 0.8.
- Returns
All URIs found in accordance with the parameters. If max_hits > found URIs the list is filled with NAs.
- Return type
list
kgextension.linking_sklearn module
- class kgextension.linking_sklearn.DbpediaLookupLinker(column, new_attribute_name='new_link', progress=True, base_url='http://lookup.dbpedia.org/api/search/', max_hits=1, query_class='', lookup_api='KeywordSearch', caching=True)
Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
- fit(X, y=None)
- transform(X, y=None)
- class kgextension.linking_sklearn.DbpediaSpotlightLinker(column, new_attribute_name='new_link', progress=True, max_hits=1, language='en', selection='first', confidence=0.3, support=5, min_similarity_score=0.5, caching=True)
Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
- fit(X, y=None)
- transform(X, y=None)
- class kgextension.linking_sklearn.LabelLinker(column, new_attribute_name='new_link', progress=True, endpoint=<kgextension.sparql_helper.RemoteEndpoint object>, result_filter=None, language='en', max_hits=1, label_property='rdfs:label', prefix_lookup=False, caching=True)
Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
- fit(X, y=None)
- transform(X, y=None)
- class kgextension.linking_sklearn.PatternLinker(column, new_attribute_name='new_link', progress=True, base_url='www.dbpedia.org/resource/', url_encoding=True, DBpedia_link_format=True)
Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
- fit(X, y=None)
- transform(X, y=None)
- class kgextension.linking_sklearn.SameAsLinker(column, new_attribute_name='new_link', progress=True, endpoint=<kgextension.sparql_helper.RemoteEndpoint object>, result_filter=None, uri_data_model=False, prefix='', bundled_mode=True, prefix_lookup=False, caching=True)
Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
- fit(X, y=None)
- transform(X, y=None)
kgextension.schema_matching module
- kgextension.schema_matching.label_schema_matching(df, endpoint=<kgextension.sparql_helper.RemoteEndpoint object>, uri_data_model=False, to_lowercase=True, remove_prefixes=True, remove_punctuation=True, prefix_threshold=1, progress=True, caching=True)
A schema matching method by checking for attribute – rdfs:label between links.
- Parameters
df (pd.DataFrame) – The dataframe where matching attributes are supposed to be found.
endpoint (Endpoint, optional) – SPARQL Endpoint to be queried. Defaults to DBpedia.
uri_data_model (bool, optional) – If enabled, the URI is directly queried instead of a SPARQL endpoint. Defaults to False.
to_lowercase (bool, optional) – Converts queried strings to lowercase. Defaults to True.
remove_prefixes (bool, optional) – Removes prefices of queried strings. Defaults to True.
remove_punctuation (bool, optional) – Removes punctuation from queried strings. Defaults to True.
prefix_threshold (int, optional) – The number of occurences after which a prefix is considered “common”. Defaults to 1.
progress (bool, optional) – If True, progress bars will be shown to inform the user about the progress made by the process (if “uri_data_model” = True). Defaults to True.
caching (bool, optional) – Turn result-caching for queries issued during the execution on or off. Defaults to True.
- Returns
Two columns with matching links and a third column with the overlapped label.
- Return type
pd.DataFrame
- kgextension.schema_matching.matching_combiner(matching_result_dfs, method='avg', columns=None, ignore_single_missings=False, weights=None, thresholds=None, merge_on=['uri_1', 'uri_2'])
Combines results of the schema matching functions into a single score per combination of attributes.
- Parameters
matching_result_dfs (list) – Results of the schema matching functions.
method (str/method, optional) – Function combining the individual scores. Defaults to “avg”.
columns (list, optional) – Columns of the input dataframes to take into account. If none are given automatically detects them from the input. Defaults to None.
ignore_single_missings (bool, optional) – If enabled, computes scores even if one of the values is missing. Defaults to False.
weights (list, optional) – Weights for weighting the different scores, if method = “weighted”. Defaults to None.
thresholds (float, optional) – Thresholds for thresholding the different scores, if method = “thresholding”. Defaults to None.
merge_on (list, optional) – Names of the columns on which the DataFrames in “matching_result_dfs” should be merged. Defaults to [“uri_1”, “uri_2”].
- Raises
ValueError – Raised if the input of “weights” or “thresholds” is not correct.
- Returns
DataFrame that contains the combined matching score for each URI-pair.
- Return type
pd.DataFrame
- kgextension.schema_matching.relational_matching(df, endpoints=[<kgextension.sparql_helper.RemoteEndpoint object>, <kgextension.sparql_helper.RemoteEndpoint object>], uri_data_model=False, match_score=1, progress=True, caching=True)
Creates a mapping of matching attributes in the schema by checking for owl:sameAs, owl:equivalentClass, owl:Equivalent and wdt:P1628 links between them.
- Parameters
df (pd.DataFrame) – Dataframe where matching attributes are supposed to be found.
endpoints (list, optional) – SPARQL Endpoint to be queried. Defaults to [DBpedia, WikiData].
uri_data_model (bool, optional) – If enabled, the URI is directly queried instead of a SPARQL endpoint. Defaults to False.
match_score (int, optional) – Score of the match: 0 < match_score <= 1. Defaults to 1.
progress (bool, optional) – If True, progress bars will be shown to inform the user about the progress made by the process (if “uri_data_model” = True). Defaults to True.
caching (bool, optional) – Turn result-caching for queries issued during the execution on or off. Defaults to True.
- Returns
Two columns with matching links and a third column with the score, which is always one in case of the relational matching unless specified otherwise.
- Return type
pd.DataFrame
- kgextension.schema_matching.string_similarity_matching(df, predicate='rdfs:label', to_lowercase=True, remove_prefixes=True, remove_punctuation=True, similarity_metric='norm_levenshtein', prefix_threshold=1, n=2, progress=True, caching=True)
Calculates the string similarity from the text field obtained by querying the attributes for the predicate, by default rdfs:label.
- Parameters
df (pd.DataFrame) – Dataframe where matching attributes are supposed to be found
predicate (str, optional) – Defaults to “rdfs:label”.
to_lowercase (bool, optional) – converts queried strings to lowercase. Defaults to True.
remove_prefixes (bool, optional) – removes prefices of queried strings. Defaults to True.
remove_punctuation (bool, optional) – removes punctuation from queried strings. Defaults to True.
similarity_metric (str, optional) – norm by which strings are compared. Defaults to “norm_levenshtein”.
prefix_threshold (int, optional) – The number of occurences after which a prefix is considered “common”. defaults to 1. n (int, optional): parameter for n-gram and Jaccard similarities. Defaults to 2.
progress (bool, optional) – If True, progress bars will be shown to inform the user about the progress made by the process. Defaults to True.
caching (bool, optional) – Turn result-caching for queries issued during the execution on or off. Defaults to True.
- Returns
Two columns with matching links and a third column with the string similarity score.
- Return type
pd.DataFrame
- kgextension.schema_matching.value_overlap_matching(df, progress=True)
A schema matching method by calculating the similarities of link values.
- Parameters
df (pd.DataFrame) – The dataframe where matching attributes are supposed to be found.
progress (bool, optional) – If True, progress bars will be shown to inform the user about the progress made by the process. Defaults to True.
- Returns
Two columns with matching links and a third column with “value_overlap”.
- Return type
pd.DataFrame
kgextension.schema_matching_fusion_sklearn module
- class kgextension.schema_matching_fusion_sklearn.MatchingFuser(matching_functions, threshold=0.85, method='avg', columns=None, ignore_single_missings=False, weights=None, merge_on=['uri_1', 'uri_2'], boolean_method_single='provenance', boolean_method_multiple='voting', numeric_method_single='average', numeric_method_multiple='average', string_method_single='longest', string_method_multiple='longest', provenance_regex='http://dbpedia.org/', progress=True, caching=True)
Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
- fit(X, y=None)
- transform(X, y=None)
kgextension.schema_matching_helper module
- kgextension.schema_matching_helper.calc_string_similarity(uri_1, uri_2, label_dict, metric='norm_levenshtein', n=2)
Calculates the string similarity between two strings based on various metrics. The strings are retreived from a dictionary provided to the function.
- Parameters
uri_1 (str) – URI linked to the first string (used as key for the label_dict).
uri_2 (str) – URI linked to the second string (used as key for the label_dict).
label_dict (dict) – Dictionary mapping the provided URIs (keys) to their respective strings.
metric (str/method, optional) – Name of the metric that should be used for the similarity calculation. Defaults to “norm_levenshtein”.
n (int, optional) – n-Value for the metrics “ngram” and “jaccard”. Defaults to 2.
- Raises
ValueError – Gets raised in case a unknown metric is provided.
- Returns
The similarity between the two strings.
- Return type
float
- kgextension.schema_matching_helper.clean_string(string, common_prefixes, to_lowercase=True, remove_prefixes=True, remove_punctuation=True)
Cleans a passed string by e.g. lowercasing it, stripping common prefixes from it and removing any punctuation.
- Parameters
string (str) – The string that should be cleaned.
common_prefixes (list) – A list containing all (common) prefixes that should be removed.
to_lowercase (bool, optional) – Indicates whether or not the string should be transformed to lowercase. Defaults to True
remove_prefixes (bool, optional) – Indicates whether or not the string should be stripped from the specified common prefixes (of type: PREFIX:string). Defaults to True.
remove_punctuation (bool, optional) – Indicates whether or not all punctuation should be removed from the string. Defaults to True.
- Returns
The cleaned string.
- Return type
str
- kgextension.schema_matching_helper.get_common_prefixes(df, threshold, column_name='o')
Finds common string prefixes (of type PREFIX:string) in a column of a specified DataFrame. Creates a list of all prefixes that appear more often than the specified threshold.
- Parameters
df (pd.DataFrame) – The DataFrame containing the data.
threshold (int) – The threshold to filter uncommon prefixes.
column_name (str, optional) – Column name of the column containing the relevant strings. Defaults to “o”.
- Returns
A list of all prefixes (of type PREFIX:string) that appear more often than the specified threshold.
- Return type
list
- kgextension.schema_matching_helper.get_value_overlap(df, col_name_dict, uri_1, uri_2)
Calculates the ratio of overlapping values of two columns of a DataFrame, using row-wise comparison.
- Parameters
df (pd.DataFrame) – The DataFrame containing the rows that should be compared (with column names reduced to the URIs).
col_name_dict (dict) – Dictionary mapping the cleaned column names from the DataFrame to the full column names.
uri_1 (str) – Column name of the first column (just the URI).
uri_2 (str) – Column name of the second column (just the URI).
- Returns
Ratio of overlapping values in the two columns.
- Return type
float
kgextension.sparql_helper module
- class kgextension.sparql_helper.Endpoint
Bases:
object
Base Endpoint class.
- class kgextension.sparql_helper.LocalEndpoint(file_path, file_format='auto')
Bases:
kgextension.sparql_helper.Endpoint
LocalEndpoint class, that handles access to local RDF files.
- close()
Closing the LocalEndpoint, i.e. releasing the data from memory.
- initialize()
Initializing the LocalEndpoint, i.e. loading the data into memory.
- query(query)
Function to issue a query against a LocalEndpoint.
- Parameters
query (str) – SPARQL query.
- Returns
The query results as DataFrame.
- Return type
pd.DataFrame
- class kgextension.sparql_helper.RemoteEndpoint(url, timeout=60, requests_per_min=100000, retries=10, page_size=0, supports_bundled_mode=True, persistence_file_path='rate_limits.db', agent='sparqlwrapper 1.8.5 (rdflib.github.io/sparqlwrapper)')
Bases:
kgextension.sparql_helper.Endpoint
RemoteEndpoint class, that handles remote SPARQL endpoints.
- kgextension.sparql_helper.endpoint_wrapper(query: str, endpoint: kgextension.sparql_helper.Endpoint, request_return_format='XML', verbose=False, return_XML=False, prefix_lookup=False, caching=True)
Wrapper function for sparql-querier and local rdf-files.
- Parameters
query (str) – Query that should be sent to the SPARQL endpoint
endpoint (Endpoint) – Link to the SPARQL endpoint that should be queried.
request_return_format (str, optional) – Requesting a specific return format from the SPARQL endpoint. Defaults to “XML”.
verbose (bool, optional) – Set to True to let the function print additional information about the returned data - for debugging and testing. Defaults to False.
return_XML (bool, optional) – if True it returns the XML results instead of a dataframe. Defaults to False.
prefix_lookup (bool/str/dict, optional) – True: Namespaces of prefixes will be looked up at prefix.cc and added to the sparql query. str: User provides the path to a json-file with prefixes and namespaces. dict: User provides a dictionary with prefixes and namespaces. Defaults to False.
caching (bool, optional) – Turn result caching on or off. Defaults to True.
- Returns
The query results in form of a DataFrame.
- Return type
pd.DataFrame
- kgextension.sparql_helper.endpoint_wrapper_logic(query, endpoint, request_return_format, verbose, return_XML)
This is a helper function for “endpoint_wrapper”, outsourced for caching purposes. Not intended for end-user usage. #TODO: Schöner lösen?
- kgextension.sparql_helper.regex_string_generator(attribute, filters, logical_connective='OR')
#TODO
- Parameters
attribute ([type]) – [description]
filters ([type]) – [description]
logical_connective (str, optional) – [description]. Defaults to “OR”.
- Raises
ValueError – [description]
- Returns
[description]
- Return type
[type]
kgextension.sparql_helper_helper module
- kgextension.sparql_helper_helper.get_initial_query_limit(query: str)
Returns the LIMIT within a SPARQL query string.
- Parameters
query (str) – SPARQL query string.
- Returns
Limit or 0 if no limit.
- Return type
int
- kgextension.sparql_helper_helper.get_initial_query_offset(query: str)
Returns the OFFSET within a SPARQL query string.
- Parameters
query (str) – SPARQL query string.
- Returns
Offset or 0 if no offset.
- Return type
int
kgextension.uri_helper module
- kgextension.uri_helper.query_uri(uri, query_string, return_formats={'wikidata.org': 'n3'}, verbose=True, caching=True)
Function that allows to query a given dereferencable URI with a given SPARQL query, without the need for an SPARQL endpoint.
- Parameters
uri (str) – Dereferencable URI.
query_string (str) – SPARQL query (the URI should already be inserted via the values statement).
return_formats (dict, optional) – Used to set specific return formats for data sources (if the default “application/rdf+xml” is not supported). For supported formats see: https://rdflib.readthedocs.io/en/stable/plugin_parsers.html. Defaults to {“wikidata.org”: “n3”}.
verbose (bool, optional) – Turn on/off warnings for likely malformed URIs. Defaults to True.
caching (bool, optional) – Turn result caching on or off. Defaults to True.
- Returns
Result of the SPARQL query issued against the URI. If the provided URI is NULL, then a empty dataframe is returned.
- Return type
pd.DataFrame
- kgextension.uri_helper.query_uri_logic(uri, query_string, return_format)
Parsing & querying logic of the “query_uri” function. Detached from the main function for caching purposes.
- Parameters
uri (str) – Dereferencable URI.
query_string (str) – SPARQL query (the URI should already be inserted via the values statement).
return_format (dict) – Used to set specific return formats for data sources (if the default “application/rdf+xml” is not supported). For supported formats see: https://rdflib.readthedocs.io/en/stable/plugin_parsers.html.
- Returns
Result of the SPARQL query issued against the URI.
- Return type
pd.DataFrame
- kgextension.uri_helper.uri_querier(df, column, query, regex_filter=None, return_formats={'wikidata.org': 'n3'}, verbose=True, caching=True, prefix_lookup=False, progress=True)
Wrapper function for the query_uri function. Queries each URI in a specified column of a DataFrame with a user-provided query and returns the results as one joint DataFrame.
- Parameters
df (pd.DataFrame) – DataFrame that contains the URIs that should be queried.
column (str) – Column in the specified DataFrame that contains the URIs that should be queried.
query (str) – The SPARQL query that’s used for querying the URIs. Has to contain a single placehold (URI) in the VALUES statement. Example: “SELECT ?value ?p ?o WHERE {VALUES (?value) { (<URI>)} ?value ?p ?o }”
regex_filter (str, optional) – If set, just URIs matching the specified RegEx are queried. Defaults to None.
return_formats (dict, optional) – Used to set specific return formats for data sources (if the default “application/rdf+xml” is not supported). For supported formats see: https://rdflib.readthedocs.io/en/stable/plugin_parsers.html. Defaults to {“wikidata.org”: “n3”}.
verbose (bool, optional) – Turn on/off warnings for likely malformed URIs. Defaults to True.
caching (bool, optional) – Turn result caching on or off. Defaults to True.
prefix_lookup (bool/str/dict, optional) – True: Namespaces of prefixes will be looked up at prefix.cc and added to the sparql query. str: User provides the path to a json-file with prefixes and namespaces. dict: User provides a dictionary with prefixes and namespaces. Defaults to False.
progress (bool, optional) – If True, progress bars will be shown to inform the user about the progress made by the process. Defaults to True.
- Returns
Joint DataFrame that contains the query-results of all URIs.
- Return type
pd.DataFrame
kgextension.utilities module
- kgextension.utilities.check_uri_redirects(df, column, replace=True, custom_name_postfix=None, redirection_property='http://dbpedia.org/ontology/wikiPageRedirects', endpoint=<kgextension.sparql_helper.RemoteEndpoint object>, regex_filter='dbpedia', bundled_mode=True, uri_data_model=False, progress=True, caching=True)
Takes a column of URIs from a DataFrame and checks for each if it has a redirection set by the endpoint. If this is the case, the URI it redirects to is either added in a new column or replaces the original URI.
- Parameters
df (pd.DataFrame) – Dataframe for which the URIs should be inspected.
column (str) – Name of the column that contains the URIs that should be checked.
replace (bool, optional) – If True: URIs that get redirected will be replaced with the new URI; If False: A new column, containing the result for each URI, is added to the DataFrame. Defaults to True.
custom_name_postfix (str, optional) – Custom postfix for the newly created column (in case “replace” is set to False). Defaults to None.
redirection_property (str, optional) – Relation/Property URI that signals a redirect for this endpoint. Defaults to “http://dbpedia.org/ontology/wikiPageRedirects”.
endpoint (Endpoint, optional) – SPARQL Endpoint to be queried; ignored when “uri_data_model” = True. Defaults to DBpedia.
regex_filter (str, optional) – Just URIs matching the specified RegEx are checked for redirects. Defaults to “dbpedia”.
bundled_mode (bool, optional) – If True, all necessary queries are bundled into one query (using the VALUES method). - Requires a SPARQL 1.1 implementation!; ignored when “uri_data_model” = True. Defaults to True.
uri_data_model (bool, optional) – If enabled, the URI is directly queried instead of a SPARQL endpoint. Defaults to False.
progress (bool, optional) – If True, progress bars will be shown to inform the user about the progress made by the process (if “uri_data_model” = True). Defaults to True.
caching (bool, optional) – Turn result-caching for queries issued during the execution on or off. Defaults to True.
- Raises
ValueError – Raised if ‘custom_name_postfix’ is set to “” instead of None.
- Returns
Returns dataframe with cleaned links / a new column.
- Return type
pd.DataFrame
- kgextension.utilities.link_validator(df, columns, purge=True, custom_name_postfix=None, fill_with=nan, caching=True, progress=True)
Takes a column of URLs / URIs from a DataFrame and checks for each if it is resolvable. If not it’s either replaced with some user-specified entry or a flag is added to a newly generated column.
- Parameters
df (pd.DataFrame) – Dataframe for which the links should be inspected.
columns (list) – List containing the names of the columns in the DataFrame, that contain the links.
purge (bool, optional) – If True: Links that are not resolvable will be replaced with “fill_with”; If False: A new column, containing the result for each link in boolean format, is added to the DataFrame. Defaults to True.
custom_name_postfix (str, optional) – Custom postfix for the newly created column (in case “purge” is set to False). Defaults to None.
fill_with (flexible, optional) – Specifies what not resolvable links should be replaced with (in case “purge” is set to True). Defaults to np.NaN.
caching (bool, optional) – Turn result caching on or off. Defaults to True.
progress (bool, optional) – If True, progress bars will be shown to inform the user about the progress made by the process. Defaults to True.
- Raises
ValueError – Raised if ‘custom_name_postfix’ is set to “” instead of None.
- Returns
Returns dataframe with cleaned links / a new column.
- Return type
pd.DataFrame
kgextension.utilities_helper module
- kgextension.utilities_helper.is_valid_url(url)
Checks if a URL is in proper format.
- Parameters
url (str) – The URL that should be checked.
- Returns
Result of the validity check in boolean form.
- Return type
bool
- kgextension.utilities_helper.url_exists(url)
Checks if a URL is resolvable / existing.
- Parameters
url (str) – The URL that should be checked.
- Returns
Result of the resolvability check in boolean form.
- Return type
bool
kgextension.utilities_sklearn module
- class kgextension.utilities_sklearn.CheckUriRedirects(column, replace=True, custom_name_postfix=None, redirection_property='http://dbpedia.org/ontology/wikiPageRedirects', endpoint=<kgextension.sparql_helper.RemoteEndpoint object>, regex_filter='dbpedia', bundled_mode=True, uri_data_model=False, progress=True, caching=True)
Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
- fit(X, y=None)
- transform(X, y=None)