Schema Matching & Fusion

Idea

If data from different sources is used, it may be beneficial to match and fuse that data in avoid of the situation that two or more URIs represent one entity.

Basics

The Schema Matching functions create a mapping of matching attributes in the schema based on different metrics. Multiple results of matching score from various functions can be incorporated by Combiner function into a single score per combination of attributes. And the final mappings can be fused by Data Fuser function with different fusion strategies, to be selected separately for boolean, numeric and string values.

  • Types of Schema Matching Function

    • relational_matching

    • label_schema_matching

    • string_similarity_matching

    • value_overlap_matching

  • Combiner Function

    • matching_combiner

  • Data Fusion Functions

    • get_fusion_clusters

    • data_fuser

Matching

Relational Matching

kgextension.schema_matching.relational_matching

The relational matching function finds that two different resources refer to the same real-world object by querying and checking for their sameAs links, equivalentClass links or Equivalent links. The matching information is directly obtained from Linked Open Data. The querying predicate is owl:equivalentProperty|owl:equivalentClass|owl:sameAs|wdt:P1628.

from kgextension.schema_matching import relational_matching

df_relational_matcher = relational_matching(
   df, endpoints=[DBpedia, WikiData], uri_data_model=False,
   match_score=1, progress=True, caching=True
)

The output DataFrame of the function as example below contains combinations of every two URIs. If parameter match_score is set to 1 as default, the result value of two same URIs would be 1, otherwise would be 0.

uri_1

uri_2

value

http://dbpedia.org/ontology/Organisation

http://schema.org/Organization

1

http://dbpedia.org/ontology/country

http://schema.org/Organization

0

http://dbpedia.org/ontology/Organisation

http://dbpedia.org/ontology/country

0

Label Schema Matching

kgextension.schema_matching.label_schema_matching

The label schema matching function is designed to query and compare the labels of one entity combination, and the querying predicate is rdfs:label.

from kgextension.schema_matching import label_schema_matching

df_label_matcher = label_schema_matching(
   df, endpoint=DBpedia, uri_data_model=False, to_lowercase=True,
   remove_prefixes=True, remove_punctuation=True, prefix_threshold=1,
   progress=True, caching=True
)

The queried text field can be preprocessed before comparison in order to improve the accuracy of matching. The provided preprocessing methods include to_lowercase: convert all letters to lowercase, remove_prefixes: remove all prefixes before label like “Category:”, and remove_punctuation: remove all punctuation from the string.

The output DataFrame of the function as example below contains combinations of every two uris. And the result value of two same labels would be 1, otherwise would be 0.

uri_1

uri_2

same_label

http://dbpedia.org/ontology/Organisation

http://schema.org/Organization

1

http://dbpedia.org/ontology/country

http://schema.org/Organization

0

http://dbpedia.org/ontology/Organisation

http://dbpedia.org/ontology/country

0

String Similarity Matching

kgextension.schema_matching.string_similarity_matching

The string similarity matching function calculates the string similarity from the text field obtained by querying the attributes for the predicate. The calculation based on various metrics that are Norm Levenshtein, Partial Levenshtein, Token Sort Levenshtein, Token Set levenshtein, N-gram and Jaccard. The default querying predicate is rdfs:label.

Default

from kgextension.schema_matching import string_similarity_matching

df_string_similarity_matcher = string_similarity_matching(
   df, predicate="rdfs:label", to_lowercase=True, remove_prefixes=True,
   remove_punctuation=True, similarity_metric="norm_levenshtein",
   prefix_threshold=1, n=2, progress=True, caching=True
)

The queried text field can be preprocessed before comparison in order to improve the accuracy of matching. The provided preprocessing methods include to_lowercase: convert all letters to lowercase, remove_prefixes: remove all prefixes before label like “Category:”, and remove_punctuation: remove all punctuation from the string.

The output DataFrame of the function with default setting would be:

uri_1

uri_2

value_string

http://dbpedia.org/ontology/Organisation

http://dbpedia.org/ontology/country

0.52

http://dbpedia.org/ontology/Organisation

http://schema.org/Organization

NaN

http://dbpedia.org/ontology/country

http://schema.org/Organization

NaN

Note

The value_string would be null if one or more URIs of one combination in which queried predicate is missing. For above example the rdfs:label of http://schema.org/Organization doesn’t exist.

Other Similarity Metric

parameter n is n-Value set for the metrics “ngram” and “jaccard”. It defaults to 2.

similarity_metric="partial_levenshtein"

uri_1

uri_2

value_string

http://dbpedia.org/ontology/Organisation

http://dbpedia.org/ontology/country

0.45

similarity_metric="token_sort_levenshtein"

uri_1

uri_2

value_string

http://dbpedia.org/ontology/Organisation

http://dbpedia.org/ontology/country

0.32

similarity_metric="token_set_levenshtein"

uri_1

uri_2

value_string

http://dbpedia.org/ontology/Organisation

http://dbpedia.org/ontology/country

0.32

similarity_metric="ngram"

uri_1

uri_2

value_string

http://dbpedia.org/ontology/Organisation

http://dbpedia.org/ontology/country

1.0

similarity_metric="jaccard"

uri_1

uri_2

value_string

http://dbpedia.org/ontology/Organisation

http://dbpedia.org/ontology/country

0.0

Value Overlap Matching

kgextension.schema_matching.value_overlap_matching

The value overlap matching function calculates the ratio of overlapping values of two columns of a DataFrame with row-wise comparison. The value overlap is calculated by dividing equivalence by the total number of entity values.

from kgextension.schema_matching import value_overlap_matching

df_value_matcher = value_overlap_matching(
   df, progress=True
)

uri_1

uri_2

value_overlap

http://dbpedia.org/ontology/Organisation

http://dbpedia.org/ontology/country

0.75

http://dbpedia.org/ontology/country

http://schema.org/Organization

0.75

http://dbpedia.org/ontology/Organisation

http://schema.org/Organization

1.00

Combine Matchings

Matching Combiner

kgextension.schema_matching.matching_combiner

It combines results of the schema matching functions into a single similarity score per combination of attributes. There are 5 methods for combining the individual scores: Maximum, Minimum, Average, Weighted and Thresholding.

Here we use the result DataFrame of above schema matching functions with default setting as input.

Default: Method-Average

from kgextension.schema_matching import matching_combiner

df_combiner = matching_combiner(
   matching_result_dfs=[df_relational_matcher, df_label_matcher,
    df_string_similarity_matcher, df_value_matcher],
   method="avg", columns=None,
   ignore_single_missings=False, weights=None,
   thresholds=None, merge_on=["uri_1", "uri_2"]
)

This method calculates the mean value of all input matching result DataFrame as column “result”. The output DataFrame would be like below, similar as the result of schema matching functions.

uri_1

uri_2

result

http://dbpedia.org/ontology/Organisation

http://dbpedia.org/ontology/country

0.423333

http://dbpedia.org/ontology/country

http://schema.org/Organization

0.375000

http://dbpedia.org/ontology/Organisation

http://schema.org/Organization

1.000000

Other Methods

method="max"

This method calculates the maximum value of all input matching result DataFrame as column “result”.

uri_1

uri_2

result

http://dbpedia.org/ontology/Organisation

http://dbpedia.org/ontology/country

0.75

http://dbpedia.org/ontology/country

http://schema.org/Organization

0.75

http://dbpedia.org/ontology/Organisation

http://schema.org/Organization

1.00

method="min"

This method calculates the minimum value of all input matching result DataFrame as column “result”.

uri_1

uri_2

result

http://dbpedia.org/ontology/Organisation

http://dbpedia.org/ontology/country

0

http://dbpedia.org/ontology/country

http://schema.org/Organization

0

http://dbpedia.org/ontology/Organisation

http://schema.org/Organization

1

For using Weighted and Thresholding two metrics, users need to input their subjective weight or threshold for every values of one entity.

method="weighted", weight=[0.2,0.2,0.4,0.2]

The result of this method would be the sum of each value of input matching result DataFrame multiple customized weight.

uri_1

uri_2

result

http://dbpedia.org/ontology/Organisation

http://dbpedia.org/ontology/country

0.358000

http://dbpedia.org/ontology/country

http://schema.org/Organization

NaN

http://dbpedia.org/ontology/Organisation

http://schema.org/Organization

NaN

method="thresholding", thresholds=[0.7,0.7,0.7,0.7]

The result of this method would be the sum of times that each value of input matching result DataFrame is higher or equal to the customized threshold.

uri_1

uri_2

result

http://dbpedia.org/ontology/Organisation

http://dbpedia.org/ontology/country

1

http://dbpedia.org/ontology/country

http://schema.org/Organization

NaN

http://dbpedia.org/ontology/Organisation

http://schema.org/Organization

NaN

Users can also turn to ignore single missing and then no null similarity value would appear in the final result.

method="thresholding", thresholds=[0.7,0.7,0.7,0.7], ignore_single_missings=True

Then the result DataFrame would be:

uri_1

uri_2

result

http://dbpedia.org/ontology/Organisation

http://dbpedia.org/ontology/country

1

http://dbpedia.org/ontology/country

http://schema.org/Organization

1

http://dbpedia.org/ontology/Organisation

http://schema.org/Organization

3

Fusion

Get Fusion Clusters

kgextension.fusion.get_fusion_clusters

The get fusion clusters function for creating clusters with the matching column names as sets according to the threshold set by users, and the input DataFrame should be the result of function Matching Combiner. For example, the pairs {car, auto} and {car, automobile} would be clustered into the set {car, auto, automobile} (if both pairs have a similarity ≥ the specified threshold).

from kgextension.fusion import get_fusion_clusters

clusters = get_fusion_clusters(
   df_combiner, threshold=0.85, progress=True
)

In our example, the function returns:

[{'http://dbpedia.org/ontology/Organisation',
  'http://schema.org/Organization'}]

Data Fuser

kgextension.fusion.data_fuser

The data fuser function can fuse the columns in the matching sets of the clusters. The available fuser metrics can be selected separately for boolean, numeric and string values as shown in below Table. Other existing and user-defined functions can also be passed as well when they are applicable to pd.DataFrame.apply(axis=1). The final output would be a DataFrame that contains no more than one URI for each entity.

from kgextension.fusion import get_fusion_clusters

df_fused = data_fuser(
   df, clusters, boolean_method_single="provenance",
   boolean_method_multiple="voting", numeric_method_single="average",
   numeric_method_multiple="average", string_method_single="longest",
   string_method_multiple="longest", provenance_regex="http://dbpedia.org/",
   progress=True
)

Fuser Metrics for Different Type and Size Matchers

The following table list for specific data type and matchers size, which kind of fuser metrics are available.

Data Type

Boolean

Numeric

String

Fuser Metrics

  • First

  • last

  • Random

  • Provenance

  • Minimum

  • Maximum

  • Average

  • Random

  • Provenance

  • First

  • last

  • Longest

  • Shortest

  • Random

  • Provenance

Only for Multiple Matchers

Voting

  • Voting

  • Median

Voting

Note

The metrics Voting and Median have been asserted in the function that they cannot be applied in single matches (a pair).