Generators
Overview
The generators extract data from a knowledge graph in different ways. The newly generated features are added to the pandas DataFrame in form of additional columns. To connect to a knowledge graph, the input DataFrame should contain at least one column holding URIs.
Apart from the feature generation, most generators also have the option to create a hierarchy of the features generated, which can be used for later filtering with one of the Hierarchical Feature Selection functions.
Some of the generators support different result types:
Boolean: True if a certain caracteristic is fulfilled and false if not.
Counts: Count the occurrences of a certain characteristic
Relative counts: Count the occurrences of a certain characteristic in relation to the total number of occurrences.
TF-IDF: Similar feature as relative counts, but weighs the percentage with the overall occurrence of a characteristic in the dataset.
A detailed comparison of the characteristics of these strategies is given in [RP14].
The input to the generators is a pandas.DataFrame
with
at least one column containing URIs linking the entities to a knowledge graph.
Usually this columns is added to the DataFrame by a Linking function.
country |
link |
---|---|
Spain |
|
Japan |
|
Chile |
Data Properties Generator
kgextension.generator.data_properties_generator
The data properties generator creates an attribute for each literal value that the linked entity has. The following example uses the default parameter settings:
from kgextension.generator import data_properties_generator
df_data_properties = data_properties_generator(df_linked, "link")
country |
link |
link_data_http://dbpedia.org/ontology/currencyCode |
link_data_http://dbpedia.org/ontology/humanDevelopmentIndex |
---|---|---|---|
Spain |
EUR |
0.893 |
|
Japan |
JPY |
0.915 |
|
Chile |
CLP |
0.847 |
The column header of a new attribute shows the origin of its values. In the example link_data_http://dbpedia.org/ontology/currencyCode from the results, the prefix link indicates that the data was queried from the URIs in the column link, the prefix data stands for the generator function that was used and http://dbpedia.org/ontology/currencyCode is the URI of the respective property.
Following example shows the rdf-triple, from which the generator creates the attribute currencyCode with the value “EUR” for “Spain”:
<http://dbpedia.org/resource/Spain> <http://dbpedia.org/ontology/currencyCode> “EUR”
Type Filter
The data properties generator usually returns a large number of new attributes of different types. With the type filter it is possible to limit the results to certain datatypes or to explicitly exclude a datatype from the results.
In this example, only properties of the datatype xsd:double are added:
from kgextension.generator import data_properties_generator
df_data_properties = data_properties_generator(
df_linked, "link",
type_filter="xsd:double"
)
A datatype can be excluded by prepending a “- “:
from kgextension.generator import data_properties_generator
df_data_properties = data_properties_generator(
df_linked, "link",
type_filter="- xsd:string"
)
Direct Type Generator
kgextension.generator.direct_type_generator
The direct type generator extracts the type(s) of the linked ressources (using rdf:type). The resulting types are added as new columns, which are filled either with a boolean indicator or a count. The following example uses the default parameter settings:
from kgextension.generator import direct_type_generator
df_direct_type = direct_type_generator(df_linked, "link")
country |
link |
link_type_http://dbpedia.org/ontology/Country |
link_type_http://dbpedia.org/class/yago/WikicatCountriesInSouthAmerica |
---|---|---|---|
Spain |
True |
False |
|
Japan |
True |
False |
|
Chile |
True |
True |
Following example shows the rdf-triple, from which the generator creates the attribute WikicatCountriesInSouthAmerica being True for “Chile”:
<http://dbpedia.org/resource/Chile> rdf:Type <http://dbpedia.org/class/yago/WikicatCountriesInSouthAmerica>
Unqualified Relation Generator
kgextension.generator.unqualified_relation_generator
The unqualified relation generator creates attributes from the existence of relations. The following example uses the default parameter settings:
from kgextension.generator import unqualified_relation_generator
df_unqualified_relation = unqualified_relation_generator(df_linked, "link")
country |
link |
Link_Out_boolean_http://dbpedia.org/ontology/capital |
Link_Out_boolean_http://dbpedia.org/property/nativeName |
---|---|---|---|
Spain |
True |
False |
|
Japan |
True |
True |
|
Chile |
True |
False |
Following example shows the rdf-triples, from which the generator creates the attribute Capital being True for “Japan”:
<http://dbpedia.org/resource/Japan> <http://dbpedia.org/ontology/Capital> <http://dbpedia.org/resource/Tokyo>
The value is set to True, because the relation <http://dbpedia.org/ontology/Capital> exists for <http://dbpedia.org/resource/Japan>.
Qualified Relation Generator
kgextension.generator.qualified_relation_generator
The qualified relation generator creates attributes from the existence of relations and takes also the types of the related resources into account. The following example uses the default parameter settings:
from kgextension.generator import qualified_relation_generator
df_qualified_relation = qualified_relation_generator(df_linked, "link")
country |
link |
Link_Out_boolean_http://dbpedia.org/ontology/currency_type_http://dbpedia.org/class/yago/WikicatCurrenciesOfAsia |
Link_Out_boolean_http://dbpedia.org/ontology/capital_type_http://dbpedia.org/class/yago/WikicatCapitalsInEurope |
---|---|---|---|
Spain |
False |
True |
|
Japan |
True |
False |
|
Chile |
False |
False |
Following example shows the rdf-triples, from which the generator creates the attribute Capital_type_WikicatCapitalsInEurope being True for “Spain”:
<http://dbpedia.org/resource/Spain> <http://dbpedia.org/ontology/Capital> <http://dbpedia.org/resource/Madrid>
<http://dbpedia.org/resource/Madrid> rdf:type <http://dbpedia.org/class/yago/WikicatCapitalsInEurope>
The value is set to True, because <http://dbpedia.org/resource/Madrid> is the <http://dbpedia.org/ontology/Capital> of “Spain” and is of type <http://dbpedia.org/class/yago/WikicatCapitalsInEurope>.
Specific Relation Generator
kgextension.generator.specific_relation_generator
The specific relation generator creates attributes from a specific direct relation (default = “http://purl.org/dc/terms/subject”). The following example uses the default parameter settings:
from kgextension.generator import specific_relation_generator
df_specific_relation = specific_relation_generator(df_linked, "link")
country |
link |
link_in_boolean_http://dbpedia.org/resource/Category:Former_Spanish_colonies |
link_in_boolean_http://dbpedia.org/resource/Category:East_Asian_countries |
---|---|---|---|
Spain |
False |
False |
|
Japan |
False |
True |
|
Chile |
True |
False |
Following example shows the rdf-triple, from which the generator creates the attribute Category:Former_Spanish_colonies being True for “Chile”:
<http://dbpedia.org/resource/Chile> <http://purl.org/dc/terms/subject> <http://dbpedia.org/resource/Category:East_Asian_countries>
Hierarchy Relation
With the specific relation generator it is also possible to create a hierarchy of the attributes with a user-defined hierarchy relation. The resulting hierarchy is appended to the DataFrame.
from kgextension.generator import specific_relation_generator
df_specific_relation = specific_relation_generator(
df_linked, "link",
hierarchy_relation="http://www.w3.org/2004/02/skos/core#broader"
)
Custom SPARQL Generator
kgextension.generator.custom_sparql_generator
The custom SPARQL generator allows you to define your own SPARQL query and creates additional attributes from the query results. Within the SPARQL query you can use links generated by linkers as placeholders, enclosed in asterisks. In the following example the Gini coefficient of the countries in the DataFrame is queried. When the query is executed, the placeholder *link* in the query is replaced by the value in the column link of the respective entity.
from kgextension.generator import custom_sparql_generator
query = "select ?gini where {*link* <http://dbpedia.org/ontology/giniCoefficient> ?gini}"
df_custom_sparql = custom_sparql_generator(df_linked, "link", query)
country |
link |
gini |
---|---|---|
Spain |
33.0 |
|
Japan |
33.9 |
|
Chile |
44.4 |
- RP14
Petar Ristoski and Heiko Paulheim. A comparison of propositionalization strategies for creating features from linked open data. In Ilaria Tiddi, editor, LD4KD 2014 : Proceedings of the 1st Workshop on Linked Data for Knowledge Discovery co-located with European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD 2014); Nancy, France, Sept. 19th, 2014, volume 1232, Paper 1. Aachen, 2014. RWTH. URL: https://madoc.bib.uni-mannheim.de/37028/.