How to control Semantic Spam

Posted by

Spam is a wide term and used mostly in emailing and blog commentaries but everyone grasps what it is. Generally, spam is “undesired electronic content”. But the matters are not of course absolute as it might be undesired for some but not for most – or was it vice versa? In our everyday lives we trust spam filters which generally do the job well, but from time to time we still face situations where some messages which we would have desired to receive are taken out by the filter. Then we customize the filter and teach it, and we accept that spam is not always spam and vice versa. It is individual and diverse and not necessarily always coming from ill-meaning or unethical individuals.
So what has spam to do with Linked Data publishing? Lets look at the example.

SemSpam1

In a Linked Data browser, when looking at the describe page of somebody or something, the amount of items may be enormous and obscuring the sight. As an example look at the view of Lionel Messi in DBpedia. I would like to focus on inferred and materialized semantics about the “Thing”, i.e. classifications of it, which is asserted as rdf:type properties.

Everyone obviously looks at this list differently and it depends on why are you on this page. But some concepts really stick out as not very useful knowledge. Concepts like Whole, Winner, Citizen, Medalist, YagoLegalActor, Contestant, Player look like vague and carrying no real useful semantics. On the other hand concepts like PeopleFromRosarioSantaFe, 2007CopaAmericaPlayer looks like very specific and hardly of common interest to wide group of linked data consumers. Then there is a third category like BasketballPlayer, which makes you doubt this tagging is correct. I would be surprised to see Messi in a same list with Kobe Bryant!

Another type of semantic spam in a form of redundancy is parallel ontologies. For example you find SoccerPlayer and Athlete type assertion from both dbpedia and umbel.org ontologies. Both foaf and scheme.org define the Person.

 I appreciate that there is a reason why these concepts are defined and asserted, and for some people under some circumstances they may be relevant and useful. However, for me whose purpose to visit this describe page is to find a sample of a soccer player, which I can use to understand the ontology describing soccer players in general – and construct a SPARQL query that fetches data for some analysis or aggregated information about them.

So for me these assertions are semantic spam. It obscures my view of relevant concepts and properties, it wastes my time to browse through the long list.

How would I like to deal with it? I would like to apply my spam filter.

 Firstly, I would like to see that those additional assertions are from a separate named graph and I would like to have a choice to filter triples only from named graphs I am interested in. In this case of DBpedia, they all are mixed in <dbpedia.org>. I think that if kept in separate context, then these type of assertions originating from some specific need can be very useful. I have used this mechanism of typing things of local interest for query performance optimization and efficient semantic cutting from a huge graph of billions of facts. But, these assertions are then kept in a “sandbox like” named graph, private to a person or a team.

Secondly, if any author and publisher of semantic concept is given a freedom to put their stuff in a common bowl then there is another possibility to filter based on a namespace.

Explore&Query provides a support for both filter types.

Below is how the latter, namespace based method looks like. It is a namespace prefix registration based feature. In the system configuration the user can add prefixes and their namespaces they consider relevant. Anything else not having a defined prefix is displayed as a full URI. In a describe page along with Navigation Map actions, there is a button “NS Control On/Off

  SemSpam2

 

 

 

 

 

Here you can toggle the namespace filter “on” and “off”

SemSpam3

The example I used is DBpedia and Yago classifications, which are both immensely impressive and significant linked data sources. I used them because they are available and really make a good example to imagine what might happen if in the Semantic Web the authors have a freedom to annotate facts and assert classifications, meta facts or anything else circumstantial linked to raw original facts.

The mechanisms for semantic spam control exist in a form of named graphs, graph level access control and namespaces. But the rules need to be defined and applied. Because what is a spam to some might be a valuable knowledge to others.