Quantcast
Channel: web of data – Tom Heath's Displacement Activities
Viewing all articles
Browse latest Browse all 3

Powerset: More Than Just a Pretty Face?

0
0

For this months Semantic Web Gang podcast we were joined by Barney Pell from Powerset, who recently launched a public beta of their long-awaited natural language query engine operating over Wikipedia data. Amid all the buzz, it was great to hear about Powerset straight from the horse’s mouth, and prompted me to spend some time exploring the system. This post is about what I found.

I took Charlie Chaplin as my starting point, wanting a topic that should have fairly broad coverage, and asked “who did Charlie Chaplin marry?”. Powerset returned the name “Mildred Harris” in the results, which seemed like a fairly reasonable response. I have no idea if it’s correct, but looking for the same information via DBpedia I found two answers: Mildred Harris and Paulette Goddard. Interesting that Powerset didn’t pick up both of those, or at least it didn’t show me those in the first set of results.

Interestingly the results page for this query shows “Factz” at the top that the Powerset algorithms have extracted from the Wikipedia articles, presented (broadly speaking) in the form of subject, predicate and object triples, e.g. “Chaplin married actress, Mildred Harris”, and showing the sentence context from which they were extracted. At a general level this reminds me of Vanessa‘s work on PowerAqua, which breaks queries down into “linguistic triples” and operates pretty impressively over existing RDF data sets. I can’t help feeling that Powerset’s triple extraction algorithms and the PowerAqua query engine could be an interesting combination.

Underneath the “Factz” at the top of the results page are a series of “Wikipedia Article” results, the first of which contains the sentence from which the “Chaplin married actress, Mildred Harris” information is extracted. The key parts of this sentence are also highlighted, enabling me to pick out the information that answered my question (in part at least).

By this point I was fairly taken with the interface, which is sweeter eye candy than either Wikipedia or DBpedia, but not necessarily faster than either, and may be guilty of presenting only half the picture. I’m also not yet convinced that if we took a large sample of natural language queries and compared the results returned, whether Powerset would significantly outperform the results provided by Google, who are consistently good at highlighting in their search results the passage of a document that is relevant to the query. Of course Google uses a much larger corpus than Powerset, but it’s interesting to note that the summary of the first result for the Charlie Chaplin query on Google reads “Charlie Chaplin was married four times and had 11 children between 1919 and 1962”.

To continue my exploration I tried another natural language query: “what is the population of brazil?”. This would seem like something of a no-brainer for a search engine with any semantic capabilities, and access to the rich knowledge bound up in Wikipedia. However, this time there were no headline “Factz” helping the answer to jump out at me. Instead there were Wikipedia Article results, the first of which was a node titled “Population of Brazil” that comes with an accompanying chart, but does not show the actual answer based on the latest available figures. Result number 4 (“Economy of Brazil”) does have as its result summary the text “In the space of fifty five years (1950 to 2005), the population of Brazil grew by 51 million to approximately 180 million inhabitants, an increase of over 2% per year”, but none of this is highlighted as the answer to my question.

Going back to the Charlie Chaplin example, I followed the associative links in my own mental Web and arrived at the entry for “Waibaidu Bridge“, an historic landmark on the Shanghai waterfront, located (when it’s not been taken away for repairs) just down the street from the Astor House Hotel, another Shanghai landmark where Chaplin apparently stayed on more than one occasion. Waibaidu Bridge has an entry on Wikipedia, and therefore also [[http://dbpedia.org/resource/Waibaidu_Bridge]an entry on DBpedia and in Powerset.

The Wikipedia entry itself is a really nice one; just enough historical background to be useful, a couple of bits of trivia (the bridge features briefly in the film Empire of the Sun), and a manually compiled list of places nearby. All of this is visible in Powerset, wrapped in their rather more 2008 interface. There are also a number of “Factz” extracted from the text of the Wikipedia article and presented in a box on the right. These are simply more of the subject, predicate, object triples mentioned previously, and sadly they add little value to the article. Here are some examples from the first section of the article:

* name bears name
* Waibaidu bore name
* citizens use ferries
* decade(1850) increases need

There are a couple that capture key elements from the article:

* Wales built bridge (note that this was a person named Wales, rather than the country Wales)
* Chinese paid toll (reflecting the history of the original Waibaidu bridges and the discriminatory tolls charged to Chinese people crossing them)

However these are mostly drowned out by the surrounding noise:

* ferry eases traffic
* Outer ferried cross
* powers restrict people

In the end it’s quicker just to read the article, as you’ll need to do so anyway to understand the “Factz” and check that they stand up. The “ferry eases traffic” “Fact” is actually incorrect, as the sentence from which this is extracted reads “In 1856, a British businessman named Wales built a first, wooden bridge at the location of the outermost ferry crossing to ease traffic between the British Settlement to the south, and the American Settlement to the north of Suzhou River.”, which has quite the opposite implication.

All this aside, one glaring ommission from Powerset struck me when looking at this page, and it was this that really made me wonder whether Powerset is anything more than just a pretty face. Some thoughtful geodata geek has made the effort to record the geo coordinates of Waibaidu bridge in the Wikipedia entry; 31°14’43″N, 121°29’7.98″E apparently. Now Wikipedia doesn’t seem to do anything in particular with this data; the list of places nearby is manually compiled. I’ll forgive them this, as I imagine they have their hands full with keeping the whole operation running. Perhaps if I donated some money they would consider doing this by default for all entries with geo-coordinates.

However, what isn’t so easily forgiven is Powerset ignoring this information completely, not even bothering to start the page with a small Google map next to the nice old photo showing the view to the Bund, let alone thinking to use a service like Geonames to compute from the Web of Data a list of nearby places. (For the record, DBpedia doesn’t do this itself, but by making the effort to link items across the Wikipedia and Geonames data sets it does the majority of the hard work already). For an application that gets so closely associated with the Semantic Web effort (whether Powerset desire this or not) I find this ommission quite sad. It’s such a no brainer, and beautifully demonstrates the kind of thing that will separate Semantic Web applications, from just more closed world systems that happens to do something smart. I put this question about use of external data sets to Barney in the Gang podcast, but, either due to the intensity of the medium or bad communication on my part due to my cold-addled brain, the true meaning of my question was lost.

The question of when Powerset will open up its technology to other text sources, and even the Web at large, always comes up. For me this is a less interesting question than the one about when/if the company will make use of existing structured data sets in their user-facing tools. I hope that with time, and perhaps less pressure now that a product is out the door, Powerset will implement the kind of features I talk about above, as the starting point for becoming a true Semantic Web application. Until then however, the current product will be, for me, really just Wikipedia hiding behind a pretty face.


Viewing all articles
Browse latest Browse all 3

Latest Images

Trending Articles





Latest Images