Monday, January 31, 2005

dlpconvert

There's so much to do right now here, and even much, much more I'd like to do - and thus I'm getting a bit late on my announcements. Finally, here we go: dlpconvert is released in a 0.5 version.

You probably think: what is dlpconvert? dlpconvert is able to convert ontologies, that are within the dlp-fragment, from one syntax - namely the OWL XML Presentation Syntax - into another, here Datalog. Thus you can just take your ontology, convert it and then use the result as your program in your Prolog-engine. Isn't that cool?

Well, it would be much cooler if it were stronger tested, and if it could read some more common syntaxes like RDF/XML-Serialisation of OWL ontologies, but both is on the way. As for testing I would hope that you may test a bit - as for the serialisation, it should be available pretty soon.

dlpconvert is based totally on KAON2 for the reduction of the ontology.

I will write more as soon as I have more time.

Thursday, January 13, 2005

Comments to naming

Richard Newman sent me some thoughtful comments via eMail on the What's in a name series (there were also some great comments on the individual entries, feel free to browse them). He sent them via eMail, cause he thought he couldn't comment - that should be wrong, everyone should be able to comment anonymously. Or did anyone else encounte problems? I should switch to some dedicated software soon, anyway, but right now I don't have the time to dig deeper into it. I especially miss trackback, sigh.

Here's what Richard wrote:

"Your first point, about ISBNs and "what's being referenced" --- I think you'd be interested in FRBR, which is a modelling of the bibliographical domain. It splits things up into

Work -> Expression -> Manifestation -> Item

A work is an abstract concept, like "Politeia". An expression is a realisation of a work, so a particular translation is an expression. A manifestation is physical embodiment of an expression: this is what's given an ISBN. All copies of a certain book are Items; the edition of the book is their Manifestation.

So, you see, when you're discussing Plato's Politeia, you have to be conceptually clear about whether you're talking about works, expressions, manifestations, or items.

E.g.

:PolWork dc:creator "Plato" ;
rdfs:label "Plato's Politeia, the abstract concept." .
:PolExp1 ex:translator "Mr Smith" ;
frbr:work :PolWork ;
rdfs:label "Mr. Smith's translation of Plato's Politeia." .
:PolMan1 ex:publisher "Penguin" ;
frbr:expression :PolExp1 ;
rdfs:label "Penguin's edition of Smith's translation." .
:MyCopy ex:owner hg:RichardNewman ;
frbr:manifestation :PolMan1 ;
rdfs:label "Richard's copy of the Penguin edition." .

Do you see? Each level has its own properties (and some may be duplicated; e.g. each has a title: the title of the abstract work, the name given to the translation, the name Penguin prints on each book, and the name printed on my copy).

I've done a bit of work on modelling FRBR in RDFS/OWL, but haven't yet finished. "

I think that's really interesting, and taking a look at FRBR it was pretty well done. I sure am looking forward to see Richards interpretation in OWL, and will probably use it.

"Your second issue is the difference between a resource and its representation. A URI should only refer to one thing; it is entirely wrong to use http://www.holygoat.co.uk to refer both to my homepage (as in using RDF to describe its language, or size, or last-modified) and to me (my name, my email address, etc.) which I have seen done.

Your web server should return RDF for http://semantic.nodx.net/#Plato if your browser says that it accepts RDF+XML. A normal browser should have an HTML representation returned. Indeed, it's possible to do the following:

# the abstract resource. Hit this with a browser, get an HTML page; with an RDF agent, get some RDF.
http://example.com/Plato a rdf:resource .

# the HTML representation.
http://example.com/Plato/html a ex:representation ;
ex:representationOf http://example.com/Plato .

# the RDF.
http://example.com/Plato/rdf a ex:representation ;
ex:representationOf http://example.com/Plato .

i.e. you can unambiguously refer to each representation, and the resource. When your client arrives, asking for Plato, you can redirect them to the appropriate place. Clever, huh?

URIs should never give a 404. They should return the appropriate headers or content for whatever the client is requesting; this may be the RDF file in which the resource is defined, if the client understands RDF, or an HTML page.

If you're interested in this sort of thing, it pops up on the W3C's RDF Interest Group list occasionally.

Patrick Stickler and others have come up with an additional HTTP verb, MGET, which will return the RDF description of a resource. Combined with their URIQA architecture, it will give you a Concise Bounded Description for a URI. This stops you having to somehow put descriptions into particular files, and better deals with the distributed nature of the Semantic Web. Check it out; it presents several convincing arguments for not using fragment identifiers to refer to resources, and solves your bandwidth problem. You should never have to dump a whole file to get a description of a URI."

I have to note that Richard wrote me this just after part 4 of the series was released, so I could answer some of the questions already in the last two parts. Just to summarise it: I don't like content negotiation. Although it is technically totally feasible, I disagree that it should be done or is a good solution. If my browser asks for http://semantic.nodix.net/#Plato I don't think I should get different things depending on the content negotiation. This feels like cheating.

I wrote that to Richard already, and he answered:

"I think we agree on the main point, which is that

foaf:name "Richard" ; ex:format "HTML" .

which is a travesty :) "

He is totally right here.


"You still see it happen, though, with people referring to Wikipedia pages as if they were the abstract resource.

The content negotiation (getting different things depending on what you accept) is exactly what the Web is supposed to do. If I'm using a mobile browser, I want a simplified version of a page; if I'm an RDF agent, I want RDF, if it exists, because HTML is of no use to me. A common usage of this is to serve up strict XHTML to Mozilla, and less-strict HTML to Internet Explorer. It is also done all the time to serve PNG where the client accepts it, and GIF if it doesn't, and there is an intentional disconnect on the Web between a resource and its representations.

The lack of such a disconnect would lead to exactly the problem you describe; if I can't return a representation of a resource, because it's abstract, then how do I find out anything about it? I could use MGET, but you can't MGET a person... so, if you want to talk about the real world thing "Plato", he has to 404, or you get the "what am I talking about?" problem. Better, in my view, to redirect a browser to plato.html and a SW agent to a chunk of RDF. "

I would rather like to ask for http://semantic.nodix.net/Plato.rdf to get the RDF/XML representation, http://semantic.nodix.net/Plato.owl to get the OWL/XML representation, http://semantic.nodix.net/Plato.html to get a HTML page for the user to read and http://semantic.nodix.net/Plato.jpg for a picture of Plato. This shouldn't be hidden behind content negotiation. I know, I know, Patrick would strongly disagree here, but I think it feels wrong and actually defies the idea of an URI.

"
You can do exactly that (and I agree that the representations should have separate URIs --- conneg is only for when you're trying to get some description of an abstract resource), but then how do you refer to the abstract concept of "Plato"? http://.../Plato is a resource, and I want to make statements about him. But there's no point in it being 404 when dereferenced, because then how would I find out that Plato.html exists? HTTP doesn't return URIs, it returns representations of them.

A URI is simply something that is dereferenced to get a representation, and that representation should be decided on by conneg. In this case, /Plato is an abstract resource, so one of the representations should be returned. We can then make statements about Plato (e.g. foaf:name "Plato"), and about the JPEG and HTML representations, because they have different URIs, but still get something useful back when we want to access /Plato."

I also dislike MGET right now. Maybe I am wrong, but to me, the whole URIQA architecture feels somewhat wrong - but maybe I should just dwell deeper into it, I have to admit, I didn't study it yet enough to really be in a position to bash on it. The problem is, that MGET seems unnecessary to me - and it works on a different conceptual level than the rest of the Semantic Web proposals. I think everything MGET solves can be solved with tools that already exist: Richards example above, where he gives triples telling us which representations are used to describe a resource, shows perfectly well that you actually don't need content negotiation and MGET.

"There are things to question about URIQA, but it does have some good going for it. MGET is actually an implicit query. In the standard Web model, you request URIs and get back document representations. Doing an MGET on a Web server is asking it to return a description, regardless of where on the site descriptions of that resource exist, and you're explicitly asking for meta-data. As Patrick points out, it's similar doing a GET and specifying that you accept RDF, but is likely to be more concise (the difference between a "representation" and a "description"). In fact, this is exactly what the Nokia URIQA server does.

MGET overlaps with query servers a bit, and with GET a bit, but it's a little bit special, too. The whole idea is that from a single URI you can get a useful description of a resource, just by issuing a single MGET. Every other approach needs more work."

This URIQA / MGET stuff sounds more and more interesting. I really should dwell deeper into it.

Also, the idea of Concise Bounded Descriptions may be very neat, I have to study that more as well. Funny thing, the very same day Richard pointed me to it, a collegue told me about it too - this is usually a sign, that this idea is worth considering more.

Richard also wrote "
URIs should never give a 404", and as you know, I disagreed with it mildly. He tried to summarise his position:

"
I consider that each returned resource should have its own URI --- e.g. Plato.jpg --- and that the original URI should be used to make statements about the abstract resource. This allows you to say

...Plato foaf:name "Plato" .
...Plato.jpg ex:resolution "150dpi" .
...Plato.html dc:creator "Denny" .

Dereferencing the abstract resource, rather than throwing a 404, should do something useful --- e.g. redirecting with a 303 to one of the representations. Have you ever tried viewing a Blogger Atom feed in your browser? If you hit it with an RSS reader, you get the XML, but in a browser Blogger shows you an XHTML transformation of the XML. That's useful, and I think that's how the Semantic Web should work. Imagine if your agent hit /Plato, and got RDF out of it, but when you looked at it with your browser you saw a dynamically-generated HTML page? Handy!

I can understand your objection, though; it does seem wrong that you get different things out of the same URI. However, you should almost always get HTML out of plato.html, and RDF out of plato.rdf. All the conneg is doing is making sure you can see an abstract thing in the best way possible, according to what you've told the server you can understand. "

Richard is pretty good in convincing me, cause he uses the right arguments: it's for the people, dummy, and the machines can work it out anyway.

I still totally stick to the recommendations I gave yesterday. But just as I am writing, and rereading it all, I am starting to change my mind on content negotiation. Maybe it is a good thing. I will have to think about it some more, and as soon as I come to a solution, I will bother you with it again. I still have a gut feeling about it that tells me 'no', but the reasons given sound very convincing and I agree with most of them, so heck, let's medidate on this as soon as I find a few hours to spare.

Big thanks to Richard and his thoughts, anyway. I hope this discussion helps you to make up your own mind as well.

Tuesday, January 11, 2005

What's in a name - Part 6

In this series we learned how to make URIs for entities. I know there's a big discussion flaring up every few weeks or so, if we should use fragment identifier or not. For me, this question is pretty much settled. Using a fragment identifier has the advantage of giving you the ability of providing a human readable page for those few lost souls who look up the URI, so maybe it's a tad nicer than using no fragment identifier and returning 404s. Not using fragids has the advantage of probably reducing bandwidth - but this discussion should be more or less academic, because looking up URIs, as we have seen, should not happen.

There is some talking about different representations, negotiating media-types, returning RDF in one, XHTML in the other case, but to be honest, I think that's far too complicated. And you would need to use another web server and extensions to HTTP to make this real, which doesn't really help the advent of the Semantic Web. Look at Nokias URIQA project for more information.

Keep this rules in mind, and everything should be fine:
  • be careful to use unused URIs if you reference a new entity. Take one from an URI space you have control of, so that URI collision won't appear
  • don't put a website under the URI you used to to name an entity. That would lead to URI collision
  • try to make nice looking URIs, but don't try to hard. They are supposed to be hidden by the application anyway
  • provide rdfs:label and rdfs:seeAlso instead. This solves everyhting you would want to try to solve with URI naming, but in a standard compliant way
  • give your resources URIs. Please. So that other can reference them more easily.
I should emphasise the last one more. Especially using RDF/XML-Syntax easily leads to anonymous nodes, which are a pain in the ass because they are hard or impossible to address. Especially, don't use rdf:nodeID. They don't give your node an ID that's visible to the outer world. This is just a local name. Don't use it, please.

The second is using them like this:
<foaf:person about="me">
<foaf:knows>
<foaf:Person>
<foaf:name>J. Random User</foaf:name>
</foaf:Person>
</foaf:knows>
</foaf:Person>
Actually, the Person known to "me" is an anonymous one. You can't refer to her. Again, try to avoid that. If you can, look up the URI the person gave to herself in her own FOAF-file. Or give her a name in your own URI-space. Don't be afraid, you won't run out of it.

Another very interesting approach is to use published subjects. I will return to this in another blog, promised, but so long: never forget, there is owl:sameAs to make two URIs point to the same thing, so don't mind too much if you doublename something.

Well, that's it. I hope you enjoyed the series, and that you learned a bit from it. Looking forward to your comments, and your questions.

Monday, January 10, 2005

What's in a name - Part 5

After calling Plato an XML-Element, making movies out of websites and having several accidents with careless URIs, it seems we return to the very beginning of this series.

http://semantic.nodix.net/document/Politeia dc:creator "Plato".

Whereby http://semantic.nodix.net/document/Politeia explicitly does not resolve but returns a 404, resource not found. Let's remember, why didn't we like it? Because humans, upon seeing this, have the urge to click on it in order to get more information about it. A pretty good argument, but every solution we tried brought us more or less trouble. We didn't get happy with any of them.

But how can I dismiss such an argument? Don't I risk loosing focus with saying "don't care about humans going nowhere"? No, I really don't think so. Due to two reasons, one meant for humans and one for the machines.

First the humans (humans always should go first, remember this, Ms and Mr PhD-student): humans actually never see this URI (or at least, should not but when debugging). URIs who will grace the GUI should have a rdfs:label which provides the label human users will see when working with this resource. Let's be honest: only geeks like us think that http://semantic.nodix.net/document/Politeia is a pretty obvious and easy name for a resource. Normal humans would probably prefer "Politeia", or even "The Republic" (which is the usual name in english speaking countries). Or be able to define their own name.

As they don't see the URI, they actually never feel the urge to click on it, or to copy and paste it to the next browser window. Naming it http://semantic.nodix.net/document/Politeia instead of http://semantic.nodix.net/concept/1383b_xc is just for the sake of readability of the source RDF files, but actually you should not derive any information out of the URI (that's what the standard says). The computer won't either.

The second point is, a RDF application shouldn't look up URIs either. It's just wrong. URIs are just names, it is important that they remain unique, but they are not there for looking up in a browser. That's what URLs are for. It's a shame they look the same. Mozilla realised the distinction when they gave their XUL language the namespace http://www.mozilla.org/keymaster/gatekeeper/there.is.only.xul. Application developers should realise this too. rdfs:seeAlso and rdfs:isDefinedBy give explicit links Applications may follow to get more information about a resource, and using owl:imports actually forces this behaviour - but the name does not.

Getting information out of names is like making fun of names. It's mean. Remember the in-kids in primary school making fun of out-kids because of their names? You know you're better than that (and, being a geek, you probably were an out-kid, so mere compassion and fond memories should hold you back too)..

Just to repeat it explicitly: if an URI gives back a 404 when you put it in a browser navigation bar - that's OK. It was supposed to identify resources, not to locate them.

Now you know the difference between URIs and URLs, and you know why avoiding URI collision is important and how to avoid it. We'll wrap it all in the final installment of the series (tomorrow, I sincerely hope) and give some practical hints, too.

By the way, right after the series I will talk about content negotiation, which was mentioned in the comments and in e-Mails.

Uh, and just another thing: the wary reader (and every reader should be wary) may also have noticed that

Philosophy:Politeia dc:creator "Plato".

is total nonsense: it says, that there is a resource (identified with QName Philosophy:Politeia) that is created by "Plato". Rest assured that this is wrong - no, not because Socrates should be credited as the creator of the Politeia (this is another discussion entirely) but because the statement claims that the string "Plato" created it - not a Person known by this name (who would be a resource that should have an URI). But this mistake is probably the most frequent one in the world of the Semantic Web - a mistake nevertheless.

It's OK if you make it. Most applications will cope with it (and some are actually not able to cope with the correct way). But it would not be OK if you didn't know that you are making a mistake.

Friday, January 07, 2005

What's in a name - Part 4

I promised you four solutions to the problem of dubbing with appropriate URIs. So, without further ado, let's go.

The first one you've seen already. It's using anonymous nodes.
_person foaf:interest _security.
http://dmoz.org/Computers/Security/ dc:subject _security.

But here we get the problem, that we can't reference _security from outside, thus loosing a lot of the possibilities inherent in the Semantic Web, because this way you can not say that someone else is interested in the same topic as _person above. Even if you say, in another RDF file,
_person2 foaf:interest _security.
http://dmoz.org/Computers/Security/ dc:subject _security.
_security actually does not have to be the same as above. Who says, websites only have one subject? The coincidental equality of the variable name _security bears as much semantics as the equality of two variables x in a C and a Python-Program.
So this solution, although possible, bears too much short-comings. Let's move on.

The second solution is hardly available to the majority of us puny mortals. It's introducing a new URI schema. Let's return to our very first example, where we wanted to say that the Politeia was written by Plato.

urn:isbn:0192833707 dc:creator "Plato".

Great! No problems here. Sure, your web-browser can't (yet) resolve urn:isbn:0192833707, but
no ambiguity here: we know exactly of what we speak.

Do we? Incidentally, urn:isbn:0465069347 also denotes the Politeia. No, not in another language (those would be another handful of ISBN numbers), just a different version (the text is public domain). Now, does the following statement hold?

urn:isbn:0192833707 owl:sameAs urn:isbn:0465069347.

Most definitively not. They have different translators. They have different publishers. These are different books. But it's the same - what? What is the same? It's not the same text. It's not the same book. They may have the same source text they are translated from. But how to express this correctly and still useful?

The urn:isbn: scheme is very useful for a very special kind of entities - published books, even the different versions of published books.
The problem with this solution that you would need tons of schemes. Imagine the number of commitees! This would, no, this should never happen. We definitively need an easier solution, although this one certainly does work for very special domains.

Let's move on to the third solution: the magic word is fragment identifier. #. Instead of saying:
http://semantic.nodix.net/Politeia dc:creator http://semantic.nodix.net/Plato.
and thus getting 404s en masse, I just say:
http://semantic.nodix.net/#Politeia dc:creator http://semantic.nodx.net/#Plato.

See? No 404. You get to the homepage of this blog by clicking there. And it's valid RDF as well. So, isn't it just perfect? Everything we wished for?

Not totally, I fear. If I click on http://semantic.nodx.net/#Plato, I actually expect to read something about Plato, and not to see a blog about the Semantic Web. So this somehow would disappoint me. Better than a 404, still...
The other point is my bandwidth. There can be RDF files with thousands of references. Following every single one will lead to considerable bandwidth abuse. For naught, as there is no further information about the subject on the other side.
Maybe using http://semantic.nodix.net/person#Plato would solve both problems, with http://semantic.nodix.net/person being a website saying something like "This page is used to reserve conceptual space for persons. To understand this, you must understand the magic of URIs and the Semantic Web. Now, go back whereever you came from and have a nice day." Not too much webspace and bandwith will be used for this tiny HTML-page.

You should be careful though to not have a real fragment identifier "Plato" in the page, or you would actually dereference to this element. URI collision again. You don't want Plato to become half-pilosopher / half-XML-element, do you?

We will return to fragment identifiers in the last part of this six part series again. And now let's take a quick look at the fourth solution - we will discuss it more thoroughly next time.

Use a fresh URI whenever you need an URI and don't care about it giving a 404.

Wednesday, January 05, 2005

What's in a name - Part 3

Last time we merrily published our first statement for the Semantic Web:

http://www.imdb.com/title/tt0088247/ http://purl.org/dc/elements/1.1/creator "James Cameron".

A fellow Semantic Web author didn't like the number-encoded IMdb-URI, but found a much more compelling one and then published the following statement:

http://en.wikipedia.org/wiki/The_Terminator http://purl.org/dc/elements/1.1/date "1984-10-26".

A third one sees those and, in order to foster integration of data offers helpfully the following statement:

http://www.imdb.com/title/tt0088247/ owl:sameAs http://en.wikipedia.org/wiki/The_Terminator.

And now they live merrily ever after. Or do you hear the thunder of doom rolling?

The problem is that the URIs above actually already denote something, namely the IMdb website about the Terminator and the Wikipedia-article on the Terminator. They did not denote the movie itself, but that's how they're used in our examples. Statement #3 above actually says the two websites are the same. The first one says, that "James Cameron" created the IMdb website on the Terminator (they'd wish), and the second one says that the Wikipedia article was created in 1984, which is wrong (July 23, 2001 would be the correct date). We have a classic case of URI collision.

This happens all the time. People working professionally on this do this too:
_person foaf:interest http://dmoz.org/Computers/Security/.

I'd bet, _person (remaining anonymously here) does not have such a heavy interest in the website http://dmoz.org/Computers/Security/, but rather in the Topic the website is about.

_person foaf:interest _security.
http://dmoz.org/Computers/Security/ dc:subject _security.

Instead of letting _security be anonymous, we'd rather give it a real URI. This way we can reference it later.

_person foaf:interest http://semantic.nodix.net/topic/security.
http://dmoz.org/Computers/Security/ dc:subject http://semantic.nodix.net/topics/security.

But, oh pain - now we're exactly at the same spot we've been in the last part. We have an URI that does not dereference to a website (by the way, I do know that the definition of foaf:interest actually says the semantics of foaf:interest is, that the Subject is interested in the Topic of the Object, and not the Object itself, but that's not my point here)
Thinking for a moment about it, we must conclude that it is actually impossible to achieve both goals: either the URIs will identify a resource retrievable over the web and are thus unsuitable as URIs for entities outside the web (like persons, chairs and such) because of URI collision, or they don't - and will then lead to 404-land.

Isn't there any solution? (Drums) Stay tuned for the next exciting installment of this series, introducing not one, not two, not three, but four solutions to this problem!