Tag Archives: Semantic Web

Class review: 6.898 Linked Data Ventures

You may remember a brief preview at beginning of the fall semester of my Linked Data Ventures class that was taught by Tim Berners-Lee. In the months since that post, we really rolled up our sleeves and got into the concepts and languages that support the Semantic Web — and also created real applications and business ideas based on Semantic Web/Linked Data.

TBL taught some of the classes, but we also had some great technical sessions with Lalana Kagal and Ian Jacobi from MIT’s CSAIL as well as business sessions with Reed Sturtevant and Katie Rae. Another organizer for the class was K. Krasnow Waterman, a 2006 MIT Sloan Fellow who told me about the history of the Linked Data Ventures Class when I met her at an alumni reception in New York earlier this month.

In addition, nearly every week, we had guest speakers who work with these technologies or develop companies based on Linked Data, including OpenCalais, an RPI faculty member named Jim Hendler who has worked on the federal government linked data initiatives, and numerous startup founders.

But what I wanted to show in this post was a summary of what we learned, from the point of view of someone who started the class with only a vague understanding of what the Semantic Web was. Here are some examples from my homework assignments for 6.898 in the early part of the semester (Note: There may be mistakes!). At the end of the post, I offer some concluding thoughts about the class and the broader SemWeb ecosystem.

Assignments:

My circles and Arrows diagram for assignment 2. The goal was to get us to think about relationships described in a paragraph of text in terms of subject-predicate-object “triples”. Here’s the assigned text:

Joe Lambda, a 25-year-old man, has a FOAF file. Joe has an AIM account “jlambda”, and a Jabber account “joe.lambda@example.com”, which is also his e-mail address. Joe is a graduate student at Foobar University, a university in the Cambridge, Massachusetts (42.373611°N, 71.110556°W), the homepage of which is located at “http://foobar.example.org/”.

Joe Lambda has two friends, Bill Foo and G. Baz. Normally, Joe lives in Somerville, Massachusetts (42.3875°N, 71.1°W), a city that borders Cambridge, with Bill. G. Baz is their neighbor. Joe, Bill, and G. have a number of different interests, but are all interested in Linked Data. Joe is also interested in Astronomy, and Cricket, Bill also enjoys American Literature and Baseball, and
G. is interested in the TV show Arrested Development and Hockey.

And here’s the diagram:

Then, we moved onto the languages, starting with turtle/n3, which identifies SPO relationships in a more human-readable format than the XML-based RDF. A brief, imperfect sample, based on the text from assignment 2, above:

@prefix ex: .
@prefix dbp: .
@prefix sws57: .
@prefix sws72: .
@prefix sws26: .
@prefix foo: .
@prefix rdf: .
@prefix rdfs: .
@prefix foaf: .
@prefix gn: .
@prefix rel: .
@prefix geo: .
@prefix vivo: .
@prefix xsd: .


ex:me foaf:interest dbp:Cricket.
dbp:Cricket rdfs:label “Cricket”@en.
ex:me foaf:name “Joe Lambda”@en;
foaf:age “25^^xsd:int”;
foaf:gender foaf:male;
foaf:aimChatID “jlambda”;
foaf:mbox “mailto:joe.lambda@example.com”;
foaf:schoolHomepage foo:;
foaf:based_near sws57:;
rel:livesWith [rel:livesWith ex:me;
rdf:type foaf:Person;
foaf:based_near sws57:;
foaf:name "Bill Foo";
foaf:interest dbp:Baseball;
foaf:interest dbp:Linked_Data].
dbp:Linked_Data rdfs:label “Linked Data”.
dbp:Baseball rdfs:label “Baseball”.
foo:about#university foaf:homepage foo:;
rdf:type vivo:University;
rdfs:label “Foobar University”;
foaf:based_near sws72:.
sws72: rdfs:label “Cambridge”@en;
geo:lat “42.373611^^xsd:decimal”;
geo:long “-71.110556^^xsd:decimal”;
gn:parentADM1 sws26:;
rdf:type gn:Feature;
gn:neighbour sws57:.
sws26: rdfs:label “Massachusetts”@en;
rdf:type gn:Feature.
sws57: gn:neighbour sws72:;
rdf:type gn:Feature;
gn:parentADM1 sws26:;
geo:lat “42.3875^^xsd:decimal”;
geo:long “-71.1^^xsd:decimal”;
rdfs:label “Somerville”@en.

We also designed our own ontologies, which define words, relationships, and other Semantic Web concepts relating to various topic areas. RDF and turtle/n3 graphs can then reuse ontologies for specific graphs (this is what the @prefix code refers to in the previous example). In the following example for assignment #4, we had to create an ontology for top-level biology definitions. Mine looked like this:

@prefix owl: .
@prefix xsd: .
@prefix rdfs: .
@prefix rdf: .

owl:Class rdfs:subClassOf rdfs:Class .

Eukaryote a owl:Class.
[ a owl:Restriction;
owl:onProperty cell;
owl:allValuesFrom CellWithNucleus ].

NonEukaryote a owl:Class.
[ a owl:Restriction;
owl:onProperty cell;
owl:allValuesFrom CellNoNucleus ].

LivingThing a owl:Class;
owl:unionOf ( Eukaryote NonEukaryote ) .
NonLivingThing a Class.
LivingThing owl:complementOf NonLivingThing.

CellWithNucleus a owl:Class,
[ a owl:Restriction;
owl:cardinality "1"xsd:nonNegativeInteger;
owl:onProperty nucleus ] .

CellNoNucleus a owl:Class.
[ a owl:Restriction;
owl:cardinality "0"xsd:nonNegativeInteger;
owl:onProperty nucleus ] .

CellWithNucleus owl:complementOf CellNoNucleus.

cell rdf:type rdf:Property;
rdfs:domain LivingThing.

nucleus rdf:type rdf:Property;
rdfs:domain CellWithNucleus.

Species rdfs:subClassOf LivingThing.

speciesName rdf:type rdf:Property;
rdfs:domain LivingThing;
rdfs:range Species.

datedescribed rdfs:subPropertyOf speciesName;
a owl:DatatypeProperty;
rdfs:range xsd:date;
rdfs:domain Species.

describername rdfs:subPropertyOf speciesName
rdfs:domain Person.

Animal a owl:Class,
[ a owl:Restriction;
owl:minCardinality "1"xsd:nonNegativeInteger;
owl:onProperty cell ] .
Animal owl:intersectionOf ( Eukaryote Species ).

HasTail rdfs:subClassOf Animal.
HasLegs rdfs:subClassOf Animal.
LeggedTailedAnimal a owl:Class.
owl:unionOf ( HasTails HasLegs ) .
numberOfLegs a owl:DatatypeProperty;
rdfs:domain HasLegs;
rdfs:range xsd:integer;


Fungi a owl:Class,
[ a owl:Restriction;
owl:minCardinality "1"xsd:nonNegativeInteger;
owl:onProperty cell ] .
Fungi owl:intersectionOf ( Eukaryote Species ).

Plants a owl:Class,
[ a owl:Restriction;
owl:minCardinality "1"xsd:nonNegativeInteger;
owl:onProperty cell ] .
Plants owl:intersectionOf ( Eukaryote Species ).

Bacteria a owl:Class,
[ a owl:Restriction;
owl:Cardinality "1"xsd:nonNegativeInteger;
owl:onProperty cell ] .
Bacteria owl:intersectionOf ( NonEukaryote Species ).

Archaea a owl:Class,
[ a owl:Restriction;
owl:Cardinality "1"xsd:nonNegativeInteger;
owl:onProperty cell ] .
Archaea owl:intersectionOf ( NonEukaryote Species ).

Protists a owl:Class,
[ a owl:Restriction;
owl:Cardinality "1"xsd:nonNegativeInteger;
owl:onProperty cell ] .
Protists owl:intersectionOf ( Eukaryote Species ).

… But unfortunately it did not map too well to the ideal solution that we were shown after we handed it in. Creating a model of these relationships depends heavily on logic as well as an understanding of the capabilities of OWL, the language that ontologies are written in.

Finally, we learned the Semantic Web query language, SPARQL. I had taken a SQL class years ago at the Boston College Woods College of Advancing Studies, and this experience was a good introduction to SPAQRL, which basically involves generating new graphs of data from existing triples in a very SQL-like manner.

The following SPARQL example that was shown to us in the class lab generates a list of countries from a triplestore based on the CIA World Factbook and restricts it to countries with a certain area and population:

PREFIX factbook:
SELECT ?country ?population_total ?area
WHERE {?country factbook:population_total ?population_total .
?country factbook:name ?country_name .
?country factbook:area_total ?area .
FILTER (?population_total > “5000000″^^xsd:long || ?area > “500000″^^xsd:long ) . }

But the class wasn’t just about learning these languages and concepts. For the second half, we were tasked with forming teams and developing an actual application and business model built on Linked Data. The instructors for this segment were Reed Sturtevant and Katie Rae, but we got a lot of feedback from Tim Berners-Lee, Lalana Kagal and Ian Jacobi during the practice demo in late November. Startup founders and angels gave us some additional feedback on demo/pitch day on December 7. Our team consisted of two Sloan Fellows and an undergrad Computer Science/Media Lab student. We ended up creating a neat little educational app that teaches kids about different countries. You can see a brief demo in the following video (scroll ahead about two or three minutes to see it):

The winner of the demo contest was a neat restaurant review/location service. The people on the team seemed pretty serious about taking it to the next level, so we’ll see how that progresses over the spring.

There is also the question of the future of the wider Semantic Web/Linked Data world. For ten years people have been talking about the potential of the technology, and there have certainly been a slew of tools, projects, apps, and datasets made available . But there are also some limitations to the Semantic Web/Linked Data, as our study group found out when we were designing our mobile educational application. Performing live queries to the Web was a no-go, owing to the slow response time, and many of the datasets (including the widely used DBPedia graph) were inconsistent or had other flaws.

Yod, Mads and I went to TBL after our December 7 demo to discuss the “curation problem,” and he offered some interesting suggestions. For instance, in choosing the best photos from flickrwrapper for the “places” part of the geography app, we could add some geocoded logic to find the best light/positioning (300 meters west of the object at a certain time of the day) and employ some to-be-determined algorithm or AI to “make sure Aunt Jenny isn’t in the frame”. He also suggested leveraging Google to programmatically derive the semantic meaning of certain terms that have additional definitions beyond geography. But the idea of using existing Linked Data, standard queries and ontologies without extensive programmed/human curation is just a dream … at least for the time being.

Beyond the technical issues, there is also the lingering question of what sorts of killer apps might be derived from the Semantic Web. I think a key reason the 6.898 class exists is to help launch more Semantic Web-based startups, open-source tools, and new datasets, in the hope that one or more of these efforts will spark a truly innovative or ground-breaking app that moves LD and the Semantic Web into the mainstream in a highly visible way. I don’t know if our educational app or the others from the class will move beyond the prototype phase, but there has been a lot of serious talk in our class about using these and other ideas as the basis of new ventures once we finish. I’ve been thinking about how the Semantic Web could vastly improve many common data-driven genealogy or history applications (areas which I have written about for years — see “Google/Ancestry.com followup: Using outsourced Chinese labor to overcome OCR limits” and “Making a case for quantitative research in the study of modern Chinese history: The Xinhua News Agency and Chinese policy views of Vietnam, 1977–1993“), and over the next few months will do some additional research and reach out to people at MIT and elsewhere to evaluate the viability of such a venture (feel free to contact me at ian dot lamont -at- sloan dot mit dot edu if you want to discuss).

Lastly, I would like to offer my profuse thanks to K. Krasnow, Reed, Katie, Ian, Lalana and TBL for not only offering Linked Data Ventures this year, but also for making it a truly challenging and eye-opening experience. It really is one of the best classes I’ve had at MIT.

Tim Berners-Lee’s primer on the Semantic Web

The students file into an ordinary, medium-sized classroom in building 4, near the center of campus. Outside, it’s a beautiful afternoon, a few days before the autumnal equinox. The room is brightly lit, thanks to the room’s tall windows. Muffled sounds of trumpets and horns can be heard nearby — there is an active music community at MIT, and some students take classes in music and the performing arts in building 4.

After everyone has settled into their seats, the professor gets up in front of the class. He is thin, has gray hair, and wears the standard faculty attire — khakis and a long-sleeved, light blue button-down shirt without a tie. Seeing him walking down the corridor, most would have no idea who he is, but to a few he’s given away by the large MacBook Pro tucked under one arm, covered with stickers, including one from the W3C — the World Wide Web Consortium.

The man is actually the director of the W3C and has played a remarkable role in the history of computing, and, indeed, the course of human history. He’s Tim Berners-Lee, the inventor of the World Wide Web — arguably the most important communications invention since Gutenberg used movable type to create the first printed bible.

Everyone reading this post has been touched by the Web in untold ways. For some people, including me, the Web has changed their lives. Now I am about to hear about another Internet technology that Berners-Lee hopes will make as big an impact: the Semantic Web.

Berners-Lee starts talking. He has an English accent, I’m guessing from somewhere in the Southeast. In front of this new audience he talks quickly, the thoughts sometimes tumbling out faster than he can speak them.

The first thing he writes on the chalkboard is http:// and a domain name — two of the fundamental elements of the World Wide Web. He adds an anchor tag.

“To a certain extent, when you go to the Semantic Web, you’ll have to leave that all behind,” he says.

Berners-Lee writes a URI, http://www.w3.org/People/Berners-Lee/card#i, and explains that it returns data, not a Web page.

“This,” he says, pointing to the URI, “is me.”

As a muffled horn ensemble begins to warm up in the next room, he gives a primer on the Semantic Web, how it’s different than the World Wide Web, and some of the basic concepts that make it work — URIs (not URLs), XML, RDF (see my post from earlier in the week), triples, ontologies. These technologies can turn the World Wide Web into a linked, queryable database, and give relationships and meaning to otherwise unstructured data on the Web.

Berners-Lee likes to draw diagrams of the RDF graphs, and sometimes uses the circle/arrow notation that’s used to model Linked Data relationships (I am using “Semantic Web” and “Linked Data” interchangeably, per the usage employed by one of the other instructors later in the class). He shows the standard “Subject-Predicate-Object” (aka subject-verb-object) format used for triples, and describes how they might be used to describe certain relations:

Tim Berners-Lee (subject) has an assistant (predicate) Amy (object) . 

And vice-versa:

Each one of the elements in these relationships will be links. For unique entities, like a person, there should be a document that describes all of the properties of that individual. As described above, Tim Berners-Lee’s is http://www.w3.org/People/Berners-Lee/card#i, and contains information such as his public home page, photographs, projects he’s participated in, and even the people he knows. Everything in the list is a link. For common verbs or relations, there are definitions already in existence that can also be referenced by a link, so new definitions need not be created from scratch. The idea of the Semantic Web is these machine-readable entities, relationships, and descriptions can be used for queries or specialized applications — for instance, “Who is Tim Berners-Lee’s current assistant?” or “What is TBL’s assistant’s email address” or “return a list of all of the email address of current MIT faculty assistants”. The beauty of the Semantic Web is the data is (ideally) readily available on the Web, instead of a proprietary database somewhere, and can be manipulated by software agents.

Linking Open Data cloud diagram, by Richard
Cyganiak and Anja Jentzsch. http://lod-cloud.net/

Students in the class ask questions. They vary in complexity. The audience is a mixed bunch of Computer Science graduate students, Sloan MBAs, and the odd LGO and Sloan Fellow. Some of the CS students already get this. To others with non-technical backgrounds, it’s completely new. I fall somewhere in-between — I can code HTML and am familiar with XML, but other Semantic Web technologies were unknown to me before I registered for the course.

An MBA asks: What happens when inconsistencies arise in linked data? For instance, what if Amy leaves her job, but only one of the reciprocal links above is adjusted to reflect that?

“This is the Web!” Berners-Lee declares. “It’s not consistent!”

This leads to a discussion of the value of having links in both directions from RDF graphs talking about the same thing, and then his “five-star” system of rating sites (or organizations?) on their ability to post data openly on the Web, especially machine-readable data.

I want to ask a lot of questions, but I hesitate. My background is online media, and the creator of the Web is standing in front of the class. It’s like being able to ask Gutenberg a question about his next generation of printing presses.

“Can you talk a little bit about trust?” I finally ask. I’m thinking about the reliability of the relationships identified in triples, and the potential for the linked data system to be abused, much as earlier Internet platforms such as email and the Web have been overrun by spam and malware.

Berners-Lee pauses, expressionless. A few people laugh. Have I really asked that stupid a question, or does everyone think I am talking about the broader concept of trust?

I make a clarification. “At last week’s lab, we were shown the layers of the Semantic Web, and one of them was –”

He interrupts me, and gestures toward the blackboard. “I can talk about it, but I am afraid it would take hours,” he says. The long and short of it: It’s a complex area, and the subject of much of the current Semantic Web research. “There’s a big social element,” he concludes, and leaves the discussion at that.


Other classroom encounters: