I’ve created a new experiment using ClearForest‘s Content Analysis services. It’s called SixDegrees and is located here:
SixDegrees is a Semantic Web experiment using Ajax, RSS and RDF combined creatively with the Content Analysis services of ClearForest. It’s a mashup but not in the typical sense.
SixDegrees uses the notion that there are Six Degrees that separate everyone in the world. My idea takes this to a new level by trying to figure out the degrees of separation in terms of everything rather than everyone.
Here’s how it works.
A repository of RSS feeds is stored in my database. These feeds are polled periodically for new content. The latest content is then parsed by ClearForest which classifies the content and returns a set of relevant tags classified by type e.g. people, company, etc. These are then stored along with sundry other meta data. For example, a given story on Windows Vista might return the tags "Seattle", "Microsoft" and "Bill Gates" depending on the content.
Hundreds of blog entries are processed in this manner and over time a repository is established. This is where the SixDegrees service comes in.
The front end website allows you to choose a start and end entity. These are essentially Tags queried from the database. The web service will then determine if these entities are linked in any way, through common references within the database.
For example, if the term Bill Gates appears in story AAA and also in story BBB, then those two stories can be thought of as linked to one another through the common reference of Bill. If these stories then contain other key terms for example Windows and Steve Ballmer, then a link between Steve Ballmer and Windows would be established through Bill Gates. The more references, the more confident we can be that semantically this link makes sense.
As you can see the data created forms a non-directed graph. My service processes this graph efficiently to be able to return results in a real-time fashion. I also use Ajax techniques to improve the user interface.
All data is user generated, essentially from blogs. By processing data from the blogosphere in this manner and combining it with the services of ClearForest the semantics of the content can be determined. This is essentially a very small step towards the semantic web.
Once a connection has been found, the resulting link is documented in RDF. The RDF essentially describes the triples within the link. Developers will be glad to learn I have validated the RDF using the w3C RDF validator located here [LINK].
The RDF validator can generate a graph representing your RDF triplets. As an example of the output and proof that this works, checkout the graph generated between the country "Australia" and the company "Dell". [LINK]
Lastly, should you so desire, I have exposed the capabilities of the service through SOAP and REST interfaces so that developers can build on top of the data collected. I need to document these better but for now here’s a few sample queries:
and of course the connection WSDL is located here [LINK]
This is an experiment and I have already thought of a number of ways to improve it, time permitting. I hope you find this tool as interesting as I do.