r2 - 30 Sep 2007 - 11:45:38 - RobinGareusYou are here: TWiki >  Opensearch Web > ScratchPad

Inventory:

(as of Sept 2007)
  • client/server core - provides user interface abstraction
  • pluggable architecture to multiple peer-to-peer networks
  • p2p network interface [testing] API
  • extendable crawler with website keyword extraction (prototype: yet only web-keywords, no content extraction)
  • keyword to URL map HASH-table (to be stored in p2p DHT)
  • chord p2p network (C++ implementation) - [stalled]? development
  • 4 src/binaries that need to be set-up and launched by hand... frown

Design

  • High level Database: Keyword - Metadata - Content - URI (XML dataformat - low activity but complex tasks)

  • Low level Database: local db, p2p hash table(s), cache connected to p2p network (binary format - many simple requests - no overhead for keywords)

Flow:

crawling - indexing - storage - retrieval - filtering results

Crawler and indexer feed the High level database, which is mapped to the underlying storage system.

A user search is mapped to meta-data queries which are in trun mapped to multiple p2p keyword lookups that are executed in parallel. - This allows to connect to p2p layers which provide schema based search (eg. jxta http://en.wikipedia.org/wiki/JXTA) as well as make use of keyword-set search (as does chord http://pdos.csail.mit.edu/chord/).

open-search uses XML on the user-interface and for content storage, but not for the keywords, as the XML overhead is not affordable. - Yet with more powerful machines and XML protocol optimized p2p layers this may change in 5 years from now.

For now chord is very promising to get/store objects that either the keyword <-> Url relationship. With DHash and SFS it provides a distributed filesystem structure that can be used to store the indexed content. It's robust, scalable and well maintained open source software. (Chord variations with alternate routing algorithms include: Pastry, CAN, Symphony & Kademilia. trading off routing-latency vs. node-states (memory-use) vs. arrival/depart msgs)

Q&A

what were our difficulties & design choices and who is researching what we need?

We choose XML as top-level format (will prevail in the future) - and with networking hardware evolving the hashed p2p dataformat can be replaced with XML. - XML offers all search/filter/tag/meta-info that is required to build a conten search engine.

Difficulties are expected when merging search results from different p2p-engines to the open-search XML format:

  • There'd need to be some fuzzy logic which weights the results accoding to the cedability of the crawler that indexed the document. as well as gpg-key verification.

  • We'd also do need a small engine to expand (recurse) search-results: eg the result-set for a seach can be another search.. (recursion has a hight latency, so we need to apply some tricks)

As for content-indexing, there's a fair ammount going on:

The MIT has a couple of groups who extend the chord-p2p framework. - for data storage in p2p berkeley.edu is ahead. and the MPI are researching on p2p networks. Sun is pushing integration with jxta. and The w3.org is developing standards in the hope for better data exchange : http://www.w3.org/TR/wsdl

Resources / Stumble upons:

  • https://jxta.dev.java.net/ - we want a highly scalable version of this wink - The XML overhead limits it's use for massive keyword lookups.

Edit | WYSIWYG | Attach | Printable | Raw View | Backlinks: Web, All Webs | History: r2 < r1 | More topic actions
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback