This page is rather outdated. See
http://doc.open-search.net for the current design document. Most of ideas presented here, however, are still valid and you are still encouraged to participate on this page or somewhere else on this website.
Objectives
- Redundant
- Distributed
- Resistant against attacks:
- DoS or DDoS
- Pollution
- Consider the case (Rolf Kleef suggested this one) where a marketeer employs a zombie-net to run tweaked crawlers that return false pages with links to their marketing targets
- Censoring
- Good search results
- No overload on participating nodes (like eg. onion)
- Open source, cross-platform
- Self-contained, easy installation, no dependencies
Ideas
- Let client function as proxy, to capture url's and pages to crawl, so surfing becomes crawling
- great idea! And not only surfing that becomes crawling but in the mean time all links from that page should be fetched / followed / analyzed. We'll need a good scheme to make sure not to obstruct surfing too much, round robin won't do
- consequences:
- popular sites are indexed a lot, probably powerlaw for crawled pages ...
- by using this principle, people are actually helping their field of interested to be indexed rather well
- ...
- Recently I had the idea of making a firefox plugin for 'collaborative topic exploration'. This plugin would be used by organisers and participants of congresses, students following a course, projectgroups, etc etc. The plugin would interface with some kind of online bookmarking service. When users are researching or looking into a particular topic they can bookmark / tag/ annotate a page or snippet they find on the web. This way a common knowledge pool on a specific topic will be made - web2.0 style. This pool may then be consulted by everybody interested in that topic. This could also be used to feed urls into our (distributed) database.
- In the first phase of the project we could implement an API to e.g. google to help people actually find something
- Make it flexible enough to be able to implement new searching structures/algoritms, without having to upgrade the entire system
- let the crawler transform the page into a standard format to ease the work of the indexer when trying to analyze different document types
- Let the crawler do some caching of the links it follows, this could speed up the browsing experience significantly
- note: might be nice to limit not-requested pages with a max-transfer-rate until it's maybe requested. to not create a possible slowdown effect.
Distribution
- create a leveled setup with a 'level' for each functionality:
- crawling
- indexing
- searching
- hereby make it possible to have seperate clients for each functionality.. although still keeping it possible to integrate a crawler with an indexer, or a searcher with an indexer, or a searcher with a crawler
Ranking
How will ranking happen?
- seeing a link as a recommendation (original google idea) won't work because of link spam
- I see an opportunity in ranking that what gets surfed most higher (see let client function as proxy)
problems:
- automated spiders to only surf particular (spam)sites can be made
- only the popular web will be indexed
- how to recouldnize personal pages, like your webmail inbox
- Of course there should be content analysis. What kind of (unsupervised?) algorithms can be used to determine if a page or body of text is relevant to a particular query? Naive bayes won't do
Trust
Hoe can you trust, or how to measure the trustworthyness of the data you receive
- compare redundant data from multiple nodes
- (incase of crawled documents, small differences(personalisations) could be filtered out, and more different could mean a broken copy)
- combine certain elements in one product (crawler + indexer)
- keep track of all the nodes through registration ?
- incase of registered nodes, keep track of node trustwortyness/value(how good are it's provided results)
Search and Indexing
- search: routing, pheromone trails, flooding, slsk like, pagerank, ... 1st question is to find a document a 2nd is to see how relevant it is. Maybe if a lot of people look at it it is redundant so many sources (edonkey like).
- index: try to find out what kind of document we are looking at(recipe, CV, blog, etc)
- index: link related terms with each other which are often associated possibly use that in the search
- index: collect as much meta data about documents as possible and index this aswel
- index: combine the crawler and indexer in one to make the trust issues smaller.
- http://www.techcrunch.com/2006/11/15/google-yahoo-and-microsoft-agree-to-standard-sitemaps-protocol/ In an encouraging act of collaboration, Google, Yahoo and Microsoft announced tonight that they will all begin using the same Sitemaps protocol to index sites around the web. Now based at Sitemaps.org, the system instructs web masters on how to install an XML file on their servers that all three engines can use to track updates to pages. This should make it easier to get your pages indexed in a simple and standardized way. People who use Google Sitemaps don’t need to change anything, those maps will now be indexed by Yahoo and Microsoft. The protocol is offered under an Attribution-ShareAlike Creative Commons License, so it can be used by any search engine, derivative variations using the same license can be created and it can be used for commercial purposes.
Storage
Distributed storage will result in slow results. Should there be central nodes who index rankings?
sugestions against the slowness:
* partialy centralise index databases (special nodes)
* let the searchers do some caching of the databases
Organisation
- Modularize (strong interaction needed in early phase of course)
- Storage team, working on storing data in a p2p-way
- Search team, working on search algorithms
- Indexing team,working on how to organize the crawled data?
- Crawling team, working on distributed crawling
- Funding team, to find funding so we can pay people to work on this
Funding?