My thesis was successfully defended on September 2nd, 2003! The final version can be found in word format here: http://www.employees.org/~alokem/thesis/thesis-outline.doc
[july 4, 2003]
[july 2, 2003]
[june 30, 2003]
Good explanation of the techniques that some p2p apps use to get around the problem of Network Address Translation (NAT). Includes a small application that checks your NAT to see if it supports these techniques.
[may 29, 2003]
Nullsoft announces a new app for collaboration in small groups called WASTE. And it gets taken down from their website the same day that it is posted! Shades of Gnutella, but will this app be as revolutionary? Mirrors and ports to new platforms are proliferating as I write.
integrating biological databases | http://www.nature.com/cgi-taf/DynaPage.taf?file=/nrg/journal/v4/n5/full/nrg1065_fs.html
Great article by Perl guru Lincoln Stein about approaches to the incredibly difficult task of integrating information from the growing number of biological databases.
[april 14, 2003]
Archived webcasts from CITO/OCRI Techtalk "Empowering the Edge: Issues, Challenges and Technology for Peer-to-Peer Computing" - Originally Presented: Tuesday, February 25, 2003. Includes a talk by my supervisor Dr. Babak Esfandiari "Meta-data Support in P2P Systems: Overview and Case Study" which talks about U-P2P.
[april 8, 2003]
Paper submission deadline 2 June 2003 Notification of acceptance 23 June 2003 Camera ready papers due 7 July 2003 Conference 1-3 September 2003
[march 18, 2003]
etime is a useful utility for timing commands in dos. Cygwin (http://www.cygwin.com) comes with an equivalent of the unix "time" command - the only problem is that its output is directed to stderr - and in the MSDOS shell there's no way to redirect stderr (e.g. to a file). etime gets around that problem - for example:
etime -2filename dir etime -2+filename dir
Would respectively time the operation of the "dir" command and put the result in "filename" / append the result to "filename". Etime's output is not as detailed as the unix / cygwin version but it does the job:
d:\up2p-exp>etime sleep 3 3.41 c:\cygwin\bin/sleep.exe 3
[march 16, 2003]
XML Schema in general plus some stuff about use of namespaces within them.
[march 12, 2003]
Site hosting "dtd2xs" - a simple to use package for converting DTDs to XML schema files. I've tried it out - it did most of the work but I had to edit some of it by hand because the xml files I had didn't quite conform to the generated schema.
perl one liners | http://www-106.ibm.com/developerworks/linux/library/l-p101/
Basic tutorial on using perl as a command-line tool.
one more xml source | http://www.fruitfly.org/sequence/dlXML.shtml
Drosophila Genomic Sequence Anotations from http://www.fruitfly.org, the Berkeley Drosophila Genome Project. Drosophila refers to "Drosophila Melanogaster" a.k.a. the fruitfly. The Drosophila genome was completely mapped in 2000. More about Drosophila here: http://www.ceolas.org/fly/intro.html.
[march 8, 2003]
Some of these data sources consist of large collections concatenated into a single XML file. Here's a useful utility for splitting these into individual XML files:
xml rpc specification | http://www.xmlrpc.com/spec
Very simple document that sets out formatting for xml rpc requests and responses. The request format is useful if you want to check that a server is up and running using telnet. You can just cut and paste the request that is there and verify that the server responds with a properly formatted response (probably an error unless you happened to implement the example.getStateName? method).
[march 5, 2003]
Detailed tutorial on how to create an XML-RPC client and server in Java (including where to get the source, compile it, etc.). Also a nice demo of how XML-RPC allows clients and servers written in different languages (Perl, PHP, Java) to interact.
[march 1, 2003]
Good tutorial on how to perform various CVS operations (tagging, branching, merging, etc.) in Eclipse.
[february 27, 2003]
In U-P2P all files are XML files! That can be a problem if you don't happen to have a pile of XML documents lying around. I think an important sub-project for U-P2P will be building tools to extract XML from files (e.g. from MP3 ID3 tags) or to create XML wrappers for files. I've created something simple for MP3 files using Rabbitfarm's GPL'd ID3 library (http://www.rabbitfarm.com/id3.html).
Here are some other sources of XML files:
[february 24, 2003]
Outlines a distributed method for communities to come to an agreement on semantic descriptions.
[february 8, 2003]
Who has to do what when + formatting guidelines.
Examples of other theses in the department:
[february 6, 2003]
Interesting simulation results concerning the performance (e.g. average path length) of semantic routing in large networks.
"And in fact if we perform a million search iterations on a 10,000-node network we find that it converges to a path length of around 2, but that its connectivity level is comparable with that of a 1000 node network. What this seems to be indicating is that the operation of semantic routing, in combination with connection/knowledge updates, leads the nodes to stabilize their connectivity levels when they are connected to enough other nodes that can directly service their search needs. However, it is important to note that the level of recall achieved in the larger network is much lower but this follows naturally from distributing the same number of documents per node over a larger network."
If nodes are allowed to establish unlimited numbers of connections with nodes sharing common interests the path length converges to a lower value. But adding new connections seems to have a diminishing return (i.e. the comment about "connectivity level is comparable with that of a 1000 node network"). At a certain point there are enough nodes in your partially converged neighbourhood that adding new connections won't improve your results by much. The term "recall" is defined in the paper as "proportion of possible matches retrieved". Existing search systems show that you need not discover every possible match to find what you are looking for (what the Stanford folks call "satisfaction").
[january 23, 2003]
(via p2p-hackers) Interesting paper based on a trace collected at the University of Washington. The authors analyze the nature of traffic transferred on the University network via four methods: HTTP, Akamai, Kazaa and Gnutella. Many interesting observations, for example: the rate of P2P requests is 1/100 that of HTTP. Yet P2P traffic exceeded web traffic by a factor of 3. Why? P2P documents are 3 orders of magnitude larger than web objects.
Another important observation is that a small number of objects in Kazaa account for a bulk of the outbound traffic. "Outbound traffic" here refers to objects requested by external hosts from servers within the UWashington network. To take advantage of this profile they propose the use of a "reverse cache" - i.e. a cache storing objects requested by outside hosts.
The outbound hit rate (Figure 14b) increases as the cache warms, stabilizing at approximately 85%. This is a remarkably high hit rate -- double that reported for web traffic. A reverse P2P cache deployed in the University's ISP would result in a peak bandwidth savings of over 120 megabits per second!
[january 19, 2003]
Research from CMU about reducing Gnutella hopcounts using "interest-based shortcuts". I would group this with other "community implicit" systems such as query routing, Neurogrid and Associative P2P because it attempts to identify communities using implicit indicators of common interest. In this proposal possession of similar files or the file being searched for is used to discover nodes that will act as shortcuts to content of interest. Shortcuts are ranked according to different characteristics (e.g. latency of path, b/w of path, probability of providing results, etc.) and new searches are directed to members of the shortcut list in rank order.
[january 12, 2003]
Download java library implementing Rendezvous - Apple's implementation of Zeroconf - which aims to simplify the set up of small IP networks. A post on p2p-hackers asks: "Who will be the first Java P2P hacker to integrate this for peer discovery?" More information about Zeroconf here: http://www.zeroconf.org and about Rendezvous here: http://www.zeroconf.org/Rendezvous/
[december 8, 2002]
(via decentralization) Melc writes: "We have started development on our CP2PC uberclient file-sharing application. As part of the CP2PC project we have developed a minimal programming interface (API) to peer-to-peer file-sharing systems. Now, based on this API, we have started building a file-sharing uberclient that will provide seamless access to multiple file-sharing networks from a single client."
gnutella search | http://www.grouter.net/gnutella/search.htm
(via decentralization) Serguei Osokine writes: "I've just published a study that can be of some interest to the people who think about the ways to improve the P2P network search performance."
[december 5, 2002]
Paul Ford's use of the Google News Search results for java as an example of why metadata is necessary (http://www.ftrain.com/links_sub_Semantic_Web.html) gave me the idea to create a new page here to keep track of other similar examples. I don't disagree that implicit metadata is powerful and useful, but it doesn't solve every problem and that's what WhyMetadata attempts to illustrate. (full disclosure: I've got a vested interest since U-P2P is built around explicit metadata)
[november 24, 2002]
Project that allows users to easily publish websites into Freenet. Interesting because Freenet URLs (URIs?) are location independent and theoretically persistent (you still have to update your site daily for it to be accessible!). From search perspective this is interesting because it is an implementation of link structure in an "exact identifier" system that could be used to create Google-style indices. Applies equally well to DHTs. Of course it only works for hypertext documents, also requires the spider to have an entry point and for the link network to be well connected.
exact identifers / search in dhts | http://www.employees.org/~alokem/thesis/notes-11-23-2002.txt
Some notes about why exactly "exact-identifer" systems (e.g. freenet, chord, can, etc.) don't support "partial match" search. Search requires access to metadata that is meaningful to humans. The part of these systems that is problematic for search is that file identifiers are opaque (hashes) or that the file contents are opaque (encrypted). If the requirement for anonymity / privacy is relaxed (i.e. if human-readable filenames and file contents were also maintained) then there is nothing to stop implementation of traditional metadata search strategies on top of these stores. The more challenging question is how to accomodate search within the architecture.
userv papers | http://www.almaden.ibm.com/cs/people/bayardo/userv/plugins/plugin.html
(via slashdot) P2P sharing of web applications (plugins) written in Java.
"One important feature of the PluginAPI? is a suite of methods that allow the plugin code to query the set of all (private or public) files accessible on the current site, and to obtain a file handle and/or URL to the content. The plugin can also query the site owner's name, the site's domain, the location of the shared web folder, the name assigned to the plugin, and so on. This state allows plugins to be implemented so that they work properly on any YouServ? site within which they are installed, without this sort of information having to be explicitly configured by the user."
http://www-db.stanford.edu/~bawa/Pub/usearch.pdf - "Make it Fresh, Make it Quick — Searching a Network of Personal Webservers"
Distributed search engine for "YouServ?" personal webservers. Connect to an individual peer through a web interface (e.g. Bob can use Alice's YouServ? instance to search the network). Each peer indexes their local collection creating an inverted index associating each keyword with a set of documents. Information about which keywords have matches are transferred to a central "registrar" which maintains a mapping between keywords and IP addresses of peers that have advertised matches for that keyword. Mapping is done using Bloom Filters - peer generates a bitmap indicating what keywords it supports, bitmap is sent to registrar - peer's ip address is added into the bucket for each set bit. Searches can be restricted to a group of peers - users maintain group definitions.
More YouServ? papers here: http://www.almaden.ibm.com/cs/people/bayardo/userv/
[november 21, 2002]
[november 19, 2002]
Order form. Table of contents is here: http://computer.org/proceedings/p2p/1810/1810toc.htm.
[november 13, 2002]
[october 17, 2002]
"Pond" prototype source. Project based at Berkeley run by John Kubiatowicz. More info here: http://oceanstore.cs.berkeley.edu/
"OceanStore? is a global persistent data store designed to scale to billions of users. It provides a consistent, highly-available, and durable storage utility atop an infrastructure comprised of untrusted servers."
conference | http://www2003.org/cfp.htm
WWW 2003 - Budapest Hungary - 20-24 May 2003 - paper submission deadline november 15 - 8-10 pages - template: http://www2003.org/www2003-submission.doc
Possible tracks: Search and Data Mining, Semantic Web
[october 5, 2002]
Some papers of interest...
http://research.microsoft.com/sn/Herald/papers/tr-2002-48.pdf - "Overlook: Scalable Name Service on an Overlay Network", Marvin Theimer, Michael B. Jones, April 2002, Technical Report, MSR-TR-2002-48
Pastry as a basis for an "internet-scale" (fast update, resilient to flash crowds) naming service.
http://www.ececs.uc.edu/~mjovanov/thesis/thesis.html - "Modeling Large-scale Peer-to-Peer Networks and a Case Study of Gnutella", Masters Thesis, Mihajlo A. Jovanovic, June 2001
Modelled topology and latency of Gnutella network. Includes Java code for a Gnutella simulation "gnutsim".
http://dbpubs.stanford.edu:8090/pub/2002-13 - "Designing a Super-Peer Network" - Yang, Beverly; Garcia-Molina, Hector, 22 February 2002
Good description of "super-peer" networks, identifies important parameters for efficient network operation.
http://www.research.microsoft.com/~antr/PAST/location.pdf - "Topology-aware routing in structured peer-to-peer overlay networks", Miguel Castro, Peter Druschel, Y. Charlie Hu, Antony Rowstron, submitted for publication February 2002, Technical Report MSR-TR-2002-82
How to take into account IP-level proximity in the overlay network.
http://www.research.microsoft.com/~antr/PAST/ring.pdf - "One ring to rule them all: Service discovery and binding in structured peer-to-peer overlay networks", M. Castro, P. Druschel, A-M. Kermarrec and A. Rowstron, SIGOPS European Workshop, France, September, 2002
Talks about distributed inverted index (keyword: loc1, loc2, loc3) for search. Also idea of "universal ring" - a bootstrap for service discovery.
[october 1, 2002]
Submissions due: 25 October 2002 Notification of Acceptance: 20 December 2002 Camera-ready copy due: 15 January 2003 Workshop: 20-21 February 2003
iris project | http://www.ddj.com/documents/s=7338/ddj1033245969026/
Article from Dr. Dobbs about IRIS - a project to develop "a secure, fault-tolerant, distributed system for data storage" based on Distributed Hashtables (DHTs). They just got $12 million in funding over the next five years from the NSF. Most of the quotes in this article are from Frans Kaashoek who was one of the researchers behind Chord. The IRIS project website is here: http://iris.lcs.mit.edu.
The article also includes this interesting link: http://www.planet-lab.org - an Intel Research (http://www.intel-research.net) backed project to create "a global testbed for developing and accessing new network services"
Another IRIS article here: http://www.newscientist.com/news/news.jsp?id=ns99992861 (no mention of its difficulties with search?)
conference | http://wwwteo.informatik.uni-rostock.de/DASD/
Neal writes: "There's a call for papers for the Design, Analysis and Simulation of Distributed Systems (DASD) conference that is due on October 18th (three weeks) if y'all are interested. It's in Orlando, Florida and the CFP is here:http://wwwteo.informatik.uni-rostock.de/DASD/."
[september 28, 2002]
Project somewhat along the lines of U-P2P to develop a framework for plugging together different peer-to-peer components to form applications. Basically, specifying interfaces for the components that allow them to interoperate without specifying the inner workings. It does seem to be standardizing a search interface though: http://tristero.sourceforge.net/search-java.html. (involves Brandon Wiley, one of the original Freenet programmers as well as Sam "Neurogrid" Joseph and Aaron "W3C" Swartz)
overview of p2p meta-data search | http://www.neurogrid.net/Decentralized_Meta-Data_Strategies-neat.html
Excellent and comprehensive overview of some of the existing approaches to distributed meta-data search. The techniques he lists are:
Bloom Filter, Semantic Routing, Reputation Learning, Query Spaces, Trust Metrics, Query Forwarding, Distributed Hash Tables, Caching
A U-P2P user could choose which one he wanted by plugging in a different Peer Network Adapter. Might be fruitful to discuss how each would work in U-P2P framework.
p2p radio | http://www.openp2p.com/pub/a/p2p/2002/09/24/p2pradio.html
Two recent pieces of software allow users to host and share their own radio stations. Peercast: http://www.peercast.org/ Streamer: http://www.chaotica.u-net.com/streamer.htm
Some implementation information from Peercast's website:
The client software uses the Gnutella 0.6 protocol, but is not connected to the Gnutella file share network. It works in much the same way as other Gnutella clients except that instead of downloading files, the users download streams. These streams are then exchanged in real-time with other users. No data is stored locally on any machine connected to the network.
The client software has the ability to serve web pages to normal browsers such as Mozilla and Internet Explorer. This means that people on your LAN can search for and listen to channels without having to install the client software on their PC. Offices can have one PeerCast? client providing audio streams to the entire LAN. Or you can set up a private network with your friends on the Internet to listen to music. Its your choice about whether you connect directly to the PeerCast? network or not.
[august 12, 2002]
"Developers use the Buddy Script SDK released Monday to write interactive agents that mine information from databases in a corporate network or on the Internet. End users at a company can then add a new contact, or "buddy," to their instant messaging software that makes use of those agents to retrieve answers to questions, company officials said." (via Professor E.)
[august 2, 2002]
This might be useful for quickly converting our up2p objects into XML. Need to download it and give it a try.
[july 21, 2002]
Starts out with the traditional "taxonomy of p2p" section then mentions a few apps: