CryptNET Peer Cache Daemon (cpcd) Administrator Manual

V. Alex Brennen


Table of Contents
1. Introduction to cpcd
1.1. Introduction to cpcd
1.2. How p2p Systems Bootstrap
1.3. History of gWebCache
1.4. Ultra Hosts Caches
1.5. Why is it important to run it
2. Installation of cpcd
2.1. Quick Start
2.2. The GNU/Linux Operating System
2.3. ./configure Options
3. Configuration and Running of cpcd
3.1. The Configuration Sub System
3.2. Item by Item Options
3.2.1. General Configuration Options
3.2.2. Performance Related Options
3.2.3. Security Related Options
3.3. Starting cpcd
3.4. Stopping cpcd
3.5. Restarting cpcd to Reload its Configuration and Data
4. Monitoring
4.1. Overview of Monitoring
4.2. The Access Log
5. Trouble Shooting
5.1. Program Crashes
5.2. Aggressive memory consumption
5.3. Lost data, errant data
5.4. Hostile Hosts
6. Bug Reports, Feature Requests and Patches
6.1. Information to Include in a Bug Report
6.2. Information to Include in a Feature Request
6.3. Submitting a Patch to the Project
Glossary of Terms

Chapter 1. Introduction to cpcd


1.1. Introduction to cpcd

The CryptNET Peer Cache Daemon (cpcd) is a light weight multithreaded program that serves as a bootstrapping system for peer to peer networks. Currently, it is specifically targeted at the Gnutella Network. It implements the gWebCache protocol, version 1 and 2 with the acception of "node clusters". Node clusters where not implemented in the original reference implementations of gWebCache and are not implemented in any know p2p programs.

cpcd was designed to allow peer to peer servlets to quickly gain access to other peers, and their resources, while consuming minimal resources on the server that the cache is running on. Due to its multithreaded design and source code written in the C programming language, a single copy of cpcd can serve many millions of requests per day while consuming only very small amounts of memory and cpu time.

In order to run cpcd, you should have atleast a Pentium II 266mhz, and 256MB of RAM. The amount of RAM that cpcd needs to run can be reduced by reducing the number of threads the program keeps in it's thread pools. More information about the sizing of thread pools can be found in the configuration chapter.

Warning

It is very strongly recommended that you set up a specific DNS hostname for your cache. For example, if you're the owner of the domain example.ie and you're willing to run a copy of cpcd on the server that hosts the website for example.ie, you should create a DNS entry called something like cpcd.example.ie. This is very important because some older poorly implemented clients and caches will not stop attempting to access a cache even if it no longer exists. So example.ie could get many hundreds of thosands of connection attempts per day even years after you decided that you no longer wanted to run a copy of cpcd. This could result is significant bandwidth costs on a metered connection. By using a separate DNS entry for the cache, you can prevent these poorly written programs from attempting to contact you by deleting the DNS entry. If you have a dedicated cache DNS entry, shutting down your cache in this way won't disrupt your web site or e-mail.


1.2. How p2p Systems Bootstrap

Gnutella systems should currently bootstrap themselves by loading first cache data off of the local disk. If they are unable to connect to the network in a reasonable amount of time using that approach they then attempt to use an internet bootstrapping solution. The two most commonly used bootstrapping solutions are Universal Host Caches (UHC's) and gWebCaches. Most peer to peer servlets will attempt to bootstrap from disk cached nodes for one to five minutes. The cpcd program, as mentioned in the introduction, currently implements the gWebCache protocol.

For a specific example of a p2p servlet, we can look to LimeWire. LimeWire follows the following bootstrapping steps:


Local Disk Cache --> Dedicated UHC --> gWebCaches (including cpcd)

Once a Gnutella Servlet finds another active node, it relies on nodes found through special response headers, by analysing search packets, and through pings and pongs.


1.3. History of gWebCache

gWebCache was one of the first solutions to the bootstrapping problem. gWebCache was originally a CGI script written in php which was meant to help small numbers of people form very small local p2p clouds. gWebCaches had the ability to maintain caches of active peers in a networks which where continually refreshed. It also was capable of caching the URL addresses of other gWebCache scripts, allowing the gWebCaches to form a network of themselves to distribute load as their popularity grew.

Before gWebCache was written, people often relied on asking for the address of active node in IRC channels or by viewing static web pages and entering them into their servlets by hand. After its introduction, people could rely on the automated mechanism that it provided.

The protocol used to interact with the caches was a simple set of CGI queries and pipe delimited response data. This was both good and bad because while it meant that it was easy for developers to start using the network, it was also easy for sub-par programmers to write implementations.

Poor implementations of the protocol on both the client and server side as well as rapidly growing popularity eventually made it impossible to continue to run the first generation of the small scripts. None of them verified that ip addresses submitted to them where valid non-busy nodes, and none of them verified that cache URL addresses submitted to them where actually caches. These oversights lead to all kinds of abuses and denial of services attacks against both the caches network and unrelated websites. These problems have been corrected in cpcd.


1.4. Ultra Hosts Caches

Ultra Host Caches (UHC) are a relatively new component of the Gnutella protocol. Hooks for UHC support have been placed in cpcd. However, cpcd will not support the UHC protocol until a future release.


1.5. Why is it important to run it

It is important to run as many Peer Caches as possible because this ensures the viability of peer to peer networks which use the caches over the long term. In the case of very large networks such as gnutella, a large number of caches maximizes not only redundancy and availability, but also help reduce the load on individual caches as clients will naturally balance themselves across all avaliable caches.

The gWebCache based caches are a final effort for all p2p clients on the Gnutella network, and many older clients still lack UHC support. Therefor. even with the introduction of the UHC solution, if the cache network fails due to a lack of caches or overburdening of caches it will prevent many people form using the Gnutella p2p Network.


Chapter 2. Installation of cpcd


2.1. Quick Start

  1. Run GNU/Linux

  2. ./configure

  3. make

  4. make install

  5. Make a DNS Sub Domain for your cache

  6. vi /usr/local/etc/cpcd.conf

  7. /etc/init.d/cpcd start


2.2. The GNU/Linux Operating System

The CryptNET Peer Cache Daemon was written specifically for the GNU/Linux operating system. It is not meant to be run on other operating systems such as FreeBSD, MacOS, or Windows. It is expected that cpcd will most likely be run on a dedicated machine, there for flexibility in operating system choice was not viewed as being worth the effort. The level of effort to support other operating systems is expected to be significant because of the design of cpcd is heavily integrated with the pthreads threading library.


2.3. ./configure Options

Compile, Build and Installion Options

Option: --prefix

Description: Install prefix

Option: --logdir

Description: Log Directory Location

Option: --datadir

Description: Data Directory Location

Option: --withconfig

Description: Config File Location

Option: --enable-lfs

Description: Enable Large File Support (--enable-lfs=yes)

Tip

Options specified at compile time with configure can be over ridden by placing new values in the cpcd.conf configuration file. The values specified with configure script options are placed in your cpcd.conf file automatically.


Chapter 3. Configuration and Running of cpcd


3.1. The Configuration Sub System

The configuration of the cpcd software is performed by editing the file cpcd.conf. This is a standard space delimited Linux configuration file. Blank lines are ignored, lines starting with the '#' character are comments,and name value pairs are separated by a space. By default, the cpcd.conf file is stored in /usr/local/etc .


3.2. Item by Item Options

The following configuration options are available to you:


3.2.1. General Configuration Options

Configuration Options

Option: bind_ip

Default: 127.0.0.1

Description: This is the ip number which cpcd will bind to. At this time only ipv4 addresses are supported. cpcd does not bind to all ips on a host because I expect some people may wish to run it in a virtual server environment.

Option: web_bind_port

Default: 8080

Description: The port for your cache to listen for HTTP CGI requests on. Port 8080 was chosen as the default because it is the official non privlidged port for web traffic. It is unlikely to be firewalled by network administrators.

Option: gnutella_bind_port

Default: 6346

Description: The port on which cpcd should listen for UHC traffic once UHC is fully implemented. Port 6346 was chosen because it is the default Gnutella port for most Gnutella servlets.

Option: server_url

Default: http://localhost:8080/pwc.cgi

Description: The full URL to your cache. This is used to keep your cache from providing it's own URL to clients who obviously already know its URL. It is also displayed on the default human readable HTML page.

Option: cpcd_keep_log

Default: Off

Description: A flag to activate or deactivate logging. If it is set, cpcd will log requests. By default, it is turned off and cpcd does not log any information.

Option: data_dir

Default: /var/tmp/cpcd

Description: This is the directory where cpcd will store its data files. Each network specified in the allowed_networks configuration option will have a file here named network_name.dat. There will also be a file called bad_caches.dat for URL submitted by clients which could not be validated.

Option: log_dir

Default: /var/log/cpcd

Description: This is the directory where the access log for your cache will be stored. Note that logging is disabled by default due to the large numbers of hits most caches get.

Option: pid_dir

Default: /var/run/cpcd

Description: The directory where the current process id for the cache is stored. It will be stored in a file called cpcd.pid. The process id file is used to allow the init script to easily interact with the running instance of the cache.

Tip

The default ip address that cpcd binds to is the loopback. This is useful for testing before deploying your cache, but you'll need to change it to your public address before your cache will be able to participate in the cache network. When you change bind_ip, you'll also want to change the value for server_url to avoid caching your own url.


3.2.2. Performance Related Options

Configuration Options

Option: listen_threads

Default: 100

Description: The number of threads in the thread pool that accepts incoming connections from peers. All of these threads have a main() routine which repeatidly calls accept(). If accept() returns a valid connection, the thread goes on to handle the request. The default number of threads is sufficient to handle many millions of connections a day. You may wish to reduce it to reduce the amount of memory cpcd uses. Signs that it would need to be increased would be slow responses from the daemon or a significant backlog of connections waiting to be accepted.

Option: host_threads

Default: 20

Description: The number of threads in the verify pool. When a valid peer servlet host address is submitted to cpcd, it is placed in a processing queue. The threads in the host_thread pool then make mutex protected pop() calls against that queue. The thread upon poping an address, attempts to make a valid Gnutella connection to it. The 5:1 ratio with the listen_thread pool seems to work well. Such a large number of threads are needed, not because the work is intense, but because the network communication and connection establishment can take some time. An upper limit has been defined in the source code for the size of the queue. If the upper limit is reached, the oldest pending item is removed while additional items are added to prevent a memory mound DoS attack.

Option: cpcd_throttle_support

Default: 0

Description: [Not Yet Supported]


3.2.3. Security Related Options

Configuration Options

Option: allowed_networks

Default: gnutella gnutella2 MetaNET

Description: A list of networks that you will allow your network to accept queries and updates for. Each network will have its own datastructure for hosts and alternative caches and its own .dat file.

Option: default_network

Default: gnutella

Description: This is the network for which hosts and URLs are returned in response to queries which do not specify a specific network.

Option: allow_servlets

Default: LIME GTKG GNUC BEAR MRPH MESH RAZA ACQX MNAP SWAP MUTE TEST META XOLO QTLA PHEX KIWI TFLS GALA ACQL GNZL GDNA GIFT

Description: This is the list of servlets which are allowed to access your cache. You may wish to remove some of these servlets if they produce too high of a load on your cache. You can also add new programs as they come into use with out updating your cache source code.

Option: max_hosts

Default: 50

Description: The maximum number of servlets that cpcd will store on a per network basis. You can reduce this number to dramatically reduce the amount of bandwidth that your cache uses. You can also increase it to increase the likleyhood that a servlet using your cache data to attempt to find a peer to connect to will be to find a non-busy peer and establish a connection no their peer to peer network.

Option: max_urls

Default: 20

Description: The maximum number of URLs for other caches that cpcd will store on a per network basis. You can reduce this number to reduce the bandwidth usage of your cache.

Option: port_restriction

Default: Off

Description: Force cpcd to only accept hosts of the standart Gnutella port. The gWebCache network was once abused by a MicroSoft windows Virus writer as a mechanism to help machines he had compromised link together in a botnet. It was possible to identify the compromised machines because they when operating an a specific non-standard port. This type of misuse is unlikely to happen with cpcd because full Gnutella handshaking is done with every listed node. However, it is still possible. So, you may want to turn this on. Please note that many legitimate Gnutella users will use non-standard ports to get around traffic shaping or firewall rules of their local network.


3.3. Starting cpcd

cpcd is started using the init script system. To start cpcd, you simply execute its init script as root. Such exsection may look as follows:


bash# /etc/init.d/cpcd start

Since cpcd uses the init system, you can configure it to start automatically at boot time. This will allow you to make sure that cpcd is running after a reboot caused by a power outage, or system crash. The command to schedule the start of a daemon at run time differs across linux distributions. As an example, that command on Gentoo Linux, would be:


bash# rc-update add cpcd default

3.4. Stopping cpcd

The init system is also stopped using the init system. This is done with the following cammand:


bash# /etc/init.d/cpcd stop

Since cpcd will most likely have many active connections, it may take a few moments for cpcd to fully shutdown. cpcd will attempt to give connections some time to complete handshakes and read calls a few seconds to time out. While it is shutting down, cpcd will not be accepting any new connections.

If it is an emergency, you can halt cpcd immediately by sending it a KILL signal. To do this, get a list of all running cpcd processes, and then send the signal with the kill command. This series of steps would look as follows:


bash# ps -ef | grep cpcd
cpcd     11063 10414  0 Mar29 pts/28   00:00:00 su - cpcd -c cpcd -v
cpcd     11064 11063  0 Mar29 pts/28   00:00:00 cpcd -v
cpcd     11065 11064  0 Mar29 pts/28   00:00:01 cpcd -v
cpcd     11066 11065  0 Mar29 pts/28   00:00:02 cpcd -v
cpcd     11067 11065  0 Mar29 pts/28   00:00:02 cpcd -v
bash# kill -9 11063

3.5. Restarting cpcd to Reload its Configuration and Data

While many daemons will retart and reload their configurations when sent a HUP signal, this is not the case with cpcd. This is because the multi threaded design of cpcd makes it very difficult to implement such a reload programatically.

In order to get cpcd to reload either its configuration or its disk data cache, you must fully stop and then start the daemon. The init script included in the distribution performs this two steps for you when issued a restart command. A restart command can be issued to the script as follows:


bash# /etc/init.d/cpcd restart

Chapter 4. Monitoring


4.1. Overview of Monitoring

Monitoring of the cpcd software can be done by turning on the logging feature and running analysis on the logs with a web statistics program such as analog. A realtime statistics and data display page is also provided. To access these pages, simply point your web browser at the port that cpcd is listening on.


4.2. The Access Log

By default the name of the log file is cpcd.log and it is stored in the directory /var/log/cpcd.

The log is in the NCSA Combined Format. This format includes information about the number of bytes transmitted, the request the client made, the client's ip address, software version, and ip address.

All requests are logged to the cpcd.log file. Unlike the Apache web server, there is not a separate log file for failed or invalid requests. This is because having a single log should make trouble shooting and analsis easier. Since the cache operates with a well defined protocol, a client should not make an errant request unless it has an error in it's implementation or was given bad data by another cache.

If you wish to produce stats for your cache from its access log with the Analog program, you can use the following LOGFORMAT definition:


LOGFORMAT (%S %j %j [%d/%M/%Y:%h:%n:%j] "%j%w%r%w%j" %b %c "%j" "%B" "%j")

Chapter 5. Trouble Shooting


5.1. Program Crashes

In order for a bug to be fixed, it must first be found. If you experience a crash, the most helpful way to aid in the correction of the problem is to run cpcd with in the debugger. It was most likely build with debugging symbols when you installed it. If not, you can rebuild it from its source code with debugging symbols and simply run it with in gdb from the src directory in the distribution archive.

The gdb Debugger has the ability to create a backtrace. A backtrace is basically a list of all the programatic steps that led up to the crash. You can create a backtrace by running cpcd inside gdb, and issuing the command 'bt' after a crash. This is the most useful data that is available when dealing with a program crash.

If you can duplicate a program crash, but not when the progam is running in gdb, you can run the program under the strace program or under the valgrind program. Both of those tools may give you an idea of what is going on.

Tip

Some older versions of GDB and Valgrind cannot handle programs which have a large number of threads. Before attempting to use these programs, you should upgrade to the latest possible version. If you still encounter problems, you should reduce the number of threads cpcd is using in its thread pools significantly and try again.


5.2. Aggressive memory consumption

Diagnosing aggressive memory consumption can be very difficult since the program is multithreaded, has a continual execution flow, and was designed to cache and buffer as much data in memory as possible in order to run as fast as possible.

Before attempting to diagnose a memory leak or memory bound denial of service attack, you should first determine an estimate of how much memory your specific configuration will use. This can be done most easily by launching the program and checking its memory usage before advertising your cache. The Linux kernel uses a Light Weight Process [LWP] threading model. Therefor, each thread in the pool will use a not insignificant amount of addition memory. Running with its default config, cpcd's memory usage tends to be about 2.5KB per thread and to total about 200MB.

If your instance of cpcd is using significantly more memory than this, you should check that the size of the bad_caches.dat and the dat files for your supported networks are a sane size. Next, you should access your cache's web page to make sure it is returning a valid data set to you. You should also check the byte counts in the log file and watch your cache in 'top' to see if its memory usage is growing rapidly.

Tip

You can sort the processes in top by memory usage by hitting uppercase-M while top is running.

If you are reasonably sure that there is in fact a memory leak, you can run the program under valgrind to attempt to find and report it. Please, see the tip in the previous section about using a recent version of valgrind.

If you are unable to find a memory leak, but still believe that one may exist, it is still helpful to report it. A cpcd developer can use more advanced memory allocation tracking tools than valgrind, such as the glibc mtrace system.


5.3. Lost data, errant data

If the problem is incorrect or errant data being delivered to clients, you will need to first determine if you are the target of a cache poisining attack. A cache poisining attack occurs when a client accessing a cache purposefully submits bad data to the cache in the hope that the cache software will deliver that false data to other clients. Determining if such at attack is taking place can be done by reviewing the access log, and by reviewing the trasmitted data of sessions by recording them with a program such as ethereal or tcpdump.

An instance in which hostile data submission should be suspected would be when one or more entries in the cache are invalid. If all entries are invalid or there are no entries the problem is more likely to be occuring at a progamatic or system level locally. If data is lost or wholly incorrect, strace may be used to watch exactly what the program does during an update and query session.

Due to the extensive verification that cpcd does, cache poising is highly unlikely. The most likely cause of the problem is memory curruption due to a program bug, or bad hardware. Completely corrupted data files should be submitted along with log excerpts to the developer for analysis to help determine if the cause is a bug.

When attempting to identify and diagnose a problem suspected to be a program bug, configuration problem, or a system problem (such as a failed hard drive), it is useful to run an additional copy of cpcd on the loopback. This will allow you to easy look at data with ethereal and avoid confusion caused by the actions of active clients.


5.4. Hostile Hosts

To deal with a hostile host, you should drop the route to the host or the host's network. This will allow your cache to continue to run with only minimal disruption until the attack has ended. Since many of the peer to peer clients used are free software, a user of a servlet may not realize that he is attacking you. Some of the more inexperienced programmers have made programatic errors in their applications which result in behavior equivalent to a denial of service attack. The only solution to reduce cache load is to refuse the connections in the first place by dropping the connections. The first step is to remove the software version from the allow_servlets option in the host configuration file, and then to drop the routes to any one who continues to abuse your cache with the errant software.

When dealing with this type of problem, it may be helpful to determine how many open connections a specific host or network has to your cache at a given time. You can do that with the following command:


bash# netstat -an | grep (cache port #)

While attacks are rare, they do happen. The ability to use the old gWebCaches in bot net viral MicroSoft networks has motivated fudeing hackers to attack caches, and record and media companies may also attempt to attack your cache. Again, the solution here is to drop routes.

These are the commands to drop a route:


(Host)
bash# route add -host 83.116.208.110 reject

(Whole Network)
bash# route add -net 24.0.154.0 netmask 255.255.255.0 reject

Chapter 6. Bug Reports, Feature Requests and Patches


6.1. Information to Include in a Bug Report

If your bug report relates to a system crash, please attempt to create and submit a backtrace.

Ideally, your bug report should include as much of this information as possible:

  1. Your contact information

  2. The version of cpcd your are running

  3. Your configuration file

  4. A Backtrace, if relevant

  5. Any relevant sample data

  6. Any relevant sample requests from the cpcd.log file

Bug reports should be submitted to the SourceForge cpcd Project Bug Tracking System.


6.2. Information to Include in a Feature Request

Please include as complete description of the feature as possible and an explanation of why you feel the feature should be implemented. A fair warning, I'm a big believer in keeping program source code as simple as possible, using as little system resources as possible, and staying away from new unnecessary technologies. Therefor, you need to make a case for a clear benefit to end users if you want your request taken seriously. Submitting a patch is a good way to increase the likelyhood that your patch will be seriously considered. Submitting a patch is covered in the next section.

Feature Requests which include patches should be submitted as described in the Patch Submission Section. Feature Requests which do not include patches should be submitted to the SourceForge cpcd Project Feature Request Tracking System.


6.3. Submitting a Patch to the Project

Your patch should be placed in the Public Domain in order to be considered. If your patch is not in the Public Domain, do not submit it.

Patches should be submitted to the SourceForge cpcd Project Patch Tracking System.

Glossary of Terms

B

Backtrace

A print out of stack information which can be created after a crash. This information is very useful in the debugging of programs.

G

Gnutella

A peer to peer network which is primarily used to distribute files.

Gnutella Web Cache
(GWC)

A cgi based solution to the problem of bootstrapping peer to peer nodes on to the existing network.

P

Ping (Gnutella)

A UDP packet sent to a Gnutella servlet which is responded to.

Ping (Gnutella Web Cache)

A message sent to a Gnutella Web Cache to which the cache will respond with a well defined message. These ping are useful to check and make sure that a cache submitted to the network is actually a cache rather than a pay-per-click advertising link for example.

Pong (Gnutella)

A response sent to a Gnutella servlet which had sent a ping. A pong may contain a compressed list of ultrapeers.

S

Servlet

A peer to peer progam. Also, a peer in a peer to peer network.

T

Thread Pool

A collection of threads which are all working on the same task. When a single thread is not fast enough to handle a given task, a collection of threads is often created. A pending instance of a task in handled by the first non busy thread in the pool.

U

Ultra Host Cache
(UHC)

A gnutella node which keeps a list of UltraPeers and responds to pings with that list.

Ultra Peer

A gnutella node which is allowing connections from other gnutella nodes.