Introduction to Tortoise

Tortoise is a usenet news server that supports transit (feeding) and readers. Tortoise is an outgrowth of NNTPRelay, a high-performance news routing system.

Primary design goals for Tortoise were high performance and "fire and forget" installation. Once installed, tortoise should run without intervention excepting group additions and deletions. They are not yet automated.

This document covers tortoise concepts, including match strings, stores, and databases. General installation and management instructions are also included. Further information might be available at http://tortoise.maxwell.syr.edu

Match Strings

Match strings are compared against articles to determine whether or not an article is appropriate for a particular operation. Match strings are used in many places in tortoise (though the prompts are not yet consistent) including the incoming filter, external filter match string, feed specifiers, and article stores.

Match strings are relatively simple constructs that are evaluated left to right. They are of the form:

MatchString={MatchElement {"," MatchElement}}

MatchElement= {modifier {modifier}} [SpecifierElement Operator Value] | BooleanElement | BooleanOperator | GroupPattern]

SpecifierElement="$" Specifier

Specifier="P" | "G" | "I" | "S" | "A" | "H[" HeaderName"]"

BooleanOperator="&" | "|"

Operator=">" | "<" | "="

BooleanElement= "C"

GroupPattern= (glob string, including * and ?)

Value=(integer) | (glob string)

Modifier= "!" | "~"

Modifier definitions:

! – logical NOT

~ - poison

Specifier definitions:

P is a path line search

G is a count of the groups in the Newsgroups Header

I is the intervening host count (number of elements in the Path header)

S is article size

A is the article age in seconds (time now – time in the Date header)

C boolean TRUE if the article has a Control header

H is a header index that allows you to compare a header by name (for example, H["Subject:"]=*FAQ* would only match if the subject header contained "FAQ".

A specific ordering principal from applies to match strings: the line gets evaluated from left to right. For a group match, a positive match stops the search and selects the article. A negated positive match moves to the next match element. A poison stops the search without selecting. A positive match on the path line stops the search without selecting. A negated positive match on the path line selects the article. For all other matches a false evaluation (after the ! is applied) ends the search, where a positive continues it. If there is not a positive selection, the article is not selected.

Some examples:

$P=news.host.net,!comp.binaries*,comp.*

(select articles that do not have news.host.net in the path header and are posted to comp.* but not comp.binaries*)

!$P=news.goodhost.net

(select all articles with news.goodhost.net in the path header)

$C

(select all control messages)

$P=news.host.net,$S<10000,$A<86400,alt.*,comp.*

(select all articles that do not have news.host.net in the Path header, are less than 10000 bytes long, less than a day old and posted in alt.* or comp.*)

Stores

Tortoise keeps all of its articles, overview, and overview cache information in "stores." Stores are cyclical data storage containers that may be up to 1 terabyte in size.

Article stores

Every tortoise installation will need at least one article store to do anything useful. Stores are managed through the web interface. Stores may be grown at any time.

After an article has arrived, passed any filtering in place, and been recorded in appropriate groups, it is ready to be written to a store. The article’s overview information is written first (sans article number, which would vary) then the entire article follows. There are no delimiters in stores, all information is tracked through offsets kept in the databases.

To create a store, connect to the web interface and choose create "Add new" under Stores in the left pane. You’ll need to specify the following:

Number. Stores are numbered 1-30. The store number is used in at least two ways: first, the number determines the search order (when trying to find a place for an article to be written, it searches from 1 to 30 in order). Second, the store number is encoded into an article’s key (see the databases section) so that tortoise knows which store contains the article.

Directory. This a full path to a directory to hold the store. Examples: f:/artstores

The directory can be a network UNC, but this requires advanced knowledge of NT to configure since most services (including tortoise by default) run in the context of the local system account, which does not have network access. This may cause difficulty because tortoise will run fine in debug mode (in the context of your login which has network access), but then fail when started as a service.

MatchString: The match string is a selection string which determines which articles are written to this store. The match string "overcache" has special meaning—see the section on overview cache stores.

Mbytes. This is the size of the store in megabytes. Maximum value is 1000000 (about 1TB)

ExpirationPriority. The expiration priority helps determine the likelihood that an entry in the seen database will be overwritten. Values range from 1-255, and those >=250 are special. See the databases section for more information.

 

 

Overview Cache Stores

Tortoise supports caching of overview information. When running without an overview cache, overview information is pulled from the store in which the article is written. This is adequate for a very small number of groups or for a system which only servers a few readers. Before the number of readers grows very large, overview retrieval will become slow.

To combat this, tortoise can do overview caching. To accomplish this, it generates and stores the overview information for a group in a contiguous block in an overview cache. This overview cache is just a repurposed article store (with a matchstring of overcache).

The caching is done using cache blocks. Each group receives (when at least one article is stored for the group) a minimum block allocation of 16 kilobytes in the overview cache store. Updates are done in-place until the overview information exceeds the size of the block, when a new block is allocated in the store.

The overview cache is updated by a scheduled delay mechanism. When an article is added to a group, that group gets scheduled for an overview update in 30 seconds. This update will attempt to simply append the new entry onto the end of the existing cache block. If it cannot (because the block is too small, or because tortoise is in slave mode and the article arrives out of order), the whole block is rewritten. In all cases, except when the cache block is expired by the cyclical action of the store, the overview data is read from the block, updated to be consistent with which articles are still available, then rewritten to disk. If wrapping the overview cache has overwritten as the block, the overview records are pulled out of the article stores. While some overwriting is desirable (it has a garbage collection effect), under sizing your overview cache can cause performance degradation, in the worst case leading to constant thrashing of the article stores and the overview store. In general, the overview cache should be 15% to 20% of the total size of your article stores, 15% for servers where space utilization is dominated by large articles, 20% for those dominated by small articles.

Databases

Tortoise maintains two separate databases to track information about articles. The first database, called the "seen" database (a historical carryover from an NNTPRelay oddity), tracks information on the articles of which the system has knowledge. The second is the group database, which matches articles with groups and is responsible for tracking overview cache information.

The Seen Database

The seen database tracks the following information about each article:

Hash of the message id (64 bits)

A pointer to the next article in the hash chain (32 bits)

The record’s use-count (8 bits)

Expiration priority (8 bits)

An article’s "key" (a 64 bit value that can be decomposed to find an article)

Message-ID length (8 bits)

The length of the article (32 bits)

The header length (16 bits)

The overview record length (16 bits)

The offset of the message id (16 bits)

Total length of each entry=33 bytes or about 31MB per million articles

The seen database is cyclical and its size is set (number of articles to remember) in config.txt. Articles (records) are stored in the seen database as follows. There are two ways to reference the seen database, through a message-id lookup, or by direct reference to a particular item (since the seen is a memory mapped array).

When an article is referenced by message-id, the message-id is hashed, and the hash is used to check the seen for existance. This happens by first transforming the hash value to reference seen index which points in to the seen database (or doesn’t—since that index slot may not have been used previously, causing the search to terminate). The hash is checked against the record. If it matches, we’ve found the article. If not, we proceed to the next entry in that hash chain (referenced through the pointer in that record). At some point, either a match is found, or the search terminates because the hash chain ends.

When adding an article, we need to take a new seen slot. If we haven’t yet wrapped the database, we just use the next unused one. If we have wrapped, the process is a little more complex: first we check to see if the article is still available (check to see if the length>0, and then see if it has been overwritten in the store in which it resides). If the article is no longer available or its expiration priority is less than 5 (more on this shortly), we overwrite its entry in the seen, and increment the usecount so that any references to the old article can determine that this article is not the same one it had recorded.

If the article is still available, the process becomes a little more dynamic. In order to find a new slot for the article, we begin searching ahead in the database. As we examine each article, we check for availability, and that the expiration priority is more than 5, otherwise the search terminates and the entry is overwritten. If not, we divide its expiration priority in half (excepting articles with a priority >250, which are never reduced in priority, and thus only overwritten when the article is no longer available, or when it cannot find another entry in the next 25000 with a lower priority). As we search, we track the lowest expiration priority observed, and limit the number of entries searched to 100 times that priority. If no more suitable entry is found, that entry is overwritten and a log entry is made noting that the seen appears to be saturated.

As this cycling process is ongoing, the entries are "fixed up" so that all articles that are still available remain in the appropriate hash chain so as to be available by message-id lookup.

The group database

The group database is responsible for tracking the group membership of articles. It is of relatively simple design, and is essentially a memory mapped array of "slots." Slots can operate in one of two fashions: they are either "head" or "subordinate" slots. Every group on the system has one lead slot. The lead slot tracks the following information:

The number of slots dedicated to the group

The number of article entries dedicated to the group

The start and end numbers of the group (low and high watermarks)

The overview cache information for the group (key, length, block length, validity bit, cache start and end numbers)

Each slot also has space to store information about 500 articles, and 25 index points into the overview cache (so that we can get within 20 articles of any requested start/end point). Each slot requires approximately 4800 bytes of storage. (Though it is possible to index for an endpoint (say where the cache has overview records for articles numbered 500-1000, and overview information is requested for 500-750), this is not currently implemented. The overview cache is read from the calculated start offset through the end of the cache range. This is a significant penalty in circumstances where early overview data is requested. This does not appear to happen often enough to cause a notable performance problem. Implementation is not constrained by any known factors other than time).

 

Groups may have one slot (a head slot), in which case they can have a maximum of 500 articles available concurrently. Groups may also chain a head slot with up to 99 subordinate slots to have a maximum of 50000 articles concurrently available.

The total number of slots available is set in confix.txt.

Database Sizing

Perhaps the most difficult configuration work lies in sizing the databases for tortoise. Since current versions of Windows NT constrain the process VM usage to something less than 2GB, you must ensure that the memory mapped seen and group databases do not consume all available VM space. Generally, you should avoiding having these two total more than about 1400 or 1500 MB. Subject to VM limits, you can expand the group or seen databases at any restart by increasing the configured size in config.txt. Our testing of large configurations is limited. If you plan to have combined database sizes exceeding 1GB you should probably drop a line to news@maxwell.syr.edu and do extensive testing on your own.

 

If foreknowledge of the exact article to group distribution existed, you could simply decide how many articles you could afford to keep in each group, and set it up. Since such knowledge doesn’t exist, you will need to guess. J

Beyond the simple traffic issues, there are also machine scalability issues. The size of your feed, retention, number of concurrent readers, and number of suck feeds will determine the hardware required. It is hard to generate formulas, so here’s a sort of configuration/machine required set that you can use to gauge machine configuration:

# of groups

# of slots

# of seen entries

# of concurrent readers

# feeds (in/out)

Machine
(proc/mem)

Disk notes

30000

180000

7000000

100-150

5/5

PP200/256-512

Seen-2 disk stripe

Group- 2 disk stripe

Overcache-stripe set

Articles-stripe set

20000

100000

7000000

40-60

5/5

P200/192

Databases-2 disk stripe

Overcache-stripe set

Articles-stripe set

8000

100000

7000000

80-120

5/5

P200/192

Databases-2 disk stripe

Overcache-stripe set

Articles-stripe set

<1000

<10000

<1000000

<30

<3/<3

P60/64MB (maybe less)

All on one stripe set or one disk

Note that none of the CPU calculations include filtering software. Cleanfeed (and others) may require CPU and/or memory augmentation.

 

Group Management

Group management in tortoise has evolved, but probably not reached the end of its evolution. The current system is almost certainly not in the running for the best system that could be devised, and future development should be invested in this area. In particular, the software currently does not track the usage status of slots—the administrator must do this manually by examining group.num.

To track groups, tortoise uses two files, group.num and active.tort. Group.num cross references group names with the slot numbers they occupy. Active.tort is an INN format active file used to generate responses to the NNTP LIST command.

A third file, groups.imp is used to add groups to the system. Upon startup, the groups.imp file is read, and each group found is added to group.num, the active.tort file, and the slot(s) that the group is to use is (are) initialized. Groups.imp has the following format:

[group name] [head slot] [number of slots] [posting flag y/n/m] [start number]

After groups.imp is processed it is deleted.

Procedures for adding a group

Examine group.num and find a free slot or set of slots depending on what you need for the group you’re adding. Create a groups.imp file with the information, and restart tortoise.

For example, if you had 25000 slots configured and the end of your group.num looked like:

Znet.test 22001 3

You might create a groups.imp that looked like this:

New.low.volume.group 22004 1 y 1

New.high.volume.group 22005 4 y 1

This would create two new groups. Note that new.high.volume.group would get 4 slots or a 2000 article capactiy where new.low.volume.group would get 1. Note that it is technically legal to do this:

New.group.name 22006 1 y 1

New.group.alias 22006 1 y 1

if you want to have 2 group with the same name. This may have unintended side effects, like causing the group to get 2 copies of the same post if they are both named in the newgroups line. This is presently untested and unsupported, but it might make for a neat experiment in having groups name show up in multiple languages, while referring to the same group. For obvious reasons, this should only be done deliberately—configuring two groups with the same slot number will cause odd behavior.

 

Procedures for deleting a group

If you want to drop a group that is in group.num, comment it out with the # symbol as the first character on the line. Remember that if you comment a group out, you should remove that group from the active.tort to prevent it from being returned in a response to a LIST command.

You can reuse the slots of groups that have been removed—simply specify the appropriate information in the groups.imp file.

Installation

Tortoise installation is not automated. You will need to retrieve the zip archive and decompress it into the directory in which you wish the program files, logs, and other supporting parts to reside.

Before installation you will need to determine how you want to use tortoise. Tortoise can operate in one of three major modes. "Normal" mode supports news readers and feeding. "Transit" mode is for news transit and lowers overhead by avoiding any processing associated with supporting readers (the group database is not used). "Slave" mode causes tortoise to use article numbering supplied in the Xref header.

To run in normal or slave mode, you should obtain the list of groups that will be available on the server. If you have a server available that has the groups you want to carry, the install scripts can retrieve the list for you. If you have an active file, tortoise can utilize that. If you have neither, you will need to craft a groups.imp file (see group management section). For slave mode, the group list should be obtained from the master server.

After it is decompressed in its final home, you should run install.bat.

 

Web Management

Most of the management of Tortoise can be accomplished through the web. By default, tortoise will set up its own internal web server on port 2001 on the computer you select. The built in web server is not designed to face attacks from the Internet—you should protect it appropriately. Because usernames and passwords used for tortoise management are not encrypted, additional considerations apply to managing tortoise from or through untrusted networks.

When you connect to the server, you will need to authenticate with a username and password. You should have created one as part of the install process. If you did not, you should see the registry section to add one manually.

Feeding

Tortoise has a relatively efficient feeding system. For small numbers of feeds (<10) its performance should be approximately equivalent to NNTPRelay if given adequate memory. The feeding system will be significantly more efficient if operated in a real-time mode and the majority of the fed hosts are not consistently fed from a backlog.

Feeds can operate in two modes in tortoise. With real-time feeding enabled, tortoise will attempt to forward articles to peers as soon as they arrive and are processed. This option is efficient and very low latency. If the host is unable to process the articles as quickly as needed to "keep up" with incoming arrivals, the articles will be queued. If real-time feeding is disabled, the articles are directly queued, and they will be processed when the queue file is rotated.

Feed queuing for tortoise is done in a subdirectory per outgoing feed. Queue files are created and remain active for 10 minutes. After 10 minutes, a new file is opened (if necessary) and the current one becomes available to be fed. Feed queue files are simple in format—<message-id>NUL<nessage-id>NUL….

You can freely copy the queue files from site to site (or move them if desired), or create them by hand and drop them in. It might be useful to create a site that does not have a hostname, just to generate queue files so that you can backfill another host easily.

The tortoise feeder supports both streaming (CHECK/TAKETHIS) and non-streaming (IHAVE) feeds. Streaming feeds support total disconnection between the outgoing check/takethis stream and the incoming responses. This has at least one (current) implication that is worth mentioning; streaming responses are not validated against the message-ids that are queued for it. This means that a downstream site can essentially retrieve any message from an upstream streaming feed by sending a 238 <message-id> response. This could have security or policy implications.

One statistic deserving special explanation is "TransmitQ Overflows." You will see this statistic for each outgoing feed through the web interface. A transmit queue overflow occurs when a remote site with a streaming feed suffers from an "eyes bigger than stomach" problem. Basically, we have a fixed queue size for tracking articles which have been checked, indicated as desired by the remote host, and are waiting to be TAKETHIS’d. This indicates that the aforementioned queue overruns and must be grown—this growing can burn CPU if it happens too much. We are working on strategies to temper the check flow to resolve this problem, but for now it is just a indicator to watch.

Incoming Access

Incoming access control is controlled through the web interface. At this point, incoming access control is somewhat limited when compared with other news systems, and future development work should address this.

Incoming access control is done through an array of entries. These entries consist of a hostname, the maximum number of connections allowed, the rights (read, post, transfer), and whether or not streaming is allowed.

The hostname can include * or ? as wildcards which will always apply to IP addresses (so 10.0.0.* is always correct), but will only apply to hostnames if you set the "doHostnameResolution" to 1 in the global configuration options. If you have hostname resolution enabled, tortoise will reverse resolve the IP address, forward resolve the hostname and compare the result with the original address in an effort to ensure that there wasn’t any DNS spoofing going on. Using hostname resolution can cause connection delays if the DNS is slow or unresponsive, so consideration should be given to avoiding the use of hostname resolution if connecting clients require non-local DNS access.

When the array is searched, the first entry found is chosen, regardless of whether another entry exists which might apply. This means that entries need to go from more specific to more general. During the creation process, you can select an index value (between 1 and 250) to determine where in the list an entry should go. Entries can be moved after they are created, but this is essentially treated as a "delete and recreate" by tortoise, and will not free up the old entry until all the users that connected by access granted in that entry have disconnected. In general, we recommend that you space out the entries to allow for future changes or expansion.

Registry information

Tortoise stores values that may be changed during operation in the registry. This section documents the registry information.

If you examine the Tortoise registry key in HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\Tortoise, these entries should be there:

Parameters

Feeds

Globals

Incoming

Stores

Users

Under the parameters key, there should be a ConfigFile REG_SZ value, which contains the full path and name of the config.txt file.

Feeds

For each outgoing feed, a registry key will be created, with the outgoing feed label as the key name. Under this key, you will find all the feed’s configuration values stored as REG_SZ strings.

Globals

This key contains all the configuration options (again as REG_SZ strings) you see in the web interface under "Global Configuration".

Incoming

A new key is created for each incoming access entry. Under this key, you’ll find the access parameters set for this entry.

Stores

Each store has a key that is named for the store number. All the store parameters are contained under this key.

Users

Each user will have a value, with the username being the value name, and the password as the content of that value.

Note: Changes made directly to the registry will not take effect unless tortoise is restarted.