Gebruiker:RonaldB/Open proxy fighting

For an old version of this article, see here

The open proxy fighting system, as developed over the past period, consists of several components, all built around a database. This database stores all relevant information on open proxies (IP, port, type, date acquired, (re)confirmation date, etc.).

Processes[bewerken | brontekst bewerken]

The following processes act on this database:

Grabbing: This process is run semi-automatic and acquires IP's from many different sources. This includes normal open proxies, TOR nodes and the url's of anonymizers (CGI/PHP proxies). Amongst others there are manually maintained lists of provider proxies, mobile IP ranges and so on.
Scanning: This multi-threaded process, running more or less on a 24/7 basis, scans the IP's in the database on the open proxy properties. It is recording all relevant information as seen from the server side. In this way it automatically discovers the exit server IP of a cascade of proxies and the IP as used by web-based proxies (anonymizers). Where appropriate the scanning process attempts to discover multiple exit nodes, which can occur with so called high-anonymous (or elite) proxies and with anonymizers.; As the primary source for the TOR nodes (for many other internet sources as well) sometimes is down, the scanning module has a facility to discover TOR exit nodes independently. This is quite a slow process and only meant for emergencies.; The speed by which IP's can be scanned varies depending on many factors, but a typical scanning speed is between 10 and 20 IP's per second.
Auto-blocking: Some categories of open proxies are blocked pro-actively. That includes exit servers (i.e. of a cascade of open proxies), TOR exit nodes and the IP's used by anonymizers. The auto-blocking also involves unblocking if an IP could not be reconfirmed to be an open proxy for a certain period of time. This in order to ensure that proxies that probably are closed again, remain blocked forever.; For e.g. TOR exit nodes the system allows a certain grace period. Analysis has shown that quite some IP's are trying TOR and quickly disable it again. It is not meaningful to block such experimenters.; Pro-actively blocked IP's don't get a template but a block comment that redirects the blocked user to an explanation. Such blocked IP's are listed on wiki pages, but in such a way that the IP will not be readable by e.g. the Google crawler.; Currently the pro-active blocking is done on nl:w and he:w (the latter on request).
Monitoring: On various of the larger wiki's a monitoring module is running. Based on irc feeds of recent changes, this module queries the database. If a hit is found, this is reported on a wiki page.; If the last confirmed date is some time in the past, the IP is automatically scanned to inspect its status (normal open proxies only, is not possible for other types). On nl: such a live open proxy is automatically blocked.; This function is also capable to check any IP against a so called suspect table. That table consists of ranges containing IP's with relative many vandalism edits in the past, as well as ranges of which is known that these are used for (highly) dynamic IP's. The latter because some trolls appeared to misuse these dynamics to change their identity very quickly.; Edits by this category of IP's are reported on a special page. Under investigation is an extension reporting edits by certain organisations and/or institutions, thus converting this tool into an on-line version of the WikiScanner.

Auxiliary modules[bewerken | brontekst bewerken]

Besides these main module there are a few auxiliary modules.

Query tool: Just by copy/paste of a bunch of text that includes IP-addresses, the database can be queried on whether or not the IP's are some sort of open proxy.
Batch inspection: Most of the IP's that ever have done an edit on nl:w are contained (via the grabbing process) in a table and can be queried against the open proxy tables. This can be useful if the on-line monitoring process has some drop-outs for whatever reason.
IP range and port scanning: A special mode of the scanning module can scan IP ranges with a lst of the most common ports (common according the database). Alternatively a single address or a range can be scanned on open ports (either a range, a list of ports used by known Trojans or the most common ports for open proxies).

Some numbers[bewerken | brontekst bewerken]

Global numbers[bewerken | brontekst bewerken]

Growth op IP's in open proxy database over time

The database consists of > 220.000 IP's, including exit servers, TOR nodes and the IP's used by anonymizers. The "normal" OP's make use of > 9.000 different ports.

The scanner is able to confirm 1/3 of the OP's as real. The other 2/3 is either very short living or has never been an OP at all. Note that it is not always clear how the reporting internet sites determine an IP to be an OP. If this is just based on port scanning, there might be quite some false positives.

A complete cycle of the scanning process results in 2-3.000 confirmations. In other words: at any moment some 2.500 IP's are really working (normal) proxies.

The confirmed OP's make use of > 2.300 different ports. Much less than the > 9.000 above, but still enough to make the IP of an open proxy without the port info useless.

In the graph at the right, the stronger growth in February 2007 is remarkable. This could be caused by some viruses that became active after New Year, spreading more trojans on the victim computers. The dip in June is caused by 3 weeks holidays.

The database further contains the url's of appr. 11.000 anonymizers, of which > 9.000 could ever be confirmed as working.

Lifetime of OP's[bewerken | brontekst bewerken]

Average lifetime of open proxies as function of acquisition date

Open proxies come and go for various reasons. To mention a few:

Machines are switched on and off, a pattern that may depend on the time zone.
If the open proxy is a zombie computer (e.g. infected by a trojan horse), this may be discovered sooner or later.
If the proxy is a TOR node, the user may decide to stop using the TOR network.
If the anonymizer (CGI/PHP proxy) does not generate enough income from the ads, the owner of the domain may decide to drop the domain name, etc.

The age and size of the database is such that some analysis of the lifetime of proxies could reasonably be done. The graph at the right shows the average lifetime for various kind of open proxies as function of the acquisition date.
It is obvious that the lines are declining. Some old and persistent IP's contribute more to the average lifetime than "fresh" IP's. Eventually all lines will stabilise to a certain value, as already appears the case for normal OP's and TOR exit nodes.

CGI/PHP proxies and exit nodes have a significant longer lifetime than the other types. For CGI/PHP that can easily be understood. These are there deliberately. Also for exit nodes there is an explanation. These IP's are only useful to act in a cascade of proxies, if they are "reliable". This is a kind of natural selection process.
Remarkable is the relative short lifetime of TOR exit nodes. Apparently lots of people try it, get dissatisfied about the performance and stop using it anymore.

Experiences[bewerken | brontekst bewerken]

Since nl:w started to pro-actively block the most annoying open proxy types, trolling, spamming and so forth has diminished considerably. The same has been experienced on he:w after my assistance was called in.

This does not mean that trolls, that make use of double cloaks and alike, have completely disappeared. But there is now ample time (and resources) to investigate other tricky methods to circumvent troll-blocking and so on. This way we discovered a troll making use of dynamic IP's as used by mobile networks in NL (now soft blocked) and even a suspect probably making use of open WiFi networks in his neighbourhood.
Spamming that is not yet pro-actively prevented - if it happens - comes in most cases from CN IP's.

Reliability[bewerken | brontekst bewerken]

The reliability of the data is predominantly determined by the scanning process. Rather than testing whether the subject IP has just open ports, the scanning process really tests if the IP can be used as open proxy or not.

Normal OP's

The scanner attempts to establish a connection to a test url, via the IP to be investigated. The http request header is defined in such a way that the IP acts as a proxy (if it responds).

The test url returns the client IP and an unique string to enable the program to recognize the response as coming from the test site.

If the client IP thus returned equals the IP used as proxy, then the result is stored in the database as a normal open proxy.
If the client IP is different, the scanner concludes that the IP under test, forwards the request to another proxy. That IP is then added to the database as an exit server (if not already present). The scanner repeats the request to find out whether this cascaded proxy varies from time to time.

The scanner also checks the presence of "forward for" and similar information in the http header.

Anonymizers

The scanner calls the url of the anonymizer (also called web based proxy or CGI/PHP proxy). The returned HTML is analysed on the presence of suitable HTML forms (and much more) and the relevant field is filled with the url of the test site. The form is subsequently sent to the url of the anonymizer.

The remainder of the process is similar to above.

TOR

Using the scanner to discover TOR exit nodes is very straight forward. The program makes a request of the test page, thereby using the TOR network. The page is again analysed in a similar way as described above. However this process is very time consuming but can act as a back-up if the premier internet source for active TOR nodes is down for a longer period of time.

Reporting via the monitoring function[bewerken | brontekst bewerken]

If an IP address is confirmed to be an open proxy as described, it is definitely an open proxy, but only at the time of confirmation. The other way around: if it could not be confirmed at the moment the scanner was probing it, it might be an open proxy. For instance, the computer could be switched off at that moment. Therefore the monitoring function reports the relevant dates (first in db, first confirm and last confirm).

Coverage[bewerken | brontekst bewerken]

This is the issue of reliability viewed from a different angle and attempts to provide an answer on the question: if an open proxy could only be confirmed once (or even never), or the last confirmation is a while ago, what is then the likelihood that it is indeed an open proxy.

The statistical data on this page, that are rather representative for the English Wikipedia, are taken as an example. Please note that en:w: has enough edits per day to use it for general purpose statistical analysis. Besides, the English Wikipedia appears not very well protected against edits by open proxies.

The automatically generated statistics show that there are roughly a quarter of million edits per day, of which 50k come from IP's, of which 230 edits were performed by open proxies (confirmed and suspects together). This means that some 4-5 ‰ (pro mille!) of the IP edits come from a (potential) open proxy.

As said before, the database is currently holding some 220k addresses of open proxies. Related to the roughly 3 billion IP addresses in use worldwide, this represents only 0,07 ‰ of the total IP's in use.

The conclusion is clear. The percentage of open proxy edits is a factor 50-100 higher than what one would expect. If the database suggests an open proxy edit, even when the date of (last)confirmation is some time ago or absent, the IP is still a suspect and careful examination of the kind of edit(s) is strongly recommended.

Discussion on some topics[bewerken | brontekst bewerken]

To follow soon