(Translated by https://www.hiragana.jp/)
⚓ T123442 Pageview API: Better filtering of bot traffic on top enpoints
Page MenuHomePhabricator

Pageview API: Better filtering of bot traffic on top enpoints
Closed, ResolvedPublic

Description

Pageview API: Better filtering of bot traffic on top enpoints

There are pages like "java" and "web scrapping" on the top10 at all times. We know our bot filtering is less than desirable but could we use the nocookie tagging here? We are going to loose 10% of real user traffic but it would be a cheap proxy to deal with bots.

Let's discuss

Some users complaining:
https://twitter.com/ReaderMeter/status/684804121208045569

Event Timeline

Nuria raised the priority of this task from to Medium.
Nuria updated the task description. (Show Details)
Nuria added a project: Analytics-Backlog.
Nuria subscribed.

Did a quick check this morning:

  • Top end point doesn't contain what we flag as bots (namelly spiders), it only contains what we flag "user".
  • Double checked pages "Java_(programming_language)" for Jan 10th and "Web_scraping" for Jan 12th
    • In pageview_hourly table, I have found 2 to 3 browser data + city groups having more than 70k requests per day.
    • In webrequest, I have found for a given hour of the specified day than less than 5 ips mke most of the traffic with unreasonable request number for a human (thousands).

A heuristic could be to remove from pageview_hourly the distinct pages (by project, language_variant and page_title) viewed more than X views from a single IP.
If we go in that direction, we'd need to double check on data loss first.

I like @JAllemandou's approach. Excluding IPs with a completely unrealistic amount of human traffic is probably simple enough. One way to validate this is to compare the results to human filtered list in the English Wikipedia Top 25 report. It uses the percentage of mobile views to detect and remove artificial traffic:

Since mobile view data became available to the Report in October 2014, we exclude articles that have almost no mobile views (~2% or less) or almost all mobile views (~95% or more) because they are very likely to be automated views based on our experience and research of the issue.

You can see West.andrew.g's chart of pages (with % mobile) here.

We deprioritized this task earlier on on Q1, moving back to Q2 just in case we want to take a 2md look.

I have released a new version of Topviews that shows the percentage of mobile views each page receives: http://tools.wmflabs.org/topviews/?project=en.wikipedia.org&platform=all-access&mobileviews=true

It automatically hides some false positives, but before this was done, you could see that 404.php has 0% mobile views (rounded down). For enwiki at least, this immediately indicates a false positive. There was also Oxford Manifesto, again with 0% mobile views. Then we have pages like XHamster and XXX with over 90% mobile views. Those I'm quite certain are also false positives.

So I guess the big indicator is if the percentage of mobile views is either extremely low or extremely high. However, as you might expect this is not consistent across projects. For instance, see results for swwiki: http://tools.wmflabs.org/topviews-test/?project=sw.wikipedia.org&platform=all-access&mobileviews=1&debug=true Here the percentage of mobile views is regularly over 90%, presumably because mobile devices in this part of the world are the most popular portal to the internet. So the logic we use for enwiki won't work there.

Not sure if these findings are helpful but I thought I'd share :)

Meant to post this earlier, but great work @MusikAnimal! I'm eager to see this codified into some sort of anti-spam correction, but I'm concerned by articles like "Oxford Manifesto", which also have <0.1% mobile. Though on second thought, the page does look a bit anomalous to be ranking so highly.

Milimetric moved this task from Wikistats to Dashiki on the Analytics board.

So for March 14 we had this: https://wikimedia.org/api/rest_v1/metrics/pageviews/top/ru.wikipedia/all-access/2017/03/14

On the Russian Wikipedia, roughly half of the top 200 pages are false positives. The bot (or whatever) was apparently written to scrape pages alphabetically, starting with the characters "Бе". They begin consistently at around 7,480 hits, and the hits slowly decrease as the bot iterated alphabetically through the pages. Single-page false positives like this happen all the time, but this is the first time I've seen it on a large scale for a single endpoint.

I didn't check all the pages, but the several I did reflect the same scenario I see with most false positives, where the vast majority of traffic comes from a single city. I haven't been checking IPs because those queries take a lot longer, so I can't say for sure if your everyday false positives are from the same IP, but that's most likely the case. Going by city should still be sufficient, provided you make the threshold high enough. So I would suggest if the top city has over say, 1000 times as many pageviews as the next city (an unreasonable amount), it's safe to assume it's a FP. That's a very simple but (in my experience) effective comparison, and would filter out most of the false positives I've uncovered. The query I typically use:

SELECT
  city,
  SUM(view_count) AS viewcount
FROM
  pageview_hourly
WHERE
  page_title = 'Без_границ_(организация)'
  AND project = 'ru.wikipedia'
  AND year = 2017
  AND month = 3
  AND day = 14
GROUP BY city
ORDER BY viewcount DESC

And from here compare the counts for the top city to the others. In this case the false positives also had less than 0.1% mobile pageviews, a tactic that works for ru.wikipedia. By contrast I think comparing the top cities might work for any wiki, again provided the threshold is crazy high. I can't imagine how pageviews originating in the top city could be 1,000 times more than the next. This could happen with New York City vs Hertford, North Carolina, but I doubt you'd see two cities like that side by side when sorted by viewcount.

What do you think? Is it possible to automate these queries and exclude any pages meeting the criteria? The beeline queries are pretty slow (though you probably have a better way to do it), so if it helps we could maybe only test the top 100 pages. That would however mean you'd first need to compute the top 1,000 then test for false positives. Not sure if we're OK with returning less than the advertised 1,000 pages, but if so maybe compute the top 1,100 or so to give a little wiggle room. If we were somehow able to do this it'd greatly improve the data. There are other mysterious false positives with IPs all around the world, and those we may not ever figure out, but I think we should attempt to filter out the obvious ones.

The problem I'd be worried about is when traffic from a specific city makes sense, like there is local news about that city that isn't relevant to the rest of the world. We'd have to find events like that and figure out how they're different from false positives like the ones you identified here. More importantly than stats, it seems we're being bombarded with fake traffic. A solution to this seems highly desirable. Will try and up the priority.

Milimetric raised the priority of this task from Medium to High.Mar 16 2017, 10:56 AM

This is on our rather for Q1/Q2 (July 2017/september 2017) we will not be able to tackle this problem any sooner.

The problem I'd be worried about is when traffic from a specific city makes sense, like there is local news about that city that isn't relevant to the rest of the world. We'd have to find events like that and figure out how they're different from false positives like the ones you identified here. More importantly than stats, it seems we're being bombarded with fake traffic. A solution to this seems highly desirable. Will try and up the priority.

I would assume at least the top 100 would reflect more nation-wide or international attention, and the top 100 are what most people are interested in. Usually any huge news that happens in NYC is going to be reported elsewhere, for instance. I can try to find out for sure, but again I bet these bots usually operate from a single IP, so going by that should alleviate your concern. There are some exceptions, like countries or perhaps individual cities where the public predominately share the same IP or a small pool of IPs. To account for that you might consider restricting this false positive detection to the more popular projects that don't have weird edge cases and are more subject to fake traffic.

I was talking to someone about bot detection, and they mentioned that they have gotten good mileage in bot filtering by grading ip addresses by the ratio of html pages requested. I ran a quick query against a day's webrequest logs to get a top level idea of whats plausible:

select count(1) as n_ip, percentile_approx(n_html/n_req, array(0.001, 0.01, 0.05, 0.25, 0.5, 0.75, 0.95, 0.99, 0.999)) as percentiles from (select sum(if(substring(content_type, 1, 9) == "text
/html", 1, 0)) as n_html, count(1) as n_req from webrequest where year=2018 and month=4 and day=17 group by client_ip ) x;

number of ip addresses: 166962302
percentiles for ratio of requests returning html over total requests by ip address:

0.1%1%5%25%50%75%95%99%99.9%
00000.0460.0970.2850.9990.999

I was surprised how many IP's don't request html. I verified with a direct count that indeed 64M out of 167M ip addresses requests less than 1 html page per 1000 requests. This could be a problem with my ad hoc classification method.

This seems to at least have the ability to differentiate ip addresses, although it would require a good bit more evaluation to determine if we could do something useful with it. Not sure if apps ever get an html response either, or if their content is embedded in a reply of a different content type.

This is so cool, @EBernhardson, thank you. Formatting for my future reference:

select count(1) as n_ip,
       percentile_approx(
           n_html/n_req,
           array(0.001, 0.01, 0.05, 0.25, 0.5, 0.75, 0.95, 0.99, 0.999)
       ) as percentiles

  from (select sum(if(content_type like '%text/html%', 1, 0)) as n_html,
               count(1) as n_req
          from webrequest
         where year=2018
           and month=4
           and day=17
         group by client_ip)  html_vs_total_webrequests;

Closing, automated marker has been deployed and top endpoints will not be reporting data marked as 'automated". See: https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/BotDetection