WaybackMedic
by GreenC
by GreenC
Wayback Medic 2.5 is a bot that adds and maintains links from the list of known web archive services in use on the English Wikipedia.
Edits made after 2018-12-04 are by version 2.5
The bot operator is User:GreenC. The bot account is User:GreenC bot. The bot (software) is "WaybackMedic".
Fix number | Function name | Example edit | Description | Notes | Date added |
---|---|---|---|---|---|
1 | fixthespuriousone | Example | Remove spurious |1= in cite templates.
|
August 2016 | |
2 | fixmissingprotocol | Example | 1. Add https if protocol missing from the archive.org URL. 2. Convert existing protocol http to https. 3. Add second-level domain web if missing (archive.org/web/ → web.archive.org/web/) 4. Add /web/ path (web.archive.org/2016/ → web.archive.org/web/2016/). In some URLs adding /web/ breaks the link, test for those. |
HTTPS per RFC | August 2016 |
3 | fixemptyarchive | Example | 1. If |archiveurl= is empty or missing but |archivedate= has content, attempt to find a working archive URL based on the archive date, otherwise add {{dead link}} if appropriate.2. If |archivedate= is empty or missing but |archiveurl= has content, generate date value based on timestamp in the archive URL.3. If |archiveurl= and |archivedate= are empty, remove both and leave a {{dead link}} if appropriate.
|
August 2016 | |
4 | fixbadstatus | Example | Check all Wayback Machine URLs for response code errors (anything but 200s). If an error code, try for a better URL via the Wayback API – first using accessdate, then using the earliest date available. If none there, check WebCite API. Try Memento API which checks a few dozen other archives. Other techniques undocumented. If still none found, remove |archiveurl= and |archivedate= and add {{dead link}} .
|
August 2016 | |
5 | Retired | ||||
6 | fixemptywayback | Example | The wayback template is mangled in a certain way. Action: re-assemble. It won't delete multiple instances if they exist in the same ref (as in the Example). | August 2016 | |
7 | fixencodedurl | Example | The URL was incorrectly encoded. Fully decode URL and re-encode. | August 2016 | |
8 | fixdatemismatch | Example | 1. Ensure |archivedate= matches the snapshot date in the URL2. Ensure date format matches dmy or mdy if set (retain ymd if in use) |
August 2016 | |
9 | fixwebcitlong | Example Example |
Convert WebCite URL's from short-form to long-form Convert Freezepage.com URL's from short-form to long-form |
WebCite Usage | January 2017 |
10 | fixstraydt | Example | Remove stray {{dead link}} template when an archive exists for the link
|
January 2017 | |
11 | fixwam | Example | Merge {{wayback}} and {{webcite}} --> {{webarchive}} Merge completed February 5, 2017 |
Webarchive TfM | January 2017 |
12 | fixiats | Example | archive url -> |archive-url) | January 2017 | |
13 | fixswitchurl | Example | Move an archive.org URL from |url= to |archiveurl= and add |archivedate= if missing.
|
January 2017 | |
14 | Retired | ||||
15 | fixembway | Example Example |
1. A {{wayback}} is embedded in a CS template.2. A {{dead link}} is embedded in a CS template.
|
January 2017 | |
16 | <various> | Example | Timestamp and/or |archivedate= is 19700101 and/or out-of-bounds.
|
January 2017 | |
17 | fixdoubleurl | Example | archive.org URLs are doubled, tripled, etc.. | January 2017 | |
18 | fixemptywebarchive | Example | {{webarchive}} |date= is missing or empty value.
|
January 2017 | |
19 | fixdoublewebarchive | Example | Remove duplicate {{webarchive}} instances.
|
January 2017 | |
20 | fixembwebarchive | Example | A {{cite web}} is embedded in a {{webarchive}}
|
January 2017 | |
21 | fixarchiveis | Example Example |
1. Convert Archive.today URL's from short-form to long-form 2. Fix URL encoding of broken links 3. Normalize as "archive.today" see note |
Archive.today Usage | January 2017 |
22 | fixitems | Example | Change "/items/" URLs that are using machine IDs | BRFA | January 2017 |
23 | encodemag | Example | Convert MediaWiki encoding to url encoding in URLs (ie. {{!}} and {{=}}) | RFC3986 | January 2017 |
24 | decodespace | Example | Convert %20 to +, + to %20, etc.. in URLs that can be repaired this way | See also | June 2017 |
25 | waytree_trailgarb | Example Example Example |
Remove typical garbage characters found at the end of URLs: .,;:-"l(%XX)('') | February 2018 | |
26 | fixcommentarchive | Example | Open-up commented-out archives and add a |deadurl= "yes" or "no"
|
February 2018 | |
27 | waytree_x2encoding | Example | Repair double URL-encoding eg. %3A -> %253A | February 2018 | |
28 | fixencodebug | Example | Repair missed URL-encoding of square brackets | T186417 | February 2018 |
29 | fixiats | Example Example |
Restore truncated Wayback URL | February 2018 | |
30 | fixiats | Example | Convert |title={title } -> |title=Archived copy
|
T203865 | September 2018 |
31 | urlchanger | Example | Move broken URL to a new working URL and undo previous archives. | BOTREQ | November 2018 |
32 | cosmetic | Example Example Example Example Example |
Edits that might be cosmetic. Only with other edits. 1. Del trailing # in URLs 2. Del empty archive fields 3. archive.is --> archive.today 4. Fix double fragments 5. Convert protocol-relative URLs |
WP:PRURL, T214855, Archive.today | January 2019 |
- Technical details
- Changes to URLs are checked against the remote site to ensure they are working
- Real-time link checks, no link database. However, links are checked over a 24 hour period before final upload of diff.
- Supports many APIs including the Internet Archive, Memento, WebCite and "Timemap" APIs at individual services
- Multiple HTTP header status code checks at the application (WaybackMedic) layer
- Additional time-out and retries built-in to the web transfer libraries.
- Additional operating-procedure level checks against network and other errors – bot is semi-supervised in known trouble areas.
- Multiple redundant checks of the APIs using multiple dates to ensure a page really is unavailable
- Accepts API results but then verifies by looking at page headers and/or contents
- The bot is primarily written in Nim (compiles to C source) with support utilities in Awk. Libraries were custom made including a string primitives library for regex, a wiki template parsing library, OAuth library (in awk), a MediaWiki API interface library, a soft404 detector.
- Due to the nature of the task, running the bot includes a fair amount of supervisory overhead so it requires operator training, though the steps are documented in the source package.
Running
editThe bot takes requests at WP:URLREQ on a per-domain basis. You can request a domain name for the bot to process.
Paid editor
editGreenC, in accordance with the Wikimedia Foundation's Terms of Use, discloses that he has been paid by the Internet Archive for his contributions to Wikipedia. This funding is for the ongoing development of WaybackMedic and a module of InternetArchiveBot related to books.
General sources
edit- GitHub is an old public repo. The most current version is not public. The bot is written in Nim and GNU awk.