research-article

Reviving Dead Links on the Web with Fable

Authors:

Anish Nyayachavadi,

Vaspol Ruamviboonsuk, and

Harsha V. MadhyasthaAuthors Info & Claims

IMC '23: Proceedings of the 2023 ACM on Internet Measurement Conference

October 2023

Pages 131 - 144

https://doi.org/10.1145/3618257.3624832

Published: 24 October 2023 Publication History

Abstract

The web is littered with millions of links which previously worked but no longer do. When users encounter any such broken link, they resort to looking up an archived copy of the linked page. But, for a sizeable fraction of these broken links, no archived copies exist. Even if a copy exists, it often poorly approximates the original page, e.g., any functionality on the page which requires the client browser to communicate with the page's backend servers will not work, and even the latest copy will be missing updates made to the page's content after that copy was captured.

To address this situation, we observe that broken links are often merely a result of website reorganizations; the linked page still exists on the same site, albeit at a different URL. Therefore, given a broken link, our system FABLE attempts to find the linked page's new URL by learning and exploiting the pattern in how the old URLs for other pages on the same site have transformed to their new URLs. We show that our approach is significantly more accurate and efficient than prior approaches which rely on stability in page content over time. FABLE increases the fraction of dead links for which the corresponding new URLs can be found by 50%, while reducing the median delay incurred in identifying the new URL for a broken link from over 40 seconds to less than 10 seconds.

References

[1]

KDE 1.92 Release Announcement. https://web.archive.org/web/20060209082707/ http://www.kde.org:80/announcements/announce-1.92.html.

[2]

What If? (2008) #1 | Comic Books | Comics | Marvel.com. http://web.archive.org/web/20121017122005/http://marvel.com/comic_books/ issue/22962/what_if_2008_1.

[3]

After the Revolution: Youth, Democracy, and the Politics of Disappointment in Serbia - Jessica Greenberg. http://web.archive.org/web/20140701030455/http://sup.org/book.cgi?id=21682.

[4]

Harvard Kennedy School - Mossavar-Rahmani Center for Business and Government:: About:: Fellows:: Senior Fellows: 2017--2018 (copy on July 12, 2017). https://web.archive.org/web/20170712144006/http://www.hks.harvard.edu/centers/mrcbg/about/fellows/currentsrfellows.

[5]

Brave Browser and the Wayback Machine: Working together to help make the Web more useful and reliable. http://blog.archive.org/2020/02/25/brave-browserand- the-wayback-machine-working-together-to-help-make-the-web-moreuseful- and-reliable/.

[6]

Cloudflare and the Wayback Machine, joining forces for a more reliable Web. https://blog.archive.org/2020/09/17/internet-archive-partners-withcloudflare-to-help-make-the-web-more-useful-and-reliable/.

[7]

410 Gone - HTTP. https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/410.

[8]

Alexa - Competitive Analysis, Marketing Mix, and Website Traffic. https://www.alexa.com/siteinfo.

[9]

Canonical link element - Wikipedia. https://en.wikipedia.org/wiki/Canonical_ link_element.

[10]

Category:Articles with dead external links - Wikipedia. https://en.wikipedia.org/ wiki/Category:Articles_with_dead_external_links.

[11]

Category:Articles with permanently dead external links - Wikipedia. https://en. wikipedia.org/wiki/Category:Articles_with_permanently_dead_external_links.

[12]

chromium/dom-distiller: Distills the DOM. https://github.com/chromium/domdistiller.

[13]

Internet Archive: Wayback Machine. https://archive.org/web/.

[14]

InternetArchiveBot. https://meta.wikimedia.org/wiki/InternetArchiveBot.

[15]

IPFS Powers the Distributed Web. https://ipfs.tech/.

[16]

Klazify - Free Website Categorization & Logo API. Find company's category and logo from URL. https://www.klazify.com/.

[17]

Medium Sitemap. https://medium.com/sitemap/sitemap.xml.

[18]

Newspaper3k: Article scraping & curation - newspaper 0.0.2 documentation. https://newspaper.readthedocs.io/en/latest/.

[19]

Perma.cc. https://perma.cc/.

[20]

PROSE - Text Transformation - Microsoft Research. https://www.microsoft.com/ en-us/research/project/prose-text-transformation/usage/.

[21]

Public Suffix List. https://publicsuffix.org/.

[22]

Robust Links - Make Your Link Robust. https://robustlinks.mementoweb.org/.

[23]

Stack Exchange Data Dump : Stack Exchange, Inc.: Free Download, Borrow, and Streaming: Internet Archive. https://archive.org/details/stackexchange.

[24]

User: FABLEBot/New URLs for permanently dead external links - Wikipedia. https://en.wikipedia.org/wiki/User:FABLEBot/New_URLs_for_permanently_ dead_external_links.

[25]

Using Flash Fill in Excel. https://support.microsoft.com/en-us/office/using-flashfill-in-excel-3f9bcf1e-db93-4890-94a0-1578341f73f7.

[26]

Web Archive, Available Online | Library of Congress. https://www.loc.gov/webarchives/.

[27]

Wikipedia:Link rot - Wikipedia. https://en.wikipedia.org/wiki/Wikipedia:Link_ rot#Internet_archives.

[28]

Scott G Ainsworth, Ahmed Alsum, Hany SalahEldeen, Michele C Weigle, and Michael L Nelson. 2011. How much of the web is archived?. In ACM/IEEE Joint Conference on Digital Libraries.

Digital Library

[29]

Ahmed AlSum, Michele C Weigle, Michael L Nelson, and Herbert Van de Sompel. 2014. Profiling web archive coverage for top-level domain and content language. International Journal on Digital Libraries (2014).

Digital Library

[30]

Ziv Bar-Yossef, Andrei Z Broder, Ravi Kumar, and Andrew Tomkins. 2004. Sic transit gloria telae: Towards an understanding of the web's decay. In WWW.

[31]

Andrei Broder. 2002. A taxonomy of web search. In ACM SIGIR Forum.

[32]

Junghoo Cho and Hector Garcia-Molina. 1999. The evolution of the web and implications for an incremental crawler. Technical Report.

[33]

Dennis Fetterly, Mark Manasse, and Marc Najork. 2003. On the evolution of clusters of near-duplicate web pages. In IEEE/LEOS International Conference on Numerical Simulation of Semiconductor Optoelectronic Devices.

[34]

Dennis Fetterly, Mark Manasse, Marc Najork, and Janet L Wiener. 2004. A largescale study of the evolution of Web pages. Software: Practice and Experience 34, 2 (2004), 213--237.

Digital Library

[35]

Ayush Goel, Jingyuan Zhu, and Harsha V. Madhyastha. 2022. Making Links on Your Web Pages Last Longer than You. In HotNets.

[36]

Ayush Goel, Jingyuan Zhu, Ravi Netravali, and Harsha V. Madhyastha. 2022. Jawa: Web Archival in the Era of JavaScript. In OSDI.

[37]

Turn all references blue. https://archive.org/details/mark-graham-presentation.

[38]

Sumit Gulwani. 2011. Automating string processing in spreadsheets using inputoutput examples. ACM SIGPLAN Notices 46, 1 (2011), 317--330.

Digital Library

[39]

Daniel Conrad Halbert. 1984. Programming by example. Ph.D. Dissertation. University of California, Berkeley.

[40]

William R Harris and Sumit Gulwani. 2011. Spreadsheet table transformations from examples. ACM SIGPLAN Notices 46, 6 (2011), 317--328.

Digital Library

[41]

Terry L Harrison and Michael L Nelson. 2006. Just-in-time recovery of missing web pages. In ACM Conference on Hypertext and Hypermedia.

Digital Library

[42]

Monika Henzinger. 2006. Finding near-duplicate web pages: A large-scale evaluation of algorithms. In SIGIR.

[43]

Zhongjun Jin, Michael R Anderson, Michael Cafarella, and HV Jagadish. 2017. Foofah: Transforming data by example. In SIGMOD.

[44]

Shawn M Jones, Herbert Van de Sompel, Harihar Shankar, Martin Klein, Richard Tobin, and Claire Grover. 2016. Scholarly context adrift: Three out of four URI references lead to changed content. PloS one (2016).

[45]

Martin Klein and Michael L Nelson. 2008. Revisiting lexical signatures to (re-) discover web pages. In International Conference on Theory and Practice of Digital Libraries. Springer, 371--382.

Digital Library

[46]

Martin Klein and Michael L Nelson. 2010. Evaluating methods to rediscover missing web pages from the web infrastructure. In ACM/IEEE Joint Conference on Digital Libraries.

Digital Library

[47]

Martin Klein, Jeffery Shipman, and Michael L Nelson. 2010. Is this a good title?. In ACM Conference on Hypertext and Hypermedia.

Digital Library

[48]

Martin Klein, Jeb Ware, and Michael L Nelson. 2011. Rediscovering missing web pages using link neighborhood lexical signatures. In ACM/IEEE Joint Conference on Digital Libraries.

Digital Library

[49]

Wallace Koehler. 2002. Web page change and persistence-A four-year longitudinal study. Journal of the American society for information science and technology 53, 2 (2002), 162--171.

Digital Library

[50]

Christian Kohlschütter, Peter Fankhauser, and Wolfgang Nejdl. 2010. Boilerplate detection using shallow text features. In WSDM.

[51]

John Kunze and Richard Rodgers. 2008. The ARK identifier scheme. (2008).

[52]

Steve Lawrence, Frans Coetzee, Eric Glover, Gary Flake, David Pennock, Bob Krovetz, Finn Nielsen, Andries Kruger, and Lee Giles. 2000. Persistence of information on the web: Analyzing citations contained in research articles. In CIKM.

[53]

Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma. 2007. Detecting near-duplicates for web crawling. In WWW.

[54]

John Markwell and David W Brooks. 2003. ?Link rot" limits the usefulness of webbased educational materials in biochemistry and molecular biology. Biochemistry and Molecular Biology Education 31, 1 (2003), 69--72.

[55]

Anders Miltner, Kathleen Fisher, Benjamin C Pierce, David Walker, and Steve Zdancewic. 2017. Synthesizing bijective lenses. In POPL.

[56]

Alexandros Ntoulas, Junghoo Cho, and Christopher Olston. 2004. What's new on the Web? The evolution of the Web from a search engine perspective. In WWW.

[57]

Anish Nyayachavadi, Jingyuan Zhu, and Harsha V Madhyastha. 2022. Characterizing ?permanently dead" links on Wikipedia. In IMC.

[58]

Peter-Michael Osera and Steve Zdancewic. 2015. Type-and-example-directed program synthesis. ACM SIGPLAN Notices 50, 6 (2015), 619--630.

Digital Library

[59]

Seung-Taek Park, David M Pennock, C Lee Giles, and Robert Krovetz. 2004. Analysis of lexical signatures for improving information persistence on the World Wide Web. ACM Transactions on Information Systems (TOIS) 22, 4 (2004), 540--572.

Digital Library

[60]

Thomas A Phelps and Robert Wilensky. 2000. Robust hyperlinks cost just five words each. University of California, Berkeley, Computer Science Division.

[61]

Sarah Rhodes. 2010. Breaking down link rot: The Chesapeake project legal information archive's examination of URL stability. Law Libr. J. 102 (2010), 581.

[62]

Gerard Salton and Christopher Buckley. 1988. Term-weighting approaches in automatic text retrieval. Information processing & management 24, 5 (1988), 513--523.

[63]

Carmine Sellitto. 2005. The impact of impermanent Web-located citations: A study of 123 scholarly conference publications. Journal of the American Society for Information Science and Technology 56, 7 (2005), 695--703.

Digital Library

[64]

Diomidis Spinellis. 2003. The decay and failures of web references. Commun. ACM 46, 1 (2003), 71--77.

Digital Library

[65]

Martin Theobald, Jonathan Siddharth, and Andreas Paepcke. 2008. SpotSigs: Robust and efficient near duplicate detection in large web collections. In SIGIR.

[66]

Dennis Trautwein, Aravindh Raman, Gareth Tyson, Ignacio Castro, Will Scott, Moritz Schubotz, Bela Gipp, and Yiannis Psaras. 2022. Design and evaluation of IPFS: a storage layer for the decentralized web. In SIGCOMM.

[67]

Thomas Vissers, Wouter Joosen, and Nick Nikiforakis. 2015. Parking sensors: Analyzing and detecting parked domains. In NDSS.

[68]

Jonathan L Zittrain, John Bowers, and Clare Stanton. 2021. The Paper of Record Meets an Ephemeral Web: An Examination of Linkrot and Content Drift within The New York Times. Available at SSRN 3833133 (2021).

Index Terms

Reviving Dead Links on the Web with Fable
1. Information systems
  1. Information systems applications
    1. Digital libraries and archives
  2. World Wide Web

Recommendations

Characterizing "permanently dead" links on Wikipedia
IMC '22: Proceedings of the 22nd ACM Internet Measurement Conference

It is common for a web page to include links which help visitors discover related pages on other sites. When a link ceases to work (e.g., because the page that it is pointing to either no longer exists or has been moved), users could rely on an archived ...
Read More
Vetting the links of the web
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge management

Many web links mislead human surfers and automated crawlers because they point to changed content, out-of-date information, or invalid URLs. It is a particular problem for large, well-known directories such as the dmoz Open Directory Project, which ...
Read More
Making links on your web pages last longer than you
HotNets '22: Proceedings of the 21st ACM Workshop on Hot Topics in Networks

It is common for the authors of a web page to include links to related pages on other sites. However, when users visit a page several years after it was last updated, they often find that some of the external links either do not work or point to ...
Read More

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

IMC '23: Proceedings of the 2023 ACM on Internet Measurement Conference

October 2023

746 pages

ISBN:9798400703829

DOI:10.1145/3618257

General Chairs:
Marie-José Montpetit
McGill University, Canada
,
Aris Leivadeas
École de Technologie Supérieure, Canada
,
Program Chairs:
Steve Uhlig
Queen Mary University of London, United Kingdom
,
Mobin Javed
Lahore University of Management Sciences, Pakistan

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGCOMM: ACM Special Interest Group on Data Communication

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

IMC '23

Sponsor:

SIGCOMM

IMC '23: ACM Internet Measurement Conference

October 24 - 26, 2023

Montreal QC, Canada

Acceptance Rates

Overall Acceptance Rate 277 of 1,083 submissions, 26%

Upcoming Conference

IMC '24

Sponsor:
sigcomm
sigcomm

ACM Internet Measurement Conference

November 4 - 6, 2024

Madrid , AA , Spain

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
222
Total Downloads

Downloads (Last 12 months)222
Downloads (Last 6 weeks)14

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents

-