Exploiting web scraping in a collaborative filtering- based approach to web advertising

Web Advertising is an emerging research field, at the intersection of information retrieval, machine learning, optimization, and microeconomics. It is one of the major sources of income for a large number of websites. Its main goal is to suggest products and services to the ever growing population of Internet users.

There are two primary channels for distributing ads: sponsored search (or paid search advertising) and contextual advertising (or content match). Sponsored search advertising displays ads on the page returned from a Web search engine following a query; whereas contextual advertising displays ads within the content of a generic, third part, Web page. A commercial intermediary, namely ad network, is usually in charge of optimizing the selection of ads with the twofold goal of increasing revenue and improving user experience. The ads are selected and served by automated systems based on the content displayed to the user.

Web scraping (also called Web harvesting or Web data extraction) is a software technique aimed at extracting information from websites [1]. Usually, Web scrapers simulate human exploration of the World Wide Web by either implementing low-level hypertext transfer protocol or embedding suitable Web browsers. Web scraping is closely related to Web indexing, which is an information retrieval technique adopted by several search engines to index information on the Web through a bot. In contrast, Web scraping focuses on the transformation of unstructured data on the Web, typically in HTML format, into structured data that can be stored and analyzed in a central local database or spreadsheet. Web scraping is also related to Web automation [2], which simulates human Web browsing using computer software. Web scraping is currently used to online price comparison, weather data monitoring, website change detection, Web research, Web mashup, and Web data integration. Several (commercial) software tools, aimed at personalizing websites by adopting scraping techniques, are currently available.

In this paper, we present a collaborative filtering-based Web advertising system that exploits Web scraping techniques to suggest suitable ads to a given Web page. In particular, we address Web advertising as an information filtering task devising our proposed Web advertising system by exploiting collaborative filtering [3]. The proposed system, first, exploits collaborative filtering and, subsequently, relies on Web scraping to extract ads to be suggested. The idea to exploit collaborative filtering in a Web advertising has been proposed by Armano & Vargiu [4] and adopted also in Armano et al. [5]. To our best knowledge this is the first attempt to adopt Web scraping techniques to perform Web advertising. The underlying motivation in adopting Web scraping is that, in case of no available ad dataset (available only for companies that operate advertising systems, e.g., Yahoo!, Google, or Microsoft, not for academic purposes), instead of building an ad-hoc dataset by hand, this unsupervised approach could be adopted.

For full text: click here

(Author: Eloisa Vargiu, Mirko Urru

Published by Sciedu Press)