UCL Discovery
UCL home » Library Services » Electronic resources » UCL Discovery

Web crawlers on a health related portal: Detection, characterisation and implications

Jawaheer, G; Kostkova, P; (2011) Web crawlers on a health related portal: Detection, characterisation and implications. In: 2011 Developments in E-systems Engineering. (pp. pp. 24-29). IEEE: Dubai, United Arab Emirates. Green open access

[thumbnail of Kostkova_PDFsam_DeSE2010Kostkova_JawaheerDeSE2010Webcrawlers-published2011.pdf]
Preview
Text
Kostkova_PDFsam_DeSE2010Kostkova_JawaheerDeSE2010Webcrawlers-published2011.pdf

Download (270kB) | Preview

Abstract

Web crawlers are automated computer programs that visit websites in order to download their content. They are employed for non-malicious (search engine crawlers indexing websites) and malicious purposes (those breaching privacy by harvesting email addresses for unsolicited email promotion and spam databases). Whatever their usage, web crawlers need to be accurately identified in an analysis of the overall traffic to a website. Visits from web crawlers as well as from genuine users are recorded in the web server logs. In this paper, we analyse the web server logs of NRIC, a health related portal. We present the techniques used to identify malicious and non-malicious web crawlers from these logs, using a blacklist database and analysis of the characteristics of the online behaviour of malicious crawlers. We use visualisation to carry out sanity checks along the crawler removal process. We illustrate the use of these techniques using 3 months of web server logs from NRIC. We use a combination of visualisation and baseline measures from Google Analytics to demonstrate the efficacy of our techniques. Finally, we discuss the implications of our work on the analysis of the web traffic to a website using web server logs and on the interpretation of the results from such analysis. © 2011 IEEE.

Type: Proceedings paper
Title: Web crawlers on a health related portal: Detection, characterisation and implications
Event: 2011 Developments in E-systems Engineering
Open access status: An open access version is available from UCL Discovery
DOI: 10.1109/DeSE.2011.83
Publisher version: https://doi.org/10.1109/DeSE.2011.83
Language: English
Additional information: This version is the author accepted manuscript. For information on re-use, please refer to the publisher’s terms and conditions.
UCL classification: UCL
UCL > Provost and Vice Provost Offices
UCL > Provost and Vice Provost Offices > UCL BEAMS
UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Maths and Physical Sciences
UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Maths and Physical Sciences > Inst for Risk and Disaster Reduction
URI: https://discovery.ucl.ac.uk/id/eprint/10088955
Downloads since deposit
55Downloads
Download activity - last month
Download activity - last 12 months
Downloads by country - last 12 months

Archive Staff Only

View Item View Item