Automatically identifying coherent web sessions from browser logs

Due to the increasing diversity in both user’s behaviour and the types of web tasks performed, many studies in information retrieval (IR) are turning towards session-based retrieval rather than single URL-query pairs. However, extracting the meaningful session data from the raw discrete logs is stil...

Full description

Bibliographic Details
Main Author: Ye, Chaoyu
Format: Thesis (University of Nottingham only)
Language:English
Published: 2019
Subjects:
Online Access:https://eprints.nottingham.ac.uk/59119/
_version_ 1848799587881451520
author Ye, Chaoyu
author_facet Ye, Chaoyu
author_sort Ye, Chaoyu
building Nottingham Research Data Repository
collection Online Access
description Due to the increasing diversity in both user’s behaviour and the types of web tasks performed, many studies in information retrieval (IR) are turning towards session-based retrieval rather than single URL-query pairs. However, extracting the meaningful session data from the raw discrete logs is still a significant challenge. Most prior studies have been based on datasets where the logs of each user’s web history were simply divided by fixed periods of inactivity, such as 5, 15, or 30 minutes [52,31]. There have also been some attempts beyond these simplistic fixed timeouts [91]. Rather than covering all web activities, they focus on search-related activities only. Consequently, it is necessary to finding a meaningful way to cluster all activities including both searching and browsing on a web browser. The goal of this study is to find a way to better automatically segment users’ web activity into sessions. There are three research stages: 1) how people understand their mental model in the session segmentation, 2) how these self-identified sessions look in practically implemented weblogs, and 3) how we can algorithmically identify these sessions from browser activity, and how each algorithm performs. To answer these questions, firstly a qualitative study was conducted and a taxonomy of six factors related to the user-identified sessions was generated. Then a Chrome Extension was built that provided the practical reflection of user-identified sessions with comprehensive sets of web logs including both user interaction and visit details. This helped to gather a ground truth dataset to support further evaluation. Finally, several algorithmic approaches to automatically clustering web activities closer to user-identified sessions were evaluated.
first_indexed 2025-11-14T20:38:03Z
format Thesis (University of Nottingham only)
id nottingham-59119
institution University of Nottingham Malaysia Campus
institution_category Local University
language English
last_indexed 2025-11-14T20:38:03Z
publishDate 2019
recordtype eprints
repository_type Digital Repository
spelling nottingham-591192025-02-28T14:39:37Z https://eprints.nottingham.ac.uk/59119/ Automatically identifying coherent web sessions from browser logs Ye, Chaoyu Due to the increasing diversity in both user’s behaviour and the types of web tasks performed, many studies in information retrieval (IR) are turning towards session-based retrieval rather than single URL-query pairs. However, extracting the meaningful session data from the raw discrete logs is still a significant challenge. Most prior studies have been based on datasets where the logs of each user’s web history were simply divided by fixed periods of inactivity, such as 5, 15, or 30 minutes [52,31]. There have also been some attempts beyond these simplistic fixed timeouts [91]. Rather than covering all web activities, they focus on search-related activities only. Consequently, it is necessary to finding a meaningful way to cluster all activities including both searching and browsing on a web browser. The goal of this study is to find a way to better automatically segment users’ web activity into sessions. There are three research stages: 1) how people understand their mental model in the session segmentation, 2) how these self-identified sessions look in practically implemented weblogs, and 3) how we can algorithmically identify these sessions from browser activity, and how each algorithm performs. To answer these questions, firstly a qualitative study was conducted and a taxonomy of six factors related to the user-identified sessions was generated. Then a Chrome Extension was built that provided the practical reflection of user-identified sessions with comprehensive sets of web logs including both user interaction and visit details. This helped to gather a ground truth dataset to support further evaluation. Finally, several algorithmic approaches to automatically clustering web activities closer to user-identified sessions were evaluated. 2019-12-12 Thesis (University of Nottingham only) NonPeerReviewed application/pdf en arr https://eprints.nottingham.ac.uk/59119/1/thesis_submitted_20190920.pdf Ye, Chaoyu (2019) Automatically identifying coherent web sessions from browser logs. PhD thesis, University of Nottingham. web sessions browsers internet information retrieval
spellingShingle web sessions
browsers
internet
information retrieval
Ye, Chaoyu
Automatically identifying coherent web sessions from browser logs
title Automatically identifying coherent web sessions from browser logs
title_full Automatically identifying coherent web sessions from browser logs
title_fullStr Automatically identifying coherent web sessions from browser logs
title_full_unstemmed Automatically identifying coherent web sessions from browser logs
title_short Automatically identifying coherent web sessions from browser logs
title_sort automatically identifying coherent web sessions from browser logs
topic web sessions
browsers
internet
information retrieval
url https://eprints.nottingham.ac.uk/59119/