Automatically identifying coherent web sessions from browser logs
Due to the increasing diversity in both user’s behaviour and the types of web tasks performed, many studies in information retrieval (IR) are turning towards session-based retrieval rather than single URL-query pairs. However, extracting the meaningful session data from the raw discrete logs is stil...
| Main Author: | |
|---|---|
| Format: | Thesis (University of Nottingham only) |
| Language: | English |
| Published: |
2019
|
| Subjects: | |
| Online Access: | https://eprints.nottingham.ac.uk/59119/ |
| _version_ | 1848799587881451520 |
|---|---|
| author | Ye, Chaoyu |
| author_facet | Ye, Chaoyu |
| author_sort | Ye, Chaoyu |
| building | Nottingham Research Data Repository |
| collection | Online Access |
| description | Due to the increasing diversity in both user’s behaviour and the types of web tasks performed, many studies in information retrieval (IR) are turning towards session-based retrieval rather than single URL-query pairs. However, extracting the meaningful session data from the raw discrete logs is still a significant challenge. Most prior studies have been based on datasets where the logs of each user’s web history were simply divided by fixed periods of inactivity, such as 5, 15, or 30 minutes [52,31]. There have also been some attempts beyond these simplistic fixed timeouts [91]. Rather than covering all web activities, they focus on search-related activities only. Consequently, it is necessary to finding a meaningful way to cluster all activities including both searching and browsing on a web browser. The goal of this study is to find a way to better automatically segment users’ web activity into sessions. There are three research stages: 1) how people understand their mental model in the session segmentation, 2) how these self-identified sessions look in practically implemented weblogs, and 3) how we can algorithmically identify these sessions from browser activity, and how each algorithm performs. To answer these questions, firstly a qualitative study was conducted and a taxonomy of six factors related to the user-identified sessions was generated. Then a Chrome Extension was built that provided the practical reflection of user-identified sessions with comprehensive sets of web logs including both user interaction and visit details. This helped to gather a ground truth dataset to support further evaluation. Finally, several algorithmic approaches to automatically clustering web activities closer to user-identified sessions were evaluated. |
| first_indexed | 2025-11-14T20:38:03Z |
| format | Thesis (University of Nottingham only) |
| id | nottingham-59119 |
| institution | University of Nottingham Malaysia Campus |
| institution_category | Local University |
| language | English |
| last_indexed | 2025-11-14T20:38:03Z |
| publishDate | 2019 |
| recordtype | eprints |
| repository_type | Digital Repository |
| spelling | nottingham-591192025-02-28T14:39:37Z https://eprints.nottingham.ac.uk/59119/ Automatically identifying coherent web sessions from browser logs Ye, Chaoyu Due to the increasing diversity in both user’s behaviour and the types of web tasks performed, many studies in information retrieval (IR) are turning towards session-based retrieval rather than single URL-query pairs. However, extracting the meaningful session data from the raw discrete logs is still a significant challenge. Most prior studies have been based on datasets where the logs of each user’s web history were simply divided by fixed periods of inactivity, such as 5, 15, or 30 minutes [52,31]. There have also been some attempts beyond these simplistic fixed timeouts [91]. Rather than covering all web activities, they focus on search-related activities only. Consequently, it is necessary to finding a meaningful way to cluster all activities including both searching and browsing on a web browser. The goal of this study is to find a way to better automatically segment users’ web activity into sessions. There are three research stages: 1) how people understand their mental model in the session segmentation, 2) how these self-identified sessions look in practically implemented weblogs, and 3) how we can algorithmically identify these sessions from browser activity, and how each algorithm performs. To answer these questions, firstly a qualitative study was conducted and a taxonomy of six factors related to the user-identified sessions was generated. Then a Chrome Extension was built that provided the practical reflection of user-identified sessions with comprehensive sets of web logs including both user interaction and visit details. This helped to gather a ground truth dataset to support further evaluation. Finally, several algorithmic approaches to automatically clustering web activities closer to user-identified sessions were evaluated. 2019-12-12 Thesis (University of Nottingham only) NonPeerReviewed application/pdf en arr https://eprints.nottingham.ac.uk/59119/1/thesis_submitted_20190920.pdf Ye, Chaoyu (2019) Automatically identifying coherent web sessions from browser logs. PhD thesis, University of Nottingham. web sessions browsers internet information retrieval |
| spellingShingle | web sessions browsers internet information retrieval Ye, Chaoyu Automatically identifying coherent web sessions from browser logs |
| title | Automatically identifying coherent web sessions from browser logs |
| title_full | Automatically identifying coherent web sessions from browser logs |
| title_fullStr | Automatically identifying coherent web sessions from browser logs |
| title_full_unstemmed | Automatically identifying coherent web sessions from browser logs |
| title_short | Automatically identifying coherent web sessions from browser logs |
| title_sort | automatically identifying coherent web sessions from browser logs |
| topic | web sessions browsers internet information retrieval |
| url | https://eprints.nottingham.ac.uk/59119/ |