Website Graph Generation
The first step in our dataset generation is to generate a user-interaction model of each website. The user-interaction model which we build takes the form of a directed graph describing a state machine model for a given website. The root node in this graph is the main landing page for a given website. Each user click is represented by a directed edge thus setting the website to a new state. The presence of generated network traffic as a result of a user-generated click is a possibility and therefore any network traffic which occurred when the graph state transition occurred is captured and associated with the given edge.
Each node on the graph is uniquely identified by the hash of the Document Object Model (DOM) which it represents. The current state of the DOM (including any modifications done by client-side JavaScript) can be obtained via JavaScript read of document.body.innerHTML. This variable is saved to a file and its hash is used to label its corresponding node on the user-interaction graph. Therefore, for any node on the user-interaction graph, by setting document.body.innerHTML to the value stored in the file with the same hash (label) as this given node, we can reset the DOM state of the browser to that which it was during the dataset generation phase.
For debugging purposes, during test set generation, a PNG image of the virtual desktop is also captured and labelled with the hash of the DOM for which it corresponds to.
~
LESCISIN, Michael and MAHMOUD, Qusay H., 2018. Dataset for Web Traffic Security Analysis. In: IECON 2018 - 44th Annual Conference of the IEEE Industrial Electronics Society. October 2018. p. 2700–2705. DOI 10.1109/IECON.2018.8591589.
Patterns of network activity can reveal an abundance of information on the behaviour of an application. Research has shown that despite the widespread use of network encryption protocols such as TLS or SSH, network application confidentiality can often times be violated through network traffic pattern analysis. Achieving sound mitigation of these information leaks while maintaining network usage efficiency is still an ongoing research topic.
The goal of the research conducted in this paper is to provide network security researchers with a dataset of captured network traffic from a popular SSL/TLS protected website, which we have chosen to be reddit.com, for the purpose of evaluating algorithms for attacking and defending against network based side-channel information leaks. Our dataset is represented as a graph describing a crawl through the website. Every path from the center of the graph to any other connected point represents a sequence of user interactions with the website. By following a directed path through the graph, researchers can obtain probable sequences of user interactions with a website and the associated patterns of network traffic which these interactions generate.