LLQA-Lifelog Question Answering Dataset

Recollecting details from lifelog data involves a higher level of granularity and reasoning than a conventional lifelog retrieval task. Investigating the task of Question Answering (QA) in lifelog data could help in human memory recollection, as well as improve traditional lifelog retrieval systems. However, there has not yet been a standardised benchmark dataset for the lifelog-based QA. In order to provide a first dataset and baseline benchmark for QA on lifelog data, we present a novel dataset, LLQA, which is an augmented 85-day lifelog collection and includes over 15,000 multiple-choice questions. We also provide different baselines for the evaluation of future works. The results showed that lifelog QA is a challenging task that requires more exploration. The dataset is publicly available at https://github.com/allie-tran/LLQA.


Introduction
Lifelogging has gained popularity within the research community in recent years with the main focus on lifelog retrieval. The term lifelogging refers to the process of capturing a personal digital diary by technologies such as body cameras and various other wearable sensors. The most extensive published lifelog data, used in the Lifelog Search Challenge workshop 2020 [12], features a collection of first-person images captured throughout the day, as well as the corresponding metadata such as time, GPS coordinates, and biometrics data. Such lifelog data can be processed in lifelog systems, which can serve as a form of 'prosthetic' memory. Lifelogs can support users in memory-related activities such as recollecting, reminiscing, retrieving, reflecting, and remembering intentions, as defined by Sellen and Whittaker's five R's [24]. Out of the five R's, retrieving lifelog data, typically lifelog photos, has been the subject of the majority of lifelog research, as seen in various workshops [11,12,21]. Recollecting details in past lifelog data, on the other hand, involves a higher level of granularity and reasoning; for example, it might involve answering memory questions such as 'What did I do, where did I go, and who did I see on [Tuesday][afternoon], [July 14, 2018] ?'. Thus, it becomes clear that Question Answering (QA) is an important related topic for research and this paper introduces the first QA dataset for lifelogs.
QA systems are designed to automatically answer questions posed in natural language and are considered to be one of the ultimate goals for retrieval systems [27]. For instance, users may prefer getting concise answers to specific questions instead of browsing an entire document. The same argument could be made for other types of media such as photos and videos; Visual QA systems can save the user from extraneous effort by automatically inferring a user's question regarding an image/video and producing a short and accurate answer. To produce the correct answer, the model needs to be able to interpret the question and focus on the relevant part of the image/video. Due to advances in the field of computer vision, visual QA has been a fast-growing area with various techniques for images [1,8,14] and videos [7,15,17]. Applying such visual QA techniques to lifelogs suggests that lifelog QA can be a valuable and impactful research area, since lifelog data is heavily visual-based. Having the ability to understand the whole context of a real-world event, Lifelog QA systems ultimately could provide help in human memory recollection, as well as improve traditional lifelog retrieval systems.
Despite the similarities to visual QA, the data used in Lifelog QA has several distinct aspects that render the direct application of Visual QA techniques less effective. Image QA techniques do not exploit the temporal nature of lifelog data. In the case of Video QA, standard action recognition techniques such as C3D [15] may not be useful as lifelog data are discontinuous (with an average frequency of 1 snapshot every 30 s) in the current generation of lifelog datasets. Moreover, current state-of-the-art video QA methods learn inference by relying on the appearance and motion data from a third-person point of view, which is different from the first-person photos in lifelog data. The most related work to Lifelog QA is EgoVQA [7], an egocentric video question answering dataset containing first-person perspective videos similarly to lifelog photos. However, videos still hold different characteristics compared to lifelog photos. For this reason, a novel benchmark dataset for Lifelog QA is a prerequisite to evaluate a model's ability to 'recollect' details in lifelog data.
In the field of lifelog QA, the novel dataset proposed in this paper supports the following research contributions: 1. Describing a new semi-automatic process of constructing a Lifelog QA dataset, based on an existing lifelog collection; 2. Providing 15,065 lifelog QA pairs, comprising of both multiple-choice questions and yes/no questions; 3. Presenting results of a pilot experiment to identify the gap between the human gold standard and existing QA models.

Lifelogs and Personal Data Analytics
The inspiration for lifelogging dates back to Vannevar Bush's 1945 article As We May Think [3], which describes a blueprint personal information system which he called Memex. Bush considered Memex as 'a device in which an individual stores all his books, records, and communications, which is mechanised to be consulted with exceeding speed and flexibility. It is an enlarged intimate supplement to his memory'. However, it was not until a research project of Microsoft Research, called MyLifeBits [10] was started by Gordon Bell in 2001, that lifelogging began to gain attention from the research community. The MyLifeBits system attempted to capture every possible aspect of the daily life of Bell, including every web page visited, all Instant Message (IM) chat sessions, all telephone conversations, meetings, radio, television programs, as well as all mouse and keyboard activities and media files in his personal computers. All digitised data were stored in a SQL database to support a simple interface for different functionalities such as organising, associating metadata, assessing, and reporting information. Since then, due to advances in sensor technology and the availability of low-cost data storage, lifelogging has become an achievable activity for many. However the primarily passive nature of lifelogging means that the amount of data generated can be massive (over 1 TB of multimodal data per individual per year), and therefore effectively searching through such extensive archives of lifelog data remains an important yet challenging task. Different lifelog benchmarking workshops/challenges have been established with distinctive evaluation metrics to assess lifelog systems, with the common objective being to facilitate the effective retrieval of specific lifelog images in an interactive or automatic manner. The standard approach taken by existing lifelog retrieval systems, such as MyScéal [26] and LifeSeeker [20], is assigning semantic context, e.g., visual concepts, to lifelog photos and applying traditional information retrieval techniques to produce a ranked list of relevant images. This approach treats each lifelog photo individually, which does not exploit the temporal and continuous nature of lifelog data. This is important because an individual snapshot of lifelog data is likely to not fully convey the whole context of an event [13].
There have been a number of lifelog datasets released since 2015 and at the most recent lifelog retrieval workshop, LSC'20 [12], the organisers published a large collection including six months of anonymised lifelog data, consisting of 50 GB of fully redacted wearable camera images at 1024 × 768 resolution, captured using OMG Autographer and Narrative Clip devices. These images were collected during periods in 2015, 2016, and 2018; some private information (for example, faces and identifiable materials) appearing in these images are anonymised in a manual or semi-manual process. The metadata for the collection consists of textual metadata representing time, physical activities, biometrics, locations, as well as visual concepts extracted from the non-redacted version of the visual dataset using a CAFFE CNN-based object detector [16]. This dataset forms the basis of the dataset augmented and released in this paper.

Video Question Answering (Video QA)
Video QA, an application of QA, is a task requiring the generation of correct answers to given questions related to a video or video archive. The questions are either in the form of fill-in-the-blank, multiple-choice, or open-ended types.
All existing Video QA datasets except for EgoVQA [7] are from third-person perspective. TGIF-QA [15] is a dataset of over 165,000 questions on 71,741 animated pictures. Multiple tasks are formulated upon this dataset, including counting the repetitions of the queried action, detecting the transitions of two actions, and image-based QA. MSVD-QA and MSRVTT-QA [28] are two datasets with third-person videos. The Video QA tasks formulated in both of these two datasets are open-ended questions of the types what, who, how, when, and where, and their answer sets are of size 1000. YouTube2Text-QA [29] is a dataset with both openended and multiple-choice tasks of three major question types (what, who, and other). TVQA and TVQA+ [17,18] are built on 21,793 video clips of 6 popular TV shows with 152.5K human-written QA pairs. EgoVQA [7] was proposed due to the lack of first-person point-of-view videos in these datasets; however, the size of the dataset is small, with just over 600 QA pairs.
After a comprehensive review of research on video QA, we observe that there are three unique characteristics of Lifelog QA compared with Video QA: (1) lifelog QA deals with more channels of information because of the inherent multimodality of lifelog data; (2) the collected activities in lifelog are captured in snapshots instead of being continuous, rendering the motion features ineffective; (3) unlike most video QA datasets, the point of view in lifelog visual data is firstperson instead of third-person. Therefore, it is clear that the existing approaches and datasets for visual QA are not representative of the challenge posed by lifelog QA, hence it becomes necessary to investigate Lifelog QA in more detail, which is the primary motivation for this research.

LLQA -A Lifelog Question Answering Dataset
We define Lifelog Question Answering (Lifelog QA) as a task to produce correct answers to given textual representations of an individual's information needs concerning a past moment or experience from a lifelogger's daily life. In the scope of this initial research, we will consider only multiple-choice questions and yes/no questions due to the straightforward means of evaluation. It is anticipated that other types of answers will be explored at a later point.
In this section, a detailed explanation about how to build the first lifelog QA dataset is covered. This process is part of our contribution to the field of lifelog QA. To save time and effort, we applied automated steps where possible. The pipeline of the entire process is summarised in Fig. 1 and the description of each component is as follows:

Data Collection
The lifelog QA dataset for this work is based on the LSC'20 collection [12]   and 59 d were selected in 2016. Each day is segmented into short events of the date based on the locations and activities of the lifelogger, which is based on the event segmentation approach of Doherty and Smeaton [6]. This encourages the annotators to focus on individual events. From the provided metadata throughout the day, whenever the location (work, home, etc.) or the activity (walking, driving, etc.) is changed, a new segment will be created. The process results in a total of 2,412 segments.
An annotation system was developed that presents annotators with all images in each segment along with the metadata such as time, GPS location, and the relative position of the segment in the whole day. Annotators, who are volunteers from undergraduate Computer Science programmes, were asked to describe the events happening in each segment as seen in Fig. 2. Every description is annotated along with its starting and ending times.
The description should include actions or activities; objects that the lifelogger interacted with along with their properties such as size, shape, or colour; the location where the lifelogger was in, heading towards to or away from; and people (with a general identity description to preserve privacy). One example could be 'The lifelogger is reading a book in a cafe with a person in a black t-shirt.'

Generation of Question and Answers
The descriptions were converted to a list of questions by an automatic system which is summarised in Fig. 3. Entity extraction and syntax transformation were completed using hand-crafted rules based on POS tags and semantic role labels. To generate question words (who, what, where, etc.), a Seq2Seq neural network was trained on the questions and answers in the CoQA [23] dataset. False answers (distractors), are generated using RACE [9] with the gathered knowledge from ConceptNet [25] facts as context.

Syntax transformaƟon
Wh-word generaƟon Distractors generaƟon Given the description 'The lifelogger was reading a book in a cafe.', the generation process would be as follows:

Entities extraction
The lifelogger, reading a book, and in a cafe are examples of entities in the sentence. We will choose reading a book in this example to illustrate further. Thus, the correct answer to this generated question-answer pair would be reading a book ; 2. Syntax transformation -yes/no By moving was to the beginning of the sentence, we get 'Was the lifelogger reading a book in the cafe?' -'Yes' as a yes/no question-answer pair; 3. Syntax transformation -multiple First, based on the POS tags, an automated process decides the entity is a phrasal verb, thus by replacing it with doing in the sentence and by applying a rule-based syntax transformation, we get '[...] was the lifelogger doing in the cafe?' 4. Wh-word generation Since questions in this dataset start with a Wh word, a pretrained S2S model chooses appropriate question word for this question. In this case, a sensible one would be What.

Distractor generations
So far, we get the question-answer pair as 'What was the lifelogger doing in the cafe?' -'Reading a book'. To make this a multiple-choice question, we use RACE [9], a distractor generator for reading comprehension questions, and get the other wrong answers as 'Using his phone', 'Drinking coffee', and 'Playing football'.

Review
The generated questions and answers are reviewed by the annotators to correct semantics and delete duplicates, as well as ensuring constraints such as: 1. There are no duplicate answers for the same question, 2. The ratios between yes and no questions are balanced. As the automatic syntax transformation could only generate positive yes/no questions, the annotators are asked to create negative ones manually.
The dataset contains 15,065 QA pairs in total. Examples of the QA pairs can be seen in Fig. 4. On average, our questions contain 7.66 words. Correct answers tend to contain 3.57 words compared to 4.34 words in the generated wrong answers. Figure 5 and Table 1 present the breakdown of questions generated. The dataset is split into two sets: training and testing sets consisting of 10,668 (70.81%) and 4,397 (29.19%) question-answer pairs, respectively. The splitting was done in a manner that ensures there are no overlapping days between the subsets, or in other words, the lifelog data in the testing set are unseen.

Pilot Experiment
In order to evaluate the dataset and provide accompanying baselines for subsequent comparison, a pilot experiment has been carried out on several baselines, which are described below.

Human Gold-Standard Baseline
To determine the targeted performance (in terms of accuracy) on our dataset, we performed a user study, asking different groups of 10 volunteer students to complete the question-answering task. Each volunteer was asked to answer 20 yes/no questions and 20 multiple-choice questions chosen randomly from the testing set. Each question was accompanied by the relevant images. To avoid bias, there was no overlap between the annotators that have worked on the questions and the students participating in this study. The gold standard accuracy was found to be 0.8417 for yes/no questions and 0.8625 for multiple-choice questions. The reason that the scores are less than 1.0 is because the volunteers were presented with the relevant section for the question, rather than the lifelog data for the whole day, so in some cases, they did not fully understand the context of the event mentioned in the question. Another interesting feedback from the participants, as well as the annotators, concerns the volume of lifelog data causing issues in understanding. This is a common problem in lifelog analytics when the decisions regarding lifelog data are often made by a third party and not the original data gathering lifelogger, for example, as seen in the studies carried out by Byrne et al. [4].

Question-Only
We implement several heuristic baselines that use only the questions and their candidate answers in a similar approach to Castro et al. [5]. Specifically, Longest answer and Shortest answer choose one out of the four options with the most or the fewest number of tokens, respectively. Word matching chooses the answer based on the number of tokens they have in common with the question. Because yes/no answers have no difference either in length or the number of common words with the questions, we omit these models for this experiment. Moreover, we implement Sequence-to-sequence (S2S) model based on the architecture of UniLMv2 [2], the state-of-the-art model in natural language understanding and generation tasks. We trained S2S on the CoQA [23] questionanswer pairs. It encodes the question with a 2-layer LSTM, then encodes the candidate answers and assigns a score to each one. The text is tokenised and represented using Glove 300-D embeddings [22].

Question and Vision
Because of the similarity to Video QA task, we implemented TVQA, the original TVQA [17] model, trained on TVQA dataset. This is the state-of-the-art system in Video QA. To evaluate the application to lifelog data, we consider each day to be a one fps video with each image (along with the attached metadata) as one single frame in that video. We converted the annotated starting and ending times into the ordinal index of the frames in the video. Moreover, we replaced the subtitles intended for videos with a concatenation of metadata associated with the frames. While it may seem strange to treat visual lifelog data as motion video, it is temporal in nature and many of the participants in the LSC challenge [12] have modified existing Video Search systems from the VBS challenge [19] to treat lifelog data as 1 fps video.

Results
Both S2S and TVQA models have been retrained on the training set of the lifelog QA dataset and achieved a small improvement in accuracy compared to the untrained versions, as seen in Table 2. Furthermore, there is no considerable difference between the question-only models. Although the average length of the correct answers are shorter than the wrong ones, Shortest answer did not perform well at the lowest accuracy of 0.1717 for multiple-choice questions. Amongst the models, the retrained TVQA achieved the best performance with the accuracy of 0.6338 and 0.6136 for yes/no questions and multiple-choice questions, respectively. However, humans still significantly outperformed the models.
The results highlighted that the existing approaches are still far from the human gold standard for the lifelog QA task, so they should be optimised to improve performance. This will be a potential and promising topic for future research in lifelog domain in general, and especially in lifelog QA.

Conclusion
In this work, we introduced Lifelog QA, a question answering dataset for lifelog data. The dataset consists of over 15,000 yes/no questions and multiple-choice questions. Through several baseline experiments, we assessed the suitability of the dataset for the task of lifelog QA. We note that there is still a significant gap between the proposed baselines and human performance on the QA accuracy, meaning that there is a significant research challenge to be addressed. Our findings suggest that a large proportion of the dataset involves the lifelogger's actions or interactions with other objects, therefore it is crucial to improve the standard action recognition mechanism. One possible approach is to sample video frames with a lower rate similarly to lifelog data and develop models based on this. Furthermore, we could develop respective sequences of features for other metadata instead of using the existing textual subtitle stream as in the TVQA model. Additionally, temporal reasoning is also essential to this task, especially for questions containing before or after actions. These three points can be integrated in future works to improve the semantic understanding of lifelog data. The dataset is published at https://github.com/allie-tran/LLQA. We also include the annotated description with timestamps, which can be used to develop models for lifelog captioning tasks. By creating this dataset, we hope it can encourage more researchers to participate in and explore this research area further.
18/CRT/6224. For the purpose of Open Access, the author has applied a CC BY public copyright licence to any Author Accepted Manuscript version arising from this submission.