Introduction
A. Background
Based on the recent advances in sensing methods, such context recognition technologies as activity recognition and indoor positioning have been scrutinized in the IoT community. Many activity recognition studies employ body-worn sensors, including acceleration sensors, gyroscopes, cameras, and microphones to recognize such daily activities as walking, running, and house cleaning [1]–[5]. Indoor positioning studies rely on signaling technologies, for example, infrared [6], ultrasound [7], active sound probing [8], [9], Bluetooth [10], and Wi-Fi [11], [12]. The recognized context information can be used in real-world services, e.g., context-aware systems, lifelogging, and the surveillance of the elderly [13]–[17].
Due to the recent proliferation of such smart speakers as Amazon Echo and Google Home, question answering (QA) by these smart devices is being woven into our daily lives. As mentioned above, our daily lives are being monitored by such context recognition techniques as activity recognition and indoor positioning. Based on the recognized and stored daily activity data, real-world question answering (real-world QA) has been investigated [18]. Real-world QA, which provides more fine-grained understanding of our daily living than just retrieving past events, offers many useful real-world applications for improving quality of life. For instance, answering such questions as “What did I eat last night?,” “Where is my smartphone?,” and “Did Mary take her medicine after eating?” supports human memory-aids, locates lost items, and monitors human activities.
Because a real-world QA is assumed to output a linguistic description as an answer, which is read out by a smart speaker, recognized daily life events are stored as linguistic descriptions (a series of sentences related to real-world events) that facilitate answer generation from the stored sentences related to daily life based on the state-of-the-art QA methods. For example, when an indoor positioning system detects that Mary’s current indoor coordinates have changed from the living room to the kitchen, this event is converted to the following sentence: “Mary has moved from the living room to the kitchen.” Since real-world QA requires complex reasoning and the generation of diverse answers (e.g., numbers, “yes/no,” and a sequence of words) in response to various information needs in the real-world, the current real-world QA approaches mainly use neural network-based QA models that demonstrate high performance over many story-based QA tasks [19]–[21]. However, neural network-based model performance relies heavily on large training datasets [22], [23].
In real-world QA, question and answer pairs as well as sentences about daily stories collected in a target environment are required as a training dataset. 5,000 QA pairs and 1,500 sentences are required for each environment [18]. However, preparing a sufficient amount of real-world QA dataset is costly and impractical. Preparing QA pairs is especially difficult because daily life events must be observed by someone to answer real-world questions, which also raises severe privacy concerns.
B. Approach
To address these problems, we propose to use a life simulator to produce sufficient amounts of QA datasets for training neural QA models. With a life simulator, we can easily create realistic daily living environments rather than building actual houses and obtain diverse realistic daily life stories. With virtual daily life stories, a large amount of virtual-world QA datasets can be efficiently compiled without breaching privacy concerns. Due to such advantages, we trained a neural QA model with this virtual-world QA dataset and solved real-world QA problems without real-world labeled data. We designated this proposed framework as a simulation to a real QA (Sim2RealQA). Fig. 1 presents an example of our proposed framework. In this study, we use a life-simulator game1 (e.g., The Sims), which replicates a person’s life in a virtual world. Because a person performs a variety of activities while interacting with common objects and others to replicate real-world daily life activities, we can generate a sufficient amount of information about daily events to acquire reasoning rules of QA in normal daily lives regardless of privacy issues.
Sim2RealQA example: We train models with a virtual-world QA dataset produced in a life simulator (left) to solve real-world QA problems (right).
We assume the existing activity recognition and indoor positioning methods studied in the IoT community and generate a sequence of sentences about daily stories in a real-world environment based on the expected outputs of these methods. We generate sentences by embedding the information acquired by the activity recognition and indoor positioning methods into a template, such as “Subject + Predicate + Object + Location.” Assume that the activity recognition system detects that David has started a “sleeping” activity. In addition, an indoor positioning system has tracked his position, and his location is estimated to be a bedroom. By referring to the predictions, “David slept in the bedroom” is generated for that event. In this study, we prepared templates that can decode the results of various types of activity recognition and indoor positioning methods. Based on these templates, we generated a sequence of sentences about daily stories in real/virtual environments. We believe that generating sentences about daily life stories from sensor data has the following benefits. (i) We can design our real-world QA model based on state-of-the-art QA techniques in NLP studies, which assume linguistic descriptions as input. (ii) Because a question (and an answer read out by a smart speaker) is done by natural language, QA targets (i.e., events) that have the same modality as the question are easy to find. (iii) Since various activity recognition and indoor positioning systems are available, the output formats of such systems also vary. When outputs are stored in a relational database, for example, the columns of tables depend on activity recognition/indoor positioning systems. Therefore, a process that handles information from the systems greatly depends on the output formats (e.g., table structures) and must be handcrafted for each system, complicating the implementation of these processes in a neural network model. In contrast, our approach can deal with the outputs of any type of system once templates are prepared for each one.
In addition to sentences about real-world stories, we generate typical real-world questions based on the entities (e.g., object, place, and person) in a real/virtual environment, which are used to train/evaluate a QA model. We made various types of questions that can be used in real-world situations, including “Where is Sally?”, “What did Ann do before going to bed?”, “Who opened the refrigerator?”, and “How many times did Tom drink coffee?” The answers require different forms: “bathroom,” “used her smartphone,” “Ann, Sally, and Tom,” and “5”.
To train a QA model that is applicable for such real-world QA problems using a virtual-world QA dataset, we designed a QA model so that it learns a general reasoning process that can be used in any environment. Assume that the following time-series of sentences about daily stories (story events) are generated by activity recognition and indoor positioning systems:
Mary moved to the washroom from the entrance.
Tom drank coffee in the living room.
Mary washed her hands in the washroom.
Mary brushed her teeth in the washroom.
Tom moved to the kitchen from the living room.
Tom opened the fridge in the kitchen.
In contrast, our QA model learns a general reasoning process for real-world events independent of environments. When sentences about daily stories and a question are given, our model first focuses on sentences that relate to the question. Our model then focuses on the words in the found sentences that might relate to the question, which are used to generate an answer. In the above example, our model first focuses on sentences, including the word “Mary.” Because “Where is” in the question specifies the latest location of “Mary,” our model focuses on the last sentence and the words in it related to the question. Since “Where” in the question specifies a place, our model focuses on the target word “washroom,” which is located after “in.” Moreover, our model can output the target word “washroom” even if it did not appear in the training data by coping it from sentences about real-world stories to the answer. This reasoning process can be learned in any environment when sufficient training data are given, and the process can be applied to any real target real environment.
Note that, unlike standard question answering studied in the NLP community, which mainly focused on choosing an answer from multiple choices or selecting an answer range in documents, real-world QA must generate diverse types of answers (e.g., numbers and a sequence of names of entities) by processing a sequence of real-world events. Therefore, our method is designed to be equipped with a module that is responsible for answer generation that efficiently generates an answer by synthesizing information about multiple important events (sentences that require high attention) to facilitate counting the number of occurrences of an event and enumerating the names of the entities related to the question.
As described above, our method focuses on critical sentences and words in given stories. We implement this idea by employing an attention mechanism that extracts important information in neural network inputs. We incorporate event-level attention that computes the weight (importance) of each event (sentence) into our QA model as well as the word-level attention that computes each word’s weight. We then generate or copy an answer by employing words with high weights that appeared in the events (sentences) with high weights. By using the word attention distribution of input sequence, this approach output answers that are related to an entity even when it is not found in a virtual-world QA dataset.
C. Contributions
The following are the contributions of this study:
We introduced a novel Sim2RealQA framework that uses a QA model that is entirely trained with a virtual-world dataset for solving real-world QA problems.
To accelerate real-world QA study in the IoT community, we developed real-world and virtual-world QA datasets comprised of daily life stories collected from an actual house and simulated household environments. The dataset is available: https://miyatai.org/data/Sim2RealQA.zip.
We proposed a new real-world QA model tailored to Sim2RealQA. The proposed model can learn a general reasoning process of real-world QA independent of environments by leveraging an attention mechanism. In addition, the model is designed for real-world QA so that it can generate diverse types of answers about real-world events by synthesizing information about important events detected by the attention mechanism.
We evaluated our model using the real-world and virtual-world QA datasets. Our experimental results demonstrate that the Sim2RealQA framework with our model accurately solved real-world QA problems without real-world answer labels for training.
In the rest of this paper, we first review studies on sensor-based context recognition methods and question answering and present our method for dataset construction and our real-world QA model tailored to Sim2RealQA. We evaluate it using data collected in real-world and virtual-world environments.
Related Work
We solved real-world QA problems with a QA model trained with virtual-world data. To tackle this cross-domain QA, we used a life simulator that generated virtual-world stories for making virtual-world QA datasets. In this section, we introduce sensor-based context recognition, the existing language tasks in a simulated world, simulations to real approaches, and cross-domain QA methods that describe different aspects from ours.
A. Sensor-Based Context Recognition
In the IoT community, sensor-based context recognition methods have been studied, especially context related to human daily activities and indoor positions. Context recognition methods can be roughly grouped into wearable sensing and environment augmentation. The former approach employs such body-worn sensors as accelerometers, microphones, cameras, and Wi-Fi receivers. The latter approach employs sensors embedded in an environment, e.g., cameras, microphones, switch sensors, RFID tags, and Wi-Fi transmitters/receivers. e Activity recognition based on body-worn accelerometers recognizes simple activities such as walking, eating, drinking, and brushing teeth [3], [4], [24]–[26]. Activity recognition based on body-worn cameras recognizes complex activities that involve interactions with objects or other persons, such as eating, talking with someone, and reading [5], [27], [28]. Activity recognition based on object-attached sensors such as switch sensors and RFID tags also categorizes complex activities by sensing interactions with objects [29]–[32]. Several methods also detect a person’s activities as well as the objects or those who are interacting with the person of interest.
Indoor positioning methods estimate the indoor coordinates of a signal receiver or a place class, e.g., toilet or kitchen, using wearable sensors [8], [33], [34]. Cameras installed in an environment can also be used for indoor positioning with a person identification technique [35]. Estimated indoor coordinates are usually converted to the name of a room when this information is provided to a user. We assume the above activity recognition and indoor positioning techniques for generating sentences about daily life stories.
B. Language Tasks in a Simulated World
For developing intelligent systems that perform language tasks in realistic environments, many studies have used simulators to train models that perform such language tasks as executing navigation instructions with natural language [36]–[39], answering questions about virtual-world situations with embodied agents in a simulated house [40]–[42], and generating the daily household activities of human-like agents [43], [44]. These methods presented in earlier works admirably performed the given tasks because the simulations provide a sufficient amount of labeled data or rewards for training. However, they only solve the language tasks with a simulator. On the other hand, although our framework also uses simulations, its main purpose is to solve real-world language tasks with virtual-world datasets that are comprised of simulation data.
C. Simulation to Real
Modern machine-learning systems that use deep neural networks require many labeled training datasets to achieve superior performance. To execute a real-world task that often lacks labeled data, transferring machine-learning systems from simulations to the real world has been widely used, such as navigating a robot to find a target object indoors [45], grasping various objects with robotic arms [46], learning to drive from a simulation [47], collision avoidance for drones [48], in-hand manipulation [49], agile locomotion for quadruped robots [50], and the semantic segmentation of actual driving video using a popular video game [51]. The results of these works indicate that simulation plays an important role in training machine-learning systems and increases the real-world task performance. The method proposed and examined in this study also uses the notion of simulations to real data. Unlike earlier works, we specifically leverage simulators to address real-world QA problems.
D. Natural Answer Generation
Question answering aims to automatically answer questions about a given context (e.g., documents, knowledge base, and multimedia data.) There are several ways of answering questions: selecting the span corresponding to the answer from the document [52], [53], choosing one from multiple answer options [19], [54], or generating an answer [20]. For responding to various real-world information needs, the real-world QA task [18] takes the form of generating answers. As with the real-world QA, most existing works use the encoder-decoder framework that takes as input a question and a sequence of words in a given context and then generates answer words [55]–[57]. For augmenting the question and answer pairs to improve QA performance, the methods of generating questions as well as answers have also been proposed [58], [59]. However, these works mainly focus on a single domain for answer generation. In contrast to them, we take a cross-domain QA framework that trains the QA model with source domain data (virtual-world data) and generates answers with target domain data (real-world data.) Another setting of natural answer generation is the setting of multi-turn QA (i.e., dialogue) [60], [61], which generates answer responses based on a given context and a history, including past questions and answers. Our story-based QA can be viewed as a special case of single-turn dialogue, which does not use any past questions and answers. Even though this study’s main topic is to generalize the single-turn QA model trained with the virtual-world data to the real-world data, we can naturally extend our story-based QA method to the multi-turn QA method by using the QA history as input.
E. Cross-Domain Question Answering
For real-world QA tasks, making QA datasets by collecting real-world stories is costly and complicated by privacy issues. We address this difficulty using a cross-domain QA method that uses QA models learned from a source QA dataset to solve target QA problems. Existing cross-domain QA methods are used under supervised, semi-supervised, and unsupervised conditions. Supervised methods use both labeled source and target datasets. They pre-train a QA model using the source dataset and fine-tune it with the target domain [62], [63]. A semi-supervised method uses unlabeled source and labeled target datasets and the unlabeled source dataset to boost the performance of the QA models by adapting a model to the target domain [64], [65]. In contrast to the supervised and semi-supervised QA methods, we use an unsupervised approach based on the difficulty of obtaining real-world, labeled data in the target domain. Under the unsupervised QA scenario, we can use a source domain (virtual-world) dataset for training the models, but target domain (real-world) labels are unavailable. Earlier studies investigated cross-domain QA under unsupervised conditions and demonstrated their usefulness [66]–[69]. However, these works used standard reading-comprehension or QA datasets comprised of Wikipedia entries, web snippets, and newspaper articles. These documents are completely different from the daily living stories used for real-world QA. Moreover, these studies focused on choosing an answer from multiple choices or selecting an answer range in documents. In contrast, real-world QA tasks must generate numbers, “yes/no” answers, and a sequence of words as answers based on the content of a given question for addressing various information needs in real-world situations. For these reasons, we created daily life stories with a life simulator and made a virtual-world QA dataset that resembles a target real-world QA dataset.
Generating Daily Life Stories and Constructing Dataset
In this section, we introduce how to create sentences about daily stories from the outputs of activity recognition and indoor positioning systems. After that, we describe the real-world and virtual-world QA datasets constructed in this study.
A. Generating Stories
In this study, we generate sentences related to (i) locomotion and (ii) activity. When an indoor positioning system detects that a person has moved to a room, we generate a sentence of the event based on the following template:
Template for locomotion
[person] moved to the [place.current] from the [place.previous].
Here, [person] is replaced by the name of the individual being tracked. [place.current] is replaced by the name of the current room, and [place.previous] is replaced by a name of the room from which the individual moved. The following is an example sentence generated from this template: “David moved to the toilet from the living room.”
As for daily activities, we assume activities that can be performed both by a single person and by multiple persons. When a single-person activity is detected by an activity recognition system, we generate a sentence of the event based on the following template:
Template for single-person activity
[person] [activity] [[activity.object]] in the [place.current].
Here, [person] is replaced by the name of the individual being monitored, and [activity] is replaced by the name of the detected activity. Because [[activity.object]] shows an option, it is replaced by an object’s name if the system can detect the object that was used in the activity (e.g., using RFIDs or body-worn cameras). A preposition is inserted before the object name, if necessary (e.g., “with”). Note that in some activities, the object being used can be automatically determined. For example, when toothbrushing is detected, an object must undoubtedly be a toothbrush. [place.current] is replaced by the user’s current place detected by the indoor positioning system. The following is an example sentence generated from this template: “Sheldon read a book in the living room.”
When a multi-person activity is detected by the activity recognition system, we generate a sentence of the event based on the following template:
Template for multi-person activity
[person] [activity] [[activity,object]] with [activity.person] in the [place.current].
Here, [activity.person] is replaced by the name(s) of a person detected as a member of the activity (e.g., using body-worn cameras or proximity sensors). The following is an example sentence generated from this template: “Sheldon talked with David in the living room.” We assume that the activity recognition and indoor positioning systems generate sentences based on the above templates.
B. Dataset Construction
We assume the above context recognition systems and introduce the construction of real-world and virtual-world QA datasets.
1) Story Collection Method
We generated story events for our datasets by observing real/virtual environments depicted in Figs. 2 and 3.
Actual house environment for collecting real-world stories, where a person performs a variety of daily life activities at each location.
House environments in The Sims 4 for simulating daily life: Environment 1 (left) simulates a single-person household. Environment 2 (center) simulates a shared house. Environment 3 (right) simulates a nuclear family. We use artificial personages as described above for each environment to obtain daily life stories.
a: Real-World Stories
We attached a wearable camera to each subject to obtain stories in the real house environment depicted in Fig. 2. Annotators watches the captured videos and manually created sentences based on templates described in Section III-A. We manually generated story sentences because this study’s purpose is to construct precise QA datasets that are independent of sensor systems as well as to validate the effectiveness of our proposed Sim2RealQA framework.
In the experiment, we generally followed an earlier study of real-world QA tasks [18]. Real-world stories consist of sentences that describe various daily living activities in residential settings. To collect more diverse daily activities than in a laboratory setting, we used a semi-naturalistic collection protocol [1]. We attached a Tobii Pro wearable eye tracker to five subjects who repeatedly performed 20 daily activities ten times in six different places: bathroom, bedroom, entrance, kitchen, living room, and washroom. For example, the subject makes coffee in the kitchen, takes it to the living room, and drinks it while watching TV. During the data collection, we captured first-person videos of their daily activities and obtained ten real-world stories per person: 50 stories. Two annotators labeled these first-person videos using a sequence of sentences that described what they are doing, when, and where inside a house. Fig. 4 (top) shows examples of the annotated real-world stories. We obtained 7,369 story events (697 unique) about their daily activities. Each story has 147 ± 8sentences and 1,338 ± 76 words on average.
Examples of real-world and virtual-world QA datasets using daily life stories collected in an actual house and a life simulator. Our QA datasets ask questions about daily life stories in both worlds.
b: Virtual-World Stories
For collecting virtual-world stories, we used a life-simulation game called The Sims,2 which replicates the life of an individual person in a virtual world. In contrast to recent simulators that imitate household environments [70]–[73], The Sims simulator easily and automatically generates many human-life stories about virtual-world residents called Sims. In their respective environments, they make their own life choices based on the available electrical appliances and furniture in their house and to match their physical and mental needs (e.g., hunger, companionship, hygiene, entertainment, health, and bodily functions), which are represented by their interior parameters. For example, when a Sim’s hunger parameter decreases, she removes food from her refrigerator, moves to the dining room table, and starts eating. We can easily customize the room layout and include such furniture and appliances as beds, sofas, coffee-makers, and PCs in the house environment where the Sims live. With these advantages of a life simulator, we obtain more realistic and diverse daily living stories than fictional stories [21], [54], [74], books [19], [20], and movie scripts [75] where detailed human activity logs are not recorded. Note that this work used The Sims due to its simplicity, but other life simulators (e.g., VirtualHome [43]) can also be used.
We simulated daily activities by preparing three shared housing environments with typical households using The Sims because the real target environments and residents are actually unknown. Each environment has a kitchen as well as a dining room, a living room, a bathroom, and bedrooms with appropriate objects for daily life. Fig. 3 depicts three environments used for the data collection. Using these house settings, we simulated the daily lives of ten family units comprised of 16 adults. With the life-simulation game, we collected 30 days of daily activities per family unit (i.e., 300 stories). Two annotators manually created story events by watching the recorded game-play videos. We obtained 54,770 story events (7,218 unique) about the daily activity events. Each story has 183 ± 58 sentences and 1,568 ± 523 words on average. Fig. 4 (bottom) shows examples of the annotated virtual-world stories. Table 1 show a list of names of entities used for creating sentences about daily life stories in the real and virtual worlds.
2) QA Creation Method
For each story, we made question and answer pairs to construct QA datasets, i.e., question, answer, and story triplets. We made a template of 22 QA tasks related to world situations following previous real-world QA work [18]. Table 2 shows the QA template with which we generated questions about each task. First, we randomly select the positions where questions are inserted in each story for each person in both worlds and then generate a question using a question template and events before the question’s position. We repeat this process until generating 20 questions per task. Then, we generate answers to a given story by an oracle QA which can accurately answer all the questions about both real-world and virtual-world QAs using the syntax structure of the questions and stories. For example, given a story “Tom washed the plate in the kitchen. Tom moved to the living room from the kitchen.”, we insert a question template “Where is [person]?” after the first event. Then, we automatically fill in the question template according to the first event’s content and generate a question, “Where is Tom?”. Finally, we create an answer “kitchen” by the oracle QA. Also, when a second question is inserted after the second event, the same question “Where is Tom?” and a different answer “living room” will be generated. An oracle for generating answers was also used in an earlier study [18], [21], [76], [77]. Note that we only use the oracle QA to validate the concept of Sim2RealQA, which is not available in the actual case. Due to the diversity of natural language questions, the oracle QA is not practical in real-world situations compared to learning-based approaches which can learn such diversity from data. “No answer” tokens are used if a question has no answer. We used the same data format for all the tasks as the bAbI dataset [21]. We also show examples of real-world and virtual-world QAs in Fig. 4. Their activities are identical in many cases, but the persons, objects, places, and daily life patterns in both worlds are sometimes different. In particular, QA models must output answers to the unknown entities that appear in the real world, but not in the virtual world (e.g., “Lisa” and “washroom,” in Fig. 4) and generate such answers as numbers, “yes/no,” and several entities (e.g., “entrance, kitchen, living room,” in Fig. 4) in response to various information needs in the real world. In fact, 28% of the answer words in the target domain (real world) do not appear in the source domain (virtual world). To exploit the simulation data and further improve the real-world QA performance, we need to address these differences between the real and virtual worlds.
Real-World QA Model for Sim2RealQA
A. Overview
We introduce our real-world QA model for the Sim2RealQA framework trained on the source QA examples (i.e., question, answer, and story triplets) of the virtual world, which can output correct answer when the target story and question sets of the real world are given. To achieve this generalization of real-world QA problems, we addressed the unknown entities that do not appear in the source QA examples, but which do appear in the target QA examples that simultaneously capture the relations over multiple events related to a given question.
The overall architecture of our QA model is shown in Fig. 5. The model mainly consists of five layers: (i) embedding, (ii) context, (iii) attention, (iv) matching, and (v) answer. The layers (i-iv) are parts of a dynamic memory module inspired by a dynamic memory network [78], [79]. Layer (v) is part of the pointer generator module inspired by pointer generator networks [80]. For a brief explanation, we consider a sentence in a story an event and a sequence of sentences about a daily story story events. In our model, the input consists of events (a daily-life story) and a question. First, the embedding layer extracts their word feature vectors (i.e., word embeddings). Second, the context layer takes the word embeddings as input, computes the sequential dependencies of the words in the question and each event, and outputs a question embedding and event embeddings. In addition, this layer takes event embeddings as input and calculates the story embeddings that capture the context of story events. The attention layer takes the question and story embeddings as input and computes the event-level attention that represents an event’s importance for a given question. The matching layer aggregates the events weighted with event-level attentions and outputs the matching embedding. Because the matching embedding is a vector that represents the association between a question and its relevant events, it is used for decoding answers. Finally, the answer layer outputs an answer sentence based on the two types of word weights from the vocabulary and word attention distributions. The vocabulary distribution, which is a probability distribution over all the words in the vocabulary of the training dataset, is calculated based on the hidden state of an RNN language model trained for predicting the answer words. By using this distribution, the model can generate answer words from the fixed vocabulary used in training. The word attention distribution, which is a probability distribution over the words in the input sequence, is calculated based on the cumulative attentions of the input words for generating the answer words. By sampling words from this distribution, the model can output unseen entities as an answer when such entities are included in the input story. We extended this word attention distribution with event-level attention that represents the relevance between a question and events because relevant events to a question are likely to contain words suitable for the answer. Finally, the answer decoder in the layer recurrently generates a sequence of answer words by integrating the vocabulary and the extended word attention distribution. We explain each component of our model in detail.
Overall model architecture for solving real-world QA problems with Sim2RealQA framework. Given sentences of a daily life story and a question, the QA model outputs a corresponding answer sentence through inference by five layers.
B. Embedding Layer
The model’s input is the question and story events. First, we convert each word in the question and story events into vectors that represent the semantics of words. For the vector representation of the word in the events and the question, we use the Glove model trained with Gigaword5+ Wikipedia2014 corpus [81]. This layer’s outputs are question word embeddings
C. Context Layer
This layer models a sequence of words (i.e., question and event) and a sequence of events (i.e., story) using the question, event, and story encoders. To model the long-term dependencies in the input sequence, we encode the sequences of the words and events using a bidirectional-GRU (Bi-GRU) [82], [83], which is a special kind of recurrent neural network. The question encoder outputs the GRU’s hidden state after reading question word embeddings
D. Attention Layer
This layer computes how essential an event is to a given question using event-level attention. It is a useful clue to find the relevant events to a user’s question in a long story. In addition, this information helps enumerating name of entities a particular event related to a question. Using story embeddings \begin{equation*} \beta _{k} = \mathrm {softmax}(W_\beta \mathrm {tanh}(W_{z} z_{k} + b_{z}) + b_\beta),\tag{1}\end{equation*}
E. Matching Layer
This layer encodes the story and question embedding as a question-story matching embedding (matching embedding for short) with an attention-based encoder. First, the attention-based encoder sequentially weights \begin{equation*} c_{k} = (1-\beta _{k}) \circ c_{k-1} + \beta _{k} \circ {s}_{k}.\tag{2}\end{equation*}
The output of encoder \begin{equation*} m = \mathrm {ReLU}(W_{m}[c; q]+b_{m}),\tag{3}\end{equation*}
F. Answer Layer
This layer in the pointer-generator module generates a sequence of answer words based on the matching results between the question and the story using the pointer-generator decoder. Fig. 6 shows its overview. First, the RNN-based decoder with a hidden layer initialized by matching embedding reads a
Overview of answer layer: Point-generator decoder sequentially generate answer words or copy them from the input in autoregressive manner by integrating vocabulary and extended word attention distribution. The extended word attention distribution holds information about question’s intention by updating the original word attention distribution with the event attention distribution.
1) Pointer-Generator Decoder
To predict a sequence of answer words, the pointer-generator decoder sequentially generates words by using previously generated words as additional input at each decoding step. At each decoding step, the decoder outputs answer word \begin{align*} o_{j}^{t}=&v^{T} \mathrm {tanh}(W_{e} e_{j} + W_{h} h_{t} + b_{e}) \tag{4}\\ \alpha ^{t}=&\mathrm {softmax}(o^{t}),\tag{5}\end{align*}
\begin{equation*} P_{vocab} = \mathrm {softmax}(W_{1}(W_{2}[h_{t}; u_{t}] +b_{2})+b_{1}),\tag{6}\end{equation*}
To address this problem, the pointer-generator decoder uses words in the story as outputs by using the following cumulative word attention distribution: \begin{equation*} P_{copy} = \sum _{j:w_{j}=w} {\alpha }_{j}^{t},\tag{7}\end{equation*}
Finally, the decoder outputs next answer word \begin{equation*} P_{final}(a) = g P_{vocab}(a) + (1-g) P_{copy}(a),\tag{8}\end{equation*}
\begin{equation*} g = \sigma (w_{h}^{T} h_{t} + w_{u}^{T} u_{t} + w_{a}^{T} a_{t} + b_{g}),\tag{9}\end{equation*}
Unfortunately, this pointer-generator mechanism only uses word-level attention for copying words; it ignores the relations between story events and a question, i.e., event-level attention. We assume the encoder’s event-level attention indicates the important events in which users are interested. Since the words in these events are worth copying, we integrate the word-level and event-level attentions for decoding.
2) Cross Attention
To identify suitable words for copying, we extend word-level attention \begin{equation*} {\alpha }'_{j} = \frac {\alpha _{j} \times \beta _{[j]}}{\sum _{i} \alpha _{i} \times \beta _{[i]}},\tag{10}\end{equation*}
Experiments
We evaluated our Sim2RealQA framework using real- and virtual-world QA datasets. First, we investigated how well our model performed in a Sim2RealQA setting and ascertained what components of QA models contribute to Sim2RealQA. Then we investigated the capability of Sim2RealQA with our model and baselines.
A. Experimental Setup
1) Methods
We conducted our empirical investigation using the following models, each of which has different modules. By comparing them, we can ascertain what components determine the generalizations when using Sim2RealQA:
RNN is a standard neural QA model based on sequence-to-sequence (Seq2Seq) [88] that encodes the story and a question and then decodes answer words.
RNN-AT is the Seq2Seq-based QA model, which uses a word-level attention mechanism [89], [90]. This mechanism considers the input context for predicting answer words at each decoding step in addition to the above RNN method.
RNN-PG uses the pointer-generator decoder [80] for predicting answers in addition to RNN-AT. The method can use input word distribution using word-level attention to generate or copy words from inputs in the decoding phase.
DMN uses the dynamic memory module for encoding a story and a question relation with their mutual relevance using event-level attentions, which is a special case of the dynamic memory network [78], [79]. The model uses the same decoder as RNN.
DMN-PG uses the pointer-generator decoder in addition to the above DMN. Moreover, this method uses the relevance between story events and a question with the event-level attention distribution to find useful words for copying in the important event for a question.
In addition to these baselines, we prepared a frequent answer baseline and question-only baselines to check the biases of the real-world and virtual-world QA datasets because question-only methods are competitive in some QA tasks [91], [92]. We prepared the following baselines:
Q-Prior uses the most popular answer per task described in Table 2. We used the frequent answers in the source domain (virtual-world) for predicting answers in the test phase.
RNN (Q), RNN-AT (Q), and RNN-PG (Q) are almost identical to the RNN, RNN-AT, and RNN-PG methods, but they use questions only for decoding answer words.
2) Parameter Settings
We trained all the methods with Adam [93] using a learning rate of 0.0001 and a batch size of 20 until 32 epochs were reached. We used early stopping if the accuracy of the validation split in the source domain did not increase for 10 epochs. A null symbol was used to pad them all to a fixed size. We did not update the word vectors during training. The embedding of the null symbol was constrained to zero. For all the RNN encoders and decoders, we used a GRU [82] with a single hidden layer. For the RNN encoders, we used a bidirectional GRU. For all the methods, we selected the following dimensions: 128 word embeddings and 256 hidden states. We set a dropout [94] value of 0.5.
3) Evaluation Settings
To assess our proposed framework, we compared the prepared methods with two real-world and virtual-world QA datasets. For training, we divided the virtual-world QA dataset into 169K, 21.1K, and 21.1K examples for the training/valid/test data. For evaluation, we divided the real-world QA dataset into 17.6K, 2.2K, and 2.2K examples for the training/valid/test data. For Sim2RealQA, we trained all the models using the virtual-world training data and evaluated them with the test data from the real world. In this case, the labels in the target domain (real-world) were withheld. We evaluated all the methods under the ParlAI framework [95] with an accuracy measure as an evaluation metrics that computed the exact matches between the predicted answer words and the ground truth. This report describes the QA performances by averaging the results of five training runs based on different initialization values.
B. Results and Analysis
In this section, we evaluated the performance of the prepared methods over two worlds and explored the generalization of the models trained on a virtual-world dataset to a target real-world dataset. Models learned with virtual-world data, which hold high generalization ability, need to show high performance even when being tested with real-world data.
1) What Factors Fuel Generalization in the Real World?
First, we investigated how well our model performed in a Sim2RealQA setting and ascertained what components improved Sim2RealQA’s performance. In Table 3 (bottom), we show the Sim2RealQA performance of the prepared methods over all the QA tasks on average. The learning-based approaches (RNN, RNN-AT, RNN-PG, DMN, and DMN-PG) significantly outperformed a simple frequency-based approach, Q-Prior, which uses the most popular answers in the source domain for each task. Moreover, these methods using both stories and questions for QA significantly outperformed RNN (Q), RNN-AT (Q), and RNN-PG (Q), all of which only use questions. These results indicate that training on a virtual-world QA dataset for reasoning over story events in response to a question effectively and accurately solves real-world QA problems. In addition, RNN-PG and DMN-PG significantly outperformed RNN, RNN-AT, and DMN, which do not use the pointer-generator decoder, suggesting that the pointer-generator mechanism further improves the Sim2RealQA performance. Note that our DMN-PG outperformed the other methods, indicating the benefit of integrating the word- and event-level attentions for the pointer-generator mechanism.
We also compared our model’s QA performance for each task to the others in Table 3. Across most tasks, RNN-PG and DMN-PG significantly outperformed the others. In particular, RNN-PG and DMN-PG achieved success in tasks 11, 12, and 13, which require answering questions about the names of people; Q-Prior, RNN (Q), RNN-AT (Q), and RNN-PG (Q), RNN, RNN-AT, and DMN failed because no real-world people appeared in the virtual world (i.e., training dataset). The models with the pointer-generator decoder extracted such unknown persons from the given real-world story events and produced them as answers, but the others could not. For example, task 11’s results in Fig. 7 (center left) show that DMN-PG and RNN-PG predicted the correct answer “David” who only appears in the real world, but RNN could not. For task 13 in Fig. 7 (center right), RNN also predicted incorrect answer “Howard” who only appears in the virtual world. In addition, the models without the pointer-generator decoder failed to predict unknown places that do not appear in the virtual world. Both RNN-PG and DMN-PG output them as an answer. For example, task 5’s results in Fig. 7 (top left) show that DMN-PG and RNN-PG predicted the correct answer “washroom”, which only exists in the real world; RNN incorrectly output “bathroom”. Both RNN-PG and DMN-PG significantly outperformed Q-Prior, RNN (Q), RNN-AT (Q), RNN-PG (Q), RNN, RNN-AT, and DMN in task 17, which requires answers about activities because they are often different between worlds. The models without the pointer-generator decoder failed to provide answers. For example, task 17’s results in Fig. 7 (bottom right) show that DMN-PG and RNN-PG predicted the correct answer, “poured hot water into the cup”, which only took place in the real world, and RNN incorrectly output “close to the ice cream maker” that took place only in the virtual world. These results posit compelling evidence that the pointer-generator mechanism is necessary for handling unknown entities caused by the gap that separates the virtual and real worlds. Moreover, the proposed DMN-PG outperformed RNN-PG in difficult tasks 7, 8, 16, 18, and 19, where the models have to answer by referring to multiple events in a long story. DMN-PG has the ability to find multiple relevant events to a question for predicting answers in contrast to RNN-PG. For example, task 8’s results in Fig. 7 (top right), which require understanding the context of the place, show that DMN-PG predicted the correct answer, but not RNN-PG. That is because it does not have an event-level attention mechanism that helps it find important clues about the correct answer in a long story. For task 16, which requires multiple events to answer questions, DMN-PG predicted the correct answer, “bedroom, entrance, kitchen, living room”, while RNN-PG predicted an incorrect answer: “bedroom, kitchen, living room”. This is because the word-level attention in the RNN-PG is difficult to find and memorize answer words, compared to the event-level attention in DMN-PG. These findings suggest that the integration of the pointer-generator decoder and event-level attention is effective for further improving our Sim2RealQA framework.
2) How Well Can Models Trained With the Virtual-World Data be Generalized to the Real-World Data?
To assess the generalization of our proposed Sim2RealQA framework, we compared the methods, which used only a virtual-world dataset for training, to the methods trained on the target, which used a real-world dataset. Those trained on the target results revealed ideal performance, but the target real-world answer labels are actually unavailable. By comparing the Sim2RealQA and ideal Train on target performances, we can quantitatively investigate how well models trained with the virtual-world data generalize to the real-world data. The results are presented in Fig. 8 (left). Horizontal lines show the Sim2RealQA performance because it did not use any target examples for training. As we anticipated, the performance of the methods trained on the target improved with an increase in the proportion of real-world examples. Compared to the oracle methods that were trained with all the target examples, the methods using Sim2RealQA still have room for improvement. However, these oracle methods achieved lower accuracy in cases that involved an inadequate amount of training data because the performance of the neural QA models relies heavily on many labeled training datasets. Note that all the RNN, RNN-AT, RNN-PG, DMN, and DMN-PG methods on the Sim2RealQA framework outperformed the oracle methods with a small real-world training dataset. The DMN-PG of Sim2RealQA significantly outperformed all the methods trained with a target of 1,000 examples, a number that we cannot realistically collect. The result indicates the effectiveness of the proposed Sim2RealQA framework in the absence of real-world answers. This finding is useful because making real-world QA datasets is extremely difficult and laborious due to privacy reasons.
Sim2RealQA performance of baselines and proposed model: Sim2RealQA performance and oracle supervised QA performance over a number of examples in real-world dataset used for training (left); Sim2RealQA performance over number of examples in virtual-world dataset used for training (center); learning curves and accuracies (right).
3) Does Generalization to Real-World Data Improve With More Virtual-World Data?
To validate the generalization ability of the proposed methods when the training set size increases, we explored the models’ performances with the virtual-world QA datasets of several training data sizes. In Fig. 8 (center), the RNN, RNN-AT, RNN-PG, DMN, and DMN-PG performances steadily improved as the virtual-world training data size increased. DMN-PG outperformed the other methods when using a large amount of examples for training. The results suggest that using a large amount of virtual-world QA datasets is effective for more accurately solving real-world QA problems. This finding is fruitful because we can obtain diverse daily life stories from simulators and compile a large amount of virtual-world QA datasets without breaching privacy.
4) Do the Pointer-Generator Mechanism and Event-Level Attention Quickly Improve the Sim2RealQA Performance?
Next we studied how the number of training epochs affected the Sim2RealQA performance. Fig. 8 (right) shows the accuracy of RNN, RNN-AT, RNN-PG, DMN, and DMN-PG over the {1, 2, 4, 8, 16, 32} epochs with the Sim2RealQA framework. With an increase of the training epochs, the performance of all the QA methods also improved. In addition, the models with a pointer-generator decoder (DMN-PG and RNN-PG) dramatically outperformed those without it (RNN, RNN-AT, and DMN), despite many fewer training epochs. DMN-PG outperformed the other methods over all the training epochs, indicating the pointer-generator mechanism’s effectiveness for quick learning. A combination of pointer-generator decoder and relevance matching between question and story events (i.e., event level attention) more quickly improved the generalization to the real-world data.
Conclusion
We proposed a novel simulation to a real QA (Sim2RealQA) framework that trains a neural QA model with many QA datasets produced in a life simulator and used it for solving real-word QA problems. To evaluate our framework, we developed real-world and virtual-world QA datasets using an actual house and a pre-made life simulator. We validated our proposed approach with a neural QA model that can address different entities between both worlds by combining the pointer-generator decoder with relevance matching between question and story events. From the experiments, we found that our method accurately solved real-world QA problems with the aid of virtual-world QA datasets. Moreover, our model, which was completely trained with the virtual-world QA dataset, significantly outperformed models trained with 1,000 examples in a target domain. In addition, the Sim2RealQA performance improved with an increasing number of examples from virtual-world QA datasets that can be created while protecting privacy. Furthermore, Sim2RealQA’s quick learning was achieved by the integration of a pointer-generator mechanism and relevance matching (i.e., event-level attention). These findings support that using life simulations is a promising approach for solving real-world QA problems when no real-world answers are available.
In future work, we will refine our model to detect the temporal intention of the user’s question and answer questions about events that occurred at specific dates and times in a long daily life. In this study, we created textual questions with the template-based approach to purely investigate the models trained with virtual-world data and generalize to real-world data with fewer noise settings. In more realistic situations, natural language questions are composed by a person seeking complicated real-world information, which offer multiple ways to say the same things. Another aspect of our future work will create datasets that include more natural and complex questions. Such diverse questions will be beneficial for robust and accurate real-world QA answering.