Learning Physical Common Sense as Knowledge Graph Completion via BERT Data Augmentation and Constrained Tucker Factorization

Physical common sense plays an essential role in the cognition abilities of robots for human-robot interaction. Machine learning methods have shown promising results on physical commonsense learning in natural language processing but still suffer from model generalization. In this paper, we formulate physical commonsense learning as a knowledge graph completion problem to better use the latent relationships among training samples. Compared with completing general knowledge graphs, completing a physical commonsense knowledge graph has three unique characteristics: training data are scarce, not all facts can be mined from existing texts, and the number of relationships is small. To deal with these problems, we ﬁrst use a pre-training language model BERT to augment training data, and then employ constrained tucker factorization to model complex relationships by constraining types and adding negative relationships. We compare our method with existing state-of-the-art knowledge graph embedding methods and show its superior performance.


Introduction
Physical common sense means understanding the physical properties of objects and how they can be manipulated (Forbes et al., 2019). Empowering natural language processing (NLP) methods with physical common sense is important when dealing with tasks that are related to the physical world, such as physical commonsense reasoning (Bisk et al., 2020), grounded verb semantics (She and Chai, 2017), and the more general human-robot interaction problem.
Generally, there are currently three methods of learning physical common sense: manual annotation, text mining, and machine learning. Manual annotation is difficult for human annotators due to inconsistent perceptions and the challenge of enumerating all physical facts. Mining text data is also challenging because some physical facts are not written in texts explicitly. Machine learning is a promising method to discover new physical facts using existing data. Forbes et al. (2019) formulate physical commonsense learning as three separate machine learning tasks: 1) given an object and a property, predicting whether they follow an objectproperty (OP) relationship, e.g., an apple is edible; 2) given an object and an affordance, predicting whether they follow an object-affordance (OA) relationship, e.g., he drove the car; and 3) given an affordance and a property, predicting whether they follow an affordance-property (AP) relationship, e.g., if you can eat something, then it is edible. However, it is difficult for a machine learning model to generalize through the use of the latent relationships among samples. For example, even if we have a training sample an apple is edible, it is hard to say that the trained model can generalize to predict a testing sample an apple is red correctly.
In this paper, we propose to model physical commonsense learning as a knowledge graph completion problem to better use the latent relationships among samples. An knowledge graph can be represented as a 3-way binary tensor, and each entry is in triple form (e h , r, e t ) (Nickel et al., 2016;Wang et al., 2017), where e h denotes the head entity, e t denotes the tail entity, r denotes the relationship between e h and e t , (e h , r, e t ) = 1 denotes the fact is true in the training data, and (e h , r, e t ) = 0 denotes the fact does not exist or is false in the training data. The goal of knowledge graph completion is to predict the real value of (e h , r, e t ) when it is missing or its label is wrong in the training data. In terms of physical common sense, entities come from the set of all objects, properties, and affordances, and relationships come from the set of OP, OA, and AP.
Compared with general knowledge graphs such as DBpedia (Auer et al., 2007) and Freebase (Bollacker et al., 2008), a physical commonsense knowledge graph has at least three characteristics: 1) Training facts are scarce. For example, when labeling the properties of an object, people usually name the ones that are easiest to think of but cannot enumerate all properties. 2) Not all facts can be mined from existing texts. For example, the relationships between affordances and properties usually do not appear in texts explicitly and need to be reasoned.
3) The number of relationships is small and all are n-to-n relationships, which makes modeling relationships between entities more complicated. Forbes et al. (2019) show that with supervised fine-tuning, Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2019) can learn the relationships OP and OA well but not AP. In this paper, we first use BERT to augment training data of OP and OA and then employ constrained Tucker factorization (Balazevic et al., 2019) to complete the knowledge graph of physical common sense. More specially, we use typed constraints to reduce the solution space and add negative relationships to leverage negative training samples. We evaluate this method on triple classification and link prediction tasks using a physical commonsense dataset (Forbes et al., 2019), and show that it can model physical common sense more effectively compared with state-of-the-art knowledge graph embedding methods.
The contributions of this paper are: 1) we formulate physical commonsense learning as a knowledge graph completion problem, 2) we propose a novel pipeline that combines pre-training models and knowledge graph embedding to learn physical common sense, and experiment results show its superior performance.

Common Sense and Physical Common Sense
Common sense learning is one of the main challenges in NLP (Cambria and White, 2014). Although existing works have made significant progress on reading comprehension and question answering (Rajpurkar et al., 2016), they are still text-based and challenging to use for commonsense reasoning (Ostermann et al., 2018). In general, commonsense modeling can be classified into two categories: 1) explicitly encoding via knowledge graphs (Auer et al., 2007;Bollacker et al., 2008) and 2) implicitly encoding via language models (Bosselut et al., 2019). Building high-quality knowledge graphs usually requires expensive human annotation. There is some research on extracting facts from unstructured text (Clancy et al., 2019), but it is not flexible to build domain-specific knowledge graphs. Recent research works show that pre-training models can be good at encoding commonsense knowledge due to a large number of model parameters and text corpora, and they can be used to complete knowledge graphs (Bosselut et al., 2019). Physical commonsense learning is a recentlyproposed task (Forbes et al., 2019) that is related to language understanding with a physical world context, which is a sub-category of commonsense learning. Forbes et al. (2019) formulate physical commonsense learning as a machine learning problem, and show that a pre-training BERT model can learn the OP and OA tasks well but cannot generalize well on the AP task. In this paper, to deal with the generalization problem of BERT, we explore using knowledge graph embedding that is commonly used in commonsense modeling to deal with the issue of physical commonsense learning.

Knowledge Graph Embedding
Knowledge graphs have been shown to be useful for many NLP tasks, such as contextual word embedding (Peters et al., 2019), text classification (K M et al., 2018), and language generation (Zhou et al., 2018). In general, knowledge graph embedding can be classified into two categories: translational distance models and semantic matching models (Wang et al., 2017). Translational distance models model the score function of a factual triple (e h , r, e t ) as the distance between e h and e t through the relationship r. Typical methods include TransE (Bordes et al., 2013) and its variants, such as TransD (Ji et al., 2015). Semantic matching models model the score function of a factual triple by exploiting the latent semantics between e h and e t , and they are usually modeled as a 3-way tensor. Typical methods include RESCAL (Nickel et al., 2011), DistMult (Yang et al., 2015), Com-plEx (Trouillon et al., 2016), SimplE (Kazemi and Poole, 2018), and Tucker factorization (Balazevic et al., 2019). Compared with other methods, the Tucker factorization method learns a basis of relationship embeddings and can model more complex relationships, so it is used in this paper.

Method
Our method consists of two components: 1) we first augment all pairs of OP and OA tasks using BERT; 2) with the training data of OP, OA, and AP as input, we use constrained Tucker factorization to de-noise and complete the knowledge graph. In particular, we use typed constraints to reduce the solution space and add negative relationships to leverage negative training samples.

Data Augmentation
Because BERT can only do well on the OP and OA tasks (Forbes et al., 2019), we only augment data of these two tasks. In particular, for each pair (o, p) of OP, where o ∈ O is an instance of objects and p ∈ P is an instance of properties, we compose a sentence: "A/An o is p.", and use fine-tuned BERT on OP to predict its label l op . Similarly, for each pair (o, a) of OA, where o ∈ O is an instance of objects and a ∈ A is an instance of affordances, we compose a sentence: "He a the o.", and use finetuned BERT on OA to predict its label l oa . We use the augmented data D OP , D OA , together with the original AP data D AP as input to the constrained Tucker factorization model.

Constrained Tucker Factorization
All (e h , r, e t ) tuples compose a 3-way binary tensor X ∈ {0, 1} ne×ne×nr , where each entry X (i, j, k) denotes whether the i-th head entity and j-th tail entity follow the k-th relationship, n e is the number of entities, and n r is the number of relationships. Each slice of X is a n e × n e matrix of the relationship k. The Tucker factorization model proposed by Balazevic et al. (2019) approximates X as: where × i denotes the i-mode product, E ∈ R de×ne is entity embeddings, R ∈ R dr×nr is relation embeddings, W ∈ R de×de×dr is a core tensor, d e is the latent dimension of entities, and d r is the latent dimension of relationships.

Typed Constraints
Similar to the typed tensor decomposition method in (Chang et al., 2014), because we know that only objects and properties can potentially have the relationship OP, we can constrain the remaining entries of the OP matrix as 0. We can also constrain the OA and AP relationships in a similar way. There-fore, we optimize the following objective jointly for the three tasks: where ||·|| F denotes the Frobenius norm, denotes element-wise production, M is the mask tensor for the typed constraint, and f (X ) = ||E|| 2 F + ||R|| 2 F + ||W || 2 F is the regularization term. λ and β are coefficient weights of constraints. Because all entities are categorized and we consider the type constraint, there is only one possible relationship for a single head and tail.

Negative Samples
One unique challenge of a physical commonsense knowledge graph is that we have to use the openworld assumption. Namely, for unknown facts, we cannot assume that they are negative samples. In this paper, we propose encoding negative samples by adding corresponding negative relationships explicitly. For each OP, OA and AP relationship, we add a corresponding negative relationship, i.e., NOT-OP, NOT-OA and NOT-AP. For example, (person, NOT-OP, a tool), (cup, NOT-OA, twist), (walk, NOT-AP, used for eating). Similar to (Balazevic et al., 2019), we also use reverse relationships. Namely, for each tuple (h, r, t), we add (t, r-reverse, h). Therefore, there are six negative relationships in total. For the OP and OA tasks, the negative samples are added through the data augmentation module in subsection 3.1, i.e., the labels are predicted by BERT, and for the AP task, we use the negative samples from the dataset. In this way, we can not only increase the number of relationships but also leverage labeled negative samples more effectively.

Experiments
To evaluate the method, we conducted experiments with a physical commonsense dataset (Forbes et al., 2019) on triple classification and link prediction. To simplify the problem, we only used the situated OP, OA, and AP data, which contains 80 objects, 50 properties, and 504 affordances. The statistics are shown in Table 2. With the data augmentation component, we generated 4000 OP samples and 40320 OA samples. We compared the method with state-of-the-art knowledge graph embedding methods, including TransE, TransD, RESCAL, Dist-   Mult, ComplEx, SimplE, and Tucker 1 . We optimized equation 2 with Adam in PyTorch and did not optimize the regularization explicitly. λ was set to 0.1 through a 5-fold cross validation. d e and d r were set to 200 by default.

Triple Classification
Triple classification needs to predict whether a fact (e h , r, e t ) is correct or not. With the learned E, R, and W , we calculated the probability that two entities e h , e t follow a relationship r as: where σ is the sigmoid function. With the typed constraint, we then selected the relationship with the maximal probability. The results are shown in Table 1. For other methods, we only input the original training data without data augmentation.
With the data augmentation (DA) and typed constraints (CSTR), we achieved the best classification accuracy. In particular, we achieved relatively high micro and macro F1 scores for the three tasks, indicating that our method can predict positive samples more accurately.

Link Prediction
Link prediction predicts the tail entity with one head and one relationship, i.e., (e h , r, ?). With the 1 The implementations of TransE, TransD, RESCAL, Dist-Mult, ComplEx and SimplE are from OpenKE (Han et al., 2018). The implementation of Tucker is from Balazevic et al. (2019). Without any explicit statement, we used their default parameters.  learned E, R, and W , we calculated probabilities of all candidate entities as: Similarly, we compared our results with typical knowledge graph embedding methods. For the Tucker method, we trained 2000 epochs, and for our method, we trained 50 epochs. The results are shown in Table 3. Compared with other methods, our method usually had relatively higher performance, indicating its potential in discovering new physical commonsense facts.

Discussion
To evaluate the effectiveness of the data augmentation (DA) and typed constraint (CSTR) components, we also conducted ablation studies on triple classification and link prediction separately, and the results are shown in Tables 1 and 3, from which we can see that DA and CSTR can help improve the performance of Tucker factorization.
Compared with knowledge graph embedding methods, the pre-training BERT model can perform better on OP and OA, but it is more difficult to generalize well on AP because such facts are not written in existing texts explicitly and BERT does not encode them as well as the OP and OA tasks (Forbes et al., 2019). For example, in terms of AP triple classification, the results of BERT are: a mi-cro F1 score of 0.37, an affordance macro F1 score of 0.36, and a property macro F1 score of 0.25. Our results for triple classification outperform them by a large margin, although our results are still worse in terms of OP and OA classification.
From the perspective of multi-task learning, one explanation of the improvement on the AP task is that the core tensor W can be viewed as parameter sharing among the three tasks and through the parameter sharing, the OP and OA tasks help improve the performance of AP. In a separate experiment, we used a multi-task BERT model (Stickland and Murray, 2019), and got a micro F1 score of 0.46, an affordance macro F1 score of 0.37, and a property macro F1 score of 0.48 for the AP task, which was similar to the result with our model.

Conclusion
In this paper, we formulate physical commonsense learning as a knowledge graph completion problem. We first use BERT to augment training data of OP and OA, and then employ constrained Tucker factorization to complete the knowledge graph. We constrain types to reduce the solution space and add negative relationships to leverage negative training samples. Compared with typical knowledge graph embedding methods, our results show good performance on triple classification and link prediction. Our method also has the potential to be a generic approach to benefit performance on the knowledge graph completion problem.