23 Sep 2018

Frustratingly Easy Domain Adaptation

The paper can be found here

What is this about?

Put in simple words: The paper presents a method on how you can train a model when you have only a small amount of (labelled) data in the domain you are working on, but have access to loads of (labelled) data from some other domain. The paper has been named so, because the author suggests that it can be frustrating when you figure out that simple methods like those illustrated can be such difficult benchmarks to beat and perform reasonably well.

Basic Idea

* Target Domain: The domain of interest, of which, typically you have access to only small amounts of data.
* Source Domain: A domain that can be used for the same task, of which we have access to fairly large corpus of annotated data.

Frequently, it is the case that we have access to large, annotated corpus from one domain which we want to use to enrich our model to work well on the domain we are working on (of which we only have access to small annotated corpus). Note, that the paper only deals with fully supervised learning as in it does not leaverage any unlabelled data you might have available from the target domain. The authors try to transform the domain adaptation problem into a standard supervised learning problem.

Prior Work & Benchmarks

  • SRCONLY: ignore the target data and train the model using only the source data
  • TGTONLY: ignore the source data and train the model using only the source data
  • ALL: train the model using the union of both the datasets
  • WEIGHTED: When the target data is small compared to the source data, we often see weights getting tuned to the source domain. To avoid this, we can replicate the target data to size it up to the source data.
  • PRED: this idea suggests to use the SRCONLY model to make predictions on TGT data and use this as an additional feature to train on the TGT domain
  • LININT: train SRCONLY and TGTONLY models separately and use the combination of their predictions to train another model over it
  • PRIOR: This method suggests that the feature weights learned by the SRCONLY model are valuable and should be allowed to drift away in the TGT model only if the data significantly suggests so. Hence, we put a regularization term as (lambda) * ||w-ws||^2, where ws is the weight of the source feature and w is the weight of the feature in the target model.

Method suggested in the paper

The above methods form baselines that are suprisingly difficult to beat. The method suggested by the authors, however beats the above baseline. Nonetheless, it is atleast 10 to 15 times slower than the PRIOR method above.

  • take each feature in the original problem and make three versions out of it.
    • general version/common version
    • source-specific version
    • target-specific version
  • modify the source data to have three instead of one feature, where if the feature was earlier present it in the source data, mark it present in the source-specific and the general feature and mark it absent from the target specific feature. For eg. suppose function f transforms each feature in the source data to the three features we require. Then:
    • i.e. f(x) = (x,x,0)
  • Do the similar procedure with the target data. But here instead of marking source-specific feature, makr the target-specific feature
  • Combine the two datasets (you can use the weighted combination as in one of the prior works above)
  • Now, you can use any of you supervisied learning algorithm on this modified dataset

Intuitive explanation

Here’s an intuitive explanation so as to why this strategy might work. This example has also been taken from paper.

Suppose we are doing POS tagging, with source domain as WSJ corpus and the target domain being a review of computer hardware from some ecommerce website. We can clearly observe words of the following kinds:
* words like “the”: they are determiners in both the cases and should be more or less treated in the same way across all domains. So, words like these may have the common or the general feature weights higher
* words like “monitor”: This is likely to be a verb in the context of WSJ but is often a noun in the context of computer hardware. Had we used a single feature to represent this, our tagger would likely be confused over what tag should go with this word.

Making separate features for the word monitor in different domains allows the model to capture things such as above, in which case, it can model that the word monitor in the target domain is likely to be a noun while in the source domain is liekly to be a verb.

This model is also very easily generalised to more than two classes, wherein instead of replicating this feature 3-times, we replicate it k+1 times (k is the number of classes).


The paper illustrates a very simplistic and straightforward startegy for domain adaptation. This is easy to estimate and may quickly make some baseline or even good results for your dataset. A nice illustration for this method and the baselines can be found in the presentation here

Visitors: visitor counter