The paper can be found here
Put in simple words: The paper presents a method on how you can train a model when you have only a small amount of (labelled) data in the domain you are working on, but have access to loads of (labelled) data from some other domain. The paper has been named so, because the author suggests that it can be frustrating when you figure out that simple methods like those illustrated can be such difficult benchmarks to beat and perform reasonably well.
Terminology:
* Target Domain: The domain of interest, of which, typically you have access to only small amounts of data.
* Source Domain: A domain that can be used for the same task, of which we have access to fairly large corpus of annotated data.
Frequently, it is the case that we have access to large, annotated corpus from one domain which we want to use to enrich our model to work well on the domain we are working on (of which we only have access to small annotated corpus). Note, that the paper only deals with fully supervised learning as in it does not leaverage any unlabelled data you might have available from the target domain. The authors try to transform the domain adaptation problem into a standard supervised learning problem.
The above methods form baselines that are suprisingly difficult to beat. The method suggested by the authors, however beats the above baseline. Nonetheless, it is atleast 10 to 15 times slower than the PRIOR method above.
Here’s an intuitive explanation so as to why this strategy might work. This example has also been taken from paper.
Suppose we are doing POS tagging, with source domain as WSJ corpus and the target domain being a review of computer hardware from some ecommerce website. We can clearly observe words of the following kinds:
* words like “the”: they are determiners in both the cases and should be more or less treated in the same way across all domains. So, words like these may have the common or the general feature weights higher
* words like “monitor”: This is likely to be a verb in the context of WSJ but is often a noun in the context of computer hardware. Had we used a single feature to represent this, our tagger would likely be confused over what tag should go with this word.
Making separate features for the word monitor in different domains allows the model to capture things such as above, in which case, it can model that the word monitor in the target domain is likely to be a noun while in the source domain is liekly to be a verb.
This model is also very easily generalised to more than two classes, wherein instead of replicating this feature 3-times, we replicate it k+1 times (k is the number of classes).
The paper illustrates a very simplistic and straightforward startegy for domain adaptation. This is easy to estimate and may quickly make some baseline or even good results for your dataset. A nice illustration for this method and the baselines can be found in the presentation here