Google has released the Patent Phrase Similarity dataset, intended to help AI models better understand the somewhat odd world of patent language:
The process of using traditional patent search methods (e.g., keyword searching) to search through the corpus of over one hundred million patent documents can be tedious and result in many missed results due to the broad and non-standard language used. For example, a “soccer ball” may be described as a “spherical recreation device”, “inflatable sportsball” or “ball for ball game”.
Announcing the Patent Phrase Similarity Dataset
The dataset was used in the U.S. Patent Phrase to Phrase Matching Kaggle competition with some close-to-human results.