The TakeLab Semantic Text Similarity System
Version: 1.0
Release date: April 18, 2012
1 Description
The TakeLab Semantic Text Similarity System was entered as task6-takelab-simple
in the SemEval-2012: Semantic Evaluation Exercises to perform the Semantic Textual Similarity task:
Given two sentences, s1 and s2, the system provides a similarity score as a floating point number on the scale of 0 (no relation) to 5 (semantic equivalence).
The SemEval-2012 organizers measured the performance of our system by comparing its output with the aggregated similarity scores obtained from human subjects on the same data set using the Pearson correlation coefficient.
Please see this page for more detailed information.
If you use the software or the datasets, please cite the following paper:
Šarić, F., Glavaš G., Karan M., Šnajder J., Dalbelo Bašić B.: TakeLab: Systems for Measuring Semantic Text Similarity. In Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012). pp. 441-448. ACL, Montreal, Canada (7-8 June 2012).
The BibTeX format is:
@InProceedings{saric2012takelab,
author = {\v{S}ari\'{c}, Frane and Glava\v{s}, Goran and Karan, Mladen
and \v{S}najder, Jan and Dalbelo Ba\v{s}i\'{c}, Bojana},
title = {TakeLab: Systems for Measuring Semantic Text Similarity},
booktitle = {Proceedings of the Sixth International Workshop on Semantic Evaluation
{(SemEval 2012)}},
month = {7-8 June},
year = {2012},
address = {Montr\'{e}al, Canada},
publisher = {Association for Computational Linguistics},
pages = {441--448},
url = {http://www.aclweb.org/anthology/S12-1060}
}
2 Instructions
2.1 Installation prerequisites
This system requires the following packages: Python 2.7,
NLTK, NumPy, and LIBSVM.
If you are using Ubuntu Linux or a similar distribution, you can install all of the
prerequisites by typing:
sudo apt-get install python-nltk python-numpy libsvm-tools
Once you install the NLTK package, you need to issue
the following commands to download the POS tagger model and the
WordNet corpus:
python -c "import nltk; nltk.download()"
When prompted, please select the maxent_treebank_pos_tagger
model and the wordnet
corpus.
2.2 Download
The source code and the data files can be downloaded from this location:
2.3 Usage
Generating the features
Issuing
python takelab_simple_features.py train/STS.input.MSRvid.txt train/STS.gs.MSRvid.txt > msrvid-train.txt
python takelab_simple_features.py test/STS.input.MSRvid.txt > msrvid-test.txt
will generate the requisite features from the train/STS.input.MSRvid.txt
and test/STS.input.MSRvid.txt
files respectively.
Note: In order to avoid an unreasonably large file download, the provided implementation contains the file with the word frequencies (word-frequencies.txt
) and the LSA word vectors (nyt_words.txt
, nyt_word_vectors.txt
, wikipedia_words.txt
and wikipedia_word_vectors.txt
) which were filtered to contain only the words appearing in the official train and test sets. Using the larger unfiltered datasets on the provided train and test sets will not affect the output of this software in any way.
Obtaining the model parameters
Executing the provided shell script grid-search.sh
./grid-search.sh msrvid-train.txt
will find the optimal model parameters for the Support Vector Regression using LIBSVM. The above command will finish with the following output:
Best correlation: 0.873686
Type: svm-train -s 3 -t 2 -c 200 -g .02 -p .5 msrvid-train.txt model.txt
Thus the optimal LIBSVM parameters for msrvid-train.txt
are: -s 3 -t 2 -c 200 -g .02 -p .5
.
Note: The best correlation and the set of optimal parameters might vary slightly depending on the installed version of NLTK, the WordNet corpus, and LIBSVM.
Training
By copy-pasting the command suggested by grid-search.sh
, i.e. the text following "Type:
"
svm-train -s 3 -t 2 -c 200 -g .02 -p .5 msrvid-train.txt model.txt
one trains the model for MSR-Video corpus.
The training will finish with the following output:
.
Warning: using -h 0 may be faster
*.*
optimization finished, #iter = 2777
nu = 0.431544
obj = -33239.258433, rho = -2.938525
nSV = 337, nBSV = 311
Evaluation
To evaluate the msrvid
model against the msrvid
test set, one can issue
svm-predict msrvid-test.txt model.txt msrvid-output.txt
which should output the following:
Mean squared error = 7.03655 (regression)
Squared correlation coefficient = -nan (regression)
The obtained sentence similarity scores need postprocessing by the postprocess_scores.py
script
python postprocess_scores.py test/STS.input.MSRvid.txt msrvid-output.txt
which handles some trivial corner cases of our scoring function.
To obtain the Pearson correlation, one can run the correlation.pl
script which was provided by the SemEval organizers alongside the training data:
perl correlation.pl msrvid-output.txt test/STS.gs.MSRvid.txt
which outputs
Pearson: 0.88516
indicating that the sentence similarity scores produced by our system have the Pearson correlation of 0.88516 when compared to the similarity scores reported by the human subjects (contained in the file test/STS.gs.MSRvid.txt
).
Please note that the system that we provide on this page differs slightly from the system that was submitted to SemEval. A couple of features described in the paper are not used in this implementation (in particular sentence length and the number of differing lemmas) and the implementation of some features might be slightly different. Please email us if you notice any major differences.
2.4 Additional data
We used the Google Books Ngrams dataset to obtain word frequencies. The word-frequencies.txt
file provided herein is filtered to only the words appearing in the train and test sets. However, providing a larger, unfiltered file does not change the output of the system in any way. Several features use the word LSA vectors obtained from the Wikipedia and New York Times Annotated Corpus. Since the full matrices containing all the LSA vectors are very large, we provide only the shortened versions containing the vectors corresponding to train and test set words. If you are interested in full matrices or the code that generates them, please email info@takelab.hr.
3 License
This code is available under a derivative of a BSD-license that requires proper attribution. Essentially, you can use, modify and sell this software or the derived software in academic and non-academic, commercial and non-commercial, open-source and closed-source settings, but you have to give proper credit.