TweetingJay - the TakeLab and LIIR semantic similarity
detection system
Version: 1.0
Release date: February 19, 2015
For other TakeLab data and software see our software and resources section.
1 Description
The TweetingJay twitter text similarity detection system was entered as TKLBLIIR
in the
SemEval-2015: Semantic Evaluation Exercises to perform the
Paraphrase and Semantic Similarity in Twitter task.
Given two tweets, t1 and t2, the system must provide a binary yes/no decision about the tweets being semantically equivalent. The SemEval-2015 organizers measured the performance of our system by comparing its output with the aggregated similarity scores obtained from human subjects on the same data set using the F1 measure.
If you use the software or the datasets, please cite the following paper:
TODO apalike citation with link
The BibTeX format is:
@InProceedings{TODO,
author = {Glava\v{s}, Goran and Karan, Mladen
and \v{S}najder, Jan and Dalbelo Ba\v{s}i\'{c}, Bojana and Vuli\v{c}, Ivan and Moens, Marie-Francine},
title = {TKLBLIIR: Detecting Twitter Paraphrases with TweetingJay},
booktitle = {Proceedings of the TODO
{(SemEval 2015)}},
month = {TODO},
year = {2015},
address = {Denver, United States of America},
publisher = {Association for Computational Linguistics},
pages = {x-y},
url = {TODO}
}
2 Instructions
2.1 Installation prerequisites
You will need python (2.7.6) (with nltk, numpy and joblib libraries) and the libsvm command line tools.
Unix (Ubuntu or similar):
Very likely you already have python, if not install it first from www.python.org, you can get everything else by:
sudo apt-get install python-nltk python-numpy python-joblib libsvm-tools
python -c "import nltk; nltk.download('punkt')"
Windows:
Not tested but should work, roughly you should:
- Get python (2.7.x) from www.python.org.
- Install the required libraries - nltk, numpy and joblib, using e.g. the pip package manager (comes with python installation).
- Download libsvm windows binaries from http://www.csie.ntu.edu.tw/~cjlin/libsvm/
and put them into your path (or into the "experiment" subfolder of the package) - the only two binaries needed are
svm-train.exe
and
svm-predict.exe
- Install the "punkt" resource into nltk (same python command as above).
- Rename run.sh to run.bat
2.2 Download
The source code and the data files can be downloaded from this location:
2.3 Usage
Expected usage is as follows:
- Generate a combined (train + dev + test) data set file:
./data-set/generateCombined.py
- Generate the features:
./feature-generation/mostFeatures.py
- Train a svm model on train + dev and apply it to test
./experiment/experiment.py
Once finished the output can be found in ./experiment/output/PIT2015_TKLBLIIR_rbfsvm.output
The steps are automated in the script run.sh
Some common usage scenarios:
Replicate TKLBLIIR Semeval result:
Follow the above pipeline exactly (execute the run.sh
script).
Evaluation is done using the official evaluation script (also available in the ./experiment/output
folder).
The output should be:
838 TKLBLIIR 01_rbfsvm 0.658 0.627 0.691
Annotate a new test set
Before starting, replace ./data-set/test
with your own test file in the same format (the official Semeval 2015 Task 1 format). Run the above pipeline (execute the run.sh
script). The output can be found in ./experiment/output/PIT2015_TKLBLIIR_rbfsvm.output
.
NOTE:
- If your data set is not as richly tagged as the one from the semeval task, you can ommit
the missing tags (put empty strings for them). Our final feature set does not use the tags.
- You will want to download the full embeddings matrix (see final notes) and filter it to leave
only the words appearing in the file combined.data (which now includes your test data)
Perform SVM parameter optimisation:
C and gamma SVM parameter values are hard-coded into experiment.py, if you wish to verify the values
using grid search (or find better ones) go to the experiment
subfolder and run
./experiment/opt-params.py N
Where N is the number of parallel threads you can afford to use. The script will perform a grid
search in the space of possible C and gamma values guided by a 10-fold crossvalidation on the training set.
Depending on your computer this may take a long time. After the search has finished the optimal parameters
will be reported.
Change the features used:
The format of a feature file is (e.g. for two features):
(line 0) featurename1 | featurename2
(line 1) example1-ID | ex1-featurename1-value | ex1-featurename2-value
(line 2) example2-ID | ex2-featurename1-value | ex2-featurename2-value
Each file can contain one or more features. There can be more than one file.
The models use all features from all feature files located in ./experiment/feature-files
.
You may affect the features as you wish by manipulating these files. E.g. you may
add a new feature by adding a properly formatted file to this folder. This is exactly
how the ./feature-generation/mostFeatures.py
script adds features.
You can find the raw text data (including IDs) from which to generate your features in the file ./data-set/combined.data
.
2.4 Final notes
3 License
This code is available under a derivative of a BSD-license that requires proper attribution. Essentially, you can use, modify and sell this software or the derived software in academic and non-academic, commercial and non-commercial, open-source and closed-source settings, but you have to give proper credit.