TweetingJay - the TakeLab and LIIR semantic similarity detection system

Version: 1.0
Release date: February 19, 2015

For other TakeLab data and software see our software and resources section.

1 Description

The TweetingJay twitter text similarity detection system was entered as TKLBLIIR in the SemEval-2015: Semantic Evaluation Exercises to perform the Paraphrase and Semantic Similarity in Twitter task.

Given two tweets, t1 and t2, the system must provide a binary yes/no decision about the tweets being semantically equivalent. The SemEval-2015 organizers measured the performance of our system by comparing its output with the aggregated similarity scores obtained from human subjects on the same data set using the F1 measure.

If you use the software or the datasets, please cite the following paper:

TODO apalike citation with link

The BibTeX format is:

  @InProceedings{TODO,
    author    = {Glava\v{s}, Goran  and  Karan, Mladen  
                 and  \v{S}najder, Jan  and  Dalbelo Ba\v{s}i\'{c}, Bojana and Vuli\v{c}, Ivan and Moens, Marie-Francine},
    title     = {TKLBLIIR: Detecting Twitter Paraphrases with TweetingJay},
    booktitle = {Proceedings of the TODO
                 {(SemEval 2015)}},
    month     = {TODO},
    year      = {2015},
    address   = {Denver, United States of America},
    publisher = {Association for Computational Linguistics},
    pages     = {x-y},
    url       = {TODO}
  }

2 Instructions

2.1 Installation prerequisites

You will need python (2.7.6) (with nltk, numpy and joblib libraries) and the libsvm command line tools.

Unix (Ubuntu or similar):

Very likely you already have python, if not install it first from www.python.org, you can get everything else by:

      sudo apt-get install python-nltk python-numpy python-joblib libsvm-tools
      python -c "import nltk; nltk.download('punkt')"

Windows:

Not tested but should work, roughly you should:

Get python (2.7.x) from www.python.org.
Install the required libraries - nltk, numpy and joblib, using e.g. the pip package manager (comes with python installation).
Download libsvm windows binaries from http://www.csie.ntu.edu.tw/~cjlin/libsvm/ and put them into your path (or into the "experiment" subfolder of the package) - the only two binaries needed are svm-train.exe and svm-predict.exe
Install the "punkt" resource into nltk (same python command as above).
Rename run.sh to run.bat

2.2 Download

The source code and the data files can be downloaded from this location:

tweetingjay.tar.gz (25 MB)

2.3 Usage

Expected usage is as follows:

Generate a combined (train + dev + test) data set file:
```
./data-set/generateCombined.py 
```
Generate the features:
```
./feature-generation/mostFeatures.py
```
Train a svm model on train + dev and apply it to test
```
./experiment/experiment.py
```

Once finished the output can be found in ./experiment/output/PIT2015_TKLBLIIR_rbfsvm.output
The steps are automated in the script run.sh

Some common usage scenarios:

Replicate TKLBLIIR Semeval result:

Follow the above pipeline exactly (execute the run.sh script).

Evaluation is done using the official evaluation script (also available in the ./experiment/output folder). The output should be:

  838	TKLBLIIR	01_rbfsvm		0.658	0.627	0.691

Annotate a new test set

Before starting, replace ./data-set/test with your own test file in the same format (the official Semeval 2015 Task 1 format). Run the above pipeline (execute the run.sh script). The output can be found in ./experiment/output/PIT2015_TKLBLIIR_rbfsvm.output .

NOTE:

If your data set is not as richly tagged as the one from the semeval task, you can ommit the missing tags (put empty strings for them). Our final feature set does not use the tags.
You will want to download the full embeddings matrix (see final notes) and filter it to leave only the words appearing in the file combined.data (which now includes your test data)

Perform SVM parameter optimisation:

C and gamma SVM parameter values are hard-coded into experiment.py, if you wish to verify the values using grid search (or find better ones) go to the experiment subfolder and run

./experiment/opt-params.py N

Where N is the number of parallel threads you can afford to use. The script will perform a grid search in the space of possible C and gamma values guided by a 10-fold crossvalidation on the training set. Depending on your computer this may take a long time. After the search has finished the optimal parameters will be reported.

Change the features used:

The format of a feature file is (e.g. for two features):

     (line 0) featurename1 | featurename2
     (line 1) example1-ID | ex1-featurename1-value | ex1-featurename2-value
     (line 2) example2-ID | ex2-featurename1-value | ex2-featurename2-value

Each file can contain one or more features. There can be more than one file. The models use all features from all feature files located in ./experiment/feature-files.

You may affect the features as you wish by manipulating these files. E.g. you may add a new feature by adding a properly formatted file to this folder. This is exactly how the ./feature-generation/mostFeatures.py script adds features.

You can find the raw text data (including IDs) from which to generate your features in the file ./data-set/combined.data.

2.4 Final notes

performance of this demo is slightly worse (by 0.001) than the official result of our system because (for simplicity of use) it has a slightly reduced feature set (the multip and aligned word pairs features are left out)

to avoid a huge download, the matrix containing the embeddings in textual format found in ./feature-generation/resources is filtered to leave only the words from the semeval data set. If you wish to apply the model to your own data you will probably want to get the full matrix from this link (~8GB).

It is the matrix constructed by Tomas Mikolov (available here) transformed to text form using a program by Marek Rei (available here).

Loading of the unfiltered matrix during feature generation might take a while or even freeze if you are low on memory, we recommend that you filter it and leave only those words that appear in your combined.data file (which includes your test data). On linux you can use the following command to do the filtering:
```
    grep -E ^$`tr ' ' '\n' < combined.data | tr '[:upper:]' '[:lower:]' | sort -u | grep -v "^$" | paste -s --delimiters="|"`$\\b GoogleFullMatrix.txt > GoogleFilteredMatrix.txt
    
```
If you notice any bugs or mistakes or have questions or comments, you can email

3 License

This code is available under a derivative of a BSD-license that requires proper attribution. Essentially, you can use, modify and sell this software or the derived software in academic and non-academic, commercial and non-commercial, open-source and closed-source settings, but you have to give proper credit.