TweetingJay - the TakeLab and LIIR semantic similarity detection system

Version: 1.0
Release date: February 19, 2015


For other TakeLab data and software see our software and resources section.

1 Description

The TweetingJay twitter text similarity detection system was entered as TKLBLIIR in the SemEval-2015: Semantic Evaluation Exercises to perform the Paraphrase and Semantic Similarity in Twitter task.

Given two tweets, t1 and t2, the system must provide a binary yes/no decision about the tweets being semantically equivalent. The SemEval-2015 organizers measured the performance of our system by comparing its output with the aggregated similarity scores obtained from human subjects on the same data set using the F1 measure.

If you use the software or the datasets, please cite the following paper:

TODO apalike citation with link

The BibTeX format is:
  @InProceedings{TODO,
    author    = {Glava\v{s}, Goran  and  Karan, Mladen  
                 and  \v{S}najder, Jan  and  Dalbelo Ba\v{s}i\'{c}, Bojana and Vuli\v{c}, Ivan and Moens, Marie-Francine},
    title     = {TKLBLIIR: Detecting Twitter Paraphrases with TweetingJay},
    booktitle = {Proceedings of the TODO
                 {(SemEval 2015)}},
    month     = {TODO},
    year      = {2015},
    address   = {Denver, United States of America},
    publisher = {Association for Computational Linguistics},
    pages     = {x-y},
    url       = {TODO}
  }
  

2 Instructions

2.1 Installation prerequisites

You will need python (2.7.6) (with nltk, numpy and joblib libraries) and the libsvm command line tools.

Unix (Ubuntu or similar):

Very likely you already have python, if not install it first from www.python.org, you can get everything else by:
      sudo apt-get install python-nltk python-numpy python-joblib libsvm-tools
      python -c "import nltk; nltk.download('punkt')" 
      

Windows:

Not tested but should work, roughly you should:

2.2 Download

The source code and the data files can be downloaded from this location:

2.3 Usage

Expected usage is as follows:
  1. Generate a combined (train + dev + test) data set file:
    ./data-set/generateCombined.py 
  2. Generate the features:
    ./feature-generation/mostFeatures.py
  3. Train a svm model on train + dev and apply it to test
    ./experiment/experiment.py
Once finished the output can be found in ./experiment/output/PIT2015_TKLBLIIR_rbfsvm.output
The steps are automated in the script run.sh

Some common usage scenarios:

Replicate TKLBLIIR Semeval result:

Follow the above pipeline exactly (execute the run.sh script).

Evaluation is done using the official evaluation script (also available in the ./experiment/output folder). The output should be:
  838	TKLBLIIR	01_rbfsvm		0.658	0.627	0.691	
  

Annotate a new test set

Before starting, replace ./data-set/test with your own test file in the same format (the official Semeval 2015 Task 1 format). Run the above pipeline (execute the run.sh script). The output can be found in ./experiment/output/PIT2015_TKLBLIIR_rbfsvm.output .

NOTE:

Perform SVM parameter optimisation:

C and gamma SVM parameter values are hard-coded into experiment.py, if you wish to verify the values using grid search (or find better ones) go to the experiment subfolder and run
./experiment/opt-params.py N 
Where N is the number of parallel threads you can afford to use. The script will perform a grid search in the space of possible C and gamma values guided by a 10-fold crossvalidation on the training set. Depending on your computer this may take a long time. After the search has finished the optimal parameters will be reported.

Change the features used:

The format of a feature file is (e.g. for two features):
     (line 0) featurename1 | featurename2
     (line 1) example1-ID | ex1-featurename1-value | ex1-featurename2-value
     (line 2) example2-ID | ex2-featurename1-value | ex2-featurename2-value
     
Each file can contain one or more features. There can be more than one file. The models use all features from all feature files located in ./experiment/feature-files.

You may affect the features as you wish by manipulating these files. E.g. you may add a new feature by adding a properly formatted file to this folder. This is exactly how the ./feature-generation/mostFeatures.py script adds features.

You can find the raw text data (including IDs) from which to generate your features in the file ./data-set/combined.data.

2.4 Final notes

3 License

This code is available under a derivative of a BSD-license that requires proper attribution. Essentially, you can use, modify and sell this software or the derived software in academic and non-academic, commercial and non-commercial, open-source and closed-source settings, but you have to give proper credit.