Skip to content

GSOC2015_Progress_Emilio

Emilio Dorigatti edited this page Aug 17, 2015 · 20 revisions

Warmup Period (until 25th of May) Warm up tickets on GitHub, made an experimental date normalization module using the ANTLR4 bindings for Python and the provided grammar. We decided to abandon this approach and create our own regular expressions directly via Python.

First Week (25/5 - 31/5) Some ideas about the date normalizer, meeting at FBK with mentor Marco Fossati.

Second Week (1/6 - 7/6) First prototype of the date normalizer, reviewed crowd annotated gold standard.

Third Week (8/6 - 14/6) Exams!

Fourth and Fifht Weeks (15/6 - 28/6) Almost finished and tested the date normalizer as well as the code using it.

Fifth Week (29/6 - 5/7) Final refinements for the mid-term: successfully outputting reified triples and script for transforming the wikipedia dump in sentences about soccer.

Sixth Week (6/7 - 12/7) Refactoring and cleaning of the code base, experiments with the unsupervised classifier. As it turned out it is heavily dependent on the quality of the entities linked by the linker (for example stagione 2010-2011 was linked to Serie B) and on the mapping between frame elements and ontology types in dbpedia.

Seventh Week (13/7 - 19/7) Script to compute Fleiss's Kappa on the crowdflower results, slowly refactoring the code base

Eight Week (20/7 - 26/7) Holidays.

Ninth Week (27/7 - 2/8) Created rules to run the supervised classifier, thoughts about scoring triples' confidence and implementation of score for unsupervised classification using the entity linking score.

Tenth Week (3/8 - 9/8) Scoring supervised classification facts and serializing triples' score in a separated dataset, heavy refactoring of the classifier.

Eleventh Week (10/8 - 16/8) Integrated the mappings to DBPO into the output triples, found some critical bugs which might be responsible for the low frame classification performances. Fixed the bugs and computed confusion matrix for the classifier. Performances are good, but not exceptional, perhaps due to some problems in how the training set was built wrt to the gold standard (i.e. some labels in the gold standard do not appear in the training set)

Twelfth Week (17/8 - 23/8) As we are using the 'C-SVC' flavour of libsvm I explored the classifier's performances as C varies taking values 0.001, 0.01, 0.1, 1, 10, 100 and 1000.

As for roles, the best performances are obtained using C values near 1 (0.1, 1, 10). Here, most roles keep good precision and recall (0.7 or higher) except for Agente, Premio and Perdente (but, curiously, not Vincitore). Perdente has low recalls (less than 0.25) with all values of C but 1, in which it has 0.45 while its precision is always stable at 0.8. Agente and Premio always have low (< 0.35) recall while their precision reach a peak at C = 10 (respectively 0.8 and 0.67 precision). The confusion matrix shows a lot of activity around these classes and Competizione and O (i.e. not tagged) which is understandable as they require some semantic understanding of the sentence. For example Competizione is often mistaken as Premio and, in fact, the distinction is not clear at all (think of Ha vinto la Champions League). Squadra, Agente Vincitore and Perdente are another example of classes which are frequently misunderstood by the classifier. The classifier tends to tag a significant amount of Competizione, Agente and O classes as Entità which is understandable considering the scarce and ambiguous training examples tagged as Entità (34 out of more than 55000 roles examples in almost 3000 sentences). Two distinct strips can be noticed in the row and column of the O role, representing missed tokens or tokens classified by mistake.

Frame classification is very stable with very high precision and recall (more than 0.8) for all classes but Stato, having a precision smaller than 0.4; it is often mistaken for Attività.