By D. Miklós, Vera T. Sós, T. Szőnyi

2), since it is supposed to quantify how “surprising” a collocation candidate is. 2 Extraction Techniques 31 sufficient discriminatory power. This might be the reason why these descriptions have not really been taken into account by practical work on collocation identification. In particular, techniques based on semantic criteria that have been successfully used to detect semantically opaque expressions (Lin, 1999; Fazly, 2007) are not applicable to collocations, which are rather compositional.

1, which also shows the typical notations for the marginal and joint frequencies. 1 Contingency table for the candidate pair (u, v). X, Y = random variables associated with each position in the pair; a = joint frequency; R1 , C1 = marginal frequencies for u, resp. , in the candidate data—that have u in the first position and v in the second; b represents the number of items that have u in the first position and ¬v in the second, and so on. 6 The sum R1 = a + b is the frequency of all pairs with u in the first position, also written as (u, ∗) or (u, •); similarly, C1 = a + c is the frequency of all pairs with v in the second position, written as (∗, v) or (•, v).

The errors that can be made by a hypothesis test are classified as: – Type I errors: wrongly rejecting the null hypothesis, when it is in fact true; – Type II errors: not rejecting the null hypothesis, when it is in fact false. , it is a false positive. , it is a false negative. Using a smaller α-level ensures that the test produces fewer type I errors, but has instead the disadvantage of introducing more type II errors; the opposite holds when increasing the α-level. , striking the right balance between precision and recall.