Language Independent Methods of Clustering Similar Contexts ...

Language Independent Methods of Clustering Similar Contexts ...

Language Independent Methods of Clustering Similar Contexts (with applications) Ted Pedersen University of Minnesota, Duluth [email protected] EACL-2006 Tutorial 1 Language Independent Methods Do not utilize syntactic information Do not utilize dictionaries or other manually created lexical resources Based on lexical features selected from corpora

No parsers, part of speech taggers, etc. required Assumption: word segmentation can be done by looking for white spaces between strings No manually annotated data of any kind, methods are completely unsupervised in the strictest sense EACL-2006 Tutorial 2 Clustering Similar Contexts A context is a short unit of text often a phrase to a paragraph in length, although it can be longer Input: N contexts

Output: K clusters Where each member of a cluster is a context that is more similar to each other than to the contexts found in other clusters EACL-2006 Tutorial 3 Applications Headed contexts (contain target word) Headless contexts Name Discrimination

Word Sense Discrimination Email Organization Document Clustering Paraphrase identification Clustering Sets of Related Words EACL-2006 Tutorial 4 Tutorial Outline Identifying lexical features Context representations Singular Value Decomposition Clustering

First & second order Dimensionality reduction Measures of association & tests of significance Partitional techniques Cluster stopping Cluster labeling Hands On Exercises EACL-2006 Tutorial 5 General Info

Please fill out short survey Break from 4:00-4:30pm Finish at 6pm Slides and video from tutorial will be posted (I will send you email when that is ready) Questions are welcome Reception tonight at 7pm at Castle (?) Now, or via email to me or SenseClusters list. Comments, observations, criticisms are all welcome Knoppix CD, will give you Linux and SenseClusters when computer is booted from the CD. EACL-2006 Tutorial

6 SenseClusters A package for clustering contexts SenseClusters Live! (Knoppix CD) Integrates with various other tools Ngram Statistics Package CLUTO SVDPACKC EACL-2006 Tutorial 7 Many thanks

Amruta Purandare (M.S., 2004) Anagha Kulkarni (M.S., 2006, expected) Founding developer of SenseClusters (20022004) Now PhD student in Intelligent Systems at the University of Pittsburgh Enhancing SenseClusters since Fall 2004! National Science Foundation (USA) for supporting Amruta, Anagha and me via EACL-2006 Tutorial CAREER award #0092784

8 Background and Motivations EACL-2006 Tutorial 9 Headed and Headless Contexts A headed context includes a target word Our goal is to cluster the target words based on their surrounding contexts Target word is center of context and our attention A headless context has no target word

Our goal is to cluster the contexts based on their similarity to each other The focus is on the context as a whole EACL-2006 Tutorial 10 Headed Contexts (input) I can hear the ocean in that shell. My operating system shell is bash. The shells on the shore are lovely. The shell command line is flexible. The oyster shell is very hard and black. EACL-2006 Tutorial 11

Headed Contexts (output) Cluster 1: My operating system shell is bash. The shell command line is flexible. Cluster 2: The shells on the shore are lovely. The oyster shell is very hard and black. I can hear the ocean in that shell. EACL-2006 Tutorial 12 Headless Contexts (input)

The new version of Linux is more stable and better support for cameras. My Chevy Malibu has had some front end troubles. Osborne made on of the first personal computers. The brakes went out, and the car flew into the house. With the price of gasoline, I think Ill be taking the bus more often! EACL-2006 Tutorial 13 Headless Contexts (output) Cluster 1:

The new version of Linux is more stable and better support for cameras. Osborne made one of the first personal computers. Cluster 2: My Chevy Malibu has had some front end troubles. The brakes went out, and the car flew into the house. With the price of gasoline, I think Ill be taking the bus more often! EACL-2006 Tutorial 14 Web Search as Application Web search results are headed contexts

Search term is target word (found in snippets) Web search results are often disorganized two people sharing same name, two organizations sharing same abbreviation, etc. often have their pages mixed up If you click on search results or follow links in pages found, you will encounter headless contexts too EACL-2006 Tutorial 15 Name Discrimination EACL-2006 Tutorial 16 George Millers! EACL-2006 Tutorial 17

EACL-2006 Tutorial 18 EACL-2006 Tutorial 19 EACL-2006 Tutorial 20 EACL-2006 Tutorial 21 EACL-2006 Tutorial 22 Email Foldering as Application Email (public or private) is made up of headless contexts

Short, usually focused Cluster similar email messages together Automatic email foldering Take all messages from sent-mail file or inbox and organize into categories EACL-2006 Tutorial 23 EACL-2006 Tutorial 24 EACL-2006 Tutorial 25 Clustering News as

Application News articles are headless contexts Entire article or first paragraph Short, usually focused Cluster similar articles together EACL-2006 Tutorial 26 EACL-2006 Tutorial 27 EACL-2006 Tutorial 28 EACL-2006 Tutorial

29 What is it to be similar? You shall know a word by the company it keeps Meanings of words are (largely) determined by their distributional patterns (Distributional Hypothesis) Harris, 1968 (Mathematical Structures of Language) Words that occur in similar contexts will have similar meanings (Strong Contextual Hypothesis) Firth, 1957 (Studies in Linguistic Analysis)

Miller and Charles, 1991 (Language and Cognitive Processes) Various extensions Similar contexts will have similar meanings, etc. Names that occur in similar contexts will refer to the same underlying person, etc. EACL-2006 Tutorial 30 General Methodology Represent contexts to be clustered using first or second order feature vectors Reduce dimensionality to make vectors more tractable and/or understandable

Singular value decomposition Cluster the context vectors Lexical features Find the number of clusters Label the clusters Evaluate and/or use the contexts! EACL-2006 Tutorial 31 Identifying Lexical Features Measures of Association and Tests of Significance EACL-2006 Tutorial

32 What are features? Features represent the (hopefully) salient characteristics of the contexts to be clustered Eventually we will represent each context as a vector, where the dimensions of the vector are associated with features Vectors/contexts that include many of the same features will be similar to each other EACL-2006 Tutorial 33 Where do features come from? In unsupervised clustering, it is common

for the feature selection data to be the same data that is to be clustered This is not cheating, since data to be clustered does not have any labeled classes that can be used to assist feature selection It may also be necessary, since we may need to cluster all available data, and not hold out some for a separate feature identification step Email or news articles EACL-2006 Tutorial 34 Feature Selection Test data the contexts to be clustered

Assume that the feature selection data is the same as the test data, unless otherwise indicated Training data a separate corpus of held out feature selection data (that will not be clustered) may need to use if you have a small number of contexts to cluster (e.g., web search results) This sense of training due to Schtze (1998) EACL-2006 Tutorial 35 Lexical Features Unigram a single word that occurs more than a given number of times Bigram an ordered pair of words that occur together more often than expected by

chance Consecutive or may have intervening words Co-occurrence an unordered bigram Target Co-occurrence a co-occurrence where one of the words is the target word EACL-2006 Tutorial 36 Bigrams fine wine (window size of 2) baseball bat house of representatives (window size of 3)

president of the republic (window size of 4) apple orchard Selected using a small window size (2-4 words), trying to capture a regular (localized) pattern between two words (collocation?) EACL-2006 Tutorial 37 Co-occurrences tropics water boat fish law president train travel Usually selected using a larger window (7-10 words) of context, hoping to capture pairs of related words rather than collocations EACL-2006 Tutorial 38

Bigrams and Cooccurrences Pairs of words tend to be much less ambiguous than unigrams bank versus river bank and bank card dot versus dot com and dot product Three grams and beyond occur much less frequently (Ngrams very Zipfian) Unigrams are noisy, but bountiful EACL-2006 Tutorial 39 occur together more often than expected by chance

Observed frequencies for two words occurring together and alone are stored in a 2x2 matrix Throw out bigrams that include one or two stop words Expected values are calculated, based on the model of independence and observed values How often would you expect these words to occur together, if they only occurred together by chance? If two words occur significantly more often than EACL-2006 the expected value, thenTutorial the words do not occur 40

2x2 Contingency Table Artificial Intelligenc ! e Intelligenc e 100 400 !Artificial 300 100,000 EACL-2006 Tutorial 41 2x2 Contingency Table Artificial !Artificial Intelligenc

! e Intelligenc e 100 300 400 200 99,400 99,600 300 99,700 100,000 EACL-2006 Tutorial 42 2x2 Contingency Table

Artificial !Artificial Intelligenc ! e Intelligenc e 100.0 300.0 000.12 398.8 200.0 99,400.0 298.8 99,301.2 300 99,700 EACL-2006 Tutorial 400 99,600 100,000 43 Measures of Association

2 2 G (observed( wi , w j ) * log i , j 1 observed( wi , w j ) expected( wi , w j ) 2 [observed( wi , w j ) expected( wi , w j )]2 i , j 1 expected( wi , w j ) X 2 EACL-2006 Tutorial ) 44 Measures of Association

2 G 750.88 2 X 8191.78 EACL-2006 Tutorial 45 Interpreting the Scores G^2 and X^2 are asymptotically approximated by the chi-squared distribution This meansif you fix the marginal totals of a table, randomly generate internal cell values in the table, calculate the G^2 or X^2 scores for each resulting table, and plot the distribution of the scores, you *should* get EACL-2006 Tutorial

46 EACL-2006 Tutorial 47 Interpreting the Scores Values above a certain level of significance can be considered grounds for rejecting the null hypothesis H0: the words in the bigram are independent 3.841 is associated with 95% confidence that the null hypothesis should be rejected EACL-2006 Tutorial 48 Measures of Association

There are numerous measures of association that can be used to identify bigram and co-occurrence features Many of these are supported in the Ngram Statistics Package (NSP) ml EACL-2006 Tutorial 49 Measures Supported in NSP Log-likelihood Ratio (ll)

True Mutual Information (tmi) Pearsons Chi-squared Test (x2) Pointwise Mutual Information (pmi) Phi coefficient (phi) T-test (tscore) Fishers Exact Test (leftFisher, rightFisher) Dice Coefficient (dice) Odds Ratio (odds) EACL-2006 Tutorial 50 Summary Identify lexical features based on frequency counts or measures of association either in the data to be clustered or in a separate set of feature selection data

Unigrams usually only selected by frequency Language independent Remember, no labeled data from which to learn, so somewhat less effective as features than in supervised case Bigrams and co-occurrences can also be selected by frequency, or better yet measures of association Bigrams and co-occurrences need not be consecutive Stop words should be eliminated Frequency thresholds are helpful (e.g., unigram/bigram that occurs once may be too rare to be useful) EACL-2006 Tutorial 52 Related Work

Moore, 2004 (EMNLP) follow-up to Dunning and Pedersen on log-likelihood and exact tests Pedersen, 1996 (SCSUG) explanation of exact tests, and comparison to log-likelihood (also see Pedersen, Kayaalp, and Bruce, AAAI-1996) Dunning, 1993 (Computational Linguistics) introduces loglikelihood ratio for collocation identification EACL-2006 Tutorial 53 Context Representations First and Second Order Methods EACL-2006 Tutorial 54

Once features selected We will have a set of unigrams, bigrams, cooccurrences or target co-occurrences that we believe are somehow interesting and useful We also have any frequency and measure of association score that have been used in their selection Convert contexts to be clustered into a vector representation based on these features EACL-2006 Tutorial 55 First Order Representation Each context is represented by a vector with M dimensions, each of

which indicates whether or not a particular feature occurred in that context Value may be binary, a frequency count, or an association score Context by Feature representation EACL-2006 Tutorial 56 Contexts Cxt1: There was an island curse of black magic cast by that voodoo child. Cxt2: Harold, a known voodoo child, was

gifted in the arts of black magic. Cxt3: Despite their military might, it was a serious error to attack. Cxt4: Military might is no defense against a voodoo child or an island curse. EACL-2006 Tutorial 57 Unigram Feature Set island black curse magic child 1000 700

500 400 200 (assume these are frequency counts obtained from some corpus) EACL-2006 Tutorial 58 First Order Vectors of Unigrams Cxt1 Cxt2 Cxt3 Cxt4 island 1 0 0 1 black 1 1 0

0 curse magic 1 1 0 1 0 0 1 0 EACL-2006 Tutorial child 1 1 0 1 59 Bigram Feature Set

island curse black magic voodoo child military might serious error island child voodoo might military error black child serious curse 189.2 123.5 120.0 100.3 89.2 73.2 69.4

54.9 43.2 21.2 (assume these are log-likelihood scores based on frequency counts from some corpus) EACL-2006 Tutorial 60 First Order Vectors of Bigrams Cxt1 black island militar seriou voodo y s error o child magic curse might 1 1 0 0 1 Cxt2

1 0 0 0 1 Cxt3 0 0 1 1 0 Cxt4 0

1 1 0 1 EACL-2006 Tutorial 61 First Order Vectors Can have binary values or weights associated with frequency, etc. Forms a context by feature matrix May optionally be smoothed/reduced with Singular Value Decomposition

More on that later The contexts are ready for clustering More on that later EACL-2006 Tutorial 62 Second Order Features First order features encode the occurrence of a feature in a context Feature occurrence represented by binary value Second order features encode something extra about a feature that occurs in a context

Feature occurrence represented by word cooccurrences Feature occurrence represented by context occurrences EACL-2006 Tutorial 63 Second Order Representation First, build word by word matrix from features Based on bigrams or co-occurrences First word is row, second word is column, cell is score (optionally) reduce dimensionality w/SVD Each row forms a vector of first order co-occurrences Second, replace each word in a context with its

row/vector as found in the word by word matrix Average all the word vectors in the context to create the second order representation Due to Schtze (1998), related to LSI/LSA EACL-2006 Tutorial 64 Word by Word Matrix magic curse might black 123.5 0 0 island 0 189.2 0 militar 0 0 100.3 y seriou 0

21.2 0 s voodo 0 0 69.4 o EACL-2006 Tutorial error 0 0 54.9 child 43.2 73.2 0 89.2 0 0 120.0

65 Word by Word Matrix can also be used to identify sets of related words In the case of bigrams, rows represent the first word in a bigram and columns represent the second word In the case of co-occurrences, rows and columns are equivalent Matrix is asymmetric Matrix is symmetric The vector (row) for each word represent a set of first order features for that word Each word in a context to be clustered for which a

vector exists (in the word by word matrix) is replaced by that vector in that context EACL-2006 Tutorial 66 There was an island curse of black magic cast by that voodoo child. magic curse might error child black 123.5 0 0 0 43.2

island 0 189.2 0 0 73.2 voodo o 0 0 69.4 0 120.0 EACL-2006 Tutorial

67 Second Order CoOccurrences Word vectors for black and island show similarity as both occur with child black and island are second order co-occurrence with each other, since both occur with child but not with each other (i.e., black island is not observed) EACL-2006 Tutorial 68 Second Order Representation There was an [curse, child] curse of

[magic, child] magic cast by that [might, child] child [curse, child] + [magic, child] + [might, child] EACL-2006 Tutorial 69 There was an island curse of black magic cast by that voodoo child. magic curse might Cxt1 41.2 63.1 24.4 EACL-2006 Tutorial error child

0 78.8 70 Second Order Representation Results in a Context by Feature (Word) Representation Cell values do not indicate if feature occurred in context. Rather, they show the strength of association of that feature with other words that occur with a word in the context. EACL-2006 Tutorial 71 Summary First order representations are intuitive,

but Can suffer from sparsity Contexts represented based on the features that occur in those contexts Second order representations are harder to visualize, but Allow a word to be represented by the words it co-occurs with (i.e., the company it keeps) Allows a context to be represented by the words that occur with the words in the context Helps combat sparsity EACL-2006 Tutorial 72 Related Work

Pedersen and Bruce 1997 (EMNLP) presented first order method of discrimination Schtze 1998 (Computational Linguistics) introduced second order method Purandare and Pedersen 2004 (CoNLL) compared first and second order methods df First order better if you have lots of data Second order better with smaller amounts of data EACL-2006 Tutorial 73 Dimensionality Reduction

Singular Value Decomposition EACL-2006 Tutorial 74 Motivation First order matrices are very sparse Context by feature Word by word NLP data is noisy No stemming performed synonyms EACL-2006 Tutorial

75 Many Methods Singular Value Decomposition (SVD) SVDPACKC Multi-Dimensional Scaling (MDS) Principal Components Analysis (PCA) Independent Components Analysis (ICA) Linear Discriminant Analysis (LDA) etc EACL-2006 Tutorial 76 Effect of SVD

SVD reduces a matrix to a given number of dimensions This may convert a word level space into a semantic or conceptual space If dog and collie and wolf are dimensions/columns in a word cooccurrence matrix, after SVD they may be a single dimension that represents canines EACL-2006 Tutorial 77 Effect of SVD The dimensions of the matrix after SVD are principal components that represent the meaning of concepts Similar columns are grouped together

SVD is a way of smoothing a very sparse matrix, so that there are very few zero valued cells after SVD EACL-2006 Tutorial 78 How can SVD be used? SVD on first order contexts will reduce a context by feature representation down to a smaller number of features Latent Semantic Analysis typically performs SVD on a feature by context representation, where the contexts are reduced SVD used in creating second order context representations Reduce word by word matrix EACL-2006 Tutorial

79 Word by Word Matrix appl e bloo d cell s ibm dat a box tissu e graphic s pc

2 0 0 1 3 1 0 0 0 0 0 body 0

3 0 0 0 0 2 0 0 2 1 disk 1 0 0

2 0 3 0 1 2 0 0 petri 0 2 1 0

0 0 2 0 1 0 1 lab 0 0 3 0 2 0

2 0 2 1 3 sales 0 0 0 2 3 0 0

1 2 0 0 linux 2 0 0 1 3 2 0 1 1

0 0 debt 0 0 0 2 3 4 0 2 0 0

0 EACL-2006 Tutorial memor y orga n plasm a 80 Singular Value Decomposition A=UDV EACL-2006 Tutorial 81 U .35 .09 -.2

.05 .59 .44 .08 -.4 9 .52 -.0 9 .40 .02 .63 .20 -.00 -.02 -.0 9 -.4

4 -.04 -.6 -.02 -.01 .35 .13 .39 -.6 0 .31 .41 -.2 2 .20 -.39 .00 .03

.08 -.4 5 .25 -.0 2 .17 .09 .83 .05 -.26 -.01 .00 .29 -.6 8 -.4

5 -.3 4 -.3 1 .02 -.2 1 .01 .43 -.02 -.07 .37 -.0 1 -.3

1 .09 .72 -.4 8 -.0 4 .03 .31 -.00 .08 .46 .11 -.0 .24 -.0 EACL-2006 Tutorial

.39 .05 .08 .08 -.00 -.01 82 D 9.19 6.36 3.99 3.25 2.52 2.30 1.26 0.66 0.00 0.00 0.00 EACL-2006 Tutorial

83 V .21 .08 -.04 .28 .04 .86 -.05 -.05 -.31 -.12 .03 .04 -.37 .57

.39 .23 -.04 .26 -.02 .03 .25 .44 .11 -.39 -.27 -.32 -.30 .06 .17 .15 -.41 .58

.07 .37 .15 .12 -.12 .39 -.17 -.13 .71 -.31 -.12 .03 .63 -.01 -.45

.52 -.09 -.26 .08 -.06 .21 .08 -.0 2 .49 .27 .50 -.32 -.45 .13

.02 -.01 .31 .12 -.0 3 .09 -.51 .20 .05 -.05 .02 .29 .08 -.04

-.31 -.7 1 .25 .15 -.12 .02 -.32 .05 -.59 -.62 -.23 .07 .28 -.23 -.14 -.45 .64

.17 -.04 -.32 .12 -.0 3 84 .11 EACL-2006 Tutorial .31 Word by Word Matrix After SVD apple blood cells ibm

dat a tissu e graphic s pc .73 .00 .11 1.3 2.0 .01 .86 .77

.00 .09 body .00 1.2 1.3 .00 .33 1.6 .00 .85 .84 1.5 disk

.76 .00 .01 1.3 2.1 .00 .91 .72 .00 .00 germ .00 1.1

1.2 .00 .49 1.5 .00 .86 .77 1.4 lab .21 1.7 2.0 .35 1.7

2.5 .18 1.7 1.2 2.3 sales .73 .15 .39 1.3 2.2 .35 .85

.98 .17 .41 linux .96 .00 .16 1.7 2.7 .03 1.1 1.0 .00 .13

debt 1.2 .00 .00 2.1 3.2 .00 1.5 1.1 .00 .00 EACL-2006 Tutorial memor y

orga n plasm a 85 Second Order Representation apple bloo d I got a new disk today! What do you think of linux? cells ibm data tissu e graphic s memor y orga

n Plasm a disk .76 .00 .01 1.3 2.1 .00 .91 .72 .00 .00

linux .96 .00 .16 1.7 2.7 .03 1.1 1.0 .00 .13 These two contexts share no words in common, yet they

are similar! disk and linux both occur with Apple, IBM, data, graphics, and memory The two contexts are similar because they share many second order co-occurrences EACL-2006 Tutorial 86 Relationship to LSA Latent Semantic Analysis uses feature by context first order representation Indicates all the contexts in which a feature occurs Use SVD to reduce dimensions (contexts) Cluster features based on similarity of contexts in which they occur Represent sentences using an average of feature vectors

EACL-2006 Tutorial 87 Feature by Context Representation Cxt1 black magic 1 island curse 1 military 0 might serious 0 error voodoo 1 child Cxt2 1 0 0 Cxt3

0 0 1 Cxt4 1 1 0 0 1 0 1 0 1 EACL-2006 Tutorial 88 References

Deerwester, S. and Dumais, S.T. and Furnas, G.W. and Landauer, T.K. and Harshman, R., Indexing by Latent Semantic Analysis, Journal of the American Society for Information Science, vol. 41, 1990 Landauer, T. and Dumais, S., A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction and Representation of Knowledge, Psychological Review, vol. 104, 1997 Schtze, H, Automatic Word Sense Discrimination, Computational Linguistics, vol. 24, 1998 Berry, M.W. and Drmac, Z. and Jessup, E.R.,Matrices, Vector Spaces, and Information Retrieval, SIAM Review, vol 41, 1999 EACL-2006 Tutorial 89 Clustering Partitional Methods Cluster Stopping Cluster Labeling

EACL-2006 Tutorial 90 Many many methods Cluto supports a wide range of different clustering methods Agglomerative Partitional K-means (Direct) Hybrid

Average, single, complete link Repeated bisections SenseClusters integrates with Cluto EACL-2006 Tutorial 91 General Methodology Represent contexts to be clustered in first or second order vectors Cluster the context vectors directly vcluster

or convert to similarity matrix and then cluster scluster EACL-2006 Tutorial 92 Partitional Methods Randomly create centroids equal to the number of clusters you wish to find Assign each context to nearest centroid After all contexts assigned, re-compute centroids best location decided by criterion function Repeat until stable clusters found

Centroids dont shift from iteration to iteration EACL-2006 Tutorial 97 Partitional Methods Advantages : fast Disadvantages Results can be dependent on the initial placement of centroids Must specify number of clusters ahead of time maybe not EACL-2006 Tutorial

98 Vectors to be clustered EACL-2006 Tutorial 99 Random Initial Centroids (k=2) EACL-2006 Tutorial 100 Assignment of Clusters EACL-2006 Tutorial 101 Recalculation of Centroids EACL-2006 Tutorial 102

Reassignment of Clusters EACL-2006 Tutorial 103 Recalculation of Centroid EACL-2006 Tutorial 104 Reassignment of Clusters EACL-2006 Tutorial 105 Partitional Criterion Functions Intra-Cluster (Internal) similarity/distance

How close together are members of a cluster? Closer together is better Inter-Cluster (External) similarity/distance How far apart are the different clusters? Further apart is better EACL-2006 Tutorial 106 Intra Cluster Similarity Ball of String (I1) How far is each member from each

other member Flower (I2) How far is each member of cluster from centroid EACL-2006 Tutorial 107 Contexts to be Clustered EACL-2006 Tutorial 108 Ball of String (I1 Internal Criterion Function) EACL-2006 Tutorial 109 Flower (I2 Internal Criterion Function)

EACL-2006 Tutorial 110 Inter Cluster Similarity The Fan (E1) How far is each centroid from the centroid of the entire collection of contexts Maximize that distance EACL-2006 Tutorial 111 The Fan (E1 External Criterion Function) EACL-2006 Tutorial

112 Hybrid Criterion Functions Balance internal and external similarity H1 = I1/E1 H2 = I2/E1 Want internal similarity to increase, while external similarity decreases Want internal distances to decrease, while external distances increase EACL-2006 Tutorial 113 Cluster Stopping EACL-2006 Tutorial

114 Cluster Stopping Many Clustering Algorithms require that the user specify the number of clusters prior to clustering But, the user often doesnt know the number of clusters, and in fact finding that out might be the goal of clustering EACL-2006 Tutorial 115 Criterion Functions Can Help Run partitional algorithm for k=1 to deltaK

DeltaK is a user estimated or automatically determined upper bound for the number of clusters Find the value of k at which the criterion function does not significantly increase at k+1 Clustering can stop at this value, since no further improvement in solution is apparent with additional clusters (increases in k) EACL-2006 Tutorial 116 SenseClusters Approach to Cluster Stopping Will be subject of Demo at EACL Demo Session 2 5th April, 14:30-16:00 Ted Pedersen and Anagha Kulkarni: Selecting the "Right" Number of Senses Based on Clustering Criterion Functions

EACL-2006 Tutorial 117 H2 versus k T. Blair V. Putin S. Hussein EACL-2006 Tutorial 118 PK2 Based on Hartigan, 1975 When ratio approaches 1, clustering is at a plateau Select value of k which is closest to but outside of standard deviation interval H 2(k ) PK 2(k ) H 2(k 1) EACL-2006 Tutorial

119 PK2 predicts 3 senses T. Blair V. Putin S. Hussein EACL-2006 Tutorial 120 PK3 Related to Salvador and Chan, 2004 Inspired by Dice Coefficient Values close to 1 mean clustering is improving Select value of k which is closest to but outside of standard deviation interval 2 * H 2( k ) PK 3(k ) H 2(k 1) H 2(k 1) EACL-2006 Tutorial 121

PK3 predicts 3 senses T. Blair V. Putin S. Hussein EACL-2006 Tutorial 122 References Hartigan, J. Clustering Algorithms, Wiley, 1975 Mojena, R., Hierarchical Grouping Methods and Stopping Rules: An Evaluation, The Computer Journal, vol 20, 1977 basis for SenseClusters stopping method PK1 Milligan, G. and Cooper, M., An Examination of Procedures

for Determining the Number of Clusters in a Data Set, Psychometrika, vol. 50, 1985 basis for SenseClusters stopping method PK2 Very extensive comparison of cluster stopping methods Tibshirani, R. and Walther, G. and Hastie, T., Estimating the Number of Clusters in a Dataset via the Gap Statistic,Journal of the Royal Statistics Society (Series B), 2001 Pedersen, T. and Kulkarni, A. Selecting the "Right" Number of Senses Based on Clustering Criterion Functions, Proceedings of the Posters and Demo Program of the Eleventh Conference of the European Chapter of the Association for Computational Linguistics, 2006 Describes SenseClusters stopping methods EACL-2006 Tutorial 123 Cluster Labeling

EACL-2006 Tutorial 124 Cluster Labeling Once a cluster is discovered, how can you generate a description of the contexts of that cluster automatically? In the case of contexts, you might be able to identify significant lexical features from the contents of the clusters, and use those as a preliminary label EACL-2006 Tutorial 125 Results of Clustering

Each cluster consists of some number of contexts Each context is a short unit of text Apply measures of association to the contents of each cluster to determine N most significant bigrams Use those bigrams as a label for the cluster EACL-2006 Tutorial 126 Label Types The N most significant bigrams for each cluster will act as a descriptive label The M most significant bigrams that are unique to each cluster will act as a discriminating label EACL-2006 Tutorial

127 Baseline Algorithm Baseline Algorithm group all instances into one cluster, this will reach accuracy equal to majority classifier What if the clustering said everything should be in the same cluster? EACL-2006 Tutorial 132 Baseline Performance S1 S2 S3 Total

s C1 0 0 0 0 C2 0 0 0 0 C3 80 35

55 170 Total s 80 35 55 170 (0+0+55)/170 = .32 (0+0+80)/170 = .47 S3 S2 S1 Total s

C1 0 0 0 0 C2 0 0 0 0 C3 55 35

80 170 Total s 55 35 80 170 if C3 is S1 if C3 is S3 EACL-2006 Tutorial 133 Things to Try Feature Identification

Type of Feature Measures of association Context Representation (1st or 2nd order) Automatic Stopping (or not) SVD (or not) Clustering Algorithm and Criterion Function Evaluation Labeling EACL-2006 Tutorial 138 Thank you!

Questions or comments on tutorial or SenseClusters are welcome at any time [email protected] SenseClusters is freely available via LIVE CD, the Web, and in source code form SenseClusters papers available at: EACL-2006 Tutorial 149

Recently Viewed Presentations

  • Department of Electrical and Computer Engineering Electric Power

    Department of Electrical and Computer Engineering Electric Power

    Smart homes and buildings: Enhanced conservation levels, lowered greenhouse gas emissions, lowered stress level on congested transmission lines. The financial incentives offered to consumers, who would consider load scheduling strategies according to real-time electricity prices, is the most momentous driver...
  • FileNewTemplate - CPDLondonmet

    FileNewTemplate - CPDLondonmet

    Foundation Certificate Diploma Chartered Postgraduate. CIM Qualification overview. Interested to find out more about marketing? Develops a good understanding of marketing fundamentals in a short period.
  • Chapter Three Truth Tables 1. Computing Truth-Values  We

    Chapter Three Truth Tables 1. Computing Truth-Values We

    Logical form is not the same as logical equivalence. 3. Tautologies, Contradictions, and Contingent Sentences A sentence that is true in virtue of its logical form is a tautology. Contradictions are sentences that cannot possibly be true. The form of...
  • Demand Elasticity - Mr. Brown's Webpage

    Demand Elasticity - Mr. Brown's Webpage

    The price elasticity of demand is closely related to the slope of the demand curve. Rule of thumb: The flatter the curve, the bigger the elasticity. The steeper the curve, the smaller the elasticity. Five different classifications of . D....
  • Chapter Eight Product, Services, and Brands: Building Customer

    Chapter Eight Product, Services, and Brands: Building Customer

    Unsought products. consumer products that the consumer does not know about or knows about but does not normally think of buying. Life insurance. Funeral services. ... products purchased for further processing or for use in conducting a business.
  • Les Pronoms Relatifs - Weebly

    Les Pronoms Relatifs - Weebly

    Si Clauses (Propositions avec "si") To say that IF something happens, something else WILL happen, use a "si" clause! Begin with the word "Si" and then write a clause in the present tense. Next, use a comma and complete your...
  • POLYATOMIC IONS - Magoffin County Schools

    POLYATOMIC IONS - Magoffin County Schools

    POLYATOMIC ION - A group of bonded atoms that have an overall "+" or "-" charge and act as if there were a single atom ion. There are MANYpolyatomic ions. The following list is a PARTIAL listing of the more...
  • Contemplative Practices Interviews - Virginia Tech

    Contemplative Practices Interviews - Virginia Tech

    2014 - Composite video and website based on 2013 conference interviews. ... Creating a memorable storyboard. Editing the video so that it actually captures the attention of viewers, rather than boring them with an interview ... Contemplative Practices Interviews