Eksploracja tekstu i analiza danych on-line

LABORATORIUM 10

Model word2vec

Jednym z czêstych zagadnieñ zwi±zanych z text mining jest budowa tzw. word embedding, czyli sposobu reprezentacji wyrazów w postaci liczbowej. W pewien sposób mówli¶my ju¿ o tym, kiedy rozpatrywali¶my model przestrzeni wektorowej - jest to jednak bardzo ograniczona rezprezentacja.

Semantyka dystrybucyjna znalaz³a w ostatnich latach szerokie zastosowanie w rozwi±zywaniu szeregu zadañ zwi±zanych z przetwarzaniem jêzyka naturalnego. U jej podstaw le¿y hipoteza, ¿e s³owa wystêpuj±ce w podobnych kontekstach w du¿ych zbiorach danych tekstowych maj± podobne znaczenie. Znaczenia s³ów reprezentowane s± przez wektory liczbowe.

Jednym z bardziej popularnych przyk³adów, nale¿±cym do tej grupy metod jest word2vec, stworzony przez Tomasa Mikolova z Google ok. 5 lat temu. Podej¶cie to mo¿e wykorzystywaæ dwie ró¿ne techniki: CBOW - Continuous Bag of Words oraz Skip-Gram. Teoretycznie metody s± algorytmicznie identyczne, oprócz tego, ¿e CBOW przewiduje kluczowe s³owo na podstawie kontekstu np. ``cat sits on the mat'', a Skip-Gram odwrotnie. Word2vec u¿ywa sieci neuronowej z pojedyncz± warstw± jako podstawowej architektury, ale jako ogólny wska¼nik u¿ywane jest po prostu prawdopodobieñstwo \(p(w_t | h)\) s³owa \(w_t\) pod warunkiem historii (otoczenia) \(h\), które jest wyznaczane przy u¿yciu metody najwiêkszej wiarygodno¶ci.

Model GloVe

GloVe, czyli Global Vectors for Word Representation jest datuj±c± siê na rok 2014 technik± zaproponowan± przez naukowców ze Stanford University i opieraj±c± siê na nastêpuj±cym algorytmie:

przy czym \(f(X_{ij})\) jest pewn± funkcj± wa¿±c±, zapobiegaj±c± uczeniu siê jedynie od najczê¶ciej wystepuj±cych s³ów. Funkcja zosta³o zaproponowana jako \(f(X_{ij}) = (X_{ij} / x_{max})^{\alpha}\) dla \(X_{ij} < x_{max}\) i \(f(X_{ij})=1\) w przeciwnym razie.

Rozpoczynamy od zainstalowania pakietu text2vec (co mo¿e trochê potrwaæ), a nastêpnie pobiramy plik potrzebny do analiz - uwaga, dane maj± ok. 100MB!

# PRZYK£AD 10.1

library(text2vec)
file.name <- "http://www.if.pw.edu.pl/~julas/TEXT/lab/text8"
file.in <- readLines(file.name, n = 1, warn = FALSE)

Wykorzystywany tekst jest wczytywany, a nastêpnie usuwane s± najrzadsze oraz naczêstsze s³owa.

# PRZYK£AD 10.2

tokens <- space_tokenizer(file.in)
tokens[[1]][1:10]
##  [1] ""           "anarchism"  "originated" "as"         "a"         
##  [6] "term"       "of"         "abuse"      "first"      "used"
it <- itoken(tokens, progressbar = FALSE)
vocab <- create_vocabulary(it)
vocab
## Number of docs: 1 
## 0 stopwords:  ... 
## ngram_min = 1; ngram_max = 1 
## Vocabulary: 
##               term term_count doc_count
##      1:   piagarmi          1         1
##      2: manganiyar          1         1
##      3:      yavne          1         1
##      4:  aroostock          1         1
##      5: stollwerck          1         1
##     ---                                
## 253850:         in     372201         1
## 253851:        one     411764         1
## 253852:        and     416629         1
## 253853:         of     593677         1
## 253854:        the    1061396         1
vocab <- prune_vocabulary(vocab, term_count_min = 5L)

Kolejnym krokiem jest stworzenie macierzy wspó³wystêpowania s³ów, opartej na wspomniane wcze¶niej ruchome okno - warto¶c wynosi 5, co oznacza, ¿e wejdzie do niego po 5 s³ów z lewej i prawej strony wyrazu.

# PRZYK£AD 10.3
vectorizer <- vocab_vectorizer(vocab)
tcm <- create_tcm(it, vectorizer, skip_grams_window = 5L)
options(max.print = 1000)
tcm
## 71290 x 71290 sparse Matrix of class "dgTMatrix"
##    [[ suppressing 32 column names 'kentauros', 'tornatore', 'phantastica' ... ]]
##    [[ suppressing 32 column names 'kentauros', 'tornatore', 'phantastica' ... ]]
##                                                                                         
## kentauros      . . . .    . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
## tornatore      . . . .    . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
## phantastica    . . . .    . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
## steinhoff      . . . 0.25 . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
## minthe         . . . .    . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
## jizyah         . . . .    . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
## pommern        . . . .    . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
## iconodules     . . . .    . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
## palas          . . . .    . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
## splinters      . . . .    . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
## nears          . . . .    . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
## cashed         . . . .    . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
## dimasi         . . . .    . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
## yisra          . . . .    . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
## mistranslated  . . . .    . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
## adenocarcinoma . . . .    . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
## 
##  ..............................
##  ........suppressing 71258 columns and 71259 rows in show(); maybe adjust 'options(max.print= *, width = *)'
##  ..............................
##    [[ suppressing 32 column names 'kentauros', 'tornatore', 'phantastica' ... ]]
##                                                                             
## s     . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
## for   . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
## eight . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
## as    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
## is    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
## two   . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
## nine  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
## zero  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
## to    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
## a     . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
## in    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
## one   . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
## and   . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
## of    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
## the   . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......

Wreszcie przechodzimy do samego uruchomienia algorytmu, ustawiaj±c rozmiar wektora na 50 elementów

# PRZYK£AD 10.4

glove_model <- GlobalVectors$new(word_vectors_size = 50, vocabulary = vocab, x_max = 10)
words_main <- glove_model$fit_transform(tcm, n_iter = 10, convergence_tol = 0.01)
## INFO [2020-01-08 22:50:23] 2020-01-08 22:50:23 - epoch 1, expected cost 0.0876
## INFO [2020-01-08 22:50:49] 2020-01-08 22:50:49 - epoch 2, expected cost 0.0614
## INFO [2020-01-08 22:51:12] 2020-01-08 22:51:12 - epoch 3, expected cost 0.0542
## INFO [2020-01-08 22:51:43] 2020-01-08 22:51:43 - epoch 4, expected cost 0.0503
## INFO [2020-01-08 22:52:15] 2020-01-08 22:52:15 - epoch 5, expected cost 0.0477
## INFO [2020-01-08 22:52:34] 2020-01-08 22:52:33 - epoch 6, expected cost 0.0459
## INFO [2020-01-08 22:52:55] 2020-01-08 22:52:55 - epoch 7, expected cost 0.0445
## INFO [2020-01-08 22:53:17] 2020-01-08 22:53:17 - epoch 8, expected cost 0.0434
## INFO [2020-01-08 22:53:36] 2020-01-08 22:53:36 - epoch 9, expected cost 0.0425
## INFO [2020-01-08 22:53:56] 2020-01-08 22:53:56 - epoch 10, expected cost 0.0418

Na wyj¶ciu otrzymujemy tak naprawdê dwa wektory \(\mathbf{w}_i\) oraz \(\mathbf{w}_j\) - teoretycznie powinny byc symetryczne. W praktyce wykorzystuje siê ¶redni± lub ich sumê.

# PRZYK£AD 10.5

words_components <- glove_model$components
W <- words_main + t(words_components)
options(max.print = 1000)
round(W, 3)
##                                 [,1]   [,2]   [,3]   [,4]   [,5]   [,6]   [,7]
## kentauros                      0.028  0.143  0.417 -0.260 -0.768 -0.012  0.078
## tornatore                     -0.106 -0.571  0.421 -0.610  0.258 -0.404  0.719
## phantastica                   -0.260 -0.318  0.659 -0.363 -0.366 -0.112 -0.282
## steinhoff                     -0.177 -0.348 -0.281  0.382 -0.700 -0.276  0.563
## minthe                        -0.215 -0.304 -0.421 -0.345 -0.022  0.213  0.397
## jizyah                         1.065 -0.369  1.139  0.152 -0.342 -0.307  0.420
## pommern                       -0.493 -0.050  0.227  0.393  0.247 -0.813  0.567
## iconodules                     0.025 -0.685  0.436  0.540 -0.731 -0.715  0.858
## palas                         -0.078  0.057  0.363  0.060 -0.704 -0.187  0.628
## splinters                      0.449 -0.253  0.000  0.228 -0.028 -0.245 -0.329
## nears                          0.684  0.232  0.157  0.675 -0.437 -0.504 -0.487
## cashed                         0.640  0.058  0.796 -0.358  0.388 -0.200  0.189
## dimasi                        -0.152  0.132  0.048  0.200  0.149 -0.202  0.399
## yisra                          0.581  0.142  0.486  0.417 -0.802  0.001  0.024
## mistranslated                  0.537 -0.272  0.019  0.885 -0.288 -0.192  0.110
## adenocarcinoma                 0.225  0.298  0.349  0.014  0.390 -0.172  0.183
## cheetham                       0.566 -0.282  0.570  0.187  0.373 -0.973  0.203
## cowbell                        0.330 -0.729 -0.482 -0.141 -0.237 -0.291  0.113
## wallerstein                   -0.938 -0.308 -0.084 -0.525 -0.769 -0.143  0.641
## bluie                         -0.177  0.112 -0.214 -0.023 -0.176  0.449  1.071
##                                 [,8]   [,9]  [,10]  [,11]  [,12]  [,13]  [,14]
## kentauros                      0.015  0.216 -0.360  0.706 -0.306  0.407 -0.654
## tornatore                     -0.174  0.118  0.136 -0.410 -0.761  0.112  0.381
## phantastica                   -0.098 -0.271 -0.100  0.345 -0.539  0.226  0.719
## steinhoff                      0.398  0.887  0.633  0.662 -0.197  0.255 -0.444
## minthe                         0.239  0.232 -0.447 -0.472 -0.204 -0.136  0.366
## jizyah                         0.598 -0.031  0.170  0.235 -0.493  0.104  0.344
## pommern                       -0.503 -0.636  0.080 -0.439  0.418  0.065 -0.389
## iconodules                     0.246 -0.567 -0.685  0.199  0.081  0.220 -0.700
## palas                         -0.419 -0.154  0.591  0.186  0.144  0.175  0.059
## splinters                      0.427 -0.302  0.152 -0.274 -0.356  0.194  0.295
## nears                         -0.167  0.520 -0.953  0.950 -0.380 -0.278 -0.541
## cashed                        -0.436  0.453 -0.222 -0.202  0.142  0.419  0.173
## dimasi                        -0.289  0.412 -0.743 -0.220 -0.655 -0.206 -0.636
## yisra                         -0.074  0.336 -0.157  1.409 -0.159  0.373  0.051
## mistranslated                 -0.233 -0.209 -0.542  0.127 -0.234  0.154 -0.623
## adenocarcinoma                -0.529  0.009  0.597  0.443  0.262 -0.516  0.110
## cheetham                       0.364 -0.105 -0.871  0.310 -0.621 -0.219 -0.360
## cowbell                        0.265  0.226  0.187  0.501 -0.619  0.523  0.030
## wallerstein                   -0.109  0.084 -0.017  0.447 -0.552 -0.912  0.075
## bluie                         -0.736  0.128  0.082  0.151 -1.076  0.533  0.225
##                                [,15]  [,16]  [,17]  [,18]  [,19]  [,20]  [,21]
## kentauros                     -0.004  0.733  0.416  0.225  0.867  0.150 -0.684
## tornatore                     -0.274 -0.316 -0.394 -0.582  0.732 -0.219 -0.130
## phantastica                    0.004 -0.178 -0.507  0.104  0.272 -0.097 -0.072
## steinhoff                     -0.270 -0.563 -0.678  0.278  0.370  0.331  0.132
## minthe                        -0.406 -0.047 -0.207  0.016 -0.306 -0.060 -0.270
## jizyah                        -0.518  0.233  0.356 -0.571 -0.076  0.761 -0.071
## pommern                       -0.163 -0.129  0.349  0.186 -0.006  0.574  0.495
## iconodules                     0.451  0.452 -0.261  0.284  0.045  0.549 -0.234
## palas                         -0.823  0.241  0.345 -0.371 -0.062  0.458  0.328
## splinters                      0.550 -0.568 -0.948  0.349  0.433 -0.946  0.390
## nears                         -0.137 -0.481 -0.149 -0.151  0.567 -0.590  0.222
## cashed                         0.773 -0.459 -0.442  0.435  0.326 -0.111  0.662
## dimasi                         0.648  0.178 -0.206  0.214 -0.176  0.016  0.369
## yisra                          0.076 -0.702  0.425 -0.020  0.060  0.025 -0.699
## mistranslated                  0.076 -0.277  0.613  0.011  0.212  0.016  0.489
## adenocarcinoma                 0.446 -0.242  0.221 -0.516  0.322  0.157 -0.531
## cheetham                      -0.304 -0.158  0.109  0.102 -0.459  0.064  0.191
## cowbell                       -0.619  0.282  0.437 -0.939  0.555  0.461 -0.727
## wallerstein                    0.166  0.065 -0.619  0.210 -0.157 -0.121  0.074
## bluie                          0.432 -0.030 -0.708 -0.120 -0.133 -0.201  0.341
##                                [,22]  [,23]  [,24]  [,25]  [,26]  [,27]  [,28]
## kentauros                      0.027 -0.438 -0.010 -0.289  0.268  0.428  0.025
## tornatore                     -0.450  0.024 -0.330  0.111  0.244  0.333  0.014
## phantastica                    0.484 -0.470  0.216  0.483  0.418 -0.314 -0.271
## steinhoff                     -0.506  0.031 -0.821  0.217 -0.033  1.053  0.140
## minthe                         0.299 -0.011  0.169  0.083  0.463  0.335 -0.153
## jizyah                         0.055  0.613  0.395  0.511 -0.392 -0.431  0.519
## pommern                       -0.399  0.243  0.322  0.719  0.546  0.589 -0.375
## iconodules                     0.226 -0.446 -0.116  0.600 -0.438 -0.268 -0.180
## palas                         -0.668 -0.198 -0.117  0.515  0.037 -0.534  0.095
## splinters                     -0.211  0.010  0.446  0.302  0.235  0.712 -0.190
## nears                         -0.121  1.086  0.738  0.795  0.248  0.195 -0.307
## cashed                         0.013 -0.084  0.652  0.347  0.392  0.186 -0.602
## dimasi                         0.121 -0.155 -0.285  0.841 -0.322  0.067 -0.096
## yisra                          0.229  0.361 -0.086  0.322  0.159  0.877 -0.281
## mistranslated                  0.374  0.736  0.896 -0.367  1.079  0.741  0.406
## adenocarcinoma                -0.335  0.295 -0.556  1.203  0.661 -0.381 -0.583
## cheetham                       0.282  0.049 -0.167  0.284  0.512  0.035  0.308
## cowbell                       -0.085  0.165  0.208  0.067  0.655 -0.009 -0.753
## wallerstein                    0.512  0.195 -0.117  0.928  0.751 -0.006  0.058
## bluie                          0.234 -0.253 -0.342  0.595  1.102  0.126 -0.303
##                                [,29]  [,30]  [,31]  [,32]  [,33]  [,34]  [,35]
## kentauros                     -0.700 -0.240  0.029 -0.345  0.160  1.035 -0.001
## tornatore                     -0.929 -0.174 -0.053 -0.295 -0.350  0.828  0.892
## phantastica                   -0.150 -0.784  0.147  0.163 -1.178 -0.375  0.544
## steinhoff                     -1.152 -0.240  0.181 -0.757 -0.938 -0.216 -0.300
## minthe                        -0.458  0.166  0.222  0.197 -0.016  0.675  0.145
## jizyah                        -0.747  0.018  0.196 -0.041 -0.572  0.097 -0.152
## pommern                       -0.493 -0.407 -0.134 -0.031 -0.387  0.345 -1.101
## iconodules                    -0.718  0.016 -0.242  0.612 -0.547 -0.161 -0.300
## palas                          0.135 -0.034  0.442  0.214 -0.703  0.236 -0.093
## splinters                     -0.387  0.332  0.185 -0.011  0.060  0.539  0.354
## nears                          0.538  0.265 -0.399 -0.513 -0.620 -0.198 -0.717
## cashed                         0.581  0.577 -0.012  0.510 -0.081  0.475  0.348
## dimasi                        -0.416 -0.573  0.559  0.082 -0.008  0.180 -0.158
## yisra                         -0.692 -0.614  0.000  0.058 -0.190  0.629 -0.383
## mistranslated                 -0.172 -0.433  0.827 -0.680  0.421  0.716 -0.377
## adenocarcinoma                -0.465 -0.350  0.276  0.574 -0.096  0.117  0.508
## cheetham                       0.449  0.317 -0.011  0.234 -0.856  0.078 -0.426
## cowbell                       -0.382  0.385 -0.579 -0.408 -0.315 -0.598  0.219
## wallerstein                   -0.440 -0.495 -0.639 -0.299 -0.451  0.038 -0.123
## bluie                         -0.115 -0.396  0.168 -0.316 -0.198  0.333 -0.110
##                                [,36]  [,37]  [,38]  [,39]  [,40]  [,41]  [,42]
## kentauros                     -0.094 -0.311 -0.130 -0.488  0.137 -0.753  0.044
## tornatore                     -0.500 -0.028 -1.092  0.326  0.873 -1.467  0.037
## phantastica                    0.313  0.007 -1.167 -0.132  0.678 -0.500 -0.415
## steinhoff                      0.396  0.433 -0.720  0.227  0.081 -0.049 -0.156
## minthe                        -0.452  0.483  0.140  0.748  0.954 -0.830  0.102
## jizyah                         0.538  0.712 -0.208  0.323  0.089 -0.586  0.147
## pommern                        0.461  0.270  0.061 -0.541  1.095 -0.981  0.829
## iconodules                     0.947 -0.468 -0.218 -0.010  0.382  0.001 -0.298
## palas                         -0.299 -0.122  0.229  0.155  0.456 -0.227 -0.268
## splinters                     -0.260  0.074 -0.559  0.184  0.030 -1.176 -0.007
## nears                          0.048  0.101 -0.473 -0.019  0.274 -0.137  0.181
## cashed                         0.913 -0.381 -0.288 -0.190  0.199 -0.460 -0.815
## dimasi                        -0.128  0.305 -0.507  0.779  0.320 -0.633 -0.187
## yisra                          0.752 -0.746 -0.257 -0.225  0.529 -1.008 -0.405
## mistranslated                 -0.107  0.113 -0.477 -0.203 -0.035 -1.136  0.394
## adenocarcinoma                -0.366  0.693  0.203 -0.232  0.996 -0.950 -0.197
## cheetham                       0.507 -0.450  0.337  1.038 -0.206 -1.327  0.148
## cowbell                       -0.250  0.173  0.152  0.440  0.586 -0.226 -0.780
## wallerstein                   -0.211 -0.036 -0.925  0.873  0.295 -0.396 -0.364
## bluie                          0.449  0.407 -0.823  0.078  0.344 -0.444 -0.022
##                                [,43]  [,44]  [,45]  [,46]  [,47]  [,48]  [,49]
## kentauros                     -0.684 -0.070 -0.164  0.605  0.140  0.339  0.926
## tornatore                     -0.010 -0.195 -0.101  0.223 -0.588 -0.363  0.132
## phantastica                   -0.075 -0.719  0.308 -0.990  0.465  0.583  1.340
## steinhoff                     -0.355 -0.273 -0.130 -0.414  0.140  0.510  0.404
## minthe                        -0.530  0.399  0.450  0.283  0.433 -0.103  0.851
## jizyah                        -0.568  0.666 -0.086 -0.387  0.151 -0.232  0.052
## pommern                       -0.941  0.031  0.092 -0.696  0.582  0.212  1.091
## iconodules                    -0.694  0.092  0.131  0.183  0.801 -0.198  0.234
## palas                         -1.183  0.197 -0.450 -0.026  0.436 -0.292 -0.501
## splinters                      0.075 -0.146 -0.560 -0.540  0.378 -0.022 -0.163
## nears                         -0.592 -0.102  0.276 -0.493  0.293  0.497 -0.320
## cashed                         0.253 -0.806 -0.420 -0.584  0.659 -0.072  0.030
## dimasi                         0.306 -0.200 -0.039 -0.284  0.553  0.145  0.346
## yisra                         -0.138 -0.680  0.095  0.115  0.773  0.366  0.358
## mistranslated                 -0.117 -0.326 -0.247 -0.101 -0.134  0.102 -0.380
## adenocarcinoma                 0.299 -0.426 -0.304 -0.226  0.613 -0.192 -0.015
## cheetham                       0.268 -0.787  0.340 -0.154  0.983 -0.048 -0.375
## cowbell                       -0.598 -0.420  0.380 -0.248  0.256  0.226  0.427
## wallerstein                   -0.290 -0.018  0.132 -0.612  0.412  0.479  0.432
## bluie                          0.195 -0.181 -0.301 -0.643  0.066  0.201  0.132
##                                [,50]
## kentauros                     -0.182
## tornatore                      0.053
## phantastica                    0.276
## steinhoff                      0.298
## minthe                        -0.554
## jizyah                         0.412
## pommern                        0.026
## iconodules                    -0.537
## palas                          0.275
## splinters                     -0.208
## nears                         -0.487
## cashed                        -0.261
## dimasi                         0.304
## yisra                         -1.045
## mistranslated                 -0.251
## adenocarcinoma                 0.307
## cheetham                       0.258
## cowbell                        0.030
## wallerstein                    0.083
## bluie                         -0.017
##  [ reached getOption("max.print") -- omitted 71270 rows ]

W tym momencie dysponujemy ju¿ w pe³ni nauczonym modelem, czyli reprezentacj± s³ów w pewnej 50-wymiarowej przestrzeni. Oczywi¶cie najprostsza rzecz± jest po prostu podanie konkretnego s³owa i wypisanie podobieñstwa do innych wyrazów, korzystaj±c z cosinusa podobieñstwa jako miary.

# PRZYK£AD 10.6

query <- W["student", , drop = FALSE]
cos_sim <- sim2(x = W, y = query, method = "cosine", norm = "l2")
head(sort(cos_sim[,1], decreasing = T))
##   student  students  graduate    school   teacher  attended 
## 1.0000000 0.7268409 0.7161877 0.7129249 0.6761584 0.6676684
# PRZYK£AD 10.7

query <- W["poland", , drop = FALSE]
cos_sim <- sim2(x = W, y = query, method = "cosine", norm = "l2")
head(sort(cos_sim[,1], decreasing = T))
##    poland    russia lithuania   austria     italy   hungary 
## 1.0000000 0.8308919 0.7721159 0.7660052 0.7534029 0.7437061

Jak widaæ, mo¿na w ten sposób otrzymaæ równie¿ zale¿no¶ci geograficzne. Ciekaw± cech± tego typu modeli jest to, ¿e skoro operujemy w pewnej przestrzni wektorowej, to w pewnien sposób mo¿emy oczekiwaæ, ¿e bêdzie tam dzia³aæ zyk³a arytmatyka wektorowa. W szczególno¶ci dodawanie i odejmowanie mo¿e doprowadziæ do ciekawych wyników:

# PRZYK£AD 10.8

query <- W["paris", , drop = FALSE] - W["france", , drop = FALSE] + W["germany", , drop = FALSE]
cos_sim <- sim2(x = W, y = query, method = "cosine", norm = "l2")
head(sort(cos_sim[,1], decreasing = T))
##     berlin      paris     munich    leipzig washington     vienna 
##  0.8086096  0.7262618  0.6909106  0.6657381  0.6391524  0.6360046
# PRZYK£AD 10.9

query <- W["man", , drop = FALSE] - W["he", , drop = FALSE] + W["she", , drop = FALSE]
cos_sim <- sim2(x = W, y = query, method = "cosine", norm = "l2")
head(sort(cos_sim[,1], decreasing = T))
##       man     woman      girl    person        my      baby 
## 0.8539033 0.7822090 0.6824101 0.6512960 0.6394281 0.6182626

I wreszcie chyba najciekawsza cecha, tzw. regularno¶æ lingwistyczna. Okazuje siê, ¿e tego typu modele zanurzeniowe mog± s³u¿yæ do wizualizacji pewnych regu³ gramatycznych, np. zwi±zanych z czêsciami mowy etc. Poniewa¿ pracujemy na 50-wymiarowej przestrzeni, aby wykonac sensowne wykresy musimy wykorzystaæ metodê skalowania wielowymiarowego.

# PRZYK£AD 10.10

q <- c("see", "saw", "have", "had", "is", "was", "buy", "bought")
z <- cmdscale(dist(W[q,]))
plot(z, pch = 19, cex = 0.8, col = "red", xlab = "pierwszy wymiar", ylab = "drugi wymiar")
text(z, labels = q, pos = 2, font = 2)
invisible(sapply(seq(1, nrow(z) - 1, 2), function(i) arrows(z[i,1],z[i,2],z[i+1,1],z[i+1,2], length = 0.15, col = "blue")))

# PRZYK£AD 10.11

q <- c("king", "queen", "father", "mother", "brother", "sister", "uncle", "aunt")
z <- cmdscale(dist(W[q,]))
plot(z, pch = 19, cex = 0.8, col = "red", xlab = "pierwszy wymiar", ylab = "drugi wymiar")
text(z, labels = q, pos = 2, font = 2)
invisible(sapply(seq(1, nrow(z) - 1, 2), function(i) arrows(z[i,1],z[i,2],z[i+1,1],z[i+1,2], length = 0.15, col = "blue")))

# PRZYK£AD 10.12

q <- c("good", "better", "big", "bigger", "light", "lighter")
z <- cmdscale(dist(W[q,]))
plot(z, pch = 19, cex = 0.8, col = "red", xlab = "pierwszy wymiar", ylab = "drugi wymiar")
text(z, labels = q, pos = 2, font = 2)
invisible(sapply(seq(1, nrow(z) - 1, 2), function(i) arrows(z[i,1],z[i,2],z[i+1,1],z[i+1,2], length = 0.15, col = "blue")))