zonghan程式筆記: Word2Vec model Introduction (skip-gram & CBOW)

在Udacity 的課程" Deep learning" 有學到如何處理文字數據，在此做個筆記，以免未來忘記
以下文字內容是我理解後整理的筆記，但codes 來自課程作業，不是我原創。Deep learning API為 tensorflow
課程網址：https://classroom.udacity.com/courses/ud730

Word2Vec

要讓機器能夠學習，要學習事物必須定義其特徵，才有辦法學。像是腫瘤，如果要讓機器學會判定是良性或惡性，特徵可以定義為"腫瘤大小、密度、基因有無突變" 等等，然後將良性及惡性腫瘤特徵餵給機器學習。而Word2Vec就是一個將單一文字變成特徵向量的演算法。

文字跟圖像資料不同，圖像的理解主要跟自己圖內的pixel 排列組合有關，所以其特徵跟自身有關，但文字的理解由前後文的關係來決定。兩個不同的字例如 cat 跟 kitty ，如果再文章中出現時，前後文都一樣，就可判斷兩個是一樣意思的文字，特徵就要長一樣。而Word2Vec作法就是輸入文字，輸出預測其前後文字。model 是兩層全連接 neural network model (1 hidden layers) 。

假設data base 有N個vocabulary，model 示意圖如下：

Input layer

Input layer 是one-hot encoding ，Size 等於N. Input layer 中每一個element 都對應一個vocabulary 中的words，0表示沒輸入，1表示輸入該word.

Hidden Layers

hidden layer 是input layer 乘上 weight matrix (1xN * NxV = 1xV)。因為input 是one-hot，所以hidden layers 就是將 weight matrix 中抽出多個 row 疊加的結果。而這些row 的 row index就等於 input element 為 1 的index. 因此 input layer 就像是weight matrix row搜尋的對照表。

1st Weights Matrix = Word feature vector matrix

Weight matrix 中的每一列對應一個input layer element。而每個element 都對應一個word，所以Weight matrix 的row vector 就是word 的feature vector。vector 的維度就是matrix 的行數，可調的。維度愈大字的預測愈準，但計算負擔愈大。

Input to hidden Layers

Neural Network Model 的 Weights matrix 是可訓練的，將words 丟進model 來預測其前後文字，

經過多次迭代訓練後，Input 跟 hidden layers 之間的Weight matrixs 就是我們要的word vector matrix。

Random Sampled Softmax

hidden layer 也就是feature vector 到 output layer 的計算，由於output 的維度是整個vocabulary 總數，所以計算量會非常大，如果能降低element 數量，就能省下很多計算量。剛好outpur element 大部份是0，我們可以不必計算所有的element。我們除了lable 的element 留下外，其他選幾個label 以外的element，每次正向反向propagation ，只計算這幾個選到的element，gradient decent只微調跟這些element 計算有關的weights (feature vector 到output layer 的weight matrix)。選output element 的方法，這邊是隨機選擇，但paper 上好像有更好的方法，等之後遇到再學習。

Model 的運行說完了，但input words 的 label 該怎麼決定呢？可以用兩個model "Skip-gram" 跟"CBOW" 來得到

Skip-gram

Skip-gram 的邏輯是，一次只輸入一個字，輸出的label 為其前後一定距離內的文字，所以同一個word 會有多個label，至於前後文的距離也是skip-gram 參數之一。假設一段句子"I have a big dog and horse". 前後距離設定為一個字，我們取"dog" 為input word，label 就是"and" 跟"big"。如果距離是兩字寬，label 就是"and" "big" "a" "horse"。

Input & label 取樣範例

CBOW （Continuous Bag-of Words)

跟skip-gram 不同，skip-gram 是input 一個字，輸出是有多個label 。CBOW 是將一段句子的中間字當作label，其左右文字為input words，所以是多個字input 一個輸出label. 句子的長度可調。

取樣範例（句子長度：3）

Visualize Words Feature Vector

每一個word 都有它自己的feature vector，vector 愈相進字意思愈相近。為了可視化，我們利用tSNE 將vector 將成二維畫圖，其結果如下圖，可以看到再vector space 上距離愈近的字，愈像是同種類的字。

Codes (python)

這是ipython 上寫的codes，只再ipython notebook 上實際跑過
以下為 Skip-gram model

%matplotlib inline
from __future__ import print_function
import collections
import math
import numpy as np
import os
import random
import tensorflow as tf
import zipfile
from matplotlib import pylab
from six.moves import range
from six.moves.urllib.request import urlretrieve
from sklearn.manifold import TSNE

url = 'http://mattmahoney.net/dc/'

def maybe_download(filename, expected_bytes):
  """Download a file if not present, and make sure it's the right size."""
  if not os.path.exists(filename):
    filename, _ = urlretrieve(url + filename, filename)
  statinfo = os.stat(filename)
  if statinfo.st_size == expected_bytes:
    print('Found and verified %s' % filename)
  else:
    print(statinfo.st_size)
    raise Exception(
      'Failed to verify ' + filename + '. Can you get to it with a browser?')
  return filename

filename = maybe_download('text8.zip', 31344016)

def read_data(filename):
  """Extract the first file enclosed in a zip file as a list of words"""
  with zipfile.ZipFile(filename) as f:
    data = f.namelist() # list type, element is str "text8" 
    data = data[0] # get ste "text8"
    data = f.read(data) #read data as bytes type
    data = tf.compat.as_str(data) #change bytes to str
    data = data.split() # split str with space
  return data
  
words = read_data(filename)
print('Data size %d' % len(words))


vocabulary_size = 50000

def build_dataset(words):
  count = [['UNK', -1]]
  count.extend(collections.Counter(words).most_common(vocabulary_size - 1)) # count each words and list as ['word',count]
  dictionary = dict()
  for word, _ in count:
    dictionary[word] = len(dictionary) #dict name is word and contest is the index
  data = list() # context is same as words list but transfer to dictionary index
  unk_count = 0
  for word in words:
    if word in dictionary:
      index = dictionary[word]
    else:
      index = 0  # dictionary['UNK']
      unk_count = unk_count + 1
    data.append(index)
  count[0][1] = unk_count
  reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys())) 
  return data, count, dictionary, reverse_dictionary

data, count, dictionary, reverse_dictionary = build_dataset(words)
print('Most common words (+UNK)', count[:5])
print('Sample data', data[:10])
print('words list',words[:10])
del words  # Hint to reduce memory.

data_index = 0

def generate_batch(batch_size, num_skips, skip_window):
  global data_index #global variable
  assert batch_size % num_skips == 0 #make sure batch_size is even
  assert num_skips <= 2 * skip_window 
  batch = np.ndarray(shape=(batch_size), dtype=np.int32)
  labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
  span = 2 * skip_window + 1 # [ skip_window target skip_window ]
  buffer = collections.deque(maxlen=span)
  for _ in range(span):
    buffer.append(data[data_index])# for shifting window in line 24
    data_index = (data_index + 1) % len(data)
  for i in range(batch_size // num_skips):
    target = skip_window  # index which will be picked as buffer[index] for label matrix
    targets_to_avoid = [ skip_window ] # those indexs will no be picked for lable
    for j in range(num_skips):
      while target in targets_to_avoid:
        target = random.randint(0, span - 1) #exclude target already in target_to_avoid
      targets_to_avoid.append(target) #add picked target to avoid be picked again
      batch[i * num_skips + j] = buffer[skip_window]
      labels[i * num_skips + j, 0] = buffer[target]
    buffer.append(data[data_index]) # shift window to right direct with 1 index
    data_index = (data_index + 1) % len(data)
  return batch, labels

print('data:', [reverse_dictionary[di] for di in data[:8]])

for num_skips, skip_window in [(2, 1), (4, 2)]:
    data_index = 0
    batch, labels = generate_batch(batch_size=8, num_skips=num_skips, skip_window=skip_window)
    print('\nwith num_skips = %d and skip_window = %d:' % (num_skips, skip_window))
    print('    batch:', [reverse_dictionary[bi] for bi in batch])
    print('    labels:', [reverse_dictionary[li] for li in labels.reshape(8)])

batch_size = 128
embedding_size = 128 # Dimension of the embedding vector.
skip_window = 1 # How many words to consider left and right.
num_skips = 2 # How many times to reuse an input to generate a label.
# We pick a random validation set to sample nearest neighbors. here we limit the
# validation samples to the words that have a low numeric ID, which by
# construction are also the most frequent. 
valid_size = 16 # Random set of words to evaluate similarity on.
valid_window = 100 # Only pick dev samples in the head of the distribution.
valid_examples = np.array(random.sample(range(valid_window), valid_size))
num_sampled = 64 # Number of negative examples to sample.

graph = tf.Graph()

with graph.as_default(), tf.device('/cpu:0'):

  # Input data.
  train_dataset = tf.placeholder(tf.int32, shape=[batch_size])
  train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
  valid_dataset = tf.constant(valid_examples, dtype=tf.int32)
  
  # Variables.
    #tf.random_uniform(shape, minval=0, maxval=None, dtype=tf.float32, seed=None, name=None)
  embeddings = tf.Variable(
    tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
  softmax_weights = tf.Variable(
    tf.truncated_normal([vocabulary_size, embedding_size],
                         stddev=1.0 / math.sqrt(embedding_size)))
  softmax_biases = tf.Variable(tf.zeros([vocabulary_size]))
  
  # Model.
  # Look up embeddings for inputs.
  embed = tf.nn.embedding_lookup(embeddings, train_dataset)
  
  # Compute the softmax loss, using a sample of the negative labels each time.
  loss = tf.reduce_mean(
    tf.nn.sampled_softmax_loss(weights=softmax_weights, biases=softmax_biases, inputs=embed,
                               labels=train_labels, num_sampled=num_sampled, num_classes=vocabulary_size))

  # Optimizer.
  # Note: The optimizer will optimize the softmax_weights AND the embeddings.
  # This is because the embeddings are defined as a variable quantity and the
  # optimizer's `minimize` method will by default modify all variable quantities 
  # that contribute to the tensor it is passed.
  # See docs on `tf.train.Optimizer.minimize()` for more details.
  optimizer = tf.train.AdagradOptimizer(1.0).minimize(loss)
  
  # Compute the similarity between minibatch examples and all embeddings.
  # We use the cosine distance:
  norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))
  normalized_embeddings = embeddings / norm
  valid_embeddings = tf.nn.embedding_lookup(
    normalized_embeddings, valid_dataset)
  similarity = tf.matmul(valid_embeddings, tf.transpose(normalized_embeddings))
  print (normalized_embeddings.get_shape)

num_steps = 100001

with tf.Session(graph=graph) as session:
  tf.global_variables_initializer().run()
  print('Initialized')
  average_loss = 0
  for step in range(num_steps):
    batch_data, batch_labels = generate_batch(
      batch_size, num_skips, skip_window)
    feed_dict = {train_dataset : batch_data, train_labels : batch_labels}
    _, l = session.run([optimizer, loss], feed_dict=feed_dict)
    average_loss += l
    if step % 2000 == 0:
      if step > 0:
        average_loss = average_loss / 2000
      # The average loss is an estimate of the loss over the last 2000 batches.
      print('Average loss at step %d: %f' % (step, average_loss))
      average_loss = 0
    # note that this is expensive (~20% slowdown if computed every 500 steps)
    if step % 10000 == 0:
      sim = similarity.eval()
      for i in range(valid_size):
        valid_word = reverse_dictionary[valid_examples[i]]
        top_k = 8 # number of nearest neighbors
        nearest = (-sim[i, :]).argsort()[1:top_k+1] # argsort function is sorting from small to big 
        log = 'Nearest to %s:' % valid_word
        for k in range(top_k):
          close_word = reverse_dictionary[nearest[k]]
          log = '%s %s,' % (log, close_word)
        print(log)
  final_embeddings = normalized_embeddings.eval()


num_points = 400

tsne = TSNE(perplexity=30, n_components=2, init='pca', n_iter=5000)
two_d_embeddings = tsne.fit_transform(final_embeddings[1:num_points+1, :])

def plot(embeddings, labels):
  assert embeddings.shape[0] >= len(labels), 'More labels than embeddings'
  pylab.figure(figsize=(15,15))  # in inches
  for i, label in enumerate(labels):
    x, y = embeddings[i,:]
    pylab.scatter(x, y)
    pylab.annotate(label, xy=(x, y), xytext=(5, 2), textcoords='offset points',
                   ha='right', va='bottom')
  pylab.show()

words = [reverse_dictionary[i] for i in range(1, num_points+1)]
plot(two_d_embeddings, words)

以下為CBOW model (與skip-gram 幾乎相同，只有generate_batch function 到 training 之間有不同，其餘都一樣)

%matplotlib inline
from __future__ import print_function
import collections
import math
import numpy as np
import os
import random
import tensorflow as tf
import zipfile
from matplotlib import pylab
from six.moves import range
from six.moves.urllib.request import urlretrieve
from sklearn.manifold import TSNE

url = 'http://mattmahoney.net/dc/'

def maybe_download(filename, expected_bytes):
  """Download a file if not present, and make sure it's the right size."""
  if not os.path.exists(filename):
    filename, _ = urlretrieve(url + filename, filename)
  statinfo = os.stat(filename)
  if statinfo.st_size == expected_bytes:
    print('Found and verified %s' % filename)
  else:
    print(statinfo.st_size)
    raise Exception(
      'Failed to verify ' + filename + '. Can you get to it with a browser?')
  return filename

filename = maybe_download('text8.zip', 31344016)

def read_data(filename):
  """Extract the first file enclosed in a zip file as a list of words"""
  with zipfile.ZipFile(filename) as f:
    data = f.namelist() # list type, element is str "text8" 
    data = data[0] # get ste "text8"
    data = f.read(data) #read data as bytes type
    data = tf.compat.as_str(data) #change bytes to str
    data = data.split() # split str with space
  return data
  
words = read_data(filename)
print('Data size %d' % len(words))


vocabulary_size = 50000

def build_dataset(words):
  count = [['UNK', -1]]
  count.extend(collections.Counter(words).most_common(vocabulary_size - 1)) # count each words and list as ['word',count]
  dictionary = dict()
  for word, _ in count:
    dictionary[word] = len(dictionary) #dict name is word and contest is the index
  data = list() # context is same as words list but transfer to dictionary index
  unk_count = 0
  for word in words:
    if word in dictionary:
      index = dictionary[word]
    else:
      index = 0  # dictionary['UNK']
      unk_count = unk_count + 1
    data.append(index)
  count[0][1] = unk_count
  reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys())) 
  return data, count, dictionary, reverse_dictionary

data, count, dictionary, reverse_dictionary = build_dataset(words)
print('Most common words (+UNK)', count[:5])
print('Sample data', data[:10])
print('words list',words[:10])
del words  # Hint to reduce memory.

data_index = 0

def generate_batch(batch_size, num_words, range_words):
  global data_index #global variable
  #assert batch_size % num_words == 0 #make sure a bag of words can all feed in batch
  assert num_words <= 2 * range_words 
  batch = np.ndarray(shape=(batch_size,num_words), dtype=np.int32)
  labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
  span = 2 * range_words + 1 # [ skip_window target skip_window ]
  buffer = collections.deque(maxlen=span)
  for _ in range(span):
    buffer.append(data[data_index])# for shifting window in line 24
    data_index = (data_index + 1) % len(data)
  for i in range(batch_size):
    labels[i, 0] = buffer[range_words]
    input_index = range_words  # index which will be picked as buffer[index] for label matrix
    input_to_avoid = [ input_index ] # those indexs will no be picked for lable
    for j in range(num_words):
      while input_index in input_to_avoid:
        input_index = random.randint(0, span - 1) #exclude target already in target_to_avoid
      input_to_avoid.append(input_index) #add picked target to avoid be picked again
      batch[i,j] = buffer[input_index]
      
    buffer.append(data[data_index]) # shift window to right direct with 1 index
    data_index = (data_index + 1) % len(data)
  return batch, labels

print('data:', [reverse_dictionary[di] for di in data[:16]])
data_index = 0
num_words = 4
range_words = 2
batch, labels = generate_batch(batch_size=8, num_words=num_words, range_words=range_words)
print('\nwith num_words = %d and range_words = %d:' % (num_words, range_words))
for i in range(8):
    print('    batch_%d:' % i, [reverse_dictionary[bi] for bi in batch[i,:]])
    print('    labels_%d:'% i, [reverse_dictionary[li] for li in labels[i]] )


batch_size = 128
embedding_size = 128 # Dimension of the embedding vector.
range_words = 1 # How many words to consider left and right.
num_words = 2 # How many times to reuse an input to generate a label.
# We pick a random validation set to sample nearest neighbors. here we limit the
# validation samples to the words that have a low numeric ID, which by
# construction are also the most frequent. 
valid_size = 16 # Random set of words to evaluate similarity on.
valid_window = 100 # Only pick dev samples in the head of the distribution.
valid_examples = np.array(random.sample(range(valid_window), valid_size))
num_sampled = 64 # Number of negative examples to sample.

graph = tf.Graph()

with graph.as_default(), tf.device('/cpu:0'):

  # Input data.
  train_dataset = tf.placeholder(tf.int32, shape=[batch_size,num_words])
  train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
  valid_dataset = tf.constant(valid_examples, dtype=tf.int32)
  
  # Variables.
    #tf.random_uniform(shape, minval=0, maxval=None, dtype=tf.float32, seed=None, name=None)
  embeddings = tf.Variable(
    tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
  softmax_weights = tf.Variable(
    tf.truncated_normal([vocabulary_size, embedding_size],
                         stddev=1.0 / math.sqrt(embedding_size)))
  softmax_biases = tf.Variable(tf.zeros([vocabulary_size]))
  
  # Model.
  # Look up embeddings for inputs.
  embed = tf.zeros([batch_size, embedding_size])
  for i in range(num_words):
      embed += tf.nn.embedding_lookup(embeddings, train_dataset[:,i])
  
  # Compute the softmax loss, using a sample of the negative labels each time.
  loss = tf.reduce_mean(
    tf.nn.sampled_softmax_loss(weights=softmax_weights, biases=softmax_biases, inputs=embed,
                               labels=train_labels, num_sampled=num_sampled, num_classes=vocabulary_size))

  # Optimizer.
  # Note: The optimizer will optimize the softmax_weights AND the embeddings.
  # This is because the embeddings are defined as a variable quantity and the
  # optimizer's `minimize` method will by default modify all variable quantities 
  # that contribute to the tensor it is passed.
  # See docs on `tf.train.Optimizer.minimize()` for more details.
  optimizer = tf.train.AdagradOptimizer(1.0).minimize(loss)
  
  # Compute the similarity between minibatch examples and all embeddings.
  # We use the cosine distance:
  norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))
  normalized_embeddings = embeddings / norm
  valid_embeddings = tf.nn.embedding_lookup(
    normalized_embeddings, valid_dataset)
  similarity = tf.matmul(valid_embeddings, tf.transpose(normalized_embeddings))
  print (normalized_embeddings.get_shape)

num_steps = 100001

with tf.Session(graph=graph) as session:
  tf.global_variables_initializer().run()
  print('Initialized')
  average_loss = 0
  for step in range(num_steps):
    batch_data, batch_labels = generate_batch(
      batch_size, num_skips, skip_window)
    feed_dict = {train_dataset : batch_data, train_labels : batch_labels}
    _, l = session.run([optimizer, loss], feed_dict=feed_dict)
    average_loss += l
    if step % 2000 == 0:
      if step > 0:
        average_loss = average_loss / 2000
      # The average loss is an estimate of the loss over the last 2000 batches.
      print('Average loss at step %d: %f' % (step, average_loss))
      average_loss = 0
    # note that this is expensive (~20% slowdown if computed every 500 steps)
    if step % 10000 == 0:
      sim = similarity.eval()
      for i in range(valid_size):
        valid_word = reverse_dictionary[valid_examples[i]]
        top_k = 8 # number of nearest neighbors
        nearest = (-sim[i, :]).argsort()[1:top_k+1] # argsort function is sorting from small to big 
        log = 'Nearest to %s:' % valid_word
        for k in range(top_k):
          close_word = reverse_dictionary[nearest[k]]
          log = '%s %s,' % (log, close_word)
        print(log)
  final_embeddings = normalized_embeddings.eval()


num_points = 400

tsne = TSNE(perplexity=30, n_components=2, init='pca', n_iter=5000)
two_d_embeddings = tsne.fit_transform(final_embeddings[1:num_points+1, :])

def plot(embeddings, labels):
  assert embeddings.shape[0] >= len(labels), 'More labels than embeddings'
  pylab.figure(figsize=(15,15))  # in inches
  for i, label in enumerate(labels):
    x, y = embeddings[i,:]
    pylab.scatter(x, y)
    pylab.annotate(label, xy=(x, y), xytext=(5, 2), textcoords='offset points',
                   ha='right', va='bottom')
  pylab.show()

words = [reverse_dictionary[i] for i in range(1, num_points+1)]
plot(two_d_embeddings, words)

2017年4月27日 星期四

Word2Vec model Introduction (skip-gram & CBOW)