以下文字內容是我理解後整理的筆記,但codes 來自課程作業,不是我原創。Deep learning API為 tensorflow
課程網址:https://classroom.udacity.com/courses/ud730
Word2Vec
要讓機器能夠學習,要學習事物必須定義其特徵,才有辦法學。像是腫瘤,如果要讓機器學會判定是良性或惡性,特徵可以定義為"腫瘤大小、密度、基因有無突變" 等等,然後將良性及惡性腫瘤特徵餵給機器學習。而Word2Vec就是一個將單一文字變成特徵向量的演算法。
文字跟圖像資料不同,圖像的理解主要跟自己圖內的pixel 排列組合有關,所以其特徵跟自身有關,但文字的理解由前後文的關係來決定。兩個不同的字例如 cat 跟 kitty ,如果再文章中出現時,前後文都一樣,就可判斷兩個是一樣意思的文字,特徵就要長一樣。而Word2Vec作法就是輸入文字,輸出預測其前後文字。model 是兩層全連接 neural network model (1 hidden layers) 。
假設data base 有N個vocabulary,model 示意圖如下:
Input layer
Input layer 是one-hot encoding ,Size 等於N. Input layer 中每一個element 都對應一個vocabulary 中的words,0表示沒輸入,1表示輸入該word.
Hidden Layers
hidden layer 是input layer 乘上 weight matrix (1xN * NxV = 1xV)。因為input 是one-hot,所以hidden layers 就是將 weight matrix 中抽出多個 row 疊加的結果。而這些row 的 row index就等於 input element 為 1 的index. 因此 input layer 就像是weight matrix row搜尋的對照表。
1st Weights Matrix = Word feature vector matrix
Weight matrix 中的每一列對應一個input layer element。而每個element 都對應一個word,所以Weight matrix 的row vector 就是word 的feature vector。vector 的維度就是matrix 的行數,可調的。維度愈大字的預測愈準,但計算負擔愈大。
Input to hidden Layers |
Neural Network Model 的 Weights matrix 是可訓練的,將words 丟進model 來預測其前後文字,
經過多次迭代訓練後,Input 跟 hidden layers 之間的Weight matrixs 就是我們要的word vector matrix。
Random Sampled Softmax
hidden layer 也就是feature vector 到 output layer 的計算,由於output 的維度是整個vocabulary 總數,所以計算量會非常大,如果能降低element 數量,就能省下很多計算量。剛好outpur element 大部份是0,我們可以不必計算所有的element。我們除了lable 的element 留下外,其他選幾個label 以外的element,每次正向反向propagation ,只計算這幾個選到的element,gradient decent只微調跟這些element 計算有關的weights (feature vector 到output layer 的weight matrix)。選output element 的方法,這邊是隨機選擇,但paper 上好像有更好的方法,等之後遇到再學習。
Model 的運行說完了,但input words 的 label 該怎麼決定呢? 可以用兩個model "Skip-gram" 跟"CBOW" 來得到
Skip-gram
Skip-gram 的邏輯是,一次只輸入一個字,輸出的label 為其前後一定距離內的文字,所以同一個word 會有多個label,至於前後文的距離也是skip-gram 參數之一。假設一段句子"I have a big dog and horse". 前後距離設定為一個字,我們取"dog" 為input word,label 就是"and" 跟"big"。如果距離是兩字寬,label 就是"and" "big" "a" "horse"。
Input & label 取樣範例 |
CBOW (Continuous Bag-of Words)
跟skip-gram 不同,skip-gram 是input 一個字,輸出是有多個label 。CBOW 是將一段句子的中間字當作label,其左右文字為input words,所以是多個字input 一個輸出label. 句子的長度可調。
取樣範例(句子長度:3) |
Visualize Words Feature Vector
每一個word 都有它自己的feature vector,vector 愈相進字意思愈相近。為了可視化,我們利用tSNE 將vector 將成二維畫圖,其結果如下圖,可以看到再vector space 上距離愈近的字,愈像是同種類的字。
Codes (python)
這是ipython 上寫的codes,只再ipython notebook 上實際跑過
以下為 Skip-gram model
以下為 Skip-gram model
%matplotlib inline from __future__ import print_function import collections import math import numpy as np import os import random import tensorflow as tf import zipfile from matplotlib import pylab from six.moves import range from six.moves.urllib.request import urlretrieve from sklearn.manifold import TSNE url = 'http://mattmahoney.net/dc/' def maybe_download(filename, expected_bytes): """Download a file if not present, and make sure it's the right size.""" if not os.path.exists(filename): filename, _ = urlretrieve(url + filename, filename) statinfo = os.stat(filename) if statinfo.st_size == expected_bytes: print('Found and verified %s' % filename) else: print(statinfo.st_size) raise Exception( 'Failed to verify ' + filename + '. Can you get to it with a browser?') return filename filename = maybe_download('text8.zip', 31344016) def read_data(filename): """Extract the first file enclosed in a zip file as a list of words""" with zipfile.ZipFile(filename) as f: data = f.namelist() # list type, element is str "text8" data = data[0] # get ste "text8" data = f.read(data) #read data as bytes type data = tf.compat.as_str(data) #change bytes to str data = data.split() # split str with space return data words = read_data(filename) print('Data size %d' % len(words)) vocabulary_size = 50000 def build_dataset(words): count = [['UNK', -1]] count.extend(collections.Counter(words).most_common(vocabulary_size - 1)) # count each words and list as ['word',count] dictionary = dict() for word, _ in count: dictionary[word] = len(dictionary) #dict name is word and contest is the index data = list() # context is same as words list but transfer to dictionary index unk_count = 0 for word in words: if word in dictionary: index = dictionary[word] else: index = 0 # dictionary['UNK'] unk_count = unk_count + 1 data.append(index) count[0][1] = unk_count reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys())) return data, count, dictionary, reverse_dictionary data, count, dictionary, reverse_dictionary = build_dataset(words) print('Most common words (+UNK)', count[:5]) print('Sample data', data[:10]) print('words list',words[:10]) del words # Hint to reduce memory. data_index = 0 def generate_batch(batch_size, num_skips, skip_window): global data_index #global variable assert batch_size % num_skips == 0 #make sure batch_size is even assert num_skips <= 2 * skip_window batch = np.ndarray(shape=(batch_size), dtype=np.int32) labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32) span = 2 * skip_window + 1 # [ skip_window target skip_window ] buffer = collections.deque(maxlen=span) for _ in range(span): buffer.append(data[data_index])# for shifting window in line 24 data_index = (data_index + 1) % len(data) for i in range(batch_size // num_skips): target = skip_window # index which will be picked as buffer[index] for label matrix targets_to_avoid = [ skip_window ] # those indexs will no be picked for lable for j in range(num_skips): while target in targets_to_avoid: target = random.randint(0, span - 1) #exclude target already in target_to_avoid targets_to_avoid.append(target) #add picked target to avoid be picked again batch[i * num_skips + j] = buffer[skip_window] labels[i * num_skips + j, 0] = buffer[target] buffer.append(data[data_index]) # shift window to right direct with 1 index data_index = (data_index + 1) % len(data) return batch, labels print('data:', [reverse_dictionary[di] for di in data[:8]]) for num_skips, skip_window in [(2, 1), (4, 2)]: data_index = 0 batch, labels = generate_batch(batch_size=8, num_skips=num_skips, skip_window=skip_window) print('\nwith num_skips = %d and skip_window = %d:' % (num_skips, skip_window)) print(' batch:', [reverse_dictionary[bi] for bi in batch]) print(' labels:', [reverse_dictionary[li] for li in labels.reshape(8)]) batch_size = 128 embedding_size = 128 # Dimension of the embedding vector. skip_window = 1 # How many words to consider left and right. num_skips = 2 # How many times to reuse an input to generate a label. # We pick a random validation set to sample nearest neighbors. here we limit the # validation samples to the words that have a low numeric ID, which by # construction are also the most frequent. valid_size = 16 # Random set of words to evaluate similarity on. valid_window = 100 # Only pick dev samples in the head of the distribution. valid_examples = np.array(random.sample(range(valid_window), valid_size)) num_sampled = 64 # Number of negative examples to sample. graph = tf.Graph() with graph.as_default(), tf.device('/cpu:0'): # Input data. train_dataset = tf.placeholder(tf.int32, shape=[batch_size]) train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1]) valid_dataset = tf.constant(valid_examples, dtype=tf.int32) # Variables. #tf.random_uniform(shape, minval=0, maxval=None, dtype=tf.float32, seed=None, name=None) embeddings = tf.Variable( tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0)) softmax_weights = tf.Variable( tf.truncated_normal([vocabulary_size, embedding_size], stddev=1.0 / math.sqrt(embedding_size))) softmax_biases = tf.Variable(tf.zeros([vocabulary_size])) # Model. # Look up embeddings for inputs. embed = tf.nn.embedding_lookup(embeddings, train_dataset) # Compute the softmax loss, using a sample of the negative labels each time. loss = tf.reduce_mean( tf.nn.sampled_softmax_loss(weights=softmax_weights, biases=softmax_biases, inputs=embed, labels=train_labels, num_sampled=num_sampled, num_classes=vocabulary_size)) # Optimizer. # Note: The optimizer will optimize the softmax_weights AND the embeddings. # This is because the embeddings are defined as a variable quantity and the # optimizer's `minimize` method will by default modify all variable quantities # that contribute to the tensor it is passed. # See docs on `tf.train.Optimizer.minimize()` for more details. optimizer = tf.train.AdagradOptimizer(1.0).minimize(loss) # Compute the similarity between minibatch examples and all embeddings. # We use the cosine distance: norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True)) normalized_embeddings = embeddings / norm valid_embeddings = tf.nn.embedding_lookup( normalized_embeddings, valid_dataset) similarity = tf.matmul(valid_embeddings, tf.transpose(normalized_embeddings)) print (normalized_embeddings.get_shape) num_steps = 100001 with tf.Session(graph=graph) as session: tf.global_variables_initializer().run() print('Initialized') average_loss = 0 for step in range(num_steps): batch_data, batch_labels = generate_batch( batch_size, num_skips, skip_window) feed_dict = {train_dataset : batch_data, train_labels : batch_labels} _, l = session.run([optimizer, loss], feed_dict=feed_dict) average_loss += l if step % 2000 == 0: if step > 0: average_loss = average_loss / 2000 # The average loss is an estimate of the loss over the last 2000 batches. print('Average loss at step %d: %f' % (step, average_loss)) average_loss = 0 # note that this is expensive (~20% slowdown if computed every 500 steps) if step % 10000 == 0: sim = similarity.eval() for i in range(valid_size): valid_word = reverse_dictionary[valid_examples[i]] top_k = 8 # number of nearest neighbors nearest = (-sim[i, :]).argsort()[1:top_k+1] # argsort function is sorting from small to big log = 'Nearest to %s:' % valid_word for k in range(top_k): close_word = reverse_dictionary[nearest[k]] log = '%s %s,' % (log, close_word) print(log) final_embeddings = normalized_embeddings.eval() num_points = 400 tsne = TSNE(perplexity=30, n_components=2, init='pca', n_iter=5000) two_d_embeddings = tsne.fit_transform(final_embeddings[1:num_points+1, :]) def plot(embeddings, labels): assert embeddings.shape[0] >= len(labels), 'More labels than embeddings' pylab.figure(figsize=(15,15)) # in inches for i, label in enumerate(labels): x, y = embeddings[i,:] pylab.scatter(x, y) pylab.annotate(label, xy=(x, y), xytext=(5, 2), textcoords='offset points', ha='right', va='bottom') pylab.show() words = [reverse_dictionary[i] for i in range(1, num_points+1)] plot(two_d_embeddings, words)
以下為CBOW model (與skip-gram 幾乎相同,只有generate_batch function 到 training 之間有不同,其餘都一樣)
%matplotlib inline from __future__ import print_function import collections import math import numpy as np import os import random import tensorflow as tf import zipfile from matplotlib import pylab from six.moves import range from six.moves.urllib.request import urlretrieve from sklearn.manifold import TSNE url = 'http://mattmahoney.net/dc/' def maybe_download(filename, expected_bytes): """Download a file if not present, and make sure it's the right size.""" if not os.path.exists(filename): filename, _ = urlretrieve(url + filename, filename) statinfo = os.stat(filename) if statinfo.st_size == expected_bytes: print('Found and verified %s' % filename) else: print(statinfo.st_size) raise Exception( 'Failed to verify ' + filename + '. Can you get to it with a browser?') return filename filename = maybe_download('text8.zip', 31344016) def read_data(filename): """Extract the first file enclosed in a zip file as a list of words""" with zipfile.ZipFile(filename) as f: data = f.namelist() # list type, element is str "text8" data = data[0] # get ste "text8" data = f.read(data) #read data as bytes type data = tf.compat.as_str(data) #change bytes to str data = data.split() # split str with space return data words = read_data(filename) print('Data size %d' % len(words)) vocabulary_size = 50000 def build_dataset(words): count = [['UNK', -1]] count.extend(collections.Counter(words).most_common(vocabulary_size - 1)) # count each words and list as ['word',count] dictionary = dict() for word, _ in count: dictionary[word] = len(dictionary) #dict name is word and contest is the index data = list() # context is same as words list but transfer to dictionary index unk_count = 0 for word in words: if word in dictionary: index = dictionary[word] else: index = 0 # dictionary['UNK'] unk_count = unk_count + 1 data.append(index) count[0][1] = unk_count reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys())) return data, count, dictionary, reverse_dictionary data, count, dictionary, reverse_dictionary = build_dataset(words) print('Most common words (+UNK)', count[:5]) print('Sample data', data[:10]) print('words list',words[:10]) del words # Hint to reduce memory. data_index = 0 def generate_batch(batch_size, num_words, range_words): global data_index #global variable #assert batch_size % num_words == 0 #make sure a bag of words can all feed in batch assert num_words <= 2 * range_words batch = np.ndarray(shape=(batch_size,num_words), dtype=np.int32) labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32) span = 2 * range_words + 1 # [ skip_window target skip_window ] buffer = collections.deque(maxlen=span) for _ in range(span): buffer.append(data[data_index])# for shifting window in line 24 data_index = (data_index + 1) % len(data) for i in range(batch_size): labels[i, 0] = buffer[range_words] input_index = range_words # index which will be picked as buffer[index] for label matrix input_to_avoid = [ input_index ] # those indexs will no be picked for lable for j in range(num_words): while input_index in input_to_avoid: input_index = random.randint(0, span - 1) #exclude target already in target_to_avoid input_to_avoid.append(input_index) #add picked target to avoid be picked again batch[i,j] = buffer[input_index] buffer.append(data[data_index]) # shift window to right direct with 1 index data_index = (data_index + 1) % len(data) return batch, labels print('data:', [reverse_dictionary[di] for di in data[:16]]) data_index = 0 num_words = 4 range_words = 2 batch, labels = generate_batch(batch_size=8, num_words=num_words, range_words=range_words) print('\nwith num_words = %d and range_words = %d:' % (num_words, range_words)) for i in range(8): print(' batch_%d:' % i, [reverse_dictionary[bi] for bi in batch[i,:]]) print(' labels_%d:'% i, [reverse_dictionary[li] for li in labels[i]] ) batch_size = 128 embedding_size = 128 # Dimension of the embedding vector. range_words = 1 # How many words to consider left and right. num_words = 2 # How many times to reuse an input to generate a label. # We pick a random validation set to sample nearest neighbors. here we limit the # validation samples to the words that have a low numeric ID, which by # construction are also the most frequent. valid_size = 16 # Random set of words to evaluate similarity on. valid_window = 100 # Only pick dev samples in the head of the distribution. valid_examples = np.array(random.sample(range(valid_window), valid_size)) num_sampled = 64 # Number of negative examples to sample. graph = tf.Graph() with graph.as_default(), tf.device('/cpu:0'): # Input data. train_dataset = tf.placeholder(tf.int32, shape=[batch_size,num_words]) train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1]) valid_dataset = tf.constant(valid_examples, dtype=tf.int32) # Variables. #tf.random_uniform(shape, minval=0, maxval=None, dtype=tf.float32, seed=None, name=None) embeddings = tf.Variable( tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0)) softmax_weights = tf.Variable( tf.truncated_normal([vocabulary_size, embedding_size], stddev=1.0 / math.sqrt(embedding_size))) softmax_biases = tf.Variable(tf.zeros([vocabulary_size])) # Model. # Look up embeddings for inputs. embed = tf.zeros([batch_size, embedding_size]) for i in range(num_words): embed += tf.nn.embedding_lookup(embeddings, train_dataset[:,i]) # Compute the softmax loss, using a sample of the negative labels each time. loss = tf.reduce_mean( tf.nn.sampled_softmax_loss(weights=softmax_weights, biases=softmax_biases, inputs=embed, labels=train_labels, num_sampled=num_sampled, num_classes=vocabulary_size)) # Optimizer. # Note: The optimizer will optimize the softmax_weights AND the embeddings. # This is because the embeddings are defined as a variable quantity and the # optimizer's `minimize` method will by default modify all variable quantities # that contribute to the tensor it is passed. # See docs on `tf.train.Optimizer.minimize()` for more details. optimizer = tf.train.AdagradOptimizer(1.0).minimize(loss) # Compute the similarity between minibatch examples and all embeddings. # We use the cosine distance: norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True)) normalized_embeddings = embeddings / norm valid_embeddings = tf.nn.embedding_lookup( normalized_embeddings, valid_dataset) similarity = tf.matmul(valid_embeddings, tf.transpose(normalized_embeddings)) print (normalized_embeddings.get_shape) num_steps = 100001 with tf.Session(graph=graph) as session: tf.global_variables_initializer().run() print('Initialized') average_loss = 0 for step in range(num_steps): batch_data, batch_labels = generate_batch( batch_size, num_skips, skip_window) feed_dict = {train_dataset : batch_data, train_labels : batch_labels} _, l = session.run([optimizer, loss], feed_dict=feed_dict) average_loss += l if step % 2000 == 0: if step > 0: average_loss = average_loss / 2000 # The average loss is an estimate of the loss over the last 2000 batches. print('Average loss at step %d: %f' % (step, average_loss)) average_loss = 0 # note that this is expensive (~20% slowdown if computed every 500 steps) if step % 10000 == 0: sim = similarity.eval() for i in range(valid_size): valid_word = reverse_dictionary[valid_examples[i]] top_k = 8 # number of nearest neighbors nearest = (-sim[i, :]).argsort()[1:top_k+1] # argsort function is sorting from small to big log = 'Nearest to %s:' % valid_word for k in range(top_k): close_word = reverse_dictionary[nearest[k]] log = '%s %s,' % (log, close_word) print(log) final_embeddings = normalized_embeddings.eval() num_points = 400 tsne = TSNE(perplexity=30, n_components=2, init='pca', n_iter=5000) two_d_embeddings = tsne.fit_transform(final_embeddings[1:num_points+1, :]) def plot(embeddings, labels): assert embeddings.shape[0] >= len(labels), 'More labels than embeddings' pylab.figure(figsize=(15,15)) # in inches for i, label in enumerate(labels): x, y = embeddings[i,:] pylab.scatter(x, y) pylab.annotate(label, xy=(x, y), xytext=(5, 2), textcoords='offset points', ha='right', va='bottom') pylab.show() words = [reverse_dictionary[i] for i in range(1, num_points+1)] plot(two_d_embeddings, words)