Fuentes: DeepLearning.AI: Procesamiento de lenguaje natural. Programa especializado en Coursera

Tip: Puedes ver este post en GitHub o ejecutarlo en Binder o Google Colab, pulsa el icono.

Preprocesamiento de los datos

El primer paso va a ser siempre tratar los datos para poder utilizarlos después en algún modelo de clasificación o regresión.

Cargamos las librerías necesarias
Cargamos los datos de una base de ejemplo de la librería nltk (muy útil para procesamiento de texto)
Procesamos los datos:
- Quitar elementos poco útiles en este caso: hyperlinks, marcas y estilos de twitter
- Tokenizamos las cadenas: separamos los tweets o frases en palabras
- Quitamos las stopwords o palabras vacías (no tienen un significado por sí solas, sino que modifican o acompañan a otras)
- Se realiza el stemming o derivación, dejando las palabras sólo con la raiz que tiene más significado.
- Pasar a minúsculas

import nltk                                # Python library for NLP
from nltk.corpus import twitter_samples   # sample Twitter dataset from NLTK
import matplotlib.pyplot as plt            # library for visualization
import random                            # pseudo-random number generator
import numpy as np

nltk.download('twitter_samples')

[nltk_data] Downloading package twitter_samples to
[nltk_data]     C:\Users\juan_\AppData\Roaming\nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!

True

all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')

print('Number of positive tweets: ', len(all_positive_tweets))
print('Number of negative tweets: ', len(all_negative_tweets))

print('\nThe type of all_positive_tweets is: ', type(all_positive_tweets))
print('The type of a tweet entry is: ', type(all_negative_tweets[0]))

Number of positive tweets:  5000
Number of negative tweets:  5000

The type of all_positive_tweets is:  <class 'list'>
The type of a tweet entry is:  <class 'str'>

print('\033[92m' + all_positive_tweets[random.randint(0,5000)])

# print negative in red
print('\033[91m' + all_negative_tweets[random.randint(0,5000)])

[ Singles &amp; Dating ] Open Question : How can I start and keep a long and happy long-distance relationship?: Hi :) last week I was in a…
Hate when I can't remember my dreams, I love sharing them :(

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\juan_\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.

True

import re                                  # library for regular expression operations
import string                              # for string operations

from nltk.corpus import stopwords          # module for stop words that come with NLTK
from nltk.stem import PorterStemmer        # module for stemming
from nltk.tokenize import TweetTokenizer   # module for tokenizing strings

Quitar hyperlinks, marcas y estilos de twitter

tweet = all_positive_tweets[2277]
print('\033[92m' + tweet)
print('\033[94m')

# remove old style retweet text "RT"
tweet2 = re.sub(r'^RT[\s]+', '', tweet)

# remove hyperlinks
tweet2 = re.sub(r'https?://[^\s\n\r]+', '', tweet2)

# remove hashtags
# only removing the hash # sign from the word
tweet2 = re.sub(r'#', '', tweet2)

print(tweet2)

My beautiful sunflowers on a sunny Friday morning off :) #sunflowers #favourites #happy #Friday off… https://t.co/3tfYom0N1i

My beautiful sunflowers on a sunny Friday morning off :) sunflowers favourites happy Friday off…

Tokenizamos las cadenas

Separamos los tweets o frases en palabras

tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True,
                               reduce_len=True)
# tokenize tweets
tweet_tokens = tokenizer.tokenize(tweet2)

print(tweet_tokens)

['my', 'beautiful', 'sunflowers', 'on', 'a', 'sunny', 'friday', 'morning', 'off', ':)', 'sunflowers', 'favourites', 'happy', 'friday', 'off', '…']

Quitamos las stopwords o palabras vacías

las que no tienen un significado por sí solas, sino que modifican o acompañan a otras.

stopwords_english = stopwords.words('english') 

print('Stop words\n')
print(stopwords_english)

print('\nPunctuation\n')
print(string.punctuation)

Stop words

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

Punctuation

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

tweets_clean = []

for word in tweet_tokens: # Go through every word in your tokens list
    if (word not in stopwords_english and  # remove stopwords
        word not in string.punctuation):  # remove punctuation
        tweets_clean.append(word)

print('removed stop words and punctuation:')
print(tweets_clean)

removed stop words and punctuation:
['beautiful', 'sunflowers', 'sunny', 'friday', 'morning', ':)', 'sunflowers', 'favourites', 'happy', 'friday', '…']

Stemming (derivación)

El proceso de convertir una palabra en su forma más general.

stemmer = PorterStemmer() 

# Create an empty list to store the stems
tweets_stem = [] 

for word in tweets_clean:
    stem_word = stemmer.stem(word)  # stemming word
    tweets_stem.append(stem_word)  # append to the list

print('stemmed words:')
print(tweets_stem)

stemmed words:
['beauti', 'sunflow', 'sunni', 'friday', 'morn', ':)', 'sunflow', 'favourit', 'happi', 'friday', '…']

Pasar a minúsculas todas las palabras

cadena= "Hola. Me llamo Pedro"
print(cadena.lower())

hola. me llamo pedro

Clasificador de regresión logística para realizar el análisis de sentimiento en tweets

En el aprendizaje automático supervisado, generalmente se tiene una entrada X, que aplicada a la función de predicción se obtiene la Y'.

Luego puede comparar su predicción con el valor real Y. Esto le da su costo que usa para actualizar los parámetros θ.

Para más detalles consulta este post: Proceso de aprendizaje automático supervisado

La siguiente imagen resume el proceso:

Análisis de sentimiento en un tweet:

X -> el texto (es decir, "I am happy because I am learning NLP") como características
Entrenar
Clasificar el texto.

Vocabulario = V (1..n): conjunto de palabras únicas Extracción de características: 1 en el índice correspondiente para cualquier palabra del tuit y 0 en caso contrario.

A medida que V crece, el vector se vuelve más disperso. Las representaciones dispersas son problemáticas para los tiempos de entrenamiento y predicción.

Preparación de los datos de entrada

Extracción de características con frecuencias

Divide el corpus de tweets en dos clases: positiva y negativa
Cuenta cada vez que aparece cada palabra en cualquiera de las clases

*** Corpus = Conjunto cerrado de textos o de datos destinados a la investigación

Vector de características [1,8,11]. 1 corresponde al sesgo, 8 a la característica positiva y 11 a la característica negativa.

def process_tweet(tweet):
    """Process tweet.
    Input:
        tweet
       
    Output:
        tweet processed
    """
    tweet2 = re.sub(r'^RT[\s]+', '', tweet)
    tweet2 = re.sub(r'https?://[^\s\n\r]+', '', tweet2)
    tweet2 = re.sub(r'#', '', tweet2)
    tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True,
                               reduce_len=True)   
    tweet_tokens = tokenizer.tokenize(tweet2)
    tweets_clean = []

    for word in tweet_tokens: # Go through every word in your tokens list
        if (word not in stopwords_english and  # remove stopwords
            word not in string.punctuation):  # remove punctuation
            tweets_clean.append(word)
    stemmer = PorterStemmer() 
    tweets_stem = [] 
    for word in tweets_clean:
        stem_word = stemmer.stem(word)  # stemming word
        tweets_stem.append(stem_word)  # append to the list
    tweets_processed = [] 
    for word in tweets_stem:
        processed_word = word.lower()  # stemming word
        tweets_processed.append(processed_word)  # append to the list
    return tweets_processed

def build_freqs(tweets, ys):
    """Build frequencies.
    Input:
        tweets: a list of tweets
        ys: an m x 1 array with the sentiment label of each tweet
            (either 0 or 1)
    Output:
        freqs: a dictionary mapping each (word, sentiment) pair to its
        frequency
    """
    # Convert np array to list since zip needs an iterable.
    # The squeeze is necessary or the list ends up with one element.
    # Also note that this is just a NOP if ys is already a list.
    yslist = np.squeeze(ys).tolist()

    # Start with an empty dictionary and populate it by looping over all tweets
    # and over all processed words in each tweet.
    freqs = {}
    for y, tweet in zip(yslist, tweets):
        for word in process_tweet(tweet):
            pair = (word, y)
            if pair in freqs:
                freqs[pair] += 1
            else:
                freqs[pair] = 1    
    return freqs

tweets = all_positive_tweets + all_negative_tweets
# make a numpy array representing labels of the tweets
labels = np.append(np.ones((len(all_positive_tweets))), np.zeros((len(all_negative_tweets))))

freqs = build_freqs(tweets, labels)

# check data type
print(f'type(freqs) = {type(freqs)}')

# check length of the dictionary
print(f'len(freqs) = {len(freqs)}')

type(freqs) = <class 'dict'>
len(freqs) = 13172

tweet = all_positive_tweets[2277]
print(process_tweet(tweet))

['beauti', 'sunflow', 'sunni', 'friday', 'morn', ':)', 'sunflow', 'favourit', 'happi', 'friday', '…']

print(freqs)

Extrayendo las características

La primera característica es la cantidad de palabras positivas en un tweet.
La segunda característica es la cantidad de palabras negativas en un tweet.

def extract_features(tweet, freqs, process_tweet=process_tweet):
    '''
    Input: 
        tweet: a list of words for one tweet
        freqs: a dictionary corresponding to the frequencies of each tuple (word, label)
    Output: 
        x: a feature vector of dimension (1,3)
    '''
    # process_tweet tokenizes, stems, and removes stopwords
    word_l = process_tweet(tweet)
    
    # 3 elements in the form of a 1 x 3 vector
    x = np.zeros((1, 3))    
    
    #bias term is set to 1
    x[0,0] = 1 
        
    # loop through each word in the list of words
    for word in word_l:       
        
        # increment the word count for the positive label 1
        if (word,1) in freqs:
            x[0,1] += freqs[word,1]
        
        # increment the word count for the negative label 0
        if (word,0) in freqs:            
            x[0,2] += freqs[word,0]
        
    assert(x.shape == (1, 3))
    return x

ejemplo_tweet =tweets[1] 
print(ejemplo_tweet)
extract_features(ejemplo_tweet, freqs, process_tweet)

@Lamb2ja Hey James! How odd :/ Please call our Contact Centre on 02392441234 and we will be able to assist you :) Many thanks!

array([[1.000e+00, 4.613e+03, 5.180e+02]])

Tabla de conteo de palabras

data = []
 
for tweet in tweets:
    for word in process_tweet(tweet):

         # initialize positive and negative counts
        pos = 0
        neg = 0

        # retrieve number of positive counts
        if (word, 1) in freqs:
            pos = freqs[(word, 1)]

        # retrieve number of negative counts
        if (word, 0) in freqs:
            neg = freqs[(word, 0)]

        # append the word counts to the table
        data.append([word, pos, neg])
    
data

Visualización del las palabras según las caracteríticas

data1= data[200:250]
fig, ax = plt.subplots(figsize = (8, 8))

# convert positive raw counts to logarithmic scale. we add 1 to avoid log(0)
x = np.log([x[1] + 1 for x in data1])  

# do the same for the negative counts
y = np.log([x[2] + 1 for x in data1]) 

# Plot a dot for each pair of words
ax.scatter(x, y)  

# assign axis labels
plt.xlabel("Log Positive count")
plt.ylabel("Log Negative count")

# Add the word as the label at the same position as you added the points just before
for i in range(0, len(data1)):
    ax.annotate(data1[i][0], (x[i], y[i]), fontsize=12)

ax.plot([0, 9], [0, 9], color = 'red') # Plot the red line that divides the 2 areas.
plt.show()

data[:20]

[['followfriday', 25, 0],
 ['top', 32, 6],
 ['engag', 7, 0],
 ['member', 16, 6],
 ['commun', 33, 2],
 ['week', 83, 56],
 [':)', 3691, 2],
 ['hey', 77, 26],
 ['jame', 7, 4],
 ['odd', 2, 3],
 [':/', 5, 11],
 ['pleas', 99, 275],
 ['call', 37, 29],
 ['contact', 7, 7],
 ['centr', 2, 2],
 ['02392441234', 1, 0],
 ['abl', 8, 23],
 ['assist', 1, 0],
 [':)', 3691, 2],
 ['mani', 33, 29]]

Función de predicción

Toma una regresión lineal regular y aplica la función sigmoide. Según el valor que se obtine podemos clasificar las entradas.

mayor que 0 -> positivo
menor que 0 -> negativo

Sigmoid function

import math
def sigmoid(z): 
    '''
    Input:
        z: is the input (can be a scalar or an array)
    Output:
        h: the sigmoid of z
    '''  
    return 1 / ( 1 + np.exp(-z))

Descenso del gradiente

Consiste en moverse por la función de coste (error) buscando el punto donde se minimiza. Para ello, en cada posición, busca la dirección en la que se reduce más el error. Esto se obtiene con la derivada parcial en el punto respecto de cáda parámetro (el gradiente de la función), que es la pendiente de la curva.

Una vez obtenida la dirección en que hay que moverse se recalculan los parámetros de la siguiente forma:

Función de coste (error)

def gradientDescent(x, y, theta, alpha , num_iters):
    '''
    Input:
        x: matrix of features which is (m,n+1)
        y: corresponding labels of the input matrix x, dimensions (m,1)
        theta: weight vector of dimension (n+1,1)
        alpha: learning rate
        num_iters: number of iterations you want to train your model for
    Output:
        J: the final cost
        theta: your final weight vector
    Hint: you might want to print the cost to make sure that it is going down.
    '''
    # get 'm', the number of rows in matrix x
    m = len(x[0:])
    
    for i in range(0, num_iters):
        
        # get z, the dot product of x and theta        
        z =  x @ theta     
        # get the sigmoid of z
        h =sigmoid(z)
        
        # calculate the cost function
        
        J = -(1/m)* (y.T @ np.log(h)+((1-y).T @ np.log(1-h)))

        # update the weights theta
        theta = theta - alpha/m  * (x.T @ (h-y)) 
        #print(f"iter: {i}")
        
    J = float(J)
    return J, theta

Entrenamiento

Apila las características de todos los ejemplos de entrenamiento en una matriz X.
Llame a la función del Descenso del Gradiente implementada anteriormente.
Repita el proceso hasta miniminzar la función de coste.

train_x= tweets[:8000]  
train_y= np.reshape(np.array(labels[:8000]),(8000,1))

print("train_y.shape = " + str(train_y.shape))

train_y.shape = (8000, 1)

X = np.zeros((len(train_x), 3))
for i in range(len(train_x)):
    X[i, :]= extract_features(train_x[i], freqs)

# training labels corresponding to X
Y = train_y

# Apply gradient descent
J, theta = gradientDescent(X, Y, np.zeros((3, 1)), 1e-9, 2900)
print(f"The cost after training is {J:.8f}.")
print(f"The resulting vector of weights is {[round(t, 8) for t in np.squeeze(theta)]}")

The cost after training is 0.13893911.
The resulting vector of weights is [1.8e-07, 0.00081903, -0.00063131]

Prueba y cálculo de la precisión del modelo entrenado

Predice si un tweet es positivo o negativo.

test_x= tweets[8000:]
test_y= labels[8000:]

def predict_tweet(tweet, freqs, theta):
    '''
    Input: 
        tweet: a string
        freqs: a dictionary corresponding to the frequencies of each tuple (word, label)
        theta: (3,1) vector of weights
    Output: 
        y_pred: the probability of a tweet being positive or negative
    '''       
    # extract the features of the tweet and store it into x
    x = extract_features(tweet, freqs)
    
    # make the prediction using x and theta
    y_pred = sigmoid(x @ theta)
   
    return y_pred

for tweet in ['I am happy', 'I am bad', 'this movie should have been great.', 'great', 'great great', 'great great great', 'great great great great']:
    print( '%s -> %f' % (tweet, predict_tweet(tweet, freqs, theta)))

I am happy -> 0.539381
I am bad -> 0.492165
this movie should have been great. -> 0.532825
great -> 0.531703
great great -> 0.563153
great great great -> 0.594103
great great great great -> 0.624323

Comprueba el rendimiento con conjunto de datos de prueba

Da una visión de cómo funcionaría el modelo en datos reales no vistos anteriormente.

def test_logistic_regression(test_x, test_y, freqs, theta, predict_tweet=predict_tweet):
    """
    Input: 
        test_x: a list of tweets
        test_y: (m, 1) vector with the corresponding labels for the list of tweets
        freqs: a dictionary with the frequency of each pair (or tuple)
        theta: weight vector of dimension (3, 1)
    Output: 
        accuracy: (# of tweets classified correctly) / (total # of tweets)
    """
    m = len(test_x[0:])
    
    # the list for storing predictions
    y_hat = []
    
    for tweet in test_x:
        # get the label prediction for the tweet
        y_pred = predict_tweet(tweet, freqs, theta)
        
        if y_pred > 0.5:
            # append 1.0 to the list
            y_hat.append(1.0)
        else:
            # append 0 to the list
            y_hat.append(0.0)

    # With the above implementation, y_hat is a list, but test_y is (m,1) array
    # convert both to one-dimensional arrays in order to compare them using the '==' operator
      
    accuracy = np.sum(test_y.T==np.asarray(y_hat))/len(test_y)
  
    return accuracy

tmp_accuracy = test_logistic_regression(test_x, test_y, freqs, theta)
print(f"Logistic regression model's accuracy = {tmp_accuracy:.4f}")

Logistic regression model's accuracy = 0.9910

Predecir con nuevo tweet

my_tweet = 'This is a ridiculously bright movie. The plot was terrible and I was sad until the ending!'
print(process_tweet(my_tweet))
y_hat = predict_tweet(my_tweet, freqs, theta)
print(y_hat)
if y_hat > 0.5:
    print('Positive sentiment')
else: 
    print('Negative sentiment')

['ridicul', 'bright', 'movi', 'plot', 'terribl', 'sad', 'end']
[[0.47824272]]
Negative sentiment