Lecture 4.3 (Part 1) : Data Preprocessing - Part 2

1. Data Preprocessing - Text

dataset = https://www.kaggle.com/datasets/yufengdev/bbc-fulltext-and-category

import pandas as pd

articles_df = pd.read_csv('data/bbc-text.csv')

articles_df.sample()

	category	text
301	politics	fox attacks blair s tory lies tony blair lie...

Categories

articles_df['category'].unique()

array(['tech', 'business', 'sport', 'entertainment', 'politics'],
      dtype=object)

articles_df['category'].value_counts().to_frame()

	category
sport	511
business	510
politics	417
tech	401
entertainment	386

articles_df.sample()['text'].values.tolist()

['children vote shrek 2 best film young uk film fans voted animated hollywood hit shrek 2 best film at the children s bafta awards on sunday.  more than 6 000 children voted in the only category chosen by fans. harry potter and the prisoner of azkaban  runner-up in the poll  was the choice of the bafta experts who named it best feature film. bbc one saturday morning show dick and dom in da bungalow won two awards - best entertainment and best presenters for richard mccourt and dominic wood.  former playschool presenter floella benjamin was awarded the special award for outstanding creative contribution to children s film and television. she first appeared on playschool 25 years ago and was made an obe in 2001 for services to broadcasting. south american-themed cartoon joko! jakamoko! toto! won the honour for pre-school animation and its writer tony collingwood for original writer. debbie isitt won the award for best adapted writer for her work with jacqueline wilson s the illustrated mum  which won the award for best schools drama.  schools  factual (primary) - thinking skills: think about it - hiding places  schools  factual (secondary) - in search of the tartan turban  pre-school live action - balamory  animation - brush head  drama - featherboy  factual - serious desert interactive bafta - king arthur international category - 8 simple rules for dating my teenage daughter']

Sometimes Language Detection is neccasary.

Capitalization/ Lower case

articles_df['lower_case'] = articles_df['text'].apply(str.lower)

articles_df.sample()['lower_case'].values

array(['surprise win for anti-bush film michael moore s anti-bush documentary fahrenheit 9/11 has won best film at the us people s choice awards  voted for by the us public.  mel gibson s the passion of the christ won best drama  despite both films being snubbed so far at us film awards in the run-up to february s oscars. julia roberts won her 10th consecutive crown as favourite female movie star. johnny depp was favourite male movie star and renee zellweger was favourite leading lady at sunday s awards in la.  film sequel shrek 2 took three prizes - voted top animated movie  top film comedy and top sequel. in television categories  desperate housewives was named top new drama and joey  starring former friends actor matt leblanc  was best new comedy. long-running shows will and grace and csi: crime scene investigation were named best tv comedy and tv drama respectively.  nominees for the people s choice awards were picked by a 6 000-strong entertainment weekly magazine panel  and winners were subsequently chosen by 21 million online voters. fahrenheit 9/11 director michael moore dedicated his trophy to soldiers in iraq. his film was highly critical of president george w bush and the us-led invasion of iraq  and moore was an outspoken bush critic in the 2004 presidential campaign inwhich democratic challenger john kerry lost.   this country is still all of ours  not right or left or democrat or republican   moore told the audience at the ceremony in pasadena  california. moore said it was  an historic occasion  that the 31-year-old awards ceremony would name a documentary its best film. unlike many other film-makers  passion of the christ director mel gibson has vowed not to campaign for an oscar for his movie.  to me  really  this is the ultimate goal because one doesn t make work for the elite   gibson said backstage at the event.  to me  the people have spoken.'],
      dtype=object)

Replace the Unicode character with equivalent ASCII character or remove them
Replace the entity references with their actual symbols or removing HTML tags
Replace the Typos, slang, acronyms or informal abbreviations - depend on different situations or main topics of the NLP such as finance or medical topics.
List out all the hashtags/ usernames then replace with equivalent words
Replace the emoticon/ emoji with equivalant word meaning such as “:)” with “smile” , or dropping emojis entirely.
Spelling correction

Remove punctuation

import string

sample = "hey! where are you!?"
sample_processed = sample.translate (str.maketrans ('', '', string.punctuation))
sample_processed

'hey where are you'

articles_df['punct_removed'] = articles_df['lower_case'].apply(lambda doc : doc.translate (str.maketrans ('', '', string.punctuation)))

articles_df.sample()['punct_removed'].values

array(['zambia confident and cautious zambia s technical director  kalusha bwalya is confident and cautious ahead of the cosafa cup final against angola on saturday in lusaka  bwalya said  nothing short of victory will do  however bwalya warned his side not to be too complacent  i don t want my team to be too comfortable or too sure of victory as it is going to be a difficult game  for me the main aim of the game is to enjoy and to win  zambia have shown their determination to win this final by recalling nine of their foreignbased players however the 41 yearold bwalya  who became the oldest player to appear in the competition when he played and scored against mauritius  is uncertain whether he will take to the field or not the chipolopolo fans however are not being so cautious with a  victory  concert already scheduled for after the match featuring some of the country s top musicians both sides are hoping to win the competition for a record third time  and so keep the trophy for good the chipolopolo won the first two editions of the regional tournament for southern african nations in 1997 and 1998 they were prevented from a third straight win by angola who knocked out the zambians at the semifinal stage in 1999 that victory for angola also marked a first defeat in 14 years for zambia at lusaka s independence stadium  where saturday s game is being played angola named just four overseasbased players in their preliminary squad the palancas negras have been unable to secure the release of many of their portugalbased players'],
      dtype=object)

Tokenise

A token is an instance of a sequence of characters in some particular document that are grouped together as a useful semantic unit for processing. Source

ie A token is a meaningful chunk of text that we use to process and understand the information in a document. It can be a word, a phrase, or even a symbol or punctuation mark. Tokens help us break down the text into smaller pieces so that we can analyze and work with it more easily.

import nltk 

nltk.download('punkt')


from nltk.tokenize import (word_tokenize,
                          sent_tokenize,
                          TreebankWordTokenizer,
                          wordpunct_tokenize,
                          TweetTokenizer,
                          MWETokenizer)


treebank_tokenizer = TreebankWordTokenizer()
mwet_tokenizer= MWETokenizer()
tweet_tokenizer= TweetTokenizer()

text="Mr. O'Neill thinks that the boys' stories about Chile's capital aren't amusing 🙃"
text

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/enfageorge/nltk_data...
[nltk_data]   Package punkt is already up-to-date!

"Mr. O'Neill thinks that the boys' stories about Chile's capital aren't amusing 🙃"

Source : Introduction to Information Retrieval By Christopher D. Manning, Prabhakar Raghavan & Hinrich Schütze

print("Word_tokenise : \n", word_tokenize(text), "\nLength :" , len(word_tokenize(text)))
print("\nWord Punct Tokeniser :  \n", wordpunct_tokenize(text), "\nLength :" , len(wordpunct_tokenize(text)))
print("\nTree Bank : \n", treebank_tokenizer.tokenize(text), "\nLength :" , len(treebank_tokenizer.tokenize(text)))
print("\nTweet Tokeniser: \n", tweet_tokenizer.tokenize(text), "\nLength :" , len(tweet_tokenizer.tokenize(text)))
print("\nMWE Tokeniser: \n", mwet_tokenizer.tokenize(word_tokenize(text)), "\nLength :" , len(mwet_tokenizer.tokenize(word_tokenize(text))))

Word_tokenise : 
 ['Mr.', "O'Neill", 'thinks', 'that', 'the', 'boys', "'", 'stories', 'about', 'Chile', "'s", 'capital', 'are', "n't", 'amusing', '🙃'] 
Length : 16

Word Punct Tokeniser :  
 ['Mr', '.', 'O', "'", 'Neill', 'thinks', 'that', 'the', 'boys', "'", 'stories', 'about', 'Chile', "'", 's', 'capital', 'aren', "'", 't', 'amusing', '🙃'] 
Length : 21

Tree Bank : 
 ['Mr.', "O'Neill", 'thinks', 'that', 'the', 'boys', "'", 'stories', 'about', 'Chile', "'s", 'capital', 'are', "n't", 'amusing', '🙃'] 
Length : 16

Tweet Tokeniser: 
 ['Mr', '.', "O'Neill", 'thinks', 'that', 'the', 'boys', "'", 'stories', 'about', "Chile's", 'capital', "aren't", 'amusing', '🙃'] 
Length : 15

MWE Tokeniser: 
 ['Mr.', "O'Neill", 'thinks', 'that', 'the', 'boys', "'", 'stories', 'about', 'Chile', "'s", 'capital', 'are', "n't", 'amusing', '🙃'] 
Length : 16

from nltk.tokenize import word_tokenize

articles_df['tokenized'] = articles_df['punct_removed'].apply(word_tokenize)

articles_df.sample()['tokenized'].values

array([list(['jungle', 'tv', 'show', 'ratings', 'drop', 'by', '4m', 'the', 'finale', 'of', 'itv1', 's', 'i', 'm', 'a', 'celebrity', 'get', 'me', 'out', 'of', 'here', 'drew', 'an', 'average', 'of', '109m', 'viewers', 'about', 'four', 'million', 'fewer', 'than', 'the', 'previous', 'series', 'the', 'fourth', 'series', 'of', 'the', 'show', 'peaked', 'on', 'monday', 'at', '119m', 'and', '492', 'of', 'the', 'audience', 'just', 'before', 'joe', 'pasquale', 'won', 'this', 'compared', 'with', 'a', 'peak', 'of', '153m', 'at', 'and', 'a', 'record', '622', 'of', 'the', 'tv', 'audience', 'when', 'kerry', 'mcfadden', 'won', 'in', 'february', 'comic', 'pasquale', 'beat', 'former', 'royal', 'butler', 'paul', 'burrell', 'who', 'came', 'second', 'nightclub', 'owner', 'fran', 'cosgrave', 'who', 'was', 'third', 'pasquale', 'follows', 'kerry', 'mcfadden', 'phil', 'tufnell', 'and', 'tony', 'blackburn', 'as', 'winners', 'of', 'the', 'show', 'singer', 'and', 'tv', 'presenter', 'mcfadden', 'was', 'the', 'show', 's', 'first', 'female', 'winner', 'when', 'cricketer', 'phil', 'tufnell', 'won', 'in', 'may', '2003', '123', 'million', 'people', '50', 'of', 'the', 'viewing', 'public', 'tuned', 'in', 'to', 'watch', 'and', 'when', 'tony', 'blackburn', 'won', 'the', 'first', 'show', 'in', '2002', '109', 'million', 'people', 'saw', 'the', 'show', 'pasquale', 'had', 'been', 'the', 'show', 's', 'hottest', 'ever', 'favourite', 'to', 'win', 'and', 'its', 'hosts', 'anthony', 'mcpartlin', 'and', 'declan', 'donnelly', 'known', 'as', 'ant', 'and', 'dec', 'said', 'monday', 's', 'deciding', 'vote', 'was', 'the', 'closest', 'in', 'the', 'programme', 's', 'history', 'pascuale', 'has', 'been', 'flooded', 'with', 'offers', 'of', 'tv', 'work', 'according', 'to', 'his', 'management', 'company', 'but', 'one', 'of', 'his', 'first', 'jobs', 'on', 'his', 'return', 'is', 'pantomime', 'before', 'joining', 'i', 'm', 'a', 'celebrity', 'he', 'had', 'signed', 'up', 'to', 'play', 'jack', 'in', 'jack', 'and', 'the', 'beanstalk', 'in', 'birmingham', 'and', 'tickets', 'for', 'the', 'show', 'have', 'become', 'increasingly', 'popular', 'since', 'he', 'joined', 'the', 'tv', 'show', 'his', 'manager', 'robert', 'voice', 'said', 'we', 've', 'had', 'interest', 'from', 'different', 'tv', 'producers', 'some', 'are', 'for', 'comedy', 'shows', 'some', 'are', 'newtype', 'projects', 'there', 'are', 'a', 'number', 'of', 'things', 'joe', 'wants', 'to', 'do', 'he', 'is', 'very', 'ambitious', 'he', 'wants', 'to', 'play', 'the', 'west', 'end', 'and', 'do', 'different', 'things', 'other', 'than', 'straightforward', 'comedy', 'we', 'are', 'talking', 'to', 'a', 'couple', 'of', 'west', 'end', 'producers', 'about', 'a', 'musical'])],
      dtype=object)

Remove Stop Words (or/and Frequent words/ Rare words)

Stop words are the most common words in any language (like articles, prepositions, pronouns, conjunctions, etc) and does not add much information to the text. Examples of a few stop words in English are “the”, “a”, “an”, “so”, “what”.

It can go wrong too.

Mark reported to the CEO : Mark reported CEO
Suzanne reported as the CEO to the board : Suzanne reported CEO board

In your NLP pipeline, you might create 4-grams such as reported to the CEO and reported as the CEO. If you remove the stop words from the 4-grams, both examples would be reduced to “reported CEO”, and you would lack the information about the professional hierarchy. In the first example, Mark could have been an assistant to the CEO, whereas in the second example Suzanne was the CEO reporting to the board. Unfortunately, retaining the stop words within your pipeline creates another problem: it increases the length of the n-grams required to make use of these connections formed by the otherwise meaningless stop words. This issue forces us to retain at least 4-grams if you want to avoid the ambiguity of the human resources example. Designing a filter for stop words depends on your particular application.

Source : https://www.manning.com/books/natural-language-processing-in-action

nltk.download("stopwords")

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/enfageorge/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

True

from nltk.corpus import stopwords

stopwords_eng = set(stopwords.words('english'))
print(stopwords_eng)

{'again', "couldn't", 'but', 'nor', 'he', 'herself', "needn't", 'yours', 'then', "aren't", 'me', 'isn', 'both', 'between', 'and', 'against', "that'll", "haven't", 'such', 'further', 'y', 'what', 'after', "you've", 'hers', "hadn't", 'on', 'about', 'were', 'most', 'haven', 'does', "mightn't", 'who', 'them', 'when', 'can', "weren't", 'ourselves', 'these', 'than', 'why', 'she', 'the', "you'd", 've', 'her', 'during', 'those', 'once', 'aren', 'himself', 'weren', 'same', 'of', 'too', 'while', 'only', 'will', "hasn't", 'do', 'any', "isn't", 'an', 'which', 'needn', 'below', 'now', 'themselves', 'very', "shouldn't", 'shouldn', 'you', "doesn't", 'did', 'doing', 'their', 'i', 'over', 'because', 'into', 'it', 'where', 'we', 'doesn', 'ma', 'each', 'at', 'mightn', 'mustn', "wasn't", 'how', 'ain', 'that', 'not', "didn't", 'other', 'they', "she's", 'have', "won't", 'out', 'being', 'own', 't', 'theirs', 're', 'll', 'wouldn', 'for', "should've", 'am', 'had', 'some', 'd', 'is', 'his', 'under', 'has', 'through', 'yourselves', 'are', 'up', 'more', 'off', 'just', 'a', 'above', 'been', 'so', 'this', 'itself', 'be', "you'll", 'all', 'o', 'should', 'was', 'before', 'from', 'don', 'my', 'whom', 'm', "wouldn't", 'yourself', 'won', 'couldn', 'your', 'having', 'there', 'hasn', 's', 'didn', 'its', 'until', 'in', 'our', "it's", 'to', "mustn't", 'by', "don't", 'ours', 'or', 'down', 'hadn', 'him', 'few', 'shan', "shan't", 'with', 'as', 'wasn', 'myself', 'if', 'no', 'here', "you're"}

articles_df['stopwords_removed'] = articles_df['tokenized'].apply(lambda doc: [word for word in doc if word not in stopwords_eng])

articles_df.sample()['stopwords_removed'].values

array([list(['west', 'end', 'honour', 'finest', 'shows', 'west', 'end', 'honouring', 'finest', 'stars', 'shows', 'evening', 'standard', 'theatre', 'awards', 'london', 'monday', 'producers', 'starring', 'nathan', 'lane', 'lee', 'evans', 'best', 'musical', 'ceremony', 'national', 'theatre', 'competing', 'sweeney', 'todd', 'funny', 'thing', 'happened', 'way', 'forum', 'award', 'goat', 'sylvia', 'edward', 'albee', 'pillowman', 'martin', 'mcdonagh', 'alan', 'bennett', 'history', 'boys', 'shortlisted', 'best', 'play', 'category', 'pam', 'ferris', 'victoria', 'hamilton', 'kelly', 'reilly', 'nominated', 'best', 'actress', 'ferris', 'best', 'known', 'television', 'roles', 'programmes', 'darling', 'buds', 'may', 'made', 'shortlist', 'role', 'notes', 'falling', 'leaves', 'royal', 'court', 'theatre', 'meanwhile', 'richard', 'griffiths', 'plays', 'hector', 'history', 'boys', 'national', 'theatre', 'battle', 'best', 'actor', 'award', 'douglas', 'hodge', 'dumb', 'show', 'stanley', 'townsend', 'shining', 'city', 'best', 'director', 'shortlist', 'includes', 'luc', 'bondy', 'cruel', 'tender', 'simon', 'mcburney', 'measure', 'measure', 'rufus', 'norris', 'festen', 'festen', 'also', 'shortlisted', 'best', 'designer', 'category', 'ian', 'macneil', 'jean', 'kalman', 'paul', 'arditti', 'hildegard', 'bechtler', 'iphigenia', 'aulis', 'paul', 'brown', 'false', 'servant', 'milton', 'shulman', 'award', 'outstanding', 'newcomer', 'presented', 'dominic', 'cooper', 'dark', 'materials', 'history', 'boys', 'romola', 'garai', 'calico', 'eddie', 'redmayne', 'goat', 'sylvia', 'ben', 'wishaw', 'hamlet', 'playwrights', 'david', 'eldridge', 'rebecca', 'lenkiewicz', 'owen', 'mccafferty', 'fight', 'charles', 'wintour', 'award', '£30', '000', 'bursary', 'three', '50th', 'anniversary', 'special', 'awards', 'also', 'presented', 'institution', 'playwright', 'individual'])],
      dtype=object)

sample_df = articles_df.sample()
print("Length after tokenisation : ",len(sample_df['tokenized'].values[0]))
print("Length after stopwords are removed : ",len(sample_df['stopwords_removed'].values[0]))

Length after tokenisation :  295
Length after stopwords are removed :  140

Stemming

The process of removing affixes or changing affixes or getting to the root words in called stemming.

Ex :

run, running, runs => run
programming, programmer ,programs => program

There are multiple stemming algorithms, but we will use Snowball (Porter2) Stemming Algorithm here.

from nltk.stem import SnowballStemmer

stemmer = nltk.PorterStemmer()

articles_df['snowball_stemmer'] = articles_df['stopwords_removed'].apply(lambda doc :  [stemmer.stem(word) for word in doc])

example = articles_df.sample()[['stopwords_removed', 'snowball_stemmer']].values.tolist()
print(example[0][0])

['blues', 'slam', 'blackburn', 'savage', 'birmingham', 'confirmed', 'blackburn', 'made', 'bid', 'robbie', 'savage', 'managing', 'director', 'karen', 'brady', 'called', 'derisory', 'rovers', 'reportedly', 'offered', '£500', '000', 'front', 'wales', 'star', '30', 'fee', 'rising', '£22m', 'brady', 'told', 'sun', 'bid', 'waste', 'fax', 'paper', 'time', 'added', 'way', 'things', 'going', 'could', 'affect', 'relationship', 'clubs', 'got', 'robbie', 'head', 'sale', 'savage', 'future', 'birmingham', 'source', 'speculation', 'several', 'weeks', 'fans', 'criticising', 'performances', 'club', 'earlier', 'season', 'however', 'good', 'displays', 'west', 'brom', 'aston', 'villa', 'impressed', 'blues', 'fans', 'crowd', 'gave', 'massive', 'standing', 'ovation', 'came', 'saturday', 'nice', 'said', 'fantastic', 'even', 'though', 'criticised', 'number', 'recent', 'weeks', 'saturday', 'showed', 'much', 'mean', 'say', 'transfer', 'rumours', 'two', 'clubs', 'created', 'speculation', 'phoned', 'every', 'national', 'newspaper', 'saying', 'blackburn', 'trying', 'buy', 'birmingham', 'manager', 'steve', 'bruce', 'insists', 'want', 'sell', 'savage', 'lot', 'said', 'written', 'sav', 'terrific', 'birmingham', 'city', 'last', 'two', 'half', 'years', 'said', 'fans', 'love', 'epitomises', 'works', 'hard', 'like', 'people', 'like', 'many', 'like', 'hell', 'sell', 'someone', 'else', 'interested']

print(example[0][1])

['blue', 'slam', 'blackburn', 'savag', 'birmingham', 'confirm', 'blackburn', 'made', 'bid', 'robbi', 'savag', 'manag', 'director', 'karen', 'bradi', 'call', 'derisori', 'rover', 'reportedli', 'offer', '£500', '000', 'front', 'wale', 'star', '30', 'fee', 'rise', '£22m', 'bradi', 'told', 'sun', 'bid', 'wast', 'fax', 'paper', 'time', 'ad', 'way', 'thing', 'go', 'could', 'affect', 'relationship', 'club', 'got', 'robbi', 'head', 'sale', 'savag', 'futur', 'birmingham', 'sourc', 'specul', 'sever', 'week', 'fan', 'criticis', 'perform', 'club', 'earlier', 'season', 'howev', 'good', 'display', 'west', 'brom', 'aston', 'villa', 'impress', 'blue', 'fan', 'crowd', 'gave', 'massiv', 'stand', 'ovat', 'came', 'saturday', 'nice', 'said', 'fantast', 'even', 'though', 'criticis', 'number', 'recent', 'week', 'saturday', 'show', 'much', 'mean', 'say', 'transfer', 'rumour', 'two', 'club', 'creat', 'specul', 'phone', 'everi', 'nation', 'newspap', 'say', 'blackburn', 'tri', 'buy', 'birmingham', 'manag', 'steve', 'bruce', 'insist', 'want', 'sell', 'savag', 'lot', 'said', 'written', 'sav', 'terrif', 'birmingham', 'citi', 'last', 'two', 'half', 'year', 'said', 'fan', 'love', 'epitomis', 'work', 'hard', 'like', 'peopl', 'like', 'mani', 'like', 'hell', 'sell', 'someon', 'els', 'interest']

Lemmatisation

Lemma is the canonical form dictionary form, or citation form of a set of word forms.In English, for example, break, breaks, broke, broken and breaking are forms of the same lexeme, with break as the lemma by which they are indexed.

Source : Wikipedia

nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/enfageorge/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!

True

import nltk

from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

nltk.download("punkt")

# Initialize Python porter stemmer
ps = PorterStemmer()
wnl = WordNetLemmatizer()

# Example inflections to reduce
example_words = ["program","programming","programer","programs","programmed"]

# Perform stemming
print("{0:20}{1:20}{2:20}".format("--Word--","--Stem--", "--Lemma--"))
for word in example_words:
   print ("{0:20}{1:20}{2:20}".format(word, ps.stem(word),wnl.lemmatize(word, pos='v')))

--Word--            --Stem--            --Lemma--           
program             program             program             
programming         program             program             
programer           program             programer           
programs            program             program             
programmed          program             program

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/enfageorge/nltk_data...
[nltk_data]   Package punkt is already up-to-date!

Source - DataCamp

from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

articles_df['text lemma'] = articles_df['stopwords_removed'].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 150)

articles_df.sample()

	category	text	lower_case	punct_removed	tokenized	stopwords_removed	snowball_stemmer	text lemma
902	tech	mobiles not media players yet mobiles are not yet ready to be all-singing all-dancing multimedia devices which will replace portable media players say two reports. despite moves to bring music download services to mobiles people do not want to trade multimedia services with size and battery life said jupiter. a separate study by gartner has also said real-time tv broadcasts to mobiles is unlikely in europe until 2007. technical issues and standards must be resolved first said the r...	mobiles not media players yet mobiles are not yet ready to be all-singing all-dancing multimedia devices which will replace portable media players say two reports. despite moves to bring music download services to mobiles people do not want to trade multimedia services with size and battery life said jupiter. a separate study by gartner has also said real-time tv broadcasts to mobiles is unlikely in europe until 2007. technical issues and standards must be resolved first said the r...	mobiles not media players yet mobiles are not yet ready to be allsinging alldancing multimedia devices which will replace portable media players say two reports despite moves to bring music download services to mobiles people do not want to trade multimedia services with size and battery life said jupiter a separate study by gartner has also said realtime tv broadcasts to mobiles is unlikely in europe until 2007 technical issues and standards must be resolved first said the report ...	[mobiles, not, media, players, yet, mobiles, are, not, yet, ready, to, be, allsinging, alldancing, multimedia, devices, which, will, replace, portable, media, players, say, two, reports, despite, moves, to, bring, music, download, services, to, mobiles, people, do, not, want, to, trade, multimedia, services, with, size, and, battery, life, said, jupiter, a, separate, study, by, gartner, has, also, said, realtime, tv, broadcasts, to, mobiles, is, unlikely, in, europe, until, 2007, technical, ...	[mobiles, media, players, yet, mobiles, yet, ready, allsinging, alldancing, multimedia, devices, replace, portable, media, players, say, two, reports, despite, moves, bring, music, download, services, mobiles, people, want, trade, multimedia, services, size, battery, life, said, jupiter, separate, study, gartner, also, said, realtime, tv, broadcasts, mobiles, unlikely, europe, 2007, technical, issues, standards, must, resolved, first, said, report, batteries, already, cope, services, operato...	[mobil, media, player, yet, mobil, yet, readi, allsing, alldanc, multimedia, devic, replac, portabl, media, player, say, two, report, despit, move, bring, music, download, servic, mobil, peopl, want, trade, multimedia, servic, size, batteri, life, said, jupit, separ, studi, gartner, also, said, realtim, tv, broadcast, mobil, unlik, europ, 2007, technic, issu, standard, must, resolv, first, said, report, batteri, alreadi, cope, servic, oper, offer, like, video, playback, video, messag, megapi...	[mobile, medium, player, yet, mobile, yet, ready, allsinging, alldancing, multimedia, device, replace, portable, medium, player, say, two, report, despite, move, bring, music, download, service, mobile, people, want, trade, multimedia, service, size, battery, life, said, jupiter, separate, study, gartner, also, said, realtime, tv, broadcast, mobile, unlikely, europe, 2007, technical, issue, standard, must, resolved, first, said, report, battery, already, cope, service, operator, offer, like,...

Sources :

From UofA 💛 ✨: Deep Learning for Natural Language Processing: A Gentle Introduction : Mihai Surdeanu and Marco A. Valenzuela-Escárcega

Books :

Blogs: - https://www.kaggle.com/code/longtng/nlp-preprocessing-feature-extraction-methods-a-z - http://hunterheidenreich.com/blog/stemming-lemmatization-what/

Will be continued either in a ML example or on the week of Applications