Dataset Download From here
Remember these points before start
Download three files from the assignments area of Canvas: stopwords.txt, Biden_DNC_Speech_2020.txt, Trump_RNC_Speech_2020.txt
Put those files in the same folder as this notebook. When writing code to access the files, DO NOT include paths since you code will not run on the instructors computers. (If the files are in the same folder as the notebook, no path is needed.)
Pick one of the speech files to process. Your choice.
Process the file and provide several statistics and answers to some questions outlined in future cells.
The next few cells will provide some suggestions on an approach. DO NOT load additional modules or libraries to do this analysis. Everything can be done with standard Python.
All processing should ignore case or assume lower case. So, take the raw data and save it as lower case in any lists and dictionaries.
Step 1: Create a list of stopwords
Open the stopwords.txt file for reading
Make sure the file is in the same folder as this notebook. Do not include a path name when referencing the file. (-1 point if a path is included)
Import each word in the file as a unique element in a list named stopwords
Output of that list will look somethiing like this:
stopwords = ['i', 'me', 'my', 'myself'....'should', 'now']
# Step 1: stopwords
stopwords = []
f = open('stopwords.txt')
line = f.readline()
while line:
stopwords.append(line.strip('\n'))
line = f.readline()
f.close()
print(stopwords) # print it out
Output
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', "it's", 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now']
Step 2: Import the text of a speech
You choose which speech to analyze.
Make sure the file is in the same folder as this notebook. Do not include a path name when referencing the file. (-1 point if a path is included).
Each line in the file should represent a paragraph.
Import the lines to an object called speech using the readlines() method.
Strip off the extra new line charcters at the end of each new line.
# Step 2: Import speech
# NOTE: uncomment the line related to your file
file = 'Biden_DNC_Speech_2020'
# file = 'Trump_RNC_Speech_2020'
fullfile = file+'.txt'
#---- Your code here for import
#open file object to read
f = open(fullfile)
#calling readlines() method
speech = f.readlines()
#strip off extra new line character at the end
speech = [ line.strip('\n') for line in speech ]
speech # print it out
Output:
['Good evening.',
'Ella Baker, a giant of the civil rights movement, left us with this wisdom: Give people light and they will find a way.',
'Give people light.',
'Those are words for our time.',
'The current president has cloaked America in darkness for much too long. Too much anger. Too much fear. Too much division.',
---
---
---
'May history be able to say that the end of this chapter of American darkness began here tonight as love and hope and light joined in the battle for the soul of the nation.',
'And this is a battle that we, together, will win.',
'I promise you.',
'Thank you.',
'And may God bless you.',
'And may God protect our troops.']
Step 3: Count sentence ending punctuation
Loop through each line in speech in search of the sentence ending punctuation.
Keep track of the counts in the puncdict.
Keep separate counts for each of the punctuation symbols.
Output will look something like this:
puncdict = {'.': 30, '!': 13, '?': 3} Your numbers will be different.
# Step 4: Count sentence ending punctuation
punclist = ['.', '!', '?']
puncdict = {'.':0,'!':0,'?':0}
for line in speech:
if line.endswith(punclist[0]):
puncdict[punclist[0]] += 1
elif line.endswith(punclist[1]):
puncdict[punclist[1]] += 1
elif line.endswith(punclist[2]):
puncdict[punclist[2]] += 1
puncdict # print it out
Output:
{'.': 163, '!': 0, '?': 7}
Step 4: Replace all punctuation with spaces, replace multiple spaces with a single space
Punctuation will confuse our analysis so we'll remove most of it.
Replace punctuation in the list below with spaces.
Don't not replace single quotes since they typically are important to contractions.
A removelist is provided
Consider looping through the speech and the removelist and use the replace function. You will likely need to use the enumerate function for this loop
Now replace all instances of two or more spaces with a single space and strip of trailing spaces for each line. (This will help when splitting the lines later.)
# Step 5: Replace all punctuation with spaces
removelist = [',', ';', '"', '“', '”', '_', '—', ':', '.', '!', '?']
#---- Your code here for punctuation
import re
newspeech = []
for line in speech:
for word in line:
if word in removelist:
line = line.replace(word, " ")
#replace all instances of two or more spaces with a single space
line = re.sub('\\s+',' ',line)
#strip of trailing spaces for each line
line = line.strip(' ')
newspeech.append(line)
speech = newspeech
#---- end of your code
for line in speech[0:10]: # print it out ten lines to check
print(line)
Output:
Good evening
Ella Baker a giant of the civil rights movement left us with this wisdom Give people light and they will find a way
Give people light
Those are words for our time
The current president has cloaked America in darkness for much too long Too much anger Too much fear Too much division
Here and now I give you my word If you entrust me with the presidency I will draw on the best of us not the worst I will be an ally of the light not of the darkness
It's time for us for We the People to come together
For make no mistake United we can and will overcome this season of darkness in America We will choose hope over fear facts over fiction fairness over privilege
I am a proud Democrat and I will be proud to carry the banner of our party into the general election So it is with great honor and humility that I accept this nomination for President of the United States of America
But while I will be a Democratic candidate I will be an American president I will work as hard for those who didn't support me as I will for those who did
Step 5: Calculate total number of words, paragraphs, characters
Before additional processing and transformations, calculate the total number of words.
Consider looping through each line, determining the number of words and characters in each line and adding that to running counters totalWords and totalChar.
totalPar should be equal to the number of lines.
# Step 6: Calculate total words and paragraphs
# Initialize variables
totalWords = 0
totalChar = 0
totalPar = 0
#---- Your code here for calculations
totalPar = len(speech)
for line in speech:
words = line.split()
totalWords += len(words)
for word in words:
totalChar += len(word)
#---- end of your code
print(f'Total Words: {totalWords}')
print(f'Total Characters: {totalChar}')
print(f'Total Paragraphs: {totalPar}')
Output:
Total Words: 3201
Total Characters: 13920
Total Paragraphs: 181
Step 6: Process Phrases and Proper Nouns
Some phrases and proper nouns have two or more words that should not be split up.
Replace the space with an underscore.
For example: 'United States of America' should become 'united_states_of_america'
Use the phrases to hold a list of phrases to be corrected.
Read the speech then add to the list any additional proper noun phrases (for example, Washington Monument, Civil War, names of people)
Be sure to handle case (e.g. check for lower case of the phrases)
# Step 7: Transform some multi-word phrases to single words
phrases = ['joe biden','mr biden','donald trump','donald j trump',
'president trump','united states of america','united states',
'ella baker','republican party','democratic party',
'first lady''vice president','abraham lincoln','ella baker'
]
# Append other proper nouns to the list above.
# Locate phrases from the list in your data.
# Replace the space in the phrase with an underscore
#---- Your code here for phrases
newspeech = []
for line in speech:
newline = line
#print("Before: ",newline)
for phrase in phrases:
start_index = newline.lower().find(phrase)
if start_index != -1:
ext_phrase = newline[start_index : start_index + len(phrase)]
ext_words = ext_phrase.split()
newphrase = "_".join(ext_words)
newline = newline[0:start_index] + " "+ newphrase + newline[start_index + len(phrase):]
#print("After: ",newline)
newspeech.append(newline)
speech = newspeech
#---- end of your code
for line in speech[0:20]: # print it out ten lines to check
print(line)
Output:
Good evening
Ella_Baker a giant of the civil rights movement left us with this wisdom Give people light and they will find a way
Give people light
Those are words for our time
The current president has cloaked America in darkness for much too long Too much anger Too much fear Too much division
Here and now I give you my word If you entrust me with the presidency I will draw on the best of us not the worst I will be an ally of the light not of the darkness
It's time for us for We the People to come together
Step 7: Populate word dictionaries
Populate the wordsdict and stopdict dictionaries with the appropriate words and frequencies for each word.
Sample substeps could include:
Loop through each line in speech
Loop through each word in the line (split might help)
Check the word against stopwords
If it is a stopword, add it to stopdict and adjust the counter value, then go to the next word
If not a stopword, add it to wordsdict and adjust the counter value
# Step 8: Populate word dictionaries
#---- Your code here for phrases
wordsdict = {}
stopdict = {}
for line in speech:
words = line.split()
for word in words:
if word in stopwords:
if word in stopdict:
stopdict[word] += 1
else:
stopdict[word] = 1
else:
if word in wordsdict:
wordsdict[word] += 1
else:
wordsdict[word] = 1
#---- end of your code
print(stopdict)
print(wordsdict)
Output:
{'a': 64, 'of': 66, 'the': 135, 'with': 20, 'this': 34, 'and': 89, 'they': 5, 'will': 48, 'are': 13, 'for': 47, 'our': 48, 'has': 9, 'in': 54, 'too': 4, 'now': 3, 'you': 19, 'my': 13, 'me': 9, 'on': 19, 'not': 15, 'be': 25, 'an': 12, 'to': 82, 'no': 5, 'we': 37, 'can': 18, 'over': 5, 'am': 1, 'into': 5, 'it': 26, 'is': 44, 'that': 37, 'while': 2, 'as': 15, 'those': 8, 'who': 11, 'did': 3, 'all': 12, 'just': 5, 'or': 5, 'so': 8, 'than': 13, 'by': 7, 'he': 6, 'about': 13, 'few': 2, 'at': 7, 'have': 23, 'only': 5, 'what': 5, 'when': 4, 'most': 9, 'same': 2, 'more': 15, 'but': 5, 'very': 4, 'any': 2, 'their': 9, 'until': 2, 'its': 4, 'from': 10, 'was': 8, 'were': 3, 'up': 13, 'if': 2, 'been': 7, 'him': 2, 'your': 11, 'where': 2, 'after': 1, 'does': 1, 'do': 6, 'out': 6, 'them': 7, 'here': 2, 'again': 3, 'other': 3, 'own': 1, 'off': 1, 'each': 3, 'should': 2, 'his': 2, "it's": 7, 'how': 2, 'being': 2, 'through': 1, 'once': 4, 'down': 4, 'then': 2, 'why': 1, 'these': 2, 'some': 1, 'which': 1, 'both': 2, 'ours': 1, 'she': 4, 'her': 3, 'had': 1, 'before': 1, 'there': 2, 'under': 1}
---
---
---
Step 8: Basic Stats
Figure out how to calculate some basic measures and statistics from the speech.
Use variable names below.
See the comment for details.
# Step 9: Calculate Statistics
# Use both previous dictionaries where appropriate
totalExcl = puncdict['!'] # total sentences ending in !
totalQues = puncdict['?'] # total sentences ending in ?
totalSent = totalPar # total sentences in speech
aveWords = totalWords / totalPar # average words per sentence (remember previous variable)
#---- end of your code
print(f'Total Exclamations: {totalExcl}')
print(f'Total Questions: {totalQues}')
print(f'Total Sentences: {totalSent}')
print(f'Average Words/Sentence: {aveWords}')
Output:
Total Exclamations: 0
Total Questions: 7
Total Sentences: 181
Average Words/Sentence: 17.685082872928177
Step 9: Create a CSV file of the wordsdict
Use the file variable from before to name the csv file
Be sure to append .csv to the name
The first row should include the headers 'word' and 'count' as the two column headers
Export the dictionary key:value pairs; each pair will be one row
You will probably use fieldnames argument, and the writeheader() and writerows() methods
You will use this file in the next assignment
# Step 10: Write csv file of the wordsdict
# remember to add column headings and use a standard naming convention
#---- Your code here for phrases
outputfile = file+'.csv'
rows = []
for key,value in wordsdict.items():
rows.append({'word':key,'count':value})
import csv
f = open(outputfile, 'w')
with f:
fnames = ['word', 'count']
writer = csv.DictWriter(f, fieldnames=fnames,delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
writer.writeheader()
writer.writerows(rows)
#---- end of your code
# View your file
# Mac/Linux Users:
!more 'Biden_DNC_Speech_2020.csv'
#!more 'Trump_DNC_Speech_2020.csv'
# Windows Users:
#!cat 'Biden_DNC_Speech_2020.csv'
#!cat 'Trump_DNC_Speech_2020.csv'
Output:
word,count
Good,1
evening,1
Ella_Baker,1
giant,1
civil,1
rights,3
movement,1
left,2
us,18
wisdom,1
Give,2
people,15
light,10
find,3
way,7
Those,1
words,4
time,10
The,10
current,5
president,20
cloaked,1
Biden_DNC_Speech_2020.csv
Contact Us:
realcode4you@gmail.com to get instant help.
If you are looking other programming language help like C, C++, Java, Python, PHP, Asp.Net, NodeJs, ReactJs, etc. with the different types of databases like MySQL, MongoDB, SQL Server, Oracle, etc. then also contact us.
Комментарии