Text Classification:
Data
1. we have total of 20 types of documents(Text files) and total 18828 documents(text files).
2. You can download data from this link, in that you will get documents.rar folder.
3. If you unzip that, you will get total of 18828 documnets. document name is defined as'ClassLabel_DocumentNumberInThatLabel'.
so from document name, you can extract the label for that document.
4. Now our problem is to classify all the documents into any one of the class.
5. Below we provided count plot of all the labels in our data.
### count plot of all the class labels.
sample document
Subject: A word of advice
From: jcopelan@nyx.cs.du.edu (The One and Only)
In article < 65882@mimsy.umd.edu > mangoe@cs.umd.edu (Charley Wingate) writes:
>
>I've said 100 times that there is no "alternative" that should think you
>might have caught on by now. And there is no "alternative", but the point
>is, "rationality" isn't an alternative either. The problems of metaphysical
>and religious knowledge are unsolvable-- or I should say, humans cannot
>solve them.
How does that saying go: Those who say it can't be done shouldn't interrupt
those who are doing it.
Jim
--
Have you washed your brain today?
Preprocessing:
useful links: http://www.pyregex.com/ 1. Find all emails in the document and then get the text after the "@". and then split those texts by '.' after that remove the words whose length is less than or equal to 2 and also remove'com' word and then combine those words by space. In one doc, if we have 2 or more mails, get all. Eg:[test@dm1.d.com, test2@dm2.dm3.com]-->[dm1.d.com, dm3.dm4.com]-->[dm1,d,com,dm2,dm3,com]-->[dm1,dm2,dm3]-->"dm1 dm2 dm3" append all those into one list/array. ( This will give length of 18828 sentences i.e one list for each of the document). Some sample output was shown below. > In the above sample document there are emails [jcopelan@nyx.cs.du.edu, 65882@mimsy.umd.edu, mangoe@cs.umd.edu] preprocessing: [jcopelan@nyx.cs.du.edu, 65882@mimsy.umd.edu, mangoe@cs.umd.edu] ==> [nyx cs du edu mimsy umd edu cs umd edu] ==> [nyx edu mimsy umd edu umd edu] 2. Replace all the emails by space in the original text.
# we have collected all emails and preprocessed them, this is sample output
preprocessed_email
output:
array(['juliet caltech edu', 'coding bchs edu newsgate sps mot austlcm sps mot austlcm sps mot com dna bchs edu', 'batman bmd trw', ..., 'rbdc wsnc org dscomsa desy zeus desy', 'rbdc wsnc org morrow stanford edu pangea Stanford EDU', 'rbdc wsnc org apollo apollo'], dtype=object)
len(preprocessed_email)
output:
18828
3. Get subject of the text i.e. get the total lines where "Subject:" occur and remove the word which are before the ":" remove the newlines, tabs, punctuations, any special chars. Eg: if we have sentance like "Subject: Re: Gospel Dating @ \r\r\n" --> You have to get "Gospel Dating" Save all this data into another list/array. 4. After you store it in the list, Replace those sentances in original text by space. 5. Delete all the sentances where sentence starts with "Write to:" or "From:". > In the above sample document check the 2nd line, we should remove that 6. Delete all the tags like "< anyword >" > In the above sample document check the 4nd line, we should remove that "< 65882@mimsy.umd.edu >" 7. Delete all the data which are present in the brackets. In many text data, we observed that, they maintained the explanation of sentence or translation of sentence to another language in brackets so remove all those. Eg: "AAIC-The course that gets you HIRED(AAIC - Der Kurs, der Sie anstellt)" --> "AAIC-The course that gets you HIRED" > In the above sample document check the 4nd line, we should remove that "(Charley Wingate)" 8. Remove all the newlines('\n'), tabs('\t'), "-", "\". 9. Remove all the words which ends with ":". Eg: "Anyword:" > In the above sample document check the 4nd line, we should remove that "writes:" 10. Decontractions, replace words like below to full words. please check the donors choose preprocessing for this Eg: can't -> can not, 's -> is, i've -> i have, i'm -> i am, you're -> you are, i'll --> i will There is no order to do point 6 to 10. but you have to get final output correctly11. Do chunking on the text you have after above preprocessing. Text chunking, also referred to as shallow parsing, is a task that follows Part-Of-Speech Tagging and that adds more structure to the sentence. So it combines the some phrases, named entities into single word. So after that combine all those phrases/named entities by separating "_". And remove the phrases/named entities if that is a "Person". You can use nltk.ne_chunk to get these. Below we have given one example. please go through it. useful links: https://www.nltk.org/book/ch07.htmlhttps://stackoverflow.com/a/31837224/4084039http://www.nltk.org/howto/tree.htmlhttps://stackoverflow.com/a/44294377/4084039
13. Replace all the digits with space i.e delete all the digits. > In the above sample document, the 6th line have digit 100, so we have to remove that. 14. After doing above points, we observed there might be few word's like "_word_" (i.e starting and ending with the _), "_word" (i.e starting with the _), "word_" (i.e ending with the _) remove the _ from these type of words. 15. We also observed some words like "OneLetter_word"- eg: d_berlin, "TwoLetters_word" - eg: dr_berlin , in these words we remove the "OneLetter_" (d_berlin ==> berlin) and "TwoLetters_" (de_berlin ==> berlin). i.e remove the words which are length less than or equal to 2 after spliiting those words by "_". 16. Convert all the words into lower case and lowe case and remove the words which are greater than or equal to 15 or less than or equal to 2. 17. replace all the words except "A-Za-z_" with space. 18. Now You got Preprocessed Text, email, subject. create a dataframe with those. Below are the columns of the df.
data.columns
output:
Index(['text', 'class', 'preprocessed_text', 'preprocessed_subject', 'preprocessed_emails'], dtype='object')
data.iloc[400]
output:
text From: arc1@ukc.ac.uk (Tony Curtis)\r\r\r\nSubj...
class alt.atheism
preprocessed_text said re is article if followed the quoting rig...
preprocessed_subject christian morality is
preprocessed_emails ukc mac macalstr edu
Name: 567, dtype: object
o get above mentioned data frame --> Try to Write Total Preprocessing steps in One Function Named Preprocess as below. def preprocess(Input_Text): """Do all the Preprocessing as shown above and return a tuple contain preprocess_email,preprocess_subject,preprocess_text for that Text_data""" return (list_of_preproessed_emails,subject,text)
Code checking: After Writing preprocess function. call that functoin with the input text of 'alt.atheism_49960' doc and print the output of the preprocess function This will help us to evaluate faster, based on the output we can suggest you if there are any changes.
After writing Preprocess function, call the function for each of the document(18828 docs) and then create a dataframe as mentioned above.
Training The models to Classify: 1. Combine "preprocessed_text", "preprocessed_subject", "preprocessed_emails" into one column. use that column to model.
2. Now Split the data into Train and test. use 25% for test also do a stratify split.
3. Analyze your text data and pad the sequnce if required. Sequnce length is not restricted, you can use anything of your choice. you need to give the reasoning
4. Do Tokenizer i.e convert text into numbers. please be careful while doing it. if you are using tf.keras "Tokenizer" API, it removes the "_", but we need that.
5. code the model's ( Model-1, Model-2 ) as discussed below and try to optimize that models.
6. For every model use predefined Glove vectors. Don't train any word vectors while Training the model.
7. Use "categorical_crossentropy" as Loss.
8. Use Accuracy and Micro Avgeraged F1 score as your as Key metrics to evaluate your model.
9. Use Tensorboard to plot the loss and Metrics based on the epoches.
10. Please save your best model weights in to 'best_model_L.h5' ( L = 1 or 2 ). 11. You are free to choose any Activation function, learning rate, optimizer. But have to use the same architecture which we are giving below.
12. You can add some layer to our architecture but you deletion of layer is not acceptable.
13. Try to use Early Stopping technique or any of the callback techniques that you did in the previous assignments.
14. For Every model save your model to image ( Plot the model) with shapes and inlcude those images in the notebook markdown cell, upload those imgages to Classroom. You can use "plot_model" please refer this if you don't know how to plot the model with shapes.
Model-1: Using 1D convolutions with word embeddings
Encoding of the Text --> For a given text data create a Matrix with Embedding layer as shown Below. In the example we have considered d = 5, but in this assignment we will get d = dimension of Word vectors we are using. i.e if we have maximum of 350 words in a sentence and embedding of 300 dim word vector, we result in 350*300 dimensional matrix for each sentance as output after embedding layer
Ref: https://i.imgur.com/kiVQuk1.png
Reference:https://stackoverflow.com/a/43399308/4084039https://missinglink.ai/guides/keras/keras-conv1d-working-1d-convolutional-neural-networks-keras/How EMBEDDING LAYER WORKS
Go through this blog, if you have any doubt on using predefined Embedding values in Embedding layer - https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/
ref: 'https://i.imgur.com/fv1GvFJ.png'
1. all are Conv1D layers with any number of filter and filter sizes, there is no restriction on this.
2. use concatenate layer is to concatenate all the filters/channels.
3. You can use any pool size and stride for maxpooling layer.
4. Don't use more than 16 filters in one Conv layer becuase it will increase the no of params. ( Only recommendation if you have less computing power )
5. You can use any number of layers after the Flatten Layer.
Model-2 : Using 1D convolutions with character embedding
Here are the some papers based on Char-CNN
1. Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification.NIPS 2015
2. Yoon Kim, Yacine Jernite, David Sontag, Alexander M. Rush. Character-Aware Neural Language Models. AAAI 2016
3. Shaojie Bai, J. Zico Kolter, Vladlen Koltun. An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling
4. Use the pratrained char embeddings https://github.com/minimaxir/char-embeddings/blob/master/glove.840B.300d-char.txt
<img src='https://i.imgur.com/EuuoJtr.png'>
If you are looking to hire expert that can help you to implementing text classification related problems then you can directly contact us or send your requirement details at :
realcode4you@gmail.com
Here you get affordable price without any plagiarism issues
Comments