Among the many applications of machine learning and AI is text classification. In this tutorial we will see how to build a news classifier app with streamlit and python. We will be using our already prepared ML models to help us with our prediction.
First of all, let us install the various packages we will be using.
pip install streamlit scikit-learn joblib wordcloud pandas matplotlib
The basic structure of our ML app will consist of two main sections.
- Prediction with ML Section
- NLP with Spacy and WordCloud
We will use streamlit’s sidebar to create a menu for selecting our activities. All our code will be in a main function called main()
# IMPORT ALL PACKAGES HERE def main(): #OUR CODE GOES HERE if __name__ == '__main__': main()
Building the News Classifier Section
We will be using streamlit and scikit-learn to work on this section.
In building our ML app we will need to have a means of receiving input from the end user and then process that input with our models. That means we will be using streamlit’s text_area() function to get input from the user like this.
news_text = st.text_area("Enter Text","Type Here")
Since our models cannot work with text, we will need to vectorized them or convert them into numbers. Hence we will be using countvectorizer to vectorize our text into an array of numbers so that our ML model will be able to process them.
# Load Our CountVectorizer news_vectorizer = open("models/final_news_cv_vectorizer.pkl","rb") news_cv = joblib.load(news_vectorizer)
For making our predictions we will load our already prepared models using joblib a serialization package.
# Load Our Models def load_prediction_models(model_file): loaded_models = joblib.load(open(os.path.join(model_file),"rb")) return loaded_models
This approach will save us a lot of time and also reduce the size of our code.
Finally for our News Classification Section we will convert our result which will be in numbers to a user friendly one using a dictionary of our prediction labels. So we will add a function to do that aspect for us.
Let us see the code for our ML Prediction Section.
if choice == 'Prediction': st.info("Prediction with ML") news_text = st.text_area("Enter Text","Type Here") all_ml_models = ["LR","NB","RFOREST","DECISION_TREE"] model_choice = st.selectbox("Choose ML Model",all_ml_models) prediction_labels = {'business':0,'tech':1,'sport':2,'health':3,'politics':4,'entertainment':5} if st.button("Classify"): st.text("Original test ::\n{}".format(news_text)) vect_text = news_cv.transform([news_text]).toarray() if model_choice == 'LR': predictor = load_prediction_models("models/newsclassifier_Logit_model.pkl") prediction = predictor.predict(vect_text) # st.write(prediction) elif model_choice == 'RFOREST': predictor = load_prediction_models("models/newsclassifier_RFOREST_model.pkl") prediction = predictor.predict(vect_text) # st.write(prediction) elif model_choice == 'NB': predictor = load_prediction_models("models/newsclassifier_NB_model.pkl") prediction = predictor.predict(vect_text) # st.write(prediction) elif model_choice == 'DECISION_TREE': predictor = load_prediction_models("models/newsclassifier_CART_model.pkl") prediction = predictor.predict(vect_text) # st.write(prediction) final_result = get_keys(prediction,prediction_labels) st.success("News Categorized as:: {}".format(final_result))
Building the NLP Section of our App
For our natural language processing, we will be using spacy and wordcloud. Spacy is a powerful NLP library for doing various natural language activities such as tokenization,named entity recognition, dependency parsing and more. In our case we will be using spacy for our tokenization,named entity recognition and lemmatization for our nlp task.
We will then display our result as both a json format and in a table using pandas dataframe.
We will then use wordcloud to build a pictorial form for the most commonest words in our text.
This is the entire code for the NLP Section of our App.
if choice == 'NLP': st.info("Natural Language Processing") news_text = st.text_area("Enter Text","Type Here") nlp_task = ["Tokenization","NER","Lemmatization","POS Tags"] task_choice = st.selectbox("Choose NLP Task",nlp_task) if st.button("Analyze"): st.info("Original Text {}".format(news_text)) docx = nlp(news_text) if task_choice == 'Tokenization': result = [ token.text for token in docx ] elif task_choice == 'Lemmatization': result = ["'Token':{},'Lemma':{}".format(token.text,token.lemma_) for token in docx] elif task_choice == 'NER': result = [(entity.text,entity.label_)for entity in docx.ents] elif task_choice == 'POS Tags': result = ["'Token':{},'POS':{},'Dependency':{}".format(word.text,word.tag_,word.dep_) for word in docx] st.json(result) if st.button("Tabulize"): docx = nlp(news_text) c_tokens = [ token.text for token in docx ] c_lemma = [token.lemma_ for token in docx] c_pos = [word.tag_ for word in docx] new_df = pd.DataFrame(zip(c_tokens,c_lemma,c_pos),columns=['Tokens','Lemma','POS']) st.dataframe(new_df) if st.checkbox("Wordcloud"): wordcloud = WordCloud().generate(news_text) plt.imshow(wordcloud,interpolation='bilinear') plt.axis("off") st.pyplot()
In summary our entire code will be like
import streamlit as st import joblib,os # NLP Pkgs import spacy nlp = spacy.load('en') # EDA pkgs import pandas as pd # Wordcloud from wordcloud import WordCloud from PIL import Image import matplotlib.pyplot as plt import matplotlib matplotlib.use('Agg') # Vectorizer news_vectorizer = open("models/final_news_cv_vectorizer.pkl","rb") news_cv = joblib.load(news_vectorizer) # Load Our Models def load_prediction_models(model_file): loaded_models = joblib.load(open(os.path.join(model_file),"rb")) return loaded_models def get_keys(val,my_dict): for key,value in my_dict.items(): if val == value: return key def main(): """News Classifier App with Streamlit """ st.title("News Classifer ML App") st.subheader("NLP and ML App with Streamlit") activities = ["Prediction","NLP"] choice = st.sidebar.selectbox("Choose Activity",activities) if choice == 'Prediction': st.info("Prediction with ML") news_text = st.text_area("Enter Text","Type Here") all_ml_models = ["LR","NB","RFOREST","DECISION_TREE"] model_choice = st.selectbox("Choose ML Model",all_ml_models) prediction_labels = {'business':0,'tech':1,'sport':2,'health':3,'politics':4,'entertainment':5} if st.button("Classify"): st.text("Original test ::\n{}".format(news_text)) vect_text = news_cv.transform([news_text]).toarray() if model_choice == 'LR': predictor = load_prediction_models("models/newsclassifier_Logit_model.pkl") prediction = predictor.predict(vect_text) # st.write(prediction) elif model_choice == 'RFOREST': predictor = load_prediction_models("models/newsclassifier_RFOREST_model.pkl") prediction = predictor.predict(vect_text) # st.write(prediction) elif model_choice == 'NB': predictor = load_prediction_models("models/newsclassifier_NB_model.pkl") prediction = predictor.predict(vect_text) # st.write(prediction) elif model_choice == 'DECISION_TREE': predictor = load_prediction_models("models/newsclassifier_CART_model.pkl") prediction = predictor.predict(vect_text) # st.write(prediction) final_result = get_keys(prediction,prediction_labels) st.success("News Categorized as:: {}".format(final_result)) if choice == 'NLP': st.info("Natural Language Processing") news_text = st.text_area("Enter Text","Type Here") nlp_task = ["Tokenization","NER","Lemmatization","POS Tags"] task_choice = st.selectbox("Choose NLP Task",nlp_task) if st.button("Analyze"): st.info("Original Text {}".format(news_text)) docx = nlp(news_text) if task_choice == 'Tokenization': result = [ token.text for token in docx ] elif task_choice == 'Lemmatization': result = ["'Token':{},'Lemma':{}".format(token.text,token.lemma_) for token in docx] elif task_choice == 'NER': result = [(entity.text,entity.label_)for entity in docx.ents] elif task_choice == 'POS Tags': result = ["'Token':{},'POS':{},'Dependency':{}".format(word.text,word.tag_,word.dep_) for word in docx] st.json(result) if st.button("Tabulize"): docx = nlp(news_text) c_tokens = [ token.text for token in docx ] c_lemma = [token.lemma_ for token in docx] c_pos = [word.tag_ for word in docx] new_df = pd.DataFrame(zip(c_tokens,c_lemma,c_pos),columns=['Tokens','Lemma','POS']) st.dataframe(new_df) if st.checkbox("Wordcloud"): wordcloud = WordCloud().generate(news_text) plt.imshow(wordcloud,interpolation='bilinear') plt.axis("off") st.pyplot() if __name__ == '__main__': main()
You can check the entire video tutorial here.
To get more on building machine learning and natural language processing apps, you can check out this upcoming course.
Thanks For Your Time
Jesus Saves
By Jesse E.Agbe (JCharis)
hello, im on windows and im having problem with installing the packages. can you help me?
Hello Bishesh, what kind of problems if I may ask?
Is it possible to use a virtual environment such as pipenv or virtualenv.
Installing Pipenv
pip3 install pipenv
Setting Up your Virtual environment
pipenv install streamlit pandas matplotlib
Hope it helps
I created virtual environment but when i try to install spacy it shows error… Cannot run ‘rc.exe’
Same happened while i tried to install wordcloud… Cannot run ‘rc.exe’ please help
Hi Bishesh,pls are you on windows. If so and you have space you
can try installing anaconda if you like. It comes with several
python packages and with that you can install spacy and wordcloud without any issues.
Please let me know the outcome.
Thanks
Hello Bishesh, what kind of problems if I may ask?
Is it possible to use a virtual environment such as pipenv or virtualenv.
Installing Pipenv
pip3 install pipenv
Setting Up your Virtual environment
pipenv install streamlit pandas matplotlib
Hope it helps