Individual Investors Sentiment Mining and Analysis using Go, VADER and Kubernetes

4 min readApr 25, 2021

The stock market has been breaking all-time high record since COVID-19 outbreak and the phenomena attracted more individual investor as the market in 2020 has seen 4% increase in individual investor from 2019 and double of the number in 2010 (Source: https://cutt.ly/rv1rdzy). Emotion has always been an important factor that influences the trade decision of individual investors. Hence, their emotions could be an indicator to predict the price movement. Besides, a series of events of individual investors outsmarting hedge funds shows the importance of individual investors. The solution is developed to mine data from social media and finance platform and analyze the sentiments.

Data Source

The most direct source of individual investors emotion can be found from the comments they make on social media and mainstream media. Hereby the social media and platforms that have more active users discussing about the stock market are identified to be Yahoo Finance, Twitter and Reddit. The following sections will be using the data mined from Yahoo Finance conversations. (Sample link: https://finance.yahoo.com/quote/NIO/community?p=NIO)

Solution Architecture

Data Mining Microservice with Go

Microservice architecture is becoming more popular due to easier code maintenance and efficient resource management. Go was chosen to be the language to develop the data mining microservice for its robust concurrency mechanism. As Yahoo Finance does not provide API to query the conversations, web scraping is done to mine the comments from the community post. The comments are exported as CSV file into a shared NFS filer. Data mining microservice is integrated into the Web API which is built from Go Fiber Web Framework (https://github.com/gofiber/fiber). Fiber is selected as it stands out among other Go web frameworks in terms of allocations and request per seconds. The Web API accepts user inputs of ticker symbol e.g. NIO and the query time range in epoch time. The Web API has configurable timeout and user input parameters validation. The timeout by default is 5 minutes and the maximum time range for the query is 7 days. Upon receiving the complete signal from sentiment analysis microservice, Web API will return the CSV file containing the analyzed sentiments in the response body and the weightage of each sentiment in the response header.

Source code: chiupc/SharkDetector-MarketSentimentsMiner

Data Cleaning Microservice

Data cleaning microservice receives signal from Data Mining microservice via GRPC to clean invalid characters including double quotes, emoticons and special characters.

Source code: chiupc/sentiment-analytic

Sentiment Analysis Microservice

The model used in analyzing the sentiment of user comment is VADER that is available in the Python package NLTK. The model gives the sentiment score of a text by summing up the intensity of each word in the text. (Source) It is integrated into a microservice developed using Python for other microservices to communicate with and GRPC is used for the inter-process communication.

Data cleaning microservice sends the signal to sentiment analysis microservice and it will pick up the cleaned CSV file for sentiment analysis. The processed file overwrites the cleaned file in the shared storage and the processing complete signal is returned back to the Web API.

The microservice runs VADER model and calculate the sentiment score for each line in the CSV file and append the score into a new column. At the end of the analysis, it leverages Pandas to calculate the distribution of each sentiment (positive, neutral and negative) over all the comments.

Source code: chiupc/sentiment_analysis_py_grpc

Kubernetes as Infrastructure for Microservices

Kubernetes is an open-source container-orchestration system for automating computer application deployment, scaling, and management. Deploying and maintaining the microservices in production environment is made easier with Kubernetes as each of them exists as a container isolated from one another. Besides, server resources could be optimally utilized using Kubernetes to manage the microservices.

To deploy them on Kubernetes, docker image of the microservice has to be built first. Each of the microservice has a Dockerfile that can be found in the repository and it is used for building the docker image. Dockerfile contains the steps of setting up the microservice dependencies and the command to run the microservice. After building the docker image, it was uploaded to docker hub image repository for it to be deployed as a container on the Kubernetes platform.

Scaleway is the platform provider as it is more cost effective than AWS, Azure and GCP. However, its location in Europe is a downside of it but at this moment it should not concern with current usage volume.

Kubernetes files for the microservices: https://github.com/chiupc/sentiment-analytic-kube.git

Docker image for Web Api and data mining microservice: https://hub.docker.com/repository/docker/chiupc/sentiments-collector

Docker image for data cleaning microservice: https://hub.docker.com/repository/docker/chiupc/sentiment-analytic-go-grpc

Docker image for sentiment analytic python microservice: https://hub.docker.com/repository/docker/chiupc/sentimentanalytic-py

How to use

The API is exposed at the following endpoint via POST method

POST http://51.158.129.250:3000/v1/yf/coversations/analyze

Sample request body, use this epoch converter to convert the start and end time to epoch time. Keep analyzer engine as default which is “VADER”, the API might extend to other sentiment analysis engine provided by cloud providers like GCP and Azure.

{"quote" : "NIO","startTime" : 1618099200,"endTime" : 1618704000,"analyzerEngine" : "VADER"}

Sample Result, the Web API returns the analyzed CSV file in the response body and the distribution of sentiments in the response header

Summary-Data
{"positive":0.4275544389,"neutral":0.327680067,"negative":0.2447654941}