Sample case studies
Case Study: Web Scraping
Background
• An AdTech company wanted to know what products/topics were trending in real time in order to create high frequency trading in the digital ad space
• E.g. if a celebrity tweeted about a product and that tweet is going viral, they can distribute ads about those products immediately when it’s trending. Or place ads next to trending topics on publishers’ websites. This type of high frequency trading allows for higher RoI for its clients.
Approach
• Pull in APIs of Twitter and other social media platforms, along with leading e-commerce and publisher websites
• Create a dashboard of trending products/topics on each platform in real time. Enable reverse process too, where company can track how certain products/topics are trending on social media and internet.
• Categorized these products/topics using natural language processing.
• Platform allows for high frequency trading.
Statistical Techniques Used
• Natural Language Processing
• Text Classification
Tech Stack
• Python Backend, Python Django Web Framework, MySQL, Amazon Kinesis for stream data
Takeaways
• Aggregates two thirds of data on the open web.
• Numerous case studies of HFT that allowed for higher click through rates and RoI.
Case Study: Alternative Data
Background
• A financial services firm wanted to utilize nighttime NASA illumination data to help in its economic analysis of emerging markets, e.g. if a geographic area is more illuminated at night, it could mean increased economic activity (e.g. overnight factory shifts, electrification).
• This allowed them to gain real time intelligence on countries, and also not rely on faulty government data in emerging markets.
• The raw data needed to be normalized such that other factors that could influence nighttime illumination magnitude (e.g. cloud cover, lunar cycle, etc) were removed and the data did not have any noise generated from other sources
Approach
• Used clustering to separate economically active sections from nearby non-active sections
• Missing value from all latitude/longitude were replaced by the mean of albedo value of the respective clusters
• After missing value replacement, the albedo values were aggregated at day level. The data is converted into time series data for a year. Each latitude and longitude is represented by a time series
• For each latitude and longitude the time series data is decomposed into Lunar effect (Seasonal), Cloud effect (Random Error), and effects due to economic activity (Trends)
• Data is decomposed into Lunar impact (seasonal), Cloud Impact (error) and Economic activity at night (Trend) component. All noise from the economic data is removed and the trend component can be used in further analysis of economic growth
Tech Stack
• Backend: Python, Apache Spark, Mysql
• Frontend: Django, D3
Statistical Techniques
• K-means Clustering, Exponential Smoothing, Time Series Decomposition, Fourier Analysis