The client is an online media company that provides financial news and analytics for the US market and has 3M of unique visitors per month.
The client needed a solution that could collect website visitors’ behavior together with transactional data to perform the transformed data as graphs, charts, and maps.
The client intended to use the obtained data for analysis and making data-driven decisions.
Task 1: Building an ETL pipeline to extract data from various sources, transform it, and load to the target system in the desired format
What kind of data we were to transfer:
- Website traffic data including users’ fingerprint identity, location, page views, clicks, etc. (in JSON format)
- Static files (in CSV format)
Our data sources were:
- The client website
- Amazon S3 (cloud object storage)
Target systems for loading data were:
- Snowflake (cloud data warehouse)
- Postgres (open source object-relational database system)
After data extraction from source systems, the stage of data transformation and enrichment took place. For this aim we compared usage of Python Script, Matillion, Python Spark, and Python Pandas. As a result, we selected Pandas since it doesn’t require a cluster setup (compared to Spark) and is much more cost-effective than Matillion.
To transfer website traffic data we additionally used Kafka and Golang. All further transformation and processing (to get a structured breakdown of user activity per activity type) was happening in batch mode using Python.
Task 2: Visualization of the obtained data
To display correlations, relationships, ratio, dynamics, and other insights derived from the data in a user-friendly way, Lightpoint team was to:
- choose and connect Business Intelligence tools to the client’s data repository;
- choose a set of graphs, charts, maps, and other visualization options that are most useful and relevant for the client’s business;
- setup real-time data visualization.
We did research to choose the right BI tool for the Client, and were finally considering Tableau and Power BI. Our choice criteria were data sources the tool supports (file data sources, various databases, Web, Azure, online services, etc), options of visualization it offers, indicators it can consider, user friendliness, complexity of building and maintaining graphs.
We opted for Power BI for supporting real-time streaming datasets as it is better at handling complex relationships between tables and data sources. Power BI also showed up as a faster tool. It looks more user-friendly and comparatively cost-efficient than Tableau.
Once data is imported, it’s ready to be charted – we selected the following chart types for further dashboard visualization:
- Line chart (displays changing data over time)
- Bar chart (to visually compare certain characteristics)
- Side-by-side bar (to use on a time scale when there are several classes and some characteristics that are used for comparison)
- Filled map (to show how a value differs across some geographical region)
- Symbol map (overlaying values onto geographical locations using symbols)
- Donut chart (to have quickly assess to difference between values for the selected classes)
- Scatter plot (to detect similarities in the distribution of different classes with respect to some parameters, as well as visually find differences)
- Funnel (to analyze the performance of some systems that undergo through some stages)
Additional charts we used: line chart represented through circles, line chart with highlighted area, horizontal bar chart, pie chart, stacked column chart and stacked bar chart, packed bubbles, treemap, heat map, circle views, side-by-side circle, dual combination, bullet graphs, gantt chart, decomposition tree, key influencer, gauge, card and multi-row card, slicer, matrix and table, waterfall chart, waterfall chart, percent of grand total regime, KPI, ribbon chart.
And below are some examples of values we used for building charts:
- Number of user transactions, max count of transactions for one user
- Payment system/bank user selected
- Number of frauds, fraud/not fraud
- Age/age category, location (country, city, state, province, etc), gender
- Monthly distribution, etc.
Team Composition & Project Duration
The Lightpoint team completed the project in 3 months with the help of 1 DevOps and 2 Data Engineers.
Major Tech Stack
Postgres, Snowflake, Amazon S3
Power BI, Tableau
Data Integration Platforms: