Development of an ETL-pipeline and integration with BI tools for a US-based digital publishing company

Development of an ETL-pipeline and integration with BI tools for a US-based digital publishing company

Client

The client is an online media company that provides financial news and analytics for the US market and has 3M of unique visitors per month.

Challenge

The client needed a solution that could collect website visitors’ behavior together with transactional data to perform the transformed data as graphs, charts, and maps. They intended to use the obtained data for analysis and making data-driven decisions.

Project Description

Task 1: Building an ETL pipeline to extract data from various sources, transform it, and load to the target system in the desired format

What kind of data we were to transfer:

  • Website traffic data including users’ fingerprint identity, location, page views, clicks, etc. (in JSON format)
  • Static files (in CSV format)

Our data sources were:

  • The client website
  • Amazon S3 (cloud object storage)
  • FTP

Target systems for loading data were:

  • Snowflake (cloud data warehouse)
  • Postgres (open source object-relational database system)

After data extraction from the source systems, the stage of data transformation and enrichment took place. For this aim we compared the usage of Python Script, Matillion, Python Spark, and Python Pandas. We selected Pandas since it doesn’t require a cluster setup (compared to Spark) and is much more cost-effective than Matillion.

To transfer website traffic data, we additionally used Kafka and Golang. All further transformation and processing (to get a structured breakdown of user activity per activity type) was happening in batch mode using Python.

Task 2: Visualization of the obtained data

To display correlations, relationships, ratio, dynamics, and other insights derived from the data in a user-friendly way, Lightpoint team was to:

  • choose and connect Business Intelligence tools to the client’s data repository;
  • choose a set of graphs, charts, maps, and other visualization options that are most useful and relevant for the client’s business;
  • setup real-time data visualization.

We did research to choose the right BI tool for the Client, and were finally considering Tableau and Power BI. Our choice criteria were data sources the tool supports (file data sources, various databases, Web, Azure, online services, etc), options of visualization it offers, indicators it can consider, user friendliness, complexity of building and maintaining graphs.

We opted for Power BI for supporting real-time streaming datasets as it is better at handling complex relationships between tables and data sources. Power BI also showed up as a faster tool. It looks more user-friendly and comparatively more cost-efficient than Tableau.

Once data is imported, it’s ready to be charted – we selected the following chart types for further dashboard visualization:

  • Line chart (to display changing data over time)
  • Bar chart (to visually compare certain characteristics)
  • Side-by-side bar (to use on a time scale when there are several classes and some characteristics that are used for comparison)
  • Filled map (to show how a value differs across some geographical region)
  • Symbol map (to overlay values onto geographical locations using symbols)
  • Donut chart (to have quickly assess to difference between values for the selected classes)
  • Scatter plot (to detect similarities in the distribution of different classes with respect to some parameters, as well as visually find differences)
  • Funnel (to analyze the performance of some systems that undergo through some stages)

Additional charts we used: line chart represented through circles, line chart with highlighted area, horizontal bar chart, pie chart, stacked column chart and stacked bar chart, packed bubbles, treemap, heat map, circle views, side-by-side circle, dual combination, bullet graphs, gantt chart, decomposition tree, key influencer, gauge, card and multi-row card, slicer, matrix and table, waterfall chart, waterfall chart, percent of grand total regime, KPI, ribbon chart.

And below are some examples of values we used for building charts:

  • Number of user transactions, max count of transactions for one user
  • Payment system/bank user selected
  • Number of frauds, fraud/not fraud
  • Age/age category, location (country, city, state, province, etc), gender
  • Monthly distribution, etc.

Team Composition & Project Duration

The Lightpoint team completed the project in 3 months with the help of 1 DevOps and 2 Data Engineers.

Major Tech Stack

Programming Languages:

Golang, Python

Databases/Data repositories:

Postgres, Snowflake, Amazon S3

BI tools:

Power BI, Tableau

Data Integration Platforms:

Kafka, Matillion

Frameworks:

Jenkins, AWS

Result

We developed a solution that collects website visitors’ behavior and created an ETL pipeline that ensures fast, error-free transfer of data from the source systems to the target repository as well as ongoing transformation to get the data ready for analytics and reporting.
We set up integration with Power BI and implemented a range of charts, visualizing analytical insights in real time.

Ready to talk about a similar solution for your business?

Get in touch with us at:

    Select a Service:
    First name
    Last name
    Email
    Description