ETL module development for a leading US subscription tech company

ETL module development for a leading US subscription tech company

Client

Our client is a leading US-based subscription management technology provider that serves top 10 US News Media companies and manages 1M+ subscribers annually.

The client provides ready-to-use software solutions for subscription-based businesses, allowing for all-round subscription management and analytics, facilitating subscriber engagement, acquisition, retention, and more.

Challenge

Previously, the client had a non-scalable system of subscription data collection and processing that resulted in data corruption, system bugs and malfunctions. Lightpoint’s goal was to develop an ETL (extract, transform, load) module linking the external data sources of the client’s customers (subscription-based publishing businesses) to the client’s target system.

The ETL module had to regularly collect the subscription data from the client’s customers, process it in particular ways, and upload it to the client’s target system, where the data will be used for subscription management, analytics and reporting. Scalability, real-time system notifications, and timely data processing were among the major requirements for the final solution.

Project Description

We opted for a batch processing method to handle the collected data within intervals of 24 hours. The source data from multiple customers was stored on FTP server in CSV format, though the structure of data wasn’t unified. Data encompassed:

  • Subscription data (subscription type, user and one’s personal info, etc)
  • Transactional data (user actions towards the subscription – payments, vacations, complaints, cancellation, etc)

Two major goals of data ingestion were:

1

Gathering data for managing subscriptions and serving subscribers.

2

Aggregating data for analytics and reporting.

The initial raw data from FTP server was loaded to a data lake, where it was standardized and transformed to the structure that was appropriate for one of the above mentioned goals.

From the data lake, data was moved to a working database (using SSIS packages) to aid in handling the subscription process and addressing events like payments, onboarding, complaints, cancellations, changing subscription parameters, and more.

Apart from the working database, there was another database for data analytics. It ingested data from the working databases, derived insights, and worked out a set of reports to use for further decision-making.

To ensure that ETL pipeline works accurately and without delays, we’ve set up regular jobs that trigger processes to regularly check FTP, download, parse and load data files, transform and load data to working and analytics databases, and so on.

The final ETL module helped to resolve the following issues:

Issue 1: The System was unable to load a large volume of data files within a limited time frame.

To address this issue, we implemented parallel processing (moved the processes of data loading and transforming to the middle tier where we could make it scalable and work in parallel), increased disk I/O speed, and improved the schedule of loading to reduce concurrency.

Issue 2: Inconsistent data structures or broken rows within data files, that was a bottleneck for data transformation and processing.

To fix that, we introduced validation algorithms for data structures and files. They detected corrupted data and produced alerts, requiring to adjust mapping, reach out to the client to re-export the file, or manually correct the data file.

Issue 3: Frequent system failures.

To cope with that, we set up transparent logging of every process step, monitoring logs and general health of every part of the platform, and alerting if critical errors occur to react asap and prevent downtime.

Team Composition & Project Duration

The team worked on this project for 2 years, and included 5 Data Engineers.

Major Tech Stack

.Net, T-SQL, SSIS, MSSQL

Results

Secure and streamlined communication between external data sources of the client’s customers and target client’s system

Doubled bandwidth of data processing

Enabled processing of growing data volumes due to ETL module’s scalability

Real-time notifications on system and data state to fix errors timely and eliminate data flaws, the system failures and downtime

Accurate data structure, ensuring correct outcomes

Smooth integration with the client’s IT ecosystem

Ready to talk about a similar solution for your business?

Get in touch with us at:

    Select a Service:
    First name
    Last name
    Email
    Description