Our client is a leading US-based subscription management technology provider that serves top 10 US News Media companies and manages 1M+ subscribers annually.
The client provides ready-to-use software solutions for subscription-based businesses, allowing for all-round subscription management and analytics, facilitating subscriber engagement, acquisition, retention, and more.
Previously, the client had a non-scalable system of subscription data collection and processing that resulted in data corruption, system bugs and malfunctions. Lightpoint’s goal was to develop an ETL (extract, transform, load) module linking the external data sources of the client’s customers (subscription-based publishing businesses) to the client’s target system.
The ETL module had to regularly collect the subscription data from the client’s customers, process it in particular ways, and upload it to the client’s target system, where the data will be used for subscription management, analytics and reporting. Scalability, real-time system notifications, and timely data processing were among major requirements for the final solution.
We opted for a batch processing method to handle the collected data within the intervals of 24 hours. The source data from multiple customers was stored on FTP server in CSV format, though the structure of data wasn’t unified. Data encompassed:
- Subscription data (subscription type, user and one’s personal info, etc)
- Transactional data (user actions towards the subscription – payments, vacations, complaints, cancellation, etc)
Two major goals of data ingestion were:
Gathering data for managing subscriptions and serving subscribers.
Aggregating data for analytics and reporting.
The initial raw data from FTP server was loaded to a data lake, where it was standardized and transformed to the structure that was appropriate for one of the above mentioned goals.
From the data lake, data was moved to a working database (using SSIS packages) to aid in handling the subscription process and addressing events like payments, onboarding, complaints, cancellations, changing subscription parameters, and more.
Apart from the working database, there was another database for data analytics. It ingested data from the working databases, derived insights, and worked out a set of reports to use for further decision-making.
To ensure that ETL pipeline works accurately and without delays, we’ve set up regular jobs that triggered processes to regularly check FTP, download, parse and load data files, transform and load data to working and analytics databases, and so on.
The final ETL module helped to resolve the following issues:
Issue 1: The System was unable to load a large volume of data files within a limited time frame.
To address this issue we implemented parallel processing (moved the processes of data loading and transforming to the middle tier where we could make it scalable and work in parallel), increased disk I/O speed, and improved the schedule of loading to reduce concurrency.
Issue 2: Inconsistent data structures or broken rows within data files, that was a bottleneck for data transformation and processing.
To fix that, we introduced validation algorithms for data structures and files. They detected corrupted data and produced alerts, requiring to adjust mapping, reach out to the client to re-export the file, or manually correct the data file.
Issue 3: Frequent system failures.
To cope with that, we set up transparent logging of every process step, monitoring logs and general health of every part of the platform, and alerting if critical errors occur to react asap and prevent downtime.
Team Composition & Project Duration
The team worked on this project for 2 years, and included 5 Data Engineers.
Major Tech Stack
.Net, T-SQL, SSIS, MSSQL
Secure and streamlined communication between external data sources of the client’s customers and target client’s system
Doubled bandwidth of data processing
Enabled processing of growing data volumes due to ETL module’s scalability
Real-time notifications on system and data state to fix errors timely and eliminate data flaws, the system failures and downtime
Accurate data structure, ensuring correct outcomes
Smooth integration with the client’s IT ecosystem