Alibaba Cloud DataWorks – Data Integration
The Alibaba Cloud Data Integration is a data synchronization platform that provides stable, efficient, and elastically scalable services.
The Alibaba Cloud Data Integration is a data synchronization platform that provides stable, efficient, and elastically scalable services. Data integration is designed to implement fast and stable data migration and synchronization between multiple heterogeneous data sources in complex network environments.
Offline (batch) data synchronization
The offline (batch) data channel provides a set of abstract data extraction plug-ins (Readers) and data writing plug-ins (Writers) by defining the source and target databases and datasets. Also, it designs a set of simplified intermediate data transmission formats based on the framework to transfer data between any structured and semi-structured data sources.
Supported data source types
Data integration supports diverse data sources as follows:
- Text storage (FTP, SFTP, OSS, Multimedia files),
- Database (RDS,DRDS,MySQL,PostgreSQL),
- NoSQL (Memcache,Redis,MongoDB,HBase),
- Big data (MaxCompute,AnalyticDB,HDFS),
- MPP database (HybridDB for MySQL).
Data Integration is a stable, efficient, and elastically scalable data synchronization platform that Alibaba Group provides to external users. It provides offline (batch) data access channels for Alibaba Cloud’s big data computing engines, including MaxCompute, AnalyticDB for MySQL 2.0, and Object Storage Service (OSS).
The following table lists data source types supported by data integration:
Data source category | Data source type | Extraction (reader) | Import (writer) | Supported methods | Supported types |
---|---|---|---|---|---|
Relational databases | MySQL | Yes. | Yes. | Wizard and script | Alibaba Cloud and on-premise |
Relational databases | SQL Server | Yes. | Yes. | Wizard and script | Alibaba Cloud and on-premise |
Relational database | PostgreSQL | Yes. | Yes. | Wizard and script | Alibaba Cloud and on-premise |
Relational databases | Oracle | Yes. | Yes. | Wizard and script | On-premise |
Relational databases | DRDS | Yes. | Yes. | Wizard and script | Alibaba Cloud |
Relational databases- | DB2 | Yes. | Yes. | Script | On-premise |
Relational databases | DM | Yes | Yes | Script | On-premise |
Relational databases | RDS for PPAS | Yes | Yes | Script | Alibaba Cloud |
MPP | HybridDB for MySQL | Yes | Yes | Wizard and script | Alibaba Cloud |
MPP | HybridDB for PostgreSQL released | Yes | Yes | Wizard and script | Alibaba Cloud |
Big data storage | MaxCompute (Corresponding data source name: MaxCompute) | Yes. | Yes. | Wizard and script | Alibaba Cloud |
Big data storage | DataHub | No | Yes. | Script | Alibaba Cloud |
Big data storage | ElasticSearch | No | Yes. | Script | Alibaba Cloud |
Big data storage | AnalyticDBAnalyticDB for MySQL 2.0 | Yes | Yes | Wizard and script | Alibaba Cloud |
Unstructured storage | OSS | Yes. | Yes. | Wizard and script | Alibaba Cloud |
Unstructured storage | HDFS | Yes | Yes. | Script | On-premise |
Unstructured storage | FTP | Yes. | Yes. | Wizard and script | On-premise |
Message queue | LogHub | Yes. | Yes. | Wizard and script | Alibaba Cloud |
NoSQL | HBase | Yes. | Yes. | Script | Alibaba Cloud and on-premise |
NoSQL | MongoDB | Yes | Yes. | Script | Alibaba Cloud and on-premise |
NoSQL | Memcache | No | Yes. | Script | Alibaba Cloud and on-premise Memcache |
NoSQL | Table Store (corresponding data source name: OTS) | Yes | Yes. | Script | Alibaba Cloud |
NoSQL | OpenSearch | No | Yes. | Script | Alibaba Cloud |
NoSQL | Redis | No | Yes. | Script | Alibaba Cloud and on-premise |
Performance testing | Stream | Yes. | Yes. | Script | – |
Synchronous development description
Synchronous development provides both wizard and script modes.
- Wizard: Provides a visualized development guide and comprehensive details about data sync task configuration. This mode is cost-effective, but lacks certain advanced functions.
- Script: Allows you to directly write a data sync JSON script for completing the data sync development. It is suitable for advanced users, but has a high learning cost. It also provides diverse and flexible functions for delicacy configuration management.
Note
- The code generated in wizard mode can be converted to script mode code. The code conversion is unidirectional, and cannot be converted back to wizard mode format. This is because the script mode capabilities are a superset of the wizard mode.
- Always configure the data source and create the target table before writing codes.
Description of network types
The networks can be classified as classic network, VPC network, and local IDC network (planning).
- Classic network: A network that is centrally deployed on the Alibaba Cloud public infrastructure network planned and managed by Alibaba Cloud. This network type suits customers that have ease-of-use requirements.
- VPC network: An isolated network environment created on Alibaba Cloud. In this network type, you have full control over the virtual network, including customizing the IP address range, partitioning network segments, and configuring routing tables and gateways.
- Local IDC network: The network environment of your server room, which is isolated from the Alibaba Cloud network.
Note:
- The public network access is supported. The public network access only selects the classic network as the network type. Note the public network bandwidth speed and relevant network traffic charges when using this network type. We do not recommend this configuration except in special cases.
- Network connections are planned for data synchronization, you can use the locally added resource + Script Mode scheme for synchronous data transfer, you can also use the Shell + DataX scheme.
- The Virtual Private Cloud (VPC) creates an isolated network environment that allows you to customize the IP address range, network segments, and gateways. The VPC applications have expanded the scope of VPC security, as a result data integration provides RDS for MySQL, RDS for SQL Server, and RDS for PostgreSQL and eliminates the need to purchase extra ECSs that reside on the same network as the VPC. Instead, the system guarantees interconnectivity by detecting devices automatically through the reverse proxy. The VPC supports other Alibaba Cloud databases including PPAS, OceanBase, Redis, MongoDB, Memcache, TableStore, and HBase. For any non-RDS data sources, an ECS on the same network is required for configuring data integration synchronization tasks on the VPC network and ensuring interconnectivity.
Limits
- Supports the following data synchronization types: structured (such as RDS and DRDS), semi-structured, and non-structured, such as OSS and TXT.The specified synchronization data must be abstracted as structured data. That is, data integration supports data synchronization that can transmit data that can be abstracted to a logical two-dimensional table, other fully unstructured data, such as a MP3 section stored in OSS. Data integration does not support synchronizing dataset to MaxCompute, which is still in development.
- Supports data synchronization and exchange between single region and cross-region data storage.
For certain regions, cross-region data transmission is supported, but not guaranteed by the classic network. If you need to use this function, while the tested classic network is disconnected, consider using the public network connection instead.
- Only data synchronization (transmission) is performed and no consumption plans of data stream is provided.
Summary
In this blog, you’ve got to see a bit more about Alibaba Cloud DataWorks – Data Integration to take advantage of all of the features included in DataWorks to help kickstart your data processing and analytics workflow.