MySQL 到PostgreSQL 的数据迁移工具
2019獨角獸企業重金招聘Python工程師標準>>>
pg_chameleon is available on pypi for download
python 包下載地址
The documentation is available on pythonhosted
文檔鏈接:
Discussion forum available on googlegroups
google group討論
Please submit your bug reports on GitHub.
發現問題請報告地址
Platform and versions
平臺以及版本
The tool is developed using Linux Slackware 14.2.Is currently tested with python 2.7 and python 3.6. 這個工具我們是在Linux Slackware 14.2上開發,已經在Python2.7和Python2.6上測試
The tool is developed using FreeBSD as database server with 該工具的數據庫測試環境是FreeBSD 服務,同時數據為:
- MySQL: 5.6
- PostgreSQL: 9.5
Possible applications
應用場景
-
分析 (數據挖掘,數據倉庫)
-
遷移
-
數據聚合, 數據源為多個MySQL 數據庫
-
Analytics
-
Migrations
-
Data aggregation from multiple MySQL databases
Features
特性
- Read the schema and data from MySQL and restore it into a target PostgreSQL schema
- 數據MySQL 數據庫的數據模式,和存儲數據,在PostgreSQL 上 還原成相應的目標模式
-
- Setup PostgreSQL to act as a MySQL slave
- 將PostgreSQL扮演成一個MySQL從庫
- Basic DDL Support (CREATE/DROP/ALTER TABLE, DROP PRIMARY KEY/TRUNCATE)
- 基本的DDL支持(CREATE/DROP/ALTER TABLE,DROP PRIMARY KEY/TRUNCATE)
- Discards of rubbish data which is saved in the table sch_chameleon.t_discarded_rows
- Replica from multiple MySQL schema or servers
- Basic replica monitoring
- Detach replica from MySQL
Requirements
安裝需知
Python: Linux操作系統 , CPython2.7/3.3 (CPython是指官方發布的那個Python,非其他衍生)
Python: CPython 2.7/3.3+ on Linux
MySQL: 5.5+
PostgreSQL: 9.5+
需要安裝的拓展庫地址
- PyMySQL
- argparse
- mysql-replication
- psycopg2
- PyYAML
- tabulate
Optionals for building documentation
文編編譯需要的工具,該工具為可選項
- sphinx
- sphinx-autobuild
Caveats
警告
The replica requires the tables to have a primary key. Tables without primary key are initialised during the init_replica process but the replicadoesn't update them. 這個環境需要數據表擁有主鍵,沒有主鍵的數據表會在 init_replica 的過程中初始化,但是replica 不會更新
Multiple replica sources are supported. However is required a separate process for each replica. Each replica must have a unique destination schema in PostgreSQL.
支持多個復制源,但是,每一份復制需要一個單獨的進程,每個復制在PostgreSQL中的映射是唯一的 The copy_max_memory is just an estimate. The average rows size is extracted from mysql's informations schema and can be outdated.If the copy process fails for memory error check the failing table's row length and the number of rows for each slice.
copy_max_memory只是一個估計,平均行大小是從MySQL的信息架構中提取出來的,并且可能已經過時了,如果復制過程由于內存錯誤而失敗,請檢查失敗的表的行長度和每個slice的行數. 當復制為空時,捕獲DDL或MySQL切換到另一個日志段(ROTATE EVENT),都會有這樣的情況.
The batch is processed every time the replica stream is empty, when a DDL is captured or when the MySQL switches to another log segment (ROTATE EVENT).Therefore the replica_batch_size is the limit for when a write happens in PostgreSQL. The parameter controls also the size of the batch replayed by pg_engine.process_batch.
The current implementation is sequential. 目前的實現是順序的(sequential). Read the replica -> Store the rows -> Replays the stored rows. 讀取復制-> 存儲行-> 重演存儲的行.
2.0的版本會改進這個地方
支持Python3.0 但是MySQL的復制,需要Python3的版本在3.3以上
使用最后接受到的事件事件戳和PostgreSQL事件戳確定滯后信息.如果MySQL是只讀的,延遲會增加,由于沒有復制信息會產生.
detach復制進程在PostgreSQL中重演這些序列,使得PostgreSQL可以獨立工作
MySQL源中的外鍵會被提取,但是會創建為無效的,外鍵在沒有ON DELETE和ON UPDATE的語句中創建。第二次會長時創建外鍵,如果發生錯誤,會日志記錄,由于源的配置.
The version 2.0 will improve this aspect.
Python 3 is supported but only from version 3.3 as required by mysql-replication .
The lag is determined using the last received event timestamp and the postgresql timestamp. If the mysql is read only the lag will increase becauseno replica event is coming in.
The detach replica process resets the sequences in postgres to let the database work standalone.
The foreign keys from the source MySQL schema are extracted and created initially as NOT VALID. The foreign keys are created without the ON DELETE or ON UPDATE clauses.A second run tries to validate the foreign keys. If an error occurs it gets logged out according to the source configuration.
Quick Setup
- Create a virtual environment (e.g. python3 -m venv venv)
- Activate the virtual environment (e.g. source venv/bin/activate)
- Install pgchameleon with pip install pg_chameleon. If you get an error upgrade your pip first.
- Create a user on mysql for the replica (e.g. usr_replica)
- Grant access to usr on the replicated database (e.g. GRANT ALL ON sakila.* TO 'usr_replica';)
- Grant RELOAD privilege to the user (e.g. GRANT RELOAD ON . to 'usr_replica';)
- Grant REPLICATION CLIENT privilege to the user (e.g. GRANT REPLICATION CLIENT ON . to 'usr_replica';)
- Grant REPLICATION SLAVE privilege to the user (e.g. GRANT REPLICATION SLAVE ON . to 'usr_replica';)
Configuration parameters
The system wide install is now supported correctly.
The first time chameleon.py is executed it creates a configuration directory in $HOME/.pg_chameleon.Inside the directory there are two subdirectories.
- config is where the configuration files live. Use config-example.yaml as template for the other configuration files. Please note the logs and pid directories with relative path will no longer work. The you should either use an absolute path or provide the home alias. Again, check the config-example.yaml for an example.
- pid is where the replica pid file is created. it can be changed in the configuration file
- logs is where the replica logs are saved if log_dest is file. It can be changed in the configuration file
The file config-example.yaml is stored in ~/.pg_chameleon/config and should be used as template for the other configuration files.
do not use config-example.yaml directly. The tool skips this filename as the file gets overwritten when pg_chameleon is upgraded.
Is it possible to have multiple configuration files for configuring the replica from multiple source databases. It's compulsory to chose different destination schemas on postgresql.
Each source requires to be started in a separate process (e.g. a cron entry).
The configuration file is a yaml file. Each parameter controls theway the program acts.
- my_server_id the server id for the mysql replica. must be unique among the replica cluster.
- copy_max_memory the max amount of memory to use when copying the table in PostgreSQL. Is possible to specify the value in (k)ilobytes, (M)egabytes, (G)igabytes adding the suffix (e.g. 300M).
- my_database mysql database to replicate. a schema with the same name will be initialised in the postgres database.
- pg_database destination database in PostgreSQL.
- copy_mode the allowed values are 'file' and 'direct'. With direct the copy happens on the fly. With file the table is first dumped in a csv file then reloaded in PostgreSQL.
- hexify is a yaml list with the data types that require coversion in hex (e.g. blob, binary). The conversion happens on the copy and on the replica.
- log_dir directory where the logs are stored.
- log_level logging verbosity. allowed values are debug, info, warning, error.
- log_dest log destination. stdout for debugging purposes, file for the normal activity.
- my_charset mysql charset for the copy. Please note the replica library read is always in utf8.
- pg_charset PostgreSQL connection's charset.
- tables_limit yaml list with the tables to replicate. If the list is empty then the entire mysql database is replicated.
- sleep_loop seconds between a two replica batches.
- pause_on_reindex determines whether to pause the replica if a reindex process is found in pg_stat_activity
- sleep_on_reindex seconds to sleep when a reindex process is found
- reindex_app_names lists the application names to check for reindex (e.g. reindexdb). This is a workaround which required for keeping the replication user unprivileged.
- source_name this must be unique along the list of sources. The tool detects if there's a duplicate when registering a new source
- dest_schema this is also a unique value. once the source is registered the dest_schema can't be changed anymore
- log_days_keep: specifies the amount how many days to keep the logs which are rotated automatically on a daily basis
- batch_retention the max retention for the replayed batches rows in t_replica_batch. The field accepts any valid interval accepted by PostgreSQL
- out_dir the directory where the csv files are dumped during the init_replica process if the copy mode is file
Reindex detection example setup
#Pause the replica for the given amount of seconds if a reindex process is found pause_on_reindex: Yes sleep_on_reindex: 30#list the application names which are supposed to reindex the database reindex_app_names: - 'reindexdb' - 'my_custom_reindex'MySQL connection parameters
mysql_conn:host: localhostport: 3306user: replication_usernamepasswd: never_commit_passwordsPostgreSQL connection parameters
pg_conn:host: localhostport: 5432user: replication_usernamepassword: never_commit_passwordsUsage
使用
The script chameleon.py requires one of the following commands.
- drop_schema Drops the service schema sch_chameleon with cascade option.
- create_schema Create the service schema sch_chameleon.
- upgrade_schema Upgrade an existing schema sch_chameleon to an newer version.
- init_replica Create the table structure from the mysql into a PostgreSQL schema with the same mysql's database name. The mysql tables are locked in read only mode and the data is copied into the PostgreSQL database. The master's coordinates are stored in the PostgreSQL service schema. The command drops and recreate the service schema.
- start_replica Starts the replication from mysql to PostgreSQL using the master data stored in sch_chameleon.t_replica_batch. The master's position is updated time a new batch is processed. The command upgrade the service schema if required.
- list_config List the available configurations and their status ('ready', 'initialising','initialised','stopped','running')
- add_source register a new configuration file as source
- drop_source remove the configuration from the registered sources
- stop_replica ends the replica process gracefully
- disable_replica ends the replica process and disable the restart
- enable_replica enable the replica process
- sync_replica sync the data between mysql and postgresql without dropping the tables
- show_status displays the replication status for each source, with the lag in seconds and the last received event
- detach_replica stops the replica stream, discards the replica setup and resets the sequences in PostgreSQL to work as a standalone db.
the optional command --config followed by the configuration file name, without the yaml suffix, allow to specify different configurations.If omitted the configuration defaults to default.
Example
Create a virtualenv and activate it
python3 -m venv venv source venv/bin/activateInstall pg_chameleon
pip install pg_chameleonRun the script in order to create the configuration directory.
chameleon.pycd in ~/.pg_chameleon/config and copy the configuration-example.yaml to default.yaml. Please note this is the default configuration and can be omitted when executing the chameleon.py script.
In MySQL create a user for the replica.
CREATE USER usr_replica ; SET PASSWORD FOR usr_replica=PASSWORD('replica'); GRANT ALL ON sakila.* TO 'usr_replica'; GRANT RELOAD ON *.* to 'usr_replica'; GRANT REPLICATION CLIENT ON *.* to 'usr_replica'; GRANT REPLICATION SLAVE ON *.* to 'usr_replica'; FLUSH PRIVILEGES;Add the configuration for the replica to my.cnf (requires mysql restart)
binlog_format= ROW binlog_row_image=FULL log-bin = mysql-bin server-id = 1If you are using a cascading replica configuration ensure the parameter log_slave_updates is set to ON.
log_slave_updates= ONIn PostgreSQL create a user for the replica and a database owned by the user
CREATE USER usr_replica WITH PASSWORD 'replica'; CREATE DATABASE db_replica WITH OWNER usr_replica;Check you can connect to both databases from the replication system.
For MySQL
mysql -p -h derpy -u usr_replica sakila Enter password: Reading table information for completion of table and column names You can turn off this feature to get a quicker startup with -AWelcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 116 Server version: 5.6.30-log Source distributionCopyright (c) 2000, 2016, Oracle and/or its affiliates. All rights reserved.Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective owners.Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.mysql>For PostgreSQL
psql -h derpy -U usr_replica db_replica Password for user usr_replica: psql (9.5.5) Type "help" for help. db_replica=>Setup the connection parameters in default.yaml
--- #global settings my_server_id: 100 replica_batch_size: 1000 my_database: sakila pg_database: db_replica#mysql connection's charset. my_charset: 'utf8' pg_charset: 'utf8'#include tables only tables_limit:#mysql slave setup mysql_conn:host: derpyport: 3306user: usr_replicapasswd: replica#postgres connection pg_conn:host: derpyport: 5432user: usr_replicapassword: replicaInitialise the schema and the replica with
chameleon.py create_schema chameleon.py add_source --config default chameleon.py init_replica --config defaultStart the replica with
chameleon.py start_replica --config defaultDetaching the replica from MySQL
chameleon.py detach_replica --config default轉載于:https://my.oschina.net/innovation/blog/903312
總結
以上是生活随笔為你收集整理的MySQL 到PostgreSQL 的数据迁移工具的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 腾讯邓君:《王者荣耀》翻过的同步技术相关
- 下一篇: Linux学习第二步(Java环境安装)