Working with Avro, Arrow, and ORC data in ClickHouse
Apache has released multiple data formats actively used in analytics environments, including the popular Avro, Arrow, and Orc. ClickHouse supports importing and exporting data using any from that list.
Importing and exporting in Avro format
ClickHouse supports reading and writing Apache Avro data files, which are widely used in Hadoop systems.
To import from an avro file, we should use Avro format in the INSERT
statement:
INSERT INTO sometable
FROM INFILE 'data.avro'
FORMAT Avro
With the file() function, we can also explore Avro files before actually importing data:
SELECT path, hits
FROM file('data.avro', Avro)
ORDER BY hits DESC
LIMIT 5;
┌─path────────────┬──hits─┐
│ Amy_Poehler │ 62732 │
│ Adam_Goldberg │ 42338 │
│ Aaron_Spelling │ 25128 │
│ Absence_seizure │ 18152 │
│ Ammon_Bundy │ 11890 │
└─────────────────┴───────┘
To export to Avro file:
SELECT * FROM sometable
INTO OUTFILE 'export.avro'
FORMAT Avro;
Avro and ClickHouse data types
Consider data types matching when importing or exporting Avro files. Use explicit type casting to convert when loading data from Avro files:
SELECT
date,
toDate(date)
FROM file('data.avro', Avro)
LIMIT 3;
┌──date─┬─toDate(date)─┐
│ 16556 │ 2015-05-01 │
│ 16556 │ 2015-05-01 │
│ 16556 │ 2015-05-01 │
└───────┴──────────────┘
Avro messages in Kafka
When Kafka messages use Avro format, ClickHouse can read such streams using AvroConfluent format and Kafka engine:
CREATE TABLE some_topic_stream
(
field1 UInt32,
field2 String
)
ENGINE = Kafka() SETTINGS
kafka_broker_list = 'localhost',
kafka_topic_list = 'some_topic',
kafka_group_name = 'some_group',
kafka_format = 'AvroConfluent';
Working with Arrow format
Another columnar format is Apache Arrow, also supported by ClickHouse for import and export. To import data from an Arrow file, we use the Arrow format:
INSERT INTO sometable
FROM INFILE 'data.arrow'
FORMAT Arrow
Exporting to Arrow file works the same way:
SELECT * FROM sometable
INTO OUTFILE 'export.arrow'
FORMAT Arrow
Also, check data types matching to know if any should be converted manually.
Arrow data streaming
The ArrowStream format can be used to work with Arrow streaming (used for in-memory processing). ClickHouse can read and write Arrow streams.
To demonstrate how ClickHouse can stream Arrow data, let's pipe it to the following python script (it reads input stream in Arrow streaming format and outputs the result as a Pandas table):
import sys, pyarrow as pa
with pa.ipc.open_stream(sys.stdin.buffer) as reader:
print(reader.read_pandas())
Now we can stream data from ClickHouse by piping its output to the script:
clickhouse-client -q "SELECT path, hits FROM some_data LIMIT 3 FORMAT ArrowStream" | python3 arrow.py
path hits
0 b'Akiba_Hebrew_Academy' 241
1 b'Aegithina_tiphia' 34
2 b'1971-72_Utah_Stars_season' 1
ClickHouse can read Arrow streams as well using the same ArrowStream format:
arrow-stream | clickhouse-client -q "INSERT INTO sometable FORMAT ArrowStream"
We've used arrow-stream
as a possible source of Arrow streaming data.
Importing and exporting ORC data
Apache ORC format is a columnar storage format typically used for Hadoop. ClickHouse supports importing as well as exporting Orc data using ORC format:
SELECT *
FROM sometable
INTO OUTFILE 'data.orc'
FORMAT ORC;
INSERT INTO sometable
FROM INFILE 'data.orc'
FORMAT ORC;
Also, check data types matching as well as additional settings to tune export and import.
Further reading
ClickHouse introduces support for many formats, both text, and binary, to cover various scenarios and platforms. Explore more formats and ways to work with them in the following articles:
And also check clickhouse-local - a portable full-featured tool to work on local/remote files without the need for Clickhouse server.