Using JupySQL with ClickHouse
In this guide we'll show an integration with ClickHouse.
We will use Jupysql to run queries on top of ClickHouse. Once the data is loaded, we'll visualize it via SQL plotting.
The integration between Jupysql and ClickHouse is made possible by the use of the clickhouse_sqlalchemy library. This library allows for easy communication between the two systems, and enables users to connect to ClickHouse and pass the SQL dialect. Once connected, users can run SQL queries directly from the Clickhouse native UI, or from the Jupyter notebook directly.
# Install required packages
%pip install --quiet jupysql clickhouse_sqlalchemy
Note: you may need to restart the kernel to use updated packages.
import pandas as pd
from sklearn_evaluation import plot
# Import jupysql Jupyter extension to create SQL cells
%load_ext sql
%config SqlMagic.autocommit=False
You'd need to make sure your Clickhouse is up and reachable for the next stages. You can use either the local or the cloud version.
Note: you will need to adjust the connection string according to the instance type you're trying to connect to (url, user, password). In the example below we've used a local instance. To learn more about it, check out this guide.
%sql clickhouse://default:@localhost:8123/default
%%sql
CREATE TABLE trips
(
`trip_id` UInt32,
`vendor_id` Enum8('1' = 1, '2' = 2, '3' = 3, '4' = 4, 'CMT' = 5, 'VTS' = 6, 'DDS' = 7, 'B02512' = 10, 'B02598' = 11, 'B02617' = 12, 'B02682' = 13, 'B02764' = 14, '' = 15),
`pickup_date` Date,
`pickup_datetime` DateTime,
`dropoff_date` Date,
`dropoff_datetime` DateTime,
`store_and_fwd_flag` UInt8,
`rate_code_id` UInt8,
`pickup_longitude` Float64,
`pickup_latitude` Float64,
`dropoff_longitude` Float64,
`dropoff_latitude` Float64,
`passenger_count` UInt8,
`trip_distance` Float64,
`fare_amount` Float32,
`extra` Float32,
`mta_tax` Float32,
`tip_amount` Float32,
`tolls_amount` Float32,
`ehail_fee` Float32,
`improvement_surcharge` Float32,
`total_amount` Float32,
`payment_type` Enum8('UNK' = 0, 'CSH' = 1, 'CRE' = 2, 'NOC' = 3, 'DIS' = 4),
`trip_type` UInt8,
`pickup` FixedString(25),
`dropoff` FixedString(25),
`cab_type` Enum8('yellow' = 1, 'green' = 2, 'uber' = 3),
`pickup_nyct2010_gid` Int8,
`pickup_ctlabel` Float32,
`pickup_borocode` Int8,
`pickup_ct2010` String,
`pickup_boroct2010` String,
`pickup_cdeligibil` String,
`pickup_ntacode` FixedString(4),
`pickup_ntaname` String,
`pickup_puma` UInt16,
`dropoff_nyct2010_gid` UInt8,
`dropoff_ctlabel` Float32,
`dropoff_borocode` UInt8,
`dropoff_ct2010` String,
`dropoff_boroct2010` String,
`dropoff_cdeligibil` String,
`dropoff_ntacode` FixedString(4),
`dropoff_ntaname` String,
`dropoff_puma` UInt16
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(pickup_date)
ORDER BY pickup_datetime;
* clickhouse://default:***@localhost:8123/default
Done.
%%sql
INSERT INTO trips
SELECT * FROM s3(
'https://datasets-documentation.s3.eu-west-3.amazonaws.com/nyc-taxi/trips_{1..2}.gz',
'TabSeparatedWithNames', "
`trip_id` UInt32,
`vendor_id` Enum8('1' = 1, '2' = 2, '3' = 3, '4' = 4, 'CMT' = 5, 'VTS' = 6, 'DDS' = 7, 'B02512' = 10, 'B02598' = 11, 'B02617' = 12, 'B02682' = 13, 'B02764' = 14, '' = 15),
`pickup_date` Date,
`pickup_datetime` DateTime,
`dropoff_date` Date,
`dropoff_datetime` DateTime,
`store_and_fwd_flag` UInt8,
`rate_code_id` UInt8,
`pickup_longitude` Float64,
`pickup_latitude` Float64,
`dropoff_longitude` Float64,
`dropoff_latitude` Float64,
`passenger_count` UInt8,
`trip_distance` Float64,
`fare_amount` Float32,
`extra` Float32,
`mta_tax` Float32,
`tip_amount` Float32,
`tolls_amount` Float32,
`ehail_fee` Float32,
`improvement_surcharge` Float32,
`total_amount` Float32,
`payment_type` Enum8('UNK' = 0, 'CSH' = 1, 'CRE' = 2, 'NOC' = 3, 'DIS' = 4),
`trip_type` UInt8,
`pickup` FixedString(25),
`dropoff` FixedString(25),
`cab_type` Enum8('yellow' = 1, 'green' = 2, 'uber' = 3),
`pickup_nyct2010_gid` Int8,
`pickup_ctlabel` Float32,
`pickup_borocode` Int8,
`pickup_ct2010` String,
`pickup_boroct2010` String,
`pickup_cdeligibil` String,
`pickup_ntacode` FixedString(4),
`pickup_ntaname` String,
`pickup_puma` UInt16,
`dropoff_nyct2010_gid` UInt8,
`dropoff_ctlabel` Float32,
`dropoff_borocode` UInt8,
`dropoff_ct2010` String,
`dropoff_boroct2010` String,
`dropoff_cdeligibil` String,
`dropoff_ntacode` FixedString(4),
`dropoff_ntaname` String,
`dropoff_puma` UInt16
") SETTINGS input_format_try_infer_datetimes = 0
* clickhouse://default:***@localhost:8123/default
Done.
%sql SELECT count() FROM trips limit 5;
* clickhouse://default:***@localhost:8123/default
Done.
count() |
---|
1999657 |
%sql SELECT DISTINCT(pickup_ntaname) FROM trips limit 5;
* clickhouse://default:***@localhost:8123/default
Done.
pickup_ntaname |
---|
Morningside Heights |
Hudson Yards-Chelsea-Flatiron-Union Square |
Midtown-Midtown South |
SoHo-TriBeCa-Civic Center-Little Italy |
Murray Hill-Kips Bay |
%sql SELECT round(avg(tip_amount), 2) FROM trips
* clickhouse://default:***@localhost:8123/default
Done.
round(avg(tip_amount), 2) |
---|
1.68 |
%%sql
SELECT
passenger_count,
ceil(avg(total_amount),2) AS average_total_amount
FROM trips
GROUP BY passenger_count
* clickhouse://default:***@localhost:8123/default
Done.
passenger_count | average_total_amount |
---|---|
0 | 22.69 |
1 | 15.97 |
2 | 17.15 |
3 | 16.76 |
4 | 17.33 |
5 | 16.35 |
6 | 16.04 |
7 | 59.8 |
8 | 36.41 |
9 | 9.81 |
%%sql
SELECT
pickup_date,
pickup_ntaname,
SUM(1) AS number_of_trips
FROM trips
GROUP BY pickup_date, pickup_ntaname
ORDER BY pickup_date ASC
limit 5;
* clickhouse://default:***@localhost:8123/default
Done.
pickup_date | pickup_ntaname | number_of_trips |
---|---|---|
2015-07-01 | Bushwick North | 2 |
2015-07-01 | Brighton Beach | 1 |
2015-07-01 | Briarwood-Jamaica Hills | 3 |
2015-07-01 | Williamsburg | 1 |
2015-07-01 | Queensbridge-Ravenswood-Long Island City | 9 |
# %sql DESCRIBE trips;
# %sql SELECT DISTINCT(trip_distance) FROM trips limit 50;
%%sql --save short-trips --no-execute
SELECT *
FROM trips
WHERE trip_distance < 6.3
* clickhouse://default:***@localhost:8123/default
Skipping execution...
%sqlplot histogram --table short-trips --column trip_distance --bins 10 --with short-trips
<AxesSubplot: title={'center': "'trip_distance' from 'short-trips'"}, xlabel='trip_distance', ylabel='Count'>
ax = %sqlplot histogram --table short-trips --column trip_distance --bins 50 --with short-trips
ax.grid()
ax.set_title("Trip distance from trips < 6.3")
_ = ax.set_xlabel("Trip distance")