5-Minute Quickstart

pymapd follows the python DB API 2.0, so experience with other Python database clients will feel similar to pymapd.

Note

This tutorial assumes you have an OmniSci server running on localhost:6274 with the default logins and databases, and have loaded the example flights_2008_10k dataset. This dataset can be obtained from the insert_sample_data script included in the OmniSci install directory.

Installing pymapd

pymapd

pymapd can be installed with conda using conda-forge or pip.

# conda
conda install -c conda-forge pymapd

# pip
pip install pymapd

If you have an NVIDIA GPU in the same machine where your pymapd code will be running, you’ll want to install cudf as well to return results sets into GPU memory as a cudf GPU DataFrame:

cudf via conda

# CUDA 9.2
conda install -c nvidia -c rapidsai -c numba -c conda-forge -c defaults cudf

# CUDA 10.0
conda install -c nvidia/label/cuda10.0 -c rapidsai/label/cuda10.0 -c numba \
    -c conda-forge -c defaults cudf

cudf via PyPI/pip

# CUDA 9.2
pip install cudf-cuda92

# CUDA 10.0
pip install cudf-cuda100

Connecting

Self-Hosted Install

For self-hosted OmniSci installs, use protocol='binary' (this is the default) to connect with OmniSci, as this will have better performance than using protocol='http' or protocol='https'.

To create a Connection using the connect() method along with user, password, host and dbname:

>>> from pymapd import connect
>>> con = connect(user="admin", password="HyperInteractive", host="localhost",
...               dbname="omnisci")
>>> con
Connection(mapd://admin:***@localhost:6274/omnisci?protocol=binary)

Alternatively, you can pass in a SQLAlchemy-compliant connection string to the connect() method:

>>> uri = "mapd://admin:HyperInteractive@localhost:6274/omnisci?protocol=binary"
>>> con = connect(uri=uri)
Connection(mapd://admin:***@localhost:6274/omnisci?protocol=binary)

OmniSci Cloud

When connecting to OmniSci Cloud, the two methods are the same as above, however you can only use protocol='https'. For a step-by-step walk-through with screenshots, please see this blog post.

Querying

A few options are available for getting the results of a query into your Python process.

  1. Into GPU Memory via cudf (Connection.select_ipc_gpu())

  2. Into CPU shared memory via Apache Arrow and pandas (Connection.select_ipc())

  3. Into python objects via Apache Thrift (Connection.execute())

The best option depends on the hardware you have available, your connection to the database, and what you plan to do with the returned data. In general, the third method, using Thrift to serialize and deserialize the data, will be slower than the GPU or CPU shared memory methods. The shared memory methods require that your OmniSci database is running on the same machine.

Note

We currently support Timestamp(0|3|6) data types i.e. seconds, milliseconds, and microseconds granularity. Support for nanoseconds, Timestamp(9) is in progress.

GPU Shared Memory

Use Connection.select_ipc_gpu() to select data into a GpuDataFrame, provided by cudf. To use this method, the Python code must be running on the same machine as the OmniSci installation AND you must have an NVIDIA GPU installed.

>>> query = "SELECT depdelay, arrdelay FROM flights_2008_10k limit 100"
>>> df = con.select_ipc_gpu(query)
>>> df.head()
  depdelay arrdelay
0       -2      -13
1       -1      -13
2       -3        1
3        4       -3
4       12        7

CPU Shared Memory

Use Connection.select_ipc() to select data into a pandas DataFrame using CPU shared memory to avoid unnecessary intermediate copies. To use this method, the Python code must be running on the same machine as the OmniSci installation.

>>> df = con.select_ipc(query)
>>> df.head()
  depdelay arrdelay
0       -2      -13
1       -1      -13
2       -3        1
3        4       -3
4       12        7

pandas.read_sql()

With a Connection defined, you can use pandass.read_sql() to read your data in a pandas DataFrame. This will be slower than using Connection.select_ipc(), but works regardless of where the Python code is running (i.e. select_ipc() must be on the same machine as the OmniSci install, pandas.read_sql() works everywhere):

>>> from pymapd import connect
>>> import pandas as pd
>>> con = connect(user="admin", password="HyperInteractive", host="localhost",
...               dbname="omnisci")
>>> df = pd.read_sql("SELECT depdelay, arrdelay FROM flights_2008_10k limit 100", con)

Cursors

After connecting to OmniSci, a cursor can be created with Connection.cursor():

>>> c = con.cursor()
>>> c
<pymapd.cursor.Cursor at 0x110fe6438>

Or by using a context manager:

>>> with con as c:
...     print(c)
<pymapd.cursor.Cursor object at 0x1041f9630>

Arbitrary SQL can be executed using Cursor.execute().

>>> c.execute("SELECT depdelay, arrdelay FROM flights_2008_10k limit 100")
<pymapd.cursor.Cursor at 0x110fe6438>

This will set the rowcount property, with the number of returned rows

>>> c.rowcount
100

The description attribute contains a list of Description objects, a namedtuple with the usual attributes required by the spec. There’s one entry per returned column, and we fill the name, type_code and null_ok attributes.

>>> c.description
[Description(name='depdelay', type_code=0, display_size=None, internal_size=None, precision=None, scale=None, null_ok=True),
 Description(name='arrdelay', type_code=0, display_size=None, internal_size=None, precision=None, scale=None, null_ok=True)]

Cursors are iterable, returning a list of tuples of values

>>> result = list(c)
>>> result[:5]
[(38, 28), (0, 8), (-4, 9), (1, -1), (1, 2)]

Loading Data

The fastest way to load data is Connection.load_table_arrow(). Internally, this will use pyarrow and the Apache Arrow format to exchange data with the OmniSci database.

>>> import pyarrow as pa
>>> import pandas as pd
>>> df = pd.DataFrame({"A": [1, 2], "B": ['c', 'd']})
>>> table = pa.Table.from_pandas(df)
>>> con.load_table_arrow("table_name", table)

This accepts either a pyarrow.Table, or a pandas.DataFrame, which will be converted to a pyarrow.Table before loading.

You can also load a pandas.DataFrame using Connection.load_table() or Connection.load_table_columnar() methods.

>>> df = pd.DataFrame({"A": [1, 2], "B": ["c", "d"]})
>>> con.load_table_columnar("table_name", df, preserve_index=False)

If you aren’t using arrow or pandas you can pass list of tuples to Connection.load_table_rowwise().

>>> data = [(1, "c"), (2, "d")]
>>> con.load_table_rowwise("table_name", data)

The high-level Connection.load_table() method will choose the fastest method available based on the type of data.

Database Metadata

Some helpful metadata are available on the Connection object.

  1. Get a list of tables with Connection.get_tables()

>>> con.get_tables()
['flights_2008_10k', 'stocks']
  1. Get column information for a table with Connection.get_table_details()

    >>> con.get_table_details('stocks')
    [ColumnDetails(name='date_', type='STR', nullable=True, precision=0,
                   scale=0, comp_param=32),
     ColumnDetails(name='trans', type='STR', nullable=True, precision=0,
                   scale=0, comp_param=32),
     ...