5-Minute Quickstart¶

pymapd follows the python DB API 2.0, so experience with other Python database clients will feel similar to pymapd.

Note

This tutorial assumes you have an OmniSci server running on localhost:6274 with the default logins and databases, and have loaded the example flights_2008_10k dataset. This dataset can be obtained from the insert_sample_data script included in the OmniSci install directory.

Installing pymapd¶

pymapd¶

pymapd can be installed with conda using conda-forge or pip.

# conda
conda install -c conda-forge pymapd

# pip
pip install pymapd

If you have an NVIDIA GPU in the same machine where your pymapd code will be running, you’ll want to install cudf as well to return results sets into GPU memory as a cudf GPU DataFrame:

cudf via conda¶

# CUDA 9.2
conda install -c nvidia -c rapidsai -c numba -c conda-forge -c defaults cudf

# CUDA 10.0
conda install -c nvidia/label/cuda10.0 -c rapidsai/label/cuda10.0 -c numba \
    -c conda-forge -c defaults cudf

cudf via PyPI/pip¶

# CUDA 9.2
pip install cudf-cuda92

# CUDA 10.0
pip install cudf-cuda100

Connecting¶

Self-Hosted Install¶

For self-hosted OmniSci installs, use protocol='binary' (this is the default) to connect with OmniSci, as this will have better performance than using protocol='http' or protocol='https'.

To create a Connection using the connect() method along with user, password, host and dbname:

>>> from pymapd import connect
>>> con = connect(user="admin", password="HyperInteractive", host="localhost",
...               dbname="omnisci")
>>> con
Connection(mapd://admin:***@localhost:6274/omnisci?protocol=binary)

Alternatively, you can pass in a SQLAlchemy-compliant connection string to the connect() method:

>>> uri = "mapd://admin:HyperInteractive@localhost:6274/omnisci?protocol=binary"
>>> con = connect(uri=uri)
Connection(mapd://admin:***@localhost:6274/omnisci?protocol=binary)

OmniSci Cloud¶

When connecting to OmniSci Cloud, the two methods are the same as above, however you can only use protocol='https'. For a step-by-step walk-through with screenshots, please see this blog post.

Querying¶

A few options are available for getting the results of a query into your Python process.

Into GPU Memory via cudf (Connection.select_ipc_gpu())
Into CPU shared memory via Apache Arrow and pandas (Connection.select_ipc())
Into python objects via Apache Thrift (Connection.execute())

The best option depends on the hardware you have available, your connection to the database, and what you plan to do with the returned data. In general, the third method, using Thrift to serialize and deserialize the data, will be slower than the GPU or CPU shared memory methods. The shared memory methods require that your OmniSci database is running on the same machine.

Note

We currently support Timestamp(0|3|6) data types i.e. seconds, milliseconds, and microseconds granularity. Support for nanoseconds, Timestamp(9) is in progress.

GPU Shared Memory¶

Use Connection.select_ipc_gpu() to select data into a GpuDataFrame, provided by cudf. To use this method, the Python code must be running on the same machine as the OmniSci installation AND you must have an NVIDIA GPU installed.

>>> query = "SELECT depdelay, arrdelay FROM flights_2008_10k limit 100"
>>> df = con.select_ipc_gpu(query)
>>> df.head()
  depdelay arrdelay
0       -2      -13
1       -1      -13
2       -3        1
3        4       -3
4       12        7

CPU Shared Memory¶

Use Connection.select_ipc() to select data into a pandas DataFrame using CPU shared memory to avoid unnecessary intermediate copies. To use this method, the Python code must be running on the same machine as the OmniSci installation.

>>> df = con.select_ipc(query)
>>> df.head()
  depdelay arrdelay
0       -2      -13
1       -1      -13
2       -3        1
3        4       -3
4       12        7

pandas.read_sql()¶

With a Connection defined, you can use pandas.read_sql() to read your data in a pandas DataFrame. This will be slower than using Connection.select_ipc(), but works regardless of where the Python code is running (i.e. select_ipc() must be on the same machine as the OmniSci install, pandas.read_sql() works everywhere):

>>> from pymapd import connect
>>> import pandas as pd
>>> con = connect(user="admin", password="HyperInteractive", host="localhost",
...               dbname="omnisci")
>>> df = pd.read_sql("SELECT depdelay, arrdelay FROM flights_2008_10k limit 100", con)

Cursors¶

After connecting to OmniSci, a cursor can be created with Connection.cursor():

>>> c = con.cursor()
>>> c
<pymapd.cursor.Cursor at 0x110fe6438>

Or by using a context manager:

>>> with con as c:
...     print(c)
<pymapd.cursor.Cursor object at 0x1041f9630>

Arbitrary SQL can be executed using Cursor.execute().

>>> c.execute("SELECT depdelay, arrdelay FROM flights_2008_10k limit 100")
<pymapd.cursor.Cursor at 0x110fe6438>

This will set the rowcount property, with the number of returned rows

>>> c.rowcount
100

The description attribute contains a list of Description objects, a namedtuple with the usual attributes required by the spec. There’s one entry per returned column, and we fill the name, type_code and null_ok attributes.

>>> c.description
[Description(name='depdelay', type_code=0, display_size=None, internal_size=None, precision=None, scale=None, null_ok=True),
 Description(name='arrdelay', type_code=0, display_size=None, internal_size=None, precision=None, scale=None, null_ok=True)]

Cursors are iterable, returning a list of tuples of values

>>> result = list(c)
>>> result[:5]
[(38, 28), (0, 8), (-4, 9), (1, -1), (1, 2)]

Loading Data¶

The fastest way to load data is Connection.load_table_arrow(). Internally, this will use pyarrow and the Apache Arrow format to exchange data with the OmniSci database.

>>> import pyarrow as pa
>>> import pandas as pd
>>> df = pd.DataFrame({"A": [1, 2], "B": ['c', 'd']})
>>> table = pa.Table.from_pandas(df)
>>> con.load_table_arrow("table_name", table)

This accepts either a pyarrow.Table, or a pandas.DataFrame, which will be converted to a pyarrow.Table before loading.

You can also load a pandas.DataFrame using Connection.load_table() or Connection.load_table_columnar() methods.

>>> df = pd.DataFrame({"A": [1, 2], "B": ["c", "d"]})
>>> con.load_table_columnar("table_name", df, preserve_index=False)

If you aren’t using arrow or pandas you can pass list of tuples to Connection.load_table_rowwise().

>>> data = [(1, "c"), (2, "d")]
>>> con.load_table_rowwise("table_name", data)

The high-level Connection.load_table() method will choose the fastest method available based on the type of data.

lists of tuples are always loaded with Connection.load_table_rowwise()
A pandas.DataFrame or pyarrow.Table will be loaded using Connection.load_table_arrow()
If upload fails using the arrow method, a pandas.DataFrame can be loaded using Connection.load_table_columnar()

Database Metadata¶

Some helpful metadata are available on the Connection object.

Get a list of tables with Connection.get_tables()

>>> con.get_tables()
['flights_2008_10k', 'stocks']

Get column information for a table with Connection.get_table_details()

>>> con.get_table_details('stocks')
[ColumnDetails(name='date_', type='STR', nullable=True, precision=0,
               scale=0, comp_param=32),
 ColumnDetails(name='trans', type='STR', nullable=True, precision=0,
               scale=0, comp_param=32),
 ...

Runtime User-Defined Functions¶

Connection instance is callable, it can be used as a decorator to Python functions to define these as Runtime UDFs:

>>> @con('int32(int32, int32)')
... def totaldelay(dep, arr):
...     return dep + arr
...
>>> query = ("SELECT depdelay, arrdelay, totaldelay(depdelay, arrdelay)"
...          " FROM flights_2008_10k limit 100")
>>> df = con.select_ipc(query)
>>> df.head()
   depdelay  arrdelay  EXPR$2
0         8       -14      -6
1        19         2      21
2         8        14      22
3        -4        -6     -10
4        34        34      68

Note

Runtime UDFs can be defined if the OmniSci server has enabled its support (see --enable-runtime-udf option of omnisci_server) and rbc package is installed. This is still experimental functionality, and currently it does not work on the Windows operating system.