laforge: a low-key build system for data work.

Contents

Overview

laforge is a low-key build system designed to interoperate Python and SQL data work, originally developed internally for IRIS, the Institute for Research on Innovation and Science at the University of Michigan’s Institute for Social Research.

Features:

  • Interoperable: Read, write, and execute Python and SQL scripts/data.

  • Straightforward: Simple build INI files designed for a one-click build.

  • No lock-in: Maintain scripts independent from laforge.

An Example Build

Directory Organization

Because laforge needs to find the scripts, and because scripts will likely interact with output from other scripts, I tend to keep related project/sub-project files together:

project
├───input
│   ├───a.csv
│   ├───b.xlsx
│   └───c.csv
├───output
│   ├───results_d.csv
│   ├───results_e.csv
│   └───results_f.csv
├───.env
├───build.ini
├───g.py
├───h.sql
├───i.py
└───j.py

.env

distro = mssql
server = MSSQL
database = testdb
schema = laforge

build.ini

[DEFAULT]

Module Documentation

builder

Builder reads and executes tasks and lists of tasks.

laforge.builder.show_env(path)[source]

Show the calculated generic section environment

class laforge.builder.Verb[source]

An enumeration.

class laforge.builder.Target[source]

An enumeration.

class laforge.builder.FileCall(method, kwargs)
property kwargs

Alias for field number 1

property method

Alias for field number 0

exception laforge.builder.TaskConstructionError[source]
exception laforge.builder.TaskExecutionError[source]
class laforge.builder.TaskList(from_string, location='.')[source]

Todo

Implement cache_results=False

load_section_config(section='DEFAULT')[source]

Put together config from env, TaskList config, section config

execute()[source]

Execute each task in the list.

Todo

Restore quiet?

dry_run()[source]

List each task in the list.

class laforge.builder.BaseTask(*, identifier, verb, target, content, config)[source]

Create a task to (verb) (something)

Todo

if “:” in self.content:

previous_result_key, actual_path_content = self.content.split(“:”)

property path

For handlers where dir[verb] + content = path

class laforge.builder.FileReader(*, identifier, verb, target, content, config)[source]
class laforge.builder.InternalPythonExecutor(*, identifier, verb, target, content, config)[source]

Execute (without importing) Python script by path

Allows script adjustment via setting the run name: __main__ = 'laforge'

..todo

Allow implict/explicit return of results.
class laforge.builder.Echoer(*, identifier, verb, target, content, config)[source]
class laforge.builder.SQLQueryReader(*, identifier, verb, target, content, config)[source]
class laforge.builder.SQLExecutor(*, identifier, verb, target, content, config)[source]
class laforge.builder.SQLReaderWriter(*, identifier, verb, target, content, config)[source]
class laforge.builder.FileWriter(*, identifier, verb, target, content, config)[source]

Handles all tasks writing to file.

class laforge.builder.ExistenceChecker(*, identifier, verb, target, content, config)[source]
laforge.builder.load_env(path)[source]

Get .env values without dotenv’s default to silently pull package dir

command

Command-line interface for laforge.

distros

exception laforge.distros.SQLDistroNotFound[source]
class laforge.distros.Distro(_)[source]

Base class for SQL Distros

..note:: http://troels.arvin.dk/db/rdbms/

class laforge.distros.MySQL(_)[source]
class laforge.distros.PostgresQL(_)[source]
class laforge.distros.MSSQL(_)[source]
static add_fast_executemany(engine: sqlalchemy.engine.base.Engine)[source]

Dramatically improve pyodbc upload performance

Theoretically, just “fast_executemany”: “True” should be sufficient in newer versions of the driver.

Note

Improved 1m row upload from over 7 minutes to less than 1 under pyodbc==4.0.26, SQLAlchemy==1.3.1, pandas==0.24.2.

class laforge.distros.SQLite(_)[source]
determine_dtypes(df)[source]

SQlite does not make gradations in integers or text, so don’t try.

laforge.distros.round_up(n, nearest=1)[source]

Round up n to the nearest nearest.

Parameters
  • n

  • nearest – (Default value = 1)

sql

SQL utilities for mid-level interaction. Inspired by pathlib; powered by SQLALchemy.

Note

Supported: MSSQL, MariaDB/MySQL, PostgreSQL, SQLite. Supportable: Firebird, Oracle, Sybase.

exception laforge.sql.SQLTableNotFound[source]
exception laforge.sql.SQLChannelNotFound[source]
exception laforge.sql.SQLIdentifierProblem[source]
class laforge.sql.Channel(distro, *, server=None, database=None, schema=None, **engine_kwargs)[source]

Abstraction from Engine, other static details.

execute_statement(statement, fetch=False)[source]

Execute SQL (core method)

Todo

De-messify

laforge.sql.execute(statement, fetch=False, channel=None)[source]

Convenience method, autofetches Channel if possible

class laforge.sql.Script(query, channel=None)[source]

SQL query string, parsable by ‘go’ separation and execute()able.

execute(statements=None)[source]

Execute itsel(f|ves)

to_table()[source]

Executes all and tries to return a DataFrame for the result of the final query.

This is one of two ways that laforge retrieves tables.

Warning

This is limited by the capacity of Pandas to retrieve only the final result. For Microsoft SQL Server, if a lengthy set of queries is desired, the most reliable approach appears to be a single final query after a ‘go’ as a batch gterminator.

Warning

This will rename columns that do not conform to naming standards.

class laforge.sql.Table(name, channel=None, **kwargs)[source]

Represents a SQL table, featuring methods to read/write DataFrames.

Todo

Factor out to superclass to allow views

write(df, if_exists='replace')[source]

From DataFrame, create a new table and fill it with values

read()[source]

Return the full table as a DataFrame

drop(ignore_existence=False)[source]

Delete the table within SQL

class laforge.sql.Scalar(prox)[source]

Little helper to produce clearly typed single (upper left) ResultProxy result.

class laforge.sql.Identifier(user_input, extra=None)[source]

Single standardized variable/database/schema/table/column/anything identifier.

Todo

class InvalidIdentifierError relay_id_problem(identifier, action, reason=None, replacement=None)

tech

laforge.tech.make_first_upper(s)[source]

Uppercase the first letter of s, leaving the rest alone.

toolbox

Handful of utility functions

Note

These intentionally only depend on builtins.

Note

Some copyright information within this file is identified per-block below.

laforge.toolbox.flatten(foo)[source]

Take any set of nests in an iterator and reduce it into one generator.

‘Nests’ include any iterable except strings.

Parameters

foo

Note

flatten() was authored by Amber Yust at https://stackoverflow.com/a/5286571. This function is not claimed under the laforge license.

Contributing to Development

laforge supports Python 3.6+.

Process

Tool

Documentation

Automation

Nox

https://nox.readthedocs.io/

Test

pytest

https://docs.pytest.org/

Test coverage

pytest-cov

https://pytest-cov.readthedocs.io/

Format

Black

https://black.readthedocs.io/

Lint

Flake8

http://flake8.pycqa.org/

Lint more

Pylint

https://pylint.readthedocs.io/en/latest/

Document

Sphinx

https://www.sphinx-doc.org/

Suggested Environment

# Create virtual environment
python -m venv .venv

# Activate virtual environment with shell-specific script:
. .venv/bin/activate.fish           # fish
# $ source ./.venv/bin/activate     # bash
#  source ./.venv/bin/activate.csh  # csh
# Note that Python for Windows creates ./Scripts/ rather than ./bin/
# .\.venv\Scripts\Activate.ps1      # PowerShell
# .venv\Scripts\Activate.bat        # cmd

# Install packages
python -m pip install -r requirements.txt

# Install working copy

# If desired, optional packages for Excel or other DBs...
# python -m pip install -e .[excel]
# python -m pip install -e .[mysql]
# python -m pip install -e .[all]

# Run tests
python -m pytest

# Run the gauntlet
python -m nox

Embedded TODOs

Todo

Implement cache_results=False

original entry

Todo

Restore quiet?

original entry

Todo

if “:” in self.content:

previous_result_key, actual_path_content = self.content.split(“:”)

original entry

Todo

De-messify

original entry

Todo

Factor out to superclass to allow views

original entry

Todo

class InvalidIdentifierError relay_id_problem(identifier, action, reason=None, replacement=None)

original entry

Docstring Gaps

Undocumented Python objects
===========================
laforge.builder
---------------
Functions:
 * get_verb
 * is_verb

Classes:
 * BaseTask -- missing methods:

   - implement
   - validate_results
 * DirectoryVisit
 * Echoer -- missing methods:

   - implement
 * ExistenceChecker -- missing methods:

   - implement
 * FileReader -- missing methods:

   - implement
 * FileWriter -- missing methods:

   - implement
   - write
 * InternalPythonExecutor -- missing methods:

   - implement
 * SQLExecutor -- missing methods:

   - implement
 * SQLQueryReader -- missing methods:

   - implement
 * SQLReaderWriter -- missing methods:

   - implement
 * Task
 * TaskList -- missing methods:

   - load_tasks
   - template_content

laforge.command
---------------
Functions:
 * find_build_config
 * get_package_logger
 * run_build
 * technobabble

laforge.distros
---------------
Classes:
 * Distro -- missing methods:

   - create_engine
   - create_spec
   - determine_dtypes
   - find
   - known
 * MSSQL -- missing methods:

   - create_spec
   - find
 * MySQL -- missing methods:

   - create_spec
 * PostgresQL -- missing methods:

   - create_spec
 * SQLite -- missing methods:

   - create_spec
   - find

laforge.sql
-----------
Functions:
 * fix_bad_columns
 * is_reserved_word

Classes:
 * Channel -- missing methods:

   - clean_up_statement
   - find
   - grab
   - retrieve_engine
   - save_engine
 * Identifier -- missing methods:

   - check
 * Script -- missing methods:

   - read
 * Table -- missing methods:

   - exists
   - resolve

laforge.tech
------------
Functions:
 * capitalize_sentences

Classes:
 * ModifiableVerb
 * Technobabbler

Free Software License

Copyright 2019 Matt VanEseltine.

laforge is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

laforge is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details. Copies are attached with this documentation and available online at https://www.gnu.org/licenses/agpl.html.