laforge: a low-key build system for data work.
Contents¶
Overview¶
laforge is a low-key build system designed to interoperate Python and SQL data work, originally developed internally for IRIS, the Institute for Research on Innovation and Science at the University of Michigan’s Institute for Social Research.
Features:
Interoperable: Read, write, and execute Python and SQL scripts/data.
Straightforward: Simple build INI files designed for a one-click build.
No lock-in: Maintain scripts independent from laforge.
An Example Build¶
Directory Organization¶
Because laforge needs to find the scripts, and because scripts will likely interact with output from other scripts, I tend to keep related project/sub-project files together:
project
├───input
│ ├───a.csv
│ ├───b.xlsx
│ └───c.csv
├───output
│ ├───results_d.csv
│ ├───results_e.csv
│ └───results_f.csv
├───.env
├───build.ini
├───g.py
├───h.sql
├───i.py
└───j.py
.env¶
distro = mssql
server = MSSQL
database = testdb
schema = laforge
build.ini¶
[DEFAULT]
Module Documentation¶
builder¶
Builder reads and executes tasks and lists of tasks.
-
class
laforge.builder.
FileCall
(method, kwargs)¶ -
property
kwargs
¶ Alias for field number 1
-
property
method
¶ Alias for field number 0
-
property
-
class
laforge.builder.
TaskList
(from_string, location='.')[source]¶ Todo
Implement cache_results=False
-
class
laforge.builder.
BaseTask
(*, identifier, verb, target, content, config)[source]¶ Create a task to (verb) (something)
Todo
- if “:” in self.content:
previous_result_key, actual_path_content = self.content.split(“:”)
-
property
path
¶ For handlers where dir[verb] + content = path
-
class
laforge.builder.
InternalPythonExecutor
(*, identifier, verb, target, content, config)[source]¶ Execute (without importing) Python script by path
Allows script adjustment via setting the run name:
__main__ = 'laforge'
..todo
Allow implict/explicit return of results.
command¶
Command-line interface for laforge.
distros¶
-
class
laforge.distros.
Distro
(_)[source]¶ Base class for SQL Distros
..note:: http://troels.arvin.dk/db/rdbms/
-
class
laforge.distros.
MSSQL
(_)[source]¶ -
static
add_fast_executemany
(engine: sqlalchemy.engine.base.Engine)[source]¶ Dramatically improve pyodbc upload performance
Theoretically, just “fast_executemany”: “True” should be sufficient in newer versions of the driver.
Note
Improved 1m row upload from over 7 minutes to less than 1 under pyodbc==4.0.26, SQLAlchemy==1.3.1, pandas==0.24.2.
-
static
sql¶
SQL utilities for mid-level interaction. Inspired by pathlib; powered by SQLALchemy.
Note
Supported: MSSQL, MariaDB/MySQL, PostgreSQL, SQLite. Supportable: Firebird, Oracle, Sybase.
-
class
laforge.sql.
Channel
(distro, *, server=None, database=None, schema=None, **engine_kwargs)[source]¶ Abstraction from Engine, other static details.
-
laforge.sql.
execute
(statement, fetch=False, channel=None)[source]¶ Convenience method, autofetches Channel if possible
-
class
laforge.sql.
Script
(query, channel=None)[source]¶ SQL query string, parsable by ‘go’ separation and execute()able.
-
to_table
()[source]¶ Executes all and tries to return a DataFrame for the result of the final query.
This is one of two ways that laforge retrieves tables.
Warning
This is limited by the capacity of Pandas to retrieve only the final result. For Microsoft SQL Server, if a lengthy set of queries is desired, the most reliable approach appears to be a single final query after a ‘go’ as a batch gterminator.
Warning
This will rename columns that do not conform to naming standards.
-
-
class
laforge.sql.
Table
(name, channel=None, **kwargs)[source]¶ Represents a SQL table, featuring methods to read/write DataFrames.
Todo
Factor out to superclass to allow views
tech¶
toolbox¶
Handful of utility functions
Note
These intentionally only depend on builtins.
Note
Some copyright information within this file is identified per-block below.
-
laforge.toolbox.
flatten
(foo)[source]¶ Take any set of nests in an iterator and reduce it into one generator.
‘Nests’ include any iterable except strings.
- Parameters
foo –
Note
flatten()
was authored by Amber Yust at https://stackoverflow.com/a/5286571. This function is not claimed under the laforge license.
Contributing to Development¶
laforge supports Python 3.6+.
Process |
Tool |
Documentation |
---|---|---|
Automation |
Nox |
|
Test |
pytest |
|
Test coverage |
pytest-cov |
|
Format |
Black |
|
Lint |
Flake8 |
|
Lint more |
Pylint |
|
Document |
Sphinx |
Suggested Environment¶
# Create virtual environment
python -m venv .venv
# Activate virtual environment with shell-specific script:
. .venv/bin/activate.fish # fish
# $ source ./.venv/bin/activate # bash
# source ./.venv/bin/activate.csh # csh
# Note that Python for Windows creates ./Scripts/ rather than ./bin/
# .\.venv\Scripts\Activate.ps1 # PowerShell
# .venv\Scripts\Activate.bat # cmd
# Install packages
python -m pip install -r requirements.txt
# Install working copy
# If desired, optional packages for Excel or other DBs...
# python -m pip install -e .[excel]
# python -m pip install -e .[mysql]
# python -m pip install -e .[all]
# Run tests
python -m pytest
# Run the gauntlet
python -m nox
Embedded TODOs¶
Todo
Implement cache_results=False
Todo
Restore quiet?
Todo
- if “:” in self.content:
previous_result_key, actual_path_content = self.content.split(“:”)
Todo
De-messify
Todo
Factor out to superclass to allow views
Todo
class InvalidIdentifierError relay_id_problem(identifier, action, reason=None, replacement=None)
Docstring Gaps¶
Undocumented Python objects
===========================
laforge.builder
---------------
Functions:
* get_verb
* is_verb
Classes:
* BaseTask -- missing methods:
- implement
- validate_results
* DirectoryVisit
* Echoer -- missing methods:
- implement
* ExistenceChecker -- missing methods:
- implement
* FileReader -- missing methods:
- implement
* FileWriter -- missing methods:
- implement
- write
* InternalPythonExecutor -- missing methods:
- implement
* SQLExecutor -- missing methods:
- implement
* SQLQueryReader -- missing methods:
- implement
* SQLReaderWriter -- missing methods:
- implement
* Task
* TaskList -- missing methods:
- load_tasks
- template_content
laforge.command
---------------
Functions:
* find_build_config
* get_package_logger
* run_build
* technobabble
laforge.distros
---------------
Classes:
* Distro -- missing methods:
- create_engine
- create_spec
- determine_dtypes
- find
- known
* MSSQL -- missing methods:
- create_spec
- find
* MySQL -- missing methods:
- create_spec
* PostgresQL -- missing methods:
- create_spec
* SQLite -- missing methods:
- create_spec
- find
laforge.sql
-----------
Functions:
* fix_bad_columns
* is_reserved_word
Classes:
* Channel -- missing methods:
- clean_up_statement
- find
- grab
- retrieve_engine
- save_engine
* Identifier -- missing methods:
- check
* Script -- missing methods:
- read
* Table -- missing methods:
- exists
- resolve
laforge.tech
------------
Functions:
* capitalize_sentences
Classes:
* ModifiableVerb
* Technobabbler
Free Software License¶
Copyright 2019 Matt VanEseltine.
laforge is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
laforge is distributed in the hope that it will be useful, but WITHOUT ANY
WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR
A PARTICULAR PURPOSE.
See the GNU Affero General Public License for more details. Copies are
attached with this documentation
and available
online at https://www.gnu.org/licenses/agpl.html.