PanDatFactory

PanDatFactory(self, **init_fields)

Defines a schema for a collection of pandas.DataFrame objects. This class is constructed with a schema. It can be used to generate PanDat objects, to write PanDat objects to different file types, or to perform bulk query operations to diagnose common data integrity failures.

Analytical code that uses PanDat objects can be used, without change, on different data sources, thus facilitating the "separate model from data" design goal.

A PanDat object is itself a collection of DataFrames that conform to a predefined schema.

:param init_fields: a mapping of tables to primary key fields and data fields. Each field listing consists of two sub lists ... first primary keys fields, then data fields.

ex: PanDatFactory (categories = [["name"],["Min Nutrition", "Max Nutrition"]], foods = [["Name"],["Cost"]] nutritionQuantities = [["Food", "Category"],["Qty"]])

Use '*' instead of a pair of lists for generic tables

ex: PanDatFactory (typical_table = [["Primary Key Field"],["Data Field"]], generic_table = '*')

add_parameter

PanDatFactory.add_parameter(name,
                            default_value,
                            number_allowed=True,
                            inclusive_min=True,
                            inclusive_max=False,
                            min=0,
                            max=inf,
                            must_be_int=False,
                            strings_allowed=(),
                            nullable=False,
                            datetime=False,
                            enforce_type_rules=True)

Add (or reset) a parameters option. Requires that a parameters table with one primary key field and one data field already be present. The legal parameters options will be enforced as part of find_data_row_failures Note that if you are using this function, then you would typically read from the parameters table indirectly, by using the dictionary returned by create_full_parameters_dict.

:param name: name of the parameter to add or reset

:param default_value: default value for the parameter (used for create_full_parameters_dict)

:param number_allowed: boolean does this parameter allow numbers?

:param inclusive_min: if number allowed, is the min inclusive?

:param inclusive_max: if number allowed, is the max inclusive?

:param min: if number allowed, the minimum value

:param max: if number allowed, the maximum value

:param must_be_int: boolean : if number allowed, must the number be integral?

:param strings_allowed: if a collection - then a list of the strings allowed. The empty collection prohibits strings. If a "*", then any string is accepted.

:param nullable: boolean : can this parameter be set to null (aka None)

:param datetime: If truthy, then number_allowed through strings_allowed are ignored. Should the data either be a datetime.datetime object or a string that can be parsed into a datetime.datetime object? Note that the various readers will try to coerce strings into datetime.datetime objects on read for parameters with datetime data types. pandas.Timestamp is itself a datetime.datetime, and the bias will be to create such an object.

:param enforce_type_rules: boolean: ignore all of number_allowed through datetime, and only enforce the parameter names and default values :return:

add_data_row_predicate

PanDatFactory.add_data_row_predicate(
  table,
  predicate,
  predicate_name=None,
  predicate_kwargs_maker=None,
  predicate_failure_response='Boolean')

The purpose of calling add_data_row_predicate is to prepare for a future call to find_data_row_failures. See https://bit.ly/3e9pdCP for more details on these two functions.

Adds a data row predicate for a table. Row predicates can be used to check for sophisticated data integrity problems of the sort that can't be easily handled with a data type rule. For example, a min_supply column can be verified to be no larger than a max_supply column.

!!! NB!!!!

pandas will typically render NULL as nan. In rare cases, it uses None.

Don't check for None (or nan) in your predicate functions. Use pandas.isnull

!!!!!!!!!!

:param table: table in the schema

:param predicate: A one argument function that accepts a table row as an argument and returns Truthy if the row is valid and Falsey otherwise. (See below, there are other arguments that can refine how predicate works). The row argument passed to predicate will be a dict that maps field name to data value for all fields (both primary key and data field) in the table. Note - if None is passed as a predicate, then any previously added predicate matching (table, predicate_name) will be removed. .

:param predicate_name: The name of the predicate. If omitted, the smallest non-colliding number will be used.

:param predicate_kwargs_maker: A function used to support predicate if predicate accepts more than just the row argument. This function accepts a single dat argument and is called exactly once per find_data_row_failures call. If predicate_kwargs_maker returns a dict, then this dict is unpacked for each call to predicate. An error (or a bulk row failure) results if predicate_kwargs_maker fails to return a dict.

:param predicate_failure_response: Either "Boolean" or "Error Message". If the latter then predicate indicates a clean row by returning True (the one and only literal True in Python) and a dirty row by returning a non-empty string (which is an error message).

See find_data_row_failures for details on handling exceptions thrown by predicate or predicate_kwargs_maker. :return:

add_foreign_key

PanDatFactory.add_foreign_key(native_table, foreign_table, mappings)

Adds a foreign key relationship to the schema. Adding a foreign key doesn't block the entry of child records that fail to find a parent match. It does make it easy to recognize such records (with find_foreign_key_failures()) and to remove such records (with remove_foreign_key_failures())

:param native_table: (aka child table). The table with fields that must match some other table.

:param foreign_table: (aka parent table). The table providing the matching entries.

:param mappings: For simple foreign keys, a [native_field, foreign_field] pair. For compound foreign keys an iterable of [native_field, foreign_field] pairs.

:return:

clear_data_type

PanDatFactory.clear_data_type(table, field)

clears the data type for a field. By default, fields don't have types. Adding a data type doesn't block data of the wrong type from being entered. Data types are useful for recognizing errant data entries. If no data type is specified (the default) then no errant data will be recognized.

:param table: table in the schema

:param field: one of table's fields.

:return:

clear_foreign_keys

PanDatFactory.clear_foreign_keys(native_table=None)

create a PanDatFactory

:param native_table: optional. The table whose foreign keys should be cleared. If omitted, all foreign keys are cleared.

clone

PanDatFactory.clone(table_restrictions=None, clone_factory=None)

clones the PanDatFactory

:param table_restrictions : if None, then argument is ignored. Otherwise, a container listing the tables to keep in the clone. Tables outside table_restrictions are removed from the clone.

:param clone_factory : optional. Defaults to PanDatFactory. Can also be TicDatFactory. Can also be a function, in which case it should behave similarly to create_from_full_schema. If clone_factory=TicDatFactory, the row predicates that use predicate_kwargs_maker won't be copied over.

:return: a clone of the PanDatFactory. Returned object will be based on clone_factory, if provided.

Note - If you want to remove tables via a clone, then call like this pdf_new = pdf.clone(table_restrictions=set(pdf.all_tables).difference(tables_to_remove)) Other schema editing operations are available with clone_add_a_table, clone_add_a_column, clone_remove_a_column and clone_rename_a_column.

clone_add_a_column

PanDatFactory.clone_add_a_column(table,
                                 field,
                                 field_type,
                                 field_position='append')

add a column to the PanDatFactory

:param table: table in the schema

:param field: name of the new field to be added

:param field_type: either "primary key" or "data"

:param field_position: integer between 0 and the length of self.primary_key_fields[table] (if "primary key") or self.data_fields[table] (if "data"), inclsuive. Alternately, can be "append", which will just insert the column at the end of the appropriate list.

:return: a clone of the PanDatFactory, with field inserted into location field_position for field_type

clone_add_a_table

PanDatFactory.clone_add_a_table(table, pk_fields, df_fields)

add a table to the PanDatFactory

:param table: table not in the schema

:param pk_fields: container of the primary key fields

:param df_fields: container of the data fields

:return: a clone of the PanDatFactory, with the new table added

clone_remove_a_column

PanDatFactory.clone_remove_a_column(table, field)

remove a column from the PanDatFactory

:param table: table in the schema

:param field: name of the field to be removed

:return: a clone of the PanDatFactory, with field removed

clone_rename_a_column

PanDatFactory.clone_rename_a_column(table, field, new_field)

rename a column in the PanDatFactory

:param table: table in the schema

:param field: name of the field to be removed

:param new_field: new name for the field

:return: a clone of the PanDatFactory, with field renamed to new_field. Data types, default values and foreign keys will reflect the new field name, but row predicates will be copied over as-is (and thus you will need to re-create them as needed).

copy_from_ampl_variables

PanDatFactory.copy_from_ampl_variables(ampl_variables)

copies the solution results from ampl_variables into a new PanDat object

:param ampl_variables: a dict mapping from (table_name, field_name) -> amplpy.variable.Variable (amplpy.variable.Variable is the type object returned by AMPL.getVariable) table_name should refer to a table in the schema that has primary key fields. field_name can refer to a data field for table_name, or it can be falsey. If the latter, then AMPL variables that pass the filter (see below) will simply populate the primary key of the table_name. Note that by default, only non-zero data is copied over. If you want to override this filter, then instead of mapping to amplpy.variable.Variable you should map to a (amplpy.variable.Variable, filter) where filter accepts a data value and returns a boolean.

:return: a deep copy of the ampl_variables into a PanDat object

copy_pan_dat

PanDatFactory.copy_pan_dat(pan_dat)

copies the tic_dat object into a new tic_dat object performs a deep copy

:param pan_dat: a pandat object

:return: a deep copy of the pan_dat argument

copy_to_ampl

PanDatFactory.copy_to_ampl(pan_dat,
                           field_renamings=None,
                           excluded_tables=None)

copies the pan_dat object into a new pan_dat object populated with amplpy.DataFrame objects performs a deep copy

:param pan_dat: a PanDat object

:param field_renamings: dict or None. If fields are to be renamed in the copy, then a mapping from (table_name, field_name) -> new_field_name If a data field is to be omitted, then new_field can be falsey table_name cannot refer to an excluded table. (see below) field_name doesn't have to refer to a field to an element of self.data_fields[t], but it doesn't have to refer to a column in the pan_dat.table_name DataFrame

:param excluded_tables: If truthy, a list of tables to be excluded from the copy. Tables without primary key fields are always excluded.

:return: a deep copy of the tic_dat argument into amplpy.DataFrames

create_from_full_schema

PanDatFactory.create_from_full_schema(full_schema)

create a PanDatFactory complete with default values, data types, and foreign keys

:param full_schema: a dictionary consistent with the data returned by a call to schema() with include_ancillary_info = True

:return: a PanDatFactory reflecting the tables, fields, default values, data types, and foreign keys consistent with the full_schema argument

create_full_parameters_dict

PanDatFactory.create_full_parameters_dict(dat)

create a fully populated dictionary of all the parameters

:param dat: a PanDat object that has a parameters table

:return: a dictionary that maps parameter option to actual dat.parameters value. if the specific option isn't part of dat.parameters, then the default value is used. Note that for datetime parameters, the default will be coerced into a datetime object, if possible.

find_data_row_failures

PanDatFactory.find_data_row_failures(pan_dat,
                                     as_table=True,
                                     exception_handling='__debug__',
                                     max_failures=inf)

Finds the data row failures for a ticdat object

:param pan_dat: a pandat object

:param as_table: boolean - if truthy then the values of the return dictionary will be the predicate failure rows themselves. Otherwise will return the boolean Series that indicates which rows have predicate failures.

:param exception_handling: One of "Handled as Failure", "Unhandled" or "debug" "Handled as Failure": Any exception generated by calling a row predicate function will indicate a data failure for that row. (Similarly, predicate_kwargs_maker exceptions create an entry in the returned failure dictionary). "Unhandled": Exceptions resulting from calling a row predicate (or a predicate_kwargs_maker) will not be handled by data_row_failures. "debug": Since "Handled as Failure" makes more sense for production runs and "Unhandled" makes more sense for debugging, this option will use the latter if debug is True and the former otherwise. See -o and debug in Python documentation for more details.

:param max_failures: number. An upper limit on the number of failures to find. Will short circuit and return ASAP with a partial failure enumeration when this number is reached.

:return: A dictionary constructed as follows:

The keys are namedtuples with members "table", "predicate_name".

The values are DataFrames that contain the subset of rows that exhibit data failures for this specific table, predicate pair (or the Series that identifies these rows).

If the predicate_failure_response for the predicate is "Error Message" (instead of "Boolean") and as_table is truthy, then an "Error Message" column will be added to the appropriate DataFrame in the returned dict.

If a predicate_kwargs_maker is provided and it fails (either by failing to return a dictionary or by throwing a handled exception) then appropriate value of the dictionary will be a namedtuple with members "primary_key" and "error message". The former will be populated with '*' (indicating all the rows) and the latter will be a string describing the failure.

find_data_type_failures

PanDatFactory.find_data_type_failures(pan_dat,
                                      as_table=True,
                                      max_failures=inf)

Finds the data type failures for a pandat object

:param pan_dat: pandat object

:param as_table: boolean - if truthy then the values of the return dictionary will be the data type failure rows themselves. Otherwise will return the boolean Series that indicates which rows have data type failures.

:param max_failures: number. An upper limit on the number of failures to find. Will short circuit and return ASAP with a partial failure enumeration when this number is reached.

:return: A dictionary constructed as follow: The keys are namedtuples with members "table", "field". Each (table,field) pair has data values that are inconsistent with its data type. (table, field) pairs with no data type at all are never part of the returned dictionary. The values are DataFrames that contain the subset of rows that exhibit data failures for this specific table, field pair (or the boolean Series that identifies these rows).

Note that for primary key fields (but not data fields) with no explicit data type, a temporary filter that excludes only Null will be applied. If you want primary key fields to allow Null, you must explicitly opt-in by calling set_data_type appropriately. See issue https://github.com/ticdat/ticdat/issues/46 for more info.

find_duplicates

PanDatFactory.find_duplicates(pan_dat, keep='first', as_table=True)

Find the duplicated rows based on the primary key fields.

:param pan_dat: pandat object

:param keep: 'first': Treat all duplicated rows as duplicates except for the first occurrence. 'last': Treat all duplicated rows as duplicates except for the last occurrence. False: Treat all duplicated rows as duplicates

:param as_table: as_table boolean : if truthy then the values of the return dictionary will be the duplicated rows themselves. Otherwise will return the boolean Series that indicates which rows are duplicated rows.

:return: A dictionary whose keys are the table names and whose values are duplicated rows (or the Series that identifies these rows)

find_foreign_key_failures

PanDatFactory.find_foreign_key_failures(pan_dat,
                                        verbosity='High',
                                        as_table=True,
                                        max_failures=inf)

Finds the foreign key failures for a pandat object

:param pan_dat: pandat object

:param verbosity: either "High" or "Low"

:param as_table: as_table boolean : if truthy then the values of the return dictionary will be the failed rows themselves. Otherwise will return the a boolean list that indicates which rows have failures. (For technical reasons, not returning a boolean Series like the other find functions)

:param max_failures: number. An upper limit on the number of failures to find. Will short circuit and return ASAP with a partial failure enumeration when this number is reached.

:return: A dictionary constructed as follows:

The keys are namedtuples with members "native_table", "foreign_table", "mapping", "cardinality".

The key data matches the arguments to add_foreign_key that constructed the foreign key (with "cardinality" being deduced from the overall schema).

The values are DataFrames that contain the subset of native table rows that fail to find the foreign table matching defined by the associated returned key (or the list that identifies these rows).

For verbosity = 'Low' a simpler return object is created that doesn't use namedtuples and omits the foreign key cardinality.

get_row_predicates

PanDatFactory.get_row_predicates(table)

return all the row predicates for a given table

:param table: a table in the schema

:return: a dictionary mapping predicate_name to RowPredicateInfo named tuple (the entries of which are based on the prior call to add_data_row_predicate).

good_pan_dat_object

PanDatFactory.good_pan_dat_object(
  data_obj,
  bad_message_handler=<function PanDatFactory.<lambda> at 0x1343a0550>)

determines if an object is a valid PanDat object for this schema

:param data_obj: the object to verify

:param bad_message_handler: a call back function to receive description of any failure message

:return: True if the dataObj can be recognized as a PanDat data object. False otherwise.

remove_foreign_key_failures

PanDatFactory.remove_foreign_key_failures(pan_dat)

Removes foreign key failures (i.e. child records with no parent table record)

:param pan_dat: pandat object (will be side-effected)

:return: pan_dat, with the foreign key failures removed Note that all foreign key removals are cascading. When a child removal results in new foreign key failures, those failures are removed as well.

remove_parameter

PanDatFactory.remove_parameter(name)

Undo a previous call to add_parameter.

:param name: name of the parameter to remove

:return:

schema

PanDatFactory.schema(include_ancillary_info=False)

Return a dictionary that summarizes the schema.

:param include_ancillary_info: if True, include all the foreign key, default, and data type information as well. Otherwise, just return table-fields dictionary

:return: a dictionary with table name mapping to a list of lists defining primary key fields and data fields If include_ancillary_info, this table-fields dictionary is just one entry in a more comprehensive dictionary.

set_ampl_data

PanDatFactory.set_ampl_data(ampl_dat, ampl, table_to_set_name=None)

performs bulk setData on the AMPL-esque first argument.

:param ampl_dat: an AmplTicDat object created by calling copy_to_ampl

:param ampl: an amplpy.AMPL object

:param table_to_set_name: a mapping of table_name to ampl set name

:return:

set_data_type

PanDatFactory.set_data_type(table,
                            field,
                            number_allowed=True,
                            inclusive_min=True,
                            inclusive_max=False,
                            min=0,
                            max=inf,
                            must_be_int=False,
                            strings_allowed=(),
                            nullable=False,
                            datetime=False)

sets the data type for a field. By default, fields don't have types. Adding a data type doesn't block data of the wrong type from being entered. Data types are useful for recognizing errant data entries with find_data_type_failures(). Errant data entries can be replaced with replace_data_type_failures().

:param table: a table in the schema

:param field: a data field for this table

:param number_allowed: boolean does this field allow numbers?

:param inclusive_min: boolean : if number allowed, is the min inclusive?

:param inclusive_max: boolean : if number allowed, is the max inclusive?

:param min: if number allowed, the minimum value

:param max: if number allowed, the maximum value

:param must_be_int: boolean : if number allowed, must the number be integral?

:param strings_allowed: if a collection - then a list of the strings allowed. The empty collection prohibits strings. If a "*", then any string is accepted. :param nullable : boolean : can this value contain null (aka None aka nan (since pandas treats null as nan))

:param datetime: If truthy, then number_allowed through strings_allowed are ignored. Should the data either be a datetime.datetime object or a string that can be parsed into a datetime.datetime object? Note that the various readers will try to coerce strings into datetime.datetime objects on read for fields with datetime data types. pandas.Timestamp is itself a datetime.datetime, and the bias will be to create such an object.

:return:

set_default_value

PanDatFactory.set_default_value(table, field, default_value)

sets the default value for a specific field

:param table: a table in the schema

:param field: a field in the table

:param default_value: the default value to apply

Note - the data fields of a schema will have the default default of zero. The primary key fields will have no default at all (NOT None, but rather, no default). replace_data_type_failures will only perform replacements on fields for which there is a default, unless there is some explicit override provided. (see replace_data_type_failures for details). This is deliberate, since a bulk replacement in a primary key field is likely to create a duplication failure.

:return:

set_default_values

PanDatFactory.set_default_values(**table_defaults)

sets the default values for the fields

:param table_defaults: A dictionary of named arguments. Each argument name (i.e. each key) should be a table name Each value should itself be a dictionary mapping data field names to default values

Ex:

pdf.set_default_values(categories = {"minNutrition":0, "maxNutrition":float("inf")}, foods = {"cost":0}, nutritionQuantities = {"qty":0})

:return:

set_duplicates_ticdat_init

PanDatFactory.set_duplicates_ticdat_init(value)

Set the duplicates_ticdat_init for the PanDatFactory. Choices are: --> 'assert' : an assert is raised if duplicate rows are passed to TicDat.init --> 'warn' : emit a warning if duplicate rows are passed to TicDat.init --> 'ignore' : don't do anything if duplicate rows are passed to TicDat.init This is only relevant when using copy_to_tic_dat :param value: either 'assert', 'warn' or 'ignore' :return:

set_infinity_io_flag

PanDatFactory.set_infinity_io_flag(value)

Set the infinity_io_flag for the PanDatFactory. 'N/A' (the default) is recognized as a flag to disable infinity I/O buffering.

If numeric, when writing data to the file system (or a database), float("inf") will be replaced by the infinity_io_flag and float("-inf") will be replaced by -infinity_io_flag, prior to writing. Similarly, the read data will replace any number >= the infinity_io_flag with float("inf") and any number smaller than float("-inf") with -infinity_io_flag.

If None, then +/- infinity will be replaced by None prior to writing. Similarly, subsequent to reading, None will be replaced either by float("inf") or float("-inf"), depending on field data types. Note that None flagging will only perform replacements on fields whose data types allow infinity and not None.

For all cases, these replacements will be done on a temporary copy of the data that is created prior to writing.

Also note that none of the these replacements will be done on the parameters table. The assumption is the parameters table will be serialized to a string/string database table. Infinity can thus be represented by "inf"/"-inf" in such serializations.

:param value: a valid infinity_io_flag

:return:

PanDatFactory.set_tooltip(table, field, tooltip)

Set the tooltip for a table, or for a (table, field) pair.

:param table: a table in the schema

:param field: an empty string (if you want to set the tooltip for a table) or a field for this table

:param tooltip: an empty string (if you want to delete a previously set tooltip) or the tooltip you want to set

:return:

After calling this function, the tooltips property for this PanDatFactory will be appropriately adjusted.

set_xlsx_trailing_empty_rows

PanDatFactory.set_xlsx_trailing_empty_rows(value)

Set the xlsx_trailing_empty_rows for the PanDatFactory. Choices are: --> 'prune' : (the default) when reading an xlsx/xlsm file, look for trailing all pd.isnull rows in each table, and prune them --> 'ignore': retain such rows With the move to openpyxl for xlsx/xlsm file reading, its more likely that Excel users accidentally creating trailing all none rows. :param value: either 'prune' or 'ignore' :return: