rfwtools.example_set.ExampleSet

class rfwtools.example_set.ExampleSet(e_type=ExampleType.EXAMPLE, example_kwargs={}, known_zones=None, known_cavity_labels=None, known_fault_labels=None, req_columns=None)[source]

Bases: object

A class for managing a collection of examples, including metadata about the collection of examples.

Each ExampleSet supports having only one type of IExample object, and each only supports one set of kwargs.

This class has methods for building collections of examples from our standard label files or from the waveform browser webservice. It also includes many methods for visualizing and reporting.

known_zones

A list of strings identifying the minimum set of zone categories to be included in the categorical. The class version is the default set. The instance version is the known to that instance.

known_cavity_labels

A list of strings identifying the minimum set of cavity label categories to be included in the categorical. The class version is the default set. The instance version is the known to that instance.

known_fault_labels

A list of strings identifying the minimum set of fault label categories to be included in the categorical. The class version is the default set. The instance version is the known to that instance.

__init__(e_type=ExampleType.EXAMPLE, example_kwargs={}, known_zones=None, known_cavity_labels=None, known_fault_labels=None, req_columns=None)[source]

Create an instance of an ExampleSet. Optionally override the default levels for zones and labels.

Parameters:
  • e_type (ExampleType) – The type of example that should be created within this ExampleSet

  • example_kwargs (dict) – A dictionary of keyword arguments that will be passed to every IExample object constructed.

  • known_zones (Optional[List[str]]) – A list of strings identifying the minimum set of zone categories to be included in the categorical.

  • known_cavity_labels (Optional[List[str]]) – A list of strings identifying the minimum set of cavity label categories to be included in the categorical.

  • known_fault_labels (Optional[List[str]]) – A list of strings identifying the minimum set of fault label categories to be included in the categorical.

  • req_columns (Optional[List[str]]) – A list of column names that are required to be in valid DataFrames used internally. These are in addition to the class defined list of “zone”, “dtime”, etc..

Methods

__init__([e_type, example_kwargs, ...])

Create an instance of an ExampleSet.

add_label_file_data([label_files, ...])

Process and add label files' data to the ExampleSet's internal collection.

add_web_service_data([server, begin, end, ...])

Add web service data (faults labeled by in-service model) to the ExampleSet.

count_duplicated_events()

Count the number of events that appear multiple times, i.e., were labeled more than once.

count_duplicated_events_with_mismatched_labels()

Count the number of events that appear multiple times with different labels.

count_duplicated_labels()

Count the number of labeling occurrences for events that appear multiple times.

count_events()

Count the number of unique events (zone/datetime combinations

count_labels()

Counts the number of labels (rows in label files)

count_mismatched_labels()

Count the number of times an event with mismatched labels appears in the ExampleSet.

count_unduplicated_events()

Count the number of events that appear exactly once.

display_examples_by_weekday_barplot([...])

Show example counts by the day of the week as a stacked barplot

display_frequency_barplot(x[, color_by, ...])

Display the example count against one or two different factors, as a (stacked) bar chart.

display_summary_label_heatmap([title, query])

Display a heatmap of fault vs cavity labels for all examples in this object

display_timeline([query])

Display a timeline of examples as a swarmplot

display_zone_label_heatmap([zones, query])

Display a heatmap of fault vs cavity labels for all examples in this object for each unique zone category

get_classification_report(other[, label, ...])

This prints a classification report of this ExampleSet's cavity labels considering other as ground truth.

get_duplicated_labels()

"Identify the fault events that appear multiple times in the ExampleSet.

get_events_with_mismatched_labels()

Identify fault events that appear multiple times with different labels.

get_example_df()

Returns the example set as a DataFrame (copy)

get_label_file_report()

Generate a string containing a report on the processed label files

get_required_columns()

Generates the list of column names that must appear in a DataFrame for it to be a valid example_df.

get_unduplicated_events()

Identify the fault events that appear exactly once in the ExampleSet.

has_required_columns(df[, dtypes, skip_example])

Check if the given DataFrame has the required columns.

load_csv(filename[, in_dir, sep])

Read in a CSV file that has ExampleSet data.

purge_invalid_examples(validator[, report, ...])

Removes all examples from the ExampleSet that do not pass validation

remove_duplicates_and_mismatches([report])

Removes duplicate example entries and removes all instances of examples that have mismatched labels.

save_csv(filename[, out_dir, sep])

Write out the ExampleSet data as a CSV file relative to out_dir.

update_example_set(df[, ...])

Replaces the contents of this ExampleSet with the supplied DataFrame.

Attributes

label_file_dataframes

A dictionary holding label file contents, keyed on file names

add_label_file_data(label_files=None, exclude_zones=None, exclude_times=None)[source]

Process and add label files’ data to the ExampleSet’s internal collection.

Parameters:
  • label_files (Optional[List[str]]) – List of label files to process. If None, all files in Config().label_dir are read. Relative paths are resolved relative to Config().label_dir.

  • exclude_zones (Optional[List[str]]) – List of zones to exclude. Defaults to Config().exclude_zones.

  • exclude_times (Optional[List[Tuple[datetime, datetime]]]) – List of 2-tuples of datetime objects. Each 2-tuple specifies a range to exclude. None implie +/-Inf.

Return type:

None

add_web_service_data(server=None, begin=None, end=None, models=None)[source]

Add web service data (faults labeled by in-service model) to the ExampleSet.

Note: Should be used exclusive of label data since they will largely overlap

Parameters:
  • server (Optional[str]) – The server to query for the data. If None, use the value in Config

  • begin (Optional[datetime]) – The earliest time for which a fault should be included. If None, defaults to Jan 1, 2018

  • end (Optional[datetime]) – The latest time for which a fault should be included. If None defaults to “now”

  • models (Optional[List[str]]) – A list of model names that should be included in the results. None means include all

Return type:

None

count_duplicated_events()[source]

Count the number of events that appear multiple times, i.e., were labeled more than once.

Return type:

int

Returns:

the number of events that appear multiple times, i.e., were labeled more than once.

This would count as one since only one event appeared that did occur multiple times 4240 2L25 2020-09-21 06:53:16.500 5 E_Quench 4241 2L26 2020-09-22 06:53:17.500 6 E_Quench 4242 2L26 2020-09-22 06:53:17.500 6 E_Quench

count_duplicated_events_with_mismatched_labels()[source]

Count the number of events that appear multiple times with different labels.

Return type:

int

Returns:

The number of events that appear multiple times with different labels

This would count as one since one event appeared that had mismatched labels 4240 2L26 2020-09-21 06:53:16.500 5 E_Quench 4241 2L26 2020-09-21 06:53:16.500 6 E_Quench 4242 2L26 2020-09-21 06:53:16.500 6 E_Quench

count_duplicated_labels()[source]

Count the number of labeling occurrences for events that appear multiple times.

This is basically the number of rows in the label files that are not for unique fault events.

Return type:

int

Returns:

The number of labeling occurrences for events that appear multiple times

count_events()[source]

Count the number of unique events (zone/datetime combinations

This would count as two since two unique zone/datetime pairs appeared 4240 2L25 2020-09-21 06:53:16.500 5 E_Quench 4241 2L26 2020-09-22 06:53:17.500 6 E_Quench 4242 2L26 2020-09-22 06:53:17.500 6 E_Quench

Return type:

int

Returns:

the number of unique events (zone/datetime combinations

count_labels()[source]

Counts the number of labels (rows in label files)

Return type:

int

Returns:

the number of labels (rows in label files)

count_mismatched_labels()[source]

Count the number of times an event with mismatched labels appears in the ExampleSet.

Return type:

int

Returns:

The number of times an event with mismatched labels appears in the ExampleSet.

This would count as three mismatched labels since one event with mismatched labels appeared three times 4240 2L26 2020-09-21 06:53:16.500 5 E_Quench 4241 2L26 2020-09-21 06:53:16.500 6 E_Quench 4242 2L26 2020-09-21 06:53:16.500 6 E_Quench

count_unduplicated_events()[source]

Count the number of events that appear exactly once.

Return type:

int

Returns:

The number of events that appear exactly once.

This would count as one since only one event appeared that did not occur multiple times 4240 2L25 2020-09-21 06:53:16.500 5 E_Quench 4241 2L26 2020-09-22 06:53:17.500 6 E_Quench 4242 2L26 2020-09-22 06:53:17.500 6 E_Quench

display_examples_by_weekday_barplot(color_by=None, title=None, query=None)[source]

Show example counts by the day of the week as a stacked barplot

Parameters:
  • color_by (Optional[str]) – The DataFrame column on which the bars will be split/colored.

  • title (Optional[str]) – The title to put on the plot. A reasonable default will be generated if None.

  • query (Optional[str]) – The expr argument to DataFrame.query. Subsets data before plot

Return type:

None

display_frequency_barplot(x, color_by=None, title=None, query=None)[source]

Display the example count against one or two different factors, as a (stacked) bar chart.

Parameters:
  • x (str) – The column name for which each bar will appear. Should probably be categorical.

  • color_by (Optional[str]) – The column name by which each bar will be split and colored (for a stacked bar plot). If None, then a simple bar plot will be displayed.

  • title (Optional[str]) – The title to put on the chart. If None, a reasonable default will be generated.

  • query (Optional[str]) – The expr argument to DataFrame.query. Subsets data before plot

Return type:

None

display_summary_label_heatmap(title='Label Summary', query=None)[source]

Display a heatmap of fault vs cavity labels for all examples in this object

Parameters:
  • title (str) – The title of the plot

  • query (Optional[str]) – The expr argument to DataFrame.query. Subsets data before plot

Return type:

None

display_timeline(query=None, **kwargs)[source]

Display a timeline of examples as a swarmplot

Parameters:
  • query (Optional[str]) – The expr argument to DataFrame.query. Subsets data before plot

  • kwargs – Other named parameters are passed to swarm_timeline method

Return type:

None

display_zone_label_heatmap(zones=None, query=None)[source]

Display a heatmap of fault vs cavity labels for all examples in this object for each unique zone category

Parameters:
  • zones (Optional[List[str]]) – A list of the zones to display.

  • query (Optional[str]) – The expr argument to DataFrame.query. Subsets data before plot

Return type:

None

get_classification_report(other, label='cavity_label', query=None, other_query=None)[source]

This prints a classification report of this ExampleSet’s cavity labels considering other as ground truth.

Only examples from other for which there is an example in this ExampleSet are considered

Parameters:
  • other (ExampleSet) – An ExampleSet that contains cavity labels considered the ground truth.

  • label (str) – The column name of containing the label values to compare.

  • query (str) –

  • other_query (str) –

get_duplicated_labels()[source]

“Identify the fault events that appear multiple times in the ExampleSet.

Return type:

DataFrame

Returns:

A DataFrame containing labels for events that appear multiple times

get_events_with_mismatched_labels()[source]

Identify fault events that appear multiple times with different labels.

Return type:

DataFrame

Returns:

A DataFrame containing the events that have mismatched labels

get_example_df()[source]

Returns the example set as a DataFrame (copy)

Return type:

DataFrame

Returns:

A copy of the internal ExampleSet DataFrame

get_label_file_report()[source]

Generate a string containing a report on the processed label files

Return type:

str

Returns:

A formatted string containing the report.

get_required_columns()[source]

Generates the list of column names that must appear in a DataFrame for it to be a valid example_df.

Return type:

List[str]

Returns:

The list of column names

get_unduplicated_events()[source]

Identify the fault events that appear exactly once in the ExampleSet.

Return type:

DataFrame

Returns:

DataFrame of the events that appear exactly once in the ExampleSet

has_required_columns(df, dtypes=False, skip_example=True)[source]

Check if the given DataFrame has the required columns.

Parameters:
  • df (DataFrame) – The DataFrame to check

  • dtypes (bool) – Check for matching dtypes of “mandatory” columns. Uses existing _example_df’s dtypes. Skips if _examples_df is None.

  • skip_example (bool) – If True, requires that the ‘example’ column name is present. Otherwise it is not checked.

Return type:

bool

Returns:

True if all required column names are present. False otherwise.

label_file_dataframes

A dictionary holding label file contents, keyed on file names

load_csv(filename, in_dir=None, sep=',')[source]

Read in a CSV file that has ExampleSet data.

Treats ‘#’ character as the start of a comment. Includes rftwools generated headers from save_csv().

Parameters:
  • filename (str) – The filename to save. Will be relative in_dir

  • in_dir (Optional[str]) – The directory to find the file in. Defaults to Config().output_dir

  • sep (str) – Delimiter string used by Pandas to parse given “csv” file

Raises:

ValueError – If the CSV file does not have the expected column names.

Return type:

None

purge_invalid_examples(validator, report=True, progress=True)[source]

Removes all examples from the ExampleSet that do not pass validation

Parameters:
  • validator (ExampleValidator) – A object that follows the ExampleValidator interface.

  • report (bool) – Should information about what is purged be printed?

  • progress (bool) – Should a progress bar be displayed

Return type:

None

remove_duplicates_and_mismatches(report=False)[source]

Removes duplicate example entries and removes all instances of examples that have mismatched labels.

Parameters:

report (bool) – Should information about what was removed be included?

Return type:

None

save_csv(filename, out_dir=None, sep=',')[source]

Write out the ExampleSet data as a CSV file relative to out_dir. Only writes out example_df equivalent.

This also writes out a comment header section that includes information about ExampleSet parameters at the time the file was written.

Parameters:
  • filename (str) – The filename to save. Will be relative out_dir

  • out_dir (Optional[str]) – The directory to save the file in. Defaults to Config().output_dir

  • sep (str) – Delimiter string used by Pandas to parse given “csv” file

Return type:

None

update_example_set(df, keep_label_file_dataframes=False)[source]

Replaces the contents of this ExampleSet with the supplied DataFrame.

Note: A copy of df is used.

Parameters:
  • df (DataFrame) – A DataFrame formatted for ExampleSet that will replace the the contents of this ExampleSet.

  • keep_label_file_dataframes (bool) – Should the dictionary of label file DataFrames be kept. If False, the dictionary recreated. If True, no action is taken.

Raises:

ValueError – If columns do not match

Return type:

None