rfwtools.example_set.ExampleSet
- class rfwtools.example_set.ExampleSet(e_type=ExampleType.EXAMPLE, example_kwargs={}, known_zones=None, known_cavity_labels=None, known_fault_labels=None, req_columns=None)[source]
Bases:
objectA class for managing a collection of examples, including metadata about the collection of examples.
Each ExampleSet supports having only one type of IExample object, and each only supports one set of kwargs.
This class has methods for building collections of examples from our standard label files or from the waveform browser webservice. It also includes many methods for visualizing and reporting.
- known_zones
A list of strings identifying the minimum set of zone categories to be included in the categorical. The class version is the default set. The instance version is the known to that instance.
- known_cavity_labels
A list of strings identifying the minimum set of cavity label categories to be included in the categorical. The class version is the default set. The instance version is the known to that instance.
- known_fault_labels
A list of strings identifying the minimum set of fault label categories to be included in the categorical. The class version is the default set. The instance version is the known to that instance.
- __init__(e_type=ExampleType.EXAMPLE, example_kwargs={}, known_zones=None, known_cavity_labels=None, known_fault_labels=None, req_columns=None)[source]
Create an instance of an ExampleSet. Optionally override the default levels for zones and labels.
- Parameters:
e_type (
ExampleType) – The type of example that should be created within this ExampleSetexample_kwargs (
dict) – A dictionary of keyword arguments that will be passed to every IExample object constructed.known_zones (
Optional[List[str]]) – A list of strings identifying the minimum set of zone categories to be included in the categorical.known_cavity_labels (
Optional[List[str]]) – A list of strings identifying the minimum set of cavity label categories to be included in the categorical.known_fault_labels (
Optional[List[str]]) – A list of strings identifying the minimum set of fault label categories to be included in the categorical.req_columns (
Optional[List[str]]) – A list of column names that are required to be in valid DataFrames used internally. These are in addition to the class defined list of “zone”, “dtime”, etc..
Methods
__init__([e_type, example_kwargs, ...])Create an instance of an ExampleSet.
add_label_file_data([label_files, ...])Process and add label files' data to the ExampleSet's internal collection.
add_web_service_data([server, begin, end, ...])Add web service data (faults labeled by in-service model) to the ExampleSet.
Count the number of events that appear multiple times, i.e., were labeled more than once.
Count the number of events that appear multiple times with different labels.
Count the number of labeling occurrences for events that appear multiple times.
Count the number of unique events (zone/datetime combinations
Counts the number of labels (rows in label files)
Count the number of times an event with mismatched labels appears in the ExampleSet.
Count the number of events that appear exactly once.
Show example counts by the day of the week as a stacked barplot
display_frequency_barplot(x[, color_by, ...])Display the example count against one or two different factors, as a (stacked) bar chart.
display_summary_label_heatmap([title, query])Display a heatmap of fault vs cavity labels for all examples in this object
display_timeline([query])Display a timeline of examples as a swarmplot
display_zone_label_heatmap([zones, query])Display a heatmap of fault vs cavity labels for all examples in this object for each unique zone category
get_classification_report(other[, label, ...])This prints a classification report of this ExampleSet's cavity labels considering other as ground truth.
"Identify the fault events that appear multiple times in the ExampleSet.
Identify fault events that appear multiple times with different labels.
Returns the example set as a DataFrame (copy)
Generate a string containing a report on the processed label files
Generates the list of column names that must appear in a DataFrame for it to be a valid example_df.
Identify the fault events that appear exactly once in the ExampleSet.
has_required_columns(df[, dtypes, skip_example])Check if the given DataFrame has the required columns.
load_csv(filename[, in_dir, sep])Read in a CSV file that has ExampleSet data.
purge_invalid_examples(validator[, report, ...])Removes all examples from the ExampleSet that do not pass validation
remove_duplicates_and_mismatches([report])Removes duplicate example entries and removes all instances of examples that have mismatched labels.
save_csv(filename[, out_dir, sep])Write out the ExampleSet data as a CSV file relative to out_dir.
update_example_set(df[, ...])Replaces the contents of this ExampleSet with the supplied DataFrame.
Attributes
A dictionary holding label file contents, keyed on file names
- add_label_file_data(label_files=None, exclude_zones=None, exclude_times=None)[source]
Process and add label files’ data to the ExampleSet’s internal collection.
- Parameters:
label_files (
Optional[List[str]]) – List of label files to process. If None, all files in Config().label_dir are read. Relative paths are resolved relative to Config().label_dir.exclude_zones (
Optional[List[str]]) – List of zones to exclude. Defaults to Config().exclude_zones.exclude_times (
Optional[List[Tuple[datetime,datetime]]]) – List of 2-tuples of datetime objects. Each 2-tuple specifies a range to exclude. None implie +/-Inf.
- Return type:
None
- add_web_service_data(server=None, begin=None, end=None, models=None)[source]
Add web service data (faults labeled by in-service model) to the ExampleSet.
Note: Should be used exclusive of label data since they will largely overlap
- Parameters:
server (
Optional[str]) – The server to query for the data. If None, use the value in Configbegin (
Optional[datetime]) – The earliest time for which a fault should be included. If None, defaults to Jan 1, 2018end (
Optional[datetime]) – The latest time for which a fault should be included. If None defaults to “now”models (
Optional[List[str]]) – A list of model names that should be included in the results. None means include all
- Return type:
None
- count_duplicated_events()[source]
Count the number of events that appear multiple times, i.e., were labeled more than once.
- Return type:
int- Returns:
the number of events that appear multiple times, i.e., were labeled more than once.
This would count as one since only one event appeared that did occur multiple times 4240 2L25 2020-09-21 06:53:16.500 5 E_Quench 4241 2L26 2020-09-22 06:53:17.500 6 E_Quench 4242 2L26 2020-09-22 06:53:17.500 6 E_Quench
- count_duplicated_events_with_mismatched_labels()[source]
Count the number of events that appear multiple times with different labels.
- Return type:
int- Returns:
The number of events that appear multiple times with different labels
This would count as one since one event appeared that had mismatched labels 4240 2L26 2020-09-21 06:53:16.500 5 E_Quench 4241 2L26 2020-09-21 06:53:16.500 6 E_Quench 4242 2L26 2020-09-21 06:53:16.500 6 E_Quench
- count_duplicated_labels()[source]
Count the number of labeling occurrences for events that appear multiple times.
This is basically the number of rows in the label files that are not for unique fault events.
- Return type:
int- Returns:
The number of labeling occurrences for events that appear multiple times
- count_events()[source]
Count the number of unique events (zone/datetime combinations
This would count as two since two unique zone/datetime pairs appeared 4240 2L25 2020-09-21 06:53:16.500 5 E_Quench 4241 2L26 2020-09-22 06:53:17.500 6 E_Quench 4242 2L26 2020-09-22 06:53:17.500 6 E_Quench
- Return type:
int- Returns:
the number of unique events (zone/datetime combinations
- count_labels()[source]
Counts the number of labels (rows in label files)
- Return type:
int- Returns:
the number of labels (rows in label files)
- count_mismatched_labels()[source]
Count the number of times an event with mismatched labels appears in the ExampleSet.
- Return type:
int- Returns:
The number of times an event with mismatched labels appears in the ExampleSet.
This would count as three mismatched labels since one event with mismatched labels appeared three times 4240 2L26 2020-09-21 06:53:16.500 5 E_Quench 4241 2L26 2020-09-21 06:53:16.500 6 E_Quench 4242 2L26 2020-09-21 06:53:16.500 6 E_Quench
- count_unduplicated_events()[source]
Count the number of events that appear exactly once.
- Return type:
int- Returns:
The number of events that appear exactly once.
This would count as one since only one event appeared that did not occur multiple times 4240 2L25 2020-09-21 06:53:16.500 5 E_Quench 4241 2L26 2020-09-22 06:53:17.500 6 E_Quench 4242 2L26 2020-09-22 06:53:17.500 6 E_Quench
- display_examples_by_weekday_barplot(color_by=None, title=None, query=None)[source]
Show example counts by the day of the week as a stacked barplot
- Parameters:
color_by (
Optional[str]) – The DataFrame column on which the bars will be split/colored.title (
Optional[str]) – The title to put on the plot. A reasonable default will be generated if None.query (
Optional[str]) – The expr argument to DataFrame.query. Subsets data before plot
- Return type:
None
- display_frequency_barplot(x, color_by=None, title=None, query=None)[source]
Display the example count against one or two different factors, as a (stacked) bar chart.
- Parameters:
x (
str) – The column name for which each bar will appear. Should probably be categorical.color_by (
Optional[str]) – The column name by which each bar will be split and colored (for a stacked bar plot). If None, then a simple bar plot will be displayed.title (
Optional[str]) – The title to put on the chart. If None, a reasonable default will be generated.query (
Optional[str]) – The expr argument to DataFrame.query. Subsets data before plot
- Return type:
None
- display_summary_label_heatmap(title='Label Summary', query=None)[source]
Display a heatmap of fault vs cavity labels for all examples in this object
- Parameters:
title (
str) – The title of the plotquery (
Optional[str]) – The expr argument to DataFrame.query. Subsets data before plot
- Return type:
None
- display_timeline(query=None, **kwargs)[source]
Display a timeline of examples as a swarmplot
- Parameters:
query (
Optional[str]) – The expr argument to DataFrame.query. Subsets data before plotkwargs – Other named parameters are passed to swarm_timeline method
- Return type:
None
- display_zone_label_heatmap(zones=None, query=None)[source]
Display a heatmap of fault vs cavity labels for all examples in this object for each unique zone category
- Parameters:
zones (
Optional[List[str]]) – A list of the zones to display.query (
Optional[str]) – The expr argument to DataFrame.query. Subsets data before plot
- Return type:
None
- get_classification_report(other, label='cavity_label', query=None, other_query=None)[source]
This prints a classification report of this ExampleSet’s cavity labels considering other as ground truth.
Only examples from other for which there is an example in this ExampleSet are considered
- Parameters:
other (ExampleSet) – An ExampleSet that contains cavity labels considered the ground truth.
label (str) – The column name of containing the label values to compare.
query (str) –
other_query (str) –
- get_duplicated_labels()[source]
“Identify the fault events that appear multiple times in the ExampleSet.
- Return type:
DataFrame- Returns:
A DataFrame containing labels for events that appear multiple times
- get_events_with_mismatched_labels()[source]
Identify fault events that appear multiple times with different labels.
- Return type:
DataFrame- Returns:
A DataFrame containing the events that have mismatched labels
- get_example_df()[source]
Returns the example set as a DataFrame (copy)
- Return type:
DataFrame- Returns:
A copy of the internal ExampleSet DataFrame
- get_label_file_report()[source]
Generate a string containing a report on the processed label files
- Return type:
str- Returns:
A formatted string containing the report.
- get_required_columns()[source]
Generates the list of column names that must appear in a DataFrame for it to be a valid example_df.
- Return type:
List[str]- Returns:
The list of column names
- get_unduplicated_events()[source]
Identify the fault events that appear exactly once in the ExampleSet.
- Return type:
DataFrame- Returns:
DataFrame of the events that appear exactly once in the ExampleSet
- has_required_columns(df, dtypes=False, skip_example=True)[source]
Check if the given DataFrame has the required columns.
- Parameters:
df (
DataFrame) – The DataFrame to checkdtypes (
bool) – Check for matching dtypes of “mandatory” columns. Uses existing _example_df’s dtypes. Skips if _examples_df is None.skip_example (
bool) – If True, requires that the ‘example’ column name is present. Otherwise it is not checked.
- Return type:
bool- Returns:
True if all required column names are present. False otherwise.
- label_file_dataframes
A dictionary holding label file contents, keyed on file names
- load_csv(filename, in_dir=None, sep=',')[source]
Read in a CSV file that has ExampleSet data.
Treats ‘#’ character as the start of a comment. Includes rftwools generated headers from save_csv().
- Parameters:
filename (
str) – The filename to save. Will be relative in_dirin_dir (
Optional[str]) – The directory to find the file in. Defaults to Config().output_dirsep (
str) – Delimiter string used by Pandas to parse given “csv” file
- Raises:
ValueError – If the CSV file does not have the expected column names.
- Return type:
None
- purge_invalid_examples(validator, report=True, progress=True)[source]
Removes all examples from the ExampleSet that do not pass validation
- Parameters:
validator (
ExampleValidator) – A object that follows the ExampleValidator interface.report (
bool) – Should information about what is purged be printed?progress (
bool) – Should a progress bar be displayed
- Return type:
None
- remove_duplicates_and_mismatches(report=False)[source]
Removes duplicate example entries and removes all instances of examples that have mismatched labels.
- Parameters:
report (
bool) – Should information about what was removed be included?- Return type:
None
- save_csv(filename, out_dir=None, sep=',')[source]
Write out the ExampleSet data as a CSV file relative to out_dir. Only writes out example_df equivalent.
This also writes out a comment header section that includes information about ExampleSet parameters at the time the file was written.
- Parameters:
filename (
str) – The filename to save. Will be relative out_dirout_dir (
Optional[str]) – The directory to save the file in. Defaults to Config().output_dirsep (
str) – Delimiter string used by Pandas to parse given “csv” file
- Return type:
None
- update_example_set(df, keep_label_file_dataframes=False)[source]
Replaces the contents of this ExampleSet with the supplied DataFrame.
Note: A copy of df is used.
- Parameters:
df (
DataFrame) – A DataFrame formatted for ExampleSet that will replace the the contents of this ExampleSet.keep_label_file_dataframes (
bool) – Should the dictionary of label file DataFrames be kept. If False, the dictionary recreated. If True, no action is taken.
- Raises:
ValueError – If columns do not match
- Return type:
None