rfwtools.feature_set.FeatureSet

class rfwtools.feature_set.FeatureSet(df=None, filename=None, in_dir=None, sep=',', name='', metadata_columns=None)[source]

Bases: ExampleSet

A class for managing common operations on a collection of labeled faults and associated features

This class is an extension on ExampleSet meant to handle additional analysis operations.

__init__(df=None, filename=None, in_dir=None, sep=',', name='', metadata_columns=None)[source]

Construct a FeatureSet. Can use either a DataFrame or CSV-like file.

Parameters:
  • df (Optional[DataFrame]) – A DataFrame containing the FeatureSet data to include. The following columns must be included - [‘zone’, ‘dtime’, ‘cavity_label’, ‘fault_label’, ‘label_source’]. Any additional columns will be treated as the features. Note: A copy of df is saved.

  • filename (Optional[str]) – The filename to load. Will be relative to in_dir

  • in_dir (Optional[str]) – The directory to find the CSV-like file in. Defaults to Config().output_dir

  • sep (str) – Delimiter string used by Pandas to parse given CSV-like file

  • name (str) – A string that may be used to help identify this FeatureSet

  • metadata_columns (Optional[List[str]]) – A list of the names of the columns that are metadata (e.g., “zone” or “cavity_label”). Any supplied column names are in addition to the mandatory columns for an ExampleSet. metadata_columns are treated as required columns.

Methods

__init__([df, filename, in_dir, sep, name, ...])

Construct a FeatureSet.

add_label_file_data([label_files, ...])

Process and add label files' data to the ExampleSet's internal collection.

add_web_service_data([server, begin, end, ...])

Add web service data (faults labeled by in-service model) to the ExampleSet.

count_duplicated_events()

Count the number of events that appear multiple times, i.e., were labeled more than once.

count_duplicated_events_with_mismatched_labels()

Count the number of events that appear multiple times with different labels.

count_duplicated_labels()

Count the number of labeling occurrences for events that appear multiple times.

count_events()

Count the number of unique events (zone/datetime combinations

count_labels()

Counts the number of labels (rows in label files)

count_mismatched_labels()

Count the number of times an event with mismatched labels appears in the ExampleSet.

count_unduplicated_events()

Count the number of events that appear exactly once.

display_2d_scatterplot([technique, alpha, ...])

Display a two-dimensional scatterplot of the dimensionally reduced feature set.

display_examples_by_weekday_barplot([...])

Show example counts by the day of the week as a stacked barplot

display_frequency_barplot(x[, color_by, ...])

Display the example count against one or two different factors, as a (stacked) bar chart.

display_summary_label_heatmap([title, query])

Display a heatmap of fault vs cavity labels for all examples in this object

display_timeline([query])

Display a timeline of examples as a swarmplot

display_zone_label_heatmap([zones, query])

Display a heatmap of fault vs cavity labels for all examples in this object for each unique zone category

do_pca_reduction([metadata_cols, report, ...])

Perform PCA on subset of columns of example_df and maintain example metadata in results.

get_classification_report(other[, label, ...])

This prints a classification report of this ExampleSet's cavity labels considering other as ground truth.

get_duplicated_labels()

"Identify the fault events that appear multiple times in the ExampleSet.

get_events_with_mismatched_labels()

Identify fault events that appear multiple times with different labels.

get_example_df()

Returns the example set as a DataFrame (copy)

get_label_file_report()

Generate a string containing a report on the processed label files

get_pca_df()

Get a copy of the PCA reduction as a DataFrame.

get_required_columns()

Generates the list of column names that must appear in a DataFrame for it to be a valid example_df.

get_unduplicated_events()

Identify the fault events that appear exactly once in the ExampleSet.

has_required_columns(df[, dtypes, skip_example])

Check if the given DataFrame has the required columns.

load_csv(filename[, in_dir, sep, ...])

Read in a CSV file that has FeatureSet data.

purge_invalid_examples(validator[, report, ...])

Removes all examples from the ExampleSet that do not pass validation

remove_duplicates_and_mismatches([report])

Removes duplicate example entries and removes all instances of examples that have mismatched labels.

save_csv(filename[, out_dir, sep])

Write out the ExampleSet data as a CSV file relative to out_dir.

update_example_set(df[, metadata_columns])

Update the _example_df and blanks other internal data derived from it.

update_metadata_columns(metadata_columns)

This updates the metadata columns and alters other related data structures.

Attributes

metadata_columns

These columns are required in internal DataFrames and are excluded from analysis routines.

pca

The pca model.

add_label_file_data(label_files=None, exclude_zones=None, exclude_times=None)

Process and add label files’ data to the ExampleSet’s internal collection.

Parameters:
  • label_files (Optional[List[str]]) – List of label files to process. If None, all files in Config().label_dir are read. Relative paths are resolved relative to Config().label_dir.

  • exclude_zones (Optional[List[str]]) – List of zones to exclude. Defaults to Config().exclude_zones.

  • exclude_times (Optional[List[Tuple[datetime, datetime]]]) – List of 2-tuples of datetime objects. Each 2-tuple specifies a range to exclude. None implie +/-Inf.

Return type:

None

add_web_service_data(server=None, begin=None, end=None, models=None)

Add web service data (faults labeled by in-service model) to the ExampleSet.

Note: Should be used exclusive of label data since they will largely overlap

Parameters:
  • server (Optional[str]) – The server to query for the data. If None, use the value in Config

  • begin (Optional[datetime]) – The earliest time for which a fault should be included. If None, defaults to Jan 1, 2018

  • end (Optional[datetime]) – The latest time for which a fault should be included. If None defaults to “now”

  • models (Optional[List[str]]) – A list of model names that should be included in the results. None means include all

Return type:

None

count_duplicated_events()

Count the number of events that appear multiple times, i.e., were labeled more than once.

Return type:

int

Returns:

the number of events that appear multiple times, i.e., were labeled more than once.

This would count as one since only one event appeared that did occur multiple times 4240 2L25 2020-09-21 06:53:16.500 5 E_Quench 4241 2L26 2020-09-22 06:53:17.500 6 E_Quench 4242 2L26 2020-09-22 06:53:17.500 6 E_Quench

count_duplicated_events_with_mismatched_labels()

Count the number of events that appear multiple times with different labels.

Return type:

int

Returns:

The number of events that appear multiple times with different labels

This would count as one since one event appeared that had mismatched labels 4240 2L26 2020-09-21 06:53:16.500 5 E_Quench 4241 2L26 2020-09-21 06:53:16.500 6 E_Quench 4242 2L26 2020-09-21 06:53:16.500 6 E_Quench

count_duplicated_labels()

Count the number of labeling occurrences for events that appear multiple times.

This is basically the number of rows in the label files that are not for unique fault events.

Return type:

int

Returns:

The number of labeling occurrences for events that appear multiple times

count_events()

Count the number of unique events (zone/datetime combinations

This would count as two since two unique zone/datetime pairs appeared 4240 2L25 2020-09-21 06:53:16.500 5 E_Quench 4241 2L26 2020-09-22 06:53:17.500 6 E_Quench 4242 2L26 2020-09-22 06:53:17.500 6 E_Quench

Return type:

int

Returns:

the number of unique events (zone/datetime combinations

count_labels()

Counts the number of labels (rows in label files)

Return type:

int

Returns:

the number of labels (rows in label files)

count_mismatched_labels()

Count the number of times an event with mismatched labels appears in the ExampleSet.

Return type:

int

Returns:

The number of times an event with mismatched labels appears in the ExampleSet.

This would count as three mismatched labels since one event with mismatched labels appeared three times 4240 2L26 2020-09-21 06:53:16.500 5 E_Quench 4241 2L26 2020-09-21 06:53:16.500 6 E_Quench 4242 2L26 2020-09-21 06:53:16.500 6 E_Quench

count_unduplicated_events()

Count the number of events that appear exactly once.

Return type:

int

Returns:

The number of events that appear exactly once.

This would count as one since only one event appeared that did not occur multiple times 4240 2L25 2020-09-21 06:53:16.500 5 E_Quench 4241 2L26 2020-09-22 06:53:17.500 6 E_Quench 4242 2L26 2020-09-22 06:53:17.500 6 E_Quench

display_2d_scatterplot(technique='pca', alpha=0.8, s=25, title=None, figsize=(12, 12), query=None, **kwargs)[source]

Display a two-dimensional scatterplot of the dimensionally reduced feature set.

If the type specified has not already been generated, an exception is raised. Note: consider passing hue=<pca_df column_name> and/or style=<pca_df column_name> to control which column colors/styles each point.

Parameters:
  • technique (str) – The type of dim reduction data to display. Currently the only supported option is pca.

  • alpha (float) – Controls point transparency

  • s (int) – Controls point size

  • title (Optional[str]) – The title of the scatterplot

  • figsize (Tuple[int, int]) – The two dimensions of the size of the figure. Passed to plt.figure.

  • query (Optional[str]) – A pd.DataFrame.query() expr argument. Used to subset the data prior to plotting.

  • **kwargs – All remaining parameters are passed directly to the scatterplot command

Return type:

None

display_examples_by_weekday_barplot(color_by=None, title=None, query=None)

Show example counts by the day of the week as a stacked barplot

Parameters:
  • color_by (Optional[str]) – The DataFrame column on which the bars will be split/colored.

  • title (Optional[str]) – The title to put on the plot. A reasonable default will be generated if None.

  • query (Optional[str]) – The expr argument to DataFrame.query. Subsets data before plot

Return type:

None

display_frequency_barplot(x, color_by=None, title=None, query=None)

Display the example count against one or two different factors, as a (stacked) bar chart.

Parameters:
  • x (str) – The column name for which each bar will appear. Should probably be categorical.

  • color_by (Optional[str]) – The column name by which each bar will be split and colored (for a stacked bar plot). If None, then a simple bar plot will be displayed.

  • title (Optional[str]) – The title to put on the chart. If None, a reasonable default will be generated.

  • query (Optional[str]) – The expr argument to DataFrame.query. Subsets data before plot

Return type:

None

display_summary_label_heatmap(title='Label Summary', query=None)

Display a heatmap of fault vs cavity labels for all examples in this object

Parameters:
  • title (str) – The title of the plot

  • query (Optional[str]) – The expr argument to DataFrame.query. Subsets data before plot

Return type:

None

display_timeline(query=None, **kwargs)

Display a timeline of examples as a swarmplot

Parameters:
  • query (Optional[str]) – The expr argument to DataFrame.query. Subsets data before plot

  • kwargs – Other named parameters are passed to swarm_timeline method

Return type:

None

display_zone_label_heatmap(zones=None, query=None)

Display a heatmap of fault vs cavity labels for all examples in this object for each unique zone category

Parameters:
  • zones (Optional[List[str]]) – A list of the zones to display.

  • query (Optional[str]) – The expr argument to DataFrame.query. Subsets data before plot

Return type:

None

do_pca_reduction(metadata_cols=None, report=True, n_components=3, **kwargs)[source]

Perform PCA on subset of columns of example_df and maintain example metadata in results.

The results are accessible through the get_pca_df() method and the fitted model through the pca attribute.

Parameters:
  • metadata_cols (Optional[List[str]]) – The column names of feature_df that contain the metadata of the events (labels, etc.). All columns not listed in event_cols are used in PCA analysis. If None, it defaults to the values supplied at construction of FeatureSet.

  • report (bool) – Should a report of explained variance be printed

  • n_components (float) – The number of principal components to calculate.

  • **kwargs – Remaining keyword arguments will be passed to sklearn.decomposition.PCA

Return type:

None

get_classification_report(other, label='cavity_label', query=None, other_query=None)

This prints a classification report of this ExampleSet’s cavity labels considering other as ground truth.

Only examples from other for which there is an example in this ExampleSet are considered

Parameters:
  • other (ExampleSet) – An ExampleSet that contains cavity labels considered the ground truth.

  • label (str) – The column name of containing the label values to compare.

  • query (str) –

  • other_query (str) –

get_duplicated_labels()

“Identify the fault events that appear multiple times in the ExampleSet.

Return type:

DataFrame

Returns:

A DataFrame containing labels for events that appear multiple times

get_events_with_mismatched_labels()

Identify fault events that appear multiple times with different labels.

Return type:

DataFrame

Returns:

A DataFrame containing the events that have mismatched labels

get_example_df()

Returns the example set as a DataFrame (copy)

Return type:

DataFrame

Returns:

A copy of the internal ExampleSet DataFrame

get_label_file_report()

Generate a string containing a report on the processed label files

Return type:

str

Returns:

A formatted string containing the report.

get_pca_df()[source]

Get a copy of the PCA reduction as a DataFrame. Will be None if the reduction has not been done.

Each example is on it’s own row with it’s primary components.

Return type:

DataFrame

Returns:

A copy of the PCA results along with the example’s metadata.

get_required_columns()

Generates the list of column names that must appear in a DataFrame for it to be a valid example_df.

Return type:

List[str]

Returns:

The list of column names

get_unduplicated_events()

Identify the fault events that appear exactly once in the ExampleSet.

Return type:

DataFrame

Returns:

DataFrame of the events that appear exactly once in the ExampleSet

has_required_columns(df, dtypes=False, skip_example=True)

Check if the given DataFrame has the required columns.

Parameters:
  • df (DataFrame) – The DataFrame to check

  • dtypes (bool) – Check for matching dtypes of “mandatory” columns. Uses existing _example_df’s dtypes. Skips if _examples_df is None.

  • skip_example (bool) – If True, requires that the ‘example’ column name is present. Otherwise it is not checked.

Return type:

bool

Returns:

True if all required column names are present. False otherwise.

label_file_dataframes

A dictionary holding label file contents, keyed on file names

load_csv(filename, in_dir=None, sep=',', metadata_columns=None)[source]

Read in a CSV file that has FeatureSet data. Relative to in_dir if filename is str.

Parameters:
  • filename (str) – The filename to load. Will be relative in_dir

  • in_dir (Optional[str]) – The directory to find the file in. Defaults to Config().output_dir

  • sep (str) – Delimiter string used by Pandas to parse given “csv” file

  • metadata_columns (Optional[List[str]]) – A list of column names to treat as metadata. This updates the FeatureSet’s list. No changes are made if it is None.

Return type:

None

metadata_columns

These columns are required in internal DataFrames and are excluded from analysis routines.

pca

The pca model. Either None or is the fitted sklearn PCA object. This is left publicly accessible so users have access for custom analysis or visualization (e.g., transforming future examples). Users beware modifying this!

purge_invalid_examples(validator, report=True, progress=True)

Removes all examples from the ExampleSet that do not pass validation

Parameters:
  • validator (ExampleValidator) – A object that follows the ExampleValidator interface.

  • report (bool) – Should information about what is purged be printed?

  • progress (bool) – Should a progress bar be displayed

Return type:

None

remove_duplicates_and_mismatches(report=False)

Removes duplicate example entries and removes all instances of examples that have mismatched labels.

Parameters:

report (bool) – Should information about what was removed be included?

Return type:

None

save_csv(filename, out_dir=None, sep=',')

Write out the ExampleSet data as a CSV file relative to out_dir. Only writes out example_df equivalent.

This also writes out a comment header section that includes information about ExampleSet parameters at the time the file was written.

Parameters:
  • filename (str) – The filename to save. Will be relative out_dir

  • out_dir (Optional[str]) – The directory to save the file in. Defaults to Config().output_dir

  • sep (str) – Delimiter string used by Pandas to parse given “csv” file

Return type:

None

update_example_set(df, metadata_columns=None)[source]

Update the _example_df and blanks other internal data derived from it.

Parameters:
  • df (DataFrame) – A DataFrame containing an example per row with additional feature information. Must be valid for this FeatureSet.

  • metadata_columns (Optional[List[str]]) – A new list of metadata columns for df

Return type:

None

update_metadata_columns(metadata_columns)[source]

This updates the metadata columns and alters other related data structures.

self.metadata_columns always include ExampleSet._mandatory_columns. Does deduplication should you include those as well.

Parameters:

metadata_columns (List[str]) – A list of metadata columns.

Return type:

None