rfwtools.feature_set.FeatureSet
- class rfwtools.feature_set.FeatureSet(df=None, filename=None, in_dir=None, sep=',', name='', metadata_columns=None)[source]
Bases:
ExampleSetA class for managing common operations on a collection of labeled faults and associated features
This class is an extension on ExampleSet meant to handle additional analysis operations.
- __init__(df=None, filename=None, in_dir=None, sep=',', name='', metadata_columns=None)[source]
Construct a FeatureSet. Can use either a DataFrame or CSV-like file.
- Parameters:
df (
Optional[DataFrame]) – A DataFrame containing the FeatureSet data to include. The following columns must be included - [‘zone’, ‘dtime’, ‘cavity_label’, ‘fault_label’, ‘label_source’]. Any additional columns will be treated as the features. Note: A copy of df is saved.filename (
Optional[str]) – The filename to load. Will be relative to in_dirin_dir (
Optional[str]) – The directory to find the CSV-like file in. Defaults to Config().output_dirsep (
str) – Delimiter string used by Pandas to parse given CSV-like filename (
str) – A string that may be used to help identify this FeatureSetmetadata_columns (
Optional[List[str]]) – A list of the names of the columns that are metadata (e.g., “zone” or “cavity_label”). Any supplied column names are in addition to the mandatory columns for an ExampleSet. metadata_columns are treated as required columns.
Methods
__init__([df, filename, in_dir, sep, name, ...])Construct a FeatureSet.
add_label_file_data([label_files, ...])Process and add label files' data to the ExampleSet's internal collection.
add_web_service_data([server, begin, end, ...])Add web service data (faults labeled by in-service model) to the ExampleSet.
Count the number of events that appear multiple times, i.e., were labeled more than once.
Count the number of events that appear multiple times with different labels.
Count the number of labeling occurrences for events that appear multiple times.
Count the number of unique events (zone/datetime combinations
Counts the number of labels (rows in label files)
Count the number of times an event with mismatched labels appears in the ExampleSet.
Count the number of events that appear exactly once.
display_2d_scatterplot([technique, alpha, ...])Display a two-dimensional scatterplot of the dimensionally reduced feature set.
Show example counts by the day of the week as a stacked barplot
display_frequency_barplot(x[, color_by, ...])Display the example count against one or two different factors, as a (stacked) bar chart.
display_summary_label_heatmap([title, query])Display a heatmap of fault vs cavity labels for all examples in this object
display_timeline([query])Display a timeline of examples as a swarmplot
display_zone_label_heatmap([zones, query])Display a heatmap of fault vs cavity labels for all examples in this object for each unique zone category
do_pca_reduction([metadata_cols, report, ...])Perform PCA on subset of columns of example_df and maintain example metadata in results.
get_classification_report(other[, label, ...])This prints a classification report of this ExampleSet's cavity labels considering other as ground truth.
"Identify the fault events that appear multiple times in the ExampleSet.
Identify fault events that appear multiple times with different labels.
Returns the example set as a DataFrame (copy)
Generate a string containing a report on the processed label files
Get a copy of the PCA reduction as a DataFrame.
Generates the list of column names that must appear in a DataFrame for it to be a valid example_df.
Identify the fault events that appear exactly once in the ExampleSet.
has_required_columns(df[, dtypes, skip_example])Check if the given DataFrame has the required columns.
load_csv(filename[, in_dir, sep, ...])Read in a CSV file that has FeatureSet data.
purge_invalid_examples(validator[, report, ...])Removes all examples from the ExampleSet that do not pass validation
remove_duplicates_and_mismatches([report])Removes duplicate example entries and removes all instances of examples that have mismatched labels.
save_csv(filename[, out_dir, sep])Write out the ExampleSet data as a CSV file relative to out_dir.
update_example_set(df[, metadata_columns])Update the _example_df and blanks other internal data derived from it.
update_metadata_columns(metadata_columns)This updates the metadata columns and alters other related data structures.
Attributes
These columns are required in internal DataFrames and are excluded from analysis routines.
The pca model.
- add_label_file_data(label_files=None, exclude_zones=None, exclude_times=None)
Process and add label files’ data to the ExampleSet’s internal collection.
- Parameters:
label_files (
Optional[List[str]]) – List of label files to process. If None, all files in Config().label_dir are read. Relative paths are resolved relative to Config().label_dir.exclude_zones (
Optional[List[str]]) – List of zones to exclude. Defaults to Config().exclude_zones.exclude_times (
Optional[List[Tuple[datetime,datetime]]]) – List of 2-tuples of datetime objects. Each 2-tuple specifies a range to exclude. None implie +/-Inf.
- Return type:
None
- add_web_service_data(server=None, begin=None, end=None, models=None)
Add web service data (faults labeled by in-service model) to the ExampleSet.
Note: Should be used exclusive of label data since they will largely overlap
- Parameters:
server (
Optional[str]) – The server to query for the data. If None, use the value in Configbegin (
Optional[datetime]) – The earliest time for which a fault should be included. If None, defaults to Jan 1, 2018end (
Optional[datetime]) – The latest time for which a fault should be included. If None defaults to “now”models (
Optional[List[str]]) – A list of model names that should be included in the results. None means include all
- Return type:
None
- count_duplicated_events()
Count the number of events that appear multiple times, i.e., were labeled more than once.
- Return type:
int- Returns:
the number of events that appear multiple times, i.e., were labeled more than once.
This would count as one since only one event appeared that did occur multiple times 4240 2L25 2020-09-21 06:53:16.500 5 E_Quench 4241 2L26 2020-09-22 06:53:17.500 6 E_Quench 4242 2L26 2020-09-22 06:53:17.500 6 E_Quench
- count_duplicated_events_with_mismatched_labels()
Count the number of events that appear multiple times with different labels.
- Return type:
int- Returns:
The number of events that appear multiple times with different labels
This would count as one since one event appeared that had mismatched labels 4240 2L26 2020-09-21 06:53:16.500 5 E_Quench 4241 2L26 2020-09-21 06:53:16.500 6 E_Quench 4242 2L26 2020-09-21 06:53:16.500 6 E_Quench
- count_duplicated_labels()
Count the number of labeling occurrences for events that appear multiple times.
This is basically the number of rows in the label files that are not for unique fault events.
- Return type:
int- Returns:
The number of labeling occurrences for events that appear multiple times
- count_events()
Count the number of unique events (zone/datetime combinations
This would count as two since two unique zone/datetime pairs appeared 4240 2L25 2020-09-21 06:53:16.500 5 E_Quench 4241 2L26 2020-09-22 06:53:17.500 6 E_Quench 4242 2L26 2020-09-22 06:53:17.500 6 E_Quench
- Return type:
int- Returns:
the number of unique events (zone/datetime combinations
- count_labels()
Counts the number of labels (rows in label files)
- Return type:
int- Returns:
the number of labels (rows in label files)
- count_mismatched_labels()
Count the number of times an event with mismatched labels appears in the ExampleSet.
- Return type:
int- Returns:
The number of times an event with mismatched labels appears in the ExampleSet.
This would count as three mismatched labels since one event with mismatched labels appeared three times 4240 2L26 2020-09-21 06:53:16.500 5 E_Quench 4241 2L26 2020-09-21 06:53:16.500 6 E_Quench 4242 2L26 2020-09-21 06:53:16.500 6 E_Quench
- count_unduplicated_events()
Count the number of events that appear exactly once.
- Return type:
int- Returns:
The number of events that appear exactly once.
This would count as one since only one event appeared that did not occur multiple times 4240 2L25 2020-09-21 06:53:16.500 5 E_Quench 4241 2L26 2020-09-22 06:53:17.500 6 E_Quench 4242 2L26 2020-09-22 06:53:17.500 6 E_Quench
- display_2d_scatterplot(technique='pca', alpha=0.8, s=25, title=None, figsize=(12, 12), query=None, **kwargs)[source]
Display a two-dimensional scatterplot of the dimensionally reduced feature set.
If the type specified has not already been generated, an exception is raised. Note: consider passing hue=<pca_df column_name> and/or style=<pca_df column_name> to control which column colors/styles each point.
- Parameters:
technique (
str) – The type of dim reduction data to display. Currently the only supported option is pca.alpha (
float) – Controls point transparencys (
int) – Controls point sizetitle (
Optional[str]) – The title of the scatterplotfigsize (
Tuple[int,int]) – The two dimensions of the size of the figure. Passed to plt.figure.query (
Optional[str]) – A pd.DataFrame.query() expr argument. Used to subset the data prior to plotting.**kwargs – All remaining parameters are passed directly to the scatterplot command
- Return type:
None
- display_examples_by_weekday_barplot(color_by=None, title=None, query=None)
Show example counts by the day of the week as a stacked barplot
- Parameters:
color_by (
Optional[str]) – The DataFrame column on which the bars will be split/colored.title (
Optional[str]) – The title to put on the plot. A reasonable default will be generated if None.query (
Optional[str]) – The expr argument to DataFrame.query. Subsets data before plot
- Return type:
None
- display_frequency_barplot(x, color_by=None, title=None, query=None)
Display the example count against one or two different factors, as a (stacked) bar chart.
- Parameters:
x (
str) – The column name for which each bar will appear. Should probably be categorical.color_by (
Optional[str]) – The column name by which each bar will be split and colored (for a stacked bar plot). If None, then a simple bar plot will be displayed.title (
Optional[str]) – The title to put on the chart. If None, a reasonable default will be generated.query (
Optional[str]) – The expr argument to DataFrame.query. Subsets data before plot
- Return type:
None
- display_summary_label_heatmap(title='Label Summary', query=None)
Display a heatmap of fault vs cavity labels for all examples in this object
- Parameters:
title (
str) – The title of the plotquery (
Optional[str]) – The expr argument to DataFrame.query. Subsets data before plot
- Return type:
None
- display_timeline(query=None, **kwargs)
Display a timeline of examples as a swarmplot
- Parameters:
query (
Optional[str]) – The expr argument to DataFrame.query. Subsets data before plotkwargs – Other named parameters are passed to swarm_timeline method
- Return type:
None
- display_zone_label_heatmap(zones=None, query=None)
Display a heatmap of fault vs cavity labels for all examples in this object for each unique zone category
- Parameters:
zones (
Optional[List[str]]) – A list of the zones to display.query (
Optional[str]) – The expr argument to DataFrame.query. Subsets data before plot
- Return type:
None
- do_pca_reduction(metadata_cols=None, report=True, n_components=3, **kwargs)[source]
Perform PCA on subset of columns of example_df and maintain example metadata in results.
The results are accessible through the get_pca_df() method and the fitted model through the pca attribute.
- Parameters:
metadata_cols (
Optional[List[str]]) – The column names of feature_df that contain the metadata of the events (labels, etc.). All columns not listed in event_cols are used in PCA analysis. If None, it defaults to the values supplied at construction of FeatureSet.report (
bool) – Should a report of explained variance be printedn_components (
float) – The number of principal components to calculate.**kwargs – Remaining keyword arguments will be passed to sklearn.decomposition.PCA
- Return type:
None
- get_classification_report(other, label='cavity_label', query=None, other_query=None)
This prints a classification report of this ExampleSet’s cavity labels considering other as ground truth.
Only examples from other for which there is an example in this ExampleSet are considered
- Parameters:
other (ExampleSet) – An ExampleSet that contains cavity labels considered the ground truth.
label (str) – The column name of containing the label values to compare.
query (str) –
other_query (str) –
- get_duplicated_labels()
“Identify the fault events that appear multiple times in the ExampleSet.
- Return type:
DataFrame- Returns:
A DataFrame containing labels for events that appear multiple times
- get_events_with_mismatched_labels()
Identify fault events that appear multiple times with different labels.
- Return type:
DataFrame- Returns:
A DataFrame containing the events that have mismatched labels
- get_example_df()
Returns the example set as a DataFrame (copy)
- Return type:
DataFrame- Returns:
A copy of the internal ExampleSet DataFrame
- get_label_file_report()
Generate a string containing a report on the processed label files
- Return type:
str- Returns:
A formatted string containing the report.
- get_pca_df()[source]
Get a copy of the PCA reduction as a DataFrame. Will be None if the reduction has not been done.
Each example is on it’s own row with it’s primary components.
- Return type:
DataFrame- Returns:
A copy of the PCA results along with the example’s metadata.
- get_required_columns()
Generates the list of column names that must appear in a DataFrame for it to be a valid example_df.
- Return type:
List[str]- Returns:
The list of column names
- get_unduplicated_events()
Identify the fault events that appear exactly once in the ExampleSet.
- Return type:
DataFrame- Returns:
DataFrame of the events that appear exactly once in the ExampleSet
- has_required_columns(df, dtypes=False, skip_example=True)
Check if the given DataFrame has the required columns.
- Parameters:
df (
DataFrame) – The DataFrame to checkdtypes (
bool) – Check for matching dtypes of “mandatory” columns. Uses existing _example_df’s dtypes. Skips if _examples_df is None.skip_example (
bool) – If True, requires that the ‘example’ column name is present. Otherwise it is not checked.
- Return type:
bool- Returns:
True if all required column names are present. False otherwise.
- label_file_dataframes
A dictionary holding label file contents, keyed on file names
- load_csv(filename, in_dir=None, sep=',', metadata_columns=None)[source]
Read in a CSV file that has FeatureSet data. Relative to in_dir if filename is str.
- Parameters:
filename (
str) – The filename to load. Will be relative in_dirin_dir (
Optional[str]) – The directory to find the file in. Defaults to Config().output_dirsep (
str) – Delimiter string used by Pandas to parse given “csv” filemetadata_columns (
Optional[List[str]]) – A list of column names to treat as metadata. This updates the FeatureSet’s list. No changes are made if it is None.
- Return type:
None
- metadata_columns
These columns are required in internal DataFrames and are excluded from analysis routines.
- pca
The pca model. Either None or is the fitted sklearn PCA object. This is left publicly accessible so users have access for custom analysis or visualization (e.g., transforming future examples). Users beware modifying this!
- purge_invalid_examples(validator, report=True, progress=True)
Removes all examples from the ExampleSet that do not pass validation
- Parameters:
validator (
ExampleValidator) – A object that follows the ExampleValidator interface.report (
bool) – Should information about what is purged be printed?progress (
bool) – Should a progress bar be displayed
- Return type:
None
- remove_duplicates_and_mismatches(report=False)
Removes duplicate example entries and removes all instances of examples that have mismatched labels.
- Parameters:
report (
bool) – Should information about what was removed be included?- Return type:
None
- save_csv(filename, out_dir=None, sep=',')
Write out the ExampleSet data as a CSV file relative to out_dir. Only writes out example_df equivalent.
This also writes out a comment header section that includes information about ExampleSet parameters at the time the file was written.
- Parameters:
filename (
str) – The filename to save. Will be relative out_dirout_dir (
Optional[str]) – The directory to save the file in. Defaults to Config().output_dirsep (
str) – Delimiter string used by Pandas to parse given “csv” file
- Return type:
None
- update_example_set(df, metadata_columns=None)[source]
Update the _example_df and blanks other internal data derived from it.
- Parameters:
df (
DataFrame) – A DataFrame containing an example per row with additional feature information. Must be valid for this FeatureSet.metadata_columns (
Optional[List[str]]) – A new list of metadata columns for df
- Return type:
None
- update_metadata_columns(metadata_columns)[source]
This updates the metadata columns and alters other related data structures.
self.metadata_columns always include ExampleSet._mandatory_columns. Does deduplication should you include those as well.
- Parameters:
metadata_columns (
List[str]) – A list of metadata columns.- Return type:
None