rfwtools.data_set.DataSet

class rfwtools.data_set.DataSet(label_files=None, example_validator=<rfwtools.example_validator.ExampleValidator object>, e_type=ExampleType.EXAMPLE, example_kwargs=None)[source]

Bases: object

This class manages a standard workflow from label files on disk to the generation of features

label_files

A list of strings that are the label filenames in Config().label_dir. None implies all files.

example_validator

An ExampleValidator object. This defines the characteristics of valid objects in DataSet’s ExampleSet

config_yaml

A YAML formatted string representing the Config() object at the time a DataSet runs save()

feature_set

A FeatureSet object containing the feature data generated by produce_feature_set()

example_set

An ExampleSet object that contains all of the examples generated from label_files by produce_example_set()

example_set_model

An ExampleSet object containing examples labeled by the production classifier during the same time span as example_set.

__init__(label_files=None, example_validator=<rfwtools.example_validator.ExampleValidator object>, e_type=ExampleType.EXAMPLE, example_kwargs=None)[source]

Create a DataSet instance that will collect events based on the provided label files and configured filters.

Some filters such as excluded zones or excluded times are set in the Config objects.

Parameters:
  • label_files (list(str)) - Filenames of the label files to parse. None means use all Config() –

  • example_validator (ExampleValidator) – validation performed if None

  • e_type (ExampleType) – The type of IExample objects that will be created. Defaults to ExampleType.EXAMPLE.

  • example_kwargs (Optional[dict]) – A dictionary of keyword arguments that will be passed to the e_type constructor

Methods

__init__([label_files, example_validator, ...])

Create a DataSet instance that will collect events based on the provided label files and configured filters.

get_example_array()

Convenience function for getting an array containing the Example objects held in self.example_set

load(filename[, in_dir])

DEPRECATED Loads a bzipped pickle file produced by save().

load_example_set(filename[, in_dir])

DEPRECATED.

load_example_set_csv(filename[, e_type, ...])

Load an ExampleSet CSV file.

load_feature_set(filename[, in_dir])

DEPRECATED.

load_feature_set_csv(filename, **kwargs)

Load a FeatureSet CSV file.

produce_example_set([report, progress, ...])

This method causes the DataSet object to produce a set of uniquely labeled examples based on label files.

produce_feature_set(extraction_function[, ...])

This method produces a FeatureSet by applying extraction_function to each Example in the example set.

save(filename[, out_dir])

DEPRECATED Saves a bzipped pickle file containing this DataSet object.

save_example_set(filename[, out_dir])

DEPRECATED.

save_example_set_csv(filename, **kwargs)

Save an ExampleSet CSV file.

save_feature_set(filename[, out_dir])

DEPRECATED.

save_feature_set_csv(filename, **kwargs)

Save a FeatureSet CSV file.

get_example_array()[source]

Convenience function for getting an array containing the Example objects held in self.example_set

Return type:

ndarray

Returns:

A numpy.ndarray generated by pd.DataFrames.values

static load(filename, in_dir=None)[source]

DEPRECATED Loads a bzipped pickle file produced by save(). Returns the resulting DataSet.

Save files made with different versions of rfwtools may be impossible to load. Please save/load ExampleSets and FeatureSets using their built in CSV methods for long-term storage.

Parameters:
  • filename (str) – Filename of the save file. Should include the .pkl.bz2 extension for clarity.

  • in_dir (Optional[str]) – The directory where the file should be found. If None, the Config().output_dir is used.

Return type:

DataSet

load_example_set(filename, in_dir=None)[source]

DEPRECATED. Loads a binary bzipped pickle file containing an example set. Old versions used pickle files

load_example_set_csv(filename, e_type=None, example_kwargs=None, **kwargs)[source]

Load an ExampleSet CSV file. Overwrites existing example_set with new ExampleSet.

This method allows the user to specify the type of Example that will be constructed from the rows of the CSV. Some types will require specific example_kwargs values.

Parameters:
  • filename (str) – The name of the file to load. Relative to in_dir (if supplied) or Config().output_dir.

  • e_type (Optional[ExampleType]) – The type of Examples to be included in this ExampleSet

  • example_kwargs (Optional[Dict[str, Any]]) – A dictionary of key-word arguments pass on to the ExampleSet constructor

  • **kwargs – All keyword args are passed on to FeatureSet.load_csv()

load_feature_set(filename, in_dir=None)[source]

DEPRECATED. Loads a binary pickle file containing a feature set.

Return type:

None

load_feature_set_csv(filename, **kwargs)[source]

Load a FeatureSet CSV file. Overwrites existing feature_set with new FeatureSet.

Parameters:
  • filename (str) – The name of the file to load. Relative to in_dir (if supplied) or Config().output_dir.

  • **kwargs – All keyword args are passed on to FeatureSet.save_csv()

Return type:

None

produce_example_set(report=False, progress=True, get_model_data=True)[source]

This method causes the DataSet object to produce a set of uniquely labeled examples based on label files.

self.example_set will contain the resulting ExampleSet. self.example_set_model will contain the ExampleSet corresponding to the production model’s results

Parameters:
  • report (bool) – Whether a report detailing issues with the label file contents should be printed

  • progress (bool) – Should a progress bar be displayed during validation

  • get_model_data (bool) – Should the model results be queried and stored in example_set_model

Return type:

None

Returns:

None

produce_feature_set(extraction_function, max_workers=4, verbose=False, **kwargs)[source]

This method produces a FeatureSet by applying extraction_function to each Example in the example set.

extraction_function should a callable that requires exactly one positional argument, the Example on which to operate and returns a pandas DataFrame with one row and a constant number of columns. The extractor function should take care to produce identical column name values as the output from each Example is appended to a master DataFrame. The extraction function is responsible for loading/unloading an Example’s data as needed. Additional kwargs will be passed to this function. A return value of None will cause the Example to be excluded from the feature set.

Additional keyword arguments will be passed to the extraction function.

Each row represents the results of a single extraction. A row also contains the cavity label, fault label, zone, timestamp, and label_source of the example.

Note this uses python multithreading to allow for concurrent data loading/unloading.

Parameters:
  • extraction_function (callable) – Function that will perform feature extraction on a single Example

  • max_workers (int) – The number of parallel process workers to launch

  • verbose (bool) – Controls level of print output

  • **kwargs – All additional keyword arguments will be passed to the extraction_function

Return type:

None

save(filename, out_dir=None)[source]

DEPRECATED Saves a bzipped pickle file containing this DataSet object.

This is a convenient way to save all of your work within a DataSet object. However, changes to the installed rfwtools software may make it difficult or impossible to load this object in the future. Please save individual ExampleSets and FeatureSets using their supplied method (save_csv) or the DataSet wrapper method, e.g., save_example_set_csv().

Note: this also saves a serialized version of the Config object.

Parameters:
  • filename (str) – Filename of the save file. Should include the .pkl.bz2 extension for clarity.

  • out_dir (Optional[str]) – The directory where the file should be placed. If None, the Config().output_dir is used.

Return type:

None

save_example_set(filename, out_dir=None)[source]

DEPRECATED. Save a binary bzipped pickle file of the current example set.

Return type:

None

save_example_set_csv(filename, **kwargs)[source]

Save an ExampleSet CSV file. All keyword args are passed to ExampleSet.save_csv method.

Parameters:
  • filename (str) – The name of the file to load. Relative to in_dir (if supplied) or Config().output_dir.

  • **kwargs – All keyword args are passed on to ExampleSet.save_csv()

Return type:

None

save_feature_set(filename, out_dir=None)[source]

DEPRECATED. Saves a binary pickle file containing the feature set DataFrame.

Return type:

None

save_feature_set_csv(filename, **kwargs)[source]

Save a FeatureSet CSV file. All keyword args are passed to FeatureSet.save_csv method.

Parameters:
  • filename (str) – The name of the file to load. Relative to out_dir (if supplied) or Config().output_dir.

  • **kwargs – All keyword args are passed on to FeatureSet.save_csv()

Return type:

None