Data handling¶
Reports¶
VaRA-TS manages experiment result data in the form of reports. The report file contains all generated data during the experiment and the report class gives the user an interface to interact with the data. To simplify report handling and storage management, the report base classes provide functionality to automatically create customized filenames. In each filename, the framework encodes information like report type, project, revision, and a UUID, to specify the run that created the file. Furthermore, report implementers have the option to customize the filename even further.
As a simple example and help to implement your own report, take a look at the EmptyReport.
List of provided report classes
Report module¶
Handling utilities for generated report files¶
Module for handling revision specific files.
When analyzing a project, result files are generated for specific project revisions. This module provides functionality to manage and access these revision specific files, e.g., to get all files of a specific report that have been process successfully.
- varats.revision.revisions.is_revision_blocked(revision, project_cls)[source]¶
Checks if a revision is blocked on a given project.
- Parameters
revision (
str) – the revisionproject_cls (
Type[Project]) – the project class the revision belongs to
- Return type
bool- Returns
filtered revision list
- varats.revision.revisions.filter_blocked_revisions(revisions, project_cls)[source]¶
Filter out all blocked revisions.
- Parameters
revisions (
List[str]) – list of revisionsproject_cls (
Type[Project]) – the project class the revisions belong to
- Return type
List[str]- Returns
filtered revision list
- varats.revision.revisions.get_all_revisions_files(project_name, result_file_type, file_name_filter=<function <lambda>>, only_newest=True)[source]¶
Find all file paths to revision files.
- Parameters
project_name (
str) – target projectresult_file_type (
MetaReport) – the type of the result filefile_name_filter (
Callable[[str],bool]) – optional filter to exclude certain files; returns true if the file_name should not be checkedonly_newest (
bool) – whether to include all result files, or only the newest; ifFalse, result files for the same revision are sorted descending by the file’s mtime
- Return type
List[Path]- Returns
a list of file paths to correctly processed revision files
- varats.revision.revisions.get_processed_revisions_files(project_name, result_file_type, file_name_filter=<function <lambda>>, only_newest=True)[source]¶
Find all file paths to correctly processed revision files.
- Parameters
project_name (
str) – target projectresult_file_type (
MetaReport) – the type of the result filefile_name_filter (
Callable[[str],bool]) – optional filter to exclude certain files; returns true if the file_name should not be checkedonly_newest (
bool) – whether to include all result files, or only the newest; ifFalse, result files for the same revision are sorted descending by the file’s mtime
- Return type
List[Path]- Returns
a list of file paths to correctly processed revision files
- varats.revision.revisions.get_failed_revisions_files(project_name, result_file_type, file_name_filter=<function <lambda>>, only_newest=True)[source]¶
Find all file paths to failed revision files.
- Parameters
project_name (
str) – target projectresult_file_type (
MetaReport) – the type of the result filefile_name_filter (
Callable[[str],bool]) – optional filter to exclude certain files; returnsTrueif the file_name should not be includedonly_newest (
bool) – whether to include all result files, or only the newest; ifFalse, result files for the same revision are sorted descending by the file’s mtime
- Return type
List[Path]- Returns
a list of file paths to failed revision files
- varats.revision.revisions.get_processed_revisions(project_name, result_file_type)[source]¶
Calculates a list of revisions of a project that have already been processed successfully.
- Parameters
project_name (
str) – target projectresult_file_type (
MetaReport) – the type of the result file
- Return type
List[str]- Returns
list of correctly process revisions
- varats.revision.revisions.get_failed_revisions(project_name, result_file_type)[source]¶
Calculates a list of revisions of a project that have failed.
- Parameters
project_name (
str) – target projectresult_file_type (
MetaReport) – the type of the result file
- Return type
List[str]- Returns
list of failed revisions
- varats.revision.revisions.get_tagged_revisions(project_cls, result_file_type, tag_blocked=True)[source]¶
Calculates a list of revisions of a project tagged with the file status. If two files exists the newest is considered for detecting the status.
- Parameters
project_cls (
Type[Project]) – target projectresult_file_type (
MetaReport) – the type of the result filetag_blocked (
bool) – whether to tag blocked revisions as blocked
- Return type
List[Tuple[str,FileStatusExtension]]- Returns
list of tuples (revision,
FileStatusExtension)
- varats.revision.revisions.get_tagged_revision(revision, project_name, result_file_type)[source]¶
Calculates the file status for a revision. If two files exists the newest is considered for detecting the status.
- Parameters
revision (
str) – the revision to get the status forproject_name (
str) – target projectresult_file_type (
MetaReport) – the type of the result file
- Return type
FileStatusExtension- Returns
the status for the revision
- varats.revision.revisions.get_supplementary_result_files(project_name, result_file_type, revision=None, suppl_info_type=None)[source]¶
Returns the current supplementary result files for a given project and report type. If a specific revision is specified then only the result files for the passed revision are returned, otherwise all files for all available revisions are returned.
- Parameters
project_name (
str) – target projectresult_file_type (
MetaReport) – the type of the result filerevision (
Optional[str]) – the revision for which the result files should be returnedsuppl_info_type (
Optional[str]) – only include result files of the specified type
- Return type
List[Tuple[Path,str,str]]- Returns
list of tuples of result file path, revision, and supplementary result file type
Data management¶
Report data can be accessed via different Database classes.
Each concrete database class offers its data in form of a pandas dataframe with a specific layout.
Clients can query them for the data for a specific project or case study via the function get_data_for_project.
The database class then takes care of loading and caching the relevant result files.
You can add new database classes by creating a subclass of Database in a separate module in the directory varats/data/databases.
The following databases are currently available:
Module: database¶
Module for the base Database class.
- class varats.data.databases.evaluationdatabase.EvaluationDatabase[source]¶
Bases:
abc.ABCBase class for accessing report data.
- Subclasses have to provide the following:
a list of available columns in the variable
COLUMNS; this list must start withDatabase.COLUMNS!an identifier for cache files
CACHE_IDa function
_load_dataframe()that loads and transparently caches report data
- CACHE_ID: str¶
- COLUMNS = ['revision', 'time_id']¶
- classmethod get_data_for_project(project_name, columns, commit_map, *case_studies, **kwargs)[source]¶
Retrieve data for a given project and case study.
- Parameters
project_name (
str) – the project to retrieve data forcolumns (
List[str]) – the columns the resulting dataframe should have; all column names must occur in theCOLUMNSclass variablecommit_map (
CommitMap) – the commit map to usecase_studies (
CaseStudy) – the case studies to retrieve data forkwargs (
Any) – additional arguments that are passed to_load_dataframe()
- Return type
DataFrame- Returns
a pandas dataframe with the given columns and the
Module: cache_helper¶
Utility functions and class to allow easier caching of pandas dataframes and other data.
- varats.data.cache_helper.get_data_file_path(data_id, project_name)[source]¶
Compose the identifier and project into a file path that points to the corresponding cache file in the cache directory.
- Parameters
data_id (
str) – identifier or identifier_name of the dataframeproject_name (
str) – name of the project
Test: >>> str(get_data_file_path(“foo”, “tmux”)) ‘data_cache/foo-tmux.csv.gz’
>>> isinstance(get_data_file_path("foo.csv", "tmux"), Path) True
- Return type
Path
- varats.data.cache_helper.load_cached_df_or_none(data_id, project_name)[source]¶
Load cached dataframe from disk, otherwise return None.
- Parameters
data_id (
str) – identifier or identifier_name of the dataframeproject_name (
str) – name of the project
- Return type
Optional[DataFrame]
- varats.data.cache_helper.cache_dataframe(data_id, project_name, dataframe)[source]¶
Cache a dataframe by persisting it to disk.
- Parameters
data_id (
str) – identifier or identifier_name of the dataframeproject_name (
str) – name of the projectdataframe (
DataFrame) – pandas dataframe to store
- Return type
None
- varats.data.cache_helper.build_cached_report_table(data_id, project_name, data_to_load, data_to_drop, create_empty_df, create_cache_entry_data, get_entry_id, get_entry_timestamp, is_newer_timestamp)[source]¶
Build up an automatically cache dataframe.
- Parameters
data_id (
str) – graph cache identifierproject_name (
str) – name of the project to work withdata_to_load (
List[TypeVar(InDataType)]) – list of data items to be loadeddata_to_drop (
List[TypeVar(InDataType)]) – list of data items to be discardedcreate_empty_df (
Callable[[],DataFrame]) – creates an empty layout of the dataframecreate_cache_entry_data (
Callable[[TypeVar(InDataType)],Tuple[DataFrame,str,str]]) – creates a dataframe from a data itemget_entry_id (
Callable[[TypeVar(InDataType)],str]) – returns a unique identifier for one data itemget_entry_timestamp (
Callable[[TypeVar(InDataType)],str]) – returns a string with information that can be used to determine which of two data items is neweris_newer_timestamp (
Callable[[str,str],bool]) – checks whether one data item is newer than another based on their timestamps
- Return type
DataFrame
Module: data_manager¶
The DataManager module handles the loading, creation, and caching of data classes.
With the DataManager in the background, we can load files from multiple locations within the tool suite, without loading the same file twice. In addition, this speeds up reloading of files, for example, in interactive plots, like in jupyter notebooks, where we sometimes re-execute triggers a file load.
- varats.data.data_manager.sha256_checksum(file_path, block_size=65536)[source]¶
Compute sha256 checksum of file.
- Parameters
file_path (
Path) – path to the fileblock_size (
int) – amount of bytes read per cycle
- Return type
str- Returns
sha256 hash of the file
- class varats.data.data_manager.FileBlob(key, file_path, data)[source]¶
Bases:
objectA FileBlob is a keyed data blob for everything that is loadable from a file and can be converted to a VaRA DataClass.
- Parameters
key (
str) – identifier for the filefile_path (
Path) – path to the filedata (
TypeVar(LoadableType, bound=BaseReport)) – a blob of data in memory
- property key: str¶
The key used as an index to the blob.
- Return type
str
- property file_path: pathlib.Path¶
File path to the loaded file.
- Return type
Path
- property data: varats.data.data_manager.LoadableType¶
The loaded DataClass from the file.
- Return type
TypeVar(LoadableType, bound=BaseReport)
- class varats.data.data_manager.FileSignal[source]¶
Bases:
PyQt5.QtCore.QObjectEmit signals after the file was loaded.
- finished¶
- clean¶
- class varats.data.data_manager.FileLoader(func, file_path, class_type)[source]¶
Bases:
PyQt5.QtCore.QRunnableManages concurrent file loading in the background of the application.
- class varats.data.data_manager.DataManager[source]¶
Bases:
objectManages data over the lifetime of the tool suite.
The DataManager handles the concurrent file loading, creation of DataClasses and caching of loaded files.
- load_data_class(file_path, DataClassTy, loaded_callback)[source]¶
Load a DataClass of type <DataClassTy> from a file asynchronosly.
- Parameters
file_path (
Path) – to the fileDataClassTy (
Type[TypeVar(LoadableType, bound=BaseReport)]) – type of the report class to be loadedloaded_callback (
Callable[[TypeVar(LoadableType, bound=BaseReport)],None]) – that gets called after loading has finished
- Return type
None
- load_data_class_sync(file_path, DataClassTy)[source]¶
Load a DataClass of type <DataClassTy> from a file synchronosly.
- Parameters
file_path (
Path) – to the fileDataClassTy (
Type[TypeVar(LoadableType, bound=BaseReport)]) – type of the report class to be loaded
- Return type
TypeVar(LoadableType, bound=BaseReport)- Returns
the loaded report file
Module: version_header¶
This module provides a reusable version header for all yaml reports generated by VaRA.
The version header specifies the type of the following yaml file and the version.
- exception varats.base.version_header.WrongYamlFileType(expected_type, actual_type)[source]¶
Bases:
ExceptionException raised for miss matches of the file type.
- exception varats.base.version_header.WrongYamlFileVersion(expected_version, actual_version)[source]¶
Bases:
ExceptionException raised for miss matches of the file version.
- exception varats.base.version_header.NoVersionHeader[source]¶
Bases:
ExceptionException raised for wrong yaml documents.
- class varats.base.version_header.VersionHeader(yaml_doc)[source]¶
Bases:
objectVersionHeader describing the type and version of the following yaml file.
- classmethod from_yaml_doc(yaml_doc)[source]¶
Creates a VersionHeader object from a yaml dict.
- Parameters
yaml_doc (
Dict[str,Any]) – version header yaml document- Return type
- classmethod from_version_number(doc_type, version)[source]¶
Creates a new VersionHeader object from a
doc_typestring and a version number.- Parameters
doc_type (
str) – type of the document that should follow the version headerversion (
int) – the current version number
- Return type
- property doc_type: str¶
Type of the following yaml file.
- Return type
str
- is_type(type_name)[source]¶
Checks if the type of the following yaml file is
type_name.- Parameters
type_name (
str) – of the possible following yaml document- Return type
bool
- raise_if_not_type(type_name)[source]¶
Checks if the type of the following yaml file is type_name, otherwise, raises an exception.
- Parameters
type_name (
str) – of the possible following yaml document- Return type
None
- property version: int¶
Document version number.
- Return type
int
Data providers¶
Providers are a means to supply additional data for a project. For example, the CVE Provider allows access to all CVEs that are related to a project.
You can implement your own provider by creating a subclass of Provider in its own subdirectory of provider in varats-core.
There is no restriction on the format in which data has to be provided.
The Provider abstract class only requires you to specify how to create an instance of your provider for a specific project, as well as a fallback implementation (that most likely returns no data).
If your provider needs some project-specific implementation, create a class with the name <your_provider_class>Hook and make the projects inherit from it, similar to the CVEProviderHook.
If a project does not inherit from that hook, your provider’s create_provider_for_project() should return None.
In that case, the provider factory method falls back to your default provider implementation and issues a warning.
For an example provider implementation take a look at the CVE Provider.
List of supported providers
Provider module¶
Provider interface module for projects.
Providers are a means to supply additional data for a project.
- class varats.provider.provider.Provider(project)[source]¶
Bases:
abc.ABCA provider allows access to additional information about a project, e.g., which revisions of a project are releases, or which CVE’s are related to a project.
- Parameters
project (
Type[Project]) – the project this provider is associated with
- property project: Type[benchbuild.project.Project]¶
The project this provider is associated with.
- Return type
Type[Project]
- abstract classmethod create_provider_for_project(project)[source]¶
Creates a provider instance for the given project if possible.
- Return type
Optional[TypeVar(ProviderType, bound=Provider)]- Returns
a provider instance for the given project if possible, otherwise,
None
- abstract classmethod create_default_provider(project)[source]¶
Creates a default provider instance that can be used with any project.
- Return type
TypeVar(ProviderType, bound=Provider)- Returns
a default provider instance
- classmethod get_provider_for_project(project)[source]¶
Factory function for creating providers.
This function is guaranteed to return a valid instance of the requested provider by falling back to a
default providerif necessary. A warning is issued in the latter case.- Parameters
project (
Type[Project]) – the project to create the provider for- Return type
TypeVar(ProviderType, bound=Provider)- Returns
an instance of this provider
Metrics¶
During data evaluation, one might wish to calculate different metrics for the data at hand. We collect the code for such metrics in a separate module to make these metrics reusable, e.g., in different plots.
Metrics module¶
This module contains functions that calculate various metrics on data.
- varats.data.metrics.lorenz_curve(data)[source]¶
Calculates the values for the lorenz curve of the data.
For more information see online lorenz curve.
- Parameters
data (
Series) – sorted series to calculate the lorenz curve for- Return type
Series- Returns
the values of the lorenz curve as a series
- varats.data.metrics.gini_coefficient(distribution)[source]¶
Calculates the gini coefficient of the data.
For more information see online gini coefficient.
- Parameters
distribution (
Series) – sorted series to calculate the gini coefficient for- Return type
float- Returns
the gini coefficient for the data
- varats.data.metrics.normalized_gini_coefficient(distribution)[source]¶
Calculates the normalized gini coefficient of the given data, , i.e.,
gini(data) * (n / n - 1)wherenis the length of the data.- Parameters
distribution (
Series) – sorted series to calculate the normalized gini coefficient for- Return type
float- Returns
the normalized gini coefficient for the data