ebcpy package

ebcpy-Module. See readme or documentation for more information.

Subpackages

Submodules

ebcpy.data_types module

This module provides useful classes for all ebcpy. Every data_type class should include every parameter other classes like optimization etc. may need. The checking of correct input is especially relevant here as the correct format of data-types will prevent errors during simulations, optimization etc.

class ebcpy.data_types.TimeSeries(data=None, index=None, dtype: Dtype | None = None, name=None, copy: bool | None = None, fastpath: bool | lib.NoDefault = _NoDefault.no_default)[source]

Bases: Series

Overwrites pd.Series to enable correct slicing and expansion in the TimeSeriesData class

New in version 0.1.7.

class ebcpy.data_types.TimeSeriesData(data: str | Any, use_multicolumn: bool = False, **kwargs)[source]

Bases: DataFrame

Most data related to energy and building climate related problems is time-variant.

Class for handling time series data using a pandas dataframe. This class works file-based and makes the import of different file-types into a pandas DataFrame more user-friendly. Furthermore, functions to support multi-indexing are provided to efficiently handle variable passed processing and provide easy visualization and preprocessing access.

Parameters:

data (str,os.path.normpath,pd.DataFrame) – Filepath ending with either .hdf, .mat, .csv, .parquet, or .parquet.COMPRESSION_NAME containing time-dependent data to be loaded as a pandas.DataFrame. Alternative option is to pass a DataFrame directly.
key (str) – Name of the table in a .hdf-file if the file contains multiple tables.
sep (str) – separator for the use of a csv file. If none is provided, a comma (“,”) is used as a default value. See pandas.read_csv() docs for further information.
header (int, list) – Header columns for .csv files. See pandas.read_csv() docs for further information. Default is first row (0).
index_col (int,str) – Column to be used as index in .csv files. See pandas.read_csv() docs for further information. Default is first column (0).
sheet_name (str) – Name of the sheet you want to load data from. Required keyword argument when loading a xlsx-file.
default_tag (str) – Which value to use as tag. Default is ‘raw’
engine (str) – Chose the engine for reading .parquet files. Default is ‘pyarrow’ Other option is ‘fastparquet’ (python>=3.9).
variable_names (list) – List of variable names to load from .mat file. If you know which variables you want to plot, this may speed up loading significantly, and reduce memory size drastically. You can also supply wildcard patterns (e.g. “wall.layer[].T”, etc.) to match multiple variables at once.

Examples:

First let’s see the usage for a common dataframe.

>>> import numpy as np
>>> from ebcpy import TimeSeriesData
>>> tsd = TimeSeriesData({"my_variable": np.random.rand(5)})
>>> tsd.to_datetime_index()
>>> tsd.save("my_new_data.csv")

Now, let’s load the recently created file.

>>> tsd = TimeSeriesData("my_new_data.csv")

clean_and_space_equally(desired_freq, inplace: bool = True)[source]

Call to the preprocessing function ebcpy.preprocessing.clean_and_space_equally_time_series() See the docstring of this function to know what is happening.

Parameters:

desired_freq (str) – Frequency to determine number of elements in processed dataframe. Options are for example: - s: second-based - 5s: Every 5 seconds - 6min: Every 6 minutes This also works for h, d, m, y, ms etc.
inplace (bool) – If True, performs operation inplace and returns None.

Returns:

pd.DataFrame Cleaned and equally spaced data-frame

property default_tag: str: Get the default of time series data object

property filepath: str: Get the filepath associated with the time series data

property frequency

The frequency of the time series data. Returns’s the mean and the standard deviation of the index.

Returns:: float: Mean value float: Standard deviation

get_columns_by_tag(tag: str, variables: list | None = None, return_type: str = 'pandas', drop_level: bool = False)[source]

Returning all columns with defined tag in the form of ndarray.

Parameters:

tag (str) – Define the tag which return columns have to match.
variables (list) – Besides the given tag, specify the variables names matching the return criteria as well.
drop_level (boolean) – If tag should be included in the response. Default is True.
return_type (str) – Return format. Options are: - pandas (pd.series) - numpy, scipy, sp, and np (np.array) - control (transposed np.array)

Returns:

ndarray of input signals

get_tags(variable: str | None = None) → List[str][source]

Return an alphabetically sorted list of all tags

Parameters:: variable (str) – If given, tags of this variable are returned
Returns:: List[str]

get_variable_names(patterns: str | List[str] | None = None) → List[str][source]

Return an alphabetically sorted list of variable names, optionally filtered by patterns.

By default, returns all variable names found in the first level of the DataFrame’s column MultiIndex, sorted alphabetically. If patterns is provided, only names matching one or more of the given literal strings or glob-style patterns (where * matches any sequence of characters) will be returned.

Parameters:

patterns –

A single string or list of strings.
Each entry may be an exact variable name, or a pattern containing * as a wildcard.
If None, all variable names are returned.

Returns:

A list of matching variable names, in alphabetical order.

Raises:

KeyError – If any literal name or pattern does not match at least one variable in the DataFrame.

Example:: # return all wall temperatures at any layer tsd.get_variable_names(”wall.layer[].T”) [“wall.layer[1].T”, “wall.layer[2].T”, “wall.layer[3].T”]

get_variables_with_multiple_tags() → List[str][source]

Return an alphabetically sorted list of all variables that contain more than one tag.

Returns:: List[str]

low_pass_filter(crit_freq, filter_order, variable, tag=None, new_tag='low_pass_filter')[source]

Call to the preprocessing function ebcpy.preprocessing.low_pass_filter() See the docstring of this function to know what is happening. If the old multicolumn format is used, the result is stored in the multicolumn header with the new_tag.

Parameters:

crit_freq (float) – The critical frequency or frequencies.
filter_order (int) – The order of the filter
variable (str) – The variable name to apply the filter to
tag (str) – If this variable has more than one tag, specify which one
new_tag (str) – The new tag to pass to the variable. Default is ‘low_pass_filter’

moving_average(window, variable, tag=None, new_tag='moving_average')[source]

Call to the preprocessing function ebcpy.preprocessing.moving_average() See the docstring of this function to know what is happening. If the old multicolumn format is used, the result is stored in the multicolumn header with the new_tag.

Parameters:

window (int) – sample rate of input
variable (str) – The variable name to apply the filter to
tag (str) – If this variable has more than one tag, specify which one
new_tag (str) – The new tag to pass to the variable. Default is ‘low_pass_filter’

number_lines_totally_na()[source]: Returns the number of rows in the given dataframe that are filled with NaN-values.

save(filepath: str | None = None, **kwargs) → None[source]

Save the current time-series-data into the given file-format. Currently supported are .hdf, which is an easy and fast storage, and, .csv is supported as an easy-readable option. Also, .parquet, and with additional compression .parquet.COMPRESSION_NAME are supported. Compressions could be gzip, brotli or snappy. For all possible compressions see the documentation of the parquet engines. For a small comparison of these data formats see https://github.com/RWTH-EBC/ebcpy/issues/81

Parameters:

filepath (str,os.path.normpath) – Filepath were to store the data. Either .hdf, .csv, .parquet or .parquet.COMPRESSION_NAME has to be the file-ending. Default is current filepath of class.
key (str) – Necessary keyword-argument for saving a .hdf-file. Specifies the key of the table in the .hdf-file.
sep (str) – Separator used for saving as .csv. Default is ‘,’.
engine (str) – Chose the engine for reading .parquet files. Default is ‘pyarrow’ Other option is ‘fastparquet’ (python>=3.9).

Returns:

to_datetime_index(unit_of_index='s', origin=datetime.datetime(2025, 9, 30, 11, 14, 22, 517375), inplace: bool = True)[source]

Convert the current index to a float based index using ebcpy.preprocessing.convert_index_to_datetime_index()

Parameters:

unit_of_index (str) – default ‘s’ The unit of the given index. Used to convert to total_seconds later on.
origin (datetime.datetime) – The reference datetime object for the first index. Default is the current system time.
inplace (bool) – If True, performs operation inplace and returns None.

Returns:

df Copy of DataFrame with correct index for usage in this framework.

to_df(force_single_index=False)[source]

Return the dataframe version of the current TimeSeriesData object. If all tags are equal, the tags are dropped. Else, the object is just converted.

Parameters:: force_single_index (bool) – If True (not the default), the conversion to a standard DataFrame with a single index column (only variable names) is only done if no variable contains multiple tags.

to_float_index(offset=0, inplace: bool = True)[source]

Convert the current index to a float based index using ebcpy.preprocessing.convert_datetime_index_to_float_index()

Parameters:

offset (float) – Offset in seconds
inplace (bool) – If True, performs operation inplace and returns None.

Returns:

pd.DataFrame df: DataFrame with correct index.

ebcpy.data_types.index_is_numeric(index: Index)[source]: Check if pandas Index is numeric

ebcpy.data_types.load_time_series_data(data: str | Any, **kwargs) → DataFrame[source]

Load time series data from various sources into a pandas DataFrame with custom time series accessor methods available via .tsd property.

Parameters:

data (str,os.path.normpath,pd.DataFrame) – Filepath ending with either .hdf, .mat, .csv, .parquet, or .parquet.COMPRESSION_NAME containing time-dependent data to be loaded as a pandas.DataFrame. Alternative option is to pass a DataFrame directly.
key (str) – Name of the table in a .hdf-file if the file contains multiple tables.
sep (str) – separator for the use of a csv file. If none is provided, a comma (“,”) is used as a default value. See pandas.read_csv() docs for further information.
header (int, list) – Header columns for .csv files. See pandas.read_csv() docs for further information. Default is first row (0).
index_col (int,str) – Column to be used as index in .csv files. See pandas.read_csv() docs for further information. Default is first column (0).
sheet_name (str) – Name of the sheet you want to load data from. Required keyword argument when loading a xlsx-file.
engine (str) – Chose the engine for reading .parquet files. Default is ‘pyarrow’ Other option is ‘fastparquet’ (python>=3.9).
variable_names (list) – List of variable names to load from .mat file. If you know which variables you want to plot, this may speed up loading significantly, and reduce memory size drastically. You can also supply wildcard patterns (e.g. “wall.layer[].T”, etc.) to match multiple variables at once.

Returns:

pd.DataFrame DataFrame with custom .tsd accessor containing time series functionality

Examples:

Create a DataFrame with random data:

>>> import numpy as np
>>> from ebcpy import load_time_series_data
>>> df = load_time_series_data({"my_variable": np.random.rand(5)})
>>> df.tsd.to_datetime_index()
>>> df.tsd.save("my_new_data.csv")

Now, let’s load the recently created file:

>>> df = load_time_series_data("my_new_data.csv")

ebcpy.optimization module

Base-module for the whole optimization pacakge. Used to define Base-Classes such as Optimizer and Calibrator.

class ebcpy.optimization.Optimizer(working_directory: Path | str | None = None, **kwargs)[source]

Bases: object

Base class for optimization in ebcpy. All classes performing optimization tasks must inherit from this class. The main feature of this class is the common interface for different available solvers in python. This makes the testing of different solvers and methods more easy. For available frameworks/solvers, check the function self.optimize().

Parameters:

working_directory (str,Path) – Directory for storing all output of optimization via a logger.
bounds (list) – The boundaries for the optimization variables.

property bounds: List[Tuple | List]: The boundaries of the optimization problem.

property cd: Path

static get_default_config(framework: str) → dict[source]

Return the default config or kwargs for the given framework.

The default values are extracted of the corresponding framework directly.

abstract mp_obj(x, *args)[source]

Objective function for Multiprocessing.

Parameters:

x (np.array) – Array with parameters for optimization. Shape of the array is (number_of_evaluations x number_of_variables). For instance, optimizating 10 variables and evaluating 900 objectives in parallel, the shape would be 900 x 10.
n_cpu (int) – Number of logical Processors to run optimization on.

abstract obj(xk, *args)[source]

Base objective function. Overload this function and create your own objective function. Make sure that the return value is a scalar. Furthermore, the parameter vector xk is always a numpy array.

Parameters:: xk (np.array) – Array with parameters for optimization
Returns:: float result: A scalar (float/ 1d) value for the optimization framework.

optimize(framework, method=None, n_cpu=1, **kwargs)[source]

Perform the optimization based on the given method and framework.

Parameters:

framework (str) – The framework (python module) you want to use to perform the optimization. Currently, “scipy_minimize”, “dlib_minimize” and “scipy_differential_evolution” are supported options. To further inform yourself about these frameworks, please see: - dlib - scipy minimize - scipy differential evolution - ‘pymoo’ <https://pymoo.org/index.html>
method (str) – The method you pass depends on the methods available in the framework you chose when setting up the class. Some frameworks don’t require a method, as only one exists. This is the case for dlib. For any framework with different methods, you must provide one. For the scipy.differential_evolution function, method is equal to the strategy. For the pymoo function, method is equal to the algorithm.
n_cpu (int) – Number of parallel processes used for the evaluation. Ignored if the framework-method combination does not support multi-processing.

Keyword arguments:: Depending on the framework an method you use, you can fine-tune the optimization tool using extra arguments. We refer to the documentation of each framework for a listing of what parameters are supported and how to set them. E.g. For scipy.optimize.minimize one could add “tol=1e-3” as a kwarg.

Returns:: res Optimization result.

property supported_frameworks: List with all frameworks supported by this wrapper class.

property working_directory: Path: The current working directory

ebcpy.preprocessing module

This general overview may help you find the function you need:

Remove duplicate rows by averaging the values (build_average_on_duplicate_rows)
Convert any integer or float index into a datetime index (convert_index_to_datetime_index)
Resample a given time-series on a given frequency (clean_and_space_equally_time_series)
Apply a low-pass-filter (low_pass_filter)
Apply a moving average to flatten disturbances in your measured data (moving_average)
Convert e.g. an electrical power signal into a binary control signal (on-off) based on a threshold (create_on_off_signal)
Find the number of lines without any values in it (number_lines_totally_na)
Split a data-set into training and test set according to cross-validation (cross_validation)

All functions in the pre-processing module should have a doctest. We refer to the example in this doctest for a better understanding of the functions. If you don’t understand the behaviour of a function or the meaning, please raise an issue.

ebcpy.preprocessing.build_average_on_duplicate_rows(df: DataFrame | TimeSeriesData) → DataFrame[source]

If the dataframe has duplicate-indexes, the average value of all those indexes is calculated and given to the first occurrence of this duplicate index. Therefore, any dataFrame should be already sorted before calling this function.

Parameters:: df (pd.DataFame) – DataFrame with the data to process
Returns:: pd.DataFame The processed DataFame

Example:

>>> df = pd.DataFrame({"idx": np.ones(5), "val": np.arange(5)}).set_index("idx")
>>> df = convert_index_to_datetime_index(df, origin=datetime(2007, 1, 1))
>>> print(df)
                     val
idx
2007-01-01 00:00:01    0
2007-01-01 00:00:01    1
2007-01-01 00:00:01    2
2007-01-01 00:00:01    3
2007-01-01 00:00:01    4
>>> print(build_average_on_duplicate_rows(df))
                     val
idx
2007-01-01 00:00:01  2.0

ebcpy.preprocessing.clean_and_space_equally_time_series(df: DataFrame | TimeSeriesData, desired_freq: str, confidence_warning: float = 0.95) → DataFrame[source]

Function for cleaning of the given dataFrame and interpolating based on the given desired frequency. Linear interpolation is used.

Parameters:

df (pd.DataFrame,TimeSeriesData) – Unclean DataFrame. Needs to have a pd.DateTimeIndex
desired_freq (str) – Frequency to determine number of elements in processed dataframe. Options are for example: - s: second-based - 5s: Every 5 seconds - 6min: Every 6 minutes This also works for h, d, m, y, ms etc.
confidence_warning (float) – Value to check the confidence interval of input data without a defined frequency. If the desired frequency is outside of the resulting confidence interval, a warning is issued.

Returns:

pd.DataFrame Cleaned and equally spaced data-frame

Example: Note: The example is for random data. Try out different sampling frequencys. You will be warned if the samping rate is to high or to low.

>>> df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)),
>>>                   columns=list('ABCD')).set_index("A").sort_index()
>>> df = convert_index_to_datetime_index(df, origin=datetime(2007, 1, 1))
>>> clean_and_space_equally_time_series(df, "30s")
>>> import matplotlib.pyplot as plt
>>> plt.plot(df["B"], label="Raw data")
>>> df = clean_and_space_equally_time_series(df.copy(), "1500ms")
>>> plt.plot(df["B"], label="Clead and spaced equally")
>>> plt.legend()
>>> plt.show()

Changed in version 0.1.7.

ebcpy.preprocessing.convert_datetime_index_to_float_index(df: DataFrame | TimeSeriesData, offset: float = 0, inplace: bool = False) → DataFrame[source]

Convert a datetime-based index to FloatIndex (in seconds). Seconds are used as a standard unit as simulation software outputs data in seconds (e.g. Modelica)

Parameters:

df (pd.DataFrame,TimeSeriesData) – DataFrame to be converted to FloatIndex
offset (float) – Offset in seconds
inplace (bool) – If True, performs operation inplace and returns None.

Returns:

pd.DataFrame df: DataFrame with correct index

Example:

>>> import pandas as pd
>>> df = pd.DataFrame(np.ones([3, 4]), columns=list('ABCD'))
>>> print(convert_index_to_datetime_index(df, origin=datetime(2007, 1, 1)))
                       A    B    C    D
2007-01-01 00:00:00  1.0  1.0  1.0  1.0
2007-01-01 00:00:01  1.0  1.0  1.0  1.0
2007-01-01 00:00:02  1.0  1.0  1.0  1.0
>>> print(convert_datetime_index_to_float_index(df))
       A    B    C    D
0.0  1.0  1.0  1.0  1.0
1.0  1.0  1.0  1.0  1.0
2.0  1.0  1.0  1.0  1.0

ebcpy.preprocessing.convert_index_to_datetime_index(df: DataFrame | TimeSeriesData, unit_of_index: str = 's', origin: datetime = datetime.datetime(2025, 9, 30, 11, 14, 22, 515592), inplace: bool = False) → DataFrame[source]

Converts the index of the given DataFrame to a pandas.core.indexes.datetimes.DatetimeIndex.

Parameters:

df (pd.DataFrame,TimeSeriesData) – dataframe with index not being a DateTime. Only numeric indexes are supported. Every integer is interpreted with the given unit, standard form is in seocnds.
unit_of_index (str) – default ‘s’ The unit of the given index. Used to convert to total_seconds later on.
origin (datetime.datetime) – The reference datetime object for the first index. Default is the current system time.
inplace (bool) – If True, performs operation inplace and returns None.

Returns:

df Copy of DataFrame with correct index for usage in this framework.

Example:

>>> import pandas as pd
>>> df = pd.DataFrame(np.ones([3, 4]), columns=list('ABCD'))
>>> print(df)
     A    B    C    D
0  1.0  1.0  1.0  1.0
1  1.0  1.0  1.0  1.0
2  1.0  1.0  1.0  1.0
>>> print(convert_index_to_datetime_index(df, origin=datetime(2007, 1, 1)))
                       A    B    C    D
2007-01-01 00:00:00  1.0  1.0  1.0  1.0
2007-01-01 00:00:01  1.0  1.0  1.0  1.0
2007-01-01 00:00:02  1.0  1.0  1.0  1.0

ebcpy.preprocessing.create_on_off_signal(df: DataFrame | TimeSeriesData, col_names: list, threshold: float | list, col_names_new: list, tags: list | str = 'raw', new_tag: str = 'converted_signal')[source]

Create on and off signals based on the given threshold for all column names.

Parameters:

df (pd.DataFame,TimeSeriesData) – DataFrame with the data to process
col_names (list) – Column names of variables to convert to signals
threshold (float,list) – Threshold for all column-names (single float) or a list with specific thresholds for specific columns.
col_names_new (list) – New name for the signal-column
tags (str,list) – If a 2-Level DataFrame for TimeSeriesData is used, one has to specify the tag of the variables. Default value is to use the “raw” tag set in the TimeSeriesClass. However, one can specify a list (Different tag for each variable), or on can pass a string (same tags for all given variables)
new_tag (str) – The tag the newly created variable will hold. This can be used to indicate where the signal was converted from.

Returns:

pd.DataFrame Copy of DataFrame with the created signals added.

Example:

>>> import matplotlib.pyplot as plt
>>> import numpy as np
>>> df = pd.DataFrame({"P_el": np.sin(np.linspace(-20, 20, 10000))*100})
>>> df = create_on_off_signal(df, col_names=["P_el"],
>>>                           threshold=25, col_names_new=["Device On"])
>>> plt.plot(df)
>>> plt.show()

ebcpy.preprocessing.cross_validation(x, y, test_size=0.3)[source]

Split data set randomly with test_size (if test_size = 0.30 –> 70 % are training data). You can use this function for segmentation tasks. Time-series-data may not be splitted with this function as the results are not coherent (time-wise).

Parameters:

x – Indexables with same length / shape[0] as y. Allowed inputs are lists, numpy arrays, scipy-sparse matrices or pandas dataframes.
y (list,np.ndarray,pd.DataFrame) – Indexables with same length / shape[0] as x. Allowed inputs are lists, numpy arrays, scipy-sparse matrices or pandas dataframes.
test_size (float) – Value between 0 and 1 specifying what percentage of the data will be used for testing.

Returns:

list Split data into 4 objects. The order is: x_train, x_test, y_train, y_test

Example:

>>> import numpy as np
>>> x = np.random.rand(100)
>>> y = np.random.rand(100)
>>> ret = cross_validation(x, y)
>>> len(ret)
4

ebcpy.preprocessing.get_df_index_frequency_mean_and_std(df_index: Index, verbose: bool = False)[source]

Function to get the mean and std of the index-frequency. If the index is a DatetimeIndex, the seconds are converted from nanoseconds to seconds. Else, seconds are assumed as values.

Parameters:

df_index (pd.Index) – Time index.
verbose (bool) – Default false. If true, additional to the mean value and standard deviation, the standard error of the mean and number of time steps are returned.

Returns:

float: Mean value float: Standard deviation

ebcpy.preprocessing.interquartile_range(x: ndarray) → ndarray[source]

Calculate interquartile range of given array. Returns the indices of values outside of the interquartile range.

Parameters:: x (np.array) – For dataframe e.g. df[‘a_col_name’].values
Returns:: np.array iqr: Array matching the interquartile-range

Example:

>>> import numpy as np
>>> normal_dis = np.random.normal(0, 1, 1000)
>>> res = interquartile_range(normal_dis)
>>> values = normal_dis[res]

ebcpy.preprocessing.low_pass_filter(data: ndarray, crit_freq: float, filter_order: int) → ndarray[source]

Create a low pass filter with given order and frequency.

Parameters:

data (numpy.ndarray) – For dataframe e.g. df[‘a_col_name’].values
crit_freq (float) – The critical frequency or frequencies.
filter_order (int) – The order of the filter

Returns:

numpy.ndarray

Example:

>>> import numpy as np
>>> import matplotlib.pyplot as plt
>>> rand_series = np.random.rand(100)
>>> plt.plot(rand_series, label="reference")
>>> plt.plot(low_pass_filter(rand_series, 0.2, 2), label="filtered")
>>> plt.legend()
>>> plt.show()

ebcpy.preprocessing.modified_z_score(x: ndarray, limit: float = 3.5) → ndarray[source]

Calculate the modified z-score using the median and median average deviation of the given data.

Parameters:

x (np.array) – For dataframe e.g. df[‘a_col_name’].values
limit (float) – default 3.5 Lower limit for required z-score

Returns:

np.array iqr: modified z score

Example:

>>> import numpy as np
>>> normal_dis = np.random.normal(0, 1, 1000)
>>> res = modified_z_score(normal_dis, limit=2)
>>> values = normal_dis[res]

ebcpy.preprocessing.moving_average(data: ndarray, window: int) → ndarray[source]

Creates a pandas Series as moving average of the input series.

Parameters:

data (np.ndarray) – For dataframe e.g. df[‘a_col_name’].values
window (int) – sample rate of input

Returns:

numpy.array shape has (###,). First and last points of input Series are extrapolated as constant values (hold first and last point).

Example:

>>> import numpy as np
>>> import matplotlib.pyplot as plt
>>> series = np.sin(np.linspace(-30, 30, 1000))
>>> plt.plot(series, label="reference")
>>> plt.plot(moving_average(series, 10), label="window=10")
>>> plt.plot(moving_average(series, 50), label="window=50")
>>> plt.plot(moving_average(series, 100), label="window=100")
>>> plt.legend()
>>> plt.show()

ebcpy.preprocessing.number_lines_totally_na(df: DataFrame | TimeSeriesData) → int[source]

Returns the number of rows in the given dataframe that are filled with NaN-values.

Parameters:: df (pd.DataFrame,TimeSeriesData) – Given dataframe to process
Returns:: int Number of NaN-Rows.

Example:

>>> import numpy as np
>>> import pandas as pd
>>> dim = np.random.randint(100) + 10
>>> nan_col = [np.NaN for i in range(dim)]
>>> col = [i for i in range(dim)]
>>> df_nan = pd.DataFrame({"col_1":nan_col, "col_2":nan_col})
>>> df_normal = pd.DataFrame({"col_1":nan_col, "col_2":col})
>>> print(number_lines_totally_na(df_nan)-dim)
0
>>> print(number_lines_totally_na(df_normal))
0

ebcpy.preprocessing.time_based_weighted_mean(df: DataFrame | TimeSeriesData) → ndarray[source]

Creates the weighted mean according to time index that does not need to be equidistant. Further info: https://stackoverflow.com/questions/26343252/create-a-weighted-mean-for-a-irregular-timeseries-in-pandas

Parameters:: df (pd.DataFrame) – A pandas DataFrame with DatetimeIndex.
Return np.array:: A numpy array containing weighted means of all columns

Example:

>>> from datetime import datetime
>>> import numpy as np
>>> import pandas as pd
>>> time_vec = [datetime(2007,1,1,0,0),
>>>             datetime(2007,1,1,0,0),
>>>             datetime(2007,1,1,0,5),
>>>             datetime(2007,1,1,0,7),
>>>             datetime(2007,1,1,0,10)]
>>> df = pd.DataFrame({'A': [1,2,4,3,6], 'B': [11,12,14,13,16]}, index=time_vec)
>>> print(time_based_weighted_mean(df=df))
[  3.55  13.55]

ebcpy.preprocessing.z_score(x: ndarray, limit=3) → ndarray[source]

Calculate the z-score using the mea and standard deviation of the given data.

Parameters:

x (np.array) – For dataframe e.g. df[‘a_col_name’].values
limit (float) – default 3 Lower limit for required z-score

Returns:

np.array iqr: modified z score

Example:

>>> import numpy as np
>>> normal_dis = np.random.normal(0, 1, 1000)
>>> res = z_score(normal_dis, limit=2)
>>> values = normal_dis[res]