Skip to content

How to implement a new parser

Jens Schneider edited this page May 21, 2021 · 5 revisions

Extract Simulation Data

Already Implemented Simulation Data Types

RDPlot is currently capable of parsing data from the reference software for HEVC(HM), the HEVC extensions for 360 video, the HEVC scalability extension (SHM), and VVC (VTM). If you want to parse data from a different source, you need to write a new parser. As input files RDPlot currently accepts *.rd, *.log and *.xml files.

How to implement new Simulation Data Types

In case you need to write a new parser you have to follow some regulations. We will provide an explanation of the methods that need to be implemented and the basic structure your parser should adopt. Additionally, you can find at the end of this page an example-dataset and the corresponding parser. You can use this example as a starting point for your own parser.

Basic structure

RDPlot provides the abstract base class AbstractSimulationDataItem. This abstract base class should be used to create a new subclass for the new parser. The base class provides the abstract method can_parse_file and the abstract properties data and tree_identifier_list. Further the helper method _is_file_text_matching_re_pattern(cls, path, pattern) is available. The basic structure of your subclass could look like this:

class YourParserName(AbstractSimulationDataItem):
    def __init__(self, path):
        super().__init__(path)
        # set identifiers, data and path
        self.sequence, self.data, self.qp = self._parse_path(self.path)
        self.encoder_config = self._parse_encoder_config()
        self.additional_params = []

    def _parse_path(self, path):
        # implement parser here
        # returns parsed data

    @property
    def tree_identifier_list(self):
        # returns list which describes hierarchy of identifiers  

    @property
    def data(self):
        # returns the parsed data as a list of pairs

    @property
    def can_parse_file(cls, path):
        # returns True or False

    def get_label
        # returns a tuple with the labels for the axes of the rate-distortion plot 

    # Optional: Non-abstract Helper Functions can be implemented
    # e.g. Helper Function that supports the can_parse_file method
    @classmethod
    def _enc_log_file_matches_re_pattern(cls, path, pattern):
        if path.endswith("enc.log"):
            return cls._is_file_text_matching_re_pattern(path, pattern)
        return False

The structure of your subclass can differ from the example as long as your subclass overrides the abstract method can_parse_file and the abstract properties data and tree_identifier_list. In the following some information regarding the implementation of the abstract method and properties are given.

can_parse_file(cls, path)

def rdplot.SimulationDataItem.AbstractSimulationDataItem.can_parse_file(cls, path)

The abstract property can_parse_file checks if your class can parse a certain file. As an input the property gets the class cls and the path to the file. Is the class capable of parsing the file, you have to make sure that the property returns True, otherwise it should return False. The helper function _is_file_text_matching_re_pattern(cls, path, pattern) can be useful here in order to check if a file name or file extension matches a certain pattern. Further you should inspect if the content of the file matches your desired pattern. Note, that the first class which returns True with respect to a file on this method, will be used to parse the file. Thus, you should implement this method as specific as possible.

Example:

    @classmethod
    def can_parse_file(cls, path):
        matches_class = cls._enc_log_file_matches_re_pattern(path, r'^HM \s software')
        is_finished = cls._enc_log_file_matches_re_pattern(path, '\[TOTAL')
        return matches_class and is_finished

tree_identifier_list(self)

def rdplot.SimulationDataItem.AbstractSimulationDataItem.tree_identifier_list(self)

The property tree_identifier_list creates a list which is used to build up a tree view of the input data. The tree can be found on the left side of the Plot Area (see screenshot). This tree allows to group the input files according to identifiers which are the keys of the tree. The identifiers in the example consist of the parsers name, the video sequence name and the data name.

Example:

    @property
    def tree_identifier_list(self):
        return [self.__class__.__name__, self.sequence, self.data_name] 

data(self)

def rdplot.SimulationDataItem.AbstractSimulationDataItem.data(self)

To access the parsed data, you have to override the abstract property data. The property should return a list of tuples similar to this example:

[
    (['NewParser', 'Configuration = config1', 'EXAMPLEDATA2.log'], {'NewParser': {'Value': {'ParameterA': [('1', '53'), ('2', '26')], 'ParameterB': [('1', '78'), ('2', '34'),]}}}), 
    (['NewParser', 'Configuration = config1']                    , {'NewParser': {'Summary': {'SummaryA': [('2', '30')], 'SummaryB': [('2', '60')]}}})
]

The first element of each tuple is a list containing the identifiers of the input files associated with the data, e.g. path, sequence or config of the input file. These identifiers determine the input files that contribute to a graph. It is possible to use data from several files to generate a single graph. Note that if you want to combine the data from different input files to one graph you have to code this into the identifier list. In the example the Summary graphs combine data that shares the same parser and configuration. When a graph is plotted, the files contributing data can be found in the lower part of the user interface (see screenshot).
The second element is a dictionary tree. It contains the variable name of your data as keys, and the actual data as leafs. The leafs must be a list of tuples that contain, an x- and corresponding y-value.

Example-dataset and corresponding parser

The example-data is stored in a .log file and has the following structure:

EXAMPLE DATA

Configuration: config1

ParameterA: (1, 34), (2, 26), (3, 65), (4, 23), (5, 78) 
ParameterB: (1, 67), (2, 46), (3, 86), (4, 55), (5, 68) 
ParameterC: (1, 35), (2, 24), (3, 24), (4, 79), (5, 98) 

SummaryA: (1, 76)
SummaryB: (1, 56)
SummaryC: (1, 54)

The corresponding parser could look like this:

import re
from rdplot.SimulationDataItem import (AbstractSimulationDataItem)

class ExampleParser(AbstractSimulationDataItem):
    def __init__(self, path):
        super().__init__(path)
        self.variable_name, self.data_tuple, self.variable_name_summary, self.data_tuple_summary = self._parse_data()
        self.encoder_config = self._parse_encoder_config()
        # Optional: if you want to parse information that specifies the input file
        self.additional_params = []

    def _parse_data(self):
        with open(self.path, 'r') as log_file:
            log_text = log_file.read()              # reads the whole text file
            lines = log_text.split('\n')            # split whole text file into lines
            data_tuple = []
            variable_name_summary = []
            data_tuple_summary = []
            variable_name = []
            for line in lines:                      # iterate over each line
                line = re.sub('\s+', '', line)      # remove white spaces
                if re.search(r'Summary', line):
                    # you can parse the data directly in form of tuples -> the data is required in tuples (x- and y-value)
                    # and then keep the tuples in a list
                    data_tuple_summary.append(re.findall(r'(\d+),(\d+)', line))
                    # the variable names are kept in a list
                    variable_name_summary.append((re.findall(r'Summary\w', line))[0])
                elif re.search(r'Parameter', line):
                    data_tuple.append(re.findall(r'(\d+),(\d+)', line))
                    variable_name.append(re.findall(r'Parameter\w', line)[0])
                else:
                    continue
        return variable_name, data_tuple, variable_name_summary, data_tuple_summary

    @property
    def tree_identifier_list(self):
        # uncomment the following line and comment all lines following OPTIONAL
        # if you do not want to group the input files according to specifications
        # return [self.__class__.__name__] + [self.path]
        # OPTIONAL:
        if not hasattr(self, 'additional_params'):
            self.additional_params = []
        # Important convert dict of self.encoder_config into a list but only convert parameters that were chosen by the user
        # this parameter are stored in self.additional_params
        config_list = list(zip(self.additional_params, [self.encoder_config[i] for i in self.additional_params]))
        config_list = list(map(lambda x: ' ='.join(x), config_list))
        return [self.__class__.__name__] + config_list + [self.path]
        # the tree identifier list is a list containing identifier that specify the data from which the data is plotted

    @property
    def data(self):
        values_dict = dict(zip(self.variable_name, tuple(self.data_tuple)))
        median_dict = dict(zip(self.variable_name_summary, self.data_tuple_summary))
        config_list = list(zip(self.additional_params, [self.encoder_config[i] for i in self.additional_params]))
        config_list = list(map(lambda x: '='.join(x), config_list))
        # uncomment the following lines and comment all lines following OPTIONAL
        # if you do not want to group the input files according to specifications
        # return [
        #     (
        #         [self.__class__.__name__] + [self.path], {self.__class__.__name__: {'Value': values_dict}}
        #     ),
        #     (
        #         [self.__class__.__name__], {self.__class__.__name__: {'Summary': median_dict}}
        #     )
        # ]                   # Order of data should be
        # OPTIONAL:
        return [
            (
                [self.__class__.__name__] + config_list + [self.path], {self.__class__.__name__: {'Value': values_dict}}
            ),
            (
                [self.__class__.__name__] + config_list, {self.__class__.__name__: {'Summary': median_dict}}
            )
        ]

    # Optional: the input files might have some specifications
    # once these specifications are parsed the input files can be grouped
    # these is realized by a dialog window showing the user all available specifications
    # the user can determine which specifications should be considered and their hierarchy
    # the dialog window will show only specification parameters that differ in their value
    def _parse_encoder_config(self):
        with open(self.path, 'r') as log_file:
            log_text = log_file.read()  # reads the whole text file
            parsed_config = re.findall(r'Configuration:\s\w+\d+',log_text)
            parsed_config = dict([parsed_config[0].split(':', 1)])
        return parsed_config        # {'SpecificationParameter1':'value', 'SpecificationParameter2':'value'}
                                    # return a dict -> allows building a diff_dict
                                    # -> this diff_dict contains all specification parameters
                                    # that differ in value from the other input files

    @classmethod
    def can_parse_file(cls, path):
        # by checking for an unique pattern in the file at path you can distinguish if the data matches the class
        # this identifier must be unique
        matches_class = cls._is_file_text_matching_re_pattern(path, r'EXAMPLE\sDATA')
        return matches_class

    def _get_label(self, keys):
        label = ('x-'+keys[-1],'y-'+keys[-1])
        return label