Bookmark
Next issue
№2
Publication date:
16 March 2018
Articles of journal № 2 at 2017 year.
Order result by: Public date | Title | Authors |
1. Effective algorithm for constructing associative rules [2017-05-26]
Author: Billig V.A.
Visitors: 2505
Constructing associative rules is one of the most important algorithms for extracting knowledge from databases. All modern algorithms are somehow connected with Apriori algorithm proposed in R. Agrawal’s and his co-authors’ works published more than 20 years ago and now considered classical. The known effective implementations of the algorithm are connected with database compression and presentation of data structure as a tree, which allows effective evaluation of support and other characteristics of associative rules. The proposed ConApriori algorithm does not use the above idea. We regard database transactions as enumeration given by a scale. This allows instant calculation of the basic algorithm operation determining whether some set is a subset of another set or not. The calculations are reduced to several logical computer commands. Enumeration also allows us to treat the transaction in the internal presentation as a number, preserving the meaning of the transaction elements in their external presentation at the same time. Another idea used in the algorithm allows us to construct most of the confident rules on the basis of those previously built. This article provides evidence for correctness of the algorithm as well as evaluate its complexity. We analyze its effectiveness compared with other known algorithm implementations. Possibility of parallelization is also considered. Keywords: data mining, apriori, associative rules, support, confidence, lift, enumeration, scale, database, transaction,
2. Parallel computing when implementing web-based pattern recognition tools based on use case methods [2017-05-26]
Authors: Fomin V.V., I.V. Aleksandrov
Visitors: 2624
The article considers a software solution aimed at improving image recognition quality and effectiveness of machine learning tools using grid technologies. The paper formulates strategic directions for pattern recognition tools as a software system based on the principles of distributed systems, parallelism and adaptive configuration of computing re-sources. The article considers the structure of web tools for pattern recognition using the concept of the algorithm library. It also describes the algorithmic solutions for parallelizing learning and recognition algorithms on the basis of classical data mining algorithms that are proved themselves in practice. Such algorithms include precedent (case-based reasoning) methods or methods based on proximity metrics. They have a great potential to parallelize computational processes and develop parallel algorithms for their implementation. The search for ways to improve the productivity of computers, especially when implementing web tools based on re-source-intensive computer algorithms for machine recognition and prediction, led to the decision to create a grid system. The architecture and implementation of the grid system considered in the article assumes parallelization and organization of distributed computations on a multi-machine basis using Internet technologies. It allows obtaining similar computing power as multiprocessor computer systems but the cost is much lower. The article considers the problem of increasing the efficiency of computing resources with the possibility of reconfiguring the structure of Internet connections, including the procedure of configuring a computer network structure, connectable communication channels and dedicated servers depending on the original algorithms and data. The paper presents dependencies of operation execution time parameters on the service discipline that adapts the system to user requests. Here the tasks are ranked according to resource intensity and computing power corresponding to their rank is allocated to them.
3. The empirical risk minimization principle based on average loss aggregating functions for regression problems [2017-05-26]
Authors: Z.M. Shibzukhov, D.P. Dimitrichenko, M.A. Kazakov
Visitors: 1752
The paper proposes an extended principle of empirical risk minimization to solve the regression problem. It is based on using aggregate functions instead of arithmetic mean to calculate risk. This can be justified if the loss distribution of emissions is significant or distorted, causing a shift in the risk assessment of the average loss from the very beginning. Therefore, in such cases, when optimizing characteristics in the regression problem the robust estimate of average value-at-risk should be initially used. Such intermediate risk assessment can be constructed using avg functions, which are the solution to the problem of penalty function minimization in case of mean deviation. This approach allows, on one hand, to determine a much broader class of secondary functions, and, on the other hand, to determine the average differentiable functions that approximate the average non-differentiable functions, such as a median or quintile. As a result, it is possible to construct gradient methods for solving the regression problem that, in a sense, can approximate robust techniques such as Least Median and Least Quantile. This paper proposes a new gradient scheme for solving the minimization problem of the intermediate risk. It is an analog of the used in the SAG algorithm circuit when the risk is calculated by arithmetic mean. An illustrative example presents the construction of robust procedures for characteristics assessment in a linear regression based on the use of the avg function, which approximates the median.
4. Recursive algorithm for exact calculation of rank tests for testing statistical hypotheses [2017-05-26]
Authors: Agamirov L.V., Vestyak V.A., Agamirov V.L.
Visitors: 2653
The paper considers the method of generating exact distributions of nonparametric rank tests by means of the computer combinatorial theory. Relevance of the work consists in the fact that determination of exact distribution of critical values of rank tests for statistical hypotheses testing is complicated by the fact that the exact tables and recurrence formulas for many of the tests do not exist. In addi-tion, approximations often give unsatisfactory results at limited volumes of observations. The task of calculating the distribution for rank tests is a search of all possible sample permutations and calculations of rank sta-tistics, as well as cumulative frequency of their occurrence. The program of generating permutations of elements of samples for nonparametric rank criteria based on the recursive brute-force algorithm of direct enumeration of order statistics vector permutation is developed with the following limited number of op-tions: in all permutation options the elements from the same sample cannot be swapped. It is a universal condition for all the rank criteria exact distributions. The paper refers to the Internet resource that contains the software package implementation of the considered calculation algo-rithm for a rank test. This complex contains four nonparametric criteria: two-sample Wilcoxon test, Lehmann-Rosenblatt test, series test and Kruskal-Wallis test, whose accurate distribution statistics are of greatest interest for technical problems. The algorithm can be used for other rank tests of statistical hypotheses testing. The paper presents an implementation of the generation method of nonparametric rank test exact distributions by computer com-binatorial means. It is based on the developed by the authors recursive direct enumeration of options of order statistics vector permu-tation with following filtration of the results. Thus, the authors solve the problem of determining the critical values of nonparametric rank tests for testing statistical hypotheses.
5. Automated data processing system in UNIX-like systems [2017-05-26]
Authors: E.V. Palchevsky, A.R. Khalikov
Visitors: 2380
The article considers development of distributed and modular information processing in the automated mode. Implementation of this project allows accepting input and output data on a physical server up to 2,2 GB/s in volume while distributing stream information (all incoming network traffic on the server) on physical and logical kernels. The paper shows a load dependence of physical resources on the input information. It also justifies feasibility of the developed hardware and software SDP (Speed data processing) use. There is also a structure and basic operation diagrams. The first stage of creating a complex is the development of an algorithm. The second stage includes its technical implementation. The article gives a fragment of a source code, which is responsible for notification messages on an email address as CPU load, and the main launched processes. There is a description of the main functionality with the following data: a function name, a function purpose, theoretical loading, data transfer limit (in MB/s) and a result of execution. The third stage includes testing SDP complex. Its average daily results are provided within ten days. The created hardware and software complex allows processing input and output information effectively in an automatic mode in order to increase capacity of reception and sending data in MySQL DBMS, including cases with DoS- and the DDoS-attacks. One of the parts of the complex is WEB-module, which is responsible for control, both as from a personal computer, as well as from a mobile phone. The possibility of notifying about a physical server usage by sms is realized in a monitoring part of the WEB module. The developed hardware and software complex showed high stability when handling data in large volumes with a minimum load on a computer.
6. A homogeneous distribution problem based on ant colony adaptive behavioor models [2017-05-26]
Authors: B.K. Lebedev, O.B. Lebedev , E.M. Lebedeva
Visitors: 2455
The paper proposes a solution of a homogeneous distribution problem. It gives the problem statement, describes the main groups of algorithms to solve it (approximate and exact) and their advantages and disadvantages. The paper proposes a new paradigm of combinatorial optimization, which is based on modeling the adaptive behavior of an ant colony. The solution of a homogeneous distribution problem is its graphical representation as a bipartite graph. New decision mechanisms were proposed to solve these problems. The basis of metaheuristics of an ant colony algorithm is a combination of two techniques. The first basic technique is to perform the search for the best solution using an ant colony adaptive behavior. An ant builds a specific solution using a built-in procedure, which is based on a constructive algorithm. A bipartite graph that is built on the solution search graph is the main difference of the proposed ant algorithm from the existing canonical paradigm. When finding optimal solutions for optimization problems, which allow presenting solutions in the form of bipartite graphs, this approach will be fairly effective. The conducted researches showed that the ant algorithm gives more qualitative solutions in comparison with the known algorithms. Comparing known and developed algorithms, we can say that the results improved by 3–4 %.
7. A contracted representation of strong associative rules in data analysis [2017-05-26]
Authors: Bykova V.V., Kataeva A.V.
Visitors: 2230
Modern methods and means of searching for association rules in big data lead to a significant number of rules, many of which are redundant. Redundant association rules are generally of no value, but they can misinform. To solve this problem, the paper proposes an algorithm MClose, which is a modification of the algorithm Close. It is known that Close algorithm might help to construct mini-max basis for strict association rules (association rules with the confidence of 1). Mini-max basis consists of only min-max association rules. Association rules with minimal antecedent and maximal consequent are called min-max association rules. Such rules are interesting for experts. However, mini-max basis may contain redundant association rules. The algorithm MClose immediately eliminates redundant association rules when creating mini-max basis. The resulting basis is called concise strong basis (CSB). Redundant association rules might always be obtained from the CSB without sacrificing their support and confidence, without references to the data set. Algorithm MClose is based on Galois connection. MClose algorithm is also based on derivability, which are similar on Armstrong axioms for functional dependencies. Experiments have shown that running time of algorithm MClose is comparable with the algorithm Close. However, it reduces the number of association rules mini-max basis about twice. We provide a description of the program which presents MClose and Close algorithms.
8. Formation of Vietnam energy development options by combinatorial modeling methods [2017-05-26]
Authors: Edelev А.V., V.I. Zorkaltsev, Đoàn Văn Bình, Nguyễn Hoài Nam
Visitors: 2304
The article describes the combinatorial modelling approach to the research on energy sector development. The idea of approach is to model a system development in the form of a directed graph with nodes corresponding to the possible states of a system at certain moments of time and arcs characterizing the possibility of transitions from one state to another. The combinatorial modelling is a visual representation of dynamic discrete alternatives. It permits to simulate the long-term process of system development at various possible external and internal conditions, to determine an optimal development strategy of the system under study. The formation and analysis procedures of energy development options are implemented in the Corrective software package. The distributed computing environment are necessary to compute an energy sector development graph. In 2015, the Institute of Energy Science of the Vietnamese Academy of Science and Technology performed a study of Vietnam sustainable energy development from 2015 to 2030. Data of this study show application of the combinatorial modelling methods to formation and analysis of Vietnam energy development options taking into account energy security requirements. The created Vietnam energy sector development graph consists of 531 442 nodes. It is computed in the cluster located at the Institute for System Dynamics and Control Theory of the Siberian Branch of the Russian Academy of Science (Irkutsk). The found optimal way of Vietnam sustainable energy development provides minimum costs of energy sector development and operation.
9. Statement of the problem of formation of directions of development of organizational automated systems and its solution algorithm [2017-05-26]
Authors: V.L. Lyaskovsky , I.B. Bresler , M.A. Alasheev
Visitors: 2489
The article discusses the statement of the problem of formation of directions for the development of organizational automated information processing and control systems and its solution algorithm. It is necessary to solve this problem due to the fact that many automated systems are created and operated for decades, while the requirements for these systems change over time. Therefore, there is a need to form solutions in order to bring an automated system in compliance with new requirements from time to time. The main performance indicator of generated solutions is a complex index of the degree of automation of functional processes implemented in the system. The constraints are mandatory requirements for automation of the most important functional processes and timeliness of their implementation, as well as the maximum permissible financial and time resources for development of an automated system. The analysis of the algorithmic complexity of the problem solution shows the impossibility of its solution by considering all possible options due to exponential dependence of the number of decisions on the dimension on the source data. Therefore, the authors have developed a heuristic algorithm to reduce the number of options under consideration and rational solution to obtain a relatively small computational complexity. The proposed algorithm allows getting the decision on the development and production of complex automation equipment for an automated control system, as well as extending the life of existing automation. This algorithm is expected to be implemented in an automated decision support system, which functions as software on a consumer-grade personal computer.
10. Evaluating the effectiveness of sustainability problem solving methods of distributed information system functioning [2017-05-26]
Author: Yesikov D.O.
Visitors: 2330
To make informed decisions regarding the organization of data storage and processing to ensure the sustainability of distributed information systems the paper proposes to use a complex of mathematical models for optimizing the dis-tribution of functional task software elements over network nodes; for optimizing the distribution of information resources by data storage and processing centers; for optimizing technical means of data storage and processing system; for optimizing the allocation of resources by information storage and processing centers. It is shown that these problems are related to the class of optimization problems with discrete Boolean variables. The paper proposes and experimentally verifies the branch-and-bound method and the method of genetic algorithms to solve formalized problems. In order to improve the efficiency of the branch-and-bound method the paper proposes preliminary determination of the variable branching order using an approximation method for solving the dual problem once. The article shows an experimental verification of the effectiveness of the branch-and-bound method for solving problems of ensuring distributed information system sustainability including the use of the algorithm of predetermined order of branching variables. The author estimates the initial data influence on overall performance of the branch-and-bound method. The paper shows the most effective strategies of variables branching to solve the designed tasks. There are variations of major operators, as well as initialization circuits of a genetic algorithm initial population to solve the problems of ensuring distributed information system sustainability. To improve the quality of the solution, which is obtained using a genetic algorithm, the author justifies the use of an adaptive scheme of individuals’ reproduction and an computational island circuit. He also experimentally checks the effectiveness of the proposed genetic and island genetic algorithms. The paper determines the parameters of genetic algorithms for ensuring maximum quality of the solutions and proves ability to manage the accuracy of the obtained solution by changing the algorithm parameters when introducing time restrictions for solving. There is a comparative assessment of the branch-and-bound method and the island genetic algorithm for solving formal problems. The paper also shows areas of their effective application.
| 1 | 2 | 3 | Next → ►