资源描述
BIS Working Papers No 930 Big data and machine learning in central banking by Sebastian Doerr, Leonardo Gambacorta and Jose Maria Serena Monetary and Economic Department March 2021 JEL classification: G17, G18, G23, G32. Keywords: big data, central banks, machine learning, artificial intelligence, data science.BIS Working Papers are written by members of the Monetary and Economic Department of the Bank for International Settlements, and from time to time by other economists, and are published by the Bank. The papers are on subjects of topical interest and are technical in character. The views expressed in them are those of their authors and not necessarily the views of the BIS. This publication is available on the BIS website (bis). Bank for International Settlements 2021. All rights reserved. Brief excerpts may be reproduced or translated provided the source is stated. ISSN 1020-0959 (print) ISSN 1682-7678 (online)1 Big data and machine learning in central banking Sebastian Doerr, Leonardo Gambacorta and Jose Maria Serena* Abstract This paper reviews the use of big data and machine learning in central banking, leveraging on a recent survey conducted among the members of the Irving Fischer Committee (IFC). The majority of central banks discuss the topic of big data formally within their institution. Big data is used with machine learning applications in a variety of areas, including research, monetary policy and financial stability. Central banks also report using big data for supervision and regulation (suptech and regtech applications). Data quality, sampling and representativeness are major challenges for central banks, and so is legal uncertainty around data privacy and confidentiality. Several institutions report constraints in setting up an adequate IT infrastructure and in developing the necessary human capital. Cooperation among public authorities could improve central banks ability to collect, store and analyse big data. JEL classification: G17, G18, G23, G32. Keywords: big data, central banks, machine learning, artificial intelligence, data science. * Sebastian Doerr, Leonardo Gambacorta and Jose Maria Serena are in the Monetary and Economic Department (MED) at the Bank for International Settlements (BIS). We would like to thank Fernando Perez Cruz for his technical advice and input. For comments and suggestions we also thank Gianni Amisano, Douglas Araujo, Claudio Borio, Agustin Carstens, Stijn Claessens, Jon Frost, Julian Langer, Michel Jiullard, Juri Marcucci, Luiz Awazu Pereira, Rafael Schmidt, Hyun Song Shin and Bruno Tissot. Giulio Cornelli provided excellent statistical assistance. The views expressed are those of the authors and not necessarily those of the BIS2 1. Introduction The world is changing and so is the way it is measured. For decades, policymakers and the private sector have relied on data released by official statistical institutions to assess the state of the economy. Collecting these data require substantial effort and publication often happens with a lag of several months or even years. However, the last years have seen explosive growth in the amount of readily available data. New models of data collection and dissemination enable the analysis of vast troves of data in real time. We now live in the “age of big data”. 1 One major factor in this development is the advent of the information age, and especially the smart phone and cloud computing: individuals and companies produce unprecedented amounts of data that are stored for future used in servers of technology companies. For example, billions of Google searches every day reveal what people want to buy or where they want to go for dinner. Social media posts allow market participants to track the spread of information in social networks. Companies record every step of their production or selling process, and electronic payment transactions and e-commerce create a digital footprint. An additional catalyst in the creation of big data, especially financial data, has been the Great Financial Crisis (GFC) of 2007-09. The GFC laid bare the necessity of more disaggregated data: a relatively small but interconnected bank such as Lehman Brothers could bring down the financial system because it was highly interconnected. The regulation and reporting requirements set up after the GFC have increased the data reported to central banks and supervisory authorities and further work to enhance central bank statistics is in progress (Buch (2019). The advent of big data coincides with a quantum leap in technology and software used to analyse it: artificial intelligence (AI) is the topic du jour and enables researchers to find meaningful patterns in large quantities of data. For example, natural language processing (NLP) techniques convert unstructured text into structured data that machine-learning tools can analyse to uncover hidden connections. Network analysis can help to visualise relations in these high-dimensional data. For the first time in history, it is possible to produce a real-time picture of economic indicators such as consumer spending, business sentiment or peoples movements. These developments have spurred central banks interest in big data. Rising interest is reflected in the number of central bank speeches that mention big data and do so in an increasingly positive light (Graph 1). And yet, big data and machine learning pose challenges some of them more general, others specific to central banks and supervisory authorities. This paper reviews the use of big data and machine learning in the central bank community, leveraging on a survey conducted in 2020 among the members of the Irving Fischer Committee (IFC). The survey contains responses from 52 respondents from all regions and examines how central banks define and use big data, as well as which opportunities and challenges they see. 1 Forbes (2012): “The Age of Big Data”, accessed 12 June 2020.3 Central banks interest in big data is mounting 1 Number of speeches Graph 1 1 Search on keyword “big data”. 2 The classification is based on the authors judgment. The score takes a value of 1 if the speech stance was clearly negative. It takes a value of +1 if the speech stance was clearly positive or a project/pilot using big data has been conducted. Other speeches (not displayed) have been classified as neutral. Sources: Central bankers speeches; authors calculations. The survey uncovers four main insights, First, central banks define big data in an encompassing way that includes unstructured non-traditional data sets, as well as structured data sets from administrative sources or those collected due to regulatory reporting requirements. Second, central banks interest in big data has markedly increased over the last years. Comparing answers from the 2020 survey to its 2015 vintage, around 80% of central banks discuss the topic of big data formally within their institution, up from 30% in 2015. Further, over 60% of respondents report a high level of interest in the topic of big data at the senior policy level. The discussion on big data in central banks focuses on a wide range of topics. A key topic of discussion is the availability of big data and tools to process, store and analyse it. The design of legal frameworks, for example in defining access rights to confidential data, or aspects of cyber security are also at the centre of central bankers interest. Third, beyond discussions, there is also action: in contrast to 2015, the vast majority of central banks are now conducting projects that involve big data. Among the institutions that currently use big data, over 70% use it for economic research, while 40% state that they use it to inform policy decisions. Several institutions use big data in the areas of financial stability and monetary policy, as well as for suptech and regtech applications. Around two thirds of respondents want to start new big data- related projects in 2020/21. And fourth, the advent of big data poses new challenges. Several central banks report that cleaning the raw data (eg in the case of data obtained from newspapers or social media), sampling and representativeness (eg in the case of data based on Google searches or employment websites), or matching new data to existing sources are obstacles to the usefulness of big data for central banks. Another often-cited challenge refers to legal aspects around privacy and confidentiality, especially with respect to data from non-traditional sources such as web pages. For example, central banks grapple with ethics and privacy issues that accompany the use of potentially sensitive data acquired through public sources or via web scraping. Finally, central4 banks also need to tackle more practical problems: the vast majority report that they face budget constraints and have difficulty in training existing or hiring new staff to work on big data-related issues. They also report that setting up the adequate IT infrastructure proves challenging. The existing literature investigating central banks use of big data mainly focuses on individual countries. A notable exception is the survey conducted by the IFC (2015) among its central bank members in 2015. 2 Against this backdrop, this paper provides an updated assessment of how central banks define big data, how their interest in and use of big data has evolved over the last years also using machine learning techniques and which challenges central banks face in collecting, storing, and analysing it. The rest of the paper is organised as follow. Section 2 provides an overview of how central banks define big data. Sections 3 illustrates in which fields central banks use or plan to use big data and discusses specific use cases. Section 4 discusses opportunities and challenges for central banks and supervisory authorities in the use of machine learning and big data. Section 5 discusses how cooperation among public authorities could relax the constraints on collecting, storing and analysing big data. Section 6 concludes. 2. How do central banks define big data? Big data is commonly defined in terms of volume, velocity and variety (the so-called 3Vs). For data to be “big”, they must not only have high volume and high velocity, but also come in multiple varieties. 3 Central bank definitions of big data reflect these characteristics. Around one third of the respondents in the survey define big data exclusively as large non-traditional or unstructured data that require new techniques for the analysis (Graph 2, left-hand panel). The remaining two thirds also include traditional and structured data sets in their definition of big data. No central bank considers traditional data alone as big data. These proportions are similar across advanced and emerging market economies, denoted in blue and red. The encompassing definition is reflected in the variety of raw data sources used for analysis. The right-hand panel in Graph 2 shows a word cloud with the most- frequently used sources, as reported by central banks in the survey. These range from structured administrative data sets such as credit registries to non-traditional data obtained from newspapers and online portals or by scraping the web. A promising avenue for central bankers and policymakers is to complement traditional data sources with non-traditional data sources to inform policy decisions. 2 See IFC (2018) for a collection of country experiences. See also Cur (2017) and Tissot (2015). 3 Occasionally, veracity is added as a fourth V, as big data is often collected from open sources (Tissot (2019).5 Central bank definitions of big data and main sources Graph 2 How does your institution define big data? 1 Word count on sources 2 Per cent 1 The graph reports the share of respondents that selected each respective answer to the question “How does your institution define big data?”. Respondents could select multiple options. Specifically, 35% of respondents consider only non-traditional data as big data. Non- traditional data include unstructured data sets that require new tools to clean and prepare, data sets that have not been part of your traditional pool and data sets with high-frequency observations and/or a large number of cross-sectional units. The remaining 65% additionally consider structured traditional databases as big data. 2 The word cloud highlights the most-frequent terms mentioned by central banks in their current and future big data projects. In a first step, open answers are transformed by removing special characters, white spaces or stopwords (such as “the”, “a” or “we”). A text-mining algorithm then counts the frequency of individual words. Words mentioned more frequently appear larger. Source: IFC (2020); authors calculations. Central banks have substantial experience with large structured data sets, typically of a financial nature, but have only recently started to explore unstructured data. Financial data sets are analysis-ready, as they are generally collected for regulatory purposes and adhere to reporting requirements. A catalyst in the creation of financial data has been the Great Financial Crisis (GFC), which laid bare the necessity of disaggregated data (IFC (2011). The ensuing regulation and reporting requirements have increased the data reported to central banks and supervisory authorities and further work to enhance central bank statistics is in progress (Buch (2019). By contrast, unstructured data are often the by-product of corporate or consumer activity. Before they are analysed, they must be cleaned and curated, ie organised and integrated into existing structures. A short introduction to machine learning Machine learning is a subfield of artificial intelligence that focuses on learning and extracting knowledge from data. By leveraging data collected by central banks or available from other sources, machine learning can provide real-time insights about eg inflation or consumer spending. Broadly speaking, machine learning involves the development of algorithms that use large amounts of data to autonomously infer their own parameters. Machine learning is usually classified into supervised and unsupervised learning. Supervised learning algorithms take an input (eg the characteristics of a house) to predict its most likely outcome (eg house price). An algorithm is a “classifier” if the6 outcome is countable and a “regression” otherwise. 4 To predict these outcomes, the machine learning algorithm relies on the so-called labelled data, which consists of pairs of inputs and outputs that have been sampled from the underlying data. The parameters of any machine learning algorithm are tuned to fit the labelled data by minimizing a loss function, eg the classification loss or R 2 . Machine learning algorithms are either non-parametric, ie the number of parameters linearly grows with the number of training points, or overparametrised, ie the models have more parameters than available data. In the latter case,
展开阅读全文