FROM SECTION 1
FROM SECTION 2
FROM SECTION 3
Summary:
Using public DATA for Investigative reporting and joining journalists with programmers resulted in the production of series of articles, both in print and online LA NACION front pages, about Argentina Subsidies for Transportation (Bus) System. We´ve already worked for 13 months in this on going project.
Subsidies in gasoil and cash, grew exponentially the last ten years and the only way to show this, as well as learning about the evolution of the top 20 companies that receive the most benefits from this, was by building a database which contains 6 years of 4 monthly PDFs, converting them to CSV and Excel (13 Mb) with a DB Model and detecting the evolution of total subsidies, and the detail of more than 1300 companies that receive cash or gasoil since 2006. This data-set of more than 285.000 records (nov 2011) is composed of different kind of cash subsidies (SISTAU, RCC, CCP) and subsidies in diesel fuel (GASOIL), detailed by company and by month. We calculated the amount of cash that those m3 of Gasoil meant for each company as a difference between market price and the discount certifieds they receive.
In october 2011, after national elections passed, our goverment announced a reduction in different subsidies (including transportation), so we are using our dataset again to give a dimension of the total of subsidies and it´s evolution, focusing in the last 2 years with a monthly detail.
Since February 2012, national government is trying to transfer the Metro and Bus system to the city of Buenos Aires government without being clear of the amount of subsidies being transferred as well.
So we are using our dataset for front page stories again and again like this one: http://www.lanacion.com.ar/1455129-traspasan-a-macri-colectivos-que-cuestan-$-1000-millones
In March 2012, we implemented an open data platform http://data.lanacion.com.ar (integrated in Junar) to start opening data in Argentina, a country with no FOIA and with no government’s portal like data.gov. This platform includes sharing, embedding and multiple formats downloading tools for our readers to reuse the content.
We used the platform to publish a dataset extracting the 33 lines of buses that are being transferred to the city of Buenos Aires, with their amount of vehicles, subsidies in cash and calculate average per vehicle and a list of these bus companies ordered by amount.
The story with the embedded Table: http://www.lanacion.com.ar/1455113-un-sector-que-vive-de-los-subsidios (tiene junar embebido)
Besides, we´re opening this data with a visualization in Tableau Public, downloadable and embeddable http://public.tableausoftware.com/views/Subsidios_colectivos/Armado_final?toolbar=yes&:embed=yes&:toolbar=yes&tabs=no&:tabs=no
Finally we are creating an interactive data application presented as Argentina’s Bus Transportation Subsidies Explorer. Here are some screenshots as it is currently in beta (March 2012). Click for viewing in full screen option inside player.
Powered by Tableau The team:
Diego Cabot, lawyer and journalist -Editor of our Financial Section, specialized in public spending, subsidies and Transportation-.
Ricardo Brom, electronic engineer - IT Manager at LA NACION, developed our scrappers , normalizers and built the data sets-.
Mariana Trigo Viera, graphics designer, specialized in interactive design in lanacion.com.
Angelica Peralta Ramos, computer scientist, -Multimedia Development Manager and Data Project leader-, helped with the data models, and manually crossed data sets to calculate average subsidies per unit (bus) per month.
We have support from other members of online and infographics team depending on the story and platform we´re publishing on.
In a country with no FOIA, programming skills for web scrapping, data modeling and data analysis are as important as problem solving skills.
The Sources and Tools
Video 1: An introduction to the subsidies case and web sources.
Video 2: Processing Excel files, before transfering to the database.
Secretaría de Transporte: The official site of Transportation Agency.
CNRT: Comisión Nacional de Regulación del Transporte
http://www.cnrt.gov.ar/ Building the DATABASE:
We spent two weeks in the design and development of the first database, then we updated it monthly:
The steps were:
1) Designing a data model based in the needs of reporting on bus subsidies possible stories, and for creating data visualizations with tools that need little programming skills (We think this data sets as “tables” so we can use different tools like Google Fusion Tables or Tableau Public, and also make our own visualizations). 2) Converting PDFs to CSV 3) Analizing the results with excel 4) Filtering, sorting and cleaning data. 5) Developing a companies and counties normalizer 6) Building a unified dataset with nornalized names 7) Updating this database 8) Discovering stories in this database and extracting data to build the visualizaitons.
Regarding skills for DataJournalists:
Database mindset: for journalists to deal with structured data, the basic is to learn Excel or some spreadsheet to organize and analyze information in a structured way.The focus must be in the data quality and in the data model.
If you build a good data model thinking not only in present but in future uses and in updating data, you should include horizontal uses as for eg: other subsidies not only bus ones or be prepared for international comparisons in your data model. Besides, if you detect frequent difficulties as normalization of names and data cleaning, you can develop routines that will clean this data as part of your monthly process.
More DATA Challenges:
-
After months of being published, PDFs are being modified backwards in two situations: a) keeping the same name and b) including a (number) at the end of the name.
-
Normalization and data cleaning problems
-
Some PDFs are lacking the totals, which makes it impossible to cross check our data in plans manually
PDF With Totals: http://www.transporte.gov.ar/UserFiles/pdfs/subsidios/sistau/2011/rcc_sistau_mayo11.pdf
PDF Without totals: http://www.transporte.gov.ar/UserFiles/pdfs/subsidios/sistau/2011/rcc_sistau_abril11(1).pdf
MAIN STORIES:
Here are the main stories that emerged from analizing this data set, that we are keeping updated every month:
Articles:
JUNE 2011
Colectivos insaciables: un cheque diario de 10 millones en subsidios. http://www.lanacion.com.ar/1380725-colectivos-insaciables-un-cheque-diario-de-10-millones-en-subsidios
Interactive Graphic: http://especiales.lanacion.com.ar/multimedia/item.asp?m=263
AUGUST 2011
Sin el subsidio, el boleto de colectivo costaría $ 3,75 (Without subsidies, bus ticket price would be $ 3,75 (pesos) http://www.lanacion.com.ar/1396512-sin-el-subsidio-el-boleto-de-colectivo-costaria-375
NOVEMBER 2011:
Empresarios dicen que, sin subsidio, el boleto de colectivo debería costar 4 pesos http://www.lanacion.com.ar/1421150-colectivos
Article and Interactive Graphic in Tableau Public offering open DATA accesible from the Tableau graph- http://www.lanacion.com.ar/1421595-nego-schiavi-que-vaya-a-subir-el-boleto
El precio del boleto de colectivo en el mundo http://www.lanacion.com.ar/1421186-el-precio-del-boleto-de-colectivo-en-el-mundo
El colectivo a $4 provocó una furiosa reacción de Schiavi: “Es terrorismo mediático” http://www.lanacion.com.ar/1421410-colectivos
Retiraran los subsidios a las empresas de colectivos que no tengan SUBE. http://www.lanacion.com.ar/1423099-retiraran-los-subsidios-a-las-empresas-de-colectivos-que-no-tengan-sube
FROM SECTION 4
In many jurisdictions around the world datasets and databases can be are covered by rights which mean that you have to ask for permission before you can legally reuse them. In this section we look at how you can get - and give - a green light to reuse datasets.
Most of the new Open Data portals being set up by governments now make clear that the data that is released can be used free of charge. The same goes for information obtained through Freedom of Information Requests. If you have scraped the data from the website of a public body, on reuse. Sometimes there will be intellectual property limitations, but normally the only requirement is that you cite the source of the information.
When talking about databases and intellectual property we need to distinguish between the structure and the content of a database (when we use the term 'data' we shall mean the content of the database itself). Structural elements include things like the field names and a model for the data — the organization of these fields and their inter-relation.
In many jurisdictions it is likely that the structural elements of a database will be covered by copyright (it depends somewhat on the level of 'creativity' involved in creating this structure).
The distinction between the "contents" of a database and the collection is especially crucial for factual databases since no jurisdiction grants a monopoly right in the individual facts (the "contents") even though it may grant right(s) in them as a collection.
To illustrate, consider the simple example of a database which lists the melting point of various substances. While the database as a whole might be protected by law so that one is not allow to access, reuse or redistribute it without permission this would never prevent you from stating the fact that substance Y melts at temperature Z.
You can read more about the situation your jurisdiction in the Guide to Open Data Licensing.
If you find you are having problems with your right to reuse information, then please let the campaign group Access Info Europe know ([email protected]). Access Info will help you with legal advice and will try to find lawyers in your country should that be necessary. Success? What now…? Share the data…
When you publish your project, do the rest of us a favor and include the data you’ve collected! It would be a damned shame if all that beautiful data you dug up (cleaned up, formatted and augmented) went stale on your hard drive. Let others source-check your story and perhaps even find different stories in the data.
When you do publish data, include an explicit IP statement, in particular an open license like the Creative Commons Zero or Attribution terms or the Open Database License (ODbL). If the data is government information not covered by copyright, publish it using Creative Commons' PD Mark dedication to make it clear that the data will be available in the public domain forever and for others to reuse!
FROM SECTION 5
FROM SECTION 6