GetData¶
-
class
gydelt.gydelt.
GetData
[source]¶ For collecting and storing the data
-
read_from_file
(path, seperator='\t', parse_dates=False, encoding=None, header=0)[source]¶ Read data from a saved file into a pandas DataFrame
- Parameters :
- path (str) :
(required)
- Path of the file to be read (Do not forget to add the file extension)
- separator (str) :
default '\t'
- Delimiter to use
- parse_dates (list) :
default False
- eg - [‘Date’] -> This column is parsed as a single Date column
- encoding (str) :
default None
- Encoding to use for UTF when reading/writing (eg -
'ISO-8859-1'
) - header (int) :
default 0
Determines what row in the data should be considered for the Heading (Column names)
If no headers are needed, pass the value of ‘header’ as ‘None’ (without the quotes)
- path (str) :
- Returns :
- pandas.DataFrame
- A pandas DataFrame
-
fire_query
(project_id, fields_required=['DATE', 'Themes', 'Locations', 'Persons', 'Organizations', 'V2Tone'], is_search_criteria=False, get_stats=True, auth_file='', search_dict={}, limit=None, save_data=True)[source]¶ Fire a query on Google’s BigQuery according to your search criteria and get the data in a pandas DataFrame
- Parameters :
- project_id (str) :
(required)
- The ID of the project created on Google BigQuery, with which the query is to be executed
- fields_required (list) :
default ['DATE', 'Themes', 'Locations', 'Persons', 'Organizations', 'V2Tone']
- The names of the columns to be extracted from the GKG Table on BigQuery
- is_search_criteria (boolean) :
default False
- If True, then the user has to specify the Search Criteria (either through the console or by passing a dictionary)
- get_stats (boolean) :
default True
- If True, then the amount of data that is to be processed will be displayed before executing the query (only if the location of ‘auth_file’ is given)
- auth_file (str) :
default '' (empty)
The path of the authorization file received from BigQuery.
Note
If the path of authorization file is provided, stats will be displayed before executing the query. If not, a message will be displayed and the user will be asked whether or not to proceeds
- search_dict (dict) :
default {} (empty)
- Contains the Search Criteria in the following format -
- Keys - Column from the GKG table, where the search is to be performed Values - The keywords that are needed to be searched in the specific fields/columns
- The values are divided into 3 parts -
Part 1 - similar to
'Include ALL of'
(boolean'and'
is applied for each keyword)Part 2 - similar to
'Include ATLEAST ONE of'
(boolean'or'
is applied for each keyword)Part 3 - similar to
'Must NOT have ANY of'
(boolean'not'
is applied for each keyword)
Delimiter for the 3 parts is semi colon
Delimiter for keywords within each part is comma
Note
If No keywords is to be added in a certain part, leave it empty (BUT, DO NOT miss the semicolons)
Example -
{'Persons': 'P1;P2,P3;P4', 'Organizations': ';O1,O2;'}
>>> {'Locations': 'United States,China;;', 'Persons': ';;Donald Trump'} # This would mean that the 'Locations' should have BOTH 'United States' and 'China' and 'Persons' should NOT have 'Donald Trump'
Note
The format while taking the input via console is also the same.
First, enter the required fields/columns (delimited by semi colon)
Then, for each field, enter the Keywords in the same format as mentioned for the search_dict
Example
>>> Enter the Field(s) : Persons;Organizations
>>> Include ALL of these in Persons: sundar pichai,narendra modi >>> Include ATLEAST ONE of these in Persons: larry page,andrew ng >>> Include NONE of these in Persons: donald trump
>>> Include ALL of these in Organizations: google >>> Include ATLEAST ONE of these in Organizations: allen institute for artificial intelligence >>> Include NONE of these in Organizations :
Note
If a dictionary is passed, it is case-sensitive. Therefore, give the values in the proper casing.
- limit (int) :
default None
The Maximum no. of rows to be returned from the result obtained by the Query.
Note
The max. size of the result is 128 MB.
If your query generates the data that exceeds 128 MB, you will need to specify the limit.
- save_data (boolean) :
default True
- Will save the data in the current working directory
- project_id (str) :
- Returns :
- pandas.DataFrame
- A pandas DataFrame
-
save_data_frame
(data_frame, path=None, seperator='\t', index=False)[source]¶ Save a DataFrame (to a specified location/current working directory)
- Parameters :
- data_frame (pandas.DataFrame) :
- The DataFrame that needs to be stored (in the specified format)
- path (str) :
default None
The path at which the file is to be stored
It should be of the following format -
<dir>/<sub-dir>/.../<filename.csv>
(recommended)Mention the file name in the path itself (along with the extension)
Note
If no path is provided, the data frame will be saved in the current working directory.
Format of file name -
Result(YYYY-MM-DD HH.MM.SS).csv
- separator (str) :
default '\t'
- Delimiter to use
- index (boolean) :
default False
- If True, then the index of the DataFrame will also be stored in the file
- Returns :
- None
- A success message, along with the full path of the file.
-
ProcessData¶
-
class
gydelt.gydelt.
ProcessData
(data_frame, location='Locations', person='Persons', organization='Organizations', tone='ToneData', theme='Themes')[source]¶ Contains wrappers to pre-process various fields of the data collected from GDELT
-
check_country_list
()[source]¶ Returns those Locations (countries) which were not present in the countries list
- Returns :
- list
- A list containing the locations for which there was no match in the default country list
-
clean_locations
(only_country=True, fillna='unknown')[source]¶ Pre-process the ‘Locations’ column of the data (Extract either all details available, or just the Countries)
- Parameters :
- only_country (boolean) :
default True
If True, will keep only the country names for each row in the
Locations
columnIf False, will keep whatever details available (city, state or country)
- fillna (str) :
default 'unknown'
- To fill the Null values (
NaN
) with the specified value
- only_country (boolean) :
- Returns :
- pandas.DataFrame
- A pandas DataFrame (with additional fields for
Countries
andStates
, if required)
-
clean_persons
(fillna='unknown', max_no_of_words=6)[source]¶ Filters out the
Persons
column of the data.Only those names are kept in which the no. of words are within a certain limit
- Parameters :
- fillna (str) :
default 'unknown'
- To fill the Null values (
NaN
) with the specified value - max_no_of_words (int) :
default 6
- Removes all the names whose length is greater than this value from each record/row
- fillna (str) :
- Returns :
- pandas.DataFrame
- A pandas DataFrame (with updated
'Persons'
)
-
clean_organizations
(fillna='unknown')[source]¶ Pre-processes the
Organizations
column. Removes certain invalid OrganizationsNote
Some Countries (eg. United States) have been mistaken as individual Organizations
This function removes those Organizations (which are actually Countries), from each record/row
- Parameters :
- fillna (str) :
default 'unknown'
- To fill the Null values (
NaN
) with the specified value
- fillna (str) :
- Returns :
- pandas.DataFrame
- A pandas DataFrame (with updated
Organizations
)
-
seperate_tones
()[source]¶ Creates Separate columns for each value in
ToneData
- Returns :
- pandas.DataFrame
- A pandas DataFrame
Note
The
ToneData
column has 7 vaules, which are converted into seperate columns in the data frame. The originalToneData
remains intact
-
clean_themes
(fillna='unknown')[source]¶ Fills the Null values (
NaN
) in theThemes
column of data- Parameters :
- fillna (str) :
default 'unknown'
- To fill the Null values (
NaN
) with the specified value
- fillna (str) :
- Returns :
- pandas.DataFrame
- A pandas DataFrame (with Null values in
Themes
filled)
-
flat_column
(columns=[], fillna='unknown')[source]¶ The given list of columns are flattened (using one-hot encoding) and the resulting columns are added to the DataFrame
- Parameters :
- fillna (str) : default ‘unknown’
- To fill the Null values (
NaN
) with the specified value
- Returns :
- pandas.DataFrame
- A pandas DataFrame
Note
All the column names passed in the list
columns
are flattened (one-hot encoding is used).The new data frame returned contains additional columns, which are the individual and unique values present in the respective columns which are required to be flattened.
-
pre_process
()[source]¶ A wrapper functions that does all the pre-processig. (Except - flattening)
- Returns :
- pandas.DataFrame
- A clean and processed pandas DataFrame
-
save_data_frame
(path=None, seperator='\t', index=False)[source]¶ Save a DataFrame (to a specified location/current working directory)
- Parameters :
- path (str) :
default None
The path at which the file is to be stored
It should be of the following format -
<dir>/<sub-dir>/.../<filename.csv>
(recommended)Mention the file name in the path itself (along with the extension)
Note
If no path is provided, the data frame will be saved in the current working directory.
Format of file name -
Result(YYYY-MM-DD HH.MM.SS).csv
- separator (str) :
default '\t'
- Delimiter to use
- index (boolean) :
default False
- If True, then the index of the DataFrame will also be stored in the file
- path (str) :
- Returns :
- None
- A success message, along with the full path of the file.
-