:orphan: Quickstart =========== Sparx is an exclusive data preprocessing library which involves in transforming raw data into an machine understandable format. We at CleverInsight Lab took the initiative to build a better automated data preprocessing library and here it is. Simple Usage ~~~~~~~~~~~~ .. code-block:: python >>> from sparx.preprocess import * is_categorical ~~~~~~~~~~~~~~~~~~~~~~ Check if the given pandas series is an categorical variable ```True``` .. code-block:: python >>> is_categorical(data[col]) >>> True is_date ~~~~~~~~~~~~~~~~~~~ Return ```True``` if the given pandas series is an date type .. code-block:: python >>> is_date(data[col]) >>> True count_missing ~~~~~~~~~~~~~~~~~~~ Return the count of missing values in the given pandas.core.series .. code-block:: python >>> count_missing(df['col_name']) >>> 0 missing_percent ~~~~~~~~~~~~~~~~~~~ Returns the percentage of missing values in the column .. code-block:: python >>> missing_percent(df['col_name']) >>> 0 types ~~~~~~~~~~~~~~~~~~~ Returns the column names in groups for the given DataFrame .. code-block:: python >>> types(df) >>> {'dates': ['D'], ... 'groups': ['C', 'D'], ... 'keywords': ['C'], ... 'numbers': ['A', 'B']} has_keyword ~~~~~~~~~~~~~~~~~~~ Returns ``True`` if any of the first 1000 non-null values in a string``series`` are strings that have more than ``thresh`` =2 separators (space, by default) in them .. code-block:: python >>> has_keywords(series) >>> False >>> has_keywords(series, thresh=1) >>> True groupmeans ~~~~~~~~~~~~~~~~~~~ Yields the significant differences in average between every pair of groups and numbers. .. code-block:: python >>> has_keywords(series) >>> False >>> has_keywords(series, thresh=1) >>> True describe ~~~~~~~~~~~~~~~~~~~ Return the basic description of an column in a pandas dataframe check if the column is an interger or float type .. code-block:: python >>> describe(dataframe, 'Amount') >>> {'min': 0, 'max': 100, 'mean': 50, 'median': 49 } geocode ~~~~~~~~~~~~~~~~~~~ Returns ```Dict``` which consist of address, latitude, longitude of the given address .. code-block:: python >>> geocode("172, 5th Avenue, Flatiron, Manhattan") >>> {'latitude': 40.74111015, ... 'adress': u'172, 5th Avenue, Flatiron, ... Manhattan, Manhattan Community Board 5, New York County, NYC, ... New York, 10010, United States of America', ... 'longitude': -73.9903105} unique_value_count ~~~~~~~~~~~~~~~~~~~~~ Returns the count of ```unique value``` fromm each column .. code-block:: python >>> unique_value_count(data['name']) >>> {'gender': {'Male': 2, 'Female': 6}, ... 'age': {32: 2, 34: 2, 35: 1, 37: 1, 21: 1, 28: 1}, ... 'name': {'Neeta': 1, 'vandana': 2, 'Amruta': 1, 'Vikrant': 2, ... 'vanana': 1, 'Pallavi': 1}} unique_identifier ~~~~~~~~~~~~~~~~~~ Returns a list of columns from the dataframe which consist of unique identifiers .. code-block:: python >>> unique_identifier(pd.Dataframe) >>> ['age', 'id'] date_split ~~~~~~~~~~ Returns a ```dictionary``` of year, month, day, hour, minute and seconds .. code-block:: python >>> date_split("march/1/1980") >>> {'second': '00', 'hour': '00', 'year': '1980', 'day': '01', ... 'minute': '00', 'month': '03'} dict_query_string ~~~~~~~~~~~~~~~~~ Returns a string which is the query formed using the given dictionary as a parameter .. code-block:: python >>> query = {'name': 'Sam', 'age': 20 } >>> dict_query_string(query) >>> name=Same&age=20 encode ~~~~~~ Returns a clean dataframe which is initially converted into `utf8` format and all categorical variables are converted into `numeric labels` also each label encoding classes are saved into a dictionary, now a tuple of first element is dataframe and second is the hash_map .. code-block:: python >>> encode(pd.DataFrame()) >>> [150 rows x 6 columns], {'Species': {0: 'setosa', 1: 'versicolor', 2: 'virginica'}}) strip_non_alphanum ~~~~~~~~~~~~~~~~~~~ Returns ```List``` of alphanumeric string by stripping the non alpha numeric characters .. code-block:: python >>> strip_non_alphanum('epqenw49021[4;;ds..,.,uo]mfLCP'X') >>> ['epqenw49021', '4', 'ds', 'uo', 'mfLCP', 'X'] word_freq_count ~~~~~~~~~~~~~~~~~ Returns ``` dict``` which consist of each words as key and its frequency count as value .. code-block:: python >>> word_freq_count("hello how are you") >>> {'a': 1, ' ': 3, 'e': 2, 'h': 2, 'l': 2, 'o': 3, 'r': 1, ... 'u': 1, 'w': 1, 'y': 1} ignore_stopwords ~~~~~~~~~~~~~~~~ Returns the list of words ignoring stopwords in the given list of words .. code-block:: python >>> ignore_stopwords("I am basically a lazy person and i hate computers") >>> ['I', 'basically', 'lazy', 'person', 'hate', 'computers']